DaDaDa 2016 - Time Series Processing with Spark

Abstract

This talk presents Chronix (Chronix Spark), an open-source solution for processing massive time series data at scale using Apache Spark and Apache Solr. Developed to handle operational monitoring of a globally distributed system with over 1,000 nodes generating 6.3 trillion observations annually, Chronix addresses the limitations of existing time series databases when dealing with truly massive-scale monitoring scenarios. The solution achieves exceptional storage efficiency—compressing 108GB of raw data to just 8.7GB—through innovative encoding techniques including delta encoding for timestamps, Protocol Buffers for values, and an optimized chunking strategy. Built on a hybrid architecture that leverages Solr for multi-dimensional queries and low-level aggregations while using Spark for high-level operations like anomaly detection, Chronix provides three data abstractions (RDD, DataFrame, Dataset) to support different use cases and workflows. The system integrates with ecosystem tools including Zeppelin, Grafana, and Prometheus, and demonstrates superior performance particularly for complex operations such as outlier detection, trend analysis, and frequency detection. Developed through a three-year research collaboration with the University of Applied Sciences Munich, Chronix represents a production-grade solution to the challenge of long-term retention and analysis of time series data at trillion-observation scale.

About the Speaker

Joseph Adersberger is co-founder and Chief Technology Officer of a Munich-based custom software development company with over 80 employees, recognized as a "Great Place to Work" award winner (first place). The company specializes in fixed-price projects focusing on legacy software modernization, Big Data solutions based on Apache Spark, IoT platforms, and infrastructure partnerships for cognitive computing frameworks. Serving major clients including BMW and Deutsche Telekom, the company notably contributes to QIVICON, Europe's largest connected home platform. Joseph initiated the Chronix project in 2011 and led a three-year research collaboration with the University of Applied Sciences Munich to develop the open-source time series database solution presented in this talk.

Transcript summary

Joseph Adersberger, co-founder and CTO of a Munich-based software development company, presents Chronix (Chronix Spark)—an open-source solution for processing massive amounts of time series data built on Apache Spark and Apache Solr. Developed to handle a use case requiring analysis of 6.3 trillion observations per year across a globally distributed system, Chronix achieves superior storage efficiency (compressing 108GB to 8.7GB) and performance on high-level operations through innovative encoding schemes and distributed processing architecture.

Opening and Venue Welcome

The speaker opens with a warm welcome to "Area 42," an area especially designed for employees and the Munich IT community to teach each other, collaborate, and also play and party hard. "I think all of those parts will be addressed today." It's a great honor to be the venue sponsor for the Munich Data Geeks event—their first since opening the space a few months ago. The speaker's name is Joseph, co-founder and CTO, and today he wants to talk about time series processing with Apache Spark—specifically their solution called Chronix (Chronix Spark), which enables processing large amounts of time series data.

Company Background

First, a few words about the company. They're a custom software development company located in Munich with a bit more than 80 employees, focused on doing fixed-price projects. They're very glad they received a "Great Place to Work" award this year—first place—so their employees are doing well. Their customers include large companies like BMW, Deutsche Telekom, and others. Their portfolio focuses on legacy software—"parametric, metric, paramedics"—dealing with systems that are not performing well, have problems with maintainability, and trying to heal them.

Service Offerings

They're also a mental infrastructure partner for customers who need solutions for cognitive computing based on frameworks like Mesos, Kubernetes, or OpenShift. They're also doing a lot of Big Data projects, mainly based on Apache Spark—that's why the speaker is talking about time series processing with Apache Spark. One focus topic is also Internet of Things—they're helping Deutsche Telekom improve their connected home platform called QIVICON, the largest connected home platform in Europe based on IoT and smart home technology.

Entering Time Series Processing

Now let's enter time series processing. First, a few basic terms concerning time series. We're surrounded by time series—a lot of data is best represented as time series, like operational data from monitoring systems, performance metrics, or even log events. Data warehouse systems have their dominant dimension as time, so they can also be interpreted as time series. Internet of Things stuff like measured data or sensor data, financial data, climate data (shown on the right), and what they also did: time series analysis with tracking tools—there are also time series.

Basic Terms and Definitions

What do we have to know? What are the basic terms behind time series? One term is univariate time series—a time series where we have one value at a certain point in time. Multivariate time series is where we have multiple values at a certain point in time (very important—we'll see this later on). Multi-dimensional time series is where we assign a time series into a multi-dimensional space where we measure what's the unit, where geographically we measured it, on which object we measure this, or which series logically. There are different types of time series.

Observations and Time Series Sets

There's also an important concept: observation—the value at a certain point of time. A time series set is multiple time series that are grouped somehow together—these are the basic terms.

Operations on Time Series

For time series, you can think of different kinds of operations to perform. There are basically two types here: operations which are performed on time series (one time series), and operations on the scale of values. A few examples for the first type—inline operations where you align to the time series at a certain point of time. You can shift your time series, you can downsample time series and sample them at certain intervals or time points afterwards. A little bit more complex: you can detect outliers in monitoring data—it's important to know if there are any outliers.

Statistical Operations

Another type of operations—very trivial ones—is min/max, the max value of the time series, average, median, slope (sometimes you're also interested in the slope), standard deviation of values within the time series, and so on.

The Use Case: Worldwide Distributed Systems

What's the use case after having cleared those basic terms and a few examples of operations on time series? Their use case was analyzing the operational data of a worldwide distributed system. It's a critical application and they wanted to perform root cause analysis when applications aren't straight (aren't working properly). The system has more than 1,000 nodes distributed worldwide, about ten processes per node running, they're collecting more than 20 metrics per process, and they measure each and every second.

The Scale: 6.3 Trillion Observations

If you calculate this, this leads to about 6.3 trillion observations per year. And they wanted to retain data for five years, so they can analyze five years of data within this application monitoring system. This is a huge amount of data—trillions, this is really, really big data. They had to look for a big data solution here, and one very important part of a big data solution is you have to scale like a boss, horizontally as much as you can.

Existing Solutions Insufficient

The solutions they evaluated which are available in the market—especially open-source solutions—didn't scale enough for their use case. So they decided to build their own open-source solution called Chronix. You can visit the Chronix website and the Chronix bio website.

Chronix Overall Architecture

This is the overall architecture of Chronix. It's built in blocks: the data ingestion/collection building block, the storage building block where they store data and process data, and the visualization building block where you can analyze the data. In the connection building block, they're able to ingest data from Logstash, Telegraf, Collectd, and other known collectors which collect metrics. They're also a long-term storage for Prometheus (a common monitoring solution), and they're collecting the data for us.

The Storage Layer

Data is then stored in the storage layer. In common, all those parts here have a common format—it's called the Chronix format. The speaker will focus on this layer and this is how they encode and decode time series data. The stack uses Chronix storage—in one node they use Solr, for distributed cloud data storage they use Solr Cloud, and they're using Apache Spark for the number crunching to analyze this data.

Access Methods

You can access this stack from Zeppelin with a Zeppelin connector which is able to connect to Chronix Spark and write operations to fetch Spark. In Zeppelin, they can also do SQL statements and SQL queries on time series data. They also have a user interface, a rich client user interface called Chronix Analytics (with some screenshots shown later). They also have a connection to Grafana for those entries within Grafana. This is the Chronix stack.

Development History

They started developing the stack in 2011. The starting point was a project at [university], but then they launched a research project at the University of Applied Sciences in Munich for three years or so and assembled the Chronix stack step by step.

Focus: Chronix Spark

Today the speaker wants to focus on Chronix Spark—where they leverage Spark to do time series processing. The basic architecture of Chronix Spark: they use Solr and it's called [unclear] later, and also for some basic data retrieval. Solr is very good in slicing and dicing in a multi-dimensional space, so attribute queries are very good. Solr can also perform low-level aggregations.

Spark for Higher-Level Operations

But for distributed processing, they're using Spark to perform higher-level operations like outlier detection or anomaly detection. This is very limited in Solr. Spark gets its data from the Solr shards—each partition here is equivalent to one shard on the Solr side. You have different options how you can do the result processing afterwards. As you might know, you can use Scala to do this, but there's also a connector for Python or R to formulate your distributed processing code.

Chronix Analytics Screenshots

This is the overall architecture of Chronix Spark. Here are some screenshots of how they're using it. One example is their Chronix Analytics tool—"not yet open source, we're working on it right now, it will be open source within the next month." Here's one screenshot where they have a multi-dimensional time series within the Chronix stack. There's the possibility to drill down into those dimensions, also the possibility to select certain types of time series and the time frame you want to get analyzed. Here multiple time series are visualized—basic trend visualization, also some more complicated visualizations like box plots.

Anomaly Detection

Within the tool, they're also performing anomaly detection based on Twitter's anomaly detection open-source project here. They're using both to highlight regions of a time series which are probably an anomaly and acting in some ways to try to reboot the system.

Design Goals

Let's get a little bit deeper into what Chronix Spark is and how you can use Chronix Spark. But first of all, this is their goal: it should be an easy-to-use time series data storage and processing solution. It should be really easy to use. They're not only focusing on data scientists—they're also focusing on software engineers like themselves who are not data scientists but are software engineers who also want to perform queries on some time series. It should be easy to use, and also "Chuck Norris approves of it."

Time Series Model

The first question is: what time series model do we use? The time series model they're using within Chronix Spark is that time series are a set of univariate, multi-dimensional, memory [likely "numerical"] time series—a set of time series because they're more flexible and can address sets of time series. They can ask questions like "give me all time series of Project X everywhere globally"—a lot of time series in this set—and they can detect anomalies on all of those. It's more flexible, and also you have a set of time series, and it's easier—you can better parallelize the operations. A time series is a basic unit of parallelization in Spark, so it's good to have large sets of time series.

Why Univariate

It's univariate because multivariate time series introduce a little bit too much complexity. Since they have sets of time series, they can use a set to represent multivariate time series. So in their use case, they don't need multivariate time series—thank you very much—univariate.

Why Multi-Dimensional

It's multi-dimensional because this was important—the ability to slice and dice and search to drill down into different kinds of time series, as seen on the screenshot before. So they decided on multi-dimensional time series. For monitoring, it's very important to have time series assigned to process dimensions, system node dimensions, geographically distributed dimensions, and so on.

Why Numerical

It's numerical because it's the most common use case with monitoring data. Actually, they're just right now working on providing a little bit more than numerical time series here because they also want to analyze log events in the near future. This is their Chronix Spark time series model.

The Interface: Two Classes

The interface to Chronix Spark is quite simple—you have two classes which are the whole interface you need to know how to interact with Chronix Spark. The first: RDDs (Resilient Distributed Datasets) are the basic abstraction within Spark for distributed data. Chronix RDD represents a set of time series or an abstraction of the time series model—it's a set of time series. Within Chronix RDD, it provides some distributed operations on this set of time series. With Chronix Spark Context, the second important class, it's used to create Chronix RDDs and connect to the Chronix server, the Solr Cloud backend. Two interface classes here.

The Data Types

Here is the interface of Chronix RDD—a Spark RDD and an RDD of metric time series. Now, metric time series—a metric time series, and the RDD—a subset of time series. You have a few operations on this set of time series—also some map and transformation operations to perform, to transform this RDD into a data frame or a dataset that Spark provides.

Three Abstractions in Spark

Spark provides three abstractions for distributed processing: one is RDD, second is data frame, third is dataset. They want to provide them all because they have different properties. RDD was the first abstraction available within Spark—it's typed, it's optimized in some ways, but you cannot perform SQL statements on an RDD. That's why Spark introduced data frames years ago.

Data Frames and Datasets

Data frames are not typed—their table abstraction is a table—they're highly optimized and have been around for some years, and you can perform SQL statements on the data frames. By transforming an RDD into a data frame, they're able to perform SQL statements on time series. A very young abstraction is the dataset—it gets stabilized within Spark 2.0. It's typed, it's very optimized (in contrast to the RDDs), and you can perform SQL statements. So each of those abstractions has its own properties, and they decided to provide them all.

Metric Time Series Data Type

Now let's have a look at the metric time series data type. Chronix RDD—the phone is an RDD type of metric time series. This is their time series abstraction—it's a set of time series. The metric time series data type provides also some operations you can use. There is an iterator where you can access all time stamps of the time series, also access all numeric values in the time series. For a little bit more convenient access, there is also a method where you can stream the observations of the time series and you get pairs of timestamp and the numeric value.

Multi-Dimensional Attributes

Here is one method which is representing the multi-dimensional aspect of the time series: it's an attribute method to get attributes. This metric vector is where they can find dimensions or define new dimensions for this time series.

Setting Up Development Environment

Then they go to coding. When you actually want to use Chronix Spark this evening—grab a beer, get into this room, and hack a little bit—the first thing you have to do is set up your development environment and use the Chronix Spark context. There is one method that's important: query. This is to get your Chronix RDD, but you need to provide a Solr query.

Solr Query Structure

This is a query for grabbing all those time series relevant for you. It's interesting—you have to tell it where the Solr host is to find all Solr Cloud servers, the database name, and the adapter to deserialize data from Solr into Spark. This is a code sample.

Code Example

This time they're able to perform operations—like in this case, max or mean. You have to initialize a Chronix context, basically providing a Java Spark context. Then define your Solr query. Here they use query: select metric dimension where all metrics are available, all time series are available, or memory usage is [selected]. They get no matter what which approach or which process, no matter which node—they get all those time series that way. Then you're able to perform your query, your operation.

Using Zeppelin

If you don't want to code like this, you can also use Zeppelin as the speaker told you to query those time series data. Here's one example: these are normal SELECT statements—"SELECT time stamp, value, process, metric dimensions FROM the time series table WHERE metric is memory free OR metric is CPU user space" and "ORDER BY timestamp." You can visualize your charts here. You can use it also a little bit more conveniently.

Opening the Black Box

A few—let's open the black box a little bit and the speaker would like to present a few of the internals of Chronix Spark. One very important internal is storage performance—how they store time series within Solr documents. The basic idea is quite easy: if you have your time series here, they're doing something—you have one time series, potentially a very large time series—they chop it into regions, and the chunks are for a specific time span. The chunks are for a specific amount of values, and they count the amount of values—1,000 values or observations within one chunk. So they can span different time spans, one per chunk.

Chunk Storage in Solr

So they just chunk this time series, and one of those chunks is stored in one Solr document. Each chunk—which each represents—they provide the start timestamp and the end timestamp and the unit, the dimensions, and the values. It's a little bit more—maybe it should better be named here as "observations." And here it's encoded—the bytes—and the timestamps and basically here is encoded observations. Those observations are encoded with Protocol Buffers RPC format from Google Protocol Buffers, and the timestamps are delta encoded.

Delta Encoding Explained

So what's done here: you know how many values you have, so you can think of homogeneously distributed timestamps. This is not the case normally with monitoring data because they are not very homogeneously distributed—there are few differences in between. So they only store the differences to homogeneously distributed time series. They also do delta encoding on the values—if the value is every time the same, they don't store any value, they just store zero. That's why they're space-efficient here.

Chunk Size Optimization

They did some research on how big chunks should be in terms of a number of observations and what's a good compression codec for the bytes here. They could figure out that 128 kilobytes within a chunk is the best-fitting configuration for their compression codec and chunk size.

The Time Series Database Landscape

They're not alone. This is the time series database landscape, and there's also a list here which shows which time series databases are available. Solr, Chronix—it's also in Prometheus, you might know, or OpenTSDB, InfluxDB. Prometheus is also some kind of a time series database. They did a comparison.

Storage Benchmark

First, they did a benchmark on storage. They measured the amount of gigabytes here—the raw data for the projects was 108.2 gigabytes. They tested it with InfluxDB, OpenTSDB, and Chronix. In efficiency, Chronix performs best: out of 108.2 gigabytes raw data, it's 8.7 gigabytes data within Chronix. InfluxDB is 10 gigabytes, and so on. So their delta encoding—most importantly the data encoding of timestamp and values and the compression project—they're able to beat the storage of others.

Performance Benchmark

They also did a benchmark on performance, measuring different kinds of operations from very basic operations like average, max, minimum to high-level operations like outlier detection, trend detection, frequency detection, and so on. These are very useful in doing anomaly detection. As you can see here, they're not performing so good in basic operations, but if you have high-level operations, Chronix performs fast. They implemented all those high-level operations within Spark and also do predicate pushdown.

Scalability Architecture

How are they scaling? For scale-out, they're using the possibilities within Solr, and Solr scales out. It's not based on containers. For scale-out in the data processing plane, they're using Spark, and it's also based on technologies. One showcase on top of DC/OS—they're using Spark on DC/OS, and this is a container solution with container orchestration. The benchmark right now they're doing for distributed Chronix is based on this DC/OS setup.

Q&A: Pattern Identification

The speaker focused the talk on operations but would be interested in identification of patterns, structures—let's say position of peaks. An audience member asks about this capability. The speaker responds that the result would be a set of time series containing just those peaks. They don't provide this possibility right now but could easily enhance the Chronix RDD to do so. The time series should identify peaks—they can identify each peak right now, but often you want certain wizards, like the interval between peaks or how high peaks are in absolute terms. Not right now, it's not within their basic operations, but you can access all observations within a Chronix time series.

Closing and Contributors

The speaker is nearing the end. Here are the contributors—much thanks to the contributors on Chronix Spark. Also, "this guy is a great contributor"—the mastermind behind the photo behind Chronix. This is their PhD student who takes pictures and [trails off as the presentation ends].

Key Technical Insights and Contributions

This presentation demonstrates several important achievements in time series database engineering. First, the problem scale drives architecture—6.3 trillion observations per year across 1,000 nodes required building custom infrastructure when existing solutions (InfluxDB, OpenTSDB) couldn't scale sufficiently. Second, storage efficiency through clever encoding matters enormously at scale—reducing 108GB to 8.7GB (92% compression) through delta encoding of timestamps, Protocol Buffers for values, and storing only differences from expected distributions.

Third, the chunking strategy (1,000 observations per chunk, 128KB optimal size) balances storage efficiency with query performance—too small creates overhead, too large reduces parallelism. Fourth, multi-dimensional modeling (process, node, geography, metric) enables the "slice and dice" operations critical for troubleshooting distributed systems—you can drill down from "all servers globally" to "this specific process on this node."

Fifth, the three-abstraction approach (RDD, DataFrame, Dataset) acknowledges that different use cases need different trade-offs between type safety, SQL support, and optimization—providing all three gives users flexibility. Sixth, the hybrid architecture (Solr for low-level aggregations and multi-dimensional queries, Spark for high-level operations like anomaly detection) plays to each technology's strengths rather than forcing one tool to do everything.

Seventh, the time series as the unit of parallelization in Spark enables efficient distributed processing—each time series can be processed independently, making horizontal scaling straightforward. Eighth, integration with ecosystem tools (Zeppelin, Grafana, Prometheus) provides familiar interfaces rather than forcing users to learn entirely new workflows—the "don't reinvent the wheel" philosophy for user experience.

Finally, the open-source strategy with university partnership (three-year research project at Munich University of Applied Sciences) shows how academic collaboration can produce production-grade systems—combining research rigor with practical engineering to solve real industrial problems. The work addresses a genuine gap in the time series database landscape for truly massive scale monitoring with long retention periods.