maprcommunity

Scaling Time Series Analysis on the MapR Converged Data Platform

Blog Post created by maprcommunity Employee on Jun 21, 2017

BY Dong Meng

 

Introduction

A time series is a collection of observations (x~t~), where x is the event recorded at time t. Common motivations for time series analysis include forecasting, clustering, classification, point estimation, and detection (in signal process domain).

With the prevalence of sensor technologies, the popularity of the Internet of Things (IoT) is trending. In a highly-distributed IoT scenario (autonomous driving, oil drilling, healthcare wearables), data with timestamps will be streaming back to your data center and stored. Today, the value of data is higher than the value of the IoT technology. If you can leverage the data upon arrival into your data center, rather than wait for a certain period, and engage in exploratory analysis on that data, you will gain more value from that information and be able to make an impact faster.

MapR Time Series Quick Start Solution

The aim of MapR is to solve the time series data collection and forecasting problem at scale. The applications that form the technology stack are MapR Streams (streaming the event data into your data center), OpenTSDB (storing the data in a high performance time series database) and Spark (data processing and forecasting). A high-level diagram of the workflow appears below:Picture 1

MapR Streams is the integrated publish/subscribe messaging engine in the MapR Converged Data Platform. Producer applications can publish messages to topics (i.e., logical collections of messages) that are managed by MapR Streams. Consumer applications can then read those messages at their own pace. All messages published to MapR Streams are persisted, allowing future consumers to “catch-up” on processing and analytics applications to process historical data. In addition to reliably delivering messages to applications within a single data center, MapR Streams can continuously replicate data between multiple clusters, delivering messages globally. Like other MapR services, MapR Streams has a distributed, scale-out design, allowing it to scale to billions of messages per second, millions of topics, and millions of producer and consumer applications. Find more information on MapR Streams here.

OpenTSDB is an open source scalable time series database with HBase as the main back-end. Since MapR-DB implements HBase API, MapR-DB serves as the back-end in this quick start solution, instead. The high performance achieved by OpenTSDB is due to the following optimizations, specifically targeted at time series data:

  1. A separate look-up table is used to assign unique IDs to metric names and tag names in the time series;
  2. The number of rows is reduced by storing multiple consecutive data points in the same row, so it seeks faster when reading.

On MapR, the performance benchmark can be as high as 100 million data points ingested per second (link).

Apache Spark provides us with the capacity to harness MapR Streams and provide data processing/parsing functions while training machine learning models with multivariate time series regression algorithms. Our Spark streaming code will pick up the data from MapR Streams, briefly process them, and write them to OpenTSDB; meanwhile, the machine learning model is fit to the data and writes the prediction into OpenTSDB as well.

In our example, we used gas sensor data from the UCI machine learning repository (link). With this dataset, we try to predict the ethylene level based on 16 sensors that monitor the gas content. The exploratory plot below shows the time series for 16 sensor readings:Picture 2

We use basic linear regression to regress on some auto-regressor features as well as some second derivative features. It is also good practice to look into the seasonality and stationarity of the time series data and apply smoothing/differentiation algorithms to prepare the data for processing. For a target with obvious on/off status, we could also consider combining a regression model and binary classification model to obtain a better RMSE.

The screenshot below gives an example from the UI for openTSDB:Picture 3

The metrics name in the data is stored as tags in OpenTSDB. In this figure, the blue and purple lines are two feature metrics, r15 and r16. The green line is the target time series, and the red line is our prediction: notice how the red and green lines track very closely. OpenTSDB provides options for automatically refreshing this dashboard.

Summary

The focus of this article is on the workflow, while the algorithm applied can be customized, given the distribution of the data and requirement of business. I have packaged the quick start solution to extend MapR 5.2 Docker container for demo purposes. You can launch the demo from your laptop, if you have Docker installed, and follow the steps in my Docker hub link.

There is a recording of this demo, showing how the Docker image works. It requires some time to start, due to the MapR and OpenTSDB services plus the MapR Streams and Spark applications. The video can be viewed at on YouTube.

Editor's Notes: Article originally posted in the Converge Blog on May 02 2017.

Additional resources

Persistence in the Age of Microservices: Introducing MapR Converged Data Platform for Docker

Getting Started with MapR Client Container

MapR-XD - PACC

containers 

Outcomes