maprcommunity

How to Set Up MapR-DB to Elasticsearch Replication

Blog Post created by maprcommunity Employee on Dec 8, 2016

How to Set Up MapR-DB to Elasticsearch Replication

by Mathieu Dumoulin

 

The Use Case

Automatic replication of MapR-DB data to Elasticsearch is useful for many environments, and I want to share information about a specific customer deployment I worked on recently. Their use case is related to log security analytics and is centered around using Drill for running interactive queries on aggregated data.

This data is streamed into MapR-DB in real time at a rate of thousands of events per second through a Logstash to MapR-DB plugin. Apache Drill queries are run from Parquet files created at regular intervals from the MapR-DB data. The Drill queries are fine but aren’t real-time, and the customer also wanted to be able to get visualizations of this data as it is ingested.

The solution we proposed, and that was implemented, was to leverage a great feature unique to MapR, which is the MapR-DB to Elasticsearch replication through the MapR Gateway service.

By and large, the MapR documentation of this feature is sufficient for an experienced MapR admin to set up the replication working without too much trouble. In this post, I’ll share information from my on-site experiences and give some pointers to help you avoid issues or understand why your replica isn’t working.

For the purposes of making this post a bit more interactive, I’ll add some sample data and instructions that will let anyone reproduce this on the MapR Sandbox version 5.2, available for free.

For this tutorial, we’ll use network metrics data as an example. The data will be as follows:

Table: nm
ColumnData TypeExample
destHoststring10.0.0.200
destPortinteger80
sourceHoststring10.0.0.176
sourcePortinteger22
bytesCountlong10
timestamptimestamp2016-10-18T16:17:18.000Z

MapR-DB Replication Using the MapR Gateway Service

MapR-DB is a NoSQL database that follows in the footsteps of Google BigTable. More specifically, it started as a reimplementation of Apache HBase designed from the ground up to take advantage of the advanced inner workings of the MapR Converged Data Platform. It also now has native JSON support to more easily handle hierarchical, nested, and evolving data formats.

At its core, the MapR-DB replication feature was to enable a MapR-DB table to be replicated to a MapR-DB instance running on another cluster automatically. One primary use case is for a global enterprise to improve speed of access and get multi-region level HA automatically with a guarantee on data consistency. This feature can get really fancy with bi-directional replication, where applications can read and write to/from either replicas and still know both are always kept up to date.

Additionally, through the MapR Gateway service, it’s possible to automatically replicate a MapR-DB table into an Elasticsearch index (ES currently limited to version 2.2).

Use Cases for Elasticsearch Replication

There are four great uses cases I can think of for taking advantage of this great feature.

  • Full text search of the data in MapR-DB
  • Geospatial searches
  • Kibana visualization of the data, especially useful for time series data like sensor data or performance/network metrics
  • ES as a secondary index for a MapR-DB table

More info can be found here and here.

Setup Guide

Choices in Solution Design

If you just want to try this feature out, then the MapR Sandbox is a great way to get started quickly and I’ll make sure to cover that in this guide.

For those who may want to use this feature on a production cluster though, there are a couple of configurations to ponder:

  • Co-locate the ES cluster with the MapR cluster
  • Use an external ES cluster

Unsurprisingly, if you have plenty of hardware servers, then the external ES cluster should be the preferred solution to isolate services and reduce failure impact as well as reserve the cluster resources for actual big data processing.

While putting the ES cluster on separate nodes is the recommended solution for a production cluster, it is also possible to colocate part or all of an ES cluster with MapR nodes. Keep in mind that memory resources taken by ES are not available to the cluster.

For sizing of the ES cluster, the main factors are storage needs and incoming data throughput. The more data, the more nodes will be needed. The sizing issue is well explained in the MapR documentation.

Preparation

INSTALL ES (SINGLE NODE OR CLUSTER MODE)

Elasticsearch installation is well documented and I won’t get into details of cluster installation here.

For the purpose of this tutorial, it is enough to just download the ES 2.2.0 tar archive and extract it in mapr home directory directly.

I edited the configuration of two values (elasticsearch-2.2.0/conf/elasticsearch.yml):

cluster.name: elastic network.host: 10.0.2.15

The purpose of network.host is to make sure Elasticsearch is up and running for the host IP of the sandbox, which I found by using the “ifconfig” command from the command line as user mapr.

Optional: add port forwarding to access ES from your host

In VirtualBox, I added TCP port 9200 to the list of Port Forwarding Rules.

We’ll just keep in mind to remember the hostname of the ES instances and remember that the supported ES version is 2.2. This is important—otherwise, there is good chance the replication will fail.

Let’s go ahead and create an index we’re going to use as a target for the replication.

 

$> curl -X PUT maprdemo:9200/networkmetrics/ -d ' { "mappings" : { "metrics" : { "properties" : { "metrics" : { "dynamic" : "true", "properties" : { "destHost" : { "type" : "ip" }, "sourceHost" : { "type" : "ip" }, "timestamp" : { "type" : "date", "format" : "strict_date_time" }, "bytesCount" : { "type" : "long" }, "destPort" : { "type" : "integer" }, "sourcePort" : { "type" : "integer" } } } } } } }'

 

If you get “{acknowledged”:true}” response from the curl command, then your Elasticsearch is ready to go and no further configuration is necessary.

To verify, run this command:

$> curl maprdemo:9200/networkmetrics/_mapping/metrics?pretty

Ok, Elasticsearch is done. We can now move on to creating a MapR-DB table.

Create a MapR-DB Table

There are a variety of ways to create a MapR-DB table, the easiest being to use MCS. Login to MCS (https://localhost:8443) as user/pass: mapr/mapr. In the “Volumes” tab, you can see that the ‘/tables’ volume has already been created for us.

So we can go to the ‘MapR-DB’ tab and create a table in that volume. To create the table, just click the ‘Create Table’ button, and enter the path ‘/tables/nm’.

That’s it! Inserting the data will create the columns automatically. We don’t need to worry about data types as MapR-DB only stores bytes and it’s up to the application to convert the data to/from bytes. This is a common pattern for NoSQL databases.

Using MapR-DB JSON Tables

MapR-DB also has native support for the JSON data type via the OJAI API, as opposed to storing only bytes in HBase tables. In that type of use case, MapR-DB becomes a more stable, faster, and still easier MongoDB type of NoSQL database. The replication feature works equally well with either JSON or HBase style tables. Check Getting Started with OJAI in the MapR documentation for the details.

This post is based on what was learned during a customer engagement for a real production use case. Although we would generally recommend JSON tables, we selected HBase tables for the following reasons:

  • Due to tight deadlines, the customer wanted to minimize extra programming effort, and they already had a data input solution that was compatible with the HBase API.
  • The customer’s data was already in tabular format. This format can be used directly with HBase tables.

 

Add data to the MapR-DB Table

We should also add some data to the instance. Here is some sample data. It’s just three rows, but if the job succeeds, then we know the setup is correct and we can then have confidence that it would equally work with much, much larger sizes. To enter this data, simply cut and paste it into the HBase shell.

Start the MapR-DB HBase shell CLI as mapr user:

$> hbase shell

Add data to the table
Then cut and paste these lines into it:

put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:destHost','133.189.219.255',1476838608478
put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:destPort','138',1476838608478
put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:timestamp','2016-10-19T00:57:18.000Z',1476838608478
put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:bytesCount','229',1476838608478
put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:sourceHost','133.189.219.192',1476838608478
put '/tables/nm','000000f9-d637-4b4f-8ee1-577d0b6812dd','m:sourcePort','138',1476838608478
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:destHost','133.189.219.84',1476933532115
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:destPort','10050',1476933532115
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:timestamp','2016-10-20T02:34:31.000Z',1476933532115
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:bytesCount','52',1476933532115
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:sourceHost','133.189.220.120',1476933532115
put '/tables/nm','00000184-7e39-4bc0-89b1-fe69f005f720','m:sourcePort','52757',1476933532115
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:destHost','172.30.120.101',1476807407102
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:destPort','1531',1476807407102
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:timestamp','2016-10-18T16:17:18.000Z',1476807407102
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:bytesCount','7893',1476807407102
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:sourceHost','172.31.68.27',1476807407102
put '/tables/nm','000001aa-ae6f-43c3-a668-533581bbfa05','m:sourcePort','49239',1476807407102


To exit the shell, just type ‘quit’

One source of problems at this step is that the data entered via the HBase shell is all in string format, which normally means UTF-8. If the data looks fine, but isn’t UTF-8, it will parse wrong at the gateway and ES will complain that the values are all messed up. The best way to avoid this type of bytes conversion issue may be to use OJAI and JSON tables.

Install MapR Gateway Service

First, install the mapr-gateway package on one or more nodes. On a production cluster, it’s always recommended to have at least two gateways to enable high availability. The number of nodes running the gateway should be based on the network bandwidth requirement as well as cluster hardware and available resources.

To install the package, log in as root (su root after logging on as mapr, or just login as root. The password is also ‘mapr’). Then install the package using yum:

$> yum install -y mapr-gateway

After installing the package, still as ‘root’ configure the system again:

 

$> /opt/mapr/server/configure.sh -R
$> service mapr-warden restart

 

The details are all available on the MapR documentation site.

Register Elasticsearch

Next we need to register ES with the MapR cluster. This basically means copying over some libraries for the gateway to use. An ES needs only be registered once per cluster, and can be reused to replicate many tables to different index/types.

To do this, run the script /opt/mapr/bin/register-elasticsearch. Parameters:

  • c <ES name > : this parameter is a tag that will be used as a target for the replica setup command. the recommended name is the ES cluster name but it could be anything.
  • r <ES hostname/IP >
  • t use the transport client. This is the only client supported by MapR 5.2 and is required in conjunction with the -r parameter.
  • e the directory where ES is installed. Note that if ES is installed via the RPM/Deb package, this parameter is not necessary.
  • y do not prompt for values. If following the steps here, it’s safe to use.

 

Using the sandbox, this command will register ES as the mapr user:

 

$> /opt/mapr/bin/register-elasticsearch -c elastic -r maprdemo -t -e /home/mapr/elasticsearch-2.2.0 -y

You will be prompted 3 times with a password. Enter ‘mapr’ each time. If the command completes successfully, there will be a confirmation message.

To verify ES is registered properly, you can then enter this command (notice the -l parameter):

$> /opt/mapr/bin/register-elasticsearch -l 
Found 1 items
drwxr-xr-x - mapr mapr 3 2016-10-27 21:28 /opt/external/elasticsearch/clusters/elastic

We are now done with registering the Elasticsearch cluster with the MapR cluster. This only needs to be done once for each Elasticsearch cluster

Set Up Replication

We are finally there! Time to start the actual replication.

This is done using the maprcli utility as user ‘mapr’:

$> maprcli table replica elasticsearch autosetup -path /tables/nm -target elastic -index networkmetrics -type metrics

Once this command is run, MapR will launch a mapreduce job to do an initial bulk replication of the data currently stored in the MapR-DB table. This could be long if the table is already holding a lot of data. With our very small test data (3 rows) this should take less than one minute, mostly because of the startup cost of a mapreduce job.

If planning to use replication from the start, it’s probably a good idea to set it up when the table has just a bit of data to make the initial bulk load run quickly. While it’s possible to enable replication on an empty table, I wouldn’t recommend it since there is no way to make sure the replication is set up properly until data is added, which could be in production. I tend to prefer to detect errors and fix issues as early as possible.

From there on out, as data is added to the MapR-DB table, the data will be automatically replicated to ES by the gateway. It’s magic. :-)

Verifying the Replication

In MCS we should now be able to see that the replication has indeed been successful.

In Elasticsearch, we can also make sure that we have 3 hits for the rows we have replicated so far:

$> curl maprdemo:9200/networkmetrics/metrics/_count {"count":3,"_shards":{"total":5,"successful":5,"failed":0}}

Potential Issues

Some sources of issues to be careful about:

    1. Make sure the user running the replication command has POSIX permissions to the MapR-DB table. In our case, we’re creating it with user ‘mapr’ and running the command as ‘mapr,’ so that’s OK. Permissions in MapR matter.
    2. Double check that your index is created and the mappings are well matched to the data. If you’re using our test data and mappings though, it should be smooth sailing!
    3. Finally, ensure that the data input are strings in UTF-8 format in this particular example. The gateway decodes the bytes stored in MapR-DB as a UTF-8 string, so if the data input was ASCII, the decoded output will be weird numbers and ES will complain. UTF-8 is the default file format of all modern computers, so it should be fine, but it’s something to keep in mind.

If the job fails, go to elasticsearch-2.2.0/conf and edit the logging.yml file to set the logging level to DEBUG. Tailing the log in elasticsearch-2.2.0/logs/elastic.log will give the most information about conversion errors.

Wrap Up

Replication to Elasticsearch can be a very useful feature, with a lot of great use cases as I described above. It’s pretty easy to set up and will work reliably in the background to keep your data synchronized. I encourage you to experiment with this feature and take advantage of it on your production clusters.

Related Content

An Introduction to Elasticsearch Mapping | Elastic 

-  Learn how to Install a MapR Cluster | MapR Academy  

-MapR installation

How to Index MapR-DB Data into Elasticsearch | MapR 

How to Index MapR-DB Data into Elasticsearch on AWS | MapR 

 

 

 

Content Originally posted in MapR Converge Blog post, visit here

Subscribe to Converge Blog

 

 

Liked this content? Click like or leave a comment below

Outcomes