mlalapet

MapR-ES Replication Setup and Testing

Blog Post created by mlalapet Employee on Jul 31, 2016

In this article we are going to set up MapR Streams replication and test the same by using logstash for both producing messages (at the source stream) and consuming messages.(at the destinations streams).

 

One of the unique feature of MapR Streams is to replicate streams to other MapR clusters worldwide or to other streams within a MapR cluster. MapR Streams supports various replication topologies, basic master slave one direction replication, many to one replication and multi master replication. This article focuses on setup of basic master slave one direction replication.

 

Let's assume we have a data center in New York collecting log data. The log data is collected by logstash and is added to the local Stream in the NY data center mapr cluster (cluster name - ny-logdata-cluster). We need to replicate this Stream to headquarters cluster at San Francisco (cluster name - sf-hq-cluster). Following are high level stems for getting the replication working

 

  1. Install and configure MapR Gateway in the destination cluster
  2. Add destination cluster CLDB information in source cluster and vice versa
  3. Setup replication between the Streams in source and destination cluster


Once all these are setup, we will use simple logstash to test the replication.

 

MapR Gateway Installation and Configuration

 

  • Install MapR Gateway on the destination cluster. In our case it's the SF HQ cluster (cluster name - sf-hq-cluster).
    • Identify the nodes on which we want to install the mapr gateway and install the mapr gateway package
    • $ yum install mapr-gateway

 

  • Run the following commands on the node where mapr gateway is installed. This is to make sure warden knows that the gateway is running in these nodes.
  • $ /opt/mapr/server/configure.sh -R

 

  • Run the following commands to check if the gateway is installed and configured properly
  • $ maprcli cluster gateway local -format text gatewayinfo

 

  • On the source cluster (cluster name - ny-logdata-cluster) set the gateway nodes as follows.
    • Logon to one of the nodes in the source cluster (ny-logdata-cluster) and run the following
    • $ maprcli cluster gateway set -dstcluster <cluster name> -gateways "<space-delimited list of gateways>"

 

  • Check if the gateway is setup properly in the source cluster by running the following command
  • $ maprcli cluster gateway list


Add CLDB Information

  • Add CLDB server details of the destination cluster in the source cluster.
    • Get the cluster name and cldb information of the destination cluster
    • Logon to each node in source cluster, vi into /opt/mapr/conf/mapr-clusters.conf file and add the destination cluster details. For e.g. below is the conf from ny-logdata-cluster cluster
    • $ cat /opt/mapr/conf/mapr-clusters.conf

ny-logdata-cluster secure=false x.x.x.x:7222 x.x.x.x:7222

sf-hq-cluster secure=false y.y.y.y:7222

  • The second statement above was added in this node.

 

  • Check if the destination cluster is accessible from one of the node in the source cluster
  • $ hadoop fs -ls /mapr/sf-hq-cluster/

 

  • Add CLDB server details of the source cluster in the destination cluster.
    • Get the cluster name and cldb information of the source cluster
    • Logon to each node in destination cluster, vi into /opt/mapr/conf/mapr-clusters.conf file and add the source cluster details. For e.g. below is the conf from sf-hq-cluster
    • $ cat /opt/mapr/conf/mapr-clusters.conf

sf-hq-cluster secure=false y.y.y.y:7222

ny-logdata-cluster secure=false x.x.x.x:7222 x.x.x.x:7222

  • The second statement above was added in this node.

 

  • Check if the source cluster is accessible from one of the node in the destination cluster
  • $ hadoop fs -ls /mapr/ny-logdata-cluster/

 

MapR Streams Replication Setup

  • Assuming the Streams and Topic at the source cluster (ny-logdata-cluster) is already created and its /streams/logstream:log_topic
  • Create the replica manually with the maprcli stream create command. Use the -copymetafrom option to ensure that the metadata for the replica is identical to the metadata for the source stream. Command (in the destination cluster - sf-hq-cluster)
    • maprcli stream create -path <path to the replica> -copymetafrom <path to the source stream>
    • In our scenario
    • $ maprcli stream create -path /mapr/sf-hq-cluster/streams/logstream -copymetafrom /mapr/ny-logdata-cluster/streams/logstream
    • NOTE - above the /streams volume in sf-hq-cluster (destination cluster) cluster should be created already

 

  • Register the replica as a replica of the source stream by running the maprcli stream replica add command.
    • maprcli stream replica add -path <path to the source stream> -replica <path to the replica> -paused true
    • In our scenario
    • $ maprcli stream replica add -path /mapr/ny-logdata-cluster/streams/logstream -replica /mapr/sf-hq-cluster/streams/logstream -paused true

 

  • Verify that you specified the correct replica by running the maprcli stream replica list command.
    • maprcli stream replica list -path <path to the source stream>
    • $ maprcli stream replica list -path /mapr/ny-logdata-cluster/streams/logstream

 

  • Authorize replication between the streams by defining the source stream as the upstream stream for the replica by running the maprcli stream upstream addcommand. Definition of the upstream stream ensures that a stream cannot replicate updates to any replica. Replication depends on a two-way agreement between the owners of the two streams.
    • maprcli stream upstream add -path <path to the replica> -upstream <path to the source stream>
    • $ maprcli stream upstream add -path /mapr/sf-hq-cluster/streams/logstream -upstream /mapr/ny-logdata-cluster/streams/logstream

 

  • Verify that you specified the correct source stream by running the maprcli stream upstream list command.
    • maprcli stream upstream list -path <path to the replica>
    • $ maprcli stream upstream list -path /mapr/sf-hq-cluster/streams/logstream

 

  • Load the replica with data from the source stream by using the mapr copystream utility.
    • $ mapr copystream -src /mapr/ny-logdata-cluster/streams/logstream -dst /mapr/sf-hq-cluster/streams/logstream

 

  • Start replication with the command maprcli stream replica resume
    • maprcli stream replica resume -path <path to the source stream> -replica <path to the replica>
    • $ maprcli stream replica resume -path /mapr/ny-logdata-cluster/streams/logstream -replica /mapr/sf-hq-cluster/streams/logstream

 


All the setup is completed and the streams are set to be replicated. If we produce any message at /mapr/ny-logdata-cluster/streams/logstream:log_topic then that message will be replicated to /mapr/sf-hq-cluster/streams/logstream:log_topic.

 

Testing MapR Streams Replication via Logstash

In our test we are simply going to use Logstash at the ny-logdata-cluster to produce a message to the stream leveraging kafka output plugin and stdin for input. And at the sf-hq-cluster we will leverage kafka input plugin and stdout for output. This will help us test the replication. Following are the steps

 

  • At the source cluster - ny-logdata-cluster use the following logstash command
    • /opt/logstash/bin/logstash -e 'input { stdin {} } output { kafka { topic_id => "/streams/logstream:log_topic" } }'
    • The above configuration takes anything typed into standard input and sends to kafka

 

  • At the destination cluster - sf-hq-cluster use the following logstash command
    • /opt/logstash/bin/logstash -e 'input { kafka { topics => [ "/streams/logstream:log_topic" ] } } output { stdout { codec => "json" } }'
    • The above configuration reads messages from Kafka and outputs to standard output

 

  • Now you should see the messages getting replicated and displayed at the destination cluster

 

And that's it! You've successfully configured and tested MapR Replication.

 

Outcomes