Configuring and Deploying Ganglia with MapR

Document created by jbubier Employee on Feb 13, 2016
Version 1Show Document
  • View in full screen mode

Author: Jonathan Bubier

 

Original Publication Date: August 14, 2014

 

 

Ganglia is a popular open-source monitoring tool that can be used to monitor and aggregate metrics from multiple data sources and clusters. It is typically used to monitor system resources such as CPU, memory, disk IO, network utilization,etc. across many systems. It can also be used to monitor Hadoop clusters as MapR and much of the Hadoop infrastructure provide metrics which can be displayed in Ganglia. This article will describe the necessary steps to configure Ganglia and MapR and the desired Hadoop components to successfully monitor Hadoop in Ganglia. Note that the installation process of Ganglia and its components is outside of the scope of this article. This article assumes that the Ganglia components - gmond, gmetad, ganglia-web - are already installed and functional.

Determine Desired Ganglia configuration

Before configuring Ganglia to monitor a MapR cluster it is important to identify the current Ganglia service layout and the redundancy requirements. Specifically identify how many nodes in the cluster will be running the Ganglia monitoring daemon (gmond) and if single points of failure in the monitoring architecture can be tolerated or if full redundancy is needed. This blog provides a good description of different Ganglia service deployments to consider before integrating with Hadoop: http://hakunamapdata.com/ganglia-configuration-for-a-small-hadoop-cluster-and-some-troubleshooting/.

 

Consider the requirements for your environment before continuing as the configuration will differ going forward based on the Ganglia service layout. For the purposes of this article we will use a service layout of one Ganglia meta daemon (gmetad) for the cluster and the Ganglia monitoring daemon (gmond) running on all CLDB nodes.  Note that in this layout all gmond instances can be polled by gmetad.  We will not be using one gmond instance to aggregate metrics from other gmond instances in the cluster.  This will create a resilient setup such that we do not have a single point of failure in either gmond instances or CLDB nodes.

Configure Ganglia

The first step is to configure the Ganglia monitoring daemon (gmond) and the Ganglia meta daemon (gmetad).  First, we will configure gmond.  By default the configuration for gmond and gmetad is under /etc/ganglia/ and each has a corresponding configuration file - gmond.conf and gmetad.conf respectively. If your installation path or location of this file differs please adjust accordingly.

1.  Update /etc/ganglia/gmond.conf on all CLDB nodes.

 

Edit /etc/ganglia/gmond.conf on all CLDB nodes in the MapR cluster and create a configuration similar to the following:

 

cluster {

  name = "MapR Cluster"

...

}

...

udp_send_channel {

  bind_hostname = yes # Highly recommended, soon to be default.

  # This option tells gmond to use a source address

  # that resolves to the machine's hostname. Without

  # this, the metrics may appear to come from any

  # interface and the DNS names associated with

  # those IPs will be used to create the RRDs.

  host = "192.168.1.1"

  port = 8649

  ttl = 1

}

 

/* You can specify as many udp_recv_channels as you like as well. */

udp_recv_channel {

  port = 8649

}

/* You can specify as many tcp_accept_channels as you like to share

  an xml description of the state of the cluster */

tcp_accept_channel {

  port = 8649

}

In the 'cluster' definition replace the value of the "name" field with the display name you wish to use for the Hadoop cluster in Ganglia. In the 'udp_send_channel' replace the value of the "host" field with the IP address of the host running gmond, i.e. the local host. The 'udp_recv_channel' indicates that the gmond instance can receive metrics from other sources on UDP port 8649. The 'tcp_accept_channel' indicates that the gmond instance can be polled for metrics by gmetad on TCP port 8649.

 

2.  Update /etc/ganglia/gmetad.conf.

 

On the host running gmetad edit /etc/ganglia/gmetad.conf and add a configuration similar to the following:

data_source "MapR Cluster" 192.168.1.1 192.168.1.2 192.168.1.3

Replace the name of the data source "MapR Cluster" with the name you wish to use to display this cluster in Ganglia. Note that the cluster name should match the name used in gmond.conf in step 1 on the CLDB nodes. Replace the space separated list of IPs with the IP addresses of the CLDB hosts in the cluster running gmond.  Note that for each data source defined in /etc/ganglia/gmetad.conf gmetad will poll only the first host listed if it is reachable.  The other hosts listed will not be polled unless the first host becomes unreachable.

 

3.  Restart gmond on all CLDB nodes.

 

Restart the Ganglia monitoring daemon on all CLDB nodes.  This is typically done using 'service gmond restart' or 'service ganglia-monitor restart'.

 

4.  Restart gmetad.

 

Restart the Ganglia meta daemon.  This is typically done using 'service gmetad restart'. 

 

5.  Verify system metrics from all CLDB nodes are available in Ganglia.

 

Once the configuration above is in place and the Ganglia daemons are restarted verify that system metrics from all CLDB nodes are available in the Ganglia web interface.


Configure MapR Metrics

After setting up Ganglia and verifying that system metrics are being reported correctly in the Ganglia web interface the next step is to configure and enable metrics in MapR.  This allows for the reporting on MapR specific metrics in CLDB as well as metrics about all fileserver nodes registered with CLDB.

1.  Update /opt/mapr/conf/hadoop-metrics.properties on all CLDB nodes.

 

The first step to configure CLDB to send metrics to Ganglia is to update the configuration file /opt/mapr/conf/hadoop-metrics.properties on all CLDB nodes. This must be done on all CLDB nodes so the metrics continue to function properly in the event the master CLDB fails over to another node. Edit the file on all CLDB nodes and modify the configuration similar to the following:

 

# Configuration of the "cldb" context for ganglia

cldb.class=com.mapr.fs.cldb.counters.MapRGangliaContext31

cldb.period=10

cldb.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

cldb.spoof=1

...

# Configuration of the "fileserver" context for ganglia

fileserver.class=com.mapr.fs.cldb.counters.MapRGangliaContext31

fileserver.period=37

fileserver.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

fileserver.spoof=1

Replace the value for "cldb.servers" and "fileserver.servers" with a comma separated list of the IP addresses of the nodes running gmond for this cluster, i.e. the CLDB nodes. This property defines the nodes used by the Ganglia context for sending CLDB and FileServer metrics.  By providing a comma separate list of hosts the CLDB can send metrics to all gmond instances for redundancy in the event gmond stops running on one node.

 

2.  Set cldb.ganglia.cldb.metrics and cldb.ganglia.fileserver.metrics to "1"

 

Use maprcli to set two properties in the CLDB configuration to enable the CLDB and Fileserver metrics.  Ex:

$ maprcli config save -values {"cldb.ganglia.cldb.metrics":"1"} 
$ maprcli config save -values {"cldb.ganglia.fileserver.metrics":"1"}

Note that these commands need to be run on only one CLDB node as this configuration is shared by all CLDB nodes.

 

3.  Restart CLDB on all CLDB nodes.

 

Once /opt/mapr/conf/hadoop-metrics.properties is updated on all CLDB nodes and the Ganglia metrics have been enabled in the CLDB configuration the next step is to restart CLDB on all nodes.  This can be done using maprcli or using the MCS.  If using maprcli the following syntax can be used:

$ maprcli node services -filter [csvc==cldb] -cldb restart

This command can be run on any node and will restart the CLDB service on all nodes that are configured to run CLDB.  Note that restarting CLDB on all nodes simultaneously will introduce a disruption in access to MapR-FS so this step should be done when downtime can be scheduled or when the interruption can be minimized. Alternatively the CLDB can be restarted on each node in a rolling fashion to minimize the impact.

 

4.  Verify 'CLDB' and 'Fileserver' metrics are reported for all cluster nodes in Ganglia

 

Once the CLDB is restarted verify that the metrics are visible in the Ganglia web interface.  For each cluster node running CLDB two new groups of metrics should be visible - CLDB and Fileserver.  The non-CLDB nodes in the cluster should also now be present in the Ganglia web interface with the Fileserver group of metrics.

 

If the MapR metrics are not present in the Ganglia web interface at this point, verify the configuration is correct according to the above steps.  Specifically verify gmond.conf, gmetad.conf and hadoop-metrics.properties for consistency.  If the configuration looks correct, verify connectivity between the host running gmetad and the hosts running gmond.  This can be done using the 'telnet' utility by attempting to telnet from the gmetad host to each gmond host on the 'tcp_accept_channel' port in gmond.conf - default TCP 8649.

 

Ex:

$ telnet 192.168.1.1 8649

 

Trying 192.168.1.1...

Connected to 192.168.1.1.

Escape character is '^]'.

<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

<!DOCTYPE GANGLIA_XML [

...

</CLUSTER>

</GANGLIA_XML>

Connection closed by foreign host.

If the telnet command errors with a 'Connection refused' message inspect the network configuration between the hosts to ensure there is no firewall or other network filtering that is preventing communication on TCP port 8649. If the command is successful review the output to determine whether the metric group names from MapR prefixed with cldb and fileserver are present.

Configure Hadoop Map-Reduce Metrics (optional)

If desired Ganglia can also report on metrics generated by the Hadoop map-reduce framework in the cluster. These metrics are generated by both the active JobTracker and all running TaskTracker processes and allow an administrator to monitor the map-reduce activity within the MapR cluster. The steps to configure the metrics classes for map-reduce are quite similar to the steps for MapR metrics. Note that once a working configuration is in place between gmetad and the gmond instances in the cluster no further configuration is needed for Ganglia. All further configuration is in Hadoop specific configuration files.

 

1.  Update /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties on all JT and TT nodes.

 

On all configured JobTracker nodes and TaskTracker nodes, edit /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties and modify the configuration so it is similar to the following:

 

# Configuration of the "mapred" context for ganglia

# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)

# mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext

mapred.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

mapred.period=10

mapred.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

...

# Configuration of the "jvm" context for ganglia

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

jvm.period=10

jvm.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

...

# Configuration of the "fairscheduler" context for ganglia

fairscheduler.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

fairscheduler.period=10

fairscheduler.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

 

Replace the value for "mapred.servers","jvm.servers" and "fairscheduler.servers" with a comma separated list of the IP addresses of the nodes running gmond for this cluster, i.e. the CLDB nodes in our example. This property defines the nodes used by the Ganglia context for sending map-reduce, JVM, and FairScheduler metrics.

 

Note that if FairScheduler is not used in your environment this context does not need to be configured. This context also is not needed on TaskTracker nodes as these metrics are emitted by the JobTracker only. Also note that "jvm.class" and "fairscheduler.class" must be set to org.apache.hadoop.metrics.ganglia.GangliaContext31 if you are using a version of Ganglia that is 3.1 or newer. By default these parameters are set to org.apache.hadoop.metrics.ganglia.GangliaContext.

 

 

2.  Restart JobTracker and TaskTracker on all JT and TT nodes.

 

After updating /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties on all JobTracker and TaskTracker nodes it is necessary to restart those services for the new configuration to take effect. This can be done using maprcli or using the MCS. If using maprcli the following syntax can be used:

$ maprcli node services -filter [csvc==tasktracker] -tasktracker restart 
$ maprcli node services -filter [csvc==jobtracker] -jobtracker restart

This command can be run on any node and will restart the TaskTracker and JobTracker on all nodes that are configured to run those services. Note that restarting the map-reduce services on all nodes simultaneously will introduce a disruption in map-reduce processing capability so this step should be done when downtime can be scheduled or when the interruption can be minimized. Alternatively the services can be restarted on each node in a rolling fashion to minimize the impact.

 

3.  Verify Map-Reduce metrics are reported in Ganglia

 

Once the map-reduce services are restarted verify that the new metrics are visible in the Ganglia web interface.  For each cluster node new groups of metrics should be visible - mapred, jvm.  For each JobTracker node the 'fairscheduler' group of metrics should also be visible if the FairScheduler is the currently configured task scheduler.

Configure Hadoop YARN Metrics (MapR v4.0.1 only)

Ganglia can report on metrics generated by the YARN services running in the cluster.  These metrics are generated by both the Resource Manager and the Node Manager processes and include metrics regarding application activity, node resource utilization and JVM statistics among many others. The following steps can be used to enable the necessary metric contexts in the YARN framework to report metrics to Ganglia. 

 

1.  Update /opt/mapr/hadoop/hadoop-2.3.0/etc/hadoop/hadoop-metrics2.properties on all RM and NM nodes.

 

On all configured Resource Manager nodes edit /opt/mapr/hadoop/hadoop-2.3.0/etc/hadoop/hadoop-metrics2.properties and modify the configuration to add the following two lines:

*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 
resourcemanager.sink.ganglia.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

On all configured Node Manager nodes edit /opt/mapr/hadoop/hadoop-2.3.0/etc/hadoop/hadoop-metrics2.properties and modify the configuration to add the following two lines:

*.sink.ganglia.class=org.apache.hadoop.metrics2.sink.ganglia.GangliaSink31 
nodemanager.sink.ganglia.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

For both the Resource Manager and Node Manager nodes, replace the value for the "servers" properties with a comma separated list of the IP addresses of the nodes running gmond for this cluster, i.e. the CLDB nodes in our example.  Note that nodes which are configured as both Resource Manager and Node Manager need the '*.sink.ganglia.class' line added only once to hadoop-metrics2.properties.

 

2.  Restart ResourceManager and NodeManager on all RM and NM nodes.

 

After updating /opt/mapr/hadoop/hadoop-2.3.0/etc/hadoop/hadoop-metrics2.properties on all Resource Manager and Node Manager nodes it is necessary to restart those services for the new configuration to take effect. This can be done using maprcli or using the MCS. If using maprcli the following syntax can be used:
$ maprcli node services -filter [csvc==resourcemanager] -name resourcemanager -action restart 
$ maprcli node services -filter [csvc==nodemanager] -name nodemanager -action restart
This command can be run on any node and will restart the Node Manager and Resource Manager on all nodes that are configured to run those services.

 

3.  Verify YARN metrics are reported in Ganglia.

 

Once the YARN services are restarted verify that the new metrics are visible in the Ganglia web interface. For each cluster node new groups of metrics should be visible depending on the configured role of the node.

Configure HBase Metrics (optional)

 

Ganglia can report on metrics generated by the HBase services in the cluster. These metrics are generated by both the active HBase master and all running HBase RegionServer processes.

 

1.  Update /opt/mapr/hbase/hbase-<version>/conf/hadoop-metrics.properties on all HBase nodes.

 

On all configured HBase master and RegionServer nodes, edit /opt/mapr/hbase/hbase-<version>/conf/hadoop-metrics.properties where <version> is your installed HBase version and modify the configuration so it is similar to the following:

 

# Configuration of the "hbase" context for ganglia

# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)

# hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext

hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

hbase.period=10

hbase.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

...

# Configuration of the "jvm" context for ganglia

# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)

# jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext

jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

jvm.period=10

jvm.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

...

# Configuration of the "rpc" context for ganglia

# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)

# rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext

rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

rpc.period=10

rpc.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

...

# Configuration of the "rest" context for ganglia

# Pick one: Ganglia 3.0 (former) or Ganglia 3.1 (latter)

# rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext

rest.class=org.apache.hadoop.metrics.ganglia.GangliaContext31

rest.period=10

rest.servers=192.168.1.1:8649,192.168.1.2:8649,192.168.1.3:8649

Replace the value for "hbase.servers","jvm.servers", "rpc.servers" and "rest.servers" with a comma separated list of the IP addresses of the nodes running gmond for this cluster, i.e. the CLDB nodes in our example. Note that if the HBase REST service is not used in your environment this context does not need to be configured in hadoop-metrics.properties. 

 

2.  Restart HBase master and HBase Regionserver on all HBase nodes.

 

After updating /opt/mapr/hbase/hbase-<version>/conf/hadoop-metrics.properties on all HBase master and RegionServer nodes it is necessary to restart those services for the new configuration to take effect. This can be done using maprcli or using the MCS. If using maprcli the following syntax can be used:

$ maprcli node services -filter [csvc==hbmaster] -hbmaster restart 
$ maprcli node services -filter [csvc==hbregionserver] -regionserver restart

This command can be run on any node and will restart the HBase services on all nodes that are configured to run those services. Note that restarting the HBase services on all nodes simultaneously will introduce a disruption in access to HBase data so this step should be done when downtime can be scheduled or when the interruption can be minimized. Alternatively the services can be restarted on each node in a rolling fashion to minimize the impact.

 

3.  Verify HBase metrics are reported in Ganglia

 

Once the HBase services are restarted verify that the new metrics are visible in the Ganglia web interface. For each cluster node new groups of metrics should be visible - hbase, jvm, rpc. For each node where the HBase REST service is running there should be an additional group of metrics called 'rest' in the Ganglia web interface.

Attachments

    Outcomes