Submit Apache Spark Jobs Using the MapR Client from Linux

Document created by Rachel Silver Employee on Jun 30, 2017Last modified by aalvarez on Jul 10, 2017
Version 9Show Document
  • View in full screen mode

Introduction

In order to submit work to a YARN cluster running Apache Spark, the MapR Client, containing an Apache Spark client, must be installed on the node that you are using to submit the job. This node could be an edge node in the cluster, your laptop, or even a Docker container.

 

In this post, we will cover how to install MapR Client and test it on Linux (CentOS):

 

Collect Information and Files

For all three methods of installation, you're going to need some basic information from the cluster. All of this information is obtainable in MCS, but here is how to find it from the CLI on any cluster node:

 

Cluster Name and CLDB nodes:

 

cat /opt/mapr/conf/mapr-clusters.conf
<cluster name> secure=false <cldb host1>:7222 <cldb host2>:7222 <cldb host3>:7222

 

The Job History Server Host can be found by looking at the roles on any node:

 

ls /opt/mapr/roles/
cldb drill-bits fileserver hbaserest hbasethrift hbinternal historyserver hivemetastore hiveserver2 hivewebhcat hue nfs
nodemanager oozie spark-historyserver zookeeper

 

Please gather the hive-site.xml and the hadoop-yarn-server-web-proxy JAR files from any node in the cluster. These files can typically be found here:

 

/opt/mapr/hive/hive-<version>/conf/hive-site.xml

/opt/mapr/hadoop/hadoop-<version>/share/hadoop/yarn/hadoop-yarn-server-web-proxy-<version>.jar

 

Note: Java must be installed, and JAVA_HOME must be set on the node where you are submitting jobs from.

 

Installing the MapR Spark Client on Linux

The full directions can be found here: Installing the MapR Client on CentOS, RedHat, Oracle Linux 

 

Create a text file called maprtech.repo in the /etc/yum.repos.d/ directory with the following content, replacing <version> with the version of MapR that you want to install (ex. v5.2.1):

 

 

[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/<version>/redhat/
enabled=1
gpgcheck=0
protect=1 

[maprecosystem]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/MEP/MEP-<version>/redhat
enabled=1
gpgcheck=0
protect=1

 

 

Update YUM:

 

yum update 

 

Install the MapR Client for your target architecture:

 

yum install mapr-client.i386

yum install mapr-client.x86_64

 

Next, we need to install the MapR build of Spark. It's important not to use the community build or it will not work with a secure cluster:

 

yum install mapr-spark

 

Run configure.sh to configure the client. For details about the syntax, parameters, and behavior of configure.sh, see configure.sh:

/opt/mapr/server/configure.sh -N <cluster name> -c -C <cldb host1>:7222 <cldb host2>:7222 <cldb host3>:7222-HS <JHS node>

Jobs must be submitted as a cluster user that exists on this host. For example, the default user 'mapr' can be created as follows:

 

groupadd mapr -g5000

useradd mapr -gmapr -u5000

 

 

Copy the Hive site file to /opt/mapr/spark/spark-<version>/conf/ and the hadoop-yarn-server-web-proxy JAR file to /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/ on the client node.

 

Finally, please follow these steps to add the Spark JAR files to a world-readable location on MapR-FS. It's not required, but this way YARN can cache them on nodes to avoid distributing them each time an application runs:

Configure Spark JAR Location (Spark 2.0.1 and later) 

 

Verify and Test

 

First, let's test that basic Hadoop commands can run on the cluster and submit jobs to the YARN queue:

 

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0-mapr-1703.jar teragen 1000 /tmp/teragen

 

Then, let's test that we can access the Spark shell in Scala and Python:

 

$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client

$SPARK_HOME/bin/pyspark --master yarn --deploy-mode client

 

And then submit a Spark job on YARN: 

 

/opt/mapr/spark/spark-2.1.0/bin/run-example --master yarn --deploy-mode client SparkPi 10

 

Further Reading

Attachments

    Outcomes