Submit Apache Spark Jobs Using the MapR Client from Linux

Document created by Rachel Silver Employee on Jun 30, 2017Last modified by Rachel Silver Employee on Aug 24, 2017
Version 10Show Document
  • View in full screen mode


In order to submit work to a YARN cluster running Apache Spark, the MapR Client, containing an Apache Spark client, must be installed on the node that you are using to submit the job. This node could be an edge node in the cluster, your laptop, or even a Docker container.


In this post, we will cover how to install MapR Client and test it on Linux (CentOS):


Collect Information and Files

For all three methods of installation, you're going to need some basic information from the cluster. All of this information is obtainable in MCS, but here is how to find it from the CLI on any cluster node:


Cluster Name and CLDB nodes:


cat /opt/mapr/conf/mapr-clusters.conf
<cluster name> secure=false <cldb host1>:7222 <cldb host2>:7222 <cldb host3>:7222


The Job History Server Host can be found by looking at the roles on any node:


ls /opt/mapr/roles/
cldb drill-bits fileserver hbaserest hbasethrift hbinternal historyserver hivemetastore hiveserver2 hivewebhcat hue nfs
nodemanager oozie spark-historyserver zookeeper


Please gather the hive-site.xml and the hadoop-yarn-server-web-proxy JAR files from any node in the cluster. These files can typically be found here:





Note: Java must be installed, and JAVA_HOME must be set on the node where you are submitting jobs from.


Installing the MapR Spark Client on Linux

The full directions can be found here: Installing the MapR Client on CentOS, RedHat, Oracle Linux 


Create a text file called maprtech.repo in the /etc/yum.repos.d/ directory with the following content, replacing <version> with the version of MapR that you want to install (ex. v5.2.1):



name=MapR Technologies

name=MapR Technologies



Update YUM:


yum update 


Install the MapR Client for your target architecture:


yum install mapr-client.i386

yum install mapr-client.x86_64


Next, we need to install the MapR build of Spark. It's important not to use the community build or it will not work with a secure cluster:


yum install mapr-spark


Run to configure the client. For details about the syntax, parameters, and behavior of, see

/opt/mapr/server/ -N <cluster name> -c -C <cldb host1>:7222 <cldb host2>:7222 <cldb host3>:7222 -HS <JHS node>

Jobs must be submitted as a cluster user that exists on this host. For example, the default user 'mapr' can be created as follows:


groupadd mapr -g5000

useradd mapr -gmapr -u5000



Copy the Hive site file to /opt/mapr/spark/spark-<version>/conf/ and the hadoop-yarn-server-web-proxy JAR file to /opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/yarn/ on the client node.


Finally, please follow these steps to add the Spark JAR files to a world-readable location on MapR-FS. It's not required, but this way YARN can cache them on nodes to avoid distributing them each time an application runs:

Configure Spark JAR Location (Spark 2.0.1 and later) 


Verify and Test


First, let's test that basic Hadoop commands can run on the cluster and submit jobs to the YARN queue:


$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.0-mapr-<version>.jar teragen 1000 /tmp/teragen


Then, let's test that we can access the Spark shell in Scala and Python:


$SPARK_HOME/bin/spark-shell --master yarn --deploy-mode client

$SPARK_HOME/bin/pyspark --master yarn --deploy-mode client


And then submit a Spark job on YARN: 


/opt/mapr/spark/spark-2.1.0/bin/run-example --master yarn --deploy-mode client SparkPi 10


Further Reading