How to Use Spark & PySpark with Zeppelin using the Native Spark Interpreter

Document created by Rachel Silver Employee on Feb 9, 2017Last modified by Rachel Silver Employee on Jun 14, 2017
Version 7Show Document
  • View in full screen mode

Introduction

 *Confirmed working with Spark 2.1.0 and Zeppelin 0.7.2 on 6/14/17

 

Apache Zeppelin is a web-based notebook project that enables interactive data analytics. Recently, Apache Zeppelin 0.7.2 was released, so we'd like assist our customers in getting Zeppelin up and running on the MapR Platform. Here, we're going to explain how to get Zeppelin working on the MapR Converged Data Platform with Apache Spark and walk through a quick example.

 

The versions used for this demo are:

 

Note: Zeppelin for MapR is not formally supported. Any problems should be addressed in Answers or in the Zeppelin Community.

 

Installing Zeppelin 

For these purposes, we're going to use the newest binary package available here:

Apache: Zeppelin: Download Page and install Zeppelin to /opt/zeppelin.

 

Get and unpack the Zeppelin binary as a user with sudo access (use the one with all interpreters):

 

mkdir -p /opt/zeppelin

wget <link to suggested mirror>.tgz  -P /tmp/

gunzip /tmp/zeppelin-<version>-bin-all.tgz

tar -xf /tmp/zeppelin-<version>-bin-all.tar -C /opt/zeppelin/

 

Change the owner of these files to your MapR cluster user; we'll use 'mapr' for these purposes:

 

chown -R mapr:mapr /opt/zeppelin

 

Note: do the rest as your MapR cluster user.

su mapr

 

Check to see if port 8080 is open (default Zeppelin port). If it's not, here's how you can change it.

 

First, create a Zeppelin environment configuration file:

cp /opt/zeppelin/zeppelin-<version>-bin-all/conf/zeppelin-env.sh.template /opt/zeppelin/zeppelin-<version>-bin-all/conf/zeppelin-env.sh

 

Open this file in a text editor and add the following to change the default port:

export ZEPPELIN_PORT=<Your Port #>                       

 

Start Zeppelin:

/opt/zeppelin/zeppelin-<version>-bin-all/bin/zeppelin-daemon.sh start

Log dir doesn't exist, create /opt/zeppelin/zeppelin-<version>-bin-all/logs

Pid dir doesn't exist, create /opt/zeppelin/zeppelin-<version>-bin-all/run

Zeppelin start                                             [  OK  ]

 

Check to see that Zeppelin is up and running by visiting the Zeppelin Web UI at the port you specified above:

http://<Hostname or IP>:<Your Port #>         

 

 

 

Configure Zeppelin for Spark

From the Zeppelin docs:

 

Apache Spark is supported in Zeppelin with Spark interpreter group which consists of below five interpreters.

NameClassDescription
%sparkSparkInterpreterCreates a SparkContext and provides a Scala environment
%spark.pysparkPySparkInterpreterProvides a Python environment
%spark.rSparkRInterpreterProvides an R environment with SparkR support
%spark.sqlSparkSQLInterpreterProvides a SQL environment
%spark.depDepInterpreterDependency loader

 

To see a list of all available configurable Spark properties, please visit:

Configuration - Spark Documentation 

 

Configure Spark 

Export SPARK_HOME, HADOOP_HOME, and HADOOP_CONF_DIR in /opt/zeppelin/zeppelin-<version>-bin-all/conf/zeppelin-env.sh as such:

 

## Spark interpreter options ##

# set spark home dir

export SPARK_HOME=/opt/mapr/spark/<spark version>

 

# set hadoop home dir

export HADOOP_HOME=/opt/mapr/hadoop/<hadoop version>

 

# set hadoop conf dir

export HADOOP_CONF_DIR=/opt/mapr/hadoop/<hadoop version>

 

Restart Zeppelin with:

 

/opt/zeppelin/zeppelin-<version>-bin-all/bin/zeppelin-daemon.sh restart

 

Finally, to make SparkSQL work, you need go to the Spark interpreter settings and change the Master property from local[*] to yarn-client, as shown below:

 

 

To test this, we recommend using the built-in Zeppelin Tutorial for Basic Features (Spark) and then look in the YARN Resource Manager and Spark History Server UIs to confirm the jobs ran on the cluster: 

 

 

 

 

 

 

 

 

 

Further Reading

3 people found this helpful

Attachments

    Outcomes