How to Use Spark via Livy Interpreter with Zeppelin

Document created by Rachel Silver Employee on Jun 20, 2017Last modified by Rachel Silver Employee on Aug 24, 2017
Version 4Show Document
  • View in full screen mode

Introduction

 

Apache Zeppelin is a web-based notebook project that enables interactive data analytics, particularly useful for Apache Spark workloads. While Apache Zeppelin has a native Spark Interpreter, MapR recommends using Livy for Apache Spark instead, so you can leverage some enhancements, such as:

  • The ability to submit jobs in YARN-cluster mode
  • Impersonation
  • Dynamic Memory Allocation controls

 

The versions used for this demo are:

 

Note: Zeppelin for MapR is not formally supported. Any problems should be addressed in Answers or in the Zeppelin Community.

 

Installing Zeppelin 

For these purposes, we're going to use the newest binary package, available here:

Apache: Zeppelin: Download Page and install Zeppelin to /opt/zeppelin.

 

Get and unpack the Zeppelin binary as a user with sudo access (use the one with all interpreters):

 

mkdir -p /opt/zeppelin

wget <link to suggested mirror>.tgz  -P /tmp/

gunzip /tmp/zeppelin-<version>-bin-all.tgz

tar -xf /tmp/zeppelin-<version>-bin-all.tar -C /opt/zeppelin/

 

Change the owner of these files to your MapR cluster user; we'll use 'mapr' for these purposes:

 

chown -R mapr:mapr /opt/zeppelin

 

Note: do the rest as your MapR cluster user.

su mapr

 

Check to see if port 8080 is open (default Zeppelin port). If it's not, here's how you can change it.

 

First, create a Zeppelin environment configuration file:

cp /opt/zeppelin/zeppelin-<version>-bin-all/conf/zeppelin-env.sh.template /opt/zeppelin/zeppelin-<version>-bin-all/conf/zeppelin-env.sh

 

Open this file in a text editor and add the following to change the default port:

export ZEPPELIN_PORT=<Your Port #>                       

 

Start Zeppelin:

/opt/zeppelin/zeppelin-<version>-bin-all/bin/zeppelin-daemon.sh start

Log dir doesn't exist, create /opt/zeppelin/zeppelin-<version>-bin-all/logs

Pid dir doesn't exist, create /opt/zeppelin/zeppelin-<version>-bin-all/run

Zeppelin start                                             [  OK  ]

 

Check to see that Zeppelin is up and running by visiting the Zeppelin Web UI at the port you specified above:

http://<Hostname or IP>:<Your Port #>         

 

 

Installing & Running Livy

 

Do the following as 'root' or a user with sudo permissions to install Livy:

 

mkdir -p /opt/livy

wget http://archive.cloudera.com/beta/livy/livy-server-0.3.0.zip  -P /tmp

unzip /tmp/livy-server-0.3.0.zip -d /opt/livy/

mkdir /var/log/livy

chown mapr:mapr /var/log/livy

chown -R mapr:mapr /opt/livy

su mapr

 

Go into the Livy configuration file with a text editor of your choice and set the following values: 

 

file: /opt/livy/livy-server-0.3.0/conf/livy-env.sh

export SPARK_HOME=/opt/mapr/spark/spark-2.1.0/
export HADOOP_CONF_DIR=/opt/mapr/hadoop/hadoop-2.7.0/
export LIVY_LOG_DIR=/var/log/livy

export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$SPARK_HOME/python/:$PYTHONPATH

 

And, to configure impersonation, please add the following to /opt/mapr/hadoop/hadoop-2.7.0/etc/hadoop/core-site.xml:

 

<property>
<name>hadoop.proxyuser.livy.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.livy.hosts</name>
<value>*</value>
</property>

 

Then, start Livy with this command:

 

/opt/livy/livy-server-0.3.0/bin/livy-server

 

 

Once launched, it will provide you with a URL to the REST service. You can visit this page and make sure everything is running and accessible. There won't be much there, but you should see "Operational Menu" as a heading.

 

 

Configure Zeppelin for Livy

 

There are many values that can be set here to control dynamic memory allocation and other enhancements that Livy is able to leverage. But the only one that must be set is:

 

zeppelin.livy.url=http://<host>:8998 

 

 

Full configuration details can be found in the Apache Zeppelin Documentation.

 

To test that it is working, create a new note and try creating sessions using both Scala and PySpark, like so:

 

 

 

Further Reading

Attachments

    Outcomes