How To Use Jupyter & PySpark on MapR

Document created by Rachel Silver Employee on Nov 30, 2016Last modified by aalvarez on Jan 16, 2017
Version 2Show Document
  • View in full screen mode

Introduction

From Project Jupyter:
"The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more."

 

We occasionally hear customers expressing interest in getting Jupyter up and running on the MapR platform. Here, we're going to explain how to get Jupyter working on the MapR Converged Data Platform with Apache Spark in cluster mode.

The versions used for this demo are:

 

This tutorial will assume that you've already installed Spark 1.6.1 on MapR. If this is not the case, please follow the directions here to do so:

MapR Documentation | Spark

 

Install Jupyter + PySpark

PySpark uses Python and Spark; however, there are some additional packages needed for array manipulation and unit tests within Python. To install these additional packages, run these commands as the 'root' user:

Install Dependencies

yum -y install python-pip

pip install nose

pip install numpy

 

Install Jupyter on cluster (or local machine)
Jupyter can be installed using conda & Anaconda or just using Python Pip:

 

pip install jupyter

...Collecting jupyter
Downloading jupyter-1.0.0-py2.py3-none-any.whl...

 

Configure Jupyter

For this part, you'll need to be a cluster user that can submit Spark/YARN jobs. For this example, we'll perform these tasks as the 'mapr' user.

 

Set Server Parameters (hostname, port, etc.)

First, generate a config file:

 

su mapr

jupyter notebook --generate-config

 

Writing default config to: /home/mapr/.jupyter/jupyter_notebook_config.py

 

Then, navigate to the config file given above (jupyter_notebook_config.py), and set the variables as needed. For example, to set the hostname so you can access this server on a pubic interface, you would input the following*:

 

c.NotebookApp.ip = '*'

 

*We recommend securing this server using, at least, SSL and a hashed password. Directions can be found here.

 

Start Jupyter Notebook

 

Then, to start the notebook, simply run the following and note the link it provide you with:

 

PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser' $SPARK_HOME/bin/pyspark

 

[I 15:36:06.049 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 16:36:00.125 NotebookApp] The port 8888 is already in use, trying another port.
[I 16:36:00.129 NotebookApp] Serving notebooks from local directory: /root
[I 16:36:00.129 NotebookApp] 0 active kernels
[I 16:36:00.129 NotebookApp] The Jupyter Notebook is running at: http:// [hostname]:8889/
[I 16:36:00.129 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Navigate to the Jupyter Notebook using your hostname, like http://<hostname>:8889

 

Example

Let's test this out by creating a simple Spark Context by entering "sc" into a cell. If this works, we should see the output depicted below:

 

2 people found this helpful

Attachments

    Outcomes