From Project Jupyter:
"The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more."
We occasionally hear customers expressing interest in getting Jupyter up and running on the MapR platform. Here, we're going to explain how to get Jupyter working on the MapR Converged Data Platform with Apache Spark in cluster mode.
The versions used for this demo are:
This tutorial will assume that you've already installed Spark 1.6.1 on MapR. If this is not the case, please follow the directions here to do so:
MapR Documentation | Spark
Install Jupyter + PySpark
PySpark uses Python and Spark; however, there are some additional packages needed for array manipulation and unit tests within Python. To install these additional packages, run these commands as the 'root' user:
yum -y install python-pip
pip install nose
pip install numpy
Install Jupyter on cluster (or local machine)
Jupyter can be installed using conda & Anaconda or just using Python Pip:
pip install jupyter
For this part, you'll need to be a cluster user that can submit Spark/YARN jobs. For this example, we'll perform these tasks as the 'mapr' user.
Set Server Parameters (hostname, port, etc.)
First, generate a config file:
jupyter notebook --generate-config
Writing default config to: /home/mapr/.jupyter/jupyter_notebook_config.py
Then, navigate to the config file given above (jupyter_notebook_config.py), and set the variables as needed. For example, to set the hostname so you can access this server on a pubic interface, you would input the following*:
c.NotebookApp.ip = '*'
*We recommend securing this server using, at least, SSL and a hashed password. Directions can be found here.
Start Jupyter Notebook
Then, to start the notebook, simply run the following and note the link it provide you with:
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS='notebook --no-browser' $SPARK_HOME/bin/pyspark
[I 15:36:06.049 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 16:36:00.125 NotebookApp] The port 8888 is already in use, trying another port.
[I 16:36:00.129 NotebookApp] Serving notebooks from local directory: /root
[I 16:36:00.129 NotebookApp] 0 active kernels
[I 16:36:00.129 NotebookApp] The Jupyter Notebook is running at: http:// [hostname]:8889/
[I 16:36:00.129 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Navigate to the Jupyter Notebook using your hostname, like http://<hostname>:8889.
Let's test this out by creating a simple Spark Context by entering "sc" into a cell. If this works, we should see the output depicted below: