maprcommunity

Python Environments for PySpark, Part 1: Using Condas

Blog Post created by maprcommunity Employee on Aug 16, 2017

By Rachel Silver

Are you a data scientist, engineer, or researcher, just getting into distributed processing and PySpark, and you want to run some of the fancy new Python libraries you've heard about, like MatPlotLib?

If so, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.

You could go through and manually install each of these environments using pip, but maybe you also want the ability to use multiple versions of Python or other libraries like pandas? Maybe you also want to allow other colleagues to specify their own environments and combinations?

If this is the case, then you should be looking toward using condas to provide specialized and personalized Python configurations that are accessible to Python programs. Conda is a tool to keep track of conda packages and tarball files containing Python (or other) libraries and to maintain the dependencies between packages and the platform.

Continuum Analytics provides an installer for conda called Miniconda, which contains only conda and its dependencies, and this installer is what we’ll be using today.

For this blog, we’ll focus on submitting jobs from spark-submit. In a later iteration of this blog, we’ll cover how to use these environments in notebooks like Apache Zeppelin and Jupyter.

Installing Miniconda and Python Libraries to All Nodes

If you have a larger cluster, I recommend using a tool like pssh (parallel SSH) to automate these steps across all nodes.

To begin, we’ll download and install the Miniconda installer for Linux (64-bit) on each node where Apache Spark is running. Please make sure, before beginning the install, that you have the bzip2 library installed on all hosts:


I recommend choosing /opt/miniconda3/ as the install directory, and, when the install completes, you need to close and reopen your terminal session.

If your install is successful, you should be able to run ‘conda list’ and see the following packages:

 

Miniconda installs an initial default conda, running Python 3.6.1. To make sure this installation worked, run a version command:

python -V Python 3.6.1 :: Continuum Analytics, Inc.

To explain what’s going on here: we haven’t removed the previous default version of Python, and it can still be found by referencing the default path: /bin/python. We’ve simply added some new Python packages, like Java alternatives, that we can point to while submitting jobs without disrupting our cluster environment. See:

/bin/python -V Python 2.7.5

Now, let’s go ahead and create a test environment with access to Python 3.5 and NumPy libraries.

First, we create the conda and specify the Python version (do this as your cluster user):

conda create --name mapr_numpy python=3.5

Next, let’s go ahead and install NumPy to this environment:

conda install --name mapr_numpy numpy

Then, let’s activate this environment, and check the Python version:

 

Please complete these steps for all nodes that will run PySpark code.

Using Spark-Submit with Conda

Let’s begin with something very simple, referencing environments and checking the Python version to make sure it’s being set correctly. Here, I’ve made a tiny script that prints the Python version:

 

Testing NumPy

Now, let’s make sure this worked!

I’m creating a little test script called spark_numpy_test.py, containing the following:

 

If I were to run this script without activating or pointing to my conda with NumPy installed, I would see this error:

 

In order to get around this error, we’ll specify the Python environment in our submit statement:

 

Now for Something a Little More Advanced...

This example of PySpark, using the NLTK Library for Natural Language Processing, has been adapted from Continuum Analytics.

We’re going to run through a quick example of word tokenization to identify parts of speech that demonstrates the use of Python environments with Spark on YARN.

First, we’ll create a new conda, and add NLTK to it on all cluster nodes:

conda create --name mapr_nltk nltk python=3.5 source activate mapr_nltk

Note that some builds of PySpark are not compatible with Python 3.6, so we’ve specified an older version.

Next, we have to download the demo data from the NLTK repository:

 

This step will download all of the data to the directory that you specify–in this case, the default MapR-FS directory for the cluster user, accessible by all nodes in the cluster.

Next, create the following Python script: nltk_test.py

 

Then, run the following as the cluster user to test:

 

Additional Resources

Editor's note: this blog post was originally published in the Converge Blog on June 29, 2017

Outcomes