At some point, you're going to need to run some of the popular new Python libraries that everybody is talking about, like MatPlotLib or SciPy. At this point, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.
While you could go through and manually install each of these environments across the cluster, using pip or Anaconda, this requires a fair bit of IT overhead to manage this installation and maintain currency and developer preference.
MapR recommends creating the environment that you want to use, zipping it up, storing it to MapR-FS, and then leveraging Spark to distribute the binary across the cluster. There is some minimal impact to job spin-up time, depending on the size of the archive. But the advantages are:
- No IT involvement: archives are unzipped into the YARN temporary staging directory at runtime and then removed when the job is complete.
- Collaboration: many users can share one environment.
- Easy to customize: if you need to make changes, it's very simple to alter and then store back to your global namespace.
This post will track how to do this with Condas. We have published a parallel post, describing how to do this with VirtualEnv. And, steps for both are or will be available in our docs:
Use Existing Condas from the MapR Data Science Refinery
Launch the MapR Data Science Refinery container, specifying the path to your Python archive in the Docker Run command or environment variable file as such:
docker run -it [...]
-e ZEPPELIN_ARCHIVE_PYTHON=/path/to/python_envs/custom_pyspark_env.zip [...]
MSG: Copying archive from MapR-FS: /user/mapr/python_envs/mapr_numpy.zip -> /home/mapr/zeppelin/archives/zip/mapr_numpy.zip
MSG: Extracing archive locally
MSG: Configuring Saprk to use custom Python
MSG: Configuring Zeppelin to use custom Python with Spark interpreter
If you built this archive using the example below, the path would be:
Now this is available to you in Apache Zeppelin and for all PySpark jobs. You can test this by checking your Python version:
Create New Conda using the MapR Data Science Refinery
Continuum Analytics provides an installer for Conda called Miniconda, which contains only Conda and its dependencies, and this installer is what we’ll be using today. You can also install the full build of Anaconda if you prefer.
sudo yum install bzip2
We typically recommend the default settings here, but feel free to change them to suit your needs.
Next, we have to create the Python environment that we want to use as a Conda. For these purposes, we'll show how to create one using Python 2.7.X and the NumPy library and call this Conda 'mapr_numpy'.
[/path/to/]conda create -p /mapr_numpy python=2 numpy
[mapr@]$ /home/mapr/miniconda3/bin/conda create -p ./mapr_numpy python=3.5 numpy
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /home/mapr/mapr_numpy:
The following NEW packages will be INSTALLED:
Proceed ([y]/n)? y
You can test that this Conda was created correctly by checking the Python version as such:
Python 2.7.14 :: Anaconda, Inc.
Store Conda to MapR-FS
First, we need to zip this environment up (use whichever tool you prefer):
zip -r mapr_numpy.zip ./
Then, store it to a directory you have access to in MapR-FS:
hadoop fs -put mapr_numpy.zip /user/mapr/python_envs/