Rachel Silver

How To: Run the MapR Data Science Refinery from an Edge Node

Blog Post created by Rachel Silver Employee on Dec 16, 2017

Recently, MapR launched the MapR Data Science Refinery, a novel way to deliver data science functionality and connectivity for your MapR Converged Data Platform.

 

One of the great advantages to this is the ability to deploy this workspace from wherever you chose to do your work; an edge node, a cloud instance, or even your personal laptop!

 

 

Below are the steps that are required to run this from an edge node. This could be from an on-premises server or a cloud/VM deployed edge node, and it only requires that a supported flavor of Linux be installed on the node that you intend to use. The supported Operating Systems are:

  • CentOS 7.x
  • Ubuntu 14
  • Ubuntu 16

 

First, you need to install and start the Docker Environment for your operating system. You'll be given a choice between Docker Community Edition (CE) and Docker Enterprise Edition (EE), and either work for this purpose. 

 

Once you have this installed, you need to pull the image into your local Docker image repository. Our Docker Hub is located here, and the pull command that you should use to pull the most recent version is:

 

$docker pull maprtech/data-science-refinery

 

After you've run this, you can see that this image now exists in your registry by running:

$docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/maprtech/data-science-refinery v1.0_6.0.0_4.0.0_centos7 <IMAGE ID>

The only piece that you have to have in place at this point, for a secure cluster, is your MapR-SASL ticket, available somewhere on this host. For steps for generating this ticket, please see this document:

Administrator's Reference for 'maprlogin' 

 

We recommend creating an environment variable file instead of passing these into the Docker Run command as it's easier to spot problems. Here is an example file, 'env.list' that we pass into the Docker Run command:

 

MAPR_CLUSTER=<cluster-name>
MAPR_CLDB_HOSTS=<cldb-ip-list> 
MAPR_MOUNT_PATH=/mapr
MAPR_TICKETFILE_LOCATION=</path/to/ticket/file>
ZEPPELIN_SSL_PORT=9995
HOST_IP=<docker-host-ip> 
MAPR_HS_HOST=<needed if you're using Pig>
MAPR_TZ=<timezone>
MAPR_CONTAINER_USER=<user-name>

MAPR_CONTAINER_PASSWORD=<password> 
MAPR_CONTAINER_UID=<uid>
MAPR_CONTAINER_GROUP=<group-name>
MAPR_CONTAINER_GID=<gid>

Next, you simply use the Docker Run command, passing in the an . For more information on this command and options, please visit this document:

Understanding Zeppelin Docker Parameters 

 

docker run --rm -it --env-file ./env.list --cap-add SYS_ADMIN --cap-add SYS_RESOURCE --device /dev/fuse -p 9995:9995 -p 10000-10010:10000-10010 -v </path/to/ticket/file>:/tmp/dsr_ticket:ro -v /sys/fs/cgroup:/sys/fs/cgroup:ro docker.io/maprtech/data-science-refinery

 

That's it! Now you can log into Zeppelin by visiting the UI at the following address:

 

https://<IP or hostname of host Docker is running on>:9995/

 

And you log in using the credentials that you provided in the Docker Run command. The authorization for the jobs themselves–whether Spark, POSIX, or JDBC–is provided by your MapR-SASL ticket.

 

 

 

In addition, you can peruse the file system using POSIX or Hadoop syntax from the CLI or Zeppelin. This is made possible by the MapR POSIX Client For Containers, which allows MapR customers to mount their global namespace to their Docker container.

$ ls -la /mapr/my.cluster.com/
total 3
drwxr-xr-x 10 mapr mapr 9 Nov 27 08:55 .
dr-xr-xr-x 3 root root 1 Dec 16 17:43 ..
drwxr-xr-x 3 mapr mapr 1 Nov 27 08:51 apps
drwxr-xr-x 2 mapr mapr 0 Nov 27 08:48 hbase
[...]

$ hadoop fs -ls /
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/mapr/hadoop/hadoop-2.7.0/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/mapr/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Found 8 items
drwxr-xr-x - mapr mapr 1 2017-11-27 08:51 /apps
drwxr-xr-x - mapr mapr 0 2017-11-27 08:48 /hbase

[...]

 

Common Problems:

 

After running the Docker Run command, you see the following error:

Started service mapr-posix-client-container                [FAILED]

This error can be safely ignored as it is a remnant of an issue with the MapR Persistent Application Client Container (PACC). 

 

You're prompted to go to an unsafe site by your web browser when visiting the Apache Zeppelin UI:

This is okay and expected behavior if you haven't installed an SSL certificate for this instance. 

 

More troubleshooting information can be found here:

Troubleshooting Data Science Refinery 

 

Related Resources

Outcomes