Rachel Silver

How To: Using R Studio with the MapR Data Science Refinery

Blog Post created by Rachel Silver Employee on Mar 23, 2018

MapR made a design goal to be both portable and extensible in this release to enable all types of data science teams. This means that, while we don't ship every possible tool that users will want, we have the right structure in place to allow them to install those tools and have them work seamlessly with direct data access to their MapR Converged Data Platform.

 

 

 

For R Studio, this is accomplished through the Sparklyr project, which is intended to provide the ability to:

 

  • Connect to R from Spark
  • Perform functions on data in Spark structures and then bring the results into R for analysis and plotting
  • Allow R to leverage distributed Spark ML

 

 

 

 

Due to the design of the DSR container, R Studio will inherit the security configuration of the container, and jobs will be submitted as the user specified by the MapR-SASL ticket or in Docker Run.

 

Preparation

 

In order to access the R Studio GUI from your web browser, you will need to pass a port mapping into Docker Run. By default, R Studio uses 8787, and this can be passed in as such:

docker run ... -p 8787:8787... 

 

There are some considerations on container size when adding projects to DSR. It might behoove you to remove Apache Zeppelin from DSR, if you aren't planning to use it, in order to keep memory requirements down. Here are the specs from my testing with DSR v1.1:

 

  • Size of DSR with Apache Zeppelin:   6.12 GB
  • Size of DSR with Apache Zeppelin + R Studio:  7.157 GB
  • Size of DSR with R Studio: 6.147 GB

 

If you want to remove Zeppelin before starting this install, you can do so with the following commands:

 

rm -rf zeppelin/
rm -rf /opt/mapr/zeppelin

 

My recommendation is that if you plan to use R Studio regularly, you save the container at the end, when you have your configuration complete using Docker Commit.

 

Install R Studio

 

To begin with, grab the most recent open source build of R Studio for your OS - we're going to use CentOS 7 here:

 

wget https://download2.rstudio.org/rstudio-server-rhel-1.1.442-x86_64.rpm

 

Then, install from this RPM including libcurl, openssl, and xml2:

 

sudo -u root yum install rstudio-server-rhel-1.1.442-x86_64.rpm libcurl-devel openssl-devel libxml2-devel

rm rstudio-server-rhel-1.1.442-x86_64.rpm

 

Log into R Studio

 

As soon as the install is complete, you should be able to log into R Studio, using the credentials that you specified in Docker Run at http://[hostname]:8787

 

 

Install Sparklyr and Dplyr

To install the latest versions of Sparklyr and Dplyr, we recommend doing so via R Studio's DevTools package, as this will allow you to pull the most recent build:

 

install.packages("devtools")
devtools::install_github("rstudio/sparklyr")
devtools::install_github("tidyverse/dplyr")

 

These will each take awhile to install. But you should see something like this, when they're complete:

 

 

Now you just have to set SPARK_HOME, and create the Spark connection:

 

library(sparklyr)
options("sparklyr.verbose" = TRUE)
Sys.setenv(SPARK_HOME="/opt/mapr/spark/spark-2.1.0")
sc <- spark_connect(master = "http://localhost:8998",method = "livy")

Test Spark Connection

 

To test that this is working, I recommend loading the built-in Iris dataset as a table into a Spark context:

 

library(dplyr)

iris_tbl <- copy_to(sc, iris)

 

Notice that up in the right-hand side, we can see this table in the Connection window:

 

 

 

Related Resources

Outcomes