AnsweredAssumed Answered

Use SparkR with RStudio

Question asked by Vinayak Meghraj on Sep 21, 2016

Steps to connect your R program to a Spark cluster from RStudio.

 

!) library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

2) Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 1 sparkr-shell")

3) sc <- sparkR.init("yarn-client",sparkEnvir = list(spark.driver.memory="4g"))

4) sqlContext <- sparkRSQL.init(sc)

5) df <- createDataFrame(sqlContext, faithful)

6) head(df)

  

[mapr@n2b ~]$ R

 

R version 3.3.1 (2016-06-21) -- "Bug in Your Hair"

Copyright (C) 2016 The R Foundation for Statistical Computing

Platform: x86_64-redhat-linux-gnu (64-bit)

 

R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.

 

  Natural language support but running in an English locale

 

R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.

 

Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.

 

Warning: namespace ‘SparkR’ is not available and has been replaced

by .GlobalEnv when processing object ‘df’

[Previously saved workspace restored]

 

> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

 

Attaching package: ‘SparkR’

 

The following objects are masked from ‘package:stats’:

 

    cov, filter, lag, na.omit, predict, sd, var

 

The following objects are masked from ‘package:base’:

 

    colnames, colnames<-, endsWith, intersect, rank, rbind, sample,

    startsWith, subset, summary, table, transform

 

> Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 1 sparkr-shell")

> sc <- sparkR.init("yarn-client")

Launching java with spark-submit command /opt/mapr/spark/spark-1.6.1/bin/spark-submit   --master yarn-client --num-executors 1 sparkr-shell /tmp/RtmpfMjP3S/backend_port6c9a2baab610

16/09/19 08:53:44 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

> sqlContext <- sparkRSQL.init(sc)

> df <- createDataFrame(sqlContext, faithful)

> head(df)

  eruptions waiting

1     3.600      79

2     1.800      54

3     3.333      74

4     2.283      62

5     4.533      85

6     2.883      55

> q()

Outcomes