Use SparkR with RStudio

Question asked by Vinayak Meghraj on Sep 21, 2016

Steps to connect your R program to a Spark cluster from RStudio.


!) library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))

2) Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 1 sparkr-shell")

3) sc <- sparkR.init("yarn-client",sparkEnvir = list(spark.driver.memory="4g"))

4) sqlContext <- sparkRSQL.init(sc)

5) df <- createDataFrame(sqlContext, faithful)

6) head(df)


[mapr@n2b ~]$ R


> library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))


Attaching package: ‘SparkR’


The following objects are masked from ‘package:stats’:


    cov, filter, lag, na.omit, predict, sd, var


The following objects are masked from ‘package:base’:


    colnames, colnames<-, endsWith, intersect, rank, rbind, sample,

    startsWith, subset, summary, table, transform


> Sys.setenv("SPARKR_SUBMIT_ARGS"="--master yarn-client --num-executors 1 sparkr-shell")

> sc <- sparkR.init("yarn-client")

Launching java with spark-submit command /opt/mapr/spark/spark-1.6.1/bin/spark-submit   --master yarn-client --num-executors 1 sparkr-shell /tmp/RtmpfMjP3S/backend_port6c9a2baab610

> sqlContext <- sparkRSQL.init(sc)

> df <- createDataFrame(sqlContext, faithful)

> head(df)

  eruptions waiting

1     3.600      79

2     1.800      54

3     3.333      74

4     2.283      62

5     4.533      85

6     2.883      55

> q()