How To: Use Spark 2.0 on MapR

Document created by Rachel Silver Employee on Aug 8, 2016Last modified by aalvarez on Jan 25, 2017
Version 3Show Document
  • View in full screen mode

Introduction

As you may have heard, we recently released Spark 2.0 as a developer preview for the MapR Platform. Since then, the community has released Spark 2.0 in GA! Here's a tutorial on how to implement our tech preview of the newest GA build of Spark 2.0.

 

There are multiple ways to run Spark (standalone, YARN), but for these purposes, everything will be run as Spark on YARN.

 

Note: Spark 2.0 for MapR is in developer preview mode and is not recommended for production. Any problems should be addressed in Answers or in the Spark community:

Community | Apache Spark

 

Test Environment

  • AWS: 3 x m4.2xlarge Centos 6.7 instances (ami-0dde2e6d)
  • MapR 5.1
  • Hive 1.2.0

For guidance on setting up a MapR cluster on AWS, this blog may be helpful:

Spinning Up a Hadoop Cluster in the Cloud | MapR

 

Setup Spark 2.0 Developer Preview

Detailed steps can be found here, in our documentation. The steps provided below are tailored to this specific environment.

 

These steps should be performed on each node that you intend to use for Spark.

 

  1. Retrieve and unzip the Spark 2.0 Technical Preview package into the directory created earlier:

    mkdir /opt/mapr/spark (if it doesn't exist)

    cd /tmp

    wget http://d3kbcqa49mib13.cloudfront.net/spark-2.0.0.tgz

    gunzip spark-2.0.0.tgz

    tar -xf spark-2.0.0.tar -C /opt/mapr/spark/

     

  2. Change the owner of these files to your YARN/MapR user; we'll use 'mapr' for these purposes:

    sudo chown -R mapr:mapr /opt/mapr/spark/spark-2.0.0

    su mapr

  3. Set the SPARK_HOME environment variable:

    export SPARK_HOME=/opt/mapr/spark/spark-2.0.0

  4. Create the initial Spark configuration file from the template. The full version, including Hive integration settings, is attached:
    • Copy template file:

      cp $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh

    • Add the following information to $SPARK_HOME/conf/spark-env.sh:

      export SPARK_HOME=/opt/mapr/spark/spark-2.0.0

      export HADOOP_HOME=/opt/mapr/hadoop/hadoop-2.7.0

      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

      MAPR_HADOOP_CLASSPATH=`hadoop classpath`:/opt/mapr/lib/slf4j-log4j12-1.7.5.jar:

      MAPR_HADOOP_JNI_PATH=`hadoop jnipath`

      export SPARK_LIBRARY_PATH=$MAPR_HADOOP_JNI_PATH

      MAPR_SPARK_CLASSPATH="$MAPR_HADOOP_CLASSPATH"

      SPARK_DIST_CLASSPATH=$MAPR_SPARK_CLASSPATH

      # Security status

      source /opt/mapr/conf/env.sh

      if [ "$MAPR_SECURITY_STATUS" = "true" ]; then  

      SPARK_SUBMIT_OPTS="$SPARK_SUBMIT_OPTS -Dmapr_sec_enabled=true"

      fi

 

Build Spark with Hive Integration

For building other modes, such as standalone mode or one that is integrated with HBase, please visit our documentation page.

These steps should be performed on each node that you intend to use for Spark:

  1. Update the following parameters in the $SPARK_HOME/pom.xml file with the new values in blue:

    <curator.version>2.7.1</curator.version>

    <hive.group>org.apache.hive</hive.group>

    <hive.version>1.2.0-mapr-1605</hive.version>

    <hive.version.short>1.2.0</hive.version.short>

    <datanucleus-core.version>4.1.6</datanucleus-core.version>

  2. Add the following repository to the $SPARK_HOME/pom.xml file under the <repositories> tag:

    <repository>

    <id>mapr-repo</id>

    <name>MapR Repository</name>

    <url>http://repository.mapr.com/maven/</url>

    <releases>

      <enabled>true</enabled>

    </releases>

    <snapshots>

      <enabled>false</enabled>

    </snapshots>

    </repository>

  3. Run the following commands to change the Scala version and build Spark with Hive (must be run from $SPARK_HOME):

    cd $SPARK_HOME

    ./dev/change-scala-version.sh 2.10

    ./dev/make-distribution.sh --tgz -Phadoop-provided -Pyarn -Phive -Phive-thriftserver -Dscala-2.10

  4. Copy the hive-site.xml file from /opt/mapr/hive/hive-1.2/conf/ to $SPARK_HOME/conf/:

    cp /opt/mapr/hive/hive-1.2/conf/hive-site.xml $SPARK_HOME/conf/

  5. Add the following property to $SPARK_HOME/conf/hive-site.xml:

    <property>

    <name>datanucleus.schema.autoCreateTables</name>

    <value>true</value>

    </property>

  6. In the $SPARK_HOME/conf/spark-env.sh file, add the following configurations:

    MAPR_HIVE_CLASSPATH="$(find /opt/mapr/hive/hive-1.2/lib/* -name

    '*.jar' -not -name '*derby*' -printf '%p:' | sed 's/:$//')"

    SPARK_DIST_CLASSPATH=$SPARK_DIST_CLASSPATH:$MAPR_HIVE_CLASSPATH

  7. Create a Spark Configuration File from the provided template:

    cp $SPARK_HOME/conf/spark-defaults.conf.template $SPARK_HOME/conf/spark-defaults.conf

  8. Add the following to it:

    spark.sql.hive.metastore.version 1.2.1

    spark.sql.hive.metastore.sharedPrefixes

    com.mysql.jdbc,org.postgresql,com.microsoft.sqlserver,oracle.jdbc,com.mapr.fs.shim.LibraryLoader,com.mapr.security.JNISecurity,com.mapr.fs.jni

    spark.yarn.dist.files 

    /opt/mapr/hive/hive-1.2/lib/datanucleus-api-jdo-4.2.1.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-core-4.1.6.jar,/opt/mapr/hive/hive-1.2/lib/datanucleus-rdbms-4.1.7.jar,/opt/mapr/hive/hive-1.2/conf/hive-site.xml

    spark.executor.extraClassPath

 

Test Your Spark Install

  1. Test Spark-on-YARN:

    $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.DFSReadWriteTest --master yarn --files ./README.md $SPARK_HOME/dist/examples/jars/spark-examples_2.10-2.0.0.jar ./README.md /user/mapr/

    [...]

    Success! Local Word Count (450) and DFS Word Count (450) agree.

    16/06/22 17:07:42 INFO util.ShutdownHookManager: Shutdown hook called

    16/06/22 17:07:42 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-889e02a6-9e23-4024-8531-ec801be6b8ef

    [mapr@~]$ hadoop fs -ls /user/mapr/

    Found 3 items

    drwxr-xr-x   - mapr mapr          0 2016-06-22 17:07 /user/mapr/.sparkStaging

    drwxr-xr-x   - mapr mapr          3 2016-06-22 17:07 /user/mapr/dfs_read_write_test

    drwxr-xr-x   - mapr mapr          1 2016-06-22 13:54 /user/mapr/tmp

  2. Test Hive integration:

    $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.sql.hive.SparkHiveExample --master yarn --deploy-mode client $SPARK_HOME/dist/examples/jars/spark-examples_2.10-2.0.0.jar

    [...]

    16/08/05 13:23:13 INFO codegen.CodeGenerator: Code generated in 10.885662 ms

    +---+------+---+------+

    |key| value|key| value|

    +---+------+---+------+

    | 86|val_86| 86|val_86|

    | 27|val_27| 27|val_27|

    | 98|val_98| 98|val_98|

    | 66|val_66| 66|val_66|

    | 37|val_37| 37|val_37|

    | 15|val_15| 15|val_15|

    | 82|val_82| 82|val_82|

    | 17|val_17| 17|val_17|

    | 57|val_57| 57|val_57|

    | 20|val_20| 20|val_20|

    | 92|val_92| 92|val_92|

    | 47|val_47| 47|val_47|

    | 72|val_72| 72|val_72|

    |  4| val_4|  4| val_4|

    | 35|val_35| 35|val_35|

    | 54|val_54| 54|val_54|

    | 51|val_51| 51|val_51|

    | 65|val_65| 65|val_65|

    | 83|val_83| 83|val_83|

    | 12|val_12| 12|val_12|

    +---+------+---+------+>

1 person found this helpful

Outcomes