How to Deal with memory configurations in Spark-1.5.2

Question asked by Karthee on Jul 14, 2017
Latest reply on Jul 26, 2017

Hi There,


I am confused with the memory configurations in Spark-1.5.2 - Spark-on-Yarn mode.


My environment settings are as below:


3 Node MAPR Cluster - Each Node: Memory 256G, 16 CPU
Hadoop 2.7.0
Spark 1.5.2 - Spark-on-Yarn


Input data information:

480 GB Parquet format table from Hive, I'm using spark-sql for querying the hive context with spark-on-yarn,but it's lot slower than the Hive, and am not sure with the right memory configurations for Spark,

These are my config's,

--> spark-defaults.conf

spark.executor.memory                              64g
spark.logConf                                             true
spark.eventLog.dir                                      maprfs:///apps/spark
spark.eventLog.enabled                             true
spark.serializer                                           org.apache.spark.serializer.KryoSerializer
spark.driver.memory                                  16g
spark.executor.instances                           70
spark.kryoserializer.buffer.max                  1024m
spark.yarn.executor.memoryOverhead      6144m

spark.sql.inMemoryColumnarStorage.compressed    true
spark.sql.inMemoryColumnarStorage.batchSize       100000



and how to avoid GC Overhead exceptions as well as Java Heap space exceptions in spark-sql CLI???

so am using Apache Zeppelin with Spark interpreter, but querying in Spark takes a very longer time than the hive !!!

i am not sure with how to use "CACHE TABLE" in Zeppelin with Spark-Interpreter ???

This is the environment variable in spark web-ui,

spark.master           local[*],

this supposed to be yarn-cluster right??? if it's wrong, how to change the spark.master???


Your assistance would be really appreciated!