AnsweredAssumed Answered

Large ingest to MapR-DB via Pig often fails

Question asked by cleranceroberts on Jun 27, 2015
Latest reply on Jun 27, 2015 by cleranceroberts
I have an MapR cluster running on Amazon EC2 with (very) large instance types - d2.8xlarge.

I have a simple Pig script that loads data from CSV files in Amazon S3, and loads it into a MapR table.

     REGISTER /opt/mapr/lib/mapr-hbase-4.1.0-mapr.jar
     SET fs.s3.awsAccessKeyId '{}'
     SET fs.s3.awsSecretAccessKey '{}'
     A = LOAD 's3://mydata' USING PigStorage(',') AS (col1:chararray, col2:chararray, rk:chararray, col3:int, col5:int, col4:int);
     B = FOREACH A GENERATE rk, TOMAP(col1, '-1');
     STORE B INTO '/user/mapr/mytable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('col2:*','-loadKey true -noWAL true');

This script is run periodically and brings in many millions of lines of CSV into my MapR table.  There are many instances where rows get quite large, with many different col2 column qualifiers. 

This script works well at first, but after a few days of periodic runs (4x/hr) jobs begin failing or taking a very long time to complete.  The logs aren't particularly useful in for the failing Map jobs:

    2015-06-27 13:35:53,650 INFO mapred.Task [communication thread]: Communication exception: org.apache.hadoop.ipc.RemoteException( JvmValidate Failed. Ignoring request from task: attempt_201506261047_0058_m_000002_5002, with JvmId: jvm_201506261047_0058_m_1900709183
at org.apache.hadoop.mapred.TaskTracker.validateJVM(
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.hadoop.ipc.WritableRpcEngine$Server$
at org.apache.hadoop.ipc.RPC$
at org.apache.hadoop.ipc.Server$Handler$
at org.apache.hadoop.ipc.Server$Handler$
at Method)
at org.apache.hadoop.ipc.Server$

at org.apache.hadoop.ipc.WritableRpcEngine$Invoker.invoke(
at com.sun.proxy.$ Source)
at org.apache.hadoop.mapred.Task$

    2015-06-27 13:35:56,657 WARN mapred.Task [communication thread]: Parent died.  Exiting attempt_201506261047_0058_m_000002_5002

What is happening here, and why do my jobs keep failing or taking many hours (when they should be taking minutes) to complete?  Is there a better way to be doing this, or are there other optimizations I should consider?

My cluster is a MapR 4.1 Community Edition cluster with four d2.8xlarge nodes (36 core, 244GB Ram, 24x2TB drives, 10-Gigabit ethernet).