AnsweredAssumed Answered

Spark Row length is 0 Exception for MaprDB

Question asked by afzal.shaikh on May 31, 2016
Latest reply on Aug 3, 2016 by maprcommunity

I am ingesting 3000 rows from a flat file into MaprDB using Spark Streaming which works fine using scala code.

 

Running Python code against a small dataset around 10 rows works fine as well, but once it passes over 3000 rows

it fail on line

datamap_rdd.saveAsNewAPIHadoopDataset(conf=conf, keyConverter=keyConv, valueConverter=valueConv)

 

with following exception

***Printing last key/value pair in RDD*****

18000 total rows in RDD

16/05/31 12:22:53 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)

  1. java.lang.IllegalArgumentException: Row length is 0

at org.apache.hadoop.hbase.client.Mutation.checkRow(Mutation.java:549)

at org.apache.hadoop.hbase.client.Put.<init>(Put.java:106)

at org.apache.hadoop.hbase.client.Put.<init>(Put.java:64)

at org.apache.hadoop.hbase.client.Put.<init>(Put.java:54)

at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:67)

at org.apache.spark.examples.pythonconverters.StringListToPutConverter.convert(HBaseConverters.scala:64)

at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:188)

at org.apache.spark.api.python.PythonHadoopUtil$$anonfun$convertRDD$1.apply(PythonHadoopUtil.scala:188)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1035)

at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)

at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1034)

at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)

at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1042)

at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1014)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)

at org.apache.spark.scheduler.Task.run(Task.scala:70)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)

 

 

EDIT: Adding Stackoverflow link with more details. Any help on how do I debug the problem would be useful.

Outcomes