AnsweredAssumed Answered

Spark Streaming - Takes more than max records (MapR Streams)

Question asked by john.humphreys on Sep 6, 2017
Latest reply on Oct 10, 2017 by john.humphreys

I've been running a spark streaming job for 5 days or so with varying load.

 

I've set the batch interval to be 7 minutes and set the Kafka rate-per-partition + # partitions to equate to 27.72 million records during that 7 minute interval.

 

I saw this work for literally days (the load falls to to 1-5 million on average and spikes up to 27.72 million every so often).  All of the sudden it spiked up to 40 million and messed up my job though after 5 days.

 

Are there known conditions under which "spark.streaming.kafka.maxRatePerPartition" will fail to be honored by Spark Streaming + MapR Streams?  I don't see any OOM errors or anything like that to have triggered the change.

 

The job seems to be catching up/recovering now but a slightly bigger spike could easily have killed the job which is why I'm concerned.

 

2017/09/06 01:34:00 27720000 records17 min5.2 min23 min
1/1
2017/09/06 01:27:00 27720000 records13 min5.3 min18 min
1/1
2017/09/06 01:20:00 30923257 records9.4 min4.6 min14 min
1/1
2017/09/06 01:13:00 40422828 records5.2 min4.5 min9.7 min
1/1
2017/09/06 01:06:00 27720000 records2 ms6.1 min6.1 min
1/1
2017/09/06 00:59:00 9344608 records0 ms3.1 min3.1 min
1/1
2017/09/06 00:52:00 15178678 records0 ms3.8 min3.8 min

Outcomes