I am planning to use spark streaming applications to read data from MapR streams. I am looking into following design consideration:
1. Have a multiple (Thousands of streams) and single job which will listen on all the streams.
2. One stream where topics will have different data and spark application will listen on per topic basis.
After performing some experiments, it looks like approach one will have significant overhead(polling time and reading data from multiple streams using Direct stream) and may result in bad performance (Apologies if I am missing something on this front). Also it looks like data is read from different streams sequentially. Even if we can configure concurrent_jobs parameter, it seems that it may limit on number of cores and afterwards it will run sequentially .
Also when I performed 2nd experiment, I observed that we need to tune some parameters like "kafka max rate per partition(max message per rdd)" as I had to perform group-by operation ( In my case I need to group data based on 2 keys ).
It will be very helpful if somebody can provide any pointers on above points. Thanks in advance