AnsweredAssumed Answered

Dealing with slow operations in Spark?

Question asked by john.humphreys on Aug 15, 2017
Latest reply on Aug 15, 2017 by dmeng

I'm writing a Spark job that aggregates data and then writes it to OpenTSDB.

It turns out that the aggregation takes around 5 minutes, and if I just write the results to a MapR stream it finishes in around that much time.  Writing to OpenTSDB, however, seems to take much longer (12 minutes), even though I have multiple TSDs behind a load balancer.


In the non-spark-world, I would create 30 threads even on my 4-core box and have them all throw requests at OpenTSDB in parallel, and this would help speed up this situation.  


What do I do in Spark?  I don't want to give the system 2x the cores just because of a bottle-neck at the end.  I assume it's bad to explicitly multi-thread in spark though, right?  I considered writing the results to a stream and using something besides spark on the other end but that will use a lot of extra data storage (the stream) and will require me to use another server for running the writer app (which isn't ideal).


Is there a better way to handle this in Spark?