I'm writing a Spark job that aggregates data and then writes it to OpenTSDB.
It turns out that the aggregation takes around 5 minutes, and if I just write the results to a MapR stream it finishes in around that much time. Writing to OpenTSDB, however, seems to take much longer (12 minutes), even though I have multiple TSDs behind a load balancer.
In the non-spark-world, I would create 30 threads even on my 4-core box and have them all throw requests at OpenTSDB in parallel, and this would help speed up this situation.
What do I do in Spark? I don't want to give the system 2x the cores just because of a bottle-neck at the end. I assume it's bad to explicitly multi-thread in spark though, right? I considered writing the results to a stream and using something besides spark on the other end but that will use a lot of extra data storage (the stream) and will require me to use another server for running the writer app (which isn't ideal).
Is there a better way to handle this in Spark?