I'm migrating a batch system over to streaming, and I'm unsure as to what kinds of aggregations are suitable for Spark Streams. We tend to have second-level metrics that we only want to keep at minute, hour, and day level aggregations though.
I understand that I can trivially micro-batch on, say, 1 minute intervals in Spark.
- Is it sensible/intelligent to attempt to do the hour/day aggregations in the stream, or do people typically save the data to file and process it in a batch for this?
- I understand spark can calculate aggregations in a way that takes little memory using deltas (something about you providing the inverse of an operation, and then it can be extra efficient). Does this mean that you can literally handle a day's worth of data because the storage size used by the function is just constant/small (just the current element and the new element being merged to it)?
For more information on #2, I'm referring to the following from Spark Streaming - Spark 2.1.0 Documentation (which I believe was also in my Spark book from 1.3, though it's been a while ).
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
For context, we're going to be dealing with 18.7 billion data points a day, ~325 metrics for ~40,000 hosts, 1,440 readings a day (1 per minute). Ignoring the time dimension, that would be around 13 million things to keep track of simultaneously.