I've read various things about this both in Spark docs and in forums/stack-overflow, but I'm still a little confused.
- Stream of data coming in from Kafka.
- Various data-points are provided at different sampling intervals, e.g.:
- CPU @ 30 times/second (every 2 seconds).
- Memory @ 2 times/second (every 30 seconds).
- Need to aggregate seconds to minute level in the Spark streaming code.
- Don't want to double-count anything.
- Can assume that all data for a minute comes within a minute, but it might be data from last week (the time-stamp coming with the data is important).
Which Spark streaming technique is appropriate/helpful for this? I've seen the SQL window function, real window functions in spark streaming, etc; but nothing fully clicked. How do I ensure I have all the values for a minute? I know I guess I can have a window and see if I have data from minute 3 and 5, then 4 should be complete; but that sounds painful for what must be such a common problem.