Spark streaming - time stamp aggregation?

Question asked by john.humphreys on Nov 2, 2017
I've read various things about this both in Spark docs and in forums/stack-overflow, but I'm still a little confused.


Problem Constraints

  • Stream of data coming in from Kafka.
  • Various data-points are provided at different sampling intervals, e.g.:
    • CPU @ 30 times/second (every 2 seconds).
    • Memory @ 2 times/second (every 30 seconds).
  • Need to aggregate seconds to minute level in the Spark streaming code.
  • Don't want to double-count anything.
  • Can assume that all data for a minute comes within a minute, but it might be data from last week (the time-stamp coming with the data is important).


Which Spark streaming technique is appropriate/helpful for this?  I've seen the SQL window function, real window functions in spark streaming, etc; but nothing fully clicked.  How do I ensure I have all the values for a minute?  I know I guess I can have a window and see if I have data from minute 3 and 5, then 4 should be complete; but that sounds painful for what must be such a common problem.