AnsweredAssumed Answered

Spark streaming - time stamp aggregation?

Question asked by john.humphreys on Nov 2, 2017
Latest reply on Nov 14, 2017 by john.humphreys

I've read various things about this both in Spark docs and in forums/stack-overflow, but I'm still a little confused.


Problem Constraints

  • Stream of data coming in from Kafka.
  • Various data-points are provided at different sampling intervals, e.g.:
    • CPU @ 30 times/second (every 2 seconds).
    • Memory @ 2 times/second (every 30 seconds).
  • Need to aggregate seconds to minute level in the Spark streaming code.
  • Don't want to double-count anything.
  • Can assume that all data for a minute comes within a minute, but it might be data from last week (the time-stamp coming with the data is important).


Which Spark streaming technique is appropriate/helpful for this?  I've seen the SQL window function, real window functions in spark streaming, etc; but nothing fully clicked.  How do I ensure I have all the values for a minute?  I know I guess I can have a window and see if I have data from minute 3 and 5, then 4 should be complete; but that sounds painful for what must be such a common problem.