AnsweredAssumed Answered

Spark Streaming - Waiting for data in window?

Question asked by john.humphreys on Jun 27, 2017
Latest reply on Jul 5, 2017 by john.humphreys

When I started implementing our stream, I quickly noticed a rather large issue or fundamental lack of understanding on my part.

 

  • Our data is time-oriented (host | metric | value | time-stamp).
  • We want to aggregate our data to the hour level, and we also aggregate multiple hosts into applications.
  • If I do an arbitrary one-hour window, I'll just be aggregating the last hour of data.  I might have 50 minutes of hour 16 and 10 minutes of hour 17, when I really want to aggregate all of hour 16 when the data has arrived.


How do people deal with this?

  • Can you tell the window when to start (like on the hour)?
  • Do you make the window much bigger than needed and just discard data for an incomplete hour?
  • Is there a crafty way to 'repair' aggregations when more data shows up after the fact?

Outcomes