When I started implementing our stream, I quickly noticed a rather large issue or fundamental lack of understanding on my part.
- Our data is time-oriented (host | metric | value | time-stamp).
- We want to aggregate our data to the hour level, and we also aggregate multiple hosts into applications.
- If I do an arbitrary one-hour window, I'll just be aggregating the last hour of data. I might have 50 minutes of hour 16 and 10 minutes of hour 17, when I really want to aggregate all of hour 16 when the data has arrived.
How do people deal with this?
- Can you tell the window when to start (like on the hour)?
- Do you make the window much bigger than needed and just discard data for an incomplete hour?
- Is there a crafty way to 'repair' aggregations when more data shows up after the fact?