I'd like to aggregate second/minute level batches into hourly batches without windowing/etc (so I can use much less memory). If I did the following:
- Define data frame X to hold, say, the last 2 hours of data.
- For each streaming batch:
- Do normal processing
- Merge results into X.
- Merge results out of X that are older than 2 hours.
Or would this cause X's lineage graph to include all batches, even though all the old ones were functionally merged out of X?
I'm not sure how smart Spark's lineage system is in this case.