AnsweredAssumed Answered

Spark - Data Frame Lineage After Repeated Merges

Question asked by john.humphreys on Jul 18, 2017
Latest reply on Jul 31, 2017 by john.humphreys

I'd like to aggregate second/minute level batches into hourly batches without windowing/etc (so I can use much less memory).  If I did the following:


  • Define data frame X to hold, say, the last 2 hours of data.
  • For each streaming batch:
    • Do normal processing
    • Merge results into X.
    • Merge results out of X that are older than 2 hours.


Or would this cause X's lineage graph to include all batches, even though all the old ones were functionally merged out of X?


I'm not sure how smart Spark's lineage system is in this case.


Thank you!

-John Humphreys