AnsweredAssumed Answered

Monitoring completed files in MapR-FS +/- Spark

Question asked by john.humphreys on Jun 27, 2017
Latest reply on Jun 27, 2017 by rupal

I have to process the data with Spark streaming eventually (happy to have intermediate steps in other technologies), but the source data is being written sporadically to the following directory structure:

YEAR/
    DAY_OF_YEAR/
        HOUR_OF_DAY/
            FILE_1/
                PART 1
                PART 2
                _SUCCESS
            FILE_2/

            ...

If you're using a HDFS file source in Spark streaming, you can only monitor one directory (sub-directories are not processed).  Also, files have to be created atomically which isn't great.


The amount of data I have to pick up and use is pretty huge (billions of records over hundreds of files a day).  Is there any general convention for monitoring for completed files and getting them into Spark streaming?

 

Alternatively, is there any standard way to monitor files of this layout and write them to MapR streams?  I feel like this problem has to have been solved before .

Outcomes