I have to process the data with Spark streaming eventually (happy to have intermediate steps in other technologies), but the source data is being written sporadically to the following directory structure:
If you're using a HDFS file source in Spark streaming, you can only monitor one directory (sub-directories are not processed). Also, files have to be created atomically which isn't great.
The amount of data I have to pick up and use is pretty huge (billions of records over hundreds of files a day). Is there any general convention for monitoring for completed files and getting them into Spark streaming?
Alternatively, is there any standard way to monitor files of this layout and write them to MapR streams? I feel like this problem has to have been solved before .