AnsweredAssumed Answered

How to prevent map/reduce job taking an incomplete input file that is still being copied through nfs?

Question asked by volans on Oct 28, 2013
Latest reply on Oct 28, 2013 by volans
Map/Reduce job processes only portion of the input file that is still being copied through nfs

I have a cron job that runs a Map/Reduce job on an hourly basis. The Map/Reduce job picks up input files from a hdfs directory.

The problem is that there is an upstream non-Map/Reduce application that copies a file (a couple gigabytes in size) into the hdfs directory asynchronously through nfs mount. We have run into a problem that while the application is still copying the file into the input hdfs directory, the Map/Reduce job starts. As a result, the Map/Reduce job only processed only portion of the file. In the end, both the Map/Reduce job and the copy completed successfully. The problem is again the Map/Reduce job never got full file to process when it runs.

The ideal solution is to make two jobs synchronous; that one job finishes before the next one starts. But as in many software pipelines, it is a collection of many systems and making it synchronous is not possible.

We have thought about checking timestamp of the input file to make sure we start the Map/Reduce job only when the input file is not updated in the current min.

Is there any other suggestion that we can do at the file system level? Any input is appreciated. Thanks.