I am uploading huge amount of XML data in part-file format ( output of MapRedce/Hive jobs ) to MarkLogic database by Mapr Map Reduce job. Due to some cluster issue or network issue only few record 5/10/50/100 records (out of 20 million) are not uploading. For which I need to upload whole 20 million record again. It’s very time consuming. We are losing 2/3 Hrs. again.
I want to find those particular split file/part file from which few records missed. So that I can re-ingest only those part files instead of whole 20/30 millions . How can I find those specific part files?
Could you please help me for the above thing?
Thanks a lot for your help.