We are using text file formats (csv and tsv) since long time. Additionally in many places there are 100s and 1000s of small csv files which used to impact our mapreduce job performance due to large number of mappers. We are now using CombineFileInputFormat to get around the performance issue while reading but it doesn't solve problem for other framework like spark, hive and drill. So I am looking into way to improve input dataset in terms of space and counts.
So question are:
- what is the best file size of individual file? I see that maprfs uses 256 as chunk(block) size by default so should we aim to have every file close to that file size and what if our individual files are bigger like 512mb, 1gb or 10s of gb. My guess is 256mb or bigger is ideal but just want to confirm.
- Any suggestion on file format? Let's say most of our processes are equally read and write intensive. But we would like to save storage space.
- re-iterate file size question considering file format. what should be ideal size give we are using xyz file format. My guess is regardless of file format file size should be 256mb or bigger.
- Ideal size of the volume.