AnsweredAssumed Answered

Compression weirdness: LZ4 and Zlib return similar compression sizes?

Question asked by rhinomike on May 16, 2015
Latest reply on May 16, 2015 by rhinomike
I am currently reviewing the options I have for long term compression of files and as such I have been running some compression tests with some previously processed data.

    [user~]$ ls -l
    ...
    -rw-r--r-- 1 user group        288339499 May 16 22:30 gz_compressed.gz
    -rw-r--r-- 1 user group         83443156 May 16 22:18 lbzip2_compressed.bz2
    -rwxr-xr-x 1 user group      36769870848 May 12 07:10 part-00000
    drwxr-xr-x 2 user group                1 May 16 23:07 subdir
    -rwxr-xr-x 1 user group                0 May 12 07:10 _SUCCESS
    -rw-r--r-- 1 user group        184155640 May 16 22:06 xz1_compressed.xz-1
    -rw-r--r-- 1 user group        135548124 May 16 00:40 xz_compressed.xz
    ...
    [user~]$ hadoop mfs -lss /user/testfiles/subdir/ | grep rw
    -rw-r--r-- z U   3 user group 36769870848      570384 2015-05-16 23:11  268435456 /user/testfiles/subdir/part-00000
    [user~]$ hadoop mfs -lss /user/testfiles/ | grep part-00000 | grep rw
    -rwxr-xr-x Z U   3 user group 36769870848      570384 2015-05-12 07:10  268435456 /user/testfiles/part-00000


From what I gather, the size on disk of the file is achieved by multiplying `570384 * 8000`, giving me around 4GB. This conclusion seems to be corroborated by @cmatta handy [checkcomp][1] script.

    [user~]$ ./checkcomp.sh -h /user/testfiles/subdir/part-00000
    4G compressed
    34G uncompressed
    4G / 34G(12.70%)
    [user~]$ ./checkcomp.sh -h /user/testfiles/part-00000
    4G compressed
    34G uncompressed
    4G / 34G(12.70%)

**How come, despite different compressions the physical size is the same?**

All tests are being generated the same way, e.g.

    cat part-00000 | gzip > gz_compressed.gz
    cat part-00000 | lbzip2 > lbzip2_compressed.bz2
    cat part-00000 | xz -1 > xz1_compressed.xz-1

and in the case of the subdir flatfile:

    cat part-00000 > subdir/part-00000

I am missing something? Or is MapR incorrectly applying LZ4(upper case Z) to a zlib compressed file (lower case z)?


  [1]: https://gist.github.com/cjmatta/8409de7e92e0d5c016e5#file-checkcomp-sh

Outcomes