Interpreting compression ratios in hadoop fs command outputs

Document created by mufeed on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Mufeed Usman

 

Original Publication Date: April 30, 2015

 

Created an uncompressed volume as detailed below.

[root@n72 ~]# hadoop mfs -lsd /v 
Found 1 items
vrwxr-xr-x U U   - root root          1 2015-04-10 10:48  268435456 /v
           p v default 2049.2095.771424 -> 2242.16.2  n72:5660

 

Copied a file to this volume (shows a size of 419720 bytes)

[root@n72 ~]# hadoop mfs -ls /v 
Found 1 items -rwxr-xr-x U U   3 root root     419720 2015-04-10 10:48  268435456 /v/cldb.log
     p 2242.36.131208  n72:5660
     0 2243.32.131232  n72:5660

 

Set compression on the volume

[root@n72 ~]# hadoop mfs -set compression on /v  

[root@n72 ~]# hadoop mfs -lsd /v
Found 1 items
vrwxr-xr-x Z U   - root root          1 2015-04-10 10:48  268435456 /v
          p v default 2049.2095.771424 -> 2242.16.2  n72:5660

         

Copied the same file again into the compressed volume (the previous file renamed as /v/cldb.log.U)

[root@n72 ~]# hadoop mfs -ls /v 
Found 2 items
-rwxr-xr-x Z U   3 root root     419720 2015-04-10 10:52  268435456 /v/cldb.log
     p 2242.38.131212  n72:5660
     0 2247.33.131374  n72:5660
-rwxr-xr-x U U   3 root root     419720 2015-04-10 10:48  268435456 /v/cldb.log.U
     p 2242.36.131208  n72:5660
     0 2243.32.131232  n72:5660

 

Now, both shows the same size 41920 bytes. Why? Didn't the compression take effect? Shouldn't the size of /v/cldb.log be lower? The catch here being this, the file size will remain the same, but the number of blocks would be less. Compression is transparent to users. Size as file property will always give what is the real size of data (i.e. logical size).

 

The effect of compression can be gauged by the following command.

[root@n72 ~]# hadoop mfs -lss /v 
Found 2 items -rwxr-xr-x Z U   3 root root     419720 2015-04-10 10:52  268435456 /v/cldb.log
     p 2242.38.131212                   8 n72:5660
     0 2247.33.131374                  18 n72:5660
     Total Disk Blocks : 26
-rwxr-xr-x U U   3 root root     419720 2015-04-10 10:48  268435456 /v/cldb.log.U
     p 2242.36.131208                   8 n72:5660
     0 2243.32.131232                  45 n72:5660
     Total Disk Blocks : 53

 

Multiplying the block count with 8K (our standard filesystem block size) should give an estimate of the before and after size. In this case,

File size before compression   = 53 * 8192 = 434176 bytes 
File size after compression    = 26 * 8192 = 212992 bytes

 

This tallies with the compression ration detailed at http://doc.mapr.com/display/MapR/Compression for lz4 which is our default.

Compression Ratio = logical size / compressed file size = 419720 / 212992 = 2 (Approx.)

 

 

 

1 person found this helpful

Attachments

    Outcomes