AnsweredAssumed Answered

Understanding mapr compression

Question asked by elleg on Jul 1, 2014
Latest reply on Jul 2, 2014 by elleg

We're testing MapR's compression feature but we're not sure if we're seeing the expected results when we try to store raw data into a compressed-enabled directory (As a side note, we have also been testing MapR tables which actually do seem to be compressing the data that we store in them).

According to MapR's documentation, compression is set at the directory level. So in our test, we created 2 new directories (using mkdir) in /mapr/ (we're using MapR on Amazon EMR), one that was compressed and another that was uncompressed. For the uncompressed directory we turned off compression by doing:

       hadoop mfs -setcompression off /mapr/

In all of our tests, we noticed that the size of both the compressed and the uncompressed directories were practically the same size given that they both had approximately the same number of files in them. Some of the different things that we tried in an attempt to affect the size of the files with no success were:

*Note - the data we were working with were Tweets.

- Inserting ~100,000 individual files (1 tweet per file) into the filesystem, each having an avg. size of 1.4KB and being no greater than 1MB. In this case the total size for the compressed and uncompressed directories was ~300MB.
- Inserting 10 files (each containing 10,000 tweets) into the file system with each file having an average size of ~28MB. Again, in total each directory, compressed and uncompressed, was ~300MB.
- Changing the default compression of the compressed directory from using lzf to lz4 with no change in the size of the directories once they had documents inserted.

Some of the possible reasons that we think might explain why we're not really seeing a noticeable difference when working with raw files in the filesystem are:

- There are not enough files.
- The size of the files are not large enough.
- We're missing a property/setting when we insert documents into the filesystem. This is what the code that does that looks like:

        String dirName = "/compressed"; // and "uncompressed" for the other directory

        conf = new Configuration();
        fileSystem = FileSystem.get(conf);
        wFilePath = new Path(dirName + "/file.w"); // the filenames were different depending on which idea we were trying out

        outputStream = fileSystem.create(wFilePath);

        byte[] messageBytes = json.getBytes();

        try {
        } catch (IOException e) {
          LOGGER.error("Error writing to the MapR output stream.");
        } catch (Exception e) {
          LOGGER.error("Unknown error writing to the MapR output stream.", e);

At this point, we're pretty much stumped as to why compression doesn't seem to be affecting the raw data that we insert into MapR. Any suggestions that might help explain what we're doing wrong would be much appreciated.