AnsweredAssumed Answered

HDFS API file read 23-times slower over WAN (MapR v1.2.9)

Question asked by matroyd on Sep 17, 2012
Latest reply on Oct 26, 2012 by Ted Dunning
Details of the test we conducted:
Reading 10k files, total size 5GB, and all from 1 directory. We ran a java program that reads the contents of the files using HDFS API. The same program was run from a Linux host on the same LAN as the MapR cluster and also from another host across WAN to compare wall clock time.

    The times compared for read from MapR are as follows:
    1> over WAN: 1731 secs
    2> over LAN: 73 secs

To account for latency, bandwidth etc over WAN we ran the same java program to read the same amount of data from another "vendor product". The performance times are as follows:

    The times compared for read from another vendor product are as follows:
    1> over WAN: 241 secs
    2> over LAN: 160 secs

We were able to clock the fastest speed with MapR but from a co-located box. We wanted some pointers on how to achieve the same kindof performance over WAN. Any pointers to performance tuning of the code / MapR settings will be greatly appreciated.
Thanks !!

The application reads files using 1 thread in a loop. FileSystem object is fetched and closed at the begin and end, respectively of the application.

Following Method is used to read (we are using google protobuf):

public XYZ getXYZ(String fileName, FileSystem fs) {
   FSDataInputStream istr = null;
   try {
    String dirName = "/foo"
    Path path = new Path(dirName + "/" + fileName);
    istr = fs.open(path);
    PBTypes.PBxyz pbxyz = PBTypes.PBxyz
      .parseFrom(istr);

    XYZ xyz = new XYZ();
    //set all the data into xyz
    xyz.set......(pbxyz.get....());
    return xyz;

   } catch (Exception e) {
    // do nothing
   } finally {
    if (istr != null) {
     try {
      istr.close();
     } catch (IOException e) {
      e.printStackTrace();
     }
    }
   }
   return null;
  }

Outcomes