AnsweredAssumed Answered

MapR scanning slowly

Question asked by snelson on Jun 17, 2014
Latest reply on Jun 23, 2014 by snelson
All,
I'm using Asynchbase library to scan millions of rows in a MapR native table, and I'm getting poor results. I've already made sure that server block caching is turned on for my scan, and I'm setting my max rows per RPC to 10,000 rows (the rows aren't very big) and my max KeyValues to -1. I've measured both the number of rows returned from each call to nextRows(), and the time elapsed, and I'm disappointed with what I see. The cluster isn't sending me 10,000 rows per request, it's sending me about 500 rows, and it's taking 10-20 milliseconds for each call to nextRows() to return the rows. When you add up all the requests, the time taken to scan a million rows is 30-50 seconds. Even more concerning was that I increased my cluster size, my node size, and my disk throughput and the time taken to scan didn't change very much.

I have two questions. How can I get the cluster to send back more rows per RPC? I'm worried the volume of RPC is killing my performance.

Are there other ways to improve performance of the scan?

**Update:**
I tried using the HBase/HTable library in place of AsyncHBase, and the performance got worse.

**Update 2**
I found this in MapR asynchbase code:

    // TODO: MapR, Need to support Scanner regex and max_num_kvs
    maprscan.rowConstraint = toRowConstraint(mtable,
                                             Bytes.toString(scan.getFamily()),
                                             scan.getQualifiers(),
                                             scan.getMinTimestamp(),
                                             scan.getMaxTimestamp(),
              scan.getMaxVersions());

This indicates to me that the max rows parameter isn't respected, and it explains why I don't get 10,000 rows back from scanner at a time.

Outcomes