AnsweredAssumed Answered

RPC Timeouts and Network Saturation at MFS Primary Node for Root FID

Question asked by dannyman on Aug 27, 2015
Latest reply on Aug 28, 2015 by dannyman
MapR v. 2.1.3.19871.GA

We get errors like this:

    2015-08-27 01:40:03,8147 ERROR Client fs/client/fileclient/cc/client.cc:3439 Thread: 140290951104256 rpc err Connection timed out(110) 28.15 to 10.10.12.149:5660, fid 2049.56066.28199832, upd 1, failed err 17

The correlation is that whichever node is responsible for the RPC timeout is also listed as the primary node for the mfs root:

    0-09:06 djh@mapr-01 ~$ hadoop mfs -lsd /
    Found 1 items
    drwxrwxr-- Z   - mapr mapr         10 2015-08-21 12:25  268435456 /
                   p 2049.16.2  badhawk-12.prod.qxxxxxxxxd.com:5660 mapr-10.prod.qxxxxxxxxd.com:5660 c24-mtv-02-35.dev.qxxxxxxxxd.com:5660

1) The IP 10.10.12.149 maps to badhawk-12.prod.qxxxxxxxxd.com ...

2) hadoop mfs -lsfid on the above RPC error reveals that the file no longer exists

3) Sometimes we get an RPC timeout for root node fid as well

Twice now we have blacklisted nodes, get the job running again, then it maybe works or maybe errors out, like the above.

Looking at performance graphs, I see the node in question is maxing out network, while others are not.

My hunch is that the client code should be trying one of the secondary nodes if the primary times out, but that we are running an antiquated version.  My other wonder is whether there are tuneables to work around this situation.  I'm going to compare configs on the badhawk vs mapr hosts: the former run Spark jobs, the latter are full-on MapR worker nodes.  I've already been looking into network upgrades.

I am grateful for any suggestions, observations, ideas, &c.

Further observation: successful runs are characterized by several nodes getting blacklisted after RPC timeouts to the swamped MFS.  This strikes me as a crude throttling mechanism which ends up limiting the amount of traffic such that the node can handle the requests thrown at it.

Outcomes