AnsweredAssumed Answered

Data Local Map Tasks - Configuration Settings?

Question asked by mandoskippy on Apr 7, 2013
Latest reply on Apr 8, 2013 by mandoskippy
We have a compute only node in our cluster, and when we run a query, we are seeing that that the compute only node only rarely gets a task assigned to it. It works fine, but it just seems the cluster is preferring the nodes with data. This is MapR 2.1.2 and Hive 0.10. 

What I mean by preference is that when I run a query, that has more than the cluster's total task slots, let's for examples sake say I have 20 map slots per node, and I have 4 data local nodes and 1 compute-only node. The cluster has 100 available task slots. I kick off a job that that has 500 mappers required, and I look at the job tracker and see that the data local nodes have 20 mappers running, but my compute only has 0 or sometimes 1 mapper running. (They work, no errors, not blacklisted etc) but rarely does it fill out.   In addition, data load functions using our custom transform script work great, they get assigned to the compute only and it contributes well there.

Now, obviously in many cases, one may say "that's gonna be a lot of Network IO if your compute node is yanking data from around the cluster" and normally I would agree. However, we have 40 Gbps infiniband cards doing our node to node communications. I look at the compute node's "usage" and it's barely over idle. There is this huge network pipe to get other data, lots of processor, and lots of memory, and we want to use it!

So my question is this. What is happening? Am I perceiving things correctly, and if I am are things working as intended. If so, are there settings we can tweak to better allow us to utilize our compute only nodes?