AnsweredAssumed Answered

Changing Balancer Settings Daily - Increased Performance?

Question asked by mandoskippy on Oct 20, 2012
Latest reply on Apr 25, 2013 by Ted Dunning
I was thinking about balancers, why they off by default, how they affect data on a cluster etc.  Let's say you have a cluster that well used during the day, but at night, the queries are less, and ETL work continues. Just for the sake of numbers, lets say you cluster is at a healthy 80% utilization during the day and falls back to 50% at night with no dangers of missing ETL SLAs. 

Would there be a benefit to programmatically turning on balancers at night with agressive settings to try to distribute data around to the nodes in an even way?  My hypothesis is that by doing so, you may impact some ETL jobs (but if they are not in danger of hurting SLAs, this is not an issue) and potentially increase the spread of data across the cluster, allowing for more balanced jobs (the data is local to more nodes, mappers have easier access to data they need etc).  I am trying to come with some test cases on this. 

The reason I asked this question is I just added a physical node with lots of processor and memory, but at this time only one disk.  This node is the newest in what HAS been an all virtual cluster. My impatience in trying out my new hardware led me to decommission one virtual node steal the physical drive that was attached via RDM and put it on the physical beefy box.   Each virtual node had 3 cores assigned, 3 map tasks, 1 reduce task, 8 GB of ram, and 1 physical drive attached via VMWares Raw Device Mapping.  Ya,  know the core/maptask thing was a bit wonky, but in my tesitng, that was actually the best performance... *shrug*.  

So before the new node, I had two racks of 4 VMs: 24 map tasks and 8 reduce tasks total.  When I removed the one node, I had an unbalanced cluster of two racks with 1 physical and 3 VMs and 4 VMs for a total of 28 map tasks and 11 reduce tasks.  My theory was that I'd get better performance by virtue of having a beefy node. However, my disks did not meet the balancer level I had set, so that one beefy node had 0% utilization on it's disk.  This caused slower results and a higher standard deviation on my test results: Hypothesis: map tasks on the physical node had to get all it's data over the network harming performance. 

So I tweaked some balancer settings to push data back to "new" node.  (I'll be getting more drives too... ) and now I am going rerun the tests (I'll post results if anyone was interested) however, I am curious if I do see an advantage, would there be an advantage on a large cluster to use slower times to ensure a strong balance of data across the cluster.

Outcomes