AnsweredAssumed Answered

What is the most elegant way to redistribute data in a targeted manner?

Question asked by jacques on May 4, 2012
Latest reply on May 4, 2012 by srivas
As we add new nodes to our existing cluster, we would like to push certain files so at least one of their replicas are on these new machines.  This would allow us to better balance reads of older non-changing data. 

With existing tools it seems like we have two options to do this:

 - Ramp up replication using FileSystem.setReplication(path, replication) and hope it lands there then decrease and hope it stays there (super ugly/overkill)
 - Copy the file using a client on the new machine.  Then delete the old file and rename the new file. This relies on the HDFS assumption that the first copy of a newly created file is local and is used in tools like HBase to achieve data locality.  (Requires all reading applications to be aware of this pattern and makes a lot less sense if there is no natural tendency for files to be rewritten.)

Is there a more efficient way of doing this?  It would be ideal if we had an interface that was basically something like FileSystem.makeLocalReplica(path, serverName) that could be run from anywhere.

Outcomes