I have a large set of data, split over hundreds of files, that I am copying over to my cluster vis ssh. I have 9 nodes, all but one setup as data nodes and the remaining node (hadoop-node2) has most of the MapR processes running on it. On hadoop-node2 I run
hadoop dfs -mkdir /data
and then copy over my data via
ssh username@hadoop-node2 "hadoop dfs -put - /data/myData"
I monitor the space used on my cluster via the MapR web interface and see that only hadoop-node2 is filling up.
I've just switched to MapR from Cloudera (so I apologize for my ignorance) where I didn't have to worry about where my data was being stored. Is this not the case with MapR? Do I have to manually pick which node my data is being copied to? Or am I doing something completely wrong?