Question

I am trying a small hadoop setup (for experimentation) with just 2 machines. I am loading about 13GB of data, a table of around 39 million rows, with a replication factor of 1 using Hive.

My problem is hadoop always stores all this data on a single datanode. Only if I change the dfs_replication factor to 2 using setrep, hadoop copies data on the other node. I also tried the balancer ($HADOOP_HOME/bin/start-balancer.sh -threshold 0). The balancer recognizes that it needs to move around 5GB to balance. But says: No block can be moved. Exiting... and exits:

2010-07-05 08:27:54,974 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Using a threshold of 0.0
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.252.130.177:1036
2010-07-05 08:27:56,995 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 over utilized nodes: 10.220.222.64:1036
2010-07-05 08:27:56,996 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: 1 under utilized nodes:  10.252.130.177:1036
2010-07-05 08:27:56,997 INFO org.apache.hadoop.hdfs.server.balancer.Balancer: Need to move 5.42 GB bytes to make the cluster balanced.

Time Stamp               Iteration#  Bytes Already Moved  Bytes Left To Move  Bytes Being Moved
No block can be moved. Exiting...
Balancing took 2.222 seconds

Can anybody suggest how to achieve even distribution of data on hadoop, without replication?

Answer 1

are you using both your machines as datanodes? Highly unlikely but you can confirm this for me.

Typically in a 2 machine cluster, I d expect one machine to be the namenode and the other one to be the datanode. So the fact that when you set the replication factor to 1, the data gets copied to the only datanode available to it. If you change it to 2, it may look for another datanode in the cluster to copy the data to but won t find it and hence it may exit.

友情链接