copy data from difrent version hadoop

I had to copy data from one Hadoop cluster to another recently. However, the two clusters ran different versions of Hadoop, which made using distcp a little tricky.

Some notes of distcp: By default, distcp will skip files that already exist in the destination, but they can be overwritten by supplying the -overwrite option. You can also update only files that have changed using the -update option. distcp is implemented as a MapReduce job where the work of copying is done by maps that run in parallel across the cluster. There are no reducers. Each file is copied by a single map, and distcp tries to give each map approximately the same amount of data, by bucketing files into roughly equal allocations.

The following command will copy the folder contents from one Hadoop cluster to a folder on another Hadoop cluster. Using hftp is necessary because the clusters run a different version of Hadoop. The command must be run on the destination cluster. Be sure your user has access to write to the destination folder.

hadoop distcp -pb hftp://namenode:50070/tmp/* hdfs://namenode/tmp/

Note: The -pb option will preserve the block size.

Double Note: For copying between two different versions of Hadoop we must use the HftpFileSystem, which is a read-only files system. So the distcp must be run on the destination cluster.

The following command will copy data from Hadoop clusters that are the same version.

hadoop distcp -pb hdfs://namenode/tmp/* hdfs://namenode/tmp/

Filed under: HDFSUbuntu No Comments
8OCT/13 0

Using Cloudera Manager I found it very easy to configure single-node Hadoop development nodes that can be used by our developers to test their Pig scripts. However, out of the box dfs.replication is set to 3, which is great for a cluster, but for a single-node development workstation, this throws warnings. I set the dfs.replication to 1, but any blocks written previously are reported as under replicated blocks. Having a quick way to change the replication factor is very handy.

There are other reasons for managing the replication level of data on a running Hadoop system. For example, if you don’t have even distribution of blocks across your DataNodes, you can increase replication temporarily and then bring it back down.

To set replication of an individual file to 4:
sudo -u hdfs hadoop dfs -setrep -w 4 /path/to/file

You can also do this recursively. To change replication of entire HDFS to 1:
sudo -u hdfs hadoop dfs -setrep -R -w 1 /

I found these easy instructions on the Streamy Development Blog.

http://www.developerscloset.com/?cat=76
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值