could only be replicated to 1 nodes instead of minReplication (=2). There are 3 datanode(s) running_there are 3 datanode(s) running and no node(s) are-CSDN博客

本文链接：https://blog.csdn.net/sjch1988/article/details/83387973

在一个bash中写了很多连续的hive脚本，总是跑一段时间后出现
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging/mqq/.staging/job_1540025341471_0619/libjars/mail-1.4.1.jar could only be replicated to 1 nodes instead of minReplication (=2). There are 3 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1571)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3112)

下次跑有时候就没事

could only be replicated to 1 nodes instead of minReplication (=2). There are 3 datanode(s) running and no node(s) are excluded in this operation.

意思是3个datanodes都没有问题，但是数据只能复制到1个datanode，满足不了最低要求2，我理解是分布式文件系统最少是2个copy吧。

查看hive.log对应的job错误

2018-10-25 20:13:14,880 WARN [Thread-209]: hdfs.DFSClient (DFSOutputStream.java:run(557)) - DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /tmp/hadoop-yarn/staging/mqq/.staging/job_1540025341471_0619/libjars/mail-1.4.1.jar could only be replicated to 1 nodes instead of minReplication (=2). There are 3 datanode(s) running and no node(s) are excluded in this operation.

没有得到更多的信息

查看运行hive脚本的机器上的hadoop log

ries=10, sleepTime=1000 MILLISECONDS)
2018-10-25 20:10:54,813 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: svr1.master.hadoop.xx/10.28.0.23:9000. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRet
ries=10, sleepTime=1000 MILLISECONDS)
2018-10-25 20:10:55,813 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: svr1.master.hadoop.xx/10.28.0.23:9000. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRet
ries=10, sleepTime=1000 MILLISECONDS)
2018-10-25 20:10:56,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: svr1.master.hadoop.xx/10.28.0.23:9000. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRet
ries=10, sleepTime=1000 MILLISECONDS)
2018-10-25 20:10:57,319 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.28.0.21:50010, datanodeUuid=10d02f06-b673-4bbb-afe8-c560f4cecdd7, infoPort=50075, infoSecurePort=0, ipcPort
=50020, storageInfo=lv=-56;cid=CID-ac6b0b1a-6fb2-434d-b5d2-d8051a1ba3d5;nsid=1362816802;c=0) Starting thread to transfer BP-843453834-10.28.0.22-1540029719983:blk_1074105134_367475 to 10.28.0.23:50010
2018-10-25 20:10:57,319 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(10.28.0.21:50010, datanodeUuid=10d02f06-b673-4bbb-afe8-c560f4cecdd7, infoPort=50075, infoSecurePort=0, ipcPort
=50020, storageInfo=lv=-56;cid=CID-ac6b0b1a-6fb2-434d-b5d2-d8051a1ba3d5;nsid=1362816802;c=0) Starting thread to transfer BP-843453834-10.28.0.22-1540029719983:blk_1074105135_367476 to 10.28.0.23:50010
2018-10-25 20:10:57,320 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-843453834-10.28.0.22-1540029719983:blk_1074105134_367475 (numBytes=6664) to /10.28.0.23:50010
2018-10-25 20:10:57,320 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-843453834-10.28.0.22-1540029719983:blk_1074105135_367476 (numBytes=4377) to /10.28.0.23:50010
2018-10-25 20:10:57,814 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: svr1.master.hadoop.xx/10.28.0.23:9000. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRet
ries=10, sleepTime=1000 MILLISECONDS)
2018-10-25 20:10:57,815 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: svr1.master.hadoop.xx/10.28.0.23:9000

意思好像是，连接到svr1.master.hadoop.xx/10.28.0.23:9000出现错误。

我在当前机器上10.28.0.21用telnet 10.28.0.23 9000，果然没法连，然后telnet 10.28.0.22 9000，连接成功

问题大概找到，查网上答案，是检查slave和master和防火墙。

我在两台机器上service iptables status，都显示没信息，说明没有防火墙。

查看哪几台机器是namenodes

hdfs getconf namenodes
svr1.master.hadoop.xx (23) svr2.master.hadoop.xx (22)

发现当前机器不是namenode，23这台机器上是namenode

去这台机器，jps发现namenode未启动

查看namenode的状态，是active还是standby
hdfs haadmin -getServiceState nn1
hdfs haadmin -getServiceState nn2

同样发现未启动

考虑去23这台机器上去运行bash hive脚本，开始发现正常，过段时间又出现

could only be replicated to 1 nodes instead of minReplication (=2). There are 3 datanode(s) running and no node(s) are excluded in this operation.

感觉已疯

直接在这台机器上hadoop namenode，试图启动namenode

发现启动失败

原因：

java.io.IOException: NameNode is not formatted.

看网上答案，还是由于namenode和datanode 集群ID什么不一致，混乱就成的，解决方法就是

删原来目录，然后namenode format什么的

有前车之鉴，namenode format会把原来的hive表全删除，不能这么做。

考虑让namenode 目录下的文件在三台机器一致，我就拷贝了22另外一台namenode机器上的current文件夹到23之台机器。

发现

hadoop namenode开始运行了，但是最后一致仍然循环报错

18/10/25 23:50:27 INFO ipc.Server: IPC Server handler 14 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from 10.28.0.21:41567 Call#25812 Retry#13: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
18/10/25 23:50:27 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/25 23:50:28 INFO ipc.Server: IPC Server handler 13 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from 10.28.0.21:41567 Call#25813 Retry#7: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
18/10/25 23:50:28 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/25 23:50:29 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 6 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/25 23:50:30 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 7 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/25 23:50:31 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 8 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
18/10/25 23:50:32 INFO ipc.Server: IPC Server handler 13 on 9000, call org.apache.hadoop.hdfs.protocol.ClientProtocol.renewLease from 10.28.0.23:46477 Call#20330 Retry#11: org.apache.hadoop.ipc.StandbyException: Operation category WRITE is not supported in state standby
18/10/25 23:50:32 INFO ipc.Client: Retrying connect to server: svr2.master.hadoop.xx/10.28.0.22:9000. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)

看起来还是到22这台机器的连接有问题，telnet 22 9000这台机器不通。

~~~~~~~不知道怎么弄通。

算了

minreplication为2，我现在只能是1，那么我改成minreplication=1如何

然后在hadoop的conf文件夹下找哪个文件有replication

fgrep "replication" -n *.xml

发现hdfs-site.xml有

<name>dfs.replication.min</name>

修改值为1

三台机器都改

然后stop-all.sh, start-all.sh重启集群

再执行bash hive