问题描述:
在SparkStreaming长时间写入HBase的时候,会下面的异常问题:
2017-12-24 23:20:34 [ SparkListenerBus:540107357 ] - [ ERROR ] Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[ip:50010,DS-d9caacf5-a95a-45ab-8231-95decdbe4889,DISK], DatanodeInfoWithStorage[ip:50010,DS-7e2e14d9-3d8b-412d-bf38-3d2930a83d2f,DISK]], original=[DatanodeInfoWithStorage[ip:50010,DS-d9caacf5-a95a-45ab-8231-95decdbe4889,DISK], DatanodeInfoWithStorage[ip:50010,DS-7e2e14d9-3d8b-412d-bf38-3d2930a83d2f,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1191)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1265)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
2017-12-24 23:20:34 [ SparkListenerBus:540107357 ] - [ ERROR ] Listener EventLoggingListener threw an exception
java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[ip:50010,DS-d9caacf5-a95a-45ab-8231-95decdbe4889,DISK], DatanodeInfoWithStorage[ip:50010,DS-7e2e14d9-3d8b-412d-bf38-3d2930a83d2f,DISK]], original=[DatanodeInfoWithStorage[ip:50010,DS-d9caacf5-a95a-45ab-8231-95decdbe4889,DISK], DatanodeInfoWithStorageip:50010,DS-7e2e14d9-3d8b-412d-bf38-3d2930a83d2f,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:1191)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1265)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1433)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:1147)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:632)
根据异常栈的信息,DataNode写入策略问题导致失败
然后寻找源码位置在dfsclient中,发现是客户端在pipeline写数据块时候的问题,也出现了两个相关的参数:
dfs.client.block.write.replace-datanode-on-failure.enable=true
如果在写pipeline中存在一个DataNode或者网络故障时,那么DFSClient将尝试从pipeline中删除失败的DataNode,然后继续尝试剩下的DataNodes进行写入。结果,pipeline中的DataNodes的数量在减少。该特性是在pipeline中添加新的DataNodes。这是一个site-wide属性来enable/disable该特性。当集群规模非常小时,例如3个节点或更少时,集群管理员可能希望将策略设置为NEVER在默认配置文件或禁用该特性。否则,因为找不到新的DataNode来替换,用户可能会经历异常高的pipeline错误
dfs.client.block.write.replace-datanode-on-failure.policy=DEFAULT
这个属性只有在dfs.client.block.write.replace-datanode-on-failure.enable设置true时有效:
ALWAYS:当一个存在的DataNode被删除时,总是添加一个新的DataNode
NEVER:永远不添加新的DataNode
DEFAULT:副本数是r,DataNode的数时n,只要r >= 3时,或者floor(r/2)大于等于n时,r>n时再添加一个新的DataNode,并且这个块是hflushed/appended
conf.set("dfs.client.block.write.replace-datanode-on-failure.policy","NEVER");
conf.set("dfs.client.block.write.replace-datanode-on-failure.enable","true");