当一个HDFS系统同时处理许多个并行的put操作,往HDFS上传数据 时,有时候会出现dfsclient 端发生socket 链接超时的报错,有的时候甚至会由于这种原因导致最终的put操作失败,造成数据上传不完整。
log类似如下:
All datanodes *** are bad. Aborting...
类似这样的错误,常常会在并行的put操作比较多,比如 60-80个,每个put的数据量约100G的时候,产生类似的错误,错误出现以后,比较好一点的情况是DFSClient端会报出一些列的错误log,如:
error Recovery for block block_-13954o849583405 bad datanode ** "
Bad response for block block_-254u94545923 from datanode ***
10/01/18 18:48:00 WARN hdfs.DFSClient: Error Recovery for block blk_6828192944006126093_201296138 bad datanode[0] 172.23.115.79:50010
10/01/18 18:48:00 WARN hdfs.DFSClient: Error Recovery for block blk_6828192944006126093_201296138 in pipeline 172.23.115.79:50010, 172.23.115.68:50010: bad datanode 172.23.115.79:50010
10/01/18 18:48:27 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-1574627828968965286_201296769java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java .nio.channels.SocketChannel[connected local=/172.23.113.2:50391 remote=/172.23.114.41:50010]
at org.apache.Hadoop.net.SocketIOWithTimeo