hadoop集群datanode报错10000 millis timeout while waiting for channel to be ready for write

问题发生背景:

        集群一切运行正常,由于需求增加,数据量也突然增大,数据量增大后,集群的datanode节点经常挂掉,而且每次挂掉都是一批一批的挂掉,每次挂掉的还不重复。

分析datanode的日志,日志如下: 

java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44300]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:22,193 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode14:10010:DataXceiver error processing READ_BLOCK operation  src: /192.168.3.8:44300 dst: /192.168.3.20:10010
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44300]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:33,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data: java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
2018-08-09 09:50:33,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.3.20:10010, dest: /192.168.3.8:44313, bytes: 3170304, op: HDFS_READ, cliID: DFSClient_attempt_20180809094616_0013_m_000268_0_-86635497_202, offset: 4194304, srvID: ac0dbce0-bb78-4832-a007-e83e2193128f, blockid: BP-1970763073-192.168.3.2-1472438963698:blk_1194008763_120310850, duration: 78482413413
2018-08-09 09:50:33,217 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.3.20, datanodeUuid=ac0dbce0-bb78-4832-a007-e83e2193128f, infoPort=10075, ipcPort=10020, storageInfo=lv=-55;cid=CID-b318428f-0cb7-4ec2-98ef-44b0441c82b2;nsid=896056531;c=0):Got exception while serving BP-1970763073-192.168.3.2-1472438963698:blk_1194008763_120310850 to /192.168.3.8:44313
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:33,217 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode14:10010:DataXceiver error processing READ_BLOCK operation  src: /192.168.3.8:44313 dst: /192.168.3.20:10010
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
        at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
        at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
        at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
        at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
        at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
        at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:44,908 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1778ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2273ms
2018-08-09 09:50:44,908 ERROR org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Error compiling report
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:545)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:422)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:403)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
        at java.io.File.<init>(File.java:262)
        at java.io.File.listFiles(File.java:1212)
        at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1162)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:595)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:585)
        at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:570)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        ... 3 more

第一眼关注到:10000 millis timeout while waiting for channel to be ready for write

怀疑是写数据的时候,等待时间过久,导致的失败,查看hdfs的配置文件 hdfs-site.xml

找到关键的两个配置:

 <property>
    <name>dfs.client.socket-timeout</name>
    <value>10000</value>
    <description>Down the DFS timeout from 60 to 10 seconds.</description>
  </property>
  <property>
    <name>dfs.datanode.socket.write.timeout</name>
    <value>10000</value>
    <description>Down the DFS timeout from 8 * 60 to 10 seconds.</description>
  </property>

发现这两个配置的数值都是10000,怀疑和这两个配置有关,然后查询这两个配置的含义:

hdfs客户端的读写超时时间

dfs.client.socket-timeout(默认60000)

dfs.datanode.socket.write.timeout(默认80000)

发现是集群这个配置确实有点小了

尝试修改 dfs.client.socket-timeout  value为15000

dfs.datanode.socket.write.timeout 为30000

重启hdfs,观察半天,发现问题解决

  • 0
    点赞
  • 9
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值