问题发生背景:
集群一切运行正常,由于需求增加,数据量也突然增大,数据量增大后,集群的datanode节点经常挂掉,而且每次挂掉都是一批一批的挂掉,每次挂掉的还不重复。
分析datanode的日志,日志如下:
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44300]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:22,193 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode14:10010:DataXceiver error processing READ_BLOCK operation src: /192.168.3.8:44300 dst: /192.168.3.20:10010
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44300]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:33,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data: java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
2018-08-09 09:50:33,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.3.20:10010, dest: /192.168.3.8:44313, bytes: 3170304, op: HDFS_READ, cliID: DFSClient_attempt_20180809094616_0013_m_000268_0_-86635497_202, offset: 4194304, srvID: ac0dbce0-bb78-4832-a007-e83e2193128f, blockid: BP-1970763073-192.168.3.2-1472438963698:blk_1194008763_120310850, duration: 78482413413
2018-08-09 09:50:33,217 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.3.20, datanodeUuid=ac0dbce0-bb78-4832-a007-e83e2193128f, infoPort=10075, ipcPort=10020, storageInfo=lv=-55;cid=CID-b318428f-0cb7-4ec2-98ef-44b0441c82b2;nsid=896056531;c=0):Got exception while serving BP-1970763073-192.168.3.2-1472438963698:blk_1194008763_120310850 to /192.168.3.8:44313
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:33,217 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode14:10010:DataXceiver error processing READ_BLOCK operation src: /192.168.3.8:44313 dst: /192.168.3.20:10010
java.net.SocketTimeoutException: 10000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/192.168.3.20:10010 remote=/192.168.3.8:44313]
at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:508)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:110)
at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:68)
at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:234)
at java.lang.Thread.run(Thread.java:745)
2018-08-09 09:50:44,908 INFO org.apache.hadoop.util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1778ms
GC pool 'PS MarkSweep' had collection(s): count=1 time=2273ms
2018-08-09 09:50:44,908 ERROR org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: Error compiling report
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:545)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:422)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:403)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:359)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
at java.io.File.<init>(File.java:262)
at java.io.File.listFiles(File.java:1212)
at org.apache.hadoop.fs.FileUtil.listFiles(FileUtil.java:1162)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:595)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.compileReport(DirectoryScanner.java:610)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:585)
at org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:570)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
第一眼关注到:10000 millis timeout while waiting for channel to be ready for write
怀疑是写数据的时候,等待时间过久,导致的失败,查看hdfs的配置文件 hdfs-site.xml
找到关键的两个配置:
<property>
<name>dfs.client.socket-timeout</name>
<value>10000</value>
<description>Down the DFS timeout from 60 to 10 seconds.</description>
</property>
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>10000</value>
<description>Down the DFS timeout from 8 * 60 to 10 seconds.</description>
</property>
发现这两个配置的数值都是10000,怀疑和这两个配置有关,然后查询这两个配置的含义:
hdfs客户端的读写超时时间
dfs.client.socket-timeout(默认60000)
dfs.datanode.socket.write.timeout(默认80000)
发现是集群这个配置确实有点小了
尝试修改 dfs.client.socket-timeout value为15000
dfs.datanode.socket.write.timeout 为30000
重启hdfs,观察半天,发现问题解决!