大数据集群yarn任务延迟问题调查

一、背景

        BI集群,有60多个节点,2P+数据,机器都已经运行了3年以上。

二、现象

        1、yarn任务严重延迟,有时候甚至超时失败

        2、yarn任务手动kill后重跑大多数时候会成功

三、调查思路

        一开始怀疑是跑任务的当时资源不足导致,一直当作资源不足处理。

        1、查看延迟任务日志

        2、查看节点日志

        3、分析任务执行的task以及container情况,发现有一个节点执行任务时耗时特别长

        4、定位到可能是这台机器有问题,重点调查这台机器的问题:

        通过跟踪节点的日志,yarn日志基本正常,hadoop datanode日志有异常,根据异常日志搜索出来的问题,没有结论。

2021-09-27 16:03:12,467 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378156_3517112822, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.InterruptedIOException: Interruped while waiting for IO on channel java.nio.channels.SocketChannel[connected 
local=/10.xx.xx.xx:50010 remote=/10.204.114.146:55280]. 447424 millis timeout left.
	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:352)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)





2021-09-27 16:26:06,623 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113 src: /10.ee.ee.ee:19545 dest: /10.xx.xx.xx:50010
2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:19545, dest: /10.xx.xx.xx:50010, bytes: 5730, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_-857996562_138, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, duration: 604633141
2021-09-27 16:26:07,228 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475387788_3517121113, type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2021-09-27 16:26:07,399 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): 
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 16:26:07,406 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475385852_3517120714 received exception java.io.IOException: Connection reset by peer
2021-09-27 16:26:07,406 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.ee.ee.ee:31015 dst: /10.xx.xx.xx:50010
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)


2021-09-27 16:30:39,560 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): 
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE
java.nio.channels.ClosedByInterruptException
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:478)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 16:30:39,561 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 received exception java.io.IOException: Connection reset by peer
2021-09-27 16:30:39,561 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.ee.ee.ee:2043 dst: /10.xx.xx.xx:50010
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
	at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:57)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.BufferedInputStream.read1(BufferedInputStream.java:284)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
	at java.io.DataInputStream.read(DataInputStream.java:149)
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:192)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 16:30:39,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475389081_3517122514 src: /10.ee.ee.ee:34679 dest: /10.xx.xx.xx:50010
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022 src: /10.ee.ee.ee:6123 dest: /10.xx.xx.xx:50010
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475378492_3517117022
2021-09-27 16:30:40,252 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475378492_3517117022, RBW
  getNumBytes()     = 44179902
  getBytesOnDisk()  = 44179902
  getVisibleLength()= 44179902
  getVolume()       = /data/dfs/data/current
  getBlockFile()    = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475378492
  bytesAcked=44179902
  bytesOnDisk=44179902




2021-09-27 18:27:02,533 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.EOFException: Premature EOF: no length prefix available
	at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:2207)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:176)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1165)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception for BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755
java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in BlockReceiver.run(): 
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
	at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
	at sun.nio.ch.IOUtil.write(IOUtil.java:65)
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
	at org.apache.hadoop.net.SocketOutputStream$Writer.performIO(SocketOutputStream.java:63)
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
	at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
	at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
	at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
	at java.io.DataOutputStream.flush(DataOutputStream.java:123)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstreamUnprotected(BlockReceiver.java:1389)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.sendAckUpstream(BlockReceiver.java:1328)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver$PacketResponder.run(BlockReceiver.java:1249)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755, type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2021-09-27 18:27:02,535 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opWriteBlock BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 received exception java.io.IOException: Premature EOF from inputStream
2021-09-27 18:27:02,536 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode88.bi:50010:DataXceiver error processing WRITE_BLOCK operation  src: /10.ee.ee.ee:3268 dst: /10.xx.xx.xx:50010
java.io.IOException: Premature EOF from inputStream
	at org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:194)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:213)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
	at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:109)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:467)
	at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:781)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:761)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
	at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
	at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:237)
	at java.lang.Thread.run(Thread.java:745)
2021-09-27 18:27:02,625 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer: Transmitted BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 (numBytes=6635939) to /10.216.5.16:50010
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755 src: /10.ee.ee.ee:3340 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recover RBW replica BP-1382344001-10.204.25.17-1458873906864:blk_4475546259_3517286755
2021-09-27 18:27:02,759 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Recovering ReplicaBeingWritten, blk_4475546259_3517286755, RBW
  getNumBytes()     = 6635939
  getBytesOnDisk()  = 6635939
  getVisibleLength()= 6635939
  getVolume()       = /data/dfs/data/current
  getBlockFile()    = /data/dfs/data/current/BP-1382344001-10.204.25.17-1458873906864/current/rbw/blk_4475546259
  bytesAcked=6635939
  bytesOnDisk=6635939
2021-09-27 18:27:03,630 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550789_3517286896 src: /10.ee.ee.ee:25889 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:03,670 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897 src: /10.ee.ee.ee:48873 dest: /10.xx.xx.xx:50010
2021-09-27 18:27:03,673 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /10.ee.ee.ee:48873, dest: /10.xx.xx.xx:50010, bytes: 19994, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1365176392_177806, offset: 0, srvID: ff8d66b8-7176-4c2e-a530-6b5038d64e52, blockid: BP-1382344001-10.204.25.17-1458873906864:blk_4475550790_3517286897, duration: 2383047

        1)、怀疑有数据倾斜,某个任务用到的数据分布不均匀,都在这个节点上,导致处理量大,耗时特别长。后来排除这种情况。

        2)、怀疑这台机器的配置有问题,检查各种配置,发现jdk版本不一致,这台机器上用的是openjdk1.8,其他机器用的oraclejdk1.8,也有的机器上用openjdk1.8,为了排除jdk的问题,把这台机器的jdk改为oraclejdk1.8,重启服务后问题依然存在。(过程中发现整个集群的jdk有好几个小版本,感慨一下,维护的同学在集群扩容的时候就不考虑跟之前保持一致吗?)

        中间也怀疑过centos系统的linux内核版本不一致导致的,查看了一下没有问题,据说有个小版本是有问题的。

        3)、因为机器很老了,怀疑这台机器硬盘有坏掉的情况,进行磁盘检测,硬件故障检测,各项硬件都正常,排除硬盘问题。

        4)、重启这台机器,问题依然存在,但是有一个发现,这台机器重启的时候,集群任务跑的很快,基本可以断定是这台机器导致整个集群的问题。

        5)、怀疑网络问题,一开始检查都很正常,后来一直ping,发现有1%的丢包。

        6)、可能是网线或者网口的问题,最后换了一个光口(从光口A换到光口B)后问题解决。

四、结论

        1、可以通过一直ping来判断网络是否正常。

        2、网络问题可能会导致很多意想不到的异常。

        yarn任务异常,异常日志在datanode日志里,最后发现是光口的问题。调查问题的结局总是让人意想不到。

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值