一、问题现象
由于前一段时间ORC delta文件合并失败,HDFS中存在了大量delta文件,最终导致访问该数据表失败。通过想办法把数据搬移到新表后,我们drop掉了这个旧表。没想到一会儿使用人员报告整个HDFS不可访问。
简单重启HDFS后,系统恢复正常了。但一个小时后,整个HDFS又不正常了。
二、问题查找
检查namenode日志发现以下异常:
1. 大量数据块不属于任何文件
......
2018-04-25 15:08:41,271 INFO BlockStateChange (BlockManager.java:processReport(1922)) - BLOCK* processReport 0x373452f9be8730: blk_1094510209_20773700 on node 44.53.4.178:50010 size 26712 does not belong to any file
......
2. standby状态不支持读操作
2018-04-25 15:08:49,942 INFO ipc.Server (Server.java:run(2231)) - IPC Server handler 199 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.getListing from 44.53.4.151:52096 Call#3577 Retry#14: org.apache.hadoop.ipc.StandbyException: Operation category READ is not supported in state standby
3. channel已关闭异常
2018-04-25 15:08:53,378 INFO ipc.Server (Server.java:run(2279)) - IPCServer handler 787 on 8020 caught an exception
java.nio.channels.ClosedChannelException
atsun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:270)
atsun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:461)
atorg.apache.hadoop.ipc.Server.channelWrite(Server.java:2768)
atorg.apache.hadoop.ipc.Server.access$2200(Server.java:139)
atorg.apache.hadoop.ipc.Server$Responder.processResponse(Server.java:1121)
atorg.apache.hadoop.ipc.Server$Responder.doRespond(Server.java:1193)
atorg.apache.hadoop.ipc.Server$Connection.sendResponse(Server.java:2134)
atorg.apache.hadoop.ipc.Server$Connection.access$400(Server.java:1261)
atorg.apache.hadoop.ipc.Server$Call.sendResponse(Server.java:644)
atorg.apache.hadoop.ipc.Server$Handler.run(Server.java:2268)
看起来意思是namenode在回应datanode时,连接被datanode提前关闭了。
我们再检查datanode日志,发现以下问题:
1. 大量 block 删除失败:
......
2018-04-23 23:17:51,875 INFO impl.FsDatasetImpl (FsDatasetImpl.java:invalidate(1897)) - Failed to delete replica blk_1092108254_18371745: ReplicaInfo not found.
......
2. 发送 block report 失败:
2018-04-23 23:17:54,835 INFO datanode.DataNode (BPServiceActor.java:blockReport(340)) - Unsuccessfully sent block report 0x3538bf64a1717a, containing 10 storage report(s), of which we sent 0. The reports had 321118 total blocks and used 0 RPC(s). This took 91 msec to generate and 4005 msecs for RPC and NN processing. Got back no commands.
3. 等待接收命令超时:
2018-04-23 23:17:54,836 WARN datanode.DataNode (BPServiceActor.java:offerService(560))- IOException in offerService
java.net.SocketTimeoutException: Call Fromnode151.gbase.hd/44.53.4.151 to node156.gbase.hd:8020 failed on socket timeoutexception: java.net.SocketTimeoutException: 4000 millis timeout while waitingfor channel to be ready for read. ch :java.nio.channels.SocketChannel[connected local=/44.53.4.151:45744remote=node156.gbase.hd/44.53.4.156:8020]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
atsun.reflect.GeneratedConstructorAccessor11.newInstance(Unknown Source)
atsun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
atjava.lang.reflect.Constructor.newInstance(Constructor.java:423)
atorg.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
atorg.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
atorg.apache.hadoop.ipc.Client.call(Client.java:1480)
atorg.apache.hadoop.ipc.Client.call(Client.java:1413)
atorg.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
atcom.sun.proxy.$Proxy15.blockReport(Unknown Source)
atorg.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:202)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:311)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:534)
atorg.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:659)
atjava.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketTimeoutException:4000 millis timeout while waiting for channel to be ready for read. ch :java.nio.channels.SocketChannel[connected local=/44.53.4.151:45744remote=node156.gbase.hd/44.53.4.156:8020]
atorg.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
atorg.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
atorg.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
atjava.io.FilterInputStream.read(FilterInputStream.java:133)
atjava.io.FilterInputStream.read(FilterInputStream.java:133)
atorg.apache.hadoop.ipc.Client$Connection$PingInputStream.read(Client.java:521)
atjava.io.BufferedInputStream.fill(BufferedInputStream.java:246)
atjava.io.BufferedInputStream.read(BufferedInputStream.java:265)
atjava.io.DataInputStream.readInt(DataInputStream.java:387)
atorg.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1085)
atorg.apache.hadoop.ipc.Client$Connection.run(Client.java:980)
总结起来,这里包含以下问题:
1. 大量数据块等待删除
2. channel 通信超时(4000ms)
3. namenode 发送 pending deletion blocks 列表失败
4. datanode 发送数据块报告失败
三、问题解决
修改hadoop配置参数(core-site.xml):
1. 设置 ipc.client.rpc-timeout.ms 为 0,意思是永不超时;
2. 设置 ipc.maximum.data.length 为 100,000,000,避免 datanode 发送 block report 失败;
3. 设置 ipc.maximum.response.length 为 100,000,000,避免 namenode 发送 pending deletion blocks 失败。
四、问题分析
1. 删除大量文件后为什么会出现这些问题
这个跟HDFS原理相关。当我们在HDFS中删除文件时,namenode只是把目录入口删掉,然后把需要删除的数据块记录到pending deletion blocks列表。当下一次datanode向namenode发送心跳的时候,namenode再把删除命令和这个列表发送到datanode端。
由于我们删除了大量文件,所以这个pending deletion blocks列表很长很长,导致了timeout。
2. 为什么重启 HDFS 后,系统一个小时后才变得不正常
因为我们把 dfs.namenode.startup.delay.block.deletion.sec (hdfs-site.xml) 参数设置为 3600,所以系统启动后一小时内不进行块删除。