HBase集群整体宕机报告(2016.7.13)

2016年7月13日,一个HBase集群出现全面宕机,原因是HDFS进入Safe Mode。故障恢复过程中发现HDFS因磁盘空间不足进入安全模式,实际问题是网络挂载点Slave2.hadoop响应超时。解决方案包括NameNode HA和JournalNode集群来避免类似问题。
摘要由CSDN通过智能技术生成

情景与操作记录

10点50分左右,接到运维人员通知,HBase集群B所有节点宕机,以下记录恢复集群的所有操作。

登录HBase UI:http://192.168.3.146:60010/,无法登录
登录hbase shell 查看:

>status 'simple'
5 dead servers

所有regionserver确实都挂掉,迅速拉起所有的regionserver

service hbase-regionserver start

继续运用status 'simple'查看集群状态,发现regionserver没有拉起,查看regionserver日志:

2016-07-13 10:37:13,480 ERROR [regionserver60020] regionserver.HRegionServer: Failed init
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /hbase/WALs/Slave4.hadoop,60020,1468377431049. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use “hdfs dfsadmin -safemode leave” to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1197)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3568)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3544)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:739)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:558)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos ClientNamenodeProtocol 2.callBlockingMethod(ClientNamenodeProtocolProtos.java)

根据上述描述,regionserver拉起失败的缘由是:HDFS进入了safe mode,迅速转向HDFS通过hdfs dfsadmin -safemode get查看HDFS状态,确实进入了safe mode,执行命令:hdfs dfsadmin -safe mode leave,离开safe mode,再次去拉起所有regionserver。
再次进入hbase shell,此时regionserver都已经拉起,进入web UI查看集群状态,发现,regionserver虽然拉起,但上面的region并没有拉起,执行命令hbase hbck发现,有1000多个不一致的地方,执行修复命令hbase hbck -repair,再次观察web UI状态,regionserver正在迅速地拉起对应的region,等待所有region都被拉起,不一致消失,HBase集群恢复服务。

故障分析

HBase集群恢复,查找为何会导致整个HBase集群如此严重问题,查看regionserver宕机日志,如下

2016-07-13 10:29:40,383 WARN [Thread-16] regionserver.HStore: Failed flushing store file, retrying num=0
java.io.IOException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create file/hbase/data/default/platform_common_user_flow_consumer/e8878
12d9c1be58014f0733cf6e7b058/.tmp/72ace704aa374894afbabf2118225ebb. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE: If you turn off safe mode before adding resources, the NN will immediately return to safe mod
e. Use “hdfs dfsadmin -safemode leave” to turn safe mode off.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1197)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2225)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)

接着,

2016-07-13 10:29:51,603 FATAL [Thread-16] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2016-07-13 10:29:51,660 INFO [Thread-16] regionserver.HRegionServer: STOPPED: Replay of HLog required. Forcing server shutdown
2016-07-13 10:29:51,663 INFO [RpcServer.handler=45,port=60020] ipc.RpcServer: RpcServer.handler=45,port=60020: exiting
2016-07-13 10:29:51,661 INFO [Priority.RpcServer.handler=0,port=60020] ipc.RpcServer: Priority.RpcServer.handler=0,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=3,port=60020] ipc.RpcServer: RpcServer.handler=3,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=49,port=60020] ipc.RpcServer: RpcServer.handler=49,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=52,port=60020] ipc.RpcServer: RpcServer.handler=52,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=53,port=60020] ipc.RpcServer: RpcServer.handler=53,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=55,port=60020] ipc.RpcServer: RpcServer.handler=55,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=56,port=60020] ipc.RpcServer: RpcServer.handler=56,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=0,port=60020] ipc.RpcServer: RpcServer.handler=0,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=59,port=60020] ipc.RpcServer: RpcServer.handler=59,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=11,port=60020] ipc.RpcServer: RpcServer.handler=11,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=10,port=60020] ipc.RpcServer: RpcServer.handler=10,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=64,port=60020] ipc.RpcServer: RpcServer.handler=64,port=60020: exiting
2016-07-13 10:29:51,661 INFO [RpcServer.handler=2,port=60020] ipc.RpcServer: RpcServer.handler=2,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=67,port=60020] ipc.RpcServer: RpcServer.handler=67,port=60020: exiting
2016-07-13 10:29:51,664 INFO [RpcServer.handler=69,port=60020] ipc.RpcServer: RpcServer.handler=69,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=66,port=60020] ipc.RpcServer: RpcServer.handler=66,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=65,port=60020] ipc.RpcServer: RpcServer.handler=65,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=63,port=60020] ipc.RpcServer: RpcServer.handler=63,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=62,port=60020] ipc.RpcServer: RpcServer.handler=62,port=60020: exiting
2016-07-13 10:29:51,663 INFO [RpcServer.handler=61,port=60020] ipc.RpcServer: RpcServer.handler=61,port=60020: exiting
2016-07-13 10:29:51,664 INFO [RpcServer.handler=79,port=60020] ipc.RpcServer: RpcServer.handler=79,port=60020: exiting

上述日志说明,regionserver在刷新store file至hdfs时失败,接着regionserver异常,HLog需要重新Repaly,强制server shutdown。而引起上述的罪魁祸首是hdfs处于了safe mode,所有的矛头都指向了namenode,查看NameNode日志,如下:

2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.NameNodeResourceChecker: Space available on volume ‘null’ is 0, which is below the configured reserved amount 104857600
2016-07-13 10:29:25,239 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: NameNode low on available disk space. Entering safe mode.
2016-07-13 10:29:25,239 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for (journal JournalAndStream(mgr=FileJournalManager(root=/data/namenode/nfsmount/nn), stream=Ed
itLogFileOutputStream(/data/namenode/nfsmount/nn/current/edits_inprogress_0000000000339257361)))
java.io.IOException: Input/output error
at sun.nio.ch.FileDispatcherImpl.size0(Native Method)
at sun.nio.ch.FileDispatcherImpl.size(FileDispatcherImpl.java:83)
at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:294)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.preallocate(EditLogFileOutputStream.java:219)
at org.apache.hadoop.hdfs.server.namenode.EditLogFileOutputStream.flushAndSync(EditLogFileOutputStream.java:202)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:112)
at org.apache.hadoop.hdfs.server.namenode.EditLogOutputStream.flush(EditLogOutputStream.java:106)
at org.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream 8.apply(JournalSet.java:498)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.mapJournalsAndReportErrors(JournalSet.java:358)
at org.apache.hadoop.hdfs.server.namenode.JournalSet.access 100(JournalSet.java:57)atorg.apache.hadoop.hdfs.server.namenode.JournalSet JournalSetOutputStream.flush(JournalSet.java:494)
at org.apache.hadoop.hdfs.server.namenode.FSEditLog.logSync(FSEditLog.java:624)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2238)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2180)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:505)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:354)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos Cli

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值