HBASE重启报错add to deadNodes and continue.

本来是服务器磁盘满了,虽然没有太多数据,但是大佬不在没人排查,然后我一个弱鸡运维就只能给扩了下容,一个悲伤的故事从此开始…

扩容之后,我先启动了Hadoop,检查HDFS节点及yarn一切正常后,启动了HBASE,然后执行了hbase shell,进入HBASE命令行,执行list查看一下表(因为有些情况HBASE会假死,进程还在但是不能查看表),在此就报错了…

我先是查看了下HBASE 60010端口,如图:
在这里插入图片描述
可以看到,RegionServer,一个都没注册进来…
我就再次去看了RegionServer节点日志,如下(部分截取):
在这里插入图片描述

2020-09-23 16:14:32,724 WARN  [regionserver/h3/20.88.0.204:16020] regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2020-09-23 16:14:35,725 INFO  [regionserver/h3/20.88.0.204:16020] regionserver.HRegionServer: reportForDuty to master=h1,16000,1600847708094 with port=16020, startcode=1600847708936
2020-09-23 16:14:35,726 WARN  [regionserver/h3/20.88.0.204:16020] regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.
2020-09-23 16:14:38,727 INFO  [regionserver/h3/20.88.0.204:16020] regionserver.HRegionServer: reportForDuty to master=h1,16000,1600847708094 with port=16020, startcode=1600847708936

看他的样子是连不上HMaster,但是我60010看着没啥啊,就再去看了Hmaster的日志(部分报错内容):

2020-09-23 10:56:12,957 INFO  [h1:16000.activeMasterManager] hdfs.DFSClient: Could not obtain BP-1598488466-192.168.1.202-1589938686053:blk_1075553367_1906199 from any node: java.io.IOException: No live nodes contain current block No live nodes contain current block Block locations: 20.88.0.203:50010 Dead nodes:  20.88.0.203:50010. Will get new block locations from namenode and retry...
2020-09-23 10:56:12,957 WARN  [h1:16000.activeMasterManager] hdfs.DFSClient: DFS chooseDataNode: got # 1 IOException, will wait for 2981.932633213984 msec.
2020-09-23 10:56:15,940 WARN  [h1:16000.activeMasterManager] hdfs.BlockReaderFactory: I/O error constructing remote block reader.
java.io.IOException: Got error for OP_READ_BLOCK, self=/20.88.0.202:53036, remote=/20.88.0.203:50010, for file /hbase/MasterProcWALs/state-00000000000000005943.log, for pool BP-1598488466-192.168.1.202-1589938686053 block 1075553367_1906199
	at org.apache.hadoop.hdfs.RemoteBlockReader2.checkSuccess(RemoteBlockReader2.java:445)
	at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:410)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReader(BlockReaderFactory.java:787)
	at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:666)
	at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:326)
	at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:570)
	at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:793)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:840)
	at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:648)
	at java.io.FilterInputStream.read(FilterInputStream.java:83)
	at com.google.protobuf.AbstractParser.parsePartialDelimitedFrom(AbstractParser.java:232)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:253)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:259)
	at com.google.protobuf.AbstractParser.parseDelimitedFrom(AbstractParser.java:49)
	at org.apache.hadoop.hbase.protobuf.generated.ProcedureProtos$ProcedureWALHeader.parseDelimitedFrom(ProcedureProtos.java:3870)
	at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFormat.readHeader(ProcedureWALFormat.java:138)
	at org.apache.hadoop.hbase.procedure2.store.wal.ProcedureWALFile.open(ProcedureWALFile.java:76)
	at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLog(WALProcedureStore.java:1027)
	at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.initOldLogs(WALProcedureStore.java:990)
	at org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.recoverLease(WALProcedureStore.java:302)
	at org.apache.hadoop.hbase.procedure2.ProcedureExecutor.start(ProcedureExecutor.java:516)
	at org.apache.hadoop.hbase.master.HMaster.startProcedureExecutor(HMaster.java:1253)
	at org.apache.hadoop.hbase.master.HMaster.startServiceThreads(HMaster.java:1165)
	at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:749)
	at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:199)
	at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1871)
	at java.lang.Thread.run(Thread.java:748)
2020-09-23 10:56:15,940 WARN  [h1:16000.activeMasterManager] hdfs.DFSClient: Failed to connect to /20.88.0.203:50010 for block, add to deadNodes and continue. java.io.IOException: Got error for OP_READ_BLOCK, self=/20.88.0.202:53036, remote=/20.88.0.203:50010, for file /hbase/MasterProcWALs/state-00000000000000005943.log, for pool BP-1598488466-192.168.1.202-1589938686053 block 1075553367_1906199
java.io.IOException: Got error for OP_READ_BLOCK, self=/20.88.0.202:53036, remote=/20.88.0.203:50010, for file /hbase/MasterProcWALs/state-00000000000000005943.log, for pool BP-1598488466-192.168.1.202-1589938686053 block 1075553367_1906199

到这就奇了怪了,我的HDFS是正常的,但是hmaster报错说连接不上DataNode?我就再次去50070查看了HDFS状态,包括检查了HDFS文件完整性,都没有问题.

既然报错DataNode连接不上,那就是两个肯定有问题,百度上面的错误都是让你检查hosts文件什么的网络问题,我的肯定不是啊…都是一个服务器的虚拟机…虽然我也检查了.

既然HMaster看不出什么,就去看看DataNode日志,翻了翻果然有发现:
在这里插入图片描述

2020-09-23 15:49:54,726 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: h1/20.88.0.202:8020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2020-09-23 15:49:54,726 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in offerService
java.io.IOException: Failed on local exception: java.io.IOException: 打开的文件过多; Host Details : local host is: "h3/20.88.0.204"; destination host is: "h1":8020; 
	at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:776)
	at org.apache.hadoop.ipc.Client.call(Client.java:1480)
	at org.apache.hadoop.ipc.Client.call(Client.java:1413)
	at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
	at com.sun.proxy.$Proxy15.sendHeartbeat(Unknown Source)
	at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClientSideTranslatorPB.java:152)
	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:402)
	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:500)
	at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:659)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: 打开的文件过多
	at sun.nio.ch.IOUtil.makePipe(Native Method)
	at sun.nio.ch.EPollSelectorImpl.<init>(EPollSelectorImpl.java:65)
	at sun.nio.ch.EPollSelectorProvider.openSelector(EPollSelectorProvider.java:36)
	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get(SocketIOWithTimeout.java:409)
	at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:325)
	at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:203)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
	at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
	at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:615)
	at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:713)
	at org.apache.hadoop.ipc.Client$Connection.access$2900(Client.java:376)
	at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
	at org.apache.hadoop.ipc.Client.call(Client.java:1452)
	... 8 more
2020-09-23 15:49:55,727 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: h1/20.88.0.202:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2020-09-23 15:49:56,728 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: h1/20.88.0.202:8020. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2020-09-23 15:49:57,523 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: meta file /data/hdfs/current/BP-1598488466-192.168.1.202-1589938686053/current/finalized/subdir27/subdir160/blk_1075552308_1905140.meta is missing!
2020-09-23 15:49:57,524 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: opReadBlock BP-1598488466-192.168.1.202-1589938686053:blk_1075552308_1905140 received exception java.io.FileNotFoundException: /data/hdfs/current/BP-1598488466-192.168.1.202-1589938686053/current/finalized/subdir27/subdir160/blk_1075552308_1905140.meta (打开的文件过多)

打开的文件过多???
在这请教了下大佬,大佬表示也没见过,百度了下,说是修改文件最大打开数量(默认1024)
在此之前,有做过大数据量集群压测情况(大概几亿左右),也没有出现过现在的问题(现在只有十几万条数据),一般来说还是没问题的,不过还是听了大佬的.

按照es的优化,修改了最大打开数量,文件描述符等.执行如下:

echo '* soft nofile 65536
* hard nofile 131072
*               soft    nproc           4096
*               hard    nproc           4096'>>/etc/security/limits.conf
    echo 'vm.max_map_count=655360' >>/etc/sysctl.conf
    sysctl -p

执行完成当前窗口用limit -n 显示还是1024,需要断开重连下生效.

然后停掉HBASE和Hadoop,重启,果然好了不报这个错,但是遇到了新的问题:
在这里插入图片描述
他开始不停的加载HDFS中的日志,我复制了路径查看了下,发现从3000开始到13000结束,大概一万个日志文件…,而HBASE两秒加载一个…
Hadoop HDFS开始和结束截图:

在这里插入图片描述

在这里插入图片描述

瞬间感觉恍然大悟:HDFS小文件太多(大概30M一个),而小文件过多容易造成HDFS NameNode节点占用大量内存等,HBASE启动又要检查这些日志,然后DataNode应该就扛不住了…

但是不是很确定没有这些HBASE还是否能用,所以停掉HBASE,将这个目录移走
在这里插入图片描述
位置在/hbase/MasterProcEALs,我将其重命名为-bak.
然后再次启动了HBASE,监控日志发现HBASE找不到这些文件,报了大量的错,但是并没有影响启动,过了一会,两个Reginaserver也注册上了…

重新执行了hbase shell ,list,恢复正常
再次执行了 scan ‘table name’,{LIMIT=>1},数据查看正常,恢复.

再次百度了这个目录的作用,类似于MySQL的binlog,记录了HBASE所有的DDL操作.

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值