解决hbase RegionServer频繁宕机的一些办法

版权声明:本文为博主原创文章,遵循 CC 4.0 by-sa 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/xiaolong_4_2/article/details/84323990

HBase在使用中会遇到非常多的问题,最为常见的就是HBase regionserver挂掉(此文中简称regionserver为RS),

而RS挂掉的原因各不相同。在本文中对遇到过的HBase RS挂掉的情况做一个简单的整理。
HBase集群预留内存不足

现象:HBase RS在起来1-2min后马上挂掉

日志为如下日志段的重复:

Mon Aug  6 10:23:54 CST 2018 Starting regionserver on node2.rosa.com

core file size          (blocks, -c) 0

data seg size           (kbytes, -d) unlimited

scheduling priority             (-e) 0

file size               (blocks, -f) unlimited

pending signals                 (-i) 127902

max locked memory       (kbytes, -l) 64

max memory size         (kbytes, -m) unlimited

open files                      (-n) 65536

pipe size            (512 bytes, -p) 8

POSIX message queues     (bytes, -q) 819200

real-time priority              (-r) 0

stack size              (kbytes, -s) 8192

cpu time               (seconds, -t) unlimited

max user processes              (-u) 2048

virtual memory          (kbytes, -v) unlimited

file locks                      (-x) unlimited

分析:此处可知所做的操作只是HBase RS的假启动,前端看上去好像是启动成功了,但是实际上并没有真正启动。

解决思路:适当减小HBase RegionServer Maximum Memory对应的值。

 
HBase集群时钟不同步

现象:HBase RS运行一段时间后挂掉

日志:ERROR [B.defaultRpcServer.handler=4,queue=1,port=16000] master.MasterRpcServices: Region server slave1,16020,1494163890158 reported a fatal error:

ABORTING region server slave1,16020,1494163890158: Unhandled: org.apache.hadoop.HBase.ClockOutOfSyncException: Server slave1,16020,1494163890158 has been rejected; Reported time is too far out of sync with master.Time difference of 52782ms > max allowed of 30000ms

分析:集群中的不同机器时间差较大,HBase集群对时钟同步要求较高。

解决思路:

    查看集群中/etc/ntp.conf的server配置是否合理,
    利用service ntpd status来查看ntp的状态
    利用service ntpd start启动ntp
    利用ntpdate –u node1在三个节点立即进行时钟同步
    利用chkconfig ntpd on设置所有节点ntp服务开机自启动。

 
HBase集群zk的连接数设置过少

现象:RS运行一段时间后挂掉

日志:

ERROR [regionserver/node1/101.12.38.119:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x3651eb7c95b0006, quorum=node1:2181,node2:2181,node3.hde.h3c.com:2181, baseZNode=/HBase-unsecure Received unexpected KeeperException, re-throwing exceptionorg.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /HBase-unsecure/replication/rs/node1,16020,1533819219139\

查看zookeeper对应时间点的日志:

2018-08-14 10:31:23,855 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /101.12.38.120 - max is 60

2018-08-14 10:31:23,935 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /101.12.38.120 - max is 60

2018-08-14 10:31:24,015 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /101.12.38.120 - max is 60

2018-08-14 10:31:24,037 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /101.12.38.120 - max is 60

2018-08-14 10:31:24,122 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too many connections from /101.12.38.120 - max is 60

2018-08-14 10:31:24,152 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@193] - Too

      

分析:zookeeper的连接数设置过少,在HBase的高并发业务时会出现如上问题

解决思路:在zookeeper的配置文件zoo.cfg中添加如下自定义配置(该配置在hbase中默认值为60):

添加:maxClientCnxns=600,重启zk,启动HBase。

 
HBase无法写入(无法落地存到hdfs):

日志:java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

分析:无法写入,3个datanode,备份数量设置的是3。在写操作时,它会在pipeline中写3个机器。默认replace-datanode-on-failure.policy是DEFAULT,如果系统中的datanode大于等于3,它会找另外一个datanode来拷贝。目前机器只有3台,因此只要一台datanode出问题,就一直无法写入成功。

处理方法:

修改hdfs-site.xml文件,添加或者修改如下两项:

    <property>

        <name>dfs.client.block.write.replace-datanode-on-failure.enable</name>

         <value>true</value>

    </property>

 

    <property>

    <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>

     <value>NEVER</value>

    </property>

 

    对于dfs.client.block.write.replace-datanode-on-failure.enable,客户端在写失败的时候,是否使用更换策略,默认是true没有问题

对于,dfs.client.block.write.replace-datanode-on-failure.policy,default在3个或以上备份的时候,是会尝试更换结点尝试写入datanode。而在两个备份的时候,不更换datanode,直接开始写。对于3个datanode的集群,只要一个节点没响应写入就会出问题,所以可以关掉。

 
HBase读数据超时

日志:ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: xxx:50010:DataXceiver error processing WRITE_BLOCK operation  src: /xxx:52827 dst: /xxx:50010

java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/xxx:50010 remote=/xxx:52827]

分析:读数据超时,出现以上错误是因为datanode的服务线程连接数都被占用,导致等待超时。

处理方法:

    datanode的资源情况适当增加的服务线程数:在hdfs-site.xml增加自定义配置文件里面新增或修改dfs.datanode.handler.count,默认是10,适当加大。
    增加客户端的超时时间dfs.client.socket-timeout,默认是60000ms。
    增大HBase配置项中的zookeeper session timeout数值。
    增加DataNode max data transfer threads至合理数值。

 
HBase GC优化

报错如下:

分析:

从以上日志可以看出,session超时,其实服务是好的,只是被zk认为死了,所以rs自己也就把自己kill了。

值得注意的是此处的报错,仍然不是深层次的原因,继续查看日志:

2018-08-07 12:27:39,919 WARN  [regionserver/node1/210.26.111.41:16020] util.Sleeper: We slept 39000ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://HBase.apache.org/book.html#trouble.rs.runtime.zkexpired

2018-08-07 12:27:39,920 INFO  [node1,16020,1533607130571_ChoreService_1] regionserver.HRegionServer$CompactionChecker: Chore: CompactionChecker missed its start time

2018-08-07 12:27:39,920 INFO  [main-SendThread(node2:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 50201ms for sessionid 0x264ea4df477b894, closing socket connection and attempting reconnect

2018-08-07 12:27:39,920 WARN  [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 37173ms

GC pool 'ParNew' had collection(s): count=1 time=207ms

由以上日志可见,HBase的put操作明显存在Full GC的问题, 上文处的超时也由FULL GC引起。这种情况下,客户端发送的请求会受到阻塞,导致客户端无法正常写数据到HBase。

 

修改:

HBase写优化方面:

    做HBase简单的写优化:增加HBase的write buffer至55%,减小HBase的read buffer至25%(write buffer与read buffer的和不能超过80%)
    增加HBase的regionserver handler数至最大。
    HBase表的预分区修改

Zk方面:

    HBase的zk超时检测延长。

做完这些操作后现场未恢复正常,建议进行HBase的jvm调优如下:

会产生FULL GC的原因:老年代回收慢。

 

1.HBase堆内存

系统可以使用export HBASE_HEAPSIZE=16384,16G的内存,下是官网的一段话:

Thus, ~20-24Gb or less memory dedicated to one RS is recommended

2.GC的参数设置

HBASE jvm优化

export HBASE_OPTS="$HBASE_OPTS -XX:+UseCompressedOops -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSClassUnloadingEnabled -XX:+UseCMSCompactAtFullCollection -XX:CMSFullGCsBeforeCompaction=0 -XX:+CMSParallelRemarkEnabled  -XX:CMSInitiatingOccupancyFraction=75 -XX:SoftRefLRUPolicyMSPerMB=0"

具体参数含义如下:

 

-XX:+UseCompressedOops

压缩指针,解决内存占用

 

-XX:+UseParNewGC

设置年轻代为并行收集

 

-XX:+UseConcMarkSweepG

使用CMS内存收集

 

-XX:+CMSClassUnloadingEnabled

相对于并行收集器,CMS收集器默认不会对永久代进行垃圾回收。如果希望对永久代进行垃圾回收,可用设置标志-XX:+CMSClassUnloadingEnabled。 在早期JVM版本中,要求设置额外的标志-XX:+CMSPermGenSweepingEnabled。注意,即使没有设置这个标志,一旦永久代耗尽空 间也会尝试进行垃圾回收,但是收集不会是并行的,而再一次进行Full GC。

 

-XX:+UseCMSCompactAtFullCollection

使用并发收集器时,开启对年老代的压缩.

 

-XX:CMSFullGCsBeforeCompaction

由于并发收集器不对内存空间进行压缩,整理,所以运行一段时间以后会产生”碎片”,使得运行效率降低.此值设置运行多少次GC以后对内存空间进行压缩,整理.

 

-XX:+CMSParallelRemarkEnabled

降低标记停顿

 

-XX:CMSInitiatingOccupancyFraction=75

使用cms作为垃圾回收使用75%后开始CMS收集

 

-XX:SoftRefLRUPolicyMSPerMB

每兆堆空闲空间中SoftReference的存活时间

 

展开阅读全文

hbase regionserver启动失败

07-31

三台主机:rn192.168.1.121 sp.soft.pc1 作为主机masterrn192.168.1.122 sp.soft.pc2rn192.168.1.123 sp.soft.pc3rn搭建了hadoop集群,zookeeper集群,启动后各服务后进程如下:rn[img=https://img-bbs.csdn.net/upload/201807/31/1533028510_624154.png][/img]rnrn在sp.soft.pc1下启动hbase,结果如下:rn[img=https://img-bbs.csdn.net/upload/201807/31/1533028465_520977.png][/img]rn另外两台机器中的hregionserver启动后很快就停止了,查看日志有如下错误:rn2018-07-31 16:53:12,908 ERROR [regionserver/sp:16020] regionserver.HRegionServer: pache/hadoop/fs/ContentSummary; @98: invokestaticrn Reason:rn Type 'org/apache/hadoop/fs/ContentSummary$Builder' (current frame, stack[1]) is not assignable to 'org/apache/hadoop/fs/QuotaUsage$Builder'rn Current Frame:rn bci: @98rn flags: rn locals: 'org/apache/hadoop/hdfs/protocol/proto/HdfsProtos$ContentSummaryProto', 'org/apache/hadoop/fs/ContentSummary$Builder' rn stack: 'org/apache/hadoop/hdfs/protocol/proto/HdfsProtos$StorageTypeQuotaInfosProto', 'org/apache/hadoop/fs/ContentSummary$Builder' rn Bytecode:rn 0x0000000: 2ac7 0005 01b0 bb03 4159 b703 424c 2b2arn 0x0000010: b603 43b6 0344 2ab6 0345 b603 462a b603rn 0x0000020: 47b6 0348 2ab6 0349 b603 4a2a b603 4bb6rn 0x0000030: 034c 2ab6 034d b603 4e2a b603 4fb6 0350rn 0x0000040: 2ab6 0351 b603 522a b603 53b6 0354 2ab6rn 0x0000050: 0355 b603 5657 2ab6 0357 9900 0b2a b603rn 0x0000060: 582b b803 592b b603 5ab0 rn Stackmap Table:rn same_frame(@6)rn append_frame(@101,Object[#2149])rn *****rnjava.lang.VerifyError: Bad type on operand stackrn..................................rn2018-07-31 16:53:13,146 ERROR [main] regionserver.HRegionServerCommandLine: Region server exitingrnjava.lang.RuntimeException: HRegionServer Abortedrn at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:67)rn at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)rn at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)rn at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)rn at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2968)rnrnrn日志内容太多,没有完全贴出来。有哪位大神来瞅瞅,我这是反了什么低级错误导致的。rn 论坛

HBase 异常宕机的原因?

07-03

在我们公司的集群中,由于配置不是很高,HBase经常Regionserver 或者 HMaster宕掉,但是不太清楚具体原因。我推测是否是Map Reduce任务与HBase抢系统资源? 因为整个HBase启动起来后,如果不同时进行一些Map Reduce任务的话,是不会出问题的。通常是在执行导入的Map Reduce任务时,容易宕机。rnrn求大神证实一下原因rnrn下面是找到的一些宕机时日志的信息:rnrn[code=text]rn2014-07-03 11:35:00,516 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 10979msrn2014-07-03 11:35:16,731 WARN [regionserver60020] util.Sleeper: We slept 16189ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpiredrn2014-07-03 11:35:16,746 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12933msrn2014-07-03 11:35:29,801 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 6797msrn2014-07-03 11:35:31,768 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1466msrn2014-07-03 11:35:31,768 INFO [regionserver60020-SendThread(u07:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 66866ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnectrn2014-07-03 11:35:36,592 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4307msrn2014-07-03 11:35:49,857 WARN [regionserver60020.periodicFlusher] util.Sleeper: We slept 20056ms instead of 10000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpiredrn2014-07-03 11:35:49,858 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12078msrn2014-07-03 11:35:52,555 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2029msrn2014-07-03 11:35:58,543 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 5487msrn2014-07-03 11:36:02,560 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2094msrn2014-07-03 11:36:06,415 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3354msrn2014-07-03 11:36:08,135 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Opening socket connection to server u04/192.168.85.131:2181. Will not attempt to authenticate using SASL (无法定位登录配置)rn2014-07-03 11:36:08,288 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 1220msrn2014-07-03 11:36:11,092 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Socket connection established to u04/192.168.85.131:2181, initiating sessionrn2014-07-03 11:36:12,170 INFO [regionserver60020-SendThread(u04:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnectrn2014-07-03 11:36:15,675 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Opening socket connection to server u02/192.168.85.129:2181. Will not attempt to authenticate using SASL (无法定位登录配置)rn2014-07-03 11:36:15,686 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Socket connection established to u02/192.168.85.129:2181, initiating sessionrn2014-07-03 11:36:25,135 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 3709msrn2014-07-03 11:36:34,095 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 8459msrn2014-07-03 11:36:34,098 INFO [regionserver60020-SendThread(u02:2181)] zookeeper.ClientCnxn: Client session timed out, have not heard from server in 17147ms for sessionid 0x646f9e22a350000, closing socket connection and attempting reconnectrn2014-07-03 11:36:34,772 DEBUG [LruStats #0] hfile.LruBlockCache: Total=1.90 MB, free=402.60 MB, max=404.50 MB, blocks=0, accesses=42069, hits=0, hitRatio=0, cachingAccesses=0, cachingHits=0, cachingHitsRatio=0,evictions=0, evicted=0, evictedPerRun=NaNrn2014-07-03 11:36:38,213 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Opening socket connection to server u03/192.168.85.130:2181. Will not attempt to authenticate using SASL (无法定位登录配置)rn2014-07-03 11:36:38,804 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Socket connection established to u03/192.168.85.130:2181, initiating sessionrn2014-07-03 11:36:51,474 WARN [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 12168msrn2014-07-03 11:36:53,638 INFO [regionserver60020-SendThread(u03:2181)] zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x646f9e22a350000, likely server has closed socket, closing socket connection and attempting reconnectrn2014-07-03 11:37:08,385 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 4707msrn2014-07-03 11:37:09,333 WARN [regionserver60020] util.Sleeper: We slept 14746ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpiredrn2014-07-03 11:37:09,963 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Opening socket connection to server u05/192.168.85.132:2181. Will not attempt to authenticate using SASL (无法定位登录配置)rn2014-07-03 11:37:11,451 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Socket connection established to u05/192.168.85.132:2181, initiating sessionrn2014-07-03 11:37:16,208 INFO [JvmPauseMonitor] util.JvmPauseMonitor: Detected pause in JVM or host machine (eg GC): pause of approximately 2610msrn2014-07-03 11:37:20,313 INFO [regionserver60020-SendThread(u05:2181)] zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x646f9e22a350000 has expired, closing socket connectionrn2014-07-03 11:37:20,388 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: ABORTING region server u05,60020,1404351691231: regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000, quorum=u04:2181,u03:2181,u02:2181,u01:2181,u08:2181,u07:2181,u06:2181,u05:2181, baseZNode=/hbase regionserver:60020-0x646f9e22a350000-0x646f9e22a350000-0x646f9e22a350000 received expired from ZooKeeper, abortingrn2014-07-03 11:37:29,418 FATAL [regionserver60020-EventThread] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []rn[/code] 论坛

没有更多推荐了,返回首页