hbase regionserver总出现自动down的情况排查

本文记录了一次HBase集群中RegionServer频繁自动关闭的问题排查过程,通过调整JVM参数,解决了由内存不足及垃圾回收引起的RegionServer异常终止问题。

最近在调试hbase,10台节点,服务正常后,写入数据,总是出现regionserver自动down的情况,查看日志如下:

2016-05-04 13:29:09,690 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] wal.ProtobufLogWriter: Failed to write trailer, non-fatal, continuing...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/oldWALs/ma7.cloud%2C16020%2C1461926336242.default.1462336775368 (inode 294646): File is not open for writing. Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3454)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3354)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:823)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod

(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

    at org.apache.hadoop.ipc.Client.call(Client.java:1411)
    at org.apache.hadoop.ipc.Client.call(Client.java:1364)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy16.getAdditionalDatanode(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolTranslatorPB.java:393)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy17.getAdditionalDatanode(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:279)
    at com.sun.proxy.$Proxy18.getAdditionalDatanode(Unknown Source)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:1028)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:1184)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:933)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:487)
2016-05-04 13:29:09,692 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: Shutdown / close of WAL failed: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /apps/hbase/data/oldWALs/ma7.cloud%2C16020%2C1461926336242.default.1462336775368 (inode 294646): File is not open for writing.

Holder DFSClient_NONMAPREDUCE_-309271655_1 does not have any open files.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3454)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalDatanode(FSNamesystem.java:3354)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getAdditionalDatanode(NameNodeRpcServer.java:823)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getAdditionalDatanode

(ClientNamenodeProtocolServerSideTranslatorPB.java:515)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod

(ClientNamenodeProtocolProtos.java)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2151)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2147)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2145)

2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020 closing leases
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020 closed leases
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] hbase.ChoreService: Chore service for: ma7.cloud,16020,1461926336242 had [[ScheduledChore: Name: ma7.cloud,16020,1461926336242-MemstoreFlusherChore Period: 10000 Unit: MILLISECONDS], [ScheduledChore: Name: MovedRegionsCleaner for region ma7.cloud,16020,1461926336242 Period: 120000 Unit: MILLISECONDS]] on shutdown
2016-05-04 13:29:09,702 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2016-05-04 13:29:09,703 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2016-05-04 13:29:09,703 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:10,703 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:12,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:16,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:18,007 INFO  [regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker closing leases
2016-05-04 13:29:18,008 INFO  [regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker] regionserver.Leases: regionserver/ma7.cloud/192.168.1.46:16020.leaseChecker closed leases
2016-05-04 13:29:24,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181,

exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:24,704 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: ZooKeeper getChildren failed after 4 attempts
2016-05-04 13:29:24,704 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.ZKUtil: regionserver:16020-0x15460f0ceb70046, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, baseZNode=/hbase-unsecure Unable to list children of znode /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:454)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:482)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1461)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1383)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1265)
    at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:187)
    at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:292)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:180)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:172)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2137)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:24,705 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.ZooKeeperWatcher: regionserver:16020-0x15460f0ceb70046, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, baseZNode=/hbase-unsecure Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/replication/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:1472)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getChildren(RecoverableZooKeeper.java:295)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchForNewChildren(ZKUtil.java:454)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenAndWatchThem(ZKUtil.java:482)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.listChildrenBFSAndWatchThem(ZKUtil.java:1461)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursivelyMultiOrSequential(ZKUtil.java:1383)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNodeRecursively(ZKUtil.java:1265)
    at org.apache.hadoop.hbase.replication.ReplicationQueuesZKImpl.removeAllQueues(ReplicationQueuesZKImpl.java:187)
    at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:292)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:180)
    at org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:172)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2137)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1071)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:24,705 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] ipc.RpcServer: Stopping server on 16020
2016-05-04 13:29:24,705 INFO  [RpcServer.listener,port=16020] ipc.RpcServer: RpcServer.listener,port=16020: stopping
2016-05-04 13:29:24,706 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped
2016-05-04 13:29:24,706 INFO  [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping
2016-05-04 13:29:24,706 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:25,706 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:27,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:31,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:39,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, quorum=mapping2.100.cloud:2181,mapping1.100.cloud:2181,mapping3.100.cloud:2181, exception=org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
2016-05-04 13:29:39,707 ERROR [regionserver/ma7.cloud/192.168.1.46:16020] zookeeper.RecoverableZooKeeper: ZooKeeper delete failed after 4 attempts
2016-05-04 13:29:39,707 WARN  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: Failed deleting my ephemeral node
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /hbase-unsecure/rs/ma7.cloud,16020,1461926336242
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
    at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
    at org.apache.zookeeper.ZooKeeper.delete(ZooKeeper.java:873)
    at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.delete(RecoverableZooKeeper.java:178)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1221)
    at org.apache.hadoop.hbase.zookeeper.ZKUtil.deleteNode(ZKUtil.java:1210)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.deleteMyEphemeralNode(HRegionServer.java:1403)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1079)
    at java.lang.Thread.run(Thread.java:745)
2016-05-04 13:29:39,708 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: stopping server ma7.cloud,16020,1461926336242; zookeeper connection closed.
2016-05-04 13:29:39,708 INFO  [regionserver/ma7.cloud/192.168.1.46:16020] regionserver.HRegionServer: regionserver/ma7.cloud/192.168.1.46:16020 exiting
2016-05-04 13:29:39,708 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:68)
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:87)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2651)
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@7a7471ce
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2016-05-04 13:29:39,710 INFO  [Thread-7] regionserver.ShutdownHook: Shutdown hook finished.


分析:

看着像是跟Zookeeper有关系,又了看监控,发现内存有时候降为0,网络的流量比较大,应该是在写入数据,这个问题网上需要调整jvm参数


从ambari上修改:

修改前:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -Xmn{{regionserver_xmn_size}} -XX:CMSInitiatingOccupancyFraction=70  -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"
修改后:
export HBASE_REGIONSERVER_OPTS="$HBASE_REGIONSERVER_OPTS -XX:MaxTenuringThreshold=3 -XX:SurvivorRatio=8 -XX:+UseG1GC -XX:MaxGCPauseMillis=50 -XX:InitiatingHeapOccupancyPercent=75 -XX:NewRatio=39 -Xms{{regionserver_heapsize}} -Xmx{{regionserver_heapsize}} $JDK_DEPENDED_OPTS"

修改前:
export HBASE_OPTS="$HBASE_OPTS -XX:+UseConcMarkSweepGC -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log -Djava.io.tmpdir={{java_io_tmpdir}}"
修改后:
export HBASE_OPTS="$HBASE_OPTS -XX:ErrorFile={{log_dir}}/hs_err_pid%p.log -Djava.io.tmpdir={{java_io_tmpdir}}"


解释:堆大小调整为40G,新生代1G,回收算法使用G1。
-XX:NewRatio=39
是新生代和其他的老年代、持久代的比例

1/(39+1) * 40 G
默认的CMS算法 总出现异常 导致regionserver自杀

-Xmn{{regionserver_xmn_size}} 是配置新生代的
有可能在G1中不适用了,删除掉


参考:

http://www.cnblogs.com/chengxin1982/p/3818448.html

http://www.cnblogs.com/zhenjing/archive/2012/11/13/hbase_is_OK.html

HMASTER日志报错2025-11-13 17:30:58,132 INFO [sxs1:16000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/flush-table-proc/acquired /hbase/flush-table-proc/reached /hbase/flush-table-proc/abort 2025-11-13 17:30:58,139 INFO [sxs1:16000.activeMasterManager] procedure.ZKProcedureUtil: Clearing all procedure znodes: /hbase/online-snapshot/acquired /hbase/online-snapshot/reached /hbase/online-snapshot/abort 2025-11-13 17:30:58,152 INFO [sxs1:16000.activeMasterManager] master.MasterCoprocessorHost: System coprocessor loading is enabled 2025-11-13 17:30:58,160 INFO [sxs1:16000.activeMasterManager] procedure2.ProcedureExecutor: Starting procedure executor threads=5 2025-11-13 17:30:58,160 INFO [sxs1:16000.activeMasterManager] wal.WALProcedureStore: Starting WAL Procedure Store lease recovery 2025-11-13 17:30:58,164 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000006.log 2025-11-13 17:30:58,168 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000006.log after 3ms 2025-11-13 17:30:58,172 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000007.log 2025-11-13 17:30:58,173 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000007.log after 1ms 2025-11-13 17:30:58,177 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000008.log 2025-11-13 17:30:58,178 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000008.log after 1ms 2025-11-13 17:30:58,184 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000009.log 2025-11-13 17:30:58,188 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000009.log after 4ms 2025-11-13 17:30:58,193 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000010.log 2025-11-13 17:30:58,195 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000010.log after 2ms 2025-11-13 17:30:58,201 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000011.log 2025-11-13 17:30:58,201 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000011.log after 0ms 2025-11-13 17:30:58,205 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000012.log 2025-11-13 17:30:58,205 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000012.log after 0ms 2025-11-13 17:30:58,211 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000013.log 2025-11-13 17:30:58,212 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000013.log after 1ms 2025-11-13 17:30:58,215 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000014.log 2025-11-13 17:30:58,216 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000014.log after 1ms 2025-11-13 17:30:58,224 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recover lease on dfs file hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000015.log 2025-11-13 17:30:58,225 INFO [sxs1:16000.activeMasterManager] util.FSHDFSUtils: Recovered lease, attempt=0 on file=hdfs://sxs1:9000/hbase/MasterProcWALs/state-00000000000000000015.log after 1ms 2025-11-13 17:30:58,248 INFO [sxs1:16000.activeMasterManager] wal.WALProcedureStore: Lease acquired for flushLogId: 16 2025-11-13 17:30:58,256 INFO [sxs1:16000.activeMasterManager] zookeeper.RecoverableZooKeeper: Process identifier=replicationLogCleaner connecting to ZooKeeper ensemble=sxs1:2181,sxs2:2181,sxs3:2181 2025-11-13 17:30:58,268 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2025-11-13 17:30:59,783 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 1515 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2025-11-13 17:31:00,918 INFO [B.defaultRpcServer.handler=1,queue=1,port=16000] master.ServerManager: Registering server=sxs1,16020,1763025688227 2025-11-13 17:31:00,928 INFO [B.defaultRpcServer.handler=0,queue=0,port=16000] master.ServerManager: Registering server=sxs2,16020,1763024791986 2025-11-13 17:31:00,949 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 2, slept for 2681 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2025-11-13 17:31:00,966 INFO [B.defaultRpcServer.handler=2,queue=2,port=16000] master.ServerManager: Registering server=sxs3,16020,1763024792019 2025-11-13 17:31:01,004 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 2736 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2025-11-13 17:31:02,522 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 3, slept for 4254 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms. 2025-11-13 17:31:02,775 INFO [sxs1:16000.activeMasterManager] master.ServerManager: Finished waiting for region servers count to settle; checked in 3, slept for 4507 ms, expecting minimum of 1, maximum of 2147483647, master is running 2025-11-13 17:31:02,782 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs1,16020,1761034573023 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,790 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs1,16020,1762933906517 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,793 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs1,16020,1762938640637 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,793 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs1,16020,1762939757280 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,795 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs1,16020,1763025688227 belongs to an existing region server 2025-11-13 17:31:02,796 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1761034787828-splitting doesn't belong to a known region server, splitting 2025-11-13 17:31:02,797 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1762933902565 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,797 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1762937541819 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,799 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1762938639822 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,800 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1763024791986 belongs to an existing region server 2025-11-13 17:31:02,803 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs3,16020,1761034787825 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,805 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs3,16020,1762933904233 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,806 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs3,16020,1762937541854 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,807 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs3,16020,1762938639774 doesn't belong to a known region server, splitting 2025-11-13 17:31:02,808 INFO [sxs1:16000.activeMasterManager] master.MasterFileSystem: Log folder hdfs://sxs1:9000/hbase/WALs/sxs3,16020,1763024792019 belongs to an existing region server 2025-11-13 17:31:02,814 INFO [sxs1:16000.activeMasterManager] master.SplitLogManager: dead splitlog workers [sxs2,16020,1761034787828] 2025-11-13 17:31:02,816 INFO [sxs1:16000.activeMasterManager] master.SplitLogManager: Started splitting 1 logs in [hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1761034787828-splitting] for [sxs2,16020,1761034787828] 2025-11-13 17:31:02,827 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta acquired by sxs3,16020,1763024792019 2025-11-13 17:31:03,059 INFO [sxs1,16000,1763026256316_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta=last_update = 1763026262876 last_version = 2 cur_worker_name = sxs3,16020,1763024792019 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2025-11-13 17:31:09,058 INFO [sxs1,16000,1763026256316_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta=last_update = 1763026262876 last_version = 2 cur_worker_name = sxs3,16020,1763024792019 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2025-11-13 17:31:15,058 INFO [sxs1,16000,1763026256316_splitLogManager__ChoreService_1] master.SplitLogManager: total tasks = 1 unassigned = 0 tasks={/hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta=last_update = 1763026262876 last_version = 2 cur_worker_name = sxs3,16020,1763024792019 status = in_progress incarnation = 0 resubmits = 0 batch = installed = 1 done = 0 error = 0} 2025-11-13 17:31:18,634 INFO [main-EventThread] coordination.SplitLogManagerCoordination: task /hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta entered state: ERR sxs3,16020,1763024792019 2025-11-13 17:31:18,634 WARN [main-EventThread] coordination.SplitLogManagerCoordination: Error splitting /hbase/splitWAL/WALs%2Fsxs2%2C16020%2C1761034787828-splitting%2Fsxs2%252C16020%252C1761034787828..meta.1761038482217.meta 2025-11-13 17:31:18,635 WARN [sxs1:16000.activeMasterManager] master.SplitLogManager: error while splitting logs in [hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1761034787828-splitting] installed = 1 but only 0 done 2025-11-13 17:31:18,635 FATAL [sxs1:16000.activeMasterManager] master.HMaster: Failed to become active master java.io.IOException: error or interrupted while splitting logs in [hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1761034787828-splitting] Task = installed = 1 done = 0 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:403) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:313) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:304) at org.apache.hadoop.hbase.master.HMaster.splitMetaLogBeforeAssignment(HMaster.java:1046) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:750) at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:189) at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1803) at java.lang.Thread.run(Thread.java:745) 2025-11-13 17:31:18,636 FATAL [sxs1:16000.activeMasterManager] master.HMaster: Master server abort: loaded coprocessors are: [] 2025-11-13 17:31:18,636 FATAL [sxs1:16000.activeMasterManager] master.HMaster: Unhandled exception. Starting shutdown. java.io.IOException: error or interrupted while splitting logs in [hdfs://sxs1:9000/hbase/WALs/sxs2,16020,1761034787828-splitting] Task = installed = 1 done = 0 error = 1 at org.apache.hadoop.hbase.master.SplitLogManager.splitLogDistributed(SplitLogManager.java:290) at org.apache.hadoop.hbase.master.MasterFileSystem.splitLog(MasterFileSystem.java:403) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:313) at org.apache.hadoop.hbase.master.MasterFileSystem.splitMetaLog(MasterFileSystem.java:304) at org.apache.hadoop.hbase.master.HMaster.splitMetaLogBeforeAssignment(HMaster.java:1046) at org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:750) at org.apache.hadoop.hbase.master.HMaster.access$600(HMaster.java:189) at org.apache.hadoop.hbase.master.HMaster$2.run(HMaster.java:1803) at java.lang.Thread.run(Thread.java:745) 2025-11-13 17:31:18,636 INFO [sxs1:16000.activeMasterManager] regionserver.HRegionServer: STOPPED: Unhandled exception. Starting shutdown. 2025-11-13 17:31:18,636 INFO [master/sxs1/192.168.78.100:16000] regionserver.HRegionServer: Stopping infoServer 2025-11-13 17:31:18,669 INFO [master/sxs1/192.168.78.100:16000] procedure2.ProcedureExecutor: Stopping the procedure executor 2025-11-13 17:31:18,669 INFO [master/sxs1/192.168.78.100:16000] wal.WALProcedureStore: Stopping the WAL Procedure Store 2025-11-13 17:31:18,776 INFO [master/sxs1/192.168.78.100:16000] regionserver.HRegionServer: stopping server sxs1,16000,1763026256316 2025-11-13 17:31:18,776 INFO [master/sxs1/192.168.78.100:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x39a7c74c6a6000e 2025-11-13 17:31:18,778 INFO [master/sxs1/192.168.78.100:16000] regionserver.HRegionServer: stopping server sxs1,16000,1763026256316; all regions closed. 2025-11-13 17:31:18,779 INFO [master/sxs1/192.168.78.100:16000] hbase.ChoreService: Chore service for: sxs1,16000,1763026256316 had [[ScheduledChore: Name: HFileCleaner Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: MILLISECONDS]] on shutdown 2025-11-13 17:31:18,782 INFO [master/sxs1/192.168.78.100:16000] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x29a7c74c6a30007 2025-11-13 17:31:18,785 INFO [master/sxs1/192.168.78.100:16000] hbase.ChoreService: Chore service for: sxs1,16000,1763026256316_splitLogManager_ had [[ScheduledChore: Name: SplitLogManager Timeout Monitor Period: 1000 Unit: MILLISECONDS]] on shutdown 2025-11-13 17:31:18,785 INFO [master/sxs1/192.168.78.100:16000] flush.MasterFlushTableProcedureManager: stop: server shutting down. 2025-11-13 17:31:18,785 INFO [master/sxs1/192.168.78.100:16000] ipc.RpcServer: Stopping server on 16000 2025-11-13 17:31:18,785 INFO [RpcServer.listener,port=16000] ipc.RpcServer: RpcServer.listener,port=16000: stopping 2025-11-13 17:31:18,786 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopped 2025-11-13 17:31:18,786 INFO [RpcServer.responder] ipc.RpcServer: RpcServer.responder: stopping 2025-11-13 17:31:18,794 INFO [master/sxs1/192.168.78.100:16000] regionserver.HRegionServer: stopping server sxs1,16000,1763026256316; zookeeper connection closed. 2025-11-13 17:31:18,794 INFO [master/sxs1/192.168.78.100:16000] regionserver.HRegionServer: master/sxs1/192.168.78.100:16000 exiting
最新发布
11-14
<think>我们正在讨论HBase的滚动重启RegionServer的方法。根据引用[1],可以使用graceful_stop.sh脚本来平滑地重启RegionServer,它会先将region移动到其他节点,重启后再移回来。 用户明确要求滚动重启RegionServer的方法,因此我们将重点放在引用[1]提到的graceful_stop.sh命令上。 步骤: 1. 停止RegionServer之前,先将其上的region迁移到其他RegionServer上,避免服务中断。 2. 重启该RegionServer。 3. 重启完成后,将region移回(graceful_stop.sh会自动完成迁移和移回的过程)。 4. 循环处理所有RegionServer。 根据引用[1],我们可以编写一个shell脚本来循环重启所有RegionServer。 注意:引用[1]中给出的脚本示例是: for i in `cat ../conf/regionservers|sort` do echo $i ./graceful_stop.sh --restart --reload $i done 其中: --restart 表示在停止后重新启动 --reload 表示重启前重新加载配置(可选) $i 是RegionServer的主机名 因此,具体步骤如下: 1. 进入HBase安装目录。 2. 编写一个循环脚本,读取conf/regionservers文件(该文件列出了所有的RegionServer主机名)。 3. 对每个RegionServer执行graceful_stop.sh脚本。 注意:使用graceful_stop.sh需要确保HBase Shell命令可用,且当前用户有操作HBase的权限。 另外,引用[2]提到,如果是Ambari管理的集群,需要切换至hbase用户执行,以便Ambari界面能够同步状态。 因此,我们给出以下详细步骤: 步骤一:切换到HBase安装目录 cd /path/to/hbase 步骤二:创建重启脚本(例如:restart_rs.sh),内容如下: #!/usr/bin/env bash # 获取所有RegionServer主机名,并排序 RS_LIST=$(cat conf/regionservers | sort) # 循环重启每个RegionServer for rs in $RS_LIST do echo "Restarting RegionServer: $rs" # 使用graceful_stop.sh脚本重启,并重新加载配置(如果不需要重新加载配置,可以去掉--reload参数) bin/graceful_stop.sh --restart --reload $rs done 步骤三:赋予脚本执行权限 chmod +x restart_rs.sh 步骤四:以hbase用户运行脚本(如果是在Ambari环境下,建议切换为hbase用户) su - hbase -c "/path/to/hbase/restart_rs.sh" 注意:执行过程中,可以通过HBase Web UI(默认端口16030)观察Region的迁移情况。 注意事项: 1. 确保HBase集群状态健康,避免在重启过程中出现不可预料的错误。 2. 滚动重启过程中,集群仍然可用,但可能会有短暂的性能波动(因为Region在迁移)。 3. 如果某个RegionServer重启失败,脚本会停止,需要手动处理。 另外,引用[4]提到了RegionServer宕机后的日志分割(WAL日志分割)和恢复过程,在滚动重启中,由于是优雅停止,所以不会产生WAL日志分割,因为数据已经刷写(flush)到HFile。 最后,根据引用[1],重启Master的命令是单独的,但用户只关心RegionServer的滚动重启,所以我们不需要包含Master重启。 生成相关问题: 1. 如何监控HBase滚动重启过程中Region的迁移状态? 2. 使用graceful_stop.sh重启RegionServer时遇到问题如何排查? 3. HBase滚动重启期间如何确保数据一致性? </think>### 滚动重启 HBase RegionServer 的详细方法 #### 核心命令与原理 滚动重启的核心是使用 `graceful_stop.sh` 脚本,该脚本会**优先迁移 Region** 到其他节点,重启完成后再自动迁移回原节点,实现服务不中断[^1]。操作流程如下: ```mermaid graph LR A[开始] --> B[迁移 Region] B --> C[停止 RegionServer] C --> D[重启 RegionServer] D --> E[迁回 Region] E --> F[下一个节点] ``` #### 具体操作步骤 1. **单节点重启命令** ```bash # 重启单个 RegionServer 并重载配置 ./bin/graceful_stop.sh --restart --reload <hostname> ``` - `--restart`:自动重启服务 - `--reload`:重启前重载配置 - `<hostname>`:节点主机名(如 `regionserver1`) 2. **批量重启脚本** 创建脚本 `restart_rs.sh`: ```bash #!/usr/bin/env bash # 读取 regionservers 文件并排序 for host in `cat conf/regionservers | sort` do echo "Restarting $host" ./bin/graceful_stop.sh --restart --reload $host sleep 120 # 等待节点稳定(根据集群规模调整) done ``` 执行权限:`chmod +x restart_rs.sh` 3. **验证重启效果** ```bash # 检查节点状态 echo "status 'detailed'" | ./bin/hbase shell # 查看日志(替换实际主机名) tail -f logs/hbase-hbase-regionserver-<hostname>.log ``` #### 关键注意事项 1. **依赖文件** - 确保 `conf/regionservers` 文件包含所有待重启节点 - 确认用户有操作权限(建议使用 `hbase` 用户执行)[^2] 2. **异常处理** - 若重启失败检查 WAL 日志状态: ```bash hdfs dfs -ls /apps/hbase/data/WALs # 检查日志目录 ``` - 出现日志归档问题时参考 `oldWALs` 目录恢复[^5] 3. **性能影响** - 大型集群建议分批执行(每次重启 20% 节点) - 监控 Region 迁移进度: ```bash watch -n 5 "echo 'balance_switch false' | hbase shell" # 临时关闭自动均衡 ``` --- ### 最佳实践建议 1. **低峰操作** 在业务低峰期执行,避免 Region 迁移导致性能波动 2. **配置预热** 重启前检查关键配置: ```xml <!-- hbase-site.xml --> <property> <name>hbase.regionserver.restart.on.zk.expire</name> <value>true</value> <!-- 确保 ZK 超时自动恢复 --> </property> ``` 3. **数据安全** 执行前手动触发 MemStore 刷写: ```bash echo "flush '表名'" | hbase shell # 防止 WAL 日志异常[^4] ```
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值