[HBase命案系列] HBase regionserver被强制关闭问题分析
一.案发现场
2018-09-06 15:51:59,685 ERROR [regionserver/node-210:16020.logRoller] regionserver.LogRoller: Log rolling failed
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
at java.lang.String.substring(String.java:1967)
at org.apache.hadoop.hbase.wal.AbstractFSWALProvider.getWALPrefixFromWALName(AbstractFSWALProvider.java:515)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.preLogRoll(ReplicationSourceManager.java:555)
at org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceWALActionListener.preLogRoll(ReplicationSourceWALActionListener.java:50)
at org.apache.hadoop.hbase.wal.DisabledWALProvider$DisabledWAL.rollWriter(DisabledWALProvider.java:126)
at org.apache.hadoop.hbase.wal.DisabledWALProvider$DisabledWAL.rollWriter(DisabledWALProvider.java:144)
at org.apache.hadoop.hbase.regionserver.LogRoller.run(LogRoller.java:184)
at java.lang.Thread.run(Thread.java:748)
2018-09-06 15:51:59,707 ERROR [regionserver/node-210:16020.logRoller] regionserver.HRegionServer: ***** ABORTING region server node-210,16020,1536216717222: Log rolling failed *****
二、案情分析:在使用多线程对HBase进行高频数据插入时,RegionServer无故被停止服务!需要找出真凶!从日志当中发现疑犯LogRoller,RegionServer被杀死前有疑犯留下的痕迹“String index out of range: -1”。对此进行进一步推理。
三、案情推理:
1.根据痕迹,找到帮凶,源码 AbstractFSWALProvider.java
.....
@VisibleForTesting
public static final String META_WAL_PROVIDER_ID = ".meta";
.....
public static String getWALPrefixFromWALName(String name) {
int endIndex = name.replaceAll(META_WAL_PROVIDER_ID, "").lastIndexOf(".");
return name.substring(0, endIndex);
}
进一步确定,endIndex出问题了,死亡的信息来源于此。
2.查看其老大ReplicationSourceManager.java
public void preLogRoll(Path newLog) throws IOException {
String logName = newLog.getName();
String logPrefix = AbstractFSWALProvider.getWALPrefixFromWALName(logName);
......
发现凶器可能来源于logName,因为log name不存在,而导致字符数组越界!
3.追踪凶器来源
public void preLogRoll(Path oldPath, Path newPath) throws IOException {
manager.preLogRoll(newPath);
}
再查看其死者生前的日志记录hbase-hadoop-regionserver-node-218.log
2018-09-04 18:22:23,832 INFO [RS_OPEN_META-regionserver/node-210:16020-0] wal.AbstractFSWAL: WAL configuration: blocksize=256 MB, rollsize=128 MB, prefix=node-210%2C16020%2C1536056537213.meta, suffix=.meta, logDir=hdfs://node-210:9000/hbase/WALs/node-210,16020,1536056537213, archiveDir=hdfs://node-210:9000/hbase/oldWALs
2018-09-04 18:22:23,847 INFO [RS_OPEN_META-regionserver/node-210:16020-0] wal.AbstractFSWAL: New WAL /hbase/WALs/node-210,16020,1536056537213/node-210%2C16020%2C1536056537213.meta.1536056543835.meta
初步判断.meta的WAL文件可能出问题,导致数组越界。
四、案情复现
因HDFS上面的.meta文件出问题(被删除?memstore flush时被改名?)
1.使用ycsb工具对HBase不间断进行上亿条数据插入
../bin/ycsb load hbase20 -P workloada -threads 100 -p table=t1 -p columnfamily=family -p recordcount=100000000 -s
2.删除HDFS的hbase的.meta表
]# su - hadoop -c "/usr/local/hadoop/bin/hdfs dfs -rm /hbase/WALs/node-218,16020,1536235084000/node-218%2C16020%2C1536235084000.meta.1536237574487.meta"
18/09/06 20:41:13 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.
Deleted /hbase/WALs/node-218,16020,1536235084000/node-218%2C16020%2C1536235084000.meta.1536237574487.meta
查看regionserver日志
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /hbase/WALs/gz-52-218,16020,1536235084000/gz-52-218%2C16020%2C1536235084000.meta.1536237591217.meta (inode 22978): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_95794702_1, pendingcreates: 3]
······
2018-09-06 20:41:14,281 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
案发现场初步复现,因.meta.文件出错,导致HRegionServer被杀死。
后续需要对.meta.凶器进行进一步分析,达到最终结案。
【未完持续】