hbase的meta region挂掉之后的问题跟踪

最新推荐文章于 2022-06-19 16:29:45 发布

qq_31384327

最新推荐文章于 2022-06-19 16:29:45 发布

阅读量298

点赞数

文章标签： HBase zk 虚拟机

本文链接：https://blog.csdn.net/qq_31384327/article/details/83940615

版权

近日对hbase进行稳定性测试，因为主机资源有限，所以使用多个虚拟机搭了一个分布式hbase集群，无意中停了一个虚拟机，然后启动hbase发现整个集群无法启动，hmaster报了一个网络异常之后，就直接退出了。
于是开始跟踪hmaster启动部分的源码。
hmaster的大郅步骤：
1.连接zk，创建master node的watcher
2.检查root region是否存在
3.启动对zk root node和meta node的track
4 分配root 到对应的regionServer
5 分配meta 到对应的regionServer
现在就卡在了分配meta 到对应的regionServer这一步。
查看源码
private HRegionInterface getMetaServerConnection(boolean refresh)
throws IOException, InterruptedException {
synchronized (metaAvailable) {
if (metaAvailable.get()) {
HRegionInterface current = getCachedConnection(metaLocation);
if (!refresh) {
return current;
}
if (verifyRegionLocation(current, this.metaLocation, META_REGION)) {
return current;
}
resetMetaLocation();
}
HRegionInterface rootConnection = getRootServerConnection();
if (rootConnection == null) {
return null;
}
HServerAddress newLocation = MetaReader.readMetaLocation(rootConnection);
if (newLocation == null) {
return null;
}

HRegionInterface newConnection = getCachedConnection(newLocation);//此处抛出IOException
if (verifyRegionLocation(newConnection, this.metaLocation, META_REGION)) {
setMetaLocation(newLocation);
return newConnection;
}
return null;
}
}

现在进行故障回放：
在hbase集群停止运行状态下，META对应的regionserver虚拟机被挂了，于是hmaster启动时，根据root表找到对应meta所以在regionserver地址，然后建立到对应regionserver的连接，但此时这台regionserver已经down了，连接失败，抛出网络层的IOException，于是hmaster无法启动。
虽然这种故障出现的概率不高，因为只有在hbase集群停止的情况下，刚好对应的metaserver挂了，所以才会造成meta表无法指定regionserver。但考虑海量数据情况下，meta表也会分裂，从而可能存在多台regionserver上，出现这种故障的概率就高了。
所以考虑对代码进行，catch IOException进行另外指定

HRegionInterface newConnection = null;
try {
newConnection = getCachedConnection(newLocation);
} catch (IOException e) {
// 从当前已经启动成功的regionserver中，找一台机器来接管meta表

}