Region Server意外退出之后…

原地址:http://www.spnguru.com/2011/04/region-server%E6%84%8F%E5%A4%96%E9%80%80%E5%87%BA%E4%B9%8B%E5%90%8E/

早晨上班好好的,突然nagios报出一台regionserver挂了。顿时忙碌起来。

上去一看,从log中看到这样一条信息

2011-04-08 04:02:22,083 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: ZooKeeper session expired

之后, regionserver就理直气壮地退出了。

于是查了下代码,看到了在org.apache.hadoop.hbase.regionserver.HRegionSever.java下这样一段代码。

/**
* We register ourselves as a watcher on the master address ZNode. This is
* called by ZooKeeper when we get an event on that ZNode. When this method
* is called it means either our master has died, or a new one has come up.
* Either way we need to update our knowledge of the master.
* @param event WatchedEvent from ZooKeeper.
*/
public void process(WatchedEvent event) {
	EventType type = event.getType();
	KeeperState state = event.getState();
	LOG.info(“Got ZooKeeper event, state: ” + state + “, type: ” +
	type + “, path: ” + event.getPath());
	// Ignore events if we’re shutting down.
	if (stopRequested.get()) {
		LOG.debug(“Ignoring ZooKeeper event while shutting down”);
		return;
	}
	if (state == KeeperState.Expired) {
		LOG.error(“ZooKeeper session expired”);
		boolean restart =
		this.conf.getBoolean(“hbase.regionserver.restart.on.zk.expire”, false);
		if (restart) {
			restart();
		} else {
			abort();
		}
	} else if (type == EventType.NodeDeleted) {
		watchMasterAddress();
	} else if (type == EventType.NodeCreated) {
		getMaster();
		// ZooKeeper watches are one time only, so we need to re-register our watch.
		watchMasterAddress();
	}
}


这段注释写的很清楚了。对于一个reigonserver, 他需要将自己注册到Zookeeper上master的Znode上。这样的目的,是当master 宕机或者新的master启动的时候,能及时收到通知。对于regionserver来说,维持和Zookeeper的联系是非常重要的。因为regionserver需要定期的将心跳包发给master server。如果regionserver不能及时的知道master的改变,就会导致regionserver和master失去联系,而成为一个僵死的进程。

于是,在默认情况下,regionserver遇到这种情况,就选择退出。
为什么regionserver 和Zookeeper的session expired? 可能的原因有
1. 网络不好。
2. Java full GC, 这会block所有的线程。如果时间比较长,也会导致session expired.
怎么办?
1. 将Zookeeper的timeout时间加长。
2. 配置“hbase.regionserver.restart.on.zk.expire” 为true。 这样子,遇到ZooKeeper session expired , regionserver将选择 restart 而不是 abort

具体的配置是,在hbase-site.xml中加入

<property>
	<name>zookeeper.session.timeout</name>
	<value>90000</value>
	<description>ZooKeeper session timeout.
		HBase passes this to the zk quorum as suggested maximum time for a
		session.  See http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
		“The client sends a requested timeout, the server responds with the
		timeout that it can give the client. The current implementation
		requires that the timeout be a minimum of 2 times the tickTime
		(as set in the server configuration) and a maximum of 20 times
		the tickTime.” Set the zk ticktime with hbase.zookeeper.property.tickTime.
		In milliseconds.
	</description>
</property>
<property>
	<name>hbase.regionserver.restart.on.zk.expire</name>
	<value>true</value>
	<description>
		Zookeeper session expired will force regionserver exit.
		Enable this will make the regionserver restart.
	</description>
</property>

为了避免java full GC suspend thread 对Zookeeper heartbeat的影响,我们还需要对hbase-env.sh进行配置。

export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"

修改成

export HBASE_OPTS="$HBASE_OPTS -XX:+HeapDumpOnOutOfMemoryError \
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled \
-XX:+CMSInitiatingOccupancyFraction=70 \
-XX:+UseCMSInitiatingOccupancyOnly -XX:+UseParNewGC -Xmn256m"
更多关于Hbase performance tuning 的信息,可以参考
http://wiki.apache.org/hadoop/PerformanceTuning
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值