我们的大数据部署在金山云,hbase集群中的进程RegionServer(RS)死掉以后不会自动启动,只能手动拉起。因此我写了个定时监控脚本,一旦监测到RS死掉就立即启动,以保证故障节点能持续提供hbase数据读/写服务。立即启动的好处是不对其他RS造成过大压力,因为一个节点的RS死掉后,hmaster就会把此Region Server管理的region分配给其他健康的Region Server(HMaster也会监视zookeeper上的这些RS临时节点,一旦hmaster发现有RS下线了,就会把此Region Server的表region分配给其他健康的Region Server)。
我的脚本设定了每间隔30分钟做一次监测,即每半小时逐个检查kmr-core-machine-001-kingsoft节点到kmr-core-machine-008-kingsoft节点上的RS进程是否在跑,若没在跑则启动它。
以下是监控脚本的输出日志信息摘要。
IN FUNCTION keep_inspecting()
Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft
IN FUNCTION check_then_start_RS()
2017-12-15 03:17:01 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:02 Checked regionServer in kmr-core-machine-005-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-006-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-007-kingsoft by Ambari, result is____ 1
2017-12-15 03:17:04 Checked regionServer in kmr-core-machine-008-kingsoft by Ambari, result is____ 1
IN FUNCTION keep_inspecting()
Active Hbase Master host is kmr-5b9c18fc-gn-7b3518df-master-1-001-kingsoft
IN FUNCTION check_then_start_RS()
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-001-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-002-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-003-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer in kmr-core-machine-004-kingsoft by Ambari, result is____ 1
2017-12-15 03:47:04 Checked regionServer inkmr-core-machine-005-kingsoft by Ambari, result is____0, now Starting......
IN FUNCTION restart_regionserver()
{
"href" : "http://localhost:8080/api/v1/clusters/ks-ksai_kmr/requests/669",
"Requests" : {
"id" : 669,
"status" : "Accepted"
}
}Now checking IF kmr-core-machine-005-kingsoft IS RUNNING RegionServer process......
IN FUNCTION after_start_RS()
RegionServer process on