3 切换耗时研究
这里的切换耗时指发生HA切换,hadoop不能正常提供服务的时间。
3.1 kill ANN切换时间
服务不可用起始时间(ANN被kill时):
2021-03-09 15:47:18,732 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: RECEIVED SIGNAL 15: SIGTERM
切换成功时间(SNN成功切换为ANN时):
2021-03-09 15:47:29,661 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at 172.19.6.21/172.19.6.21:9000 to active state
总耗时:11s
耗时分析:其中10s为失败重连的间隔(fence原ANN连不上,会重连一次),只有不到1s为真正的切换耗时
3.2 kill ANN所在节点ZKFC切换时间
服务不可用起始时间(开始fence ANN时):
2021-03-09 18:44:47,706 INFO org.apache.hadoop.ha.ZKFailoverController: Should fence: NameNode at /172.19.6.21:9000
切换成功时间(SNN成功切换为ANN时):
2021-03-09 18:44:48,085 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at 172.19.6.20/172.19.6.20:9000 to active state
总耗时:小于1s
3.3 ANN所在节点断网切换时间
服务不可用起始时间(断网时,可判断断网时间点大致为18:59:05,324):
2021-03-09 18:59:11,992 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 6668ms for sessionid 0x20110cc70c50006, closing socket connection and attempting reconnect
切换成功时间(SNN成功切换为ANN时):
2021-03-09 18:59:46,922 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at 172.19.6.21/172.19.6.21:9000 to active state
总耗时:41s
耗时分析:查看日志发现18s用于fence时尝试连接NN,10s为重连间隔,3s用于第二次尝试重连NN。剩下10s大部分用于zk会话超时
3.4 磁盘空间不足切换时间
服务不可用起始时间(检测到磁盘空间不足,可能早于这个时间,因为检测间隔为1s):
2021-03-11 11:13:39,492 WARN org.apache.hadoop.ha.HealthMonitor: Service health check failed for NameNode at 172.19.6.20/172.19.6.20:9000: The NameNode has no resources available
切换成功时间(SNN成功切换为ANN时):
2021-03-11 11:13:40,351 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at 172.19.6.21/172.19.6.21:9000 to active state
总耗时:1s