ZFKC原理及源码解析
原理
概述
NameNode主备切换主要由 ZKFailoverController、HealthMonitor 和 ActiveStandbyElector 这 3 个组件来协同实现。
HealthMonitor负责监控NN的健康情况,起一个线程去发送rpc请求,根据响应来确认NN状态,一旦状态发生改变通过回调函数通知zkfc
ActiveStandbyElector主要负责凭借ZK进行协调,监听zk集群节点,用作选举
ZK集群上有两个路径用于Hadoop HA切换
一是/hadoop-ha/{dfs.nameservices}/ActiveStandbyElectorLockactive路径用于创建临时节点,也就是锁
二是/hadoop−ha/{dfs.nameservices}/ActiveBreadCrumb路径用于创建永久节点,存储ANN的地址信息
正常情况下,临时节点删除时候会将永久节点一并删除
[zk: localhost:2181(CONNECTED) 10] ls /hadoop-ha/xxxxx
[ActiveBreadCrumb, ActiveStandbyElectorLock]
当ANN的HM监控到NN状态异常,通过回调函数通知zkfc,zkfc调用ASE方法进行退出选举,即删除zk节点;或者ZKFC整个服务不可用,长时间不向zk集群发送心跳,zk集群删除ha节点。
其他SNN通过ASE监控到zk集群Active节点被删除,如果自身状态健康,会在ActiveStandbyElectorLockactive路径下创建临时节点,即抢夺锁。当SNN创建成功,会去检测面包屑路径下是否有节点存在,如果存在,尝试将节点删除,通过调用transitionToStandby方法删除,如果删除不了会使用ssh登录kill进程或者启动shell脚本来fence ANN。
如果删除不掉会放弃锁,退出选举,等待一段时间(为了让其他的SNN能够抢夺锁),如果删除成功,会调用becomeActive方法,底层调用transitionToActive方法将NN变成ANN。
zkfc服务如下图所示
zkfc状态机如下图所示
hadoop官方issue详细介绍了ZFKC的设计功能
https://issues.apache.org/jira/browse/HDFS-2185
流程
1.HealthMonitor 初始化完成之后会启动内部的线程来定时调用对应 NameNode 的 HAServiceProtocol RPC 接口的方法,对NameNode 的健康状态进行检测。
2.HealthMonitor 如果检测到 NameNode 的健康状态发生变化,会回调 ZKFailoverController 注册的相应方法进行处理。
3.如果 ZKFailoverController 判断需要进行主备切换,会首先使用 ActiveStandbyElector 来进行自动的主备选举。
4.ActiveStandbyElector 与 Zookeeper 进行交互完成自动的主备选举。
5.ActiveStandbyElector 在主备选举完成后,会回调 ZKFailoverController 的相应方法来通知当前的 NameNode 成为主 NameNode或者备 NameNode
6.ZKFailoverController 调用对应 NameNode 的 HAServiceProtocol RPC 接口的方法将 NameNode 转换为 Active 状态或 Standby 状态。
HealthMonitor
HealthMonitor的作用是通过RPC来监视本地NN的健康状态(HealthMonitor.State)和服务状态(HAServiceStatus),当状态信息发生变化,通过callback向ZKFC发送信息。
//HealthMonitor的五种状态
/**
* The health monitor is still starting up.
*/
INITIALIZING,
/**
* The service is not responding to health check RPCs.
*/
SERVICE_NOT_RESPONDING,
/**
* The service is connected and healthy.
*/
SERVICE_HEALTHY,
/**
* The service is running but unhealthy.
*/
SERVICE_UNHEALTHY,
/**
* The health monitor itself failed unrecoverably and can
* no longer provide accurate information.
*/
HEALTH_MONITOR_FAILED;
//HAServiceStatus四种状态
INITIALIZING("initializing"),
ACTIVE("active"),
STANDBY("standby"),
STOPPING("stopping");
ActiveStandbyElector
ActiveStandbyElector主要控制和监控ZK上的节点的状态,与ZKFC交互,如何调用了joinElection,ASE会尝试在ZK上创建节点(获取锁),如果成功创建节点,那么调用becomeActive成为ANN,如果失败,调用becameStandby成为SNN继续监听NN的健康状态和注册watcher监听active锁。
/**
* To participate in election, the app will call joinElection. The result will
* be notified by a callback on either the becomeActive or becomeStandby app
* interfaces.
*/
public synchronized void joinElection(byte[] data)
/**
* Any service instance can drop out of the election by calling quitElection.
* <br/>
*/
public synchronized void quitElection(boolean needFence)
ZKFC
ZKFC在创建的时候会初始化HealthMonitor和ActiveStandbyElector,ZKFC就是协调HealthMonitor和ActiveStandbyElector,根据发来的事件,完成HA切换。
Fencing
kill掉主节点的zkfc,zk无法接收ANN心跳,通知SNN的zkfc,SNN zkfc在zk上成功创建znode后,会让之前的ANN调用transitionToStandby() 方法,如果无效会使用其他方法(比如kill掉节点),然后自己调用transitionToActive() 成为主节点。
源码
DFSZKFailoverController其实是一个main方法的java程序
main方法中构造了DFSZKFailoverController并且运行了run方法
在run方法中的doRun方法中有几个重要的方法
private int doRun(String[] args)
throws Exception {
try {
//初始化zk
initZK();
//格式化zk
formatZK(force, interactive);
//初始化rpc
initRPC();
//初始化hm
initHM();
//启动rpc
startRPC();
mainLoop();
} finally {
rpcServer.stopAndJoin();
elector.quitElection(true);
healthMonitor.shutdown();
healthMonitor.join();
}
return 0;
}
initZK 初始化zk,获取zk连接信息,如集群信息,acl认证,解析等等以及初始化ActiveStandbyElector
//初始化ActiveStandbyElector和传入回调方法becomeActive or becomeStandby app等等
elector = new ActiveStandbyElector(zkQuorum,
zkTimeout, getParentZnode(), zkAcls, zkAuths,
new ElectorCallbacks(), maxRetryNum);
//构造ActiveStandbyElector
public ActiveStandbyElector(String zookeeperHostPorts,
int zookeeperSessionTimeout, String parentZnodeName, List<ACL> acl,
List<ZKAuthInfo> authInfo, ActiveStandbyElectorCallback app,
int maxRetryNum, boolean failFast) throws IOException,
HadoopIllegalArgumentException, KeeperException {
...
if (failFast) {
createConnection();
} else {
reEstablishSession();
}
}
//创建与zk的连接
private void createConnection() throws IOException, KeeperException {
if (zkClient != null) {
try {
zkClient.close();
} catch (InterruptedException e) {
throw new IOException("Interrupted while closing ZK",
e);
}
zkClient = null;
watcher = null;
}
zkClient = connectToZooKeeper();
if (LOG.isDebugEnabled()) {
LOG.debug("Created new connection for " + this);
}
}
//连接zk,并初始化watcher监听zk上的节点
protected synchronized ZooKeeper connectToZooKeeper() throws IOException,
KeeperException {
watcher = new WatcherWithClientRef();
ZooKeeper zk = createZooKeeper();
watcher.setZooKeeperRef(zk);
watcher.waitForZKConnectionEvent(zkSessionTimeout);
...
}
fomartZK() 格式化zk,创建一个目录,用于后续将NN的状态写给zk
initRPC() 初始化ZKFCRpcServer
initHM() 开启健康检查HealthMonitor
private void initHM() {
//1.初始化hm,启动线程
healthMonitor = new HealthMonitor(conf, localTarget);
//2.添加回调函数
healthMonitor.addCallback(new HealthCallbacks());
//3.添加回调函数
healthMonitor.addServiceStateCallback(new ServiceStateCallBacks());
//4.开启
healthMonitor.start();
}
//2.回调函数
class HealthCallbacks implements HealthMonitor.Callback {
@Override
public void enteredState(HealthMonitor.State newState) {
//设置最新状态
setLastHealthState(newState);
//2.1检查是否选举
recheckElectability();
}
}
//2.1检查是否选举 HealthMonitor回调方法recheckElectability检查service当前状态,在recheckElectability方法中,会根据最近一次检测出的健康状态,做对应的处理动作,当HealthMonitor.State为健康,触发joinElection选举,尝试在zk上创建znode;初始化暂不选举,不健康会退出选举(如果NN为active状态,则删除zk上的节点)。
private void recheckElectability() {
// Maintain lock ordering of elector -> ZKFC
synchronized (elector) {
synchronized (this) {
boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
switch (lastHealthState) {
case SERVICE_HEALTHY:
//2.1.1选举
elector.joinElection(targetToData(localTarget));
if (quitElectionOnBadState) {
quitElectionOnBadState = false;
}
break;
case SERVICE_UNHEALTHY:
//2.1.2退出选举
elector.quitElection(true);
serviceState = HAServiceState.INITIALIZING;
break;
}
}
}
}
//2.1.1 joinElection方法中有joinElectionInternal方法
private void joinElectionInternal() {
...
createRetryCount = 0;
wantToBeInElection = true;
createLockNodeAsync();
}
//joinElectionInternal方法中的createLockNodeAsync会调用zk客户端方法创建临时znode
private void createLockNodeAsync() {
zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
this, zkClient);
}
//2.1.2 quitElection退出选举,zk上的临时节点也会被删除
public synchronized void quitElection(boolean needFence) {
// 如果当前NameNode从Active状态变为Standby状态,则删除临时znode
tryDeleteOwnBreadCrumbNode();
}
//3.回调函数
class ServiceStateCallBacks implements HealthMonitor.ServiceStateCallback {
@Override
public void reportServiceStatus(HAServiceStatus status) {
// 传入当前检测出的健康状态进行检查
verifyChangedServiceState(status.getState());
}
}
//3.1传入当前检测出的健康状态进行检查
void verifyChangedServiceState(HAServiceState changedState) {
synchronized (elector) {
synchronized (this) {
if (serviceState == HAServiceState.INITIALIZING) {
if (quitElectionOnBadState) {
LOG.debug("rechecking for electability from bad state");
recheckElectability();
}
return;
}
if (changedState == serviceState) {
serviceStateMismatchCount = 0;
return;
}
if (serviceStateMismatchCount == 0) {
// recheck one more time. As this might be due to parallel transition.
serviceStateMismatchCount++;
return;
}
// quit the election as the expected state and reported state
// mismatches.
LOG.error("Local service " + localTarget
+ " has changed the serviceState to " + changedState
+ ". Expected was " + serviceState
+ ". Quitting election marking fencing necessary.");
delayJoiningUntilNanotime = System.nanoTime()
+ TimeUnit.MILLISECONDS.toNanos(1000);
elector.quitElection(true);
quitElectionOnBadState = true;
serviceStateMismatchCount = 0;
serviceState = HAServiceState.INITIALIZING;
}
}
}
//4.启动线程
public void run() {
while (shouldRun) {
try {
//MonitorDaemon线程运行了两个方法
//一直循环尝试连接,直到通过HAServiceProtocol代理连接上HA-servce
loopUntilConnected();
//4.1监控检查
doHealthChecks();
} catch (InterruptedException ie) {
Preconditions.checkState(!shouldRun,
"Interrupted but still supposed to run");
}
}
}
//4.1监控检查
private void doHealthChecks() throws InterruptedException {
while (shouldRun) {
HAServiceStatus status = null;
boolean healthy = false;
try {
//发送一个rpc请求来查看是否响应从而判断NN的健康状态
status = proxy.getServiceStatus();
proxy.monitorHealth();
healthy = true;
} ...
if (healthy) {
//根据不同状态会调用enterState方法
enterState(State.SERVICE_HEALTHY);
}
Thread.sleep(checkIntervalMillis);
}
}
startRPC()启动ZKFCRpcServer
rpcServer.stopAndJoin();
elector.quitElection(true);
healthMonitor.shutdown();
healthMonitor.join();