【大数据Hadoop】HDFS-HA模式下ZKFC(DFSZKFailoverController)高可用主备切换机制

笑起来贼好看

已于 2023-03-26 19:27:12 修改

阅读量2.3k

点赞数 2

分类专栏： Hadoop 大数据文章标签： hadoop 大数据 hdfs

于 2023-03-26 13:14:24 首次发布

本文链接：https://blog.csdn.net/u013412066/article/details/129777627

版权

大数据同时被 2 个专栏收录

71 篇文章 1 订阅

订阅专栏

Hadoop

36 篇文章 5 订阅

订阅专栏

DFSZKFailoverController机制

概览
组件原理

概览

当一个NameNode被成功切换为Active状态时，它会在ZK内部创建一个临时的znode，在znode中将会保留当前Active NameNode的一些信息，比如主机名等等。当Active NameNode出现失败或连接超时的情况下，监控程序会将ZK上对应的临时znode进行删除，znode的删除事件会主动触发到下一次的Active NamNode的选择。

因为ZK是具有高度一致性的，它能保证当前最多只能有一个节点能够成功创建znode，成为当前的Active Name。这也就是为什么社区会利用ZK来做HDFS HA的自动切换的原因。

组件原理

在ZKFC的进程内部，运行着3个对象服务：

ZKFailoverController：协调HealMonitor和ActiveStandbyElector对象，处理它们发来的event变化事件，完成自动切换的过程
HealthMonitor：监控local-NameNode的服务状态
ActiveStandbyElector：管理和监控ZK上的节点的状态

以上3者的运行结果图如图1-1所示。
在这里插入图片描述

启动日志看出端倪

zkfc中先启动rpc服务，接着打印 Entering state SERVICE_HEALTHY,写zk数据到/hadoop-ha-cdp-cluster/cdp-cluster/ActiveBreadCrumb，最后将当前节点namenode的状态置为active

zkfc的日志

2023-03-22 08:45:47,258 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server spark-31/10.253.128.31:2181. Will not attempt to authenticate using SASL (unknown error)
2023-03-22 08:45:47,266 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /10.253.128.31:50782, server: spark-31/10.253.128.31:2181
2023-03-22 08:45:47,285 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server spark-31/10.253.128.31:2181, sessionid = 0x1000e3e5f9a0005, negotiated timeout = 10000
2023-03-22 08:45:47,289 INFO org.apache.hadoop.ha.ActiveStandbyElector: Session connected.
2023-03-22 08:45:47,299 INFO org.apache.hadoop.ha.ZKFailoverController: ZKFC RpcServer binding to /10.253.128.31:8019
2023-03-22 08:45:47,331 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue, queueCapacity: 300, scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler, ipcBackoff: false.
2023-03-22 08:45:47,378 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8019
2023-03-22 08:45:47,484 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting
2023-03-22 08:45:47,484 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8019: starting
2023-03-22 08:45:47,724 INFO org.apache.hadoop.ha.HealthMonitor: Entering state SERVICE_HEALTHY
2023-03-22 08:45:47,725 INFO org.apache.hadoop.ha.ZKFailoverController: Local service NameNode at /10.253.128.31:8020 entered state: SERVICE_HEALTHY
2023-03-22 08:45:47,745 INFO org.apache.hadoop.ha.ActiveStandbyElector: Checking for any old active which needs to be fenced...
2023-03-22 08:45:47,756 INFO org.apache.hadoop.ha.ActiveStandbyElector: No old node to fence
2023-03-22 08:45:47,756 INFO org.apache.hadoop.ha.ActiveStandbyElector: Writing znode /hadoop-ha-cdp-cluster/cdp-cluster/ActiveBreadCrumb to indicate that the local node is the most recent active...
2023-03-22 08:45:47,763 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at spark-31/10.253.128.31:8020 active...
2023-03-22 08:45:49,271 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at spark-31/10.253.128.31:8020 to active state

Namenode在启动后，默认会先进入到standby state状态。当zkfc检测到当前namenode启动后，会发来探测monitorHealth，之后zkfc会进行选举，如果选举为active后，会发送rpc将此namenode节点状态置为active.
接着Namenode会stopStandbyServices 停止一些 standby节点才会执行的线程服务，比如standbyCheckpointer和editLogTailer。最后执行 startActiveServices 启动主节点服务的一些线程检测服务。

namenode的日志

monitorHealth from 10.253.128.31:45708: org.apache.hadoop.ha.HealthCheckFailedException: The NameNode is configured to report UNHEALTHY to ZKFC in Safemode.
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Safe mode is OFF
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Leaving safe mode after 33 secs
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* Network topology has 1 racks and 3 datanodes
2023-03-26 10:33:21,493 INFO org.apache.hadoop.hdfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks
2023-03-26 10:33:21,749 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Stopping services started for standby state
2023-03-26 10:33:21,751 WARN org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer: Edit log tailer interrupted: sleep interrupted
2023-03-26 10:33:21,755 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
2023-03-26 10:33:21,766 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Starting recovery process for unclosed journal segments...
2023-03-26 10:33:21,803 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Successfully started new epoch 8
2023-03-26 10:33:21,804 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Beginning recovery of unclosed segment starting at txid 102544
2023-03-26 10:33:21,850 INFO org.apache.hadoop.hdfs.qjournal.client.QuorumJournalManager: Recovery prepare phase complete. Responses:
10.253.128.33:8485: segmentState { startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543
10.253.128.31:8485: segmentState { startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543
10.253.128.32:8485: segmentState { startTxId: 102544 endTxId: 102544 isInProgress: true } lastWriterEpoch: 7 lastCommittedTxId: 102543

ZKFailoverController

此过程做了如下事情：

初始化zk的连接信息，创建ActiveStandbyElector
创建rpc服务，用于连接local namenode，检测服务状态
创建 healthMonitor , 并启动线程
首先看对象的成员变量，通过这些变量，也能知晓其中要干的事情。

public abstract class ZKFailoverController {
  // 省略...

  // zk连接串
  private String zkQuorum;
  // 本地的namenode对象
  protected final HAServiceTarget localTarget;
  // 健康检测对象
  private HealthMonitor healthMonitor;
  // 选举对象
  private ActiveStandbyElector elector;
  // rpc对象
  protected ZKFCRpcServer rpcServer;
  // 默认状态
  private State lastHealthState = State.INITIALIZING;

  private volatile HAServiceState serviceState = HAServiceState.INITIALIZING;

核心线程工作内容


private int doRun(String[] args)
      throws Exception {
    try {
      initZK();
    } catch (KeeperException ke) {
      LOG.error("Unable to start failover controller. Unable to connect "
          + "to ZooKeeper quorum at " + zkQuorum + ". Please check the "
          + "configured value for " + ZK_QUORUM_KEY + " and ensure that "
          + "ZooKeeper is running.", ke);
      return ERR_CODE_NO_ZK;
    }
    // 省略参数解析过程...
    try {
      // 创建 rpc 服务，连接 local namenode rpc
      initRPC();
      // 创建 healthMonitor
      initHM();
      // 启动 rpc
      startRPC();
      // 挂住主进程。
      mainLoop();
    } catch (Exception e) {
      LOG.error("The failover controller encounters runtime error: ", e);
      throw e;
    } finally {
      rpcServer.stopAndJoin();
      
      elector.quitElection(true);
      healthMonitor.shutdown();
      healthMonitor.join();
    }
    return 0;
  }

HealthMonitor

HealthMonitor是服务监控对象，用于监控当前节点上的namenode节点状态。内部维护了5种类型的状态

public enum State {
    /**
     * The health monitor is still starting up.
     */
     // HealMonitor初始化启动状态。
    INITIALIZING,

    /**
     * The service is not responding to health check RPCs.
     */
     // 健康检查无响应状态。
    SERVICE_NOT_RESPONDING,

    /**
     * The service is connected and healthy.
     */
    // 服务检测健康状态。
    SERVICE_HEALTHY,
    
    /**
     * The service is running but unhealthy.
     */
    // 服务检查不健康状态。
    SERVICE_UNHEALTHY,
    
    /**
     * The health monitor itself failed unrecoverably and can
     * no longer provide accurate information.
     */
    // 监控服务本身失败不可用状态。
    HEALTH_MONITOR_FAILED;
  }

看看 HealMonitor 初始化过程

private void initHM() {
  // HealthMonitor对象的初始化
  healthMonitor = new HealthMonitor(conf, localTarget);
  // 加入回调操作对象，以此不同的状态变化可以触发这些回调的执行
  healthMonitor.addCallback(new HealthCallbacks());
  healthMonitor.addServiceStateCallback(new ServiceStateCallBacks());
  healthMonitor.start();
}

HealMonitor对象检测NameNode的健康状况的逻辑其实非常简单：发送一个RPC请求，查看是否有响应。相关代码如下：

public void run() {
  // 循环检测
  while (shouldRun) {
    try {
      // 尝试连接 namenode，直到连接上退出loop
      loopUntilConnected();
      // 做监控检测
      doHealthChecks();
    } catch (InterruptedException ie) {
      Preconditions.checkState(!shouldRun,
          "Interrupted but still supposed to run");
    }
  }
}

继续进入到 doHealthChecks

private void doHealthChecks() throws InterruptedException {
  while (shouldRun) {
    HAServiceStatus status = null;
    boolean healthy = false;
    try {
      // rpc调用namenode获取服务状态
      status = proxy.getServiceStatus();
      // 监控健康
      proxy.monitorHealth();
      // 没有异常，就将healthy置为true
      healthy = true;
    } catch (Throwable t) {
      if (isHealthCheckFailedException(t)) {
        LOG.warn("Service health check failed for {}", targetToMonitor, t);
        // 如果出现异常情况 进入服务不健康状态
        enterState(State.SERVICE_UNHEALTHY);
      } else {
        LOG.warn("Transport-level exception trying to monitor health of {}",
            targetToMonitor, t);
        RPC.stopProxy(proxy);
        proxy = null;
        // 进入服务无响应状态
        enterState(State.SERVICE_NOT_RESPONDING);
        Thread.sleep(sleepAfterDisconnectMillis);
        return;
      }
    }
    // 服务状态
    if (status != null) {
      setLastServiceStatus(status);
    }
    // 服务健康状态
    if (healthy) {
      enterState(State.SERVICE_HEALTHY);
    }
	// 进行检测间隔时间的睡眠
    Thread.sleep(checkIntervalMillis);
  }
}

根据检测出的不同状态之后，会调用enterState方法，在这个方法内部会触发相应状态的回调事件。这些事件会在ZKFailoverController类中被处理。

看 enterState 方法如下：

class HealthCallbacks implements HealthMonitor.Callback {
  @Override
  public void enteredState(HealthMonitor.State newState) {
    // 设置最近状态
    setLastHealthState(newState);
    recheckElectability();
  }
}

接着看 recheckElectability 方法

private void recheckElectability() {
  // Maintain lock ordering of elector -> ZKFC
  synchronized (elector) {
    synchronized (this) {
      boolean healthy = lastHealthState == State.SERVICE_HEALTHY;
      // 省略 部分代码...
      
      switch (lastHealthState) {
      // 如果当前状态为健康，则加入此轮选举
      case SERVICE_HEALTHY:
        if(serviceState != HAServiceState.OBSERVER) {
          elector.joinElection(targetToData(localTarget));
        }
        if (quitElectionOnBadState) {
          quitElectionOnBadState = false;
        }
        break;
        
      case INITIALIZING:
        LOG.info("Ensuring that " + localTarget + " does not " +
            "participate in active master election");
        // 如果当前处于初始化状态，则暂时不加入选举
        elector.quitElection(false);
        serviceState = HAServiceState.INITIALIZING;
        break;
  
      case SERVICE_UNHEALTHY:
      case SERVICE_NOT_RESPONDING:
        LOG.info("Quitting master election for " + localTarget +
            " and marking that fencing is necessary");
        // 如果当前状态为不健康或无响应状态，则退出选择
        elector.quitElection(true);
        serviceState = HAServiceState.INITIALIZING;
        break;
        
      case HEALTH_MONITOR_FAILED:
        fatalError("Health monitor failed!");
        break;
        
      default:
        throw new IllegalArgumentException("Unhandled state:"
                                             + lastHealthState);
      }
    }
  }
}

ActiveStandbyElector

ActiveStandbyElector对象主要负责的是与Zookeeper之间的交互操作。比如一个节点成功被切换为Active Name，ActiveStandbyElector对象会在ZK上创建一个节点。在这个类最后，有2个涉及到Active Name选举的关键方法：joinElection()和quitElection()方法。

joinElection方法被调用表明本地的NameNode准备参与Active NameNode的选举，为一个备选节点。quitElection方法被调用表示的是本地节点退出本次的选举。

这2个方法会在HDFS HA自动切换最后被调用。显然quitElection方法会在原Active NameNode所在节点中被调用。
在joinElection参加选举的方法中,会执行在ZK上创建临时znode的方法，代码如下：

  private void joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    createLockNodeAsync();
  }
private void createLockNodeAsync() {
  zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
      this, zkClient);
}

quitElection方法会删除已经创建的zk节点，方法如下：

public synchronized void quitElection(boolean needFence) {
  LOG.info("Yielding from election");
  if (!needFence && state == State.ACTIVE) {
    // If active is gracefully going back to standby mode, remove
    // our permanent znode so no one fences us.
    tryDeleteOwnBreadCrumbNode();
  }
  reset();
  wantToBeInElection = false;
}

private void tryDeleteOwnBreadCrumbNode() {
  assert state == State.ACTIVE;
  LOG.info("Deleting bread-crumb of active node...");
  
  // Sanity check the data. This shouldn't be strictly necessary,
  // but better to play it safe.
  Stat stat = new Stat();
  byte[] data = null;
  try {
    data = zkClient.getData(zkBreadCrumbPath, false, stat);

    if (!Arrays.equals(data, appData)) {
      throw new IllegalStateException(
          "We thought we were active, but in fact " +
          "the active znode had the wrong data: " +
          StringUtils.byteToHexString(data) + " (stat=" + stat + ")");
    }
    
    deleteWithRetries(zkBreadCrumbPath, stat.getVersion());
  } catch (Exception e) {
    LOG.warn("Unable to delete our own bread-crumb of being active at {}." +
        ". Expecting to be fenced by the next active.", zkBreadCrumbPath, e);
  }
}

希望对正在查看文章的您有所帮助，记得关注、评论、收藏，谢谢您

笑起来贼好看

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
打赏
0
评论
【大数据Hadoop】HDFS-HA模式下ZKFC(DFSZKFailoverController)高可用主备切换机制

当一个NameNode被成功切换为Active状态时，它会在ZK内部创建一个临时的znode，在znode中将会保留当前Active NameNode的一些信息，比如主机名等等。当Active NameNode出现失败或连接超时的情况下，监控程序会将ZK上对应的临时znode进行删除，znode的删除事件会主动触发到下一次的Active NamNode的选择。根据检测出的不同状态之后，会调用enterState方法，在这个方法内部会触发相应状态的回调事件。这2个方法会在HDFS HA自动切换最后被调用。
复制链接

扫一扫