Hadoop源码分析（26）

最新推荐文章于 2024-07-22 14:15:06 发布

huserblog

最新推荐文章于 2024-07-22 14:15:06 发布

阅读量725

点赞数

文章标签： hadoop java 大数据

本文链接：https://blog.csdn.net/qq_39210987/article/details/124669569

版权

Hadoop源码分析（26）

ZKFC源码分析

在文档（25）中分析了zkfc的启动过程。在zkfc的启动过程中，其会连接zookeeper与namenode。并对namenode进行健康检查。

namenode的健康检查，实际是通过RPC调用namenode自身的方法来进行健康检查。健康检查的主要的方法是monitorHealth方法，同时在namenode的启动中分析了用于处理namenode的RPC服务的类为：NameNodeRpcServer。该方法在namenode的实现如下：

  public synchronized void monitorHealth() throws HealthCheckFailedException,
      AccessControlException, IOException {
    checkNNStartup();
    nn.monitorHealth();
  }

这里重点是第4行调用的monitorHealth方法这个方法内容如下：

synchronized void monitorHealth() 
      throws HealthCheckFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      return; // no-op, if HA is not enabled
    }
    getNamesystem().checkAvailableResources();
    if (!getNamesystem().nameNodeHasResourcesAvailable()) {
      throw new HealthCheckFailedException(
          "The NameNode has no resources available");
    }
  }

健康检查执行成功后，会回到zkfc远程调用的客户端。在文档（25）中分析了在这个方法调用结束后会执行一个enterState方法，来进入对应的状态，这个方法的调用情况如下：

enterState方法调用片段

这两个方法的内容如下：

private synchronized void enterState(State newState) {
    if (newState != state) {
      LOG.info("Entering state " + newState);
      state = newState;
      synchronized (callbacks) {
        for (Callback cb : callbacks) {
          cb.enteredState(newState);
        }
      }
    }
  }

这个方法很简单，首先是对state进行赋值，然后遍历callbacks中的对象，然后调用者个对象的enteredState方法。这里的callbacks在文档（25）中分析过了在创建healthMonitor的时候会调用addCallback方法添加callback到上述的callbacks中。这里设置的callback是HealthCallbacks类的对象，这个对象的enteredState方法内容如下：

  public void enteredState(HealthMonitor.State newState) {
      setLastHealthState(newState);
      recheckElectability();
    }

这里重点是第3行的recheckElectability方法，其内容如下：

private void recheckElectability() {
    // Maintain lock ordering of elector -> ZKFC
    synchronized (elector) {
      synchronized (this) {
        boolean healthy = lastHealthState == State.SERVICE_HEALTHY;

        long remainingDelay = delayJoiningUntilNanotime - System.nanoTime(); 
        if (remainingDelay > 0) {
          if (healthy) {
            LOG.info("Would have joined master election, but this node is " +
                "prohibited from doing so for " +
                TimeUnit.NANOSECONDS.toMillis(remainingDelay) + " more ms");
          }
          scheduleRecheck(remainingDelay);
          return;
        }

        switch (lastHealthState) {
        case SERVICE_HEALTHY:
          elector.joinElection(targetToData(localTarget));
          if (quitElectionOnBadState) {
            quitElectionOnBadState = false;
          }
          break;

        case INITIALIZING:
          LOG.info("Ensuring that " + localTarget + " does not " +
              "participate in active master election");
          elector.quitElection(false);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case SERVICE_UNHEALTHY:
        case SERVICE_NOT_RESPONDING:
          LOG.info("Quitting master election for " + localTarget +
              " and marking that fencing is necessary");
          elector.quitElection(true);
          serviceState = HAServiceState.INITIALIZING;
          break;

        case HEALTH_MONITOR_FAILED:
          fatalError("Health monitor failed!");
          break;

        default:
          throw new IllegalArgumentException("Unhandled state:" + lastHealthState);
        }
      }
    }
  }

这里重点是第18行的switch语句，这里会根据lastHealthState的值来执行不同的方法。而这个lastHealthState的会在健康检查后由enterState方法传入。上文提到了传入的值为SERVICE_HEALTHY。所以这里实际会执行第19行到第24行的方法，这里的重点是第19行执行elector的joinElection方法。这个elector在zkfc初始化的时候提到过，其创建的是ActiveStandbyElector类的对象。其joinElection方法的内容如下：

 public synchronized void joinElection(byte[] data)
      throws HadoopIllegalArgumentException {

    if (data == null) {
      throw new HadoopIllegalArgumentException("data cannot be null");
    }

    if (wantToBeInElection) {
      LOG.info("Already in election. Not re-connecting.");
      return;
    }

    appData = new byte[data.length];
    System.arraycopy(data, 0, appData, 0, data.length);

    LOG.debug("Attempting active election for " + this);
    joinElectionInternal();
  }

这里主要是一些参数的处理，重点在第17行的joinElectionInternal方法。这个方法的内容如下：

  private void joinElectionInternal() {
    Preconditions.checkState(appData != null,
        "trying to join election without any app data");
    if (zkClient == null) {
      if (!reEstablishSession()) {
        fatalError("Failed to reEstablish connection with ZooKeeper");
        return;
      }
    }

    createRetryCount = 0;
    wantToBeInElection = true;
    createLockNodeAsync();
  }

这里首先判断了zkClient是否为空，然后对几个参数赋值，最后调用createLockNodeAsync方法。这个方法的内容如下：

  private void createLockNodeAsync() {
    zkClient.create(zkLockFilePath, appData, zkAcl, CreateMode.EPHEMERAL,
        this, zkClient);
  }

从上述代码中可以看出，这里的选举实际是在zookeeper的指定路径下创建一个节点。若这个节点创建成功则代表该节点是选举出的主节点，若失败则为从节点。而在传入的参数中：zkLockFilePath是临时目录的路径，CreateMode.EPHEMERAL是目录的类型（此类目录会在客户端断开的时候被删除），this传入的是elector本身，即ActiveStandbyElector类。这个参数是作为回调函数被传入的，而传ActiveStandbyElector是因为他实现了StatCallback和StringCallback，这两个接口，所以它也可以作为回调函数被传入。

无论创建节点是成功还是失败，zookeeper都会调用ActiveStandbyElector中的processResult方法，由该方法会判断其是否成功，如果成功则将其对应的namenode变为Active，否则为standby。processResult方法内容如下：

public synchronized void processResult(int rc, String path, Object ctx,
      String name) {
    if (isStaleClient(ctx)) return;
    LOG.debug("CreateNode result: " + rc + " for path: " + path
        + " connectionState: " + zkConnectionState +
        "  for " + this);

    Code code = Code.get(rc);
    if (isSuccess(code)) {
      // we successfully created the znode. we are the leader. start monitoring
      if (becomeActive()) {
        monitorActiveStatus();
      } else {
        reJoinElectionAfterFailureToBecomeActive();
      }
      return;
    }

    if (isNodeExists(code)) {
      if (createRetryCount == 0) {
        // znode exists and we did not retry the operation. so a different
        // instance has created it. become standby and monitor lock.
        becomeStandby();
      }
      // if we had retried then the znode could have been created by our first
      // attempt to the server (that we lost) and this node exists response is
      // for the second attempt. verify this case via ephemeral node owner. this
      // will happen on the callback for monitoring the lock.
      monitorActiveStatus();
      return;
    }

    String errorMessage = "Received create error from Zookeeper. code:"
        + code.toString() + " for path " + path;
    LOG.debug(errorMessage);

    if (shouldRetry(code)) {
      if (createRetryCount < maxRetryNum) {
        LOG.debug("Retrying createNode createRetryCount: " + createRetryCount);
        ++createRetryCount;
        createLockNodeAsync();
        return;
      }
      errorMessage = errorMessage
          + ". Not retrying further znode create connection errors.";
    } else if (isSessionExpired(code)) {
      // This isn't fatal - the client Watcher will re-join the election
      LOG.warn("Lock acquisition failed because session was lost");
      return;
    }

    fatalError(errorMessage);
  }

这里的重点有两个：第一个是第11行的becomeActive方法，第二个是第23行的becomeStandby方法。这里两个方法用于转换namenode的状态。

首先是becomeActive方法，其内容如下：

private boolean becomeActive() {
    assert wantToBeInElection;
    if (state == State.ACTIVE) {
      // already active
      return true;
    }
    try {
      Stat oldBreadcrumbStat = fenceOldActive();
      writeBreadCrumbNode(oldBreadcrumbStat);

      LOG.debug("Becoming active for " + this);
      appClient.becomeActive();
      state = State.ACTIVE;
      return true;
    } catch (Exception e) {
      LOG.warn("Exception handling the winning of election", e);
      // Caller will handle quitting and rejoining the election.
      return false;
    }
  }

这里主要有两个方法，首先是第8行的fenceOldActive方法。这个方法是用来处理切换状态前的active节点。然后是第12行的becomeActive方法。

这里主要分析becomeActive方法，这个方法内容如下：

   public void becomeActive() throws ServiceFailedException {
      ZKFailoverController.this.becomeActive();
    }

这里是继续调用ZKFailoverController的becomeActive方法。该方法内容如下：

private synchronized void becomeActive() throws ServiceFailedException {
    LOG.info("Trying to make " + localTarget + " active...");
    try {
      HAServiceProtocolHelper.transitionToActive(localTarget.getProxy(
          conf, FailoverController.getRpcTimeoutToNewActive(conf)),
          createReqInfo());
      String msg = "Successfully transitioned " + localTarget +
          " to active state";
      LOG.info(msg);
      serviceState = HAServiceState.ACTIVE;
      recordActiveAttempt(new ActiveAttemptRecord(true, msg));

    } catch (Throwable t) {
      String msg = "Couldn't make " + localTarget + " active";
      LOG.fatal(msg, t);

      recordActiveAttempt(new ActiveAttemptRecord(false, msg + "\n" +
          StringUtils.stringifyException(t)));

      if (t instanceof ServiceFailedException) {
        throw (ServiceFailedException)t;
      } else {
        throw new ServiceFailedException("Couldn't transition to active",
            t);
      }
    }
  }

这里的重点在第4行的方法，这个方法传入了两个参数，其中第一个参数是通过localTarget的getProxy方法获取的。这方法在文档（25）中解析过是用来获取namenode的代理对象的。然后调用的transitionToActive方法内容如下：

  public static void transitionToActive(HAServiceProtocol svc,
      StateChangeRequestInfo reqInfo)
      throws IOException {
    try {
      svc.transitionToActive(reqInfo);
    } catch (RemoteException e) {
      throw e.unwrapRemoteException(ServiceFailedException.class);
    }
  }

这里可以看见第5行直接调用了代理对象的transitionToActive方法，这两个方法会通过RPC直接调用NameNodeRpcServer的方法。该方法内容如下：

 public synchronized void transitionToActive(StateChangeRequestInfo req) 
      throws ServiceFailedException, AccessControlException, IOException {
    checkNNStartup();
    nn.checkHaStateChange(req);
    nn.transitionToActive();
  }

这里的重点在第5行的transitionToActive方法，这个方法的内容如下：

  synchronized void transitionToActive() 
      throws ServiceFailedException, AccessControlException {
    namesystem.checkSuperuserPrivilege();
    if (!haEnabled) {
      throw new ServiceFailedException("HA for namenode is not enabled");
    }
    state.setState(haContext, ACTIVE_STATE);
  }

重点是第7行的setState方法，这个方法重新设置state的状态。在之前的文档中解析了在namenode的启动的时候都是以standby状态启动的。所以这里的state是standby状态的。其执行的setState方法内容如下：

  public void setState(HAContext context, HAState s) throws ServiceFailedException {
    if (s == NameNode.ACTIVE_STATE) {
      setStateInternal(context, s);
      return;
    }
    super.setState(context, s);
  }

这里传入的s的值为ACTIVE_STATE，所以第2行的if 条件的结果是True。即这段代码会执行第3行的setStateInternal方法。这个方法内容如下：

protected final void setStateInternal(final HAContext context, final HAState s)
      throws ServiceFailedException {
    prepareToExitState(context);
    s.prepareToEnterState(context);
    context.writeLock();
    try {
      exitState(context);
      context.setState(s);
      s.enterState(context);
      s.updateLastHATransitionTime();
    } finally {
      context.writeUnlock();
    }
  }

这里的逻辑很简单，首先需要准备退出当前状态（第3行和第4行），没有问题后开始执行退出程序（第7行），然后再设置新的状态（第8行），然后进入新的状态（第9行）。

执行退出程序调用的exitState方法，这里主要是需要退出standby状态，在文档（24）中解析了在进入standby状态下的时候主要是启动两个线程，用于同步active的数据与执行checkpoint。这里退出standby状态主要就是停掉上述的两个线程。这里调用的exitState方法的内容如下：

 public void exitState(HAContext context) throws ServiceFailedException {
    try {
      context.stopStandbyServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to stop standby services", e);
    }
  }

这里会继续调用context的stopStandbyServices方法来处理，这个方法的内容如下：

   public void stopStandbyServices() throws IOException {
      try {
        if (namesystem != null) {
          namesystem.stopStandbyServices();
        }
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

重点在第6行会调用 namesystem的stopStandbyServices方法。这个方法的内容如下：

  void stopStandbyServices() throws IOException {
    LOG.info("Stopping services started for standby state");
    if (standbyCheckpointer != null) {
      standbyCheckpointer.stop();
    }
    if (editLogTailer != null) {
      editLogTailer.stop();
    }
    if (dir != null && getFSImage() != null && getFSImage().editLog != null) {
      getFSImage().editLog.close();
    }
  }

这个方法在第4行和第7行停掉了上文提到的两个进程：standbyCheckpointer和editLogTailer。

然后再看进入新状态的enterState方法，这里的新状态是active，所以调用的是active的enterState方法，其内容如下：

public void enterState(HAContext context) throws ServiceFailedException {
    try {
      context.startActiveServices();
    } catch (IOException e) {
      throw new ServiceFailedException("Failed to start active services", e);
    }
  }

  public void startActiveServices() throws IOException {
      try {
        namesystem.startActiveServices();
      } catch (Throwable t) {
        doImmediateShutdown(t);
      }
    }

这里和上文相同逐级调用方法，最后调用的startActiveServices方法内容如下：

void startActiveServices() throws IOException {
    startingActiveService = true;
    LOG.info("Starting services required for active state");
    writeLock();
    try {
      FSEditLog editLog = getFSImage().getEditLog();

      if (!editLog.isOpenForWrite()) {
        // During startup, we're already open for write during initialization.
        editLog.initJournalsForWrite();
        // May need to recover
        editLog.recoverUnclosedStreams();

        LOG.info("Catching up to latest edits from old active before " +
            "taking over writer role in edits logs");
        editLogTailer.catchupDuringFailover();

        blockManager.setPostponeBlocksFromFuture(false);
        blockManager.getDatanodeManager().markAllDatanodesStale();
        blockManager.clearQueues();
        blockManager.processAllPendingDNMessages();

        // Only need to re-process the queue, If not in SafeMode.
        if (!isInSafeMode()) {
          LOG.info("Reprocessing replication and invalidation queues");
          initializeReplQueues();
        }

        if (LOG.isDebugEnabled()) {
          LOG.debug("NameNode metadata after re-processing " +
              "replication and invalidation queues during failover:\n" +
              metaSaveAsString());
        }

        long nextTxId = getFSImage().getLastAppliedTxId() + 1;
        LOG.info("Will take over writing edit logs at txnid " + 
            nextTxId);
        editLog.setNextTxId(nextTxId);

        getFSImage().editLog.openForWrite();
      }

      // Enable quota checks.
      dir.enableQuotaChecks();
      if (haEnabled) {
        // Renew all of the leases before becoming active.
        // This is because, while we were in standby mode,
        // the leases weren't getting renewed on this NN.
        // Give them all a fresh start here.
        leaseManager.renewAllLeases();
      }
      leaseManager.startMonitor();
      startSecretManagerIfNecessary();

      //ResourceMonitor required only at ActiveNN. See HDFS-2914
      this.nnrmthread = new Daemon(new NameNodeResourceMonitor());
      nnrmthread.start();

      nnEditLogRoller = new Daemon(new NameNodeEditLogRoller(
          editLogRollerThreshold, editLogRollerInterval));
      nnEditLogRoller.start();

      if (lazyPersistFileScrubIntervalSec > 0) {
        lazyPersistFileScrubber = new Daemon(new LazyPersistFileScrubber(
            lazyPersistFileScrubIntervalSec));
        lazyPersistFileScrubber.start();
      }

      cacheManager.startMonitorThread();
      blockManager.getDatanodeManager().setShouldSendCachingCommands(true);
    } finally {
      startingActiveService = false;
      checkSafeMode();
      writeUnlock("startActiveServices");
    }
  }

首先是第8行到第41行的if语句，这个语句中的主要是用来处理未关闭的日志流，即之前分析editlog文件中的inprogress文件，并且会打开日志的写权限。然后是第44行到末尾，这里会启动active需要的一些线程。其中最重要的是第59行启动的NameNodeEditLogRoller。

huserblog

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop源码分析（26）

Hadoop源码分析（26）ZKFC源码分析在文档（25）中分析了zkfc的启动过程。在zkfc的启动过程中，其会连接zookeeper与namenode。并对namenode进行健康检查。 namenode的健康检查，实际是通过RPC调用namenode自身的方法来进行健康检查。健康检查的主要的方法是monitorHealth方法，同时在namenode的启动中分析了用于处理namenode的RPC服务的类为：NameNodeRpcServer。该方法在namenode的实现如下： pu
复制链接

扫一扫