4.hadoop源码分析:datanode的心跳机制

回顾一下之前的分析中,datanode启动的时候会启动一个bpofferservice,一个bpofferservice又对应着两个bpServerActor,这两个actor又分别与namenode的active与standby去通信。actor作为一个线程,在启动过程中,将执行自身run方法,上一篇分析了run方法中的注册逻辑,本篇分析run方法后面的逻辑中最重要的逻辑:心跳。重新来看run方法

@Override
public void run() {
  LOG.info(this + " starting to offer service");

  try {
    while (true) {
      // init stuff
      try {
        // setup storage
        //注册
        connectToNNAndHandshake();
        break;
      } catch (IOException ioe) {
        // Initial handshake, storage recovery or registration failed
        runningState = RunningState.INIT_FAILED;
        if (shouldRetryInit()) {
          // Retry until all namenode's of BPOS failed initialization
          LOG.error("Initialization failed for " + this + " "
              + ioe.getLocalizedMessage());
          sleepAndLogInterrupts(5000, "initializing");
        } else {
          runningState = RunningState.FAILED;
          LOG.fatal("Initialization failed for " + this + ". Exiting. ", ioe);
          return;
        }
      }
    }

    runningState = RunningState.RUNNING;

    while (shouldRun()) {
      try {
        //发送心跳逻辑
        offerService();
      } catch (Exception ex) {
        LOG.error("Exception in BPOfferService for " + this, ex);
        sleepAndLogInterrupts(5000, "offering service");
      }
    }
    runningState = RunningState.EXITED;
  } catch (Throwable ex) {
    LOG.warn("Unexpected exception in block pool " + this, ex);
    runningState = RunningState.FAILED;
  } finally {
    LOG.warn("Ending block pool service for: " + this);
    cleanUp();
  }
}

上面第一个while(true)是在发送注册信息,注册成功后,进入到第二个while循环,进入offerService方法:

/**
 * 对于每个blockpool,在其生命周期中会一直调用该方法
 */
private void offerService() throws Exception {
    //从日志信息可以看出,这里涉及到几个时间参数
    //1.delete数据的汇报间隔
    //2.block汇报间隔
    //3.cache汇报间隔
    //4.初始汇报延迟时间
    //5.心跳间隔
  LOG.info("For namenode " + nnAddr + " using"
      + " DELETEREPORT_INTERVAL of " + dnConf.deleteReportInterval + " msec "
      + " BLOCKREPORT_INTERVAL of " + dnConf.blockReportInterval + "msec"
      + " CACHEREPORT_INTERVAL of " + dnConf.cacheReportInterval + "msec"
      + " Initial delay: " + dnConf.initialBlockReportDelay + "msec"
      + "; heartBeatInterval=" + dnConf.heartBeatInterval);

  //
  // Now loop for a long time....
  //
  while (shouldRun()) {
    try {
      final long startTime = now();

      //
      // Every so often, send heartbeat or block-report
      //
      if (startTime - lastHeartbeat >= dnConf.heartBeatInterval) {
        //
        // All heartbeat messages include following info:
        // -- Datanode name
        // -- data transfer port
        // -- Total capacity
        // -- Bytes remaining
        //
        lastHeartbeat = startTime;
        //发送心跳
        if (!dn.areHeartbeatsDisabledForTests()) {
          //namenode返回的心跳应答
          //包含发给dd的命令,nn的HA状态,RollingUpgradeStatus
          HeartbeatResponse resp = sendHeartBeat();
          assert resp != null;
          dn.getMetrics().addHeartbeat(now() - startTime);

          // If the state of this NN has changed (eg STANDBY->ACTIVE)
          // then let the BPOfferService update itself.
          //
          // Important that this happens before processCommand below,
          // since the first heartbeat to a new active might have commands
          // that we should actually process.
          //从nn中返回的ha状态更新actor的状态
          bpos.updateActorStatesFromHeartbeat(
              this, resp.getNameNodeHaState());
          state = resp.getNameNodeHaState().getState();

          if (state == HAServiceState.ACTIVE) {
            handleRollingUpgradeStatus(resp);
          }

          long startProcessCommands = now();
          //处理nn发过来的命令
          if (!processCommand(resp.getCommands()))
            continue;
          long endProcessCommands = now();
          if (endProcessCommands - startProcessCommands > 2000) {
            LOG.info("Took " + (endProcessCommands - startProcessCommands)
                + "ms to process " + resp.getCommands().length
                + " commands from NN");
          }
        }
      }
      if (sendImmediateIBR ||
          (startTime - lastDeletedReport > dnConf.deleteReportInterval)) {
        //汇报delete block数据,默认5分钟
        reportReceivedDeletedBlocks();
        lastDeletedReport = startTime;
      }
    //汇报block数据 默认六小时
      List<DatanodeCommand> cmds = blockReport();
      processCommand(cmds == null ? null : cmds.toArray(new DatanodeCommand[cmds.size()]));
    //汇报cache block 默认10秒
      DatanodeCommand cmd = cacheReport();
      processCommand(new DatanodeCommand[]{ cmd });

      // Now safe to start scanning the block pool.
      // If it has already been started, this is a no-op.
      if (dn.blockScanner != null) {
        dn.blockScanner.addBlockPool(bpos.getBlockPoolId());
      }

      //
      // There is no work to do;  sleep until hearbeat timer elapses, 
      // or work arrives, and then iterate again.
      //
      long waitTime = dnConf.heartBeatInterval - 
      (Time.now() - lastHeartbeat);
      synchronized(pendingIncrementalBRperStorage) {
        if (waitTime > 0 && !sendImmediateIBR) {
          try {
            pendingIncrementalBRperStorage.wait(waitTime);
          } catch (InterruptedException ie) {
            LOG.warn("BPOfferService for " + this + " interrupted");
          }
        }
      } // synchronized
    } catch(RemoteException re) {
      String reClass = re.getClassName();
      if (UnregisteredNodeException.class.getName().equals(reClass) ||
          DisallowedDatanodeException.class.getName().equals(reClass) ||
          IncorrectVersionException.class.getName().equals(reClass)) {
        LOG.warn(this + " is shutting down", re);
        shouldServiceRun = false;
        return;
      }
      LOG.warn("RemoteException in offerService", re);
      try {
        long sleepTime = Math.min(1000, dnConf.heartBeatInterval);
        Thread.sleep(sleepTime);
      } catch (InterruptedException ie) {
        Thread.currentThread().interrupt();
      }
    } catch (IOException e) {
      LOG.warn("IOException in offerService", e);
    }
  } // while (shouldRun())
} // offerService

上来进入一个while循环,判断上一次心跳时间与当前时间的间隔如果超过了设定的心跳间隔时间,那么发送心跳,发送心跳的代码显然是在HeartbeatResponse resp = sendHeartBeat();这行代码上,进入sendHeartBeat():

  HeartbeatResponse sendHeartBeat() throws IOException {
/**
 * 存储id
 * datanode状态
 * 存储类型:普通磁盘还是ssd,还是其他什么类型
 * id前缀
 * 容量
 * 已用空间
 * 剩余空间呢
 * blockpool使用量
 */
    StorageReport[] reports =
        dn.getFSDataset().getStorageReports(bpos.getBlockPoolId());
    if (LOG.isDebugEnabled()) {
      LOG.debug("Sending heartbeat with " + reports.length +
                " storage reports from service actor: " + this);
    }

    return bpNamenode.sendHeartbeat(bpRegistration,
        reports,
        dn.getFSDataset().getCacheCapacity(),
        dn.getFSDataset().getCacheUsed(),
        dn.getXmitsInProgress(),
        dn.getXceiverCount(),
        dn.getFSDataset().getNumFailedVolumes());
  }

首先获取一个StorageReport,来看下该类的属性,包含了如下的信息:

存储id

datanode状态

存储类型:普通磁盘还是ssd,还是其他什么类型

id前缀

总容量

已用空间

剩余空间呢

blockpool使用量

来看该report的获取是通过getStorageReports方法:

@Override // FsDatasetSpi
public StorageReport[] getStorageReports(String bpid)
    throws IOException {
  StorageReport[] reports;
  synchronized (statsLock) {
    List<FsVolumeImpl> curVolumes = getVolumes();
    reports = new StorageReport[curVolumes.size()];
    int i = 0;
    for (FsVolumeImpl volume : curVolumes) {
      reports[i++] = new StorageReport(volume.toDatanodeStorage(),
                                       false,
                                       volume.getCapacity(),
                                       volume.getDfsUsed(),
                                       volume.getAvailable(),
                                       volume.getBlockPoolUsed(bpid));
    }
  }

  return reports;
}

datanode使用FsVolumeImpl来记录每个磁盘的状态信息,最终从内存FsVolumeList所记录的磁盘信息中拿到所有存储状态。获取到存储信息后,执行bpNamenode.sendHeartbeat方法:

@Override
public HeartbeatResponse sendHeartbeat(DatanodeRegistration registration,
    StorageReport[] reports, long cacheCapacity, long cacheUsed,
        int xmitsInProgress, int xceiverCount, int failedVolumes)
            throws IOException {
  //构造心跳请求信息
  HeartbeatRequestProto.Builder builder = HeartbeatRequestProto.newBuilder()
      .setRegistration(PBHelper.convert(registration))
      .setXmitsInProgress(xmitsInProgress).setXceiverCount(xceiverCount)
      .setFailedVolumes(failedVolumes);
  builder.addAllReports(PBHelper.convertStorageReports(reports));
  if (cacheCapacity != 0) {
    builder.setCacheCapacity(cacheCapacity);
  }
  if (cacheUsed != 0) {
    builder.setCacheUsed(cacheUsed);
  }
  HeartbeatResponseProto resp;
  try {
    //通过rpc代理来调用namenode的相关方法
    resp = rpcProxy.sendHeartbeat(NULL_CONTROLLER, builder.build());
  } catch (ServiceException se) {
    throw ProtobufHelper.getRemoteException(se);
  }
  DatanodeCommand[] cmds = new DatanodeCommand[resp.getCmdsList().size()];
  int index = 0;
  for (DatanodeCommandProto p : resp.getCmdsList()) {
    cmds[index] = PBHelper.convert(p);
    index++;
  }
  RollingUpgradeStatus rollingUpdateStatus = null;
  if (resp.hasRollingUpgradeStatus()) {
    rollingUpdateStatus = PBHelper.convert(resp.getRollingUpgradeStatus());
  }
  return new HeartbeatResponse(cmds, PBHelper.convert(resp.getHaStatus()),
      rollingUpdateStatus);
}

通过HeartbeatRequestProto的builder构造心跳请求信息,然后与注册一样,通过rpc client的rpc代理去调用远程namenode上rpcserver上的同名方法:sendHeartbeat.

来到namenode rpc服务端的sendHeartbeat方法:

//发送心跳
  @Override // DatanodeProtocol
  public HeartbeatResponse sendHeartbeat(DatanodeRegistration nodeReg,
      StorageReport[] report, long dnCacheCapacity, long dnCacheUsed,
      int xmitsInProgress, int xceiverCount,
      int failedVolumes) throws IOException {
    checkNNStartup();
    verifyRequest(nodeReg);
    return namesystem.handleHeartbeat(nodeReg, report,
        dnCacheCapacity, dnCacheUsed, xceiverCount, xmitsInProgress,
        failedVolumes);
  }

rpcserver端持有FsNamesystem对象,此时调用了FsNameSystem的handleHeartbeat方法:

HeartbeatResponse handleHeartbeat(DatanodeRegistration nodeReg,
    StorageReport[] reports, long cacheCapacity, long cacheUsed,
    int xceiverCount, int xmitsInProgress, int failedVolumes)
      throws IOException {
  readLock();
  try {
    //get datanode commands
    final int maxTransfer = blockManager.getMaxReplicationStreams()
        - xmitsInProgress;
    //DatanodeManager来处理心跳
    DatanodeCommand[] cmds = blockManager.getDatanodeManager().handleHeartbeat(
        nodeReg, reports, blockPoolId, cacheCapacity, cacheUsed,
        xceiverCount, maxTransfer, failedVolumes);
    
    //create ha status
    final NNHAStatusHeartbeat haState = new NNHAStatusHeartbeat(
        haContext.getState().getServiceState(),
        getFSImage().getLastAppliedOrWrittenTxId());

    return new HeartbeatResponse(cmds, haState, rollingUpgradeInfo);
  } finally {
    readUnlock();
  }
}

该handle方法一是在server端更新心跳时间,以便namenode判断datanode的健康状态,二是更新datanode存储状况,以便在后续分配block块的时候做判断。最后调用的是blockManager的DatanodeManager的handleHeartbeat:blockManager.getDatanodeManager().handleHeartbeat

public DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
    StorageReport[] reports, final String blockPoolId,
    long cacheCapacity, long cacheUsed, int xceiverCount, 
    int maxTransfers, int failedVolumes
    ) throws IOException {
  synchronized (heartbeatManager) {
    synchronized (datanodeMap) {
      //描述datanode的对象
      DatanodeDescriptor nodeinfo = null;
      try {
        nodeinfo = getDatanode(nodeReg);
      } catch(UnregisteredNodeException e) {
        return new DatanodeCommand[]{RegisterCommand.REGISTER};
      }
      
      // Check if this datanode should actually be shutdown instead. 
      if (nodeinfo != null && nodeinfo.isDisallowed()) {
        setDatanodeDead(nodeinfo);
        throw new DisallowedDatanodeException(nodeinfo);
      }

      if (nodeinfo == null || !nodeinfo.isAlive) {
        return new DatanodeCommand[]{RegisterCommand.REGISTER};
      }
      //更新心跳
      heartbeatManager.updateHeartbeat(nodeinfo, reports,
                                       cacheCapacity, cacheUsed,
                                       xceiverCount, failedVolumes);

      // If we are in safemode, do not send back any recovery / replication
      // requests. Don't even drain the existing queue of work.
      //namenode安全模式下不会发送指令
      if(namesystem.isInSafeMode()) {
        return new DatanodeCommand[0];
      }

      //check lease recovery
      BlockInfoUnderConstruction[] blocks = nodeinfo
          .getLeaseRecoveryCommand(Integer.MAX_VALUE);
      if (blocks != null) {
        BlockRecoveryCommand brCommand = new BlockRecoveryCommand(
            blocks.length);
        for (BlockInfoUnderConstruction b : blocks) {
          final DatanodeStorageInfo[] storages = b.getExpectedStorageLocations();
          // Skip stale nodes during recovery - not heart beated for some time (30s by default).
          final List<DatanodeStorageInfo> recoveryLocations =
              new ArrayList<DatanodeStorageInfo>(storages.length);
          for (int i = 0; i < storages.length; i++) {
            if (!storages[i].getDatanodeDescriptor().isStale(staleInterval)) {
              recoveryLocations.add(storages[i]);
            }
          }
          // If we only get 1 replica after eliminating stale nodes, then choose all
          // replicas for recovery and let the primary data node handle failures.
          if (recoveryLocations.size() > 1) {
            if (recoveryLocations.size() != storages.length) {
              LOG.info("Skipped stale nodes for recovery : " +
                  (storages.length - recoveryLocations.size()));
            }
            brCommand.add(new RecoveringBlock(
                new ExtendedBlock(blockPoolId, b),
                DatanodeStorageInfo.toDatanodeInfos(recoveryLocations),
                b.getBlockRecoveryId()));
          } else {
            // If too many replicas are stale, then choose all replicas to participate
            // in block recovery.
            brCommand.add(new RecoveringBlock(
                new ExtendedBlock(blockPoolId, b),
                DatanodeStorageInfo.toDatanodeInfos(storages),
                b.getBlockRecoveryId()));
          }
        }
        return new DatanodeCommand[] { brCommand };
      }
      //发送command给datanode
      final List<DatanodeCommand> cmds = new ArrayList<DatanodeCommand>();
      //check pending replication
      List<BlockTargetPair> pendingList = nodeinfo.getReplicationCommand(
            maxTransfers);
      if (pendingList != null) {
        cmds.add(new BlockCommand(DatanodeProtocol.DNA_TRANSFER, blockPoolId,
            pendingList));
      }
      //check block invalidation
      Block[] blks = nodeinfo.getInvalidateBlocks(blockInvalidateLimit);
      if (blks != null) {
        cmds.add(new BlockCommand(DatanodeProtocol.DNA_INVALIDATE,
            blockPoolId, blks));
      }
      boolean sendingCachingCommands = false;
      long nowMs = Time.monotonicNow();
      if (shouldSendCachingCommands && 
          ((nowMs - nodeinfo.getLastCachingDirectiveSentTimeMs()) >=
              timeBetweenResendingCachingDirectivesMs)) {
        DatanodeCommand pendingCacheCommand =
            getCacheCommand(nodeinfo.getPendingCached(), nodeinfo,
              DatanodeProtocol.DNA_CACHE, blockPoolId);
        if (pendingCacheCommand != null) {
          cmds.add(pendingCacheCommand);
          sendingCachingCommands = true;
        }
        DatanodeCommand pendingUncacheCommand =
            getCacheCommand(nodeinfo.getPendingUncached(), nodeinfo,
              DatanodeProtocol.DNA_UNCACHE, blockPoolId);
        if (pendingUncacheCommand != null) {
          cmds.add(pendingUncacheCommand);
          sendingCachingCommands = true;
        }
        if (sendingCachingCommands) {
          nodeinfo.setLastCachingDirectiveSentTimeMs(nowMs);
        }
      }

      blockManager.addKeyUpdateCommand(cmds, nodeinfo);

      // check for balancer bandwidth update
      if (nodeinfo.getBalancerBandwidth() > 0) {
        cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth()));
        // set back to 0 to indicate that datanode has been sent the new value
        nodeinfo.setBalancerBandwidth(0);
      }

      if (!cmds.isEmpty()) {
        return cmds.toArray(new DatanodeCommand[cmds.size()]);
      }
    }
  } 

这段代码主要做两件事,一是调用了HeartbeatManager的update方法,二是生成一些command交个datanode去做。先看update方法:

synchronized void updateHeartbeat(final DatanodeDescriptor node,
    StorageReport[] reports, long cacheCapacity, long cacheUsed,
    int xceiverCount, int failedVolumes) {
  stats.subtract(node);
  //DatanodeDescriptor的信息更新
  node.updateHeartbeat(reports, cacheCapacity, cacheUsed,
      xceiverCount, failedVolumes);
  stats.add(node);
}

进入updateHeartbeat方法:

public void updateHeartbeat(StorageReport[] reports, long cacheCapacity,
    long cacheUsed, int xceiverCount, int volFailures) {
  updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount,
      volFailures);
  //注册以来是否发送过心跳
  heartbeatedSinceRegistration = true;
}

进入updateHeartbeatState方法:

public void updateHeartbeatState(StorageReport[] reports, long cacheCapacity,
      long cacheUsed, int xceiverCount, int volFailures) {
    long totalCapacity = 0;
    long totalRemaining = 0;
    long totalBlockPoolUsed = 0;
    long totalDfsUsed = 0;
    Set<DatanodeStorageInfo> failedStorageInfos = null;

    // Decide if we should check for any missing StorageReport and mark it as
    // failed. There are different scenarios(方案).
    // 1. When DN is running, a storage failed. Given the current DN
    //    implementation doesn't add recovered storage back to its storage list
    //    until DN restart, we can assume volFailures won't decrease
    //    during the current DN registration session.
    //    When volumeFailures == this.volumeFailures, it implies there is no
    //    state change. No need to check for failed storage. This is an
    //    optimization.
    // 2. After DN restarts, volFailures might not increase and it is possible
    //    we still have new failed storage. For example, admins reduce
    //    available storages in configuration. Another corner case
    //    is the failed volumes might change after restart; a) there
    //    is one good storage A, one restored good storage B, so there is
    //    one element in storageReports and that is A. b) A failed. c) Before
    //    DN sends HB to NN to indicate A has failed, DN restarts. d) After DN
    //    restarts, storageReports has one element which is B.
    boolean checkFailedStorages = (volFailures > this.volumeFailures) ||
        !heartbeatedSinceRegistration;
//默认每种storage都是失效的,然后如果上报这个storage的信息,再去掉这个storage.
    if (checkFailedStorages) {
      LOG.info("Number of failed storage changes from "
          + this.volumeFailures + " to " + volFailures);
      failedStorageInfos = new HashSet<DatanodeStorageInfo>(
          storageMap.values());
    }

    setCacheCapacity(cacheCapacity);
    setCacheUsed(cacheUsed);
    setXceiverCount(xceiverCount);
    //设置最近心跳时间,父类的方法
    setLastUpdate(Time.now());    
    this.volumeFailures = volFailures;
    for (StorageReport report : reports) {
      DatanodeStorageInfo storage = updateStorage(report.getStorage());
      if (checkFailedStorages) {
        failedStorageInfos.remove(storage);
      }
  //将每块磁盘的容量相加
      storage.receivedHeartbeat(report);
      totalCapacity += report.getCapacity();
      totalRemaining += report.getRemaining();
      totalBlockPoolUsed += report.getBlockPoolUsed();
      totalDfsUsed += report.getDfsUsed();
    }
    rollBlocksScheduled(getLastUpdate());

    // Update total metrics for the node.
    setCapacity(totalCapacity);
    setRemaining(totalRemaining);
    setBlockPoolUsed(totalBlockPoolUsed);
    setDfsUsed(totalDfsUsed);
    if (checkFailedStorages) {
      updateFailedStorage(failedStorageInfos);
    }

    if (storageMap.size() != reports.length) {
      pruneStorageMap(reports);
    }
  }

主要是更新一些datanode的存储信息。

 

namenode更新datanode的信息后,回到datanode端,客户端拿到namenode的心跳回复信息将执行bpos.updateActorStatesFromHeartbeat方法:

void updateActorStatesFromHeartbeat(
    BPServiceActor actor,
    NNHAStatusHeartbeat nnHaState) {
  writeLock();
  try {
    final long txid = nnHaState.getTxId();

    final boolean nnClaimsActive =
        nnHaState.getState() == HAServiceState.ACTIVE;
    final boolean bposThinksActive = bpServiceToActive == actor;
    final boolean isMoreRecentClaim = txid > lastActiveClaimTxId;

    if (nnClaimsActive && !bposThinksActive) {
      LOG.info("Namenode " + actor + " trying to claim ACTIVE state with " +
          "txid=" + txid);
      if (!isMoreRecentClaim) {
        // Split-brain scenario - an NN is trying to claim active
        // state when a different NN has already claimed it with a higher
        // txid.
        LOG.warn("NN " + actor + " tried to claim ACTIVE state at txid=" +
            txid + " but there was already a more recent claim at txid=" +
            lastActiveClaimTxId);
        return;
      } else {
        if (bpServiceToActive == null) {
          LOG.info("Acknowledging ACTIVE Namenode " + actor);
        } else {
          LOG.info("Namenode " + actor + " taking over ACTIVE state from " +
              bpServiceToActive + " at higher txid=" + txid);
        }
        bpServiceToActive = actor;
      }
    } else if (!nnClaimsActive && bposThinksActive) {
      LOG.info("Namenode " + actor + " relinquishing ACTIVE state with " +
          "txid=" + nnHaState.getTxId());
      bpServiceToActive = null;
    }

    if (bpServiceToActive == actor) {
      assert txid >= lastActiveClaimTxId;
      lastActiveClaimTxId = txid;
    }
  } finally {
    writeUnlock();
  }
}

到此,心跳的交互完成。

总结一下:

1.datanode的bpserviceActor的run方法中,获取本地每个磁盘的存储信息,将这些信息打包成request通过namenode的RpcClient端代理发送给namenode的RpcServer相应的接口。

2.namenode端的rpcServer接收到请求后将其转给对应的方法处理,该同名方法会调用持有的FsNameSystem的方法处理。FsNameSystem会使用blockManager的datanodemanager来处理,最后还使用了heartmanager来处理心跳。处理心跳主要是更新最后心跳时间和更新namenode端存放的datanode存储信息。然后生成一些command作为响应信息返回给datanode。

3.datanode接收到响应信息后执行对应的command。

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值