Hadoop3.2.1 【 HDFS 】源码分析 :BlockManager解析 [一]

 

一.前言

BlockManager最重要的功能之一就是维护Namenode内存中的数据块信息,
BlockManager中存储的数据块信息包含两个部分。

■ 数据块与存储这个数据块的数据节点存储的对应关系, 这部分信息保存在数据块对应的BlockInfo对象的storages[]数组中, Namenode内存中的所有BlockInfo对象则保存在BlockManager.blocksMap字段中。
■ 数据节点存储与这个数据节点存储上保存的所有数据块的对应关系, 这部分信息保存在DatanodeStorageInfo.blocks字段中, blockList是BlockInfo类型的, 利用BlockInfo.storages[]字段的双向链表结构 ,DatanodeStorageInfo可以通过blockList字段保存这个数据块存储上所有数据块对应的BlockInfo对象。

BlockManager管理Namenode中数据块的信息基本上都是通过对这两部分内容进行修改实现的,

二增、 删、 改、 查数据块

2.1. 添加数据块

我们以上传一个spark-examples_2.11-2.3.1.jar文件到hhdfs 根目录"/"为例,看一下代码实现.

当客户端向HDFS写入新文件时, 如果写满了一个数据块, 客户端会调用ClientProtocol.addBlock()方法向Namenode申请一个新的数据块。 这个请求到达Namenode后会由FSNamesystem.getAdditionalBlock()方法响应

 

NameNodeRpcServer#addBlock

getAdditionalBlock()方法首先会检查文件系统状态, 然后为新添加的数据块选择存放副本的Datanode, 最后构造Block对象并调用FSDirWriteFileOp.storeAllocatedBlock()方法将Block对象加入文件对应的INode对象中。更新并持久化内存中.

 


  /**
   *
   * getAdditionalBlock()方法首先会检查文件系统状态,
   * 然后为新添加的数据块选择存放副本的Datanode,
   *
   * 最后构造Block对象并调用FSDirectory.addBlock()方法将Block对象加入
   * 文件对应的INode对象中。
   *
   *
   * The client would like to obtain an additional block for the indicated
   * filename (which is being written-to).  Return an array that consists
   * of the block, plus a set of machines.  The first on this list should
   * be where the client writes data.  Subsequent items in the list must
   * be provided in the connection to the first datanode.
   *
   * Make sure the previous blocks have been reported by datanodes and
   * are replicated.  Will return an empty 2-elt array if we want the
   * client to "try again later".
   */
  LocatedBlock getAdditionalBlock(
      String src, long fileId, String clientName, ExtendedBlock previous,
      DatanodeInfo[] excludedNodes, String[] favoredNodes,
      EnumSet<AddBlockFlag> flags) throws IOException {


    final String operationName = "getAdditionalBlock";
    NameNode.stateChangeLog.debug("BLOCK* getAdditionalBlock: {}  inodeId {}" +
        " for {}", src, fileId, clientName);

    LocatedBlock[] onRetryBlock = new LocatedBlock[1];
    FSDirWriteFileOp.ValidateAddBlockResult r;
    checkOperation(OperationCategory.READ);
    final FSPermissionChecker pc = getPermissionChecker();
    readLock();
    try {
      checkOperation(OperationCategory.READ);
      r = FSDirWriteFileOp.validateAddBlock(this, pc, src, fileId, clientName,
                                            previous, onRetryBlock);
    } finally {
      readUnlock(operationName);
    }

    if (r == null) {
      assert onRetryBlock[0] != null : "Retry block is null";
      // This is a retry. Just return the last block.
      return onRetryBlock[0];
    }
    //选取存储datanode节点
    DatanodeStorageInfo[] targets = FSDirWriteFileOp.chooseTargetForNewBlock(
        blockManager, src, excludedNodes, favoredNodes, flags, r);

    checkOperation(OperationCategory.WRITE);
    writeLock();
    LocatedBlock lb;
    try {
      checkOperation(OperationCategory.WRITE);

      //保存block
      lb = FSDirWriteFileOp.storeAllocatedBlock(
          this, src, fileId, clientName, previous, targets);

    } finally {
      writeUnlock(operationName);
    }
    getEditLog().logSync();
    return lb;
  }

最后向client返回存储信息

 

2.2.添加副本

当Datanode上写入了一个新的数据块副本或者完成了一次数据块副本复制操作后, 会通过DatanodeProtocol.blockReport()或DatanodeProtocol.blockReceivedAndDeleted()方法向Namenode汇报该Datanode上添加了一个新的数据块副本。

这两个接口最终都会通过调用BlockManager.addStoredBlock()方法更新BlockManager.blocksMap中的数据块副本与Datanode的对应信息。 


  /**
   *
   * 当Datanode上写入了一个新的数据块副本或者完成了一次数据块副本复制操作后, 会
   * 通过DatanodeProtocol.blockReport()或者
   * DatanodeProtocol.blockReceivedAndDeleted()方法向
   * Namenode汇报该Datanode上添加了一个新的数据块副本。 这两个接口最终都会通过调用
   * BlockManager.addStoredBlock()方法更新BlockManager.blocksMap中的数据块副本与
   * Datanode的对应信息。
   *
   *
   *
   * addStoredBlock()方法首先确认当前副本是否属于Namenode内存中的一个HDFS文件,
   * 如果不属于则直接返回。 然后addStoredBlock()会调用storageInfo.addBlock()在数据块与数
   * 据节点存储的映射中添加当前数据节点存储的信息(在BlockInfo的triplets[]数组中添加当
   * 前DatanodeStorageInfo的信息) , 并在当前数据节点存储对象上添加这个数据块的信息
   * (在DatanodeStorageInfo的blockList链表中添加当前副本对应的BlockInfo对象)
   *
   *
   * Modify (block-->datanode) map. Remove block from set of
   * needed reconstruction if this takes care of the problem.
   * @return the block that is stored in blocksMap.
   */
  private Block addStoredBlock(final BlockInfo block,
                               final Block reportedBlock,
                               DatanodeStorageInfo storageInfo,
                               DatanodeDescriptor delNodeHint,
                               boolean logEveryBlock)
  throws IOException {

    assert block != null && namesystem.hasWriteLock();

    BlockInfo storedBlock;
    DatanodeDescriptor node = storageInfo.getDatanodeDescriptor();
    if (!block.isComplete()) {
      //refresh our copy in case the block got completed in another thread
      storedBlock = getStoredBlock(block);
    } else {
      storedBlock = block;
    }

    // 如果当前block不属于任何inode, 则直接返回, 不进行任何操作
    if (storedBlock == null || storedBlock.isDeleted()) {

      // If this block does not belong to anyfile, then we are done.
      blockLog.debug("BLOCK* addStoredBlock: {} on {} size {} but it does not" +
          " belong to any file", block, node, block.getNumBytes());

      // we could add this block to invalidate set of this datanode.
      // it will happen in next block report otherwise.
      return block;
    }

    // 在block -> datanode映射中添加当前datanode
    // add block to the datanode
    AddBlockResult result = storageInfo.addBlock(storedBlock, reportedBlock);

    int curReplicaDelta;
    if (result == AddBlockResult.ADDED) {
      curReplicaDelta =
          (node.isDecommissioned() || node.isDecommissionInProgress()) ? 0 : 1;
      if (logEveryBlock) {
        blockLog.debug("BLOCK* addStoredBlock: {} is added to {} (size={})",
            node, storedBlock, storedBlock.getNumBytes());
      }
    } else if (result == AddBlockResult.REPLACED) {
      curReplicaDelta = 0;
      blockLog.warn("BLOCK* addStoredBlock: block {} moved to storageType " +
          "{} on node {}", storedBlock, storageInfo.getStorageType(), node);
    } else {
      // if the same block is added again and the replica was corrupt
      // previously because of a wrong gen stamp, remove it from the
      // corrupt block list.
      corruptReplicas.removeFromCorruptReplicasMap(block, node,
          Reason.GENSTAMP_MISMATCH);
      curReplicaDelta = 0;
      blockLog.debug("BLOCK* addStoredBlock: Redundant addStoredBlock request"
              + " received for {} on node {} size {}", storedBlock, node,
          storedBlock.getNumBytes());
    }

    // Now check for completion of blocks and safe block count
    NumberReplicas num = countNodes(storedBlock);
    int numLiveReplicas = num.liveReplicas();
    int pendingNum = pendingReconstruction.getNumReplicas(storedBlock);
    int numCurrentReplica = numLiveReplicas + pendingNum;
    int numUsableReplicas = num.liveReplicas() +
        num.decommissioning() + num.liveEnteringMaintenanceReplicas();

    // 如果新添加的副本对应数据块的状态为COMMITTED
    if(storedBlock.getBlockUCState() == BlockUCState.COMMITTED &&
        hasMinStorage(storedBlock, numUsableReplicas)) {
      addExpectedReplicasToPending(storedBlock);

      //将Namenode中保存的当前数据块的状态由构建状态转换为正常状态,
      //也就是将Namenode文件系统目录树的INode对象以及BlockManager.
      //blocksMap字段中保存的BlockInfoUnderConstruction引用更新为正常的
      //BlockInfo引用。 如果数据块不处于COMMITTED状态, 则跳出继续执行。 completeBlock()
      //方法只有在当前数据块处于提交状态, 并且数据块的副本数量满足最小副本要求时才会被调用。

      completeBlock(storedBlock, null, false);

    } else if (storedBlock.isComplete() && result == AddBlockResult.ADDED) {
      // check whether safe replication is reached for the block
      // only complete blocks are counted towards that
      // Is no-op if not in safe mode.
      // In the case that the block just became complete above, completeBlock()
      // handles the safe block count maintenance.
      bmSafeMode.incrementSafeBlockCount(numCurrentReplica, storedBlock);
    }
    
    // if block is still under construction, then done for now
    if (!storedBlock.isCompleteOrCommitted()) {
      return storedBlock;
    }

    // do not try to handle extra/low redundancy blocks during first safe mode
    if (!isPopulatingReplQueues()) {
      return storedBlock;
    }

    // handle low redundancy/extra redundancy
    short fileRedundancy = getExpectedRedundancyNum(storedBlock);
    if (!isNeededReconstruction(storedBlock, num, pendingNum)) {
      neededReconstruction.remove(storedBlock, numCurrentReplica,
          num.readOnlyReplicas(), num.outOfServiceReplicas(), fileRedundancy);
    } else {
      updateNeededReconstructions(storedBlock, curReplicaDelta, 0);
    }
    if (shouldProcessExtraRedundancy(num, fileRedundancy)) {
      processExtraRedundancyBlock(storedBlock, fileRedundancy, node,
          delNodeHint);
    }
    // If the file redundancy has reached desired value
    // we can remove any corrupt replicas the block may have
    int corruptReplicasCount = corruptReplicas.numCorruptReplicas(storedBlock);
    int numCorruptNodes = num.corruptReplicas();
    if (numCorruptNodes != corruptReplicasCount) {
      LOG.warn("Inconsistent number of corrupt replicas for {}" +
          ". blockMap has {} but corrupt replicas map has {}",
          storedBlock, numCorruptNodes, corruptReplicasCount);
    }
    if ((corruptReplicasCount > 0) && (numLiveReplicas >= fileRedundancy)) {
      invalidateCorruptReplicas(storedBlock, reportedBlock, num);
    }
    return storedBlock;
  }

 

2.3.删除数据块

当客户端删除一个HDFS文件时, 客户端会调用RPC接口ClientProtocol.delete()删除HDFS文件或者目录, 并且删除文件拥有的所有数据块, 以及这个数据块在Datanode上的所有副本, 这个请求会在FSNamesystem.delete() -> FSDirDeleteOp.delete() -> FSDirDeleteOp.deleteInternal()方法中响应.

下面是删除代码的入口. 其他的代码太多了,自己看吧...


  @Override // ClientProtocol
  public boolean delete(String src, boolean recursive) throws IOException {
    checkNNStartup();
    if (stateChangeLog.isDebugEnabled()) {
      stateChangeLog.debug("*DIR* Namenode.delete: src=" + src
          + ", recursive=" + recursive);
    }
    namesystem.checkOperation(OperationCategory.WRITE);
    CacheEntry cacheEntry = RetryCache.waitForCompletion(retryCache);
    if (cacheEntry != null && cacheEntry.isSuccess()) {
      return true; // Return previous response
    }

    boolean ret = false;
    try {
      //执行删除
      ret = namesystem.delete(src, recursive, cacheEntry != null);
    } finally {
      RetryCache.setState(cacheEntry, ret);
    }
    if (ret) 
      metrics.incrDeleteFileOps();
    return ret;
  }

三.块汇报

我们知道Namenode中数据块与数据节点的对应关系并不持久化在fsimage文件中, 而是由Datanode定期块汇报到Namenode, 然后Namenode重建内存中数据块与数据节点的对应关系。 

数据块与数据节点存储的对应关系是由BlockManager.blocksMap字段维护的, 而数据节点存储与数据块的信息则是由
DatanodeStorageInfo.blocks对象维护的

Datanode启动后, 会与Namenode握手、 注册以及向Namenode发送第一次全量块汇报, 全量块汇报中包含了Datanode上存储的所有副本信息。 之后, Datanode的BPServiceActor对象会以dfs.blockreport.intervalMsec(默认是6个小时) 间隔向Namenode发送全量块汇报, 同时会以100*heartBeatInterval(心跳间隔的100倍, 默认为300秒) 间隔向Namenode发送增量块汇报增量块汇报中包含了Datanode最近新添加的以及删除的副本信息。

Namenode会将Datanode的全量块汇报分为两种: 启动时发送的第一次全量块汇报和周期性的全量块汇报。 对于启动时发送的第一次全量块汇报,为了提高响应速度, Namenode不会计算哪些元数据需要删除, 不会计算无效副本, 将这些处理都推迟到下一次块汇报时处理.

块汇报操作到达Namenode之后会由BlockManager.processReport()方法响应。processReport()方法会判断当前块汇报是否是该数据节点的第一次块汇报, 如果是则调用processFirstBlockReport()方法处理, 这个方法的效率会很高。 如果不是第一次块汇报, 则调用私有的proceeReport()方法处理。 之后, processReport()方法还会对postponedMisreplicatedBlock集合重新扫描, 删除那些已经不是stale状态的副本

HA架构中, Datanode的心跳信息、 全量块汇报以及增量块汇报会同时发送到StandbyNamenode以及Active Namenode。 StandByNamenode处理全量块汇报时, 可能出现命名空间还未与Active Namenode同步的情况, 这时就需要将待处理副本暂时缓存起来, 等待StandBy Namenode完全加载editlog并更新命名空间后再处理。 


  /**
   *
   * processReport()方法会判断当前块汇报是否是该数据节点的第一次块汇报,
   * 如果是则调用 processFirstBlockReport () 方法处理,
   * 这个方法的效率会很高。 如果不是第一次块汇报,
   * 则调用私有的proceeReport()方法处理。
   *
   * The given storage is reporting all its blocks.
   * Update the (storage-->block list) and (block-->storage list) maps.
   *
   * @return true if all known storages of the given DN have finished reporting.
   * @throws IOException
   */
  public boolean processReport(final DatanodeID nodeID,
      final DatanodeStorage storage,
      final BlockListAsLongs newReport,
      BlockReportContext context) throws IOException {

    namesystem.writeLock();
    final long startTime = Time.monotonicNow(); //after acquiring write lock
    final long endTime;
    DatanodeDescriptor node;
    Collection<Block> invalidatedBlocks = Collections.emptyList();
    String strBlockReportId =
        context != null ? Long.toHexString(context.getReportId()) : "";

    try {
      node = datanodeManager.getDatanode(nodeID);
      if (node == null || !node.isRegistered()) {
        throw new IOException(
            "ProcessReport from dead or unregistered node: " + nodeID);
      }

      // To minimize startup time, we discard any second (or later) block reports
      // that we receive while still in startup phase.
      // Register DN with provided storage, not with storage owned by DN
      // DN should still have a ref to the DNStorageInfo.
      DatanodeStorageInfo storageInfo =
          providedStorageMap.getStorage(node, storage);

      if (storageInfo == null) {
        // We handle this for backwards compatibility.
        storageInfo = node.updateStorage(storage);
      }
      if (namesystem.isInStartupSafeMode()
          && storageInfo.getBlockReportCount() > 0) {
        blockLog.info("BLOCK* processReport 0x{}: "
            + "discarded non-initial block report from {}"
            + " because namenode still in startup phase",
            strBlockReportId, nodeID);
        blockReportLeaseManager.removeLease(node);
        return !node.hasStaleStorages();
      }

      if (storageInfo.getBlockReportCount() == 0) {
        // The first block report can be processed a lot more efficiently than
        // ordinary block reports.  This shortens restart times.
        blockLog.info("BLOCK* processReport 0x{}: Processing first "
            + "storage report for {} from datanode {}",
            strBlockReportId,
            storageInfo.getStorageID(),
            nodeID.getDatanodeUuid());

        //  对于第一次块汇报, 调用processFirstBlockReport()
        processFirstBlockReport(storageInfo, newReport);
      } else {
        // Block reports for provided storage are not
        // maintained by DN heartbeats
        if (!StorageType.PROVIDED.equals(storageInfo.getStorageType())) {

          //不是第一次块汇报, 则调用私有的processReport()方法
          invalidatedBlocks = processReport(storageInfo, newReport, context);
        }
      }
      storageInfo.receivedBlockReport();
    } finally {
      endTime = Time.monotonicNow();
      namesystem.writeUnlock();
    }

    for (Block b : invalidatedBlocks) {
      blockLog.debug("BLOCK* processReport 0x{}: {} on node {} size {} does not"
          + " belong to any file", strBlockReportId, b, node, b.getNumBytes());
    }

    // Log the block report processing stats from Namenode perspective
    final NameNodeMetrics metrics = NameNode.getNameNodeMetrics();
    if (metrics != null) {
      metrics.addStorageBlockReport((int) (endTime - startTime));
    }
    blockLog.info("BLOCK* processReport 0x{}: from storage {} node {}, " +
        "blocks: {}, hasStaleStorage: {}, processing time: {} msecs, " +
        "invalidatedBlocks: {}", strBlockReportId, storage.getStorageID(),
        nodeID, newReport.getNumberOfBlocks(),
        node.hasStaleStorages(), (endTime - startTime),
        invalidatedBlocks.size());
    return !node.hasStaleStorages();
  }

3.1.对于第一次块汇报, 调用processFirstBlockReport()

注意: 为了提交启动速度, processFirstBlockReport()方法并不处理需 要删除的副本  Datanode上不存在、 Namenode内存中存在。


  /**
   *
   *
   * Datanode会通过调用NamenodeProtocol.blockReport()方法向Namenode发送全量块汇
   * 报, 请求到达Namenode后的代码调用流程为NameNodeRpcServer.blockReport() ->
   * BlockManager.processReport()。
   *
   *
   * processReport()方法会调用processFirstBlockReport()来处理 Datanode启动后的第一次全量块汇报,
   *
   * processFirstBlockReport()方法会调用 addStoredBlockImmediate()方法
   * 将块汇报中所有有效的副本加入Namenode内存中,
   * 之后processFirstBlockReport()方法会
   * 调用markBlockAsCorrupt()方法处理无效副本
   * (Datanode上存在、 Namenode的blocksMap中不存在) ,
   *
   * 注意processFirstBlockReport()方法并不处理需 要删除的副本
   * Datanode上不存在、 Namenode内存中存在) 。
   *
   *
   * 在HDFS HA架构中,
   * Datanode的心跳信息、 全量块汇报以及增量块汇报会同时发送到Standby Namenode以及Active Namenode。
   *
   * StandByNamenode处理全量块汇报时,
   * 可能出现命名空间还未与Active Namenode同步的情况,
   * 这时就需要将待处理副本暂时缓存起来,
   * 等待StandBy Namenode完全加载editlog并更新命名空间后再处理。
   *
   *
   * processFirstBlockReport is intended only for processing "initial" block
   * reports, the first block report received from a DN after it registers.
   * It just adds all the valid replicas to the datanode, without calculating 
   * a toRemove list (since there won't be any).  It also silently discards 
   * any invalid blocks, thereby deferring their processing until 
   * the next block report.
   * @param storageInfo - DatanodeStorageInfo that sent the report
   * @param report - the initial block report, to be processed
   * @throws IOException 
   */
  void processFirstBlockReport(
      final DatanodeStorageInfo storageInfo,
      final BlockListAsLongs report) throws IOException {


    if (report == null) return;
    assert (namesystem.hasWriteLock());
    assert (storageInfo.getBlockReportCount() == 0);

    for (BlockReportReplica iblk : report) {
      ReplicaState reportedState = iblk.getState();

      if (LOG.isDebugEnabled()) {
        LOG.debug("Initial report of block {} on {} size {} replicaState = {}",
            iblk.getBlockName(), storageInfo.getDatanodeDescriptor(),
            iblk.getNumBytes(), reportedState);
      }
      if (shouldPostponeBlocksFromFuture && isGenStampInFuture(iblk)) {
        queueReportedBlock(storageInfo, iblk, reportedState,
            QUEUE_REASON_FUTURE_GENSTAMP);
        continue;
      }

      BlockInfo storedBlock = getStoredBlock(iblk);

      // If block does not belong to any file, we check if it violates
      // an integrity assumption of Name node
      if (storedBlock == null) {
        bmSafeMode.checkBlocksWithFutureGS(iblk);
        continue;
      }

      // If block is corrupt, mark it and continue to next block.
      BlockUCState ucState = storedBlock.getBlockUCState();
      BlockToMarkCorrupt c = checkReplicaCorrupt(
          iblk, reportedState, storedBlock, ucState,
          storageInfo.getDatanodeDescriptor());
      if (c != null) {
        if (shouldPostponeBlocksFromFuture) {
          // In the Standby, we may receive a block report for a file that we
          // just have an out-of-date gen-stamp or state for, for example.
          queueReportedBlock(storageInfo, iblk, reportedState,
              QUEUE_REASON_CORRUPT_STATE);
        } else {
          markBlockAsCorrupt(c, storageInfo, storageInfo.getDatanodeDescriptor());
        }
        continue;
      }
      
      // If block is under construction, add this replica to its list
      if (isBlockUnderConstruction(storedBlock, ucState, reportedState)) {
        storedBlock.getUnderConstructionFeature()
            .addReplicaIfNotPresent(storageInfo, iblk, reportedState);
        // OpenFileBlocks only inside snapshots also will be added to safemode
        // threshold. So we need to update such blocks to safemode
        // refer HDFS-5283
        if (namesystem.isInSnapshot(storedBlock.getBlockCollectionId())) {
          int numOfReplicas = storedBlock.getUnderConstructionFeature()
              .getNumExpectedLocations();
          bmSafeMode.incrementSafeBlockCount(numOfReplicas, storedBlock);
        }
        //and fall through to next clause
      }      
      //add replica if appropriate
      if (reportedState == ReplicaState.FINALIZED) {
        addStoredBlockImmediate(storedBlock, iblk, storageInfo);
      }
    }
  }

 

3.2.普通块汇报——processReport()

对于Datanode周期性的块汇报, processReport()方法会调用私有的processReport()方法处理。 这个方法会调用reportDiff()方法, 将块汇报中的副本与当前Namenode内存中记录的副本状态做比对, 然后产生5个操作队列

■ toAdd——上报副本与Namenode内存中记录的数据块有相同的时间戳以及长度,那么将上报副本添加到toAdd队列中。 对于toAdd队列中的元素, 调用addStoredBlock()方法将副本添加到Namenode内存中。

■ toRemove——副本在Namenode内存中的DatanodeStorageInfo对象上存在, 但是块汇报时并没有上报该副本, 那么将副本添加到toRemove队列中。 对于toRemove队列中的元素, 调用removeStoredBlock()方法将数据块从Namenode内存中删除。

■ toInvalidate——BlockManager的blocksMap字段中没有保存上报副本的信息, 那么将上报副本添加到toInvalidate队列中。 对于toInvalidate队列中的元素, 调用addToInvalidates()方法将该副本加入BlockManager.invalidateBlocks队列中, 然后触发Datanode节点删除该副本。

■ toCorrupt——上报副本的时间戳或者文件长度不正常, 那么将上报副本添加到corruptReplicas队列中。 对于corruptReplicas队列中的元素, 调用markBlockAsCorrupt()方法处理。

■ toUC——如果上报副本对应的数据块处于构建状态, 则调用addStoredBlockUnderConstruction()方法构造一个ReplicateUnderConstruction对象, 然后将该对象添加到数据块对应的BlockInfoUnderConstruction对象的replicas队列中


  Collection<Block> processReport(
      final DatanodeStorageInfo storageInfo,
      final BlockListAsLongs report,
      BlockReportContext context) throws IOException {
    // Normal case:
    // Modify the (block-->datanode) map, according to the difference
    // between the old and new block report.
    //

    //reportDiffSorted()方法获取需要更新的不同队列
    Collection<BlockInfoToAdd> toAdd = new LinkedList<>();
    Collection<BlockInfo> toRemove = new TreeSet<>();
    Collection<Block> toInvalidate = new LinkedList<>();
    Collection<BlockToMarkCorrupt> toCorrupt = new LinkedList<>();
    Collection<StatefulBlockInfo> toUC = new LinkedList<>();

    boolean sorted = false;
    String strBlockReportId = "";
    if (context != null) {
      sorted = context.isSorted();
      strBlockReportId = Long.toHexString(context.getReportId());
    }

    Iterable<BlockReportReplica> sortedReport;
    if (!sorted) {
      blockLog.warn("BLOCK* processReport 0x{}: Report from the DataNode ({}) "
                    + "is unsorted. This will cause overhead on the NameNode "
                    + "which needs to sort the Full BR. Please update the "
                    + "DataNode to the same version of Hadoop HDFS as the "
                    + "NameNode ({}).",
                    strBlockReportId,
                    storageInfo.getDatanodeDescriptor().getDatanodeUuid(),
                    VersionInfo.getVersion());
      Set<BlockReportReplica> set = new FoldedTreeSet<>();
      for (BlockReportReplica iblk : report) {
        set.add(new BlockReportReplica(iblk));
      }
      sortedReport = set;
    } else {
      sortedReport = report;
    }

    reportDiffSorted(storageInfo, sortedReport,
                     toAdd, toRemove, toInvalidate, toCorrupt, toUC);


    //调用对应方法处理不同的队列
    DatanodeDescriptor node = storageInfo.getDatanodeDescriptor();
    // Process the blocks on each queue
    for (StatefulBlockInfo b : toUC) { 
      addStoredBlockUnderConstruction(b, storageInfo);
    }
    for (BlockInfo b : toRemove) {
      removeStoredBlock(b, node);
    }
    int numBlocksLogged = 0;
    for (BlockInfoToAdd b : toAdd) {
      addStoredBlock(b.stored, b.reported, storageInfo, null,
          numBlocksLogged < maxNumBlocksToLog);
      numBlocksLogged++;
    }
    if (numBlocksLogged > maxNumBlocksToLog) {
      blockLog.info("BLOCK* processReport 0x{}: logged info for {} of {} " +
          "reported.", strBlockReportId, maxNumBlocksToLog, numBlocksLogged);
    }
    for (Block b : toInvalidate) {
      addToInvalidates(b, node);
    }
    for (BlockToMarkCorrupt b : toCorrupt) {
      markBlockAsCorrupt(b, storageInfo, node);
    }

    return toInvalidate;
  }

 

3.3.增量汇报块

Datanode会调用NamenodeProtocol.blockReceivedAndDeleted()方法将短时间内接收的副本或者删除的副本增量汇报给Namenode, Namenode收到了增量汇报后, 会调用processIncrementalBlockReport()方法处理。


  /**
   * 增量块汇报
   *
   * 给出了一些关于报告块的增量信息。
   *
   * 这包括开始接收的块,完成时收到或删除。
   *
   * 必须使用此NameLock调用FSystem方法。
   *
   * The given node is reporting incremental information about some blocks.
   * This includes blocks that are starting to be received, completed being
   * received, or deleted.
   * 
   * This method must be called with FSNamesystem lock held.
   */
  public void processIncrementalBlockReport(final DatanodeID nodeID,
      final StorageReceivedDeletedBlocks srdb) throws IOException {
    assert namesystem.hasWriteLock();
    final DatanodeDescriptor node = datanodeManager.getDatanode(nodeID);
    if (node == null || !node.isRegistered()) {
      blockLog.warn("BLOCK* processIncrementalBlockReport"
              + " is received from dead or unregistered node {}", nodeID);
      throw new IOException(
          "Got incremental block report from unregistered or dead node");
    }

    boolean successful = false;
    try {
      processIncrementalBlockReport(node, srdb);
      successful = true;
    } finally {
      if (!successful) {
        node.setForceRegistration(true);
      }
    }
  }

processIncrementalBlockReport()方法会遍历增量汇报中的所有数据块, 如果是新添加的数据块(RECEIVED_BLOCK) , 则调用addBlock()方法处理添加请求; 如果是删除的数据块(DELETED_BLOCK) , 则调用removeStorageBlock()修改数据块与存储这个数据块的数据节点存储的对应关系。 对于接收中的副本(RECEIVING_BLOCK) ,则调用processAndHandleReportedBlock()方法处理。


  private void processIncrementalBlockReport(final DatanodeDescriptor node,
      final StorageReceivedDeletedBlocks srdb) throws IOException {
    DatanodeStorageInfo storageInfo =
        node.getStorageInfo(srdb.getStorage().getStorageID());
    if (storageInfo == null) {
      // The DataNode is reporting an unknown storage. Usually the NN learns
      // about new storages from heartbeats but during NN restart we may
      // receive a block report or incremental report before the heartbeat.
      // We must handle this for protocol compatibility. This issue was
      // uncovered by HDFS-6094.
      storageInfo = node.updateStorage(srdb.getStorage());
    }

    int received = 0;
    int deleted = 0;
    int receiving = 0;

    for (ReceivedDeletedBlockInfo rdbi : srdb.getBlocks()) {
      switch (rdbi.getStatus()) {
      case DELETED_BLOCK:
        removeStoredBlock(storageInfo, rdbi.getBlock(), node);
        deleted++;
        break;
      case RECEIVED_BLOCK:
        addBlock(storageInfo, rdbi.getBlock(), rdbi.getDelHints());
        received++;
        break;
      case RECEIVING_BLOCK:
        receiving++;
        processAndHandleReportedBlock(storageInfo, rdbi.getBlock(),
                                      ReplicaState.RBW, null);
        break;
      default:
        String msg = 
          "Unknown block status code reported by " + node +
          ": " + rdbi;
        blockLog.warn(msg);
        assert false : msg; // if assertions are enabled, throw.
        break;
      }
      blockLog.debug("BLOCK* block {}: {} is received from {}",
          rdbi.getStatus(), rdbi.getBlock(), node);
    }
    blockLog.debug("*BLOCK* NameNode.processIncrementalBlockReport: from "
            + "{} receiving: {}, received: {}, deleted: {}", node, receiving,
        received, deleted);
  }

 

 

 

 

参考:
Hadoop 2.X HDFS源码剖析 -- 徐鹏

 

 

 

 

  • 6
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值