Hadoop3.2.1 【 HDFS 】源码分析 : BPOfferService 解析

一.前言

一个HDFS集群可以配置多个命名空间(Namespace) , 每个Datanode都会存储多个块池的数据块。 所以在Datanode实现中, 定义了BlockPoolManager类来管理Datanode上的所有块池, Datanode的其他模块对块池的操作都必须通过BlockPoolManager执行, 每个Datanode都有一个BlockPoolManager的实例。

由于在HDFS Federation部署中, 一个Datanode会保存多个块池的数据块, 所以BlockPoolManager会拥有多个BPOfferService对
象, 每个BPOfferService对象都封装了对单个块池操作的API
。 同时, 由于在HDFS HA部署中, 每个命名空间又会同时拥有两个Namenode, 一个作为活动的(Active) Namenode,另一个作为热备的(Standby) Namenode, 所以每个BPOfferService都会包含两个BPServiceActor对象, 每个BPServiceActor对象都封装了与该命名空间中单个Namenode的操作, 包括定时向这个Namenode发送心跳(heartbeat) 、 增量块汇报(blockReceivedAndDeleted) 、 全量块汇报(blockreport) 、 缓存块汇报(cacheReport) , 以及执行Namenode通过心跳/ 块汇报响应传回的名字节点指令等操作。

在这里插入图片描述

二.BPOfferService简介

在HDFS Federation部署中, 一个HDFS集群可以定义多个命名空间, 每一个命名空间在Datanode上都有一个对应的块池存储这个命名空间的数据块, 这个块池是由一个BPOfferService实例管理的。 由于一个命名空间可以定义两个Namenode, 所以BPOfferService类需要与两个Namenode通信并执行指定的逻辑, 也就是BPOfferService要拥有两个BPServiceActor对象的引用。 同时为了防止出现脑裂的情况, 需要保证命名空间中有且只能有一个处于活动状态的Namenode, BPOfferService类还需要管理当前Datanode认为是Active状态的Namenode的引用( 通过bpServiceToActive字段) 。

三. BPOfferService的属性

在这里插入图片描述

■ NamespaceInfo bpNSInfo: 当前BPOfferService服务的命名空间的信息, 这个信息是在与Namenode的握手阶段获得的。
■ DatanodeRegistration bpRegistration:当前BPOfferService对应的块池在Namenode上的注册信息,这个信息是在Datanode注册阶段获得。
■ DataNode dn: 当前DataNode对象的引用。
■ BPServiceActor bpServiceToActive: 当前BPOffferService认为Active的Namenode对应的BPServiceActor对象。
■ ListbpServices: 当前命名空间中所有Namenode对应的BPServiceActor的列表。 注意这里是一个CopyOnWriteList.
■ lastActiveClaimTxId: 每当收到一个Namenode(声明自己为Active状态的Namenode) 传来的心跳时, 就记录下最近的一transactionId, 这个字段用于防止出现脑裂的情况

四.BPOfferService方法

4.1. 触发汇报

trySendErrorReport()、 reportRemoteBadBlock()、 reportBadBlocks()方法实现了向Namenode发送错误汇报、 汇报远程坏块以及本地坏块的操作, 这三个方法会直接调用两个BPServiceActor对象的对应方法, BPServiceActor对象则会调用DatanodeProtocol对应的方法执行汇报操作。

reportBadBlocks()方法调用了当前BPOfferService持有的所有BPServiceActor对象的bpThreadEnqueue()方法,向Namenode汇报损坏的数据块

  void reportBadBlocks(ExtendedBlock block,
                       String storageUuid, StorageType storageType) {
    checkBlock(block);
    //遍历BPOfferService中管理的所有BPServiceActor对象
    for (BPServiceActor actor : bpServices) {
     // 构建ReportBadBlockAction
      ReportBadBlockAction rbbAction = new ReportBadBlockAction   (block, storageUuid, storageType);
           //在这些对象上调用对应的bpThreadEnqueue ()方法
      actor.bpThreadEnqueue(rbbAction);
    }
  }

BPServiceActor.bpThreadEnqueue()会将Action操作加入到bpThreadQueue队列中,等待线程处理,
最终调用ReportBadBlockAction#reportTo 方法处理 , 采用DatanodeProtocolClientSideTranslatorPB 向namenode汇报


  public void bpThreadEnqueue(BPServiceActorAction action) {
    synchronized (bpThreadQueue) {
      if (!bpThreadQueue.contains(action)) {
        // 加入处理队列
        bpThreadQueue.add(action);
      }
    }
  }

  @Override
  public void reportTo(DatanodeProtocolClientSideTranslatorPB bpNamenode, 
    DatanodeRegistration bpRegistration) throws BPServiceActorActionException {
    if (bpRegistration == null) {
      return;
    }
    DatanodeInfo[] dnArr = {new DatanodeInfoBuilder()
        .setNodeID(bpRegistration).build()};
    String[] uuids = { storageUuid };
    StorageType[] types = { storageType };
    LocatedBlock[] locatedBlock = { new LocatedBlock(block,
        dnArr, uuids, types) };

    try {
      bpNamenode.reportBadBlocks(locatedBlock);
    } catch (RemoteException re) {
      DataNode.LOG.info("reportBadBlock encountered RemoteException for "
          + "block:  " + block , re);
    } catch (IOException e) {
      throw new BPServiceActorActionException("Failed to report bad block "
          + block + " to namenode.", e);
    }
  }

4.2.添加与删除数据块操作

当Datanode接收一个新的数据块时, 如客户端通过数据流管道写入一个数据块、 块恢复操作更新了RUR状态数据块的时间戳和文件长度时, 或者通过DataTransfer Protocol流式接口复制一个数据块时, 都会调用BPOfferService.notifyNamenodeReceivedBlock()方法通知命名空间。

当Datanode删除一个已有的数据块时, 会调用BPOfferService.notifyNamenodeDeletedBlock()方法通知命名空间。 当Datanode调用FsDatasetImpl.invalidate()方法从Datanode上删除一个数据块时, invalidate()方法会创建一个删除任务ReplicaFileDeleteTask异步地从Datanode磁盘上删除这个数据块文件, 当删除操作完成后会调用BPOfferService.notifyNamenodeDeletedBlock()方法通知命名空间。

notifyNamenodeDeletedBlock、 notifyNamenodeReceivingBlock、notifyNamenodeReceivedBlock这三个方法的实现都是遍历BPOfferService持有的所有BPServiceActor对象, 并在BPServiceActor对象上调用notifyNamenodeBlockImmediately()、notifyNamenodeDeletedBlock()方法。

private void notifyNamenodeBlock(ExtendedBlock block, BlockStatus status,
      String delHint, String storageUuid, boolean isOnTransientStorage) {
    checkBlock(block);


    final ReceivedDeletedBlockInfo info = new ReceivedDeletedBlockInfo( block.getLocalBlock(), status, delHint);
    final DatanodeStorage storage = dn.getFSDataset().getStorage(storageUuid);

    for (BPServiceActor actor : bpServices) {
      //遍历所有的BPServiceActor对象, 并调用notifyNamenodeDeletedBlock()方法
      actor.getIbrManager().notifyNamenodeBlock(info, storage,
          isOnTransientStorage);
    }
  }

4.3.管理bpNSInfo、 bpRegistration、 bpServiceToActive

4.3.1.NamespaceInfo bpNSInfo

bpNSInfo字段保存了当前BPOfferService服务块池的命名空间信息, 这个字段是在Datanode与Namenode握手时调用BPOfferService.verifyAndSetNamespaceInfo()方法设置的 。 verifyAndSetNamespaceInfo()方法还会校验当前命名空间中两个Namenode获取的命名空间信息(NamespaceInfo) 是否一致。

/**
   *
   * 由BPServiceActors在与NN握手时调用。
   * 如果这是第一个NN连接,则设置此BPOfferService的命名空间信息。
   * 如果是到新NN的连接,它将验证这个名称空间是否匹配(例如,为了防止在指定来自不同集群的备用节点时发生配置错误)
   *
   * Called by the BPServiceActors when they handshake to a NN.
   * If this is the first NN connection, this sets the namespace info
   * for this BPOfferService. If it's a connection to a new NN, it
   * verifies that this namespace matches (eg to prevent a misconfiguration
   * where a StandbyNode from a different cluster is specified)
   */
  void verifyAndSetNamespaceInfo(BPServiceActor actor, NamespaceInfo nsInfo)
    throws IOException {
    writeLock();

    if(nsInfo.getState() == HAServiceState.ACTIVE
        && bpServiceToActive == null) {
      LOG.info("Acknowledging ACTIVE Namenode during handshake" + actor);
      bpServiceToActive = actor;
    }

    try {

      if (setNamespaceInfo(nsInfo) == null) {
        boolean success = false;
        //第一个Namenode的响应, 这时我们已经知道命名空间id,
        // 就可以通过调用Datanode.initBLockPool()方法初始化Datanode的本地存储了

        // Now that we know the namespace ID, etc, we can pass this to the DN.
        // The DN can now initialize its local storage if we are the
        // first BP to handshake, etc.
        try {
          dn.initBlockPool(this);
          success = true;
        } finally {
          if (!success) {
            // The datanode failed to initialize the BP. We need to reset
            // the namespace info so that other BPService actors still have
            // a chance to set it, and re-initialize the datanode.

            //如果初始化失败, 则将bpNSInfo置为空, 等待下一个Namenode的响应
            setNamespaceInfo(null);
          }
        }
      }
    } finally {
      writeUnlock();
    }
  }

4.3.2.DatanodeRegistration bpRegistration

bpRegistration字段保存了当前BPOfferService管理的块池在Namenode上的注册信息。这个信息是在Datanode向Namenode注册的过程中, 通过调用registrationSucceeded()方法设置的。 registrationSucceeded()方法还会验证从命名空间中两个Namenode获取的DatanodeRegistration是否一致。

/**
   * BPServiceActors成功注册到NN ,  检测连接到的NN是否与 block-pool NN信息一致
   *
   * After one of the BPServiceActors registers successfully with the
   * NN, it calls this function to verify that the NN it connected to
   * is consistent with other NNs serving the block-pool.
   */
  void registrationSucceeded(BPServiceActor bpServiceActor,
      DatanodeRegistration reg) throws IOException {
    writeLock();
    try {


      if (bpRegistration != null) {
        // 检测 namespaceID是否一致
        checkNSEquality(bpRegistration.getStorageInfo().getNamespaceID(),
            reg.getStorageInfo().getNamespaceID(), "namespace ID");
        // 检测 ClusterID 是否一致
        checkNSEquality(bpRegistration.getStorageInfo().getClusterID(),
            reg.getStorageInfo().getClusterID(), "cluster ID");
      }
      bpRegistration = reg;

      // 加入缓存/安全校验 SecretManager
      dn.bpRegistrationSucceeded(bpRegistration, getBlockPoolId());

      // Add the initial block token secret keys to the DN's secret manager.
      if (dn.isBlockTokenEnabled) {
        dn.blockPoolTokenSecretManager.addKeys(getBlockPoolId(),
            reg.getExportedKeys());
      }
    } finally {
      writeUnlock();
    }
  }

4.3.3.BPServiceActor bpServiceToActive

bpServiceToActive字段保存了当前BPOffferService认为活动的Namenode对应的BPServiceActor对象, 这个字段是在updateActorStatesFromHeartbeat()方法中赋值的。(Active NameNode冲突时 通过txid来选取Active NameNode)


  /**
   * Update the BPOS's view of which NN is active, based on a heartbeat
   * response from one of the actors.
   * 
   * @param actor the actor which received the heartbeat
   * @param nnHaState the HA-related heartbeat contents
   */
  void updateActorStatesFromHeartbeat(
      BPServiceActor actor,
      NNHAStatusHeartbeat nnHaState) {
    writeLock();
    try {
      // Namenode携带的txid
      final long txid = nnHaState.getTxId();


      //当前Namenode是否声明自己为Active Namenode
      final boolean nnClaimsActive =
          nnHaState.getState() == HAServiceState.ACTIVE;


      final boolean bposThinksActive = bpServiceToActive == actor;

      //当前Namenode携带的txid是否大于原Active Namenode携带的txid
      final boolean isMoreRecentClaim = txid > lastActiveClaimTxId;

      //原来的Standby Namenode声明自己为Active Namenode, 发生状态切换
      if (nnClaimsActive && !bposThinksActive) {
        LOG.info("Namenode " + actor + " trying to claim ACTIVE state with " +
            "txid=" + txid);
        if (!isMoreRecentClaim) {
          // Split-brain scenario - an NN is trying to claim active
          // state when a different NN has already claimed it with a higher
          // txid.
          LOG.warn("NN " + actor + " tried to claim ACTIVE state at txid=" +
              txid + " but there was already a more recent claim at txid=" +
              lastActiveClaimTxId);
          return;
        } else {
          if (bpServiceToActive == null) {
            LOG.info("Acknowledging ACTIVE Namenode " + actor);
          } else {
            LOG.info("Namenode " + actor + " taking over ACTIVE state from " +
                bpServiceToActive + " at higher txid=" + txid);
          }
          bpServiceToActive = actor;
        }
      } else if (!nnClaimsActive && bposThinksActive) {
        LOG.info("Namenode " + actor + " relinquishing ACTIVE state with " +
            "txid=" + nnHaState.getTxId());
        bpServiceToActive = null;
      }

      //将bpServiceToActive指向当前Namenode对应的BPServiceActor
      if (bpServiceToActive == actor) {
        assert txid >= lastActiveClaimTxId;
        lastActiveClaimTxId = txid;
      }
    } finally {
      writeUnlock();
    }
  }

4.4. 响应NameNode指令

Datanode会通过心跳、 块汇报、 缓存汇报的响应携带回名字节点指令,BPOfferService提供了processCommandFromActor()方法处理名字节点指令。 对于ActiveNamenode和StandbyNamenode返回的名字节点指令, 处理逻辑是不同的

processCommandFromActor()方法是BPOfferService处理名字节点指令的入口方法,BPServiceActor在工作线程的offerService()方法中接收到Namenode返回的指令后, 会调用这个方法处理名字节点指令。


  boolean processCommandFromActor(DatanodeCommand cmd,
      BPServiceActor actor) throws IOException {
    assert bpServices.contains(actor);
    if (cmd == null) {
      return true;
    }
    /*
     * Datanode Registration can be done asynchronously here. No need to hold
     * the lock. for more info refer HDFS-5014
     */
    if (DatanodeProtocol.DNA_REGISTER == cmd.getAction()) {
      // namenode requested a registration - at start or if NN lost contact
      // Just logging the claiming state is OK here instead of checking the
      // actor state by obtaining the lock
      LOG.info("DatanodeCommand action : DNA_REGISTER from " + actor.nnAddr
          + " with " + actor.state + " state");

      //如果Namenode返回的指令要求Datanode重新注册, 则调用reRegister()方法
      actor.reRegister();
      return false;
    }
    writeLock();
    try {
      
      if (actor == bpServiceToActive) {
        //对于Active Namenode返回的指令, 调用processCommandFromActive()方法处理
        return processCommandFromActive(cmd, actor);
      } else {
        //对于Standby Namenode返回的指令, 则调用processCommandFromStandby()方法处理
        return processCommandFromStandby(cmd, actor);
      }
    } finally {
      writeUnlock();
    }
  }

对于Active Namenode和Standby Namenode返回的指令, 处理方法是完全不同的。

4.4.1 processCommandFromActive

processCommandFromActive()方法处理来自Active Namenode的名字节点指令, REGISTER指令在processCommandFromActor()中处理.
一共有十种指令类型要在这里进行处理

序号名称描述
1DNA_TRANSFER数据块复制
2DNA_INVALIDATE数据库删除
3DNA_CACHE数据缓存
4DNA_UNCACHE清除缓存
5DNA_SHUTDOWN关闭 Datanode节点
6DNA_FINALIZE提交上一次升级
7DNA_RECOVERBLOCK数据块恢复
8DNA_ACCESSKEYUPDATE更新 access key
9DNA_BALANCERBANDWIDTHUPDATE更新平衡器宽度
10DNA_ERASURE_CODING_RECONSTRUCTION擦除编码重建命令

/**

  • This method should handle all commands from Active namenode except
  • DNA_REGISTER which should be handled earlier itself.
  • @param cmd
  • @return true if further processing may be required or false otherwise.
  • @throws IOException
    */
    private boolean processCommandFromActive(DatanodeCommand cmd,
    BPServiceActor actor) throws IOException {
    final BlockCommand bcmd =
    cmd instanceof BlockCommand? (BlockCommand)cmd: null;
    final BlockIdCommand blockIdCmd =
    cmd instanceof BlockIdCommand ? (BlockIdCommand)cmd: null;
switch(cmd.getAction()) {

case DatanodeProtocol.DNA_TRANSFER:
  // 数据块复制
  // Send a copy of a block to another datanode
  dn.transferBlocks(bcmd.getBlockPoolId(), bcmd.getBlocks(),
      bcmd.getTargets(), bcmd.getTargetStorageTypes(),
      bcmd.getTargetStorageIDs());
  break;

case DatanodeProtocol.DNA_INVALIDATE:
  // 数据库删除
  
  // Some local block(s) are obsolete and can be 
  // safely garbage-collected.
  //
  Block toDelete[] = bcmd.getBlocks();
  try {
    // using global fsdataset
    dn.getFSDataset().invalidate(bcmd.getBlockPoolId(), toDelete);
  } catch(IOException e) {
    // Exceptions caught here are not expected to be disk-related.
    throw e;
  }
  dn.metrics.incrBlocksRemoved(toDelete.length);
  break;
case DatanodeProtocol.DNA_CACHE:
  // 数据缓存
  LOG.info("DatanodeCommand action: DNA_CACHE for " +
    blockIdCmd.getBlockPoolId() + " of [" +
      blockIdArrayToString(blockIdCmd.getBlockIds()) + "]");
  dn.getFSDataset().cache(blockIdCmd.getBlockPoolId(), blockIdCmd.getBlockIds());
  break;
case DatanodeProtocol.DNA_UNCACHE:
  // 清除缓存
  LOG.info("DatanodeCommand action: DNA_UNCACHE for " +
    blockIdCmd.getBlockPoolId() + " of [" +
      blockIdArrayToString(blockIdCmd.getBlockIds()) + "]");
  dn.getFSDataset().uncache(blockIdCmd.getBlockPoolId(), blockIdCmd.getBlockIds());
  break;
case DatanodeProtocol.DNA_SHUTDOWN:
  // 关闭 datanode节点
  // TODO: DNA_SHUTDOWN appears to be unused - the NN never sends this command
  // See HDFS-2987.
  throw new UnsupportedOperationException("Received unimplemented DNA_SHUTDOWN");
case DatanodeProtocol.DNA_FINALIZE:
  // 提交上一次升级
  String bp = ((FinalizeCommand) cmd).getBlockPoolId();
  LOG.info("Got finalize command for block pool " + bp);
  assert getBlockPoolId().equals(bp) :
    "BP " + getBlockPoolId() + " received DNA_FINALIZE " +
    "for other block pool " + bp;

  dn.finalizeUpgradeForPool(bp);
  break;
case DatanodeProtocol.DNA_RECOVERBLOCK:
  // 数据块恢复
  String who = "NameNode at " + actor.getNNSocketAddress();
  dn.getBlockRecoveryWorker().recoverBlocks(who,
      ((BlockRecoveryCommand)cmd).getRecoveringBlocks());
  break;
case DatanodeProtocol.DNA_ACCESSKEYUPDATE:
  // 安全相关 更新 access key
  LOG.info("DatanodeCommand action: DNA_ACCESSKEYUPDATE");
  if (dn.isBlockTokenEnabled) {
    dn.blockPoolTokenSecretManager.addKeys(
        getBlockPoolId(), 
        ((KeyUpdateCommand) cmd).getExportedKeys());
  }
  break;
case DatanodeProtocol.DNA_BALANCERBANDWIDTHUPDATE:
  // 更新平衡器宽度
  LOG.info("DatanodeCommand action: DNA_BALANCERBANDWIDTHUPDATE");
  long bandwidth =
             ((BalancerBandwidthCommand) cmd).getBalancerBandwidthValue();
  if (bandwidth > 0) {
    DataXceiverServer dxcs =
                 (DataXceiverServer) dn.dataXceiverServer.getRunnable();
    LOG.info("Updating balance throttler bandwidth from "
        + dxcs.balanceThrottler.getBandwidth() + " bytes/s "
        + "to: " + bandwidth + " bytes/s.");
    dxcs.balanceThrottler.setBandwidth(bandwidth);
  }
  break;
case DatanodeProtocol.DNA_ERASURE_CODING_RECONSTRUCTION:
  // 擦除编码重建命令
  LOG.info("DatanodeCommand action: DNA_ERASURE_CODING_RECOVERY");
  Collection<BlockECReconstructionInfo> ecTasks =
      ((BlockECReconstructionCommand) cmd).getECTasks();
  dn.getErasureCodingWorker().processErasureCodingTasks(ecTasks);
  break;
default:
  LOG.warn("Unknown DatanodeCommand action: " + cmd.getAction());
}
return true;

}

4.4.2 processCommandFromStandby

processCommandFromStandby()处理来自StandbyNamenode的名字节点指令。 这个方法的处理逻辑很简单, 对于StandbyNamenode返回的指令, 直接忽略即可。 这样处理的原因是为了防止在HA部署下出现脑裂的情况, 也就是Active Namenode和Standby Namenode同时向Datanode下发指令。 所以BPOfferService对象并不执行StandbyNamenode返回的名字节点指令, 只执行Active Namenode返回的指令, 这样也就保证了单个命名空间中只有一个Active状态的Namenode。


  /**
   * 此方法应处理来自备用namenode的命令
   * 除了更新 DNA_ACCESSKEYUPDATE 基本啥都干不了.
   * 
   * This method should handle commands from Standby namenode except
   * DNA_REGISTER which should be handled earlier itself.
   */
  private boolean processCommandFromStandby(DatanodeCommand cmd,
      BPServiceActor actor) throws IOException {
    switch(cmd.getAction()) {
    case DatanodeProtocol.DNA_ACCESSKEYUPDATE:
      LOG.info("DatanodeCommand action from standby: DNA_ACCESSKEYUPDATE");
      if (dn.isBlockTokenEnabled) {
        dn.blockPoolTokenSecretManager.addKeys(
            getBlockPoolId(), 
            ((KeyUpdateCommand) cmd).getExportedKeys());
      }
      break;
    case DatanodeProtocol.DNA_TRANSFER:
    case DatanodeProtocol.DNA_INVALIDATE:
    case DatanodeProtocol.DNA_SHUTDOWN:
    case DatanodeProtocol.DNA_FINALIZE:
    case DatanodeProtocol.DNA_RECOVERBLOCK:
    case DatanodeProtocol.DNA_BALANCERBANDWIDTHUPDATE:
    case DatanodeProtocol.DNA_CACHE:
    case DatanodeProtocol.DNA_UNCACHE:
    case DatanodeProtocol.DNA_ERASURE_CODING_RECONSTRUCTION:
      LOG.warn("Got a command from standby NN - ignoring command:" + cmd.getAction());
      break;
    default:
      LOG.warn("Unknown DatanodeCommand action: " + cmd.getAction());
    }
    return true;
  }
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值