HDFS副本放置策略源码分析

本文分析了Hadoop HDFS的副本存放策略,包括1st到4th副本的选择逻辑。当写入block时遇到问题,发现集群虽满载但未达到无法存储的程度。通过源码研究,详细解释了`BlockManager`和`BlockPlacementPolicyDefault`中的`chooseTarget`方法,探讨了副本选择的降级处理过程,以及选择节点的各种条件,如存储类型匹配、容量要求、节点状态等。此外,还讨论了`scheduledSize`的计算和pipeline的构建,确保副本间距离最小化。
摘要由CSDN通过智能技术生成
背景

前段时间我们的集群在写入block时有如下报错:

Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.ipc.RemoteException(java.io.IOException): 
File /tmp/xxxx.tmp could only be replicated to 0 nodes instead of minReplication (=1).  
There are 983 datanode(s) running and 983 node(s) are excluded in this operation.

虽然集群容量确实使用到了百分之九十多,但也不至于近千个节点全部都满了,所以看了一下相关代码,也总结了一下Hadoop的副本存放策略,我们这个集群使用的hadoop版本是2.6

HDFS副本存放策略
  • 1st replica 如果写请求client所在机器是其中一个datanode,则直接存放在本地,否则随机在集群中选择一个datanode.
  • 2nd replica 第二个副本存放于不同第一个副本的所在的机架.
  • 3rd replica 第三个副本存放于第二个副本所在的机架,但是属于不同的节点
  • 4rd replica 第四个副本或更多副本随机选择datanode节点进行存储

在这里插入图片描述

源码分析

BlockManager.java
可以看到以上的报错就是下面这个方法,接下来看chooseTarget方法

/**
   * Choose target datanodes for creating a new block.
   * 
   * @throws IOException
   *           if the number of targets < minimum replication.
   * @see BlockPlacementPolicy#chooseTarget(String, int, Node,
   *      Set, long, List, BlockStoragePolicy)
   */
  public DatanodeStorageInfo[] chooseTarget4NewBlock(final String src,
      final int numOfReplicas, final Node client,
      final Set<Node> excludedNodes,
      final long blocksize,
      final List<String> favoredNodes,
      final byte storagePolicyID) throws IOException {
   
    List<DatanodeDescriptor> favoredDatanodeDescriptors = 
        getDatanodeDescriptors(favoredNodes);
    final BlockStoragePolicy storagePolicy = storagePolicySuite.getPolicy(storagePolicyID);
    // 调用blockplacement的chooseTarget方法
    final DatanodeStorageInfo[] targets = blockplacement.chooseTarget(src,
        numOfReplicas, client, excludedNodes, blocksize, 
        favoredDatanodeDescriptors, storagePolicy);
    // 选择的目标节点数量不足,则会抛出IO异常
    if (targets.length < minReplication) {
   
      throw new IOException("File " + src + " could only be replicated to "
          + targets.length + " nodes instead of minReplication (="
          + minReplication + ").  There are "
          + getDatanodeManager().getNetworkTopology().getNumOfLeaves()
          + " datanode(s) running and "
          + (excludedNodes == null? "no": excludedNodes.size())
          + " node(s) are excluded in this operation.");
    }
    return targets;
  }

BlockPlacementPolicyDefault.java

  /** This is the implementation. */
  private DatanodeStorageInfo[] chooseTarget(int numOfReplicas,
                                    Node writer,
                                    List<DatanodeStorageInfo> chosenStorage,
                                    boolean returnChosenNodes,
                                    Set<Node> excludedNodes,
                                    long blocksize,
                                    final BlockStoragePolicy storagePolicy) {
   
    // 副本数为0或datanode数量为0,返回空数组
    if (numOfReplicas == 0 || clusterMap.getNumOfLeaves()==0) {
   
      return DatanodeStorageInfo.EMPTY_ARRAY;
    }
    // 初始化排除节点列表
    if (excludedNodes == null) {
   
      excludedNodes = new HashSet<Node>();
    }
    // 计算每个机架允许分配的最大副本数
    int[] result = getMaxNodesPerRack(chosenStorage.size(), numOfReplicas);
    numOfReplicas = result[0];
    int maxNodesPerRack = result[1];
    // 初始化结果节点列表
    final List<DatanodeStorageInfo> results = new ArrayList<DatanodeStorageInfo>(chosenStorage);
    for (DatanodeStorageInfo storage : chosenStorage) {
   
      // add localMachine and related nodes to excludedNodes
      addToExcludedNodes(storage.getDatanodeDescriptor(), excludedNodes);
    }

    boolean avoidStaleNodes = (stats != null
        && stats.isAvoidingStaleDataNodesForWrite());
    // 调用chooseTarget方法选择节点
    final Node localNode = chooseTarget(numOfReplicas, writer, excludedNodes,
        blocksize, maxNodesPerRack, results, avoidStaleNodes, storagePolicy,
        EnumSet.noneOf(StorageType.class), results.isEmpty());
    if (!returnChosenNodes) {
     
      results.removeAll(chosenStorage);
    }
      
    // sorting nodes to form a pipeline
    return getPipeline(
        (writer != null && writer instanceof DatanodeDescriptor) ? writer
            : localNode,
        results.toArray(new DatanodeStorageInfo[results.size()]));
  }

getMaxNodesPerRack方法(Calculate the maximum number of replicas to allocate per rack.)

  private int[] getMaxNodesPerRack(int numOfChosen, int numOfReplicas) {
   
    int clusterSize = clusterMap.getNumOfLeaves();
    int totalNumOfReplicas = numOfChosen + numOfReplicas;
    if (totalNumOfReplicas > clusterSize) {
   
      numOfReplicas -= (totalNumOfReplicas-clusterSize);
      totalNumOfReplicas = clusterSize;
    }
    // No calculation needed when there is only one rack or picking one node.
    int numOfRacks = clusterMap.getNumOfRacks();
    if (numOfRacks == 1 || totalNumOfReplicas <= 1) {
   
      
  • 2
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值