HDFS-Datanode磁盘选择策略分析

最新推荐文章于 2024-08-30 16:10:38 发布

风筝Lee

最新推荐文章于 2024-08-30 16:10:38 发布

阅读量1.9k

点赞数

分类专栏：大数据专栏文章标签： hadoop datanode 磁盘选择策略

本文链接：https://blog.csdn.net/breakout_alex/article/details/89156794

版权

大数据专栏专栏收录该内容

92 篇文章 4 订阅

订阅专栏

概述

Hadoop技术体系中，hdfs是重要的技术之一，而真实的数据都存储在datanode节点之上，DataNode 将数据块存储到本地文件系统目录中，而每个datanode节点可以配置多个存储目录（可以是不同类型的数据硬盘），hdfs-site.xml （dfs.datanode.data.dir 参数）。

一般的hadoop集群datanode节点会配置多块数据盘，当我们往 HDFS 上写入新的数据块，DataNode 将会使用 volume 选择策略来为数据块选择存储的磁盘目录。目前有两种volume选择策略：

round-robin （default）
available space

遇到的问题：

由于hadoop集群规模一般比较大，且需要长期维护，所以会涉及到很多流程以及操作，例如：扩容新服务器、定期更换坏盘、下线服务器、删除历史数据等等。所以会造成节点间的数据不平衡，以及datanode节点上多个磁盘之间的不平衡问题。

1. 节点间的数据不平衡，可以通过hdfs 本身的balancer工具进行数据平衡；

2. datanode节点上多个磁盘之间数据不平衡，Hadoop 3.0 引入了磁盘均衡器(diskbalancer)。

这里先抛出一个问题：为什么datanode本身的磁盘选择策略没有很好的解决这些数据不平衡问题呢？

下面分析下datanode磁盘选择相关的源码：

hdfs-site.xml: 配置项 dfs.datanode.fsdataset.volume.choosing.policy

org.apache.hadoop.hdfs.server.datanode.fsdataset.RoundRobinVolumeChoosingPolicy （default）

org.apache.hadoop.hdfs.server.datanode.fsdataset.AvailableSpaceVolumeChoosingPolicy

A. RoundRobinVolumeChoosingPolicy:



/**
 * Choose volumes in round-robin order.
 */
public class RoundRobinVolumeChoosingPolicy<V extends FsVolumeSpi>
    implements VolumeChoosingPolicy<V> {
  public static final Log LOG = LogFactory.getLog(RoundRobinVolumeChoosingPolicy.class);

  private int curVolume = 0;

  @Override
  public synchronized V chooseVolume(final List<V> volumes, long blockSize)
      throws IOException {

    if(volumes.size() < 1) {
      throw new DiskOutOfSpaceException("No more available volumes");
    }
    
    // since volumes could've been removed because of the failure
    // make sure we are not out of bounds
    if(curVolume >= volumes.size()) {
      curVolume = 0;
    }
    
    int startVolume = curVolume;
    long maxAvailable = 0;
    
    // 遍历磁盘列表
    while (true) {
      final V volume = volumes.get(curVolume);
      curVolume = (curVolume + 1) % volumes.size();
      long availableVolumeSize = volume.getAvailable();
      // 可用空间大于数据块，直接返回volume
      if (availableVolumeSize > blockSize) {
        return volume;
      }
      
      // 更新最大可用空间
      if (availableVolumeSize > maxAvailable) {
        maxAvailable = availableVolumeSize;
      }
      // 未找到合适的存储磁盘
      if (curVolume == startVolume) {
        throw new DiskOutOfSpaceException("Out of space: "
            + "The volume with the most available space (=" + maxAvailable
            + " B) is less than the block size (=" + blockSize + " B).");
      }
    }
  }
}

可见，这种轮询的实现目的也是为了数据均衡，这种轮询的方式虽然能够保证所有磁盘都能够被使用，但是由于这种算法实现只是按照block数量进行轮询选择，而没有考虑到每次存储的block大小，如果每次存储的block大小相差很大，也会造成磁盘数据不均衡；另外如果HDFS 上的文件存在大量的删除操作，也可能会导致磁盘数据的分布不均匀。

看下第二种实现方式.

B. AvailableSpaceVolumeChoosingPolicy:

/**
 * A DN volume choosing policy which takes into account the amount of free
 * space on each of the available volumes when considering where to assign a
 * new replica allocation. By default this policy prefers assigning replicas to
 * those volumes with more available free space, so as to over time balance the
 * available space of all the volumes within a DN.
 */
public class AvailableSpaceVolumeChoosingPolicy<V extends FsVolumeSpi>
    implements VolumeChoosingPolicy<V>, Configurable {

  
  // 加载并初始化配置  （省略）
  .................
 
  // 用于需要平衡磁盘的轮询磁盘选择策略
  private final VolumeChoosingPolicy<V> roundRobinPolicyBalanced =
      new RoundRobinVolumeChoosingPolicy<V>();
  // 用于可用空间高的磁盘的轮询磁盘选择策略
  private final VolumeChoosingPolicy<V> roundRobinPolicyHighAvailable =
      new RoundRobinVolumeChoosingPolicy<V>();
  // 用于可用空间低的磁盘的轮询磁盘选择策略
  private final VolumeChoosingPolicy<V> roundRobinPolicyLowAvailable =
      new RoundRobinVolumeChoosingPolicy<V>();

  @Override
  public synchronized V chooseVolume(List<V> volumes,
      long replicaSize) throws IOException {
    if (volumes.size() < 1) {
      throw new DiskOutOfSpaceException("No more available volumes");
    }
    
    AvailableSpaceVolumeList volumesWithSpaces =
        new AvailableSpaceVolumeList(volumes);
    // 如果磁盘都在数据平衡阈值（可配置）之内,则直接使用轮询策略选择磁盘
    if (volumesWithSpaces.areAllVolumesWithinFreeSpaceThreshold()) {
      // If they're actually not too far out of whack, fall back on pure round
      // robin.
      V volume = roundRobinPolicyBalanced.chooseVolume(volumes, replicaSize);
      if (LOG.isDebugEnabled()) {
        LOG.debug("All volumes are within the configured free space balance " +
            "threshold. Selecting " + volume + " for write of block size " +
            replicaSize);
      }
      return volume;
    } else {
      V volume = null;
      // 如果没有一个低自由空间的体积有足够的空间存储副本时，总是尽量选择有大量空闲空间的卷。
      // 从低剩余磁盘列表中选取最大可用空间（磁盘可用）
      long mostAvailableAmongLowVolumes = volumesWithSpaces
          .getMostAvailableSpaceAmongVolumesWithLowAvailableSpace();
      // 高可用空间磁盘列表
      List<V> highAvailableVolumes = extractVolumesFromPairs(
          volumesWithSpaces.getVolumesWithHighAvailableSpace());
      // 低可用空间磁盘列表
      List<V> lowAvailableVolumes = extractVolumesFromPairs(
          volumesWithSpaces.getVolumesWithLowAvailableSpace());
      // 平衡比值
      float preferencePercentScaler =
          (highAvailableVolumes.size() * balancedPreferencePercent) +
          (lowAvailableVolumes.size() * (1 - balancedPreferencePercent));
      float scaledPreferencePercent =
          (highAvailableVolumes.size() * balancedPreferencePercent) /
          preferencePercentScaler;
      // 如果低可用空间磁盘列表中最大的可用空间无法满足副本大小
      // 或随机概率小于比例值,就在高可用空间磁盘中进行轮询调度选择
      if (mostAvailableAmongLowVolumes < replicaSize ||
          random.nextFloat() < scaledPreferencePercent) {
        volume = roundRobinPolicyHighAvailable.chooseVolume(
            highAvailableVolumes, replicaSize);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Volumes are imbalanced. Selecting " + volume +
              " from high available space volumes for write of block size "
              + replicaSize);
        }
      } else {
        // 否则在低可用空间列表中选择
        volume = roundRobinPolicyLowAvailable.chooseVolume(
            lowAvailableVolumes, replicaSize);
        if (LOG.isDebugEnabled()) {
          LOG.debug("Volumes are imbalanced. Selecting " + volume +
              " from low available space volumes for write of block size "
              + replicaSize);
        }
      }
      return volume;
    }
  }

高\低可用空间磁盘列表调用逻辑：

  /**
   * Used to keep track of the list of volumes we're choosing from.
   */
  private class AvailableSpaceVolumeList {
    
    // 省略
    ................

    /**
     * @return the maximum amount of space available across volumes with low space.
     */
    public long getMostAvailableSpaceAmongVolumesWithLowAvailableSpace() {
      long mostAvailable = Long.MIN_VALUE;
      for (AvailableSpaceVolumePair volume : getVolumesWithLowAvailableSpace()) {
        mostAvailable = Math.max(mostAvailable, volume.getAvailable());
      }
      return mostAvailable;
    }
    
    /**
     * @return the list of volumes with relatively low available space.
     */
    public List<AvailableSpaceVolumePair> getVolumesWithLowAvailableSpace() {
      long leastAvailable = getLeastAvailableSpace();
      List<AvailableSpaceVolumePair> ret = new ArrayList<AvailableSpaceVolumePair>();
      for (AvailableSpaceVolumePair volume : volumes) {
        // 可用空间小于 （最小可用空间+平衡阀值）
        if (volume.getAvailable() <= leastAvailable + balancedSpaceThreshold) {
          ret.add(volume);
        }
      }
      return ret;
    }
    
    /**
     * @return the list of volumes with a lot of available space.
     */
    public List<AvailableSpaceVolumePair> getVolumesWithHighAvailableSpace() {
      long leastAvailable = getLeastAvailableSpace();
      List<AvailableSpaceVolumePair> ret = new ArrayList<AvailableSpaceVolumePair>();
      for (AvailableSpaceVolumePair volume : volumes) {
        // 可用空间大于 （最小可用空间+平衡阀值）
        if (volume.getAvailable() > leastAvailable + balancedSpaceThreshold) {
          ret.add(volume);
        }
      }
      return ret;
    }
    
  }

可见，可用空间策略设计原理是根据配置平衡阀值划分磁盘分为两类列表：高可用空间磁盘列表、低可用空间列表，通过随机数概率，会相应较高概率选择高可用空间列表中的磁盘；