HDFS Hedged Read

最新推荐文章于 2022-05-13 08:41:57 发布

王小禾

最新推荐文章于 2022-05-13 08:41:57 发布

阅读量1.7k

点赞数 1

分类专栏： HDFS

本文链接：https://blog.csdn.net/answer100answer/article/details/107668089

版权

HDFS 专栏收录该内容

38 篇文章 6 订阅

订阅专栏

Hedged reads是HDFS的一个功能，在Hadoop 2.4.0之后引入。一般来说，每个读请求都会由生成的一个线程处理。在Hedged reads 启用后，客户端可以等待一个预配置的时间，如果read没有返回，则客户端会生成第二个读请求，访问同一份数据的另一个block replica之后，其中任意一个read 先返回的话，则另一个read请求则被丢弃。
Hedged reads使用的场景是：解决少概率的slow read（可能由瞬时错误导致，例如磁盘错误或是网络抖动等）。
客户端hedged读两个参数：

dfs.client.hedged.read.threadpool.size：默认值为0：指定有多少线程用于服务hedged reads。如果此值设置为0（默认），则hedged reads为disabled状态
dfs.client.hedged.read.threshold.millis：默认为500（0.5秒）：在spawning 第二个线程前，等待的时间。

DFSClient

private volatile long hedgedReadThresholdMillis;  //定义原子性的hedged读阈值时间
private static ThreadPoolExecutor HEDGED_READ_THREAD_POOL;  //定义静态线程池

看一下参数相关说明

this.hedgedReadThresholdMillis = conf.getLong(
    DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS,
    DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS); //默认值是500
int numThreads = conf.getInt(
    DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE,
    DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE); //默认值是0
if (numThreads > 0) {
  this.initThreadsNumForHedgedReads(numThreads); //如果线程池大小为0，则表示不开启hedged读，大于0时则进行初始化
}

/**
 * Create hedged reads thread pool, HEDGED_READ_THREAD_POOL, if
 * it does not already exist.
 * @param num Number of threads for hedged reads thread pool.
 * If zero, skip hedged reads thread pool creation.
 */
private synchronized void initThreadsNumForHedgedReads(int num) {
  if (num <= 0 || HEDGED_READ_THREAD_POOL != null) return;
  HEDGED_READ_THREAD_POOL = new ThreadPoolExecutor(1, num, 60,
      TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
      new Daemon.DaemonFactory() {
        private final AtomicInteger threadIndex =
          new AtomicInteger(0); 
        @Override
        public Thread newThread(Runnable r) {
          Thread t = super.newThread(r);
          t.setName("hedgedRead-" +
            threadIndex.getAndIncrement());
          return t;
        }
      },
      new ThreadPoolExecutor.CallerRunsPolicy() {

    @Override
    public void rejectedExecution(Runnable runnable,
        ThreadPoolExecutor e) {
      LOG.info("Execution rejected, Executing in current thread");
      HEDGED_READ_METRIC.incHedgedReadOpsInCurThread();
      // will run in the current thread
      super.rejectedExecution(runnable, e);
    }
  });
  HEDGED_READ_THREAD_POOL.allowCoreThreadTimeOut(true);
  if (LOG.isDebugEnabled()) {
    LOG.debug("Using hedged reads; pool threads=" + num);
  }
}

分析一下初始化代码，未实例化过线程池则初始化线程池。
关于ThreadPoolExecutor的参数本文不再论述，可参考其他资料。这里对HEDGED_READ_THREAD_POOL初始化参数做下说明：

corePoolSize: 线程池核心线程的数量，本例为1；
maximumPoolSize: 线程池可创建的最大线程数量，本例为2；
空闲线程（超过core）的存活时间，本例60s；
workQueue: 存储未执行的任务的队列，本例为new SynchronousQueue()；
使用的线程工厂，本例为后台线程工厂；
拒绝策略，队列已满时其他任务的处理策略，本例为ThreadPoolExecutor.CallerRunsPolicy()；
HEDGED_READ_THREAD_POOL.allowCoreThreadTimeOut(true); 允许核心线程过期。

DFSClient为hedged提供了一些方法，如：

long getHedgedReadTimeout() {
  return this.hedgedReadThresholdMillis;
}

@VisibleForTesting
void setHedgedReadTimeout(long timeoutMillis) {
  this.hedgedReadThresholdMillis = timeoutMillis;
}

ThreadPoolExecutor getHedgedReadsThreadPool() {
  return HEDGED_READ_THREAD_POOL;
}

boolean isHedgedReadsEnabled() {
  return (HEDGED_READ_THREAD_POOL != null) &&
    HEDGED_READ_THREAD_POOL.getMaximumPoolSize() > 0;
}

DFSInputStream

在输入流DFSInputStream的read方法中，会通过dfsClient.isHedgedReadsEnabled()判断是否开启了Hedged Read特性，在其开启的情况下，调用hedgedFetchBlockByteRange()方法进行数据读取操作，如下：

if (dfsClient.isHedgedReadsEnabled()) {
  hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
      - 1, buffer, offset, corruptedBlockMap);
} else {
  fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
      buffer, offset, corruptedBlockMap);
}

开启hedged时走hedgedFetchBlockByteRange逻辑。hedgedFetchBlockByteRange方法的成员变量如下：

ArrayList<Future<ByteBuffer>> futures = new ArrayList<Future<ByteBuffer>>(); //1
CompletionService<ByteBuffer> hedgedService = new ExecutorCompletionService<ByteBuffer>(dfsClient.getHedgedReadsThreadPool()); //2
ArrayList<DatanodeInfo> ignored = new ArrayList<DatanodeInfo>(); //3
ByteBuffer bb = null; //4
int len = (int) (end - start + 1); //4
int hedgedReadId = 0; //4

定义了一个Future列表
构造一个ExecutorCompletionService，其线程池使用dfsClient中初始化的线程池HEDGED_READ_THREAD_POOL。
ignored用于记录排除的节点
计算数据块和长度

上述成员变量协助hedged读，hedged读主要流程再一个while(true)循环中，该循环又分为两部分：

1.第一次读取时，ignored和futures为空，从NN获取blocks块的位置（仅仅是第一次的时候需要从NN获取，走RPC调用），chosenNode = chooseDataNode(block, ignored);
通过chosenNode来构造一个获得ByteBuffer的Callable线程，并提交到hedgedService，其返回结果是Future对象，记为firstRequest，并将该对象添加到futures列表中，futures.add(firstRequest)。
然后用非阻塞的poll获取结果future，判断future是否成功，成功即返回；否则在ignored中添加下次需要忽略的本节点，incHedgedReadOps计数并继续，此时开启启动hedged读。
2.第一次结束后在while(true)中进行后续读。第二次（或者大于第二次）读时，进行真正的hedged读（为什么是第二次开始呢？因为第一次相当于一次正常读，如果读到就直接返回了，等待500ms后，直接进行下一次循环），设置boolean refetch = false;即选择dn时不从NameNodeRPC请求，如果选到了，从该dn读并将结果放入futures列表中，如果选不到节点需要将refetch设为true，用于再次获取该块的dn。
以下贴出hedgedRead的关键代码：

/**
 * Like {@link #fetchBlockByteRange}except we start up a second, parallel,
 * 'hedged' read if the first read is taking longer than configured amount of
 * time. We then wait on which ever read returns first.
 */
private void hedgedFetchBlockByteRange(LocatedBlock block, long start,
    long end, ByteBuffer buf, CorruptedBlocks corruptedBlocks)
    throws IOException {
  final DFSClient.Conf conf = dfsClient.getConf();
  ArrayList<Future<ByteBuffer>> futures = new ArrayList<>();
  CompletionService<ByteBuffer> hedgedService =
      new ExecutorCompletionService<>(dfsClient.getHedgedReadsThreadPool());
  ArrayList<DatanodeInfo> ignored = new ArrayList<>();
  ByteBuffer bb;
  int len = (int) (end - start + 1);
  int hedgedReadId = 0;
  while (true) {
    // see HDFS-6591, this metric is used to verify/catch unnecessary loops
    hedgedReadOpsLoopNumForTesting++;
    DNAddrPair chosenNode = null;
    // there is no request already executing.
    if (futures.isEmpty()) {
      // chooseDataNode is a commitment. If no node, we go to
      // the NN to reget block locations. Only go here on first read.
      chosenNode = chooseDataNode(block, ignored);
      // Latest block, if refreshed internally
      block = chosenNode.block;
      bb = ByteBuffer.allocate(len);
      Callable<ByteBuffer> getFromDataNodeCallable = getFromOneDataNode(
          chosenNode, block, start, end, bb,
          corruptedBlocks, hedgedReadId++);
      Future<ByteBuffer> firstRequest = hedgedService
          .submit(getFromDataNodeCallable);
      futures.add(firstRequest);
      Future<ByteBuffer> future = null;
      try {
        future = hedgedService.poll(
            dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS);
        if (future != null) {
          ByteBuffer result = future.get();
          result.flip();
          buf.put(result);
          return;
        }
        DFSClient.LOG.debug("Waited {}ms to read from {}; spawning hedged "
            + "read", dfsClient.getHedgedReadTimeout(), chosenNode.info);
        dfsClient.getHedgedReadMetrics().incHedgedReadOps();
        // continue; no need to refresh block locations
      } catch (ExecutionException e) {
        futures.remove(future);
      } catch (InterruptedException e) {
        throw new InterruptedIOException(
            "Interrupted while waiting for reading task");
      }
      // Ignore this node on next go around.
      // If poll timeout and the request still ongoing, don't consider it
      // again. If read data failed, don't consider it either.
      ignored.add(chosenNode.info);
    } else {
      // We are starting up a 'hedged' read. We have a read already
      // ongoing. Call getBestNodeDNAddrPair instead of chooseDataNode.
      // If no nodes to do hedged reads against, pass.
      boolean refetch = false;
      try {
        chosenNode = chooseDataNode(block, ignored, false);
        if (chosenNode != null) {
          // Latest block, if refreshed internally
          block = chosenNode.block;
          bb = ByteBuffer.allocate(len);
          Callable<ByteBuffer> getFromDataNodeCallable =
              getFromOneDataNode(chosenNode, block, start, end, bb,
                  corruptedBlocks, hedgedReadId++);
          Future<ByteBuffer> oneMoreRequest =
              hedgedService.submit(getFromDataNodeCallable);
          futures.add(oneMoreRequest);
        } else {
          refetch = true;
        }
      } catch (IOException ioe) {
        DFSClient.LOG.debug("Failed getting node for hedged read: {}",
            ioe.getMessage());
      }
      // if not succeeded. Submit callables for each datanode in a loop, wait
      // for a fixed interval and get the result from the fastest one.
      try {
        ByteBuffer result = getFirstToComplete(hedgedService, futures);
        // cancel the rest.
        cancelAll(futures);
        dfsClient.getHedgedReadMetrics().incHedgedReadWins();
        result.flip();
        buf.put(result);
        return;
      } catch (InterruptedException ie) {
        // Ignore and retry
      }
      if (refetch) {
        refetchLocations(block, ignored);
      }
      // We got here if exception. Ignore this node on next go around IFF
      // we found a chosenNode to hedge read against.
      if (chosenNode != null && chosenNode.info != null) {
        ignored.add(chosenNode.info);
      }
    }
  }
}

getFirstToComplete的方法如下。注意：CompletionService的实现本文不再详解，见本博客多线程部分，其take()方法会返回第一个有结果的Future。

private ByteBuffer getFirstToComplete(
    CompletionService<ByteBuffer> hedgedService,
    ArrayList<Future<ByteBuffer>> futures) throws InterruptedException {
  if (futures.isEmpty()) {
    throw new InterruptedException("let's retry");
  }
  Future<ByteBuffer> future = null;
  try {
    future = hedgedService.take();
    ByteBuffer bb = future.get();
    futures.remove(future);
    return bb;
  } catch (ExecutionException | CancellationException e) {
    // already logged in the Callable
    futures.remove(future);
  }

  throw new InterruptedException("let's retry");
}

关于异步线程：

  private Callable<ByteBuffer> getFromOneDataNode(final DNAddrPair datanode,
      final LocatedBlock block, final long start, final long end,
      final ByteBuffer bb,
      final CorruptedBlocks corruptedBlocks,
      final int hedgedReadId) {
    return new Callable<ByteBuffer>() {
      @Override
      public ByteBuffer call() throws Exception {
        DFSClientFaultInjector.get().sleepBeforeHedgedGet();
        actualGetFromOneDataNode(datanode, start, end, bb, corruptedBlocks);
        return bb;
      }
    };
  }

在这里插入图片描述

总结

1.hedged读的实现符合典型的异步线程Future设计模式，其任一个线程得到结果则取消其他线程。利用限时等待的非阻塞的poll方法来启动hedged读，利用阻塞的task方法来得到优先结果，利用cancelAll来取消其他线程。这种设计思想值得学习。

王小禾

关注

1
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
HDFS Hedged Read

Hedged reads是HDFS的一个功能，在Hadoop 2.4.0之后引入。一般来说，每个读请求都会由生成的一个线程处理。在Hedged reads 启用后，客户端可以等待一个预配置的时间，如果read没有返回，则客户端会生成第二个读请求，访问同一份数据的另一个block replica之后，其中任意一个read 先返回的话，则另一个read请求则被丢弃。Hedged reads使用的场景是：解决少概率的slow read（可能由瞬时错误导致，例如磁盘错误或是网络抖动等）。客户端hedged读两个
复制链接

扫一扫