HDFS Hedged Read

Hedged reads是HDFS的一个功能,在Hadoop 2.4.0之后引入。一般来说,每个读请求都会由生成的一个线程处理。在Hedged reads 启用后,客户端可以等待一个预配置的时间,如果read没有返回,则客户端会生成第二个读请求,访问同一份数据的另一个block replica之后,其中任意一个read 先返回的话,则另一个read请求则被丢弃。
Hedged reads使用的场景是:解决少概率的slow read(可能由瞬时错误导致,例如磁盘错误或是网络抖动等)。
客户端hedged读两个参数:

  • dfs.client.hedged.read.threadpool.size:默认值为0:指定有多少线程用于服务hedged reads。如果此值设置为0(默认),则hedged reads为disabled状态
  • dfs.client.hedged.read.threshold.millis:默认为500(0.5秒):在spawning 第二个线程前,等待的时间。
DFSClient
private volatile long hedgedReadThresholdMillis;  //定义原子性的hedged读阈值时间
private static ThreadPoolExecutor HEDGED_READ_THREAD_POOL;  //定义静态线程池

看一下参数相关说明

this.hedgedReadThresholdMillis = conf.getLong(
    DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS,
    DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THRESHOLD_MILLIS); //默认值是500
int numThreads = conf.getInt(
    DFSConfigKeys.DFS_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE,
    DFSConfigKeys.DEFAULT_DFSCLIENT_HEDGED_READ_THREADPOOL_SIZE); //默认值是0
if (numThreads > 0) {
  this.initThreadsNumForHedgedReads(numThreads); //如果线程池大小为0,则表示不开启hedged读,大于0时则进行初始化
}
/**
 * Create hedged reads thread pool, HEDGED_READ_THREAD_POOL, if
 * it does not already exist.
 * @param num Number of threads for hedged reads thread pool.
 * If zero, skip hedged reads thread pool creation.
 */
private synchronized void initThreadsNumForHedgedReads(int num) {
  if (num <= 0 || HEDGED_READ_THREAD_POOL != null) return;
  HEDGED_READ_THREAD_POOL = new ThreadPoolExecutor(1, num, 60,
      TimeUnit.SECONDS, new SynchronousQueue<Runnable>(),
      new Daemon.DaemonFactory() {
        private final AtomicInteger threadIndex =
          new AtomicInteger(0); 
        @Override
        public Thread newThread(Runnable r) {
          Thread t = super.newThread(r);
          t.setName("hedgedRead-" +
            threadIndex.getAndIncrement());
          return t;
        }
      },
      new ThreadPoolExecutor.CallerRunsPolicy() {

    @Override
    public void rejectedExecution(Runnable runnable,
        ThreadPoolExecutor e) {
      LOG.info("Execution rejected, Executing in current thread");
      HEDGED_READ_METRIC.incHedgedReadOpsInCurThread();
      // will run in the current thread
      super.rejectedExecution(runnable, e);
    }
  });
  HEDGED_READ_THREAD_POOL.allowCoreThreadTimeOut(true);
  if (LOG.isDebugEnabled()) {
    LOG.debug("Using hedged reads; pool threads=" + num);
  }
}


分析一下初始化代码,未实例化过线程池则初始化线程池。
关于ThreadPoolExecutor的参数本文不再论述,可参考其他资料。这里对HEDGED_READ_THREAD_POOL初始化参数做下说明:

  1. corePoolSize: 线程池核心线程的数量,本例为1;
  2. maximumPoolSize: 线程池可创建的最大线程数量,本例为2;
  3. 空闲线程(超过core)的存活时间,本例60s;
  4. workQueue: 存储未执行的任务的队列,本例为new SynchronousQueue();
  5. 使用的线程工厂,本例为后台线程工厂;
  6. 拒绝策略,队列已满时其他任务的处理策略,本例为ThreadPoolExecutor.CallerRunsPolicy();
  7. HEDGED_READ_THREAD_POOL.allowCoreThreadTimeOut(true); 允许核心线程过期。

DFSClient为hedged提供了一些方法,如:

long getHedgedReadTimeout() {
  return this.hedgedReadThresholdMillis;
}

@VisibleForTesting
void setHedgedReadTimeout(long timeoutMillis) {
  this.hedgedReadThresholdMillis = timeoutMillis;
}

ThreadPoolExecutor getHedgedReadsThreadPool() {
  return HEDGED_READ_THREAD_POOL;
}

boolean isHedgedReadsEnabled() {
  return (HEDGED_READ_THREAD_POOL != null) &&
    HEDGED_READ_THREAD_POOL.getMaximumPoolSize() > 0;
}
DFSInputStream

在输入流DFSInputStream的read方法中,会通过dfsClient.isHedgedReadsEnabled()判断是否开启了Hedged Read特性,在其开启的情况下,调用hedgedFetchBlockByteRange()方法进行数据读取操作,如下:

if (dfsClient.isHedgedReadsEnabled()) {
  hedgedFetchBlockByteRange(blk, targetStart, targetStart + bytesToRead
      - 1, buffer, offset, corruptedBlockMap);
} else {
  fetchBlockByteRange(blk, targetStart, targetStart + bytesToRead - 1,
      buffer, offset, corruptedBlockMap);
}

开启hedged时走hedgedFetchBlockByteRange逻辑。hedgedFetchBlockByteRange方法的成员变量如下:

ArrayList<Future<ByteBuffer>> futures = new ArrayList<Future<ByteBuffer>>(); //1
CompletionService<ByteBuffer> hedgedService = new ExecutorCompletionService<ByteBuffer>(dfsClient.getHedgedReadsThreadPool()); //2
ArrayList<DatanodeInfo> ignored = new ArrayList<DatanodeInfo>(); //3
ByteBuffer bb = null; //4
int len = (int) (end - start + 1); //4
int hedgedReadId = 0; //4
  1. 定义了一个Future列表
  2. 构造一个ExecutorCompletionService,其线程池使用dfsClient中初始化的线程池HEDGED_READ_THREAD_POOL。
  3. ignored用于记录排除的节点
  4. 计算数据块和长度

上述成员变量协助hedged读,hedged读主要流程再一个while(true)循环中,该循环又分为两部分:

1.第一次读取时,ignored和futures为空,从NN获取blocks块的位置(仅仅是第一次的时候需要从NN获取,走RPC调用),chosenNode = chooseDataNode(block, ignored);
通过chosenNode来构造一个获得ByteBuffer的Callable线程,并提交到hedgedService,其返回结果是Future对象,记为firstRequest,并将该对象添加到futures列表中,futures.add(firstRequest)。
然后用非阻塞的poll获取结果future,判断future是否成功,成功即返回;否则在ignored中添加下次需要忽略的本节点,incHedgedReadOps计数并继续,此时开启启动hedged读。
2.第一次结束后在while(true)中进行后续读。第二次(或者大于第二次)读时,进行真正的hedged读(为什么是第二次开始呢?因为第一次相当于一次正常读,如果读到就直接返回了,等待500ms后,直接进行下一次循环),设置boolean refetch = false;即选择dn时不从NameNodeRPC请求,如果选到了,从该dn读并将结果放入futures列表中,如果选不到节点需要将refetch设为true,用于再次获取该块的dn。
以下贴出hedgedRead的关键代码:

/**
 * Like {@link #fetchBlockByteRange}except we start up a second, parallel,
 * 'hedged' read if the first read is taking longer than configured amount of
 * time. We then wait on which ever read returns first.
 */
private void hedgedFetchBlockByteRange(LocatedBlock block, long start,
    long end, ByteBuffer buf, CorruptedBlocks corruptedBlocks)
    throws IOException {
  final DFSClient.Conf conf = dfsClient.getConf();
  ArrayList<Future<ByteBuffer>> futures = new ArrayList<>();
  CompletionService<ByteBuffer> hedgedService =
      new ExecutorCompletionService<>(dfsClient.getHedgedReadsThreadPool());
  ArrayList<DatanodeInfo> ignored = new ArrayList<>();
  ByteBuffer bb;
  int len = (int) (end - start + 1);
  int hedgedReadId = 0;
  while (true) {
    // see HDFS-6591, this metric is used to verify/catch unnecessary loops
    hedgedReadOpsLoopNumForTesting++;
    DNAddrPair chosenNode = null;
    // there is no request already executing.
    if (futures.isEmpty()) {
      // chooseDataNode is a commitment. If no node, we go to
      // the NN to reget block locations. Only go here on first read.
      chosenNode = chooseDataNode(block, ignored);
      // Latest block, if refreshed internally
      block = chosenNode.block;
      bb = ByteBuffer.allocate(len);
      Callable<ByteBuffer> getFromDataNodeCallable = getFromOneDataNode(
          chosenNode, block, start, end, bb,
          corruptedBlocks, hedgedReadId++);
      Future<ByteBuffer> firstRequest = hedgedService
          .submit(getFromDataNodeCallable);
      futures.add(firstRequest);
      Future<ByteBuffer> future = null;
      try {
        future = hedgedService.poll(
            dfsClient.getHedgedReadTimeout(), TimeUnit.MILLISECONDS);
        if (future != null) {
          ByteBuffer result = future.get();
          result.flip();
          buf.put(result);
          return;
        }
        DFSClient.LOG.debug("Waited {}ms to read from {}; spawning hedged "
            + "read", dfsClient.getHedgedReadTimeout(), chosenNode.info);
        dfsClient.getHedgedReadMetrics().incHedgedReadOps();
        // continue; no need to refresh block locations
      } catch (ExecutionException e) {
        futures.remove(future);
      } catch (InterruptedException e) {
        throw new InterruptedIOException(
            "Interrupted while waiting for reading task");
      }
      // Ignore this node on next go around.
      // If poll timeout and the request still ongoing, don't consider it
      // again. If read data failed, don't consider it either.
      ignored.add(chosenNode.info);
    } else {
      // We are starting up a 'hedged' read. We have a read already
      // ongoing. Call getBestNodeDNAddrPair instead of chooseDataNode.
      // If no nodes to do hedged reads against, pass.
      boolean refetch = false;
      try {
        chosenNode = chooseDataNode(block, ignored, false);
        if (chosenNode != null) {
          // Latest block, if refreshed internally
          block = chosenNode.block;
          bb = ByteBuffer.allocate(len);
          Callable<ByteBuffer> getFromDataNodeCallable =
              getFromOneDataNode(chosenNode, block, start, end, bb,
                  corruptedBlocks, hedgedReadId++);
          Future<ByteBuffer> oneMoreRequest =
              hedgedService.submit(getFromDataNodeCallable);
          futures.add(oneMoreRequest);
        } else {
          refetch = true;
        }
      } catch (IOException ioe) {
        DFSClient.LOG.debug("Failed getting node for hedged read: {}",
            ioe.getMessage());
      }
      // if not succeeded. Submit callables for each datanode in a loop, wait
      // for a fixed interval and get the result from the fastest one.
      try {
        ByteBuffer result = getFirstToComplete(hedgedService, futures);
        // cancel the rest.
        cancelAll(futures);
        dfsClient.getHedgedReadMetrics().incHedgedReadWins();
        result.flip();
        buf.put(result);
        return;
      } catch (InterruptedException ie) {
        // Ignore and retry
      }
      if (refetch) {
        refetchLocations(block, ignored);
      }
      // We got here if exception. Ignore this node on next go around IFF
      // we found a chosenNode to hedge read against.
      if (chosenNode != null && chosenNode.info != null) {
        ignored.add(chosenNode.info);
      }
    }
  }
}

getFirstToComplete的方法如下。注意:CompletionService的实现本文不再详解,见本博客多线程部分,其take()方法会返回第一个有结果的Future

private ByteBuffer getFirstToComplete(
    CompletionService<ByteBuffer> hedgedService,
    ArrayList<Future<ByteBuffer>> futures) throws InterruptedException {
  if (futures.isEmpty()) {
    throw new InterruptedException("let's retry");
  }
  Future<ByteBuffer> future = null;
  try {
    future = hedgedService.take();
    ByteBuffer bb = future.get();
    futures.remove(future);
    return bb;
  } catch (ExecutionException | CancellationException e) {
    // already logged in the Callable
    futures.remove(future);
  }

  throw new InterruptedException("let's retry");
}

关于异步线程

  private Callable<ByteBuffer> getFromOneDataNode(final DNAddrPair datanode,
      final LocatedBlock block, final long start, final long end,
      final ByteBuffer bb,
      final CorruptedBlocks corruptedBlocks,
      final int hedgedReadId) {
    return new Callable<ByteBuffer>() {
      @Override
      public ByteBuffer call() throws Exception {
        DFSClientFaultInjector.get().sleepBeforeHedgedGet();
        actualGetFromOneDataNode(datanode, start, end, bb, corruptedBlocks);
        return bb;
      }
    };
  }

在这里插入图片描述

总结

1.hedged读的实现符合典型的异步线程Future设计模式,其任一个线程得到结果则取消其他线程。利用限时等待的非阻塞的poll方法来启动hedged读,利用阻塞的task方法来得到优先结果,利用cancelAll来取消其他线程。这种设计思想值得学习。

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值