Hadoop 3.1.3学习笔记2

Hadoop3.1.3学习笔记2

本文主要探讨StripedBlockReconstruction过程,即条带化存储下如何恢复损坏的block,以下内容涉及StripedReader、StripedWriter等相关类的内容,将在后面进行介绍

/**
 * StripedBlockReconstructor reconstruct one or more missed striped block in
 * the striped block group, the minimum number of live striped blocks should
 * be no less than data block number.
 */

在注释中给出了StripedBlockReconstructor的功能介绍,显然在纠删码策略下必须要大于k个块才能完整恢复数据。StripedBlockReconstructor继承了StripedReconstructor并实现了StripedReconstructor的抽象接口reconstruct,也就是整个过程中最关键的一步“重建”,在父类StripedReconstructor中有对这一过程的详细描述如下:

/**
 * StripedReconstructor reconstruct one or more missed striped block in the
 * striped block group, the minimum number of live striped blocks should be
 * no less than data block number.
 *
 * | <- Striped Block Group -> |
 *  blk_0      blk_1       blk_2(*)   blk_3   ...   <- A striped block group
 *    |          |           |          |
 *    v          v           v          v
 * +------+   +------+   +------+   +------+
 * |cell_0|   |cell_1|   |cell_2|   |cell_3|  ...
 * +------+   +------+   +------+   +------+
 * |cell_4|   |cell_5|   |cell_6|   |cell_7|  ...
 * +------+   +------+   +------+   +------+
 * |cell_8|   |cell_9|   |cell10|   |cell11|  ...
 * +------+   +------+   +------+   +------+
 *  ...         ...       ...         ...
 *
 *
 * We use following steps to reconstruct striped block group, in each round, we
 * reconstruct <code>bufferSize</code> data until finish, the
 * <code>bufferSize</code> is configurable and may be less or larger than
 * cell size:
 * step1: read <code>bufferSize</code> data from minimum number of sources
 *        required by reconstruction.
 * step2: decode data for targets.
 * step3: transfer data to targets.
 *
 * In step1, try to read <code>bufferSize</code> data from minimum number
 * of sources , if there is corrupt or stale sources, read from new source
 * will be scheduled. The best sources are remembered for next round and
 * may be updated in each round.
 *
 * In step2, typically if source blocks we read are all data blocks, we
 * need to call encode, and if there is one parity block, we need to call
 * decode. Notice we only read once and reconstruct all missed striped block
 * if they are more than one.
 *
 * In step3, send the reconstructed data to targets by constructing packet
 * and send them directly. Same as continuous block replication, we
 * don't check the packet ack. Since the datanode doing the reconstruction work
 * are one of the source datanodes, so the reconstructed data are sent
 * remotely.
 *
 * There are some points we can do further improvements in next phase:
 * 1. we can read the block file directly on the local datanode,
 *    currently we use remote block reader. (Notice short-circuit is not
 *    a good choice, see inline comments).
 * 2. We need to check the packet ack for EC reconstruction? Since EC
 *    reconstruction is more expensive than continuous block replication,
 *    it needs to read from several other datanodes, should we make sure the
 *    reconstructed result received by targets?
 */

通过阅读以上注释,可以看到,重建过程分为三步,第一步,读取大小为buffersize的数据,这些数据来源于“用于重建的最小数量源数据”,第二步对这些数据进行编码,第三步将编码后的数据发送到目标datanode。

注释中给出了这三步中的一些关键性问题。在第一步读取源数据的过程中,如果遇到数据损坏或过期,会读取新的数据,并且会进行记录“best source”,在每个round中都会更新best source。在第二步编码过程中,实际上不仅仅指的是我们通常意义上的编码,而是泛指这个过程,如果丢失的是校验块,读取数据块进行修复,显然要调用编码方法,如果丢失的包含数据块,读取的源数据包含校验块,则需要调用解码方法。在第三步将编码后的数据发送的目标提出了一个新的概念“packet”,指的就是通信上所讲的“包”,在这里特地强调了,重建过程是在某个源数据的datanode上进行的。甚至作者在这里还对下一阶段的提高提出了两点设想。

接着我们回到StripedBlockReconstructor。

首先来看run方法

public void run() {
  try {
    initDecoderIfNecessary();

    getStripedReader().init();

    stripedWriter.init();

    reconstruct();

    stripedWriter.endTargetBlocks();

    // Currently we don't check the acks for packets, this is similar as
    // block replication.
  } catch (Throwable e) {
    LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
    getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
    getDatanode().decrementXmitsInProgress(getXmits());
    final DataNodeMetrics metrics = getDatanode().getMetrics();
    metrics.incrECReconstructionTasks();
    metrics.incrECReconstructionBytesRead(getBytesRead());
    metrics.incrECReconstructionRemoteBytesRead(getRemoteBytesRead());
    metrics.incrECReconstructionBytesWritten(getBytesWritten());
    getStripedReader().close();
    stripedWriter.close();
    cleanup();
  }
}

run方法对应了上述注释中的重建过程,首先初始化解码器,初始化条带化读取,再进行重建,最后发送到目标datanode。在这里我们先来看最关键的reconstruct方法

void reconstruct() throws IOException {
  while (getPositionInBlock() < getMaxTargetLength()) {
    DataNodeFaultInjector.get().stripedBlockReconstruction();
    long remaining = getMaxTargetLength() - getPositionInBlock();
    final int toReconstructLen =
        (int) Math.min(getStripedReader().getBufferSize(), remaining);

    long start = Time.monotonicNow();
    // step1: read from minimum source DNs required for reconstruction.
    // The returned success list is the source DNs we do real read from
    getStripedReader().readMinimumSources(toReconstructLen);
    long readEnd = Time.monotonicNow();

    // step2: decode to reconstruct targets
    reconstructTargets(toReconstructLen);
    long decodeEnd = Time.monotonicNow();

    // step3: transfer data
    if (stripedWriter.transferData2Targets() == 0) {
      String error = "Transfer failed for all targets.";
      throw new IOException(error);
    }
    long writeEnd = Time.monotonicNow();

    // Only the succeed reconstructions are recorded.
    final DataNodeMetrics metrics = getDatanode().getMetrics();
    metrics.incrECReconstructionReadTime(readEnd - start);
    metrics.incrECReconstructionDecodingTime(decodeEnd - readEnd);
    metrics.incrECReconstructionWriteTime(writeEnd - decodeEnd);

    updatePositionInBlock(toReconstructLen);

    clearBuffers();
  }
}

下面我们详细分析这个方法。第一步获取所需的最小源数据,start和readend用于记录运行时间,我们不去管它。ReadMinimumSources方法给了一个参数toReconstructLen即要重建的数据长度,单位是字节

/**
 * Read from minimum source DNs required for reconstruction in the iteration.
 * First try the success list which we think they are the best DNs
 * If source DN is corrupt or slow, try to read some other source DN,
 * and will update the success list.
 *
 * Remember the updated success list and return it for following
 * operations and next iteration read.
 *
 * @param reconstructLength the length to reconstruct.
 * @return updated success list of source DNs we do real read
 * @throws IOException
 */
void readMinimumSources(int reconstructLength) throws IOException {
  CorruptedBlocks corruptedBlocks = new CorruptedBlocks();
  try {
    successList = doReadMinimumSources(reconstructLength, corruptedBlocks);
  } finally {
    // report corrupted blocks to NN
    datanode.reportCorruptedBlocks(corruptedBlocks);
  }
}

ReadMinimumSources方法又调用了doReadMinimumSources用于寻找源数据,这里我们先不去考虑有block损坏的情况,仅考虑doReadMinimumSources,

int[] doReadMinimumSources(int reconstructLength,
                           CorruptedBlocks corruptedBlocks)
    throws IOException {
  Preconditions.checkArgument(reconstructLength >= 0 &&
      reconstructLength <= bufferSize);
  int nSuccess = 0;
  int[] newSuccess = new int[minRequiredSources];
  BitSet usedFlag = new BitSet(sources.length);
  /*
   * Read from minimum source DNs required, the success list contains
   * source DNs which we think best.
   */
  for (int i = 0; i < minRequiredSources; i++) {
    StripedBlockReader reader = readers.get(successList[i]);
    int toRead = getReadLength(liveIndices[successList[i]],
        reconstructLength);
    if (toRead > 0) {
      Callable<BlockReadStats> readCallable =
          reader.readFromBlock(toRead, corruptedBlocks);
      Future<BlockReadStats> f = readService.submit(readCallable);
      futures.put(f, successList[i]);
    } else {
      // If the read length is 0, we don't need to do real read
      reader.getReadBuffer().position(0);
      newSuccess[nSuccess++] = successList[i];
    }
    usedFlag.set(successList[i]);
  }

  while (!futures.isEmpty()) {
    try {
      StripingChunkReadResult result =
          StripedBlockUtil.getNextCompletedStripedRead(
              readService, futures, stripedReadTimeoutInMills);
      int resultIndex = -1;
      if (result.state == StripingChunkReadResult.SUCCESSFUL) {
        resultIndex = result.index;
      } else if (result.state == StripingChunkReadResult.FAILED) {
        // If read failed for some source DN, we should not use it anymore
        // and schedule read from another source DN.
        StripedBlockReader failedReader = readers.get(result.index);
        failedReader.closeBlockReader();
        resultIndex = scheduleNewRead(usedFlag,
            reconstructLength, corruptedBlocks);
      } else if (result.state == StripingChunkReadResult.TIMEOUT) {
        // If timeout, we also schedule a new read.
        resultIndex = scheduleNewRead(usedFlag,
            reconstructLength, corruptedBlocks);
      }
      if (resultIndex >= 0) {
        newSuccess[nSuccess++] = resultIndex;
        if (nSuccess >= minRequiredSources) {
          // cancel remaining reads if we read successfully from minimum
          // number of source DNs required by reconstruction.
          cancelReads(futures.keySet());
          futures.clear();
          break;
        }
      }
    } catch (InterruptedException e) {
      LOG.info("Read data interrupted.", e);
      cancelReads(futures.keySet());
      futures.clear();
      break;
    }
  }

  if (nSuccess < minRequiredSources) {
    String error = "Can't read data from minimum number of sources "
        + "required by reconstruction, block id: " +
        reconstructor.getBlockGroup().getBlockId();
    throw new IOException(error);
  }

  return newSuccess;
}

本方法的逻辑较为复杂,笔者研究了很久才搞清楚。首先,根据上述注释所讲的读取过程,要从我们认为是bestsource的datanode中读取源数据(代码10-11行),这里的toRead就是要读取的数据长度,如果toRead>0则说明需要读,加入future中随后来读取,如果toRead <= 0则说明不需要再读了,直接加入newSuccess,这里要解释一个问题,所谓的“要读取的长度”是什么,又是如何得到的。显然,这是由方法getReadLength来决定的,我们具体来看这个方法:

private int getReadLength(int index, int reconstructLength) {
  // the reading length should not exceed the length for reconstruction
  long blockLen = reconstructor.getBlockLen(index);
  long remaining = blockLen - reconstructor.getPositionInBlock();
  return (int) Math.min(remaining, reconstructLength);
}

这个方法非常简单,只有6行,其中blocklen是块的长度,即块中有多少字节,remaining是剩下的长度,第4行中getPositionInBlock方法会返回当前读到的位置,就是position,position初始值为0,每读完一个reconstructionLength就会加上它,即

position += reconstructionLength;

reconstructionLength是buffersize,也就是读取的缓冲区大小,可以理解为一次只能读buffersize这么多的字节。方法的第5行返回剩余的大小和buffersize中更小的那一个,这就好理解了,如果剩下的数据比一个buffersize大那就读一个buffersize那么大的数据,如果小,那就把剩下的都读出来。

然后我们回到doReadMinimumSources方法,此方法代码的30行往后就是从future中的datanode读数据,如果超时或失败则去新的source datanode中读取。

这样,reconstruct方法的第一步就完成了。

reconstruct的第二步是reconstructTargets,在这一步中实际调用了doDecode函数,会根据采取的纠删码不同调用不同的解码方法。

第三步发送到输出流,在这里是以packet的形式发送

/**
 * buf is pointed into like follows:
 *  (C is checksum data, D is payload data)
 *
 * [_________CCCCCCCCC________________DDDDDDDDDDDDDDDD___]
 *           ^        ^               ^               ^
 *           |        checksumPos     dataStart       dataPos
 *           checksumStart
 *
 * Right before sending, we move the checksum data to immediately precede
 * the actual data, and then insert the header into the buffer immediately
 * preceding the checksum data, so we make sure to keep enough space in
 * front of the checksum data to support the largest conceivable header.
 */

最后,reconstruct方法会更新position并清空buffer进行下一次发送,直到达到maxTargetLength,也就是要恢复的块的大小。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值