DFSInputStream类中的零拷贝数据读取----read(ByteBufferPool bufferPool,int maxLength, EnumSet<ReadOption> opts)

最新推荐文章于 2019-03-13 23:42:27 发布

乘风如水

最新推荐文章于 2019-03-13 23:42:27 发布

阅读量614

点赞数

分类专栏： hadoop

本文链接：https://blog.csdn.net/weixin_39935887/article/details/87209055

版权

hadoop 专栏收录该内容

36 篇文章 2 订阅

订阅专栏

在阅读这篇文章前，建议先阅读这篇文章。

我们回到DFSInputStream中的read函数，这些函数的逻辑都差不多，都是会创建BlockReaderFactory类对象，并执行该对象的build函数，相应的代码我在之前的文章中都有讲述，可以翻看我之前的文章，我们这里讲解函数

public synchronized ByteBuffer read(ByteBufferPool bufferPool,int maxLength, EnumSet<ReadOption> opts)

中的另外一部分，就是零拷贝。相应的代码如下：

@Override
  /*这里bufferPool用来存储读出来的数据,
    maxLength表示读取数据的大小
    opts为读数据方式(比如SKIP_CHECKSUMS,这个表示跳过文件校验)
  */
  public synchronized ByteBuffer read(ByteBufferPool bufferPool,
      int maxLength, EnumSet<ReadOption> opts) 
          throws IOException, UnsupportedOperationException {
    if (maxLength == 0) {
      return EMPTY_BUFFER;
    } else if (maxLength < 0) {
      throw new IllegalArgumentException("can't read a negative " +
          "number of bytes.");
    }
    //如果blockReader为null或者blockEnd为-1(也就是当前块无效或者块对象还没初始化)
    if ((blockReader == null) || (blockEnd == -1)) {
      //如果当前没有可读的数据,那么就返回null
      if (pos >= getFileLength()) {
        return null;
      }
      /*
       * If we don't have a blockReader, or the one we have has no more bytes
       * left to read, we call seekToBlockSource to get a new blockReader and
       * recalculate blockEnd.  Note that we assume we're not at EOF here
       * (we check this above).
       */
      /*根据pos获取对应的数据块,这里用!多此一举,因为seekToBlockSource函数要么抛出异常，要么返回true,所以!seekToBlockSource(pos)永远都为false
                        如果该函数返回false或者blockReader为null那么就抛出异常
      */
      if ((!seekToBlockSource(pos)) || (blockReader == null)) {
        throw new IOException("failed to allocate new BlockReader " +
            "at position " + pos);
      }
    }
    ByteBuffer buffer = null;
    //判断是否支持零拷贝方式
    if (dfsClient.getConf().shortCircuitMmapEnabled) {
      buffer = tryReadZeroCopy(maxLength, opts);
    }
    if (buffer != null) {
      return buffer;
    }
    //如果零拷贝不成功,那么会退化为一个普通的读取
    buffer = ByteBufferUtil.fallbackRead(this, bufferPool, maxLength);
    if (buffer != null) {
      //将数据放入到extendedReadBuffers中
      extendedReadBuffers.put(buffer, bufferPool);
    }
    return buffer;
  }

我们来分析tryReadZeroCopy函数，代码如下：

private synchronized ByteBuffer tryReadZeroCopy(int maxLength,
      EnumSet<ReadOption> opts) throws IOException {
    // Copy 'pos' and 'blockEnd' to local variables to make it easier for the
    // JVM to optimize this function.
    final long curPos = pos;
    final long curEnd = blockEnd;
    final long blockStartInFile = currentLocatedBlock.getStartOffset();
    final long blockPos = curPos - blockStartInFile;

    // Shorten this read if the end of the block is nearby.
    long length63;
    if ((curPos + maxLength) <= (curEnd + 1)) {
      length63 = maxLength;
    } else {
      length63 = 1 + curEnd - curPos;
      if (length63 <= 0) {
        if (DFSClient.LOG.isDebugEnabled()) {
          DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
            curPos + " of " + src + "; " + length63 + " bytes left in block.  " +
            "blockPos=" + blockPos + "; curPos=" + curPos +
            "; curEnd=" + curEnd);
        }
        return null;
      }
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Reducing read length from " + maxLength +
            " to " + length63 + " to avoid going more than one byte " +
            "past the end of the block.  blockPos=" + blockPos +
            "; curPos=" + curPos + "; curEnd=" + curEnd);
      }
    }
    // Make sure that don't go beyond 31-bit offsets in the MappedByteBuffer.
    int length;
    if (blockPos + length63 <= Integer.MAX_VALUE) {
      length = (int)length63;
    } else {
      long length31 = Integer.MAX_VALUE - blockPos;
      if (length31 <= 0) {
        // Java ByteBuffers can't be longer than 2 GB, because they use
        // 4-byte signed integers to represent capacity, etc.
        // So we can't mmap the parts of the block higher than the 2 GB offset.
        // FIXME: we could work around this with multiple memory maps.
        // See HDFS-5101.
        if (DFSClient.LOG.isDebugEnabled()) {
          DFSClient.LOG.debug("Unable to perform a zero-copy read from offset " +
            curPos + " of " + src + "; 31-bit MappedByteBuffer limit " +
            "exceeded.  blockPos=" + blockPos + ", curEnd=" + curEnd);
        }
        return null;
      }
      length = (int)length31;
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("Reducing read length from " + maxLength +
            " to " + length + " to avoid 31-bit limit.  " +
            "blockPos=" + blockPos + "; curPos=" + curPos +
            "; curEnd=" + curEnd);
      }
    }
    final ClientMmap clientMmap = blockReader.getClientMmap(opts);
    if (clientMmap == null) {
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("unable to perform a zero-copy read from offset " +
          curPos + " of " + src + "; BlockReader#getClientMmap returned " +
          "null.");
      }
      return null;
    }
    boolean success = false;
    ByteBuffer buffer;
    try {
      seek(curPos + length);
      //直接从映射内存中读取相应的数据
      buffer = clientMmap.getMappedByteBuffer().asReadOnlyBuffer();
      buffer.position((int)blockPos);
      buffer.limit((int)(blockPos + length));
      extendedReadBuffers.put(buffer, clientMmap);
      readStatistics.addZeroCopyBytes(length);
      if (DFSClient.LOG.isDebugEnabled()) {
        DFSClient.LOG.debug("readZeroCopy read " + length + 
            " bytes from offset " + curPos + " via the zero-copy read " +
            "path.  blockEnd = " + blockEnd);
      }
      success = true;
    } finally {
      if (!success) {
        IOUtils.closeQuietly(clientMmap);
      }
    }
    return buffer;
  }

getClientMmap函数代码如下：

/**
   * Get or create a memory map for this replica.
   * 
   * There are two kinds of ClientMmap objects we could fetch here: one that 
   * will always read pre-checksummed data, and one that may read data that
   * hasn't been checksummed.
   *
   * If we fetch the former, "safe" kind of ClientMmap, we have to increment
   * the anchor count on the shared memory slot.  This will tell the DataNode
   * not to munlock the block until this ClientMmap is closed.
   * If we fetch the latter, we don't bother with anchoring.
   *
   * @param opts     The options to use, such as SKIP_CHECKSUMS.
   * 
   * @return         null on failure; the ClientMmap otherwise.
   */
  @Override
  public ClientMmap getClientMmap(EnumSet<ReadOption> opts) {
	//如果需要验证校验
    boolean anchor = verifyChecksum &&
        (opts.contains(ReadOption.SKIP_CHECKSUMS) == false);
    if (anchor) {
      //如果不能免校验,说明存在问题
      if (!createNoChecksumContext()) {
        if (LOG.isTraceEnabled()) {
          LOG.trace("can't get an mmap for " + block + " of " + filename + 
              " since SKIP_CHECKSUMS was not given, " +
              "we aren't skipping checksums, and the block is not mlocked.");
        }
        return null;
      }
    }
    ClientMmap clientMmap = null;
    try {
      //得到一个映射内存类对象
      clientMmap = replica.getOrCreateClientMmap(anchor);
    } finally {
      if ((clientMmap == null) && anchor) {
    	//如果映射内存类对象为null同时需要验证文件校验和,那么就需要释放掉之前的锚
        releaseNoChecksumContext();
      }
    }
    return clientMmap;
  }

createNoChecksumContext函数代码如下：

//返回false表示
  private boolean createNoChecksumContext() {
    if (verifyChecksum) {
      //如果存在存储类型,且存储属于不可持久化的,不可持久化的一律不进行文件校验
      if (storageType != null && storageType.isTransient()) {
        // Checksums are not stored for replicas on transient storage.  We do not
        // anchor, because we do not intend for client activity to block eviction
        // from transient storage on the DataNode side.
        return true;
      } else {
    	//如果数据的存储类型为持久类型那么就给datanode上的该块数据添加一个免文件校验的锚
        return replica.addNoChecksumAnchor();
      }
    } else {
      return true;
    }
  }

getOrCreateClientMmap函数最终会调用ClientMmap类中的getOrCreateClientMmap函数，该函数代码如下：

ClientMmap getOrCreateClientMmap(ShortCircuitReplica replica,
      boolean anchored) {
    Condition newCond;
    lock.lock();
    try {
      while (replica.mmapData != null) {
    	//如果已经有值，那么直接创建ClientMmap类对象
        if (replica.mmapData instanceof MappedByteBuffer) {
          //添加对ShortCircuitReplica类对象的引用
          ref(replica);
          MappedByteBuffer mmap = (MappedByteBuffer)replica.mmapData;
          return new ClientMmap(replica, mmap, anchored);
        } else if (replica.mmapData instanceof Long) {
          long lastAttemptTimeMs = (Long)replica.mmapData;
          long delta = Time.monotonicNow() - lastAttemptTimeMs;
          if (delta < mmapRetryTimeoutMs) {
            if (LOG.isTraceEnabled()) {
              LOG.trace(this + ": can't create client mmap for " +
                  replica + " because we failed to " +
                  "create one just " + delta + "ms ago.");
            }
            return null;
          }
          if (LOG.isTraceEnabled()) {
            LOG.trace(this + ": retrying client mmap for " + replica +
                ", " + delta + " ms after the previous failure.");
          }
        } else if (replica.mmapData instanceof Condition) {
          Condition cond = (Condition)replica.mmapData;
          cond.awaitUninterruptibly();
        } else {
          Preconditions.checkState(false, "invalid mmapData type " +
              replica.mmapData.getClass().getName());
        }
      }
      newCond = lock.newCondition();
      replica.mmapData = newCond;
    } finally {
      lock.unlock();
    }
    MappedByteBuffer map = replica.loadMmapInternal();
    lock.lock();
    try {
      if (map == null) {
        replica.mmapData = Long.valueOf(Time.monotonicNow());
        newCond.signalAll();
        return null;
      } else {
        outstandingMmapCount++;
        replica.mmapData = map;
        //添加对ShortCircuitReplica类对象的引用
        ref(replica);
        newCond.signalAll();
        return new ClientMmap(replica, map, anchored);
      }
    } finally {
      lock.unlock();
    }
  }

总结一下零拷贝的过程:

1、获取要读取数据所在的块，并获取到该块文件对应的映射内存

2、直接从该映射内存中获取相应的数据

乘风如水

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
DFSInputStream类中的零拷贝数据读取----read(ByteBufferPool bufferPool,int maxLength, EnumSet<ReadOption> opts)

在阅读这篇文章前，建议先阅读这篇文章。我们回到DFSInputStream中的read函数，这些函数的逻辑都差不多，都是会创建BlockReaderFactory类对象，并执行该对象的build函数，相应的代码我在之前的文章中都有讲述，可以翻看我之前的文章，我们这里讲解函数public synchronized ByteBuffer read(ByteBufferPool buffe...
复制链接

扫一扫