读HDFS书笔记---5.2 文件读操作与输入流(5.2.2)---上

5.2.2 读操作--DFSInputStream实现

   HDFS目前实现的读操作有三个层次,分别是网络读、短路读(short circuit read)以及零拷贝(zero copy read),它们的读取效率一次递增。

网络读:

网络读是最基本的一种HDFS读,DFSClient和Datanode通过建立Socket连接传输数据。

短路读:

当DFSClient和保存目标数据块的Datanode在同一个物理节点上时,DFSClient可以直接打开数据块副本文件读取数据,而不需要Datanode进程的转发。后面会讲到。

零拷贝读:

当DFSClient和缓存目标数据块的Datanode在同一个物理节点上时,DFSClient可以通过零拷贝方式读取该数据块,大大提供了效率。而且即使在读取过程中该数据块被Datanode从缓存中移出了,读取操作也可以退化成本地短路读。

 

HdfsDataInputStream.read()方法就实现了上面描述的三个层次读取,代码(read方法在其父类FSDataInputStream下)如下:

@Override
  public ByteBuffer read(ByteBufferPool bufferPool, int maxLength,
      EnumSet<ReadOption> opts) 
          throws IOException, UnsupportedOperationException {
    try {
      return ((HasEnhancedByteBufferAccess)in).read(bufferPool,
          maxLength, opts);
    }
    catch (ClassCastException e) {
      ByteBuffer buffer = ByteBufferUtil.
          fallbackRead(this, bufferPool, maxLength);
      if (buffer != null) {
        extendedReadBuffers.put(buffer, bufferPool);
      }
      return buffer;
    }
  }

HdfsDataInputStream.read()方法首先会调用HasEnhancedByteBufferAccess.read()方法尝试进行零拷贝读取,如果当前配置不支持零拷贝读取模式,则抛出异常,然后调用ByteBufferUtil.fallbackRead()静态方法退化成短路读或者网络读。HdfsDataInputStream.read()方法调用流程如下图:

HdfsDataInputStream.read()调用流程图

HdfsDataInputStream实现了HasEnhancedByteBufferAccess.read()方法以及InputStream.read()方法,这两个方法的实现都是通过调用底层包装类DFSInputStream对应的方法执行的。HasEnhancedByteBufferAccess.read()方法定义了零拷贝读取的实现,而InputStream.read()方法则定义了短路读和网络读的实现。下面我们开始分别讲解DFSInputStream实现的InputStream.read()和HasEnhancedByteBufferAccess.read()。

这里需要解释一下,为什么说HasEnhancedByteBufferAccess.read()方法以及InputStream.read()方法都是通过调用底层包装类DFSInputStream对应的方法执行的?

我们看一下FSDataInputStream的创建函数doCall()方法,
public FSDataInputStream doCall(final Path p) throws IOException, UnresolvedLinkException {

        //通过open函数返回一个DFSInputStream类对象
        final DFSInputStream dfsis = dfs.open(getPathName(p), bufferSize, verifyChecksum);

        //将DFSInputStream类对象封装成HdfsDataInputStream类对象(createWrappedInputStream函数下面会列出)
        return dfs.createWrappedInputStream(dfsis);
}

createWrappedInputStream()函数代码如下:

/**
   * Wraps the stream in a CryptoInputStream if the underlying file is
   * encrypted.
   */
  public HdfsDataInputStream createWrappedInputStream(DFSInputStream dfsis)
      throws IOException {
    final FileEncryptionInfo feInfo = dfsis.getFileEncryptionInfo();
    if (feInfo != null) {
      // File is encrypted, wrap the stream in a crypto stream.
      // Currently only one version, so no special logic based on the version #
      getCryptoProtocolVersion(feInfo);
      final CryptoCodec codec = getCryptoCodec(conf, feInfo);
      final KeyVersion decrypted = decryptEncryptedDataEncryptionKey(feInfo);
      final CryptoInputStream cryptoIn =
          new CryptoInputStream(dfsis, codec, decrypted.getMaterial(),
              feInfo.getIV());
      return new HdfsDataInputStream(cryptoIn);
    } else {
      // No FileEncryptionInfo so no encryption.
      return new HdfsDataInputStream(dfsis);
    }
  }

可以看到,HdfsDataInputStream类将DFSInputStream类对象作为构造函数的参数传入,最终赋值给InputStream类中的成员变量,如下:

protected volatile InputStream in;

在调用read函数的时候,会调用HdfsDataInputStream类中的read函数,但是由于该类中没有实现read函数,所以调用它的父类FDSInputStream类中的read函数,该函数先尝试进行零拷贝,代码为

return ((HasEnhancedByteBufferAccess)in).read(bufferPool, maxLength, opts);

这里的in实际类型为DFSInputStream类,由于该类实现了接口HasEnhancedByteBufferAccess,所以这里转换没有问题,所以这里的read调用的是DFSInputStream类中的函数,如果该read函数调用异常,那么就会执行代码

ByteBuffer buffer = ByteBufferUtil.fallbackRead(this, bufferPool, maxLength);

fallbackRead函数代码如下:

/**
   * Perform a fallback read.
   */
  public static ByteBuffer fallbackRead(
      InputStream stream, ByteBufferPool bufferPool, int maxLength)
          throws IOException {
    if (bufferPool == null) {
      throw new UnsupportedOperationException("zero-copy reads " +
          "were not available, and you did not provide a fallback " +
          "ByteBufferPool.");
    }
    boolean useDirect = streamHasByteBufferRead(stream);
    ByteBuffer buffer = bufferPool.getBuffer(useDirect, maxLength);
    if (buffer == null) {
      throw new UnsupportedOperationException("zero-copy reads " +
          "were not available, and the ByteBufferPool did not provide " +
          "us with " + (useDirect ? "a direct" : "an indirect") +
          "buffer.");
    }
    Preconditions.checkState(buffer.capacity() > 0);
    Preconditions.checkState(buffer.isDirect() == useDirect);
    maxLength = Math.min(maxLength, buffer.capacity());
    boolean success = false;
    try {
      if (useDirect) {
        buffer.clear();
        buffer.limit(maxLength);
        ByteBufferReadable readable = (ByteBufferReadable)stream;
        int totalRead = 0;
        while (true) {
          if (totalRead >= maxLength) {
            success = true;
            break;
          }
          int nRead = readable.read(buffer);
          if (nRead < 0) {
            if (totalRead > 0) {
              success = true;
            }
            break;
          }
          totalRead += nRead;
        }
        buffer.flip();
      } else {
        buffer.clear();
        int nRead = stream.read(buffer.array(),
            buffer.arrayOffset(), maxLength);
        if (nRead >= 0) {
          buffer.limit(nRead);
          success = true;
        }
      }
    } finally {
      if (!success) {
        // If we got an error while reading, or if we are at EOF, we 
        // don't need the buffer any more.  We can give it back to the
        // bufferPool.
        bufferPool.putBuffer(buffer);
        buffer = null;
      }
    }
    return buffer;
  }

从上面的代码可以看到,里面调用了FSDInputStream中的两种read函数,最终调用的都是FSDInputStream类中in成员变量的read函数,而in的实际类型就是DFSInputStream,所以read函数最终调用的都是DFSInputStream类中的。

InputStream.read()

InputStream.read()函数的流程图如下:

InputStream.read()函数调用流程图

read函数代码如下:

/**
   * Read the entire buffer.
   */
  @Override
  public synchronized int read(final byte buf[], int off, int len) throws IOException {
    //这里使用字节数组作为容器
    ReaderStrategy byteArrayReader = new ByteArrayStrategy(buf);

    return readWithStrategy(byteArrayReader, off, len);
  }

read()方法将从输入流的off游标开始,读取len个字节,然后存入buf[]缓存数组中,这里的off、len以及buf[]都是read()方法的输入参数。read()方法首先会构造一个ByteArrayStrategy对象,表明当前的读取操作使用字节数组作为容器,然后调用readWithStrategy()方法读取数据。其中ByteArrayStrategy类的代码如下:

/**
   * Used to read bytes into a byte[]
   */
  private static class ByteArrayStrategy implements ReaderStrategy {
    final byte[] buf;

    public ByteArrayStrategy(byte[] buf) {
      this.buf = buf;
    }

    @Override
    public int doRead(BlockReader blockReader, int off, int len,
            ReadStatistics readStatistics) throws ChecksumException, IOException {
        int nRead = blockReader.read(buf, off, len);
        updateReadStatistics(readStatistics, nRead, blockReader);
        return nRead;
    }
  }

下面我们看一下readWithStrategy()方法的实现,代码如下:

private int readWithStrategy(ReaderStrategy strategy, int off, int len) throws IOException {
    dfsClient.checkOpen();
    if (closed) {
      throw new IOException("Stream closed");
    }
    Map<ExtendedBlock,Set<DatanodeInfo>> corruptedBlockMap 
      = new HashMap<ExtendedBlock, Set<DatanodeInfo>>();
    failures = 0;
    if (pos < getFileLength()) {//读取位置在文件范围内
      int retries = 2;//如果出现异常,则重试两次
      while (retries > 0) {
        try {
          // currentNode can be left as null if previous read had a checksum
          // error on the same block. See HDFS-3067
          //pos超过数据块边界,需要从新的数据块开始读取数据
          if (pos > blockEnd || currentNode == null) {
            //调用blockSeekTo()方法获取保存这个数据块的一个数据节点
            currentNode = blockSeekTo(pos);
          }
          //计算这次读取的长度
          int realLen = (int) Math.min(len, (blockEnd - pos + 1L));
          if (locatedBlocks.isLastBlockComplete()) {
            realLen = (int) Math.min(realLen, locatedBlocks.getFileLength());
          }

          //调用readBuffer()方法读取数据
          int result = readBuffer(strategy, off, realLen, corruptedBlockMap);
          
          if (result >= 0) {
            pos += result;//pos移位
          } else {
            // got a EOS from reader though we expect more data on it.
            throw new IOException("Unexpected EOS from the reader");
          }
          if (dfsClient.stats != null) {
            dfsClient.stats.incrementBytesRead(result);
          }
          return result;
        } catch (ChecksumException ce) {
          throw ce;          //出现校验错误,则抛出异常  
        } catch (IOException e) {
          if (retries == 1) {
            DFSClient.LOG.warn("DFS Read", e);
          }
          blockEnd = -1;
          if (currentNode != null) { addToDeadNodes(currentNode); }//将当前失败的节点入黑名单
          if (--retries == 0) {//重试超过两次,直接抛出异常
            throw e;
          }
        } finally {
          // Check if need to report block replicas corruption either read
          // was successful or ChecksumException occured.
          //检查是否需要向Namenode汇报损坏的数据块
          reportCheckSumFailure(corruptedBlockMap, 
              currentLocatedBlock.getLocations().length);
        }
      }
    }
    return -1;
  }

readWithStrategy()方法首先调用blockSeek()方法获取一个保存了目标数据块的Datanode,然后调用readBuffer()方法从该Datanode读取数据块。如果读取过程出现IO异常,则进行重试操作,并将该Datanode放入黑名单中。

可以看到,readWithStrategy()调用了blockSeekTo()以及readBuffer()方法,接下来讲解这两个方法。

(1) blockSeekTo()

该函数代码如下:

/**
   * Open a DataInputStream to a DataNode so that it can be read from.
   * We get block ID and the IDs of the destinations at startup, from the namenode.
   */
  private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
    if (target >= getFileLength()) {//如果读取位置超过HDFS文件长度,则抛出异常
      throw new IOException("Attempted to read past end of file");
    }

    // Will be getting a new BlockReader.
    if (blockReader != null) {//关闭上一个数据块对应的BlockReader
      blockReader.close();
      blockReader = null;
    }

    //
    // Connect to best DataNode for desired Block, with potential offset
    //
    DatanodeInfo chosenNode = null;
    int refetchToken = 1; // only need to get a new access token once
    int refetchEncryptionKey = 1; // only need to get a new encryption key once
    
    boolean connectFailedOnce = false;

    while (true) {
      //
      // Compute desired block
      //
      //获取target对应的数据块的位置信息
      LocatedBlock targetBlock = getBlockAt(target, true);
      assert (target==pos) : "Wrong postion " + pos + " expect " + target;
      //获取当前target在新数据块中的偏移量
      long offsetIntoBlock = target - targetBlock.getStartOffset();

      //调用chooseDataNode()方法,获取一个Datanode用来读取该数据块
      DNAddrPair retval = chooseDataNode(targetBlock, null);
      chosenNode = retval.info;
      InetSocketAddress targetAddr = retval.addr;
      StorageType storageType = retval.storageType;

      try {
        ExtendedBlock blk = targetBlock.getBlock();
        Token<BlockTokenIdentifier> accessToken = targetBlock.getBlockToken();
        //通过BlockReaderFactory获取blockReader对象
        blockReader = new BlockReaderFactory(dfsClient.getConf()).
            setInetSocketAddress(targetAddr).
            setRemotePeerFactory(dfsClient).
            setDatanodeInfo(chosenNode).
            setStorageType(storageType).
            setFileName(src).
            setBlock(blk).
            setBlockToken(accessToken).
            setStartOffset(offsetIntoBlock).
            setVerifyChecksum(verifyChecksum).
            setClientName(dfsClient.clientName).
            setLength(blk.getNumBytes() - offsetIntoBlock).
            setCachingStrategy(cachingStrategy).
            setAllowShortCircuitLocalReads(!shortCircuitForbidden()).
            setClientCacheContext(dfsClient.getClientContext()).
            setUserGroupInformation(dfsClient.ugi).
            setConfiguration(dfsClient.getConfiguration()).
            build();
        if(connectFailedOnce) {
          DFSClient.LOG.info("Successfully connected to " + targetAddr +
                             " for " + blk);
        }
        return chosenNode;
      } catch (IOException ex) {
        if (ex instanceof InvalidEncryptionKeyException && refetchEncryptionKey > 0) {
          //安全相关的异常
          DFSClient.LOG.info("Will fetch a new encryption key and retry, " 
              + "encryption key was invalid when connecting to " + targetAddr
              + " : " + ex);
          // The encryption key used is invalid.
          refetchEncryptionKey--;
          dfsClient.clearDataEncryptionKey();
        } else if (refetchToken > 0 && tokenRefetchNeeded(ex, targetAddr)) {
          //安全相关         
          refetchToken--;
          fetchBlockAt(target);
        } else {
          connectFailedOnce = true;
          DFSClient.LOG.warn("Failed to connect to " + targetAddr + " for block"
            + ", add to deadNodes and continue. " + ex, ex);
          // Put chosen node into dead list, continue
          //BlockReader构造失败,将chosenNode放入黑名单中
          addToDeadNodes(chosenNode);
        }
      }
    }
  }

一个HDFS文件会被切分成多个数据块,这些数据块分散在HDFS集群的Datanode上。当我们读取文件时,也就是按照顺序读取数据块时,如果读操作完成了一个数据块的读取,就需要构造读取下一个数据块的输入流,这时就需要调用blockSeekTo()方法获取保存下一个数据块的Datanode。

blockSeekTo()方法会先调用getBlockAt()方法获取游标(DFSInputStream.pos字段保存)所在数据块的信息,然后调用chooseDataNode()方法获取一个存储了该数据块的Datanode。接下来会构造从这个节点读取数据块的BlockReader对象,构造的BlockReader对象会被保存在DFSInputStream.blockReader字段中。这里需要注意,构造BlockReader时使用了BlockReaderFactory这个工厂类,后面会讲到BlockReader类的实现。

接下来分析getBlockAt()->chooseDataNode()->blockReader。也就是获取数据块->获取数据块对应的数据节点->获取BlockReader对象的过程。

<1> getBlockAt()

该方法用于获取文件pos游标所在数据块的位置信息,也就是获取该数据块对应的LocatedBlock对象。LocatedBlock对象保存了所有存储该数据块的Datanode信息,这些信息会按照距离客户端的远近排序,同时LocatedBlock还保存了当前数据块是否被缓存等信息。getBlockAt()方法会调用ClientProtocol.getBlockLocations()方法从Namenode获取LocatedBlock对象,并将这个LocatedBlock对象保存到DFSInputStream.locatedBlocks字段中。

<2> chooseDataNode()

选择一个合适的Datanode读取数据块,这个方法的逻辑很简单,由于LocatedBlock对象中已经包含了按照与客户端距离远近排序的Datanode列表,所以只需要遍历这个列表,选出第一个不在Datanode黑名单(DFSInputStream.deadNodes字段中保存)中的Datanode即可。

<3> BlockReaderFactory.build()

构造从指定Datanode上读取数据块的BlockReader对象,这里使用了BlockReaderFactory这个工厂类,BlockReaderFactory.build()方法的实现后面会讲到。

(2) readBuffer()

该方法代码如下:

/* This is a used by regular read() and handles ChecksumExceptions.
   * name readBuffer() is chosen to imply similarity to readBuffer() in
   * ChecksumFileSystem
   */ 
  private synchronized int readBuffer(ReaderStrategy reader, int off, int len,
      Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
      throws IOException {
    IOException ioe;
    
    /* we retry current node only once. So this is set to true only here.
     * Intention is to handle one common case of an error that is not a
     * failure on datanode or client : when DataNode closes the connection
     * since client is idle. If there are other cases of "non-errors" then
     * then a datanode might be retried by setting this to true again.
     */
    boolean retryCurrentNode = true;

    while (true) {
      // retry as many times as seekToNewSource allows.
      try {
        //读取数据
        return reader.doRead(blockReader, off, len, readStatistics);
      } catch ( ChecksumException ce ) {
        DFSClient.LOG.warn("Found Checksum error for "
            + getCurrentBlock() + " from " + currentNode
            + " at " + ce.getPos()); 
        //出现校验异常时,表明currentNode上的数据块出现了错误       
        ioe = ce;
        retryCurrentNode = false;
        // we want to remember which block replicas we have tried
        //将损坏的数据块加入corruptedBlockMap中,并向Namenode汇报
        addIntoCorruptedBlockMap(getCurrentBlock(), currentNode,
            corruptedBlockMap);
      } catch ( IOException e ) {
        if (!retryCurrentNode) {
          DFSClient.LOG.warn("Exception while reading from "
              + getCurrentBlock() + " of " + src + " from "
              + currentNode, e);
        }
        ioe = e;
      }
      boolean sourceFound = false;
      if (retryCurrentNode) {
        /* possibly retry the same node so that transient errors don't
         * result in application level failures (e.g. Datanode could have
         * closed the connection because the client is idle for too long).
         */ 
        //重试当前节点
        sourceFound = seekToBlockSource(pos);
      } else {
        //当Datanode重试失败,则将当前节点加入黑名单中,然后重新选择一个Datanode读取数据
        addToDeadNodes(currentNode);
        sourceFound = seekToNewSource(pos);
      }
      if (!sourceFound) {
        throw ioe;
      }
      retryCurrentNode = false;
    }
  }

readBuffer()的读入操作主要是通过委托BlockReader对象实现的,并在发生异常时进行重试。当读取出现校验异常时,表明currentNode上的数据块出现了错误,这时readBuffer()方法会将错误的数据块添加到corruptedBlockMap中,并通过reportCheckSumFailure()方法向Namenode汇报错误的数据块。如果是普通的IO异常,则有可能是客户端与数据节点之间的连接关闭了,那么readBuffer()方法会在当前节点上调用seekToBlockSource()重试。如果重试失败,则调用seekToNewSource()选择新的Datanode,并将当前Datanode加入黑名单中,SeekToBlockSource()和seekToNewSource()方法都调用了上面介绍的blockSeekTo()方法。

由于篇幅原因,后续内容放在下一篇文章。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值