Hdfs 客户端读过程源码解析

最新推荐文章于 2023-03-15 16:12:04 发布

午后的红茶meton

最新推荐文章于 2023-03-15 16:12:04 发布

阅读量857

点赞数

分类专栏： Hadoop分析与理解文章标签： hadoop hdfs 客户端读

本文链接：https://blog.csdn.net/u012151684/article/details/107948538

版权

Hadoop分析与理解专栏收录该内容

40 篇文章 18 订阅

订阅专栏

说到hdfs文件系统上的读文件流程，相信大家都不会陌生，都会知道读取文件的过程如下的示意图：

客户端读过程示意图：

其基本的读取流程如下：

客户端通过调用FileSystem对象的open()方法来打开hdfs上的文件，这个方法在底层会调用ClientProtocol.open()方法，该方法会返回一个HdfsDataInputStream对象用于读取数据块。HdfsDataInputStream是一个DFSInputStream的装饰类，真正进行数据块读取操作的是DFSInputStream对象。
DistributedFileSystem通过调用RPC接口ClientProtocol.getBlockLocations()方法向名字节点NameNode获取该hdfs文件起始块的位置，同一Block按照重复数会返回多个位置，这些位置按照Hadoop集群拓扑结构排序，距离客户端近的排在前面；所以DFSInputStream会选择一个最优的DataNode节点，然后建立与这个节点的数据连接并读取数据块。
客户端通过DFSInputStream.read()方法从最优的DataNode节点上读取数据块，数据块会以数据包(packet)为单位从数据节点通过流式接口传递到客户端，当一个数据块读取完毕时，其会再次调用ClientProtocol.getBlockLocations()获取文件的下一个数据块位置信息，并建立和这个新的数据块的最优DataNode之间的连接，然后hdfs客户端就会继续读取该数据块了。
一旦客户端完成读取，就对HdfsDataInputStream调用close()方法关闭文件读取的输入流。

接下来从源码的角度一步步解析，看hdfs client是如何与NameNode，DataNode进行读文件交互的。

1、首先客户端调用FSDataInputStream inputStream = DistributedFileSystem.open()；打开文件并获取到相应的输入流，可以看到其最终会构造一个DFSInputStream输入流对象用来读取该hdfs文件。

  @Override
  public FSDataInputStream open(Path f, final int bufferSize)
      throws IOException {
    statistics.incrementReadOps(1);
    Path absF = fixRelativePart(f);
    return new FileSystemLinkResolver<FSDataInputStream>() {
      @Override
      public FSDataInputStream doCall(final Path p)
          throws IOException, UnresolvedLinkException {
        final DFSInputStream dfsis =
          dfs.open(getPathName(p), bufferSize, verifyChecksum);
        return dfs.createWrappedInputStream(dfsis);
      }
    }.resolve(this, absF);
  }

  public HdfsDataInputStream createWrappedInputStream(DFSInputStream dfsis)
      throws IOException {
      // ......... 主要是一些加密流的判断
      return new HdfsDataInputStream(dfsis);
    }
  }

在实际的DistributedFileSystem.open()过程中，其内部是委托给DFSClient类的实际对象dfs.open()；其主要作用在于打开文件，并构造获取该文件对应的输入流DFSInputStream。在DFSInputStream的构造方法内部会

初始化DFSInputStream的基本属性：包括 dfsClient类的引用，verifyChecksum读取数据时是否进行校验(这个主要适用于零拷贝)，buffersize读取数据时缓冲区大小(4KB)，src读取文件地址；
调用openInfo()方法：从NameNode处获取文件对应的数据块的位置信息，并将返回的数据块位置信息保存DFSInputStream.locatedBlocks字段中。

接着来详细看下openInfo()方法的具体执行；openInfo()方法会调用fetchLocatedBlocksAndGetLastBlockLength()方法获取文件对应的所有数据块的位置信息。其主要执行的流程有：

先调用dfsClient.getLocatedBlocks()方法通过rpc接口ClientProtocol.getBlockLocations()从NameNode获取文件对应的所有数据块的位置信息；
然后将新获取的数据块位置信息与locatedBlocks保存的位置信息进行对比，更新最新的locatedBlocks字段；
最后会调用readBlockLength()方法通过rpc接口ClientDatanodeProtocol去获取文件最后一个数据块的大小，然后更新locatedBlocks记录的最后一个数据块的长度；

  private long fetchLocatedBlocksAndGetLastBlockLength() throws IOException {
    // 通过rpc接口ClientProtocol.getBlockLocations()从NameNode获取文件对应的所有数据块的位置信息
    final LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0);
    if (DFSClient.LOG.isDebugEnabled()) {
      DFSClient.LOG.debug("newInfo = " + newInfo);
    }
    if (newInfo == null) {
      throw new IOException("Cannot open filename " + src);
    }

    // 比较并更新locatedBlocks字段
    if (locatedBlocks != null) {
      Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();
      Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();
      while (oldIter.hasNext() && newIter.hasNext()) {
        if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {
          throw new IOException("Blocklist for " + src + " has changed!");
        }
      }
    }
    locatedBlocks = newInfo;
    long lastBlockBeingWrittenLength = 0;
    if (!locatedBlocks.isLastBlockComplete()) {
      final LocatedBlock last = locatedBlocks.getLastLocatedBlock();
      if (last != null) {
        if (last.getLocations().length == 0) {
          if (last.getBlockSize() == 0) {
            // if the length is zero, then no data has been written to
            // datanode. So no need to wait for the locations.
            return 0;
          }
          return -1;
        }
		// 通过rpc接口ClientDatanodeProtocol去获取文件最后一个数据块的大小并更新
        final long len = readBlockLength(last);
        last.getBlock().setNumBytes(len);
        lastBlockBeingWrittenLength = len; 
      }
    }
	
    fileEncryptionInfo = locatedBlocks.getFileEncryptionInfo();
    currentNode = null;
    return lastBlockBeingWrittenLength;
  }

2、inputStream.read()；在构造并获取该文件对应的输入流DFSInputStream后，便可以调用inputStream.read()方法进行数据块的读取；其读取的基本过程如下：

currentNode = blockSeekTo(targetPos)；其会获取保存下一个数据块的最佳DataNode位置信息；blockSeekTo()方法首先会调用getBlockAt()方法去获取当前游标所在的数据块信息，然后调用chooseDataNode()方法获取一个最佳的DataNode节点；之后便会构造读取该block数据块的blockReader对象用于数据流的读取；
blockReader对象主要是用来从指定数据节点上读取数据块；在构造的过程中，其会构造一个Sender对象向DataNode发送一个数据块读取的Op.READ_BLOCK操作码；其有多种读取的方式(本文主要介绍remote读取方式)：
1. BlockReaderLocal：本地短路读取(client和datanode在同一机器上，可以直接从本地磁盘读取)
2. RemoteBlockReader2：使用socket连接从datanode读取数据块
readBuffer()将从数据流中读取该数据块的数据；其内部会委托给blockReader.read(buf)进行数据的读取；并且会在读取错误时，根据重试策略尝试seekToBlockSource重新尝试本节点或者调用seekToNewSource(其内部会重新调用blockSeekTo)选择新的DataNode节点

  /**
   * Open a DataInputStream to a DataNode so that it can be read from.
   * We get block ID and the IDs of the destinations at startup, from the namenode.
   */
  private synchronized DatanodeInfo blockSeekTo(long target) throws IOException {
    //
    // Connect to best DataNode for desired Block, with potential offset
    //
    DatanodeInfo chosenNode = null;
    while (true) {
      // 获取当前游标所在的数据块信息
      LocatedBlock targetBlock = getBlockAt(target, true);
      assert (target==pos) : "Wrong postion " + pos + " expect " + target;
      long offsetIntoBlock = target - targetBlock.getStartOffset();

      // 获取最佳DataNode位置信息
      DNAddrPair retval = chooseDataNode(targetBlock, null);
      chosenNode = retval.info;
      InetSocketAddress targetAddr = retval.addr;
      StorageType storageType = retval.storageType;

      try {
        ExtendedBlock blk = targetBlock.getBlock();
        Token<BlockTokenIdentifier> accessToken = targetBlock.getBlockToken();
        // 造读blockReader对象用于该数据块流的读取
        blockReader = new BlockReaderFactory(dfsClient.getConf()).
            setInetSocketAddress(targetAddr).
            setRemotePeerFactory(dfsClient).
            setDatanodeInfo(chosenNode).
            ......
            build();
        if(connectFailedOnce) {
          DFSClient.LOG.info("Successfully connected to " + targetAddr +
                             " for " + blk);
        }
        return chosenNode;
      } catch (IOException ex) {
        } else {
          connectFailedOnce = true;
          DFSClient.LOG.warn("Failed to connect to " + targetAddr + " for block"
            + ", add to deadNodes and continue. " + ex, ex);
          // Put chosen node into dead list, continue
          addToDeadNodes(chosenNode); // 将chosenNode加入到黑名单中
        }
      }
    }
  }

最佳DataNode选择策略为：因为在数据块locatedBlocks获取的时候，其已经按照与客户端的距离进行排序，所以只要找到不在deadNodes中的DataNode即可；

数据块读取操作码Op.READ_BLOCK发送过程为：在构造reader = new BlockReaderFactory().build()；方法中：

getRemoteBlockReaderFromTcp()
- blockReader = getRemoteBlockReader(peer)
  - RemoteBlockReader2.newBlockReader()
    - new Sender(out).readBlock(block, blockToken, clientName, startOffset, len, verifyChecksum, cachingStrategy); 最终会调用Sender发送READ_BLOCK操作码

DFSInputStream#read()进行数据块的读取：

  private synchronized int readBuffer(ReaderStrategy reader, int off, int len,
      Map<ExtendedBlock, Set<DatanodeInfo>> corruptedBlockMap)
      throws IOException {
    IOException ioe;
    
    boolean retryCurrentNode = true;

    while (true) {
      // retry as many times as seekToNewSource allows.
      try {
	    // 调用reader读取数据
        return reader.doRead(blockReader, off, len, readStatistics);
      } catch ( ChecksumException ce ) {
        DFSClient.LOG.warn("Found Checksum error for "
            + getCurrentBlock() + " from " + currentNode
            + " at " + ce.getPos());        
        ioe = ce;
        retryCurrentNode = false;
        // we want to remember which block replicas we have tried
		// 将损坏的数据块加入CorruptedBlockMap中，并向NameNode汇报
        addIntoCorruptedBlockMap(getCurrentBlock(), currentNode,
            corruptedBlockMap);
      } catch ( IOException e ) {
        // .........
      }
      boolean sourceFound = false;
      if (retryCurrentNode) {
        // 尝试重试当前节点
        sourceFound = seekToBlockSource(pos);
      } else {
	    // 选择一个新的DataNode进行数据读取
        addToDeadNodes(currentNode);
        sourceFound = seekToNewSource(pos);
      }
      if (!sourceFound) {
        throw ioe;
      }
      retryCurrentNode = false;
    }
  }

3、blockReader.read()；在remote模式下会构造RemoteBlockReader2；其使用socket连接从datanode中读取数据块，其主要的read()方法会调用readNextPacket()将从数据流中获取一个新的数据包packet。

  @Override
  public synchronized int read(byte[] buf, int off, int len) 
                               throws IOException {
    // 读取下一个数据包
    if (curDataSlice == null || curDataSlice.remaining() == 0 && bytesNeededToFinish > 0) {
      readNextPacket();
    }
    if (curDataSlice.remaining() == 0) {
      // we're at EOF now
      return -1;
    }
  
    int nRead = Math.min(curDataSlice.remaining(), len);
    curDataSlice.get(buf, off, nRead);
    return nRead;
  }

  private void readNextPacket() throws IOException {
    //Read packet headers.
    // 读取数据包头与数据包
    packetReceiver.receiveNextPacket(in);

    PacketHeader curHeader = packetReceiver.getHeader();
    curDataSlice = packetReceiver.getDataSlice();
    assert curDataSlice.capacity() == curHeader.getDataLen();
    
    // Sanity check the lengths
    // 检查数据包头长度
    if (!curHeader.sanityCheck(lastSeqNo)) {
         throw new IOException("BlockReader: error in packet header " +
                               curHeader);
    }
    
    // 数据包校验和
    if (curHeader.getDataLen() > 0) {
      int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;
      int checksumsLen = chunks * checksumSize;

      assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :
        "checksum slice capacity=" + packetReceiver.getChecksumSlice().capacity() + 
          " checksumsLen=" + checksumsLen;
      
      lastSeqNo = curHeader.getSeqno();
      if (verifyChecksum && curDataSlice.remaining() > 0) {
        checksum.verifyChunkedSums(curDataSlice,
            packetReceiver.getChecksumSlice(),
            filename, curHeader.getOffsetInBlock());
      }
      bytesNeededToFinish -= curHeader.getDataLen();
    }    
    
    // First packet will include some data prior to the first byte
    // the user requested. Skip it.
    if (curHeader.getOffsetInBlock() < startOffset) {
      int newPos = (int) (startOffset - curHeader.getOffsetInBlock());
      curDataSlice.position(newPos);
    }

    // If we've now satisfied the whole client read, read one last packet
    // header, which should be empty
    if (bytesNeededToFinish <= 0) {
      readTrailingEmptyPacket();
      if (verifyChecksum) {
        sendReadResult(Status.CHECKSUM_OK);
      } else {
        sendReadResult(Status.SUCCESS);
      }
    }
  }

4、读取完毕后；会简单的调用DFSInputStream.close()方法进行数据流的关闭，其内部也是最终调用关闭blockReader；

  @Override
  public synchronized void close() throws IOException {
    if (closed) {
      return;
    }
    dfsClient.checkOpen();

    if (!extendedReadBuffers.isEmpty()) {
      final StringBuilder builder = new StringBuilder();
      extendedReadBuffers.visitAll(new IdentityHashStore.Visitor<ByteBuffer, Object>() {
        private String prefix = "";
        @Override
        public void accept(ByteBuffer k, Object v) {
          builder.append(prefix).append(k);
          prefix = ", ";
        }
      });
      DFSClient.LOG.warn("closing file " + src + ", but there are still " +
          "unreleased ByteBuffers allocated by read().  " +
          "Please release " + builder.toString() + ".");
    }
    if (blockReader != null) {
      blockReader.close();
      blockReader = null;
    }
    super.close();
    closed = true;
  }

午后的红茶meton

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
Hdfs 客户端读过程源码解析

说到hdfs文件系统上的读文件流程，相信大家都不会陌生，都会知道读取文件的过程如下的示意图：客户端读过程示意图：其基本的读取流程如下：客户端通过调用FileSystem对象的open()方法来打开hdfs上的文件，这个方法在底层会调用ClientProtocol.open()方法，该方法会返回一个HdfsDataInputStream对象用于读取数据块。HdfsDataInputStream是一个DFSInputStream的装饰类，真正进行数据块读取操作的是DFSInputSt...
复制链接

扫一扫