第七章：小朱笔记hadoop之源码分析-hdfs分析第三节：hdfs实现分析-CSDN博客

第七章：小朱笔记hadoop之源码分析-hdfs分析

第三节：hdfs实现分析

3.4 datanode数据结构

与Storage相关的类:从宏观上刻画了每个存储目录的组织结构，管理由HDFS属性dfs.data.dir指定的目录，如current、previous、detach、tmp、storage等目录和文件，并定义了对整个存储的相关操作；
与Dataset相关的类:描述了块文件及其元数据文件的组织方式.

与块相关的操作由FSDataset相关的类处理，存储结构由大到小是卷（FSVolume）、目录（FSDir）和文件（Block和元数据等）

(1) FSDatasetInterface

FSDatasetInterface是DataNode对底层存储的抽象

getMetaDataLength(Block b)：获得块元数据文件大小
getMetaDataInputStream(Block b)：获得块元数据文件输入流
metaFileExists(Block b)：检查元数据文件是否存在
getLength(Block b)：获得块数据文件大小
getStoredBlock(Block b)：根据块ID得到块信息
getBlockInputStream(long blkid)：获得块数据文件输入流
getBlockInputStream(Block b, long seekOffset)：获得位于块数据文件特定位置的输入流
getTmpInputStreams(Block b, long blkoff, long ckoff)：获得位于块数据文件特定位置的输入流，块文件还位于临时目录中
writeToBlock(Block b, boolean isRecovery)：创建块文件，并获得文件输出流
updateBlock(Block oldblock, Block newblock)：更新块
finalizeBlock(Block b)：完成块的写操作
unfinalizeBlock(Block b)：关闭块文件的写，删除与块相关的临时文件
getBlockReport()：得到块的报告
isValidBlock(Block b)：检查块是否正常
invalidate(Block invalidBlks[])：检查多个块是否正常
checkDataDir()：检查存储目录是否正常
shutdown()：关闭FSDataset
getChannelPosition(Block b, BlockWriteStreams stream)：获得在数据输出流中的当前位置
setChannelPosition(Block b, BlockWriteStreams stream, long dataOffset, long ckOffset)：设置在数据输出流中的位置
validateBlockMetadata(Block b)：验证块元数据文件

(2) FSDataset

FSDataset实现了接口FSDatasetInterface,所有和数据块相关的操作，都在FSDataset相关的类中进行处理关键成员变量:

  //FSDataset使用的所有Storage 卷集合
  FSVolumeSet volumes;
  //Block到ActiveFile的映射,正在创建的Block，都会记录在ongoingCreates里。
  private HashMap<Block,ActiveFile> ongoingCreates = new HashMap<Block,ActiveFile>();
  //每个目录能保存的最大块数 
  private int maxBlocksPerDir = 0;
  //元文件和内存中的映射表:block在DN上存储FSVolume以及具体文件信息
  HashMap<Block,DatanodeBlockInfo> volumeMap = new HashMap<Block, DatanodeBlockInfo>();

关键方法:

  //通过Block的ID，找到对应的Block
  public synchronized Block getStoredBlock(long blkid) throws IOException {
    File blockfile = findBlockFile(blkid);
    if (blockfile == null) {
      return null;
    }
    File metafile = findMetaFile(blockfile);
    Block block = new Block(blkid);
    return new Block(blkid, getVisibleLength(block),
        parseGenerationStamp(blockfile, metafile));
  }

  //判断block的元数据的元数据文件是否存在
  public boolean metaFileExists(Block b) throws IOException {
    return getMetaFile(b).exists();
  }
  
  //得到一个block的元数据长度。通过block的ID，找对应的元数据文件，返回文件长度
  public long getMetaDataLength(Block b) throws IOException {
    File checksumFile = getMetaFile( b );
    return checksumFile.length();
  }

  //得到一个block的元数据输入流。通过block的ID，找对应的元数据文件，在上面打开输入流
  public MetaDataInputStream getMetaDataInputStream(Block b)
      throws IOException {
    File checksumFile = getMetaFile( b );
    return new MetaDataInputStream(new FileInputStream(checksumFile),
                                                    checksumFile.length());
  }


/**
   * Returns handles to the block file and its metadata file
   * 得到Block的临时输入流。注意，临时输入流是指对应的文件处于tmp目录中。
   * 新创建块时，块数据应该写在tmp目录中，直到写操作成功，文件才会被移动到current目录中，如果失败，就不会影响current目录了。
   * 
   */
  public synchronized BlockInputStreams getTmpInputStreams(Block b, 
                          long blkOffset, long ckoff) throws IOException {

    DatanodeBlockInfo info = volumeMap.get(b);
    if (info == null) {
      throw new IOException("Block " + b + " does not exist in volumeMap.");
    }
    FSVolume v = info.getVolume();
    File blockFile = info.getFile();
    //新创建块时，块数据应该写在tmp目录中
    if (blockFile == null) {
      blockFile = v.getTmpFile(b);
    }
    RandomAccessFile blockInFile = new RandomAccessFile(blockFile, "r");
    if (blkOffset > 0) {
      blockInFile.seek(blkOffset);
    }
    File metaFile = getMetaFile(blockFile, b);
    RandomAccessFile metaInFile = new RandomAccessFile(metaFile, "r");
    if (ckoff > 0) {
      metaInFile.seek(ckoff);
    }
    return new BlockInputStreams(new FileInputStream(blockInFile.getFD()),
                                new FileInputStream(metaInFile.getFD()));
  }


/** {@inheritDoc} 
   * 
   * updateBlock的最外层是一个死循环，循环的结束条件，是没有任何和这个数据块相关的写线程。每次循环，updateBlock都会去调用一个叫tryUpdateBlock的内部方法。tryUpdateBlock发现已经没有线程在写这个块，就会跟新和这个数据块相关的信息，
   * 包括元文件和内存中的映射表volumeMap。如果tryUpdateBlock发现还有活跃的线程和该块关联，那么，updateBlock会试图结束该线程，并等在join上等待
   * 
   * */
  public void updateBlock(Block oldblock, Block newblock) throws IOException {
    if (oldblock.getBlockId() != newblock.getBlockId()) {
      throw new IOException("Cannot update oldblock (=" + oldblock
          + ") to newblock (=" + newblock + ").");
    }


    // Protect against a straggler updateblock call moving a block backwards
    // in time.
    boolean isValidUpdate =
      (newblock.getGenerationStamp() > oldblock.getGenerationStamp()) ||
      (newblock.getGenerationStamp() == oldblock.getGenerationStamp() &&
       newblock.getNumBytes() == oldblock.getNumBytes());

    if (!isValidUpdate) {
      throw new IOException(
        "Cannot update oldblock=" + oldblock +
        " to newblock=" + newblock + " since generation stamps must " +
        "increase, or else length must not change.");
    }

    
    for(;;) {
      final List<Thread> threads = tryUpdateBlock(oldblock, newblock);
      if (threads == null) {
        return;
      }
     
      interruptAndJoinThreads(threads);
    }
  }

 * Try to update an old block to a new block.
   * If there are ongoing create threads running for the old block,
   * the threads will be returned without updating the block. 
   * 用于将旧块截断成新块(调用truncateBlock)，并截断相应的元数据文件，以及更新ongoingCreates、volumeMap。
   * @return ongoing create threads if there is any. Otherwise, return null.
   */
  private synchronized List<Thread> tryUpdateBlock(Block oldblock, Block newblock) throws IOException {
    //check ongoing create threads 获取与此块相关的文件及访问这个块的线程
    ArrayList<Thread> activeThreads = getActiveThreads(oldblock);
    if (activeThreads != null) {
      return activeThreads; //如果近期有对此块进行操作，返回存活的操作线程
    }
    //近期无对此块操作的线程，就更新块  
    
    获得旧块的文件  
    //No ongoing create threads is alive.  Update block.
    File blockFile = findBlockFile(oldblock.getBlockId());
    if (blockFile == null) {
      throw new IOException("Block " + oldblock + " does not exist.");
    }

    File oldMetaFile = findMetaFile(blockFile);
    long oldgs = parseGenerationStamp(blockFile, oldMetaFile);
    
    // First validate the update
    
    //update generation stamp
    //旧块stamp比新块stamp大，不合法  
    if (oldgs > newblock.getGenerationStamp()) {
      throw new IOException("Cannot update block (id=" + newblock.getBlockId()
          + ") generation stamp from " + oldgs
          + " to " + newblock.getGenerationStamp());
    }
    
    //update length
    //新块的大小大于旧块的大小  
    if (newblock.getNumBytes() > oldblock.getNumBytes()) {
      throw new IOException("Cannot update block file (=" + blockFile
          + ") length from " + oldblock.getNumBytes() + " to " + newblock.getNumBytes());
    }

    // Now perform the update

    //rename meta file to a tmp file
    //旧块元数据文件重命名  
    File tmpMetaFile = new File(oldMetaFile.getParent(),oldMetaFile.getName()+"_tmp" + newblock.getGenerationStamp());
    if (!oldMetaFile.renameTo(tmpMetaFile)){
      throw new IOException("Cannot rename block meta file to " + tmpMetaFile);
    }

    //新块的大小小于旧块的大小 截断旧块和旧块的元数据文件
    if (newblock.getNumBytes() < oldblock.getNumBytes()) {
      truncateBlock(blockFile, tmpMetaFile, oldblock.getNumBytes(), newblock.getNumBytes());
    }

    //rename the tmp file to the new meta file (with new generation stamp)
    File newMetaFile = getMetaFile(blockFile, newblock);
    if (!tmpMetaFile.renameTo(newMetaFile)) {
      throw new IOException("Cannot rename tmp meta file to " + newMetaFile);
    }

    updateBlockMap(ongoingCreates, oldblock, newblock);
    updateBlockMap(volumeMap, oldblock, newblock);

    // paranoia! verify that the contents of the stored block 
    // matches the block file on disk.
    validateBlockMetadata(newblock);
    return null;
  }

  //truncateBlock对旧块blockFile和对应的元数据文件metaFile进行截断，截断后旧块长度为newlen(newlen<oldlen)。  
  static void truncateBlock(File blockFile, File metaFile,long oldlen, long newlen) throws IOException {
    if (newlen == oldlen) {
      return;
    }
    if (newlen > oldlen) {
      throw new IOException("Cannot truncate block to from oldlen (=" + oldlen
          + ") to newlen (=" + newlen + ")");
    }

    if (newlen == 0) {
      // Special case for truncating to 0 length, since there's no previous
      // chunk.
      RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");
      try {
        //truncate blockFile 
        blockRAF.setLength(newlen);   
      } finally {
        blockRAF.close();
      }
      //update metaFile 
      RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
      try {
        metaRAF.setLength(BlockMetadataHeader.getHeaderSize());
      } finally {
        metaRAF.close();
      }
      return;
    }
    
    //由于只是对就块进行截断，所有新块的最后一个校验和字段可能在旧块中不一样，  
    //所有setLength进行截断后，要读取最后一个校验和字段  
    DataChecksum dcs = BlockMetadataHeader.readHeader(metaFile).getChecksum(); 
    int checksumsize = dcs.getChecksumSize();
    int bpc = dcs.getBytesPerChecksum();
    long newChunkCount = (newlen - 1)/bpc + 1;//校验和的段数  
    long newmetalen = BlockMetadataHeader.getHeaderSize() + newChunkCount*checksumsize;//新的校验和文件的长度  
    long lastchunkoffset = (newChunkCount - 1)*bpc;//最后一个校验和字段的偏移位置  
    int lastchunksize = (int)(newlen - lastchunkoffset); //最后一个校验和的开始位置  
    byte[] b = new byte[Math.max(lastchunksize, checksumsize)]; 

    RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");//对旧块进行读取  
    try {
      //truncate blockFile 
      blockRAF.setLength(newlen);
 
      //read last chunk
      blockRAF.seek(lastchunkoffset);
      blockRAF.readFully(b, 0, lastchunksize);
    } finally {
      blockRAF.close();
    }

    //compute checksum
    dcs.update(b, 0, lastchunksize);
    dcs.writeValue(b, 0, false);

    //update metaFile 
    RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
    try {
      metaRAF.setLength(newmetalen);
      metaRAF.seek(newmetalen - checksumsize);
      metaRAF.write(b, 0, checksumsize);
    } finally {
      metaRAF.close();
    }
  }

 /**
   * 提交（或叫：结束finalize）通过writeToBlock打开的block，这意味着写过程没有出错，
   * 可以正式把Block从tmp文件夹放到current文件夹。在FSDataset中，
   * finalizeBlock将从ongoingCreates中删除对应的block，同时将block对应的DatanodeBlockInfo，
   * 放入volumeMap中。我们还是以blk_3148782637964391313为例，当DataNode提交Block ID为3148782637964391313数据块文件时，
   * DataNode将把tmp/blk_3148782637964391313移到current下某一个目录，
   * 以subdir12为例，这是tmp/blk_3148782637964391313将会挪到current/subdir12/blk_3148782637964391313。对应的meta文件也在目录current/subdir12下。
   */
  @Override
  public void finalizeBlock(Block b) throws IOException {
    finalizeBlockInternal(b, false);
  }


/**
   * Start writing to a block file
   * If isRecovery is true and the block pre-exists, then we kill all
      volumeMap.put(b, v);
      volumeMap.put(b, v);
   * other threads that might be writing to this block, and then reopen the file.
   * If replicationRequest is true, then this operation is part of a block
   * replication request.
   * 
    如果块数据文件已经存在并且当前要进行恢复，将块数据文件跟与其硬链接的文件分离，需要对文件进行恢复有两种情况：客户端重新打开连接并重新发送数据包；往块追加数据
    如果块在ongoingCreates中，将其删去
    如果不是恢复操作，获得存放块文件的卷，创建临时文件，将块添加到volumeMap中
    对于恢复操作，如果块临时文件存在，重用临时文件，将块添加到volumeMap中；如果块临时文件不存在，将块数据文件和对应的元数据文件移动到临时目录中，将块添加到volumeMap中
    将块添加到ongoingCreates中
    返回块文件输出流
   * 
   */
  public BlockWriteStreams writeToBlock(Block b, boolean isRecovery,
                           boolean replicationRequest) throws IOException {
    //
    // Make sure the block isn't a valid one - we're still creating it!
    //
    if (isValidBlock(b)) {
      if (!isRecovery) {
        throw new BlockAlreadyExistsException("Block " + b + " is valid, and cannot be written to.");
      }
      // If the block was successfully finalized because all packets
      // were successfully processed at the Datanode but the ack for
      // some of the packets were not received by the client. The client 
      // re-opens the connection and retries sending those packets.
      // The other reason is that an "append" is occurring to this block.
      detachBlock(b, 1);
    }
    long blockSize = b.getNumBytes();

    //
    // Serialize access to /tmp, and check if file already there.
    //
    File f = null;
    List<Thread> threads = null;
    synchronized (this) {
      //
      // Is it already in the create process?
      //
      ActiveFile activeFile = ongoingCreates.get(b);
      if (activeFile != null) {
        f = activeFile.file;
        threads = activeFile.threads;
        
        if (!isRecovery) {
          throw new BlockAlreadyExistsException("Block " + b +
                                  " has already been started (though not completed), and thus cannot be created.");
        } else {
          for (Thread thread:threads) {
            thread.interrupt();
          }
        }
        ongoingCreates.remove(b);
      }
      FSVolume v = null;
      if (!isRecovery) {
        v = volumes.getNextVolume(blockSize);
        // create temporary file to hold block in the designated volume
        f = createTmpFile(v, b, replicationRequest);
      } else if (f != null) {
        DataNode.LOG.info("Reopen already-open Block for append " + b);
        // create or reuse temporary file to hold block in the designated volume
        v = volumeMap.get(b).getVolume();
        volumeMap.put(b, new DatanodeBlockInfo(v, f));
      } else {
        // reopening block for appending to it.
        DataNode.LOG.info("Reopen Block for append " + b);
        v = volumeMap.get(b).getVolume();
        f = createTmpFile(v, b, replicationRequest);
        File blkfile = getBlockFile(b);
        File oldmeta = getMetaFile(b);
        File newmeta = getMetaFile(f, b);

        // rename meta file to tmp directory
        DataNode.LOG.debug("Renaming " + oldmeta + " to " + newmeta);
        if (!oldmeta.renameTo(newmeta)) {
          throw new IOException("Block " + b + " reopen failed. " +
                                " Unable to move meta file  " + oldmeta +
                                " to tmp dir " + newmeta);
        }

        // rename block file to tmp directory
        DataNode.LOG.debug("Renaming " + blkfile + " to " + f);
        if (!blkfile.renameTo(f)) {
          if (!f.delete()) {
            throw new IOException("Block " + b + " reopen failed. " +
                                  " Unable to remove file " + f);
          }
          if (!blkfile.renameTo(f)) {
            throw new IOException("Block " + b + " reopen failed. " +
                                  " Unable to move block file " + blkfile +
                                  " to tmp dir " + f);
          }
        }
      }
      if (f == null) {
        DataNode.LOG.warn("Block " + b + " reopen failed " +
                          " Unable to locate tmp file.");
        throw new IOException("Block " + b + " reopen failed " +
                              " Unable to locate tmp file.");
      }
      // If this is a replication request, then this is not a permanent
      // block yet, it could get removed if the datanode restarts. If this
      // is a write or append request, then it is a valid block.
      if (replicationRequest) {
        volumeMap.put(b, new DatanodeBlockInfo(v));
      } else {
        volumeMap.put(b, new DatanodeBlockInfo(v, f));
      }
      ongoingCreates.put(b, new ActiveFile(f, threads));
    }

    try {
      if (threads != null) {
        for (Thread thread:threads) {
          thread.join();
        }
      }
    } catch (InterruptedException e) {
      throw new IOException("Recovery waiting for thread interrupted.");
    }

    //
    // Finally, allow a writer to the block file
    // REMIND - mjc - make this a filter stream that enforces a max
    // block size, so clients can't go crazy
    //
    File metafile = getMetaFile(f, b);
    DataNode.LOG.debug("writeTo blockfile is " + f + " of size " + f.length());
    DataNode.LOG.debug("writeTo metafile is " + metafile + " of size " + metafile.length());
    return createBlockWriteStreams( f , metafile);
  }

detach技术:
系统在升级时会创建一个snapshot，snapshot的文件和current里的数据块文件和数据块元文件是通过硬链接，指向了相同的内容。当改变current里的文件时，如果不进行detach操作，那么，修改的内容就会影响snapshot里的文件，这时，我们需要将对应的硬链接解除掉。方法很简单，就是在临时文件夹里，复制文件，然后将临时文件改名成为current里的对应文件，这样的话，current里的文件和snapshot里的文件就detach了。这样的技术，也叫copy-on-write，是一种有效提高系统性能的方法。DatanodeBlockInfo中的detachBlock，能够对Block对应的数据文件和元数据文件进行detach操作。

(3) DataStorage

对所有的本地存储路径进行统一管理,也就是对StorageDirectory进行管理,并没有对存储路径中的具体数据文件进行管理

(4) FSVolumeSet

FSVolumeSet对所有的FSVolume对象进行管理,实际上就是对所有的存储路径进行管理。FSVolumeSet主要为上层(DataNode进程)提供存储数据块选择一个的存储路径(分区)，就是为该数据块创建一个对应的本地磁盘文件，同时也负载统计它的存储空间的状态信息和收集所有的数据块信息

(5) FSVolume

管理块文件，统计存储目录的使用情况.

    private File currentDir;
    private File blocksBeingWritten;     // clients write here
    private FSDir dataDir;  //存储有效的数据块的最终位置(current/)
    private File tmpDir;    //存储数据块的中间位置(tmp/)
    private File detachDir; //存储数据块的copy on write(detach/)
    private DF usage;       //获取当前存储目录的空间使用信息
    private DU dfsUsage;    //获取当前存储目录所在的磁盘分区空间信息
    private long reserved;  //预留存储空间大小

(6) FSDir

FSDir对应着HDFS中的一个目录，目录里存放着数据块文件和它的元文件。默认情况下，每个目录下最多有64个子目录，最多能存储64个块。在初始化一个目录时，会递归扫描该目录下的目录和文件，从而形成一个树状结构。当有数据块到达DataNode节点时，DataNode并不是马上在current/中为这个数据块选择合适的存储目录，而是先把它存放到存储路径的tmp/子目录下，当这个数据块被DataNode节点成功接受之后，才把它移动到current/下的合适目录中DataNode节点会首先把文件的数据块存储到存储路径的子目录current/下；当子目录current/中已经存储了maxBlocksPerDir个数据块之后，就会在目录current/下创建maxBlocksPerDir个子目录，然后从中选择一个子目录，把数据块存储到这个子目录中；如果选择的子目录也已经存储了maxBlocksPerDir个数据块，则又在这个子目录下创建maxBlocksPerDir个子目录，从这些子目录中选一个来存储数据块，就这样一次递归下去，直到存储路径的剩余存储空间不够存储一个数据块为止。maxBlocksPerDir的默认值是64，但也可以通过DataNode的配置文件来设置，它对应的配置选项是dsf.datanode.numblocks。

    File dir;//存储路径的子目录current/
    int numBlocks = 0;//存储目录当前已经存储的数据块的数量
    FSDir children[];//目录current/的子目录
    int lastChildIdx = 0;//存储上一个数据块的子目录序号

(7) BlockAndFile ActiveFile

BlockAndFile:block信息和对应的文件 ActiveFile:正在写入的文件

(8) DatanodeBlockInfo

DatanodeBlockInfo存放的是Block在文件系统上的信息。它保存了Block存放的卷（FSVolume），文件名和detach状态。

  private FSVolume volume;       // volume where the block belongs
  private File     file;         // block file
  private boolean detached;      // copy-on-write done for block