第七章:小朱笔记hadoop之源码分析-hdfs分析 第三节:hdfs实现分析

第七章:小朱笔记hadoop之源码分析-hdfs分析
第三节:hdfs实现分析
3.4 datanode数据结构

与Storage相关的类:从宏观上刻画了每个存储目录的组织结构,管理由HDFS属性dfs.data.dir指定的目录,如current、previous、detach、tmp、storage等目录和文件,并定义了对整个存储的相关操作;
与Dataset相关的类:描述了块文件及其元数据文件的组织方式.

与块相关的操作由FSDataset相关的类处理,存储结构由大到小是卷(FSVolume)、目录(FSDir)和文件(Block和元数据等)

 

(1) FSDatasetInterface

     FSDatasetInterface是DataNode对底层存储的抽象

    

getMetaDataLength(Block b):获得块元数据文件大小
getMetaDataInputStream(Block b):获得块元数据文件输入流
metaFileExists(Block b):检查元数据文件是否存在
getLength(Block b):获得块数据文件大小
getStoredBlock(Block b):根据块ID得到块信息
getBlockInputStream(long blkid):获得块数据文件输入流
getBlockInputStream(Block b, long seekOffset):获得位于块数据文件特定位置的输入流
getTmpInputStreams(Block b, long blkoff, long ckoff):获得位于块数据文件特定位置的输入流,块文件还位于临时目录中
writeToBlock(Block b, boolean isRecovery):创建块文件,并获得文件输出流
updateBlock(Block oldblock, Block newblock):更新块
finalizeBlock(Block b):完成块的写操作
unfinalizeBlock(Block b):关闭块文件的写,删除与块相关的临时文件
getBlockReport():得到块的报告
isValidBlock(Block b):检查块是否正常
invalidate(Block invalidBlks[]):检查多个块是否正常
checkDataDir():检查存储目录是否正常
shutdown():关闭FSDataset
getChannelPosition(Block b, BlockWriteStreams stream):获得在数据输出流中的当前位置
setChannelPosition(Block b, BlockWriteStreams stream, long dataOffset, long ckOffset):设置在数据输出流中的位置
validateBlockMetadata(Block b):验证块元数据文件
 

 

(2) FSDataset

FSDataset实现了接口FSDatasetInterface,所有和数据块相关的操作,都在FSDataset相关的类中进行处理关键成员变量:

 

  //FSDataset使用的所有Storage 卷集合
  FSVolumeSet volumes;
  //Block到ActiveFile的映射,正在创建的Block,都会记录在ongoingCreates里。
  private HashMap<Block,ActiveFile> ongoingCreates = new HashMap<Block,ActiveFile>();
  //每个目录能保存的最大块数 
  private int maxBlocksPerDir = 0;
  //元文件和内存中的映射表:block在DN上存储FSVolume以及具体文件信息
  HashMap<Block,DatanodeBlockInfo> volumeMap = new HashMap<Block, DatanodeBlockInfo>();
  

关键方法:

 

  //通过Block的ID,找到对应的Block
  public synchronized Block getStoredBlock(long blkid) throws IOException {
    File blockfile = findBlockFile(blkid);
    if (blockfile == null) {
      return null;
    }
    File metafile = findMetaFile(blockfile);
    Block block = new Block(blkid);
    return new Block(blkid, getVisibleLength(block),
        parseGenerationStamp(blockfile, metafile));
  }

  //判断block的元数据的元数据文件是否存在
  public boolean metaFileExists(Block b) throws IOException {
    return getMetaFile(b).exists();
  }
  
  //得到一个block的元数据长度。通过block的ID,找对应的元数据文件,返回文件长度
  public long getMetaDataLength(Block b) throws IOException {
    File checksumFile = getMetaFile( b );
    return checksumFile.length();
  }

  //得到一个block的元数据输入流。通过block的ID,找对应的元数据文件,在上面打开输入流
  public MetaDataInputStream getMetaDataInputStream(Block b)
      throws IOException {
    File checksumFile = getMetaFile( b );
    return new MetaDataInputStream(new FileInputStream(checksumFile),
                                                    checksumFile.length());
  }


/**
   * Returns handles to the block file and its metadata file
   * 得到Block的临时输入流。注意,临时输入流是指对应的文件处于tmp目录中。
   * 新创建块时,块数据应该写在tmp目录中,直到写操作成功,文件才会被移动到current目录中,如果失败,就不会影响current目录了。
   * 
   */
  public synchronized BlockInputStreams getTmpInputStreams(Block b, 
                          long blkOffset, long ckoff) throws IOException {

    DatanodeBlockInfo info = volumeMap.get(b);
    if (info == null) {
      throw new IOException("Block " + b + " does not exist in volumeMap.");
    }
    FSVolume v = info.getVolume();
    File blockFile = info.getFile();
    //新创建块时,块数据应该写在tmp目录中
    if (blockFile == null) {
      blockFile = v.getTmpFile(b);
    }
    RandomAccessFile blockInFile = new RandomAccessFile(blockFile, "r");
    if (blkOffset > 0) {
      blockInFile.seek(blkOffset);
    }
    File metaFile = getMetaFile(blockFile, b);
    RandomAccessFile metaInFile = new RandomAccessFile(metaFile, "r");
    if (ckoff > 0) {
      metaInFile.seek(ckoff);
    }
    return new BlockInputStreams(new FileInputStream(blockInFile.getFD()),
                                new FileInputStream(metaInFile.getFD()));
  }


/** {@inheritDoc} 
   * 
   * updateBlock的最外层是一个死循环,循环的结束条件,是没有任何和这个数据块相关的写线程。每次循环,updateBlock都会去调用一个叫tryUpdateBlock的内部方法。tryUpdateBlock发现已经没有线程在写这个块,就会跟新和这个数据块相关的信息,
   * 包括元文件和内存中的映射表volumeMap。如果tryUpdateBlock发现还有活跃的线程和该块关联,那么,updateBlock会试图结束该线程,并等在join上等待
   * 
   * */
  public void updateBlock(Block oldblock, Block newblock) throws IOException {
    if (oldblock.getBlockId() != newblock.getBlockId()) {
      throw new IOException("Cannot update oldblock (=" + oldblock
          + ") to newblock (=" + newblock + ").");
    }


    // Protect against a straggler updateblock call moving a block backwards
    // in time.
    boolean isValidUpdate =
      (newblock.getGenerationStamp() > oldblock.getGenerationStamp()) ||
      (newblock.getGenerationStamp() == oldblock.getGenerationStamp() &&
       newblock.getNumBytes() == oldblock.getNumBytes());

    if (!isValidUpdate) {
      throw new IOException(
        "Cannot update oldblock=" + oldblock +
        " to newblock=" + newblock + " since generation stamps must " +
        "increase, or else length must not change.");
    }

    
    for(;;) {
      final List<Thread> threads = tryUpdateBlock(oldblock, newblock);
      if (threads == null) {
        return;
      }
     
      interruptAndJoinThreads(threads);
    }
  }

 * Try to update an old block to a new block.
   * If there are ongoing create threads running for the old block,
   * the threads will be returned without updating the block. 
   * 用于将旧块截断成新块(调用truncateBlock),并截断相应的元数据文件,以及更新ongoingCreates、volumeMap。
   * @return ongoing create threads if there is any. Otherwise, return null.
   */
  private synchronized List<Thread> tryUpdateBlock(Block oldblock, Block newblock) throws IOException {
    //check ongoing create threads 获取与此块相关的文件及访问这个块的线程
    ArrayList<Thread> activeThreads = getActiveThreads(oldblock);
    if (activeThreads != null) {
      return activeThreads; //如果近期有对此块进行操作,返回存活的操作线程
    }
    //近期无对此块操作的线程,就更新块  
    
    获得旧块的文件  
    //No ongoing create threads is alive.  Update block.
    File blockFile = findBlockFile(oldblock.getBlockId());
    if (blockFile == null) {
      throw new IOException("Block " + oldblock + " does not exist.");
    }

    File oldMetaFile = findMetaFile(blockFile);
    long oldgs = parseGenerationStamp(blockFile, oldMetaFile);
    
    // First validate the update
    
    //update generation stamp
    //旧块stamp比新块stamp大,不合法  
    if (oldgs > newblock.getGenerationStamp()) {
      throw new IOException("Cannot update block (id=" + newblock.getBlockId()
          + ") generation stamp from " + oldgs
          + " to " + newblock.getGenerationStamp());
    }
    
    //update length
    //新块的大小大于旧块的大小  
    if (newblock.getNumBytes() > oldblock.getNumBytes()) {
      throw new IOException("Cannot update block file (=" + blockFile
          + ") length from " + oldblock.getNumBytes() + " to " + newblock.getNumBytes());
    }

    // Now perform the update

    //rename meta file to a tmp file
    //旧块元数据文件重命名  
    File tmpMetaFile = new File(oldMetaFile.getParent(),oldMetaFile.getName()+"_tmp" + newblock.getGenerationStamp());
    if (!oldMetaFile.renameTo(tmpMetaFile)){
      throw new IOException("Cannot rename block meta file to " + tmpMetaFile);
    }

    //新块的大小小于旧块的大小 截断旧块和旧块的元数据文件
    if (newblock.getNumBytes() < oldblock.getNumBytes()) {
      truncateBlock(blockFile, tmpMetaFile, oldblock.getNumBytes(), newblock.getNumBytes());
    }

    //rename the tmp file to the new meta file (with new generation stamp)
    File newMetaFile = getMetaFile(blockFile, newblock);
    if (!tmpMetaFile.renameTo(newMetaFile)) {
      throw new IOException("Cannot rename tmp meta file to " + newMetaFile);
    }

    updateBlockMap(ongoingCreates, oldblock, newblock);
    updateBlockMap(volumeMap, oldblock, newblock);

    // paranoia! verify that the contents of the stored block 
    // matches the block file on disk.
    validateBlockMetadata(newblock);
    return null;
  }

  //truncateBlock对旧块blockFile和对应的元数据文件metaFile进行截断,截断后旧块长度为newlen(newlen<oldlen)。  
  static void truncateBlock(File blockFile, File metaFile,long oldlen, long newlen) throws IOException {
    if (newlen == oldlen) {
      return;
    }
    if (newlen > oldlen) {
      throw new IOException("Cannot truncate block to from oldlen (=" + oldlen
          + ") to newlen (=" + newlen + ")");
    }

    if (newlen == 0) {
      // Special case for truncating to 0 length, since there's no previous
      // chunk.
      RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");
      try {
        //truncate blockFile 
        blockRAF.setLength(newlen);   
      } finally {
        blockRAF.close();
      }
      //update metaFile 
      RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
      try {
        metaRAF.setLength(BlockMetadataHeader.getHeaderSize());
      } finally {
        metaRAF.close();
      }
      return;
    }
    
    //由于只是对就块进行截断,所有新块的最后一个校验和字段可能在旧块中不一样,  
    //所有setLength进行截断后,要读取最后一个校验和字段  
    DataChecksum dcs = BlockMetadataHeader.readHeader(metaFile).getChecksum(); 
    int checksumsize = dcs.getChecksumSize();
    int bpc = dcs.getBytesPerChecksum();
    long newChunkCount = (newlen - 1)/bpc + 1;//校验和的段数  
    long newmetalen = BlockMetadataHeader.getHeaderSize() + newChunkCount*checksumsize;//新的校验和文件的长度  
    long lastchunkoffset = (newChunkCount - 1)*bpc;//最后一个校验和字段的偏移位置  
    int lastchunksize = (int)(newlen - lastchunkoffset); //最后一个校验和的开始位置  
    byte[] b = new byte[Math.max(lastchunksize, checksumsize)]; 

    RandomAccessFile blockRAF = new RandomAccessFile(blockFile, "rw");//对旧块进行读取  
    try {
      //truncate blockFile 
      blockRAF.setLength(newlen);
 
      //read last chunk
      blockRAF.seek(lastchunkoffset);
      blockRAF.readFully(b, 0, lastchunksize);
    } finally {
      blockRAF.close();
    }

    //compute checksum
    dcs.update(b, 0, lastchunksize);
    dcs.writeValue(b, 0, false);

    //update metaFile 
    RandomAccessFile metaRAF = new RandomAccessFile(metaFile, "rw");
    try {
      metaRAF.setLength(newmetalen);
      metaRAF.seek(newmetalen - checksumsize);
      metaRAF.write(b, 0, checksumsize);
    } finally {
      metaRAF.close();
    }
  }

 /**
   * 提交(或叫:结束finalize)通过writeToBlock打开的block,这意味着写过程没有出错,
   * 可以正式把Block从tmp文件夹放到current文件夹。在FSDataset中,
   * finalizeBlock将从ongoingCreates中删除对应的block,同时将block对应的DatanodeBlockInfo,
   * 放入volumeMap中。我们还是以blk_3148782637964391313为例,当DataNode提交Block ID为3148782637964391313数据块文件时,
   * DataNode将把tmp/blk_3148782637964391313移到current下某一个目录,
   * 以subdir12为例,这是tmp/blk_3148782637964391313将会挪到current/subdir12/blk_3148782637964391313。对应的meta文件也在目录current/subdir12下。
   */
  @Override
  public void finalizeBlock(Block b) throws IOException {
    finalizeBlockInternal(b, false);
  }


/**
   * Start writing to a block file
   * If isRecovery is true and the block pre-exists, then we kill all
      volumeMap.put(b, v);
      volumeMap.put(b, v);
   * other threads that might be writing to this block, and then reopen the file.
   * If replicationRequest is true, then this operation is part of a block
   * replication request.
   * 
    如果块数据文件已经存在并且当前要进行恢复,将块数据文件跟与其硬链接的文件分离,需要对文件进行恢复有两种情况:客户端重新打开连接并重新发送数据包;往块追加数据
    如果块在ongoingCreates中,将其删去
    如果不是恢复操作,获得存放块文件的卷,创建临时文件,将块添加到volumeMap中
    对于恢复操作,如果块临时文件存在,重用临时文件,将块添加到volumeMap中;如果块临时文件不存在,将块数据文件和对应的元数据文件移动到临时目录中,将块添加到volumeMap中
    将块添加到ongoingCreates中
    返回块文件输出流
   * 
   */
  public BlockWriteStreams writeToBlock(Block b, boolean isRecovery,
                           boolean replicationRequest) throws IOException {
    //
    // Make sure the block isn't a valid one - we're still creating it!
    //
    if (isValidBlock(b)) {
      if (!isRecovery) {
        throw new BlockAlreadyExistsException("Block " + b + " is valid, and cannot be written to.");
      }
      // If the block was successfully finalized because all packets
      // were successfully processed at the Datanode but the ack for
      // some of the packets were not received by the client. The client 
      // re-opens the connection and retries sending those packets.
      // The other reason is that an "append" is occurring to this block.
      detachBlock(b, 1);
    }
    long blockSize = b.getNumBytes();

    //
    // Serialize access to /tmp, and check if file already there.
    //
    File f = null;
    List<Thread> threads = null;
    synchronized (this) {
      //
      // Is it already in the create process?
      //
      ActiveFile activeFile = ongoingCreates.get(b);
      if (activeFile != null) {
        f = activeFile.file;
        threads = activeFile.threads;
        
        if (!isRecovery) {
          throw new BlockAlreadyExistsException("Block " + b +
                                  " has already been started (though not completed), and thus cannot be created.");
        } else {
          for (Thread thread:threads) {
            thread.interrupt();
          }
        }
        ongoingCreates.remove(b);
      }
      FSVolume v = null;
      if (!isRecovery) {
        v = volumes.getNextVolume(blockSize);
        // create temporary file to hold block in the designated volume
        f = createTmpFile(v, b, replicationRequest);
      } else if (f != null) {
        DataNode.LOG.info("Reopen already-open Block for append " + b);
        // create or reuse temporary file to hold block in the designated volume
        v = volumeMap.get(b).getVolume();
        volumeMap.put(b, new DatanodeBlockInfo(v, f));
      } else {
        // reopening block for appending to it.
        DataNode.LOG.info("Reopen Block for append " + b);
        v = volumeMap.get(b).getVolume();
        f = createTmpFile(v, b, replicationRequest);
        File blkfile = getBlockFile(b);
        File oldmeta = getMetaFile(b);
        File newmeta = getMetaFile(f, b);

        // rename meta file to tmp directory
        DataNode.LOG.debug("Renaming " + oldmeta + " to " + newmeta);
        if (!oldmeta.renameTo(newmeta)) {
          throw new IOException("Block " + b + " reopen failed. " +
                                " Unable to move meta file  " + oldmeta +
                                " to tmp dir " + newmeta);
        }

        // rename block file to tmp directory
        DataNode.LOG.debug("Renaming " + blkfile + " to " + f);
        if (!blkfile.renameTo(f)) {
          if (!f.delete()) {
            throw new IOException("Block " + b + " reopen failed. " +
                                  " Unable to remove file " + f);
          }
          if (!blkfile.renameTo(f)) {
            throw new IOException("Block " + b + " reopen failed. " +
                                  " Unable to move block file " + blkfile +
                                  " to tmp dir " + f);
          }
        }
      }
      if (f == null) {
        DataNode.LOG.warn("Block " + b + " reopen failed " +
                          " Unable to locate tmp file.");
        throw new IOException("Block " + b + " reopen failed " +
                              " Unable to locate tmp file.");
      }
      // If this is a replication request, then this is not a permanent
      // block yet, it could get removed if the datanode restarts. If this
      // is a write or append request, then it is a valid block.
      if (replicationRequest) {
        volumeMap.put(b, new DatanodeBlockInfo(v));
      } else {
        volumeMap.put(b, new DatanodeBlockInfo(v, f));
      }
      ongoingCreates.put(b, new ActiveFile(f, threads));
    }

    try {
      if (threads != null) {
        for (Thread thread:threads) {
          thread.join();
        }
      }
    } catch (InterruptedException e) {
      throw new IOException("Recovery waiting for thread interrupted.");
    }

    //
    // Finally, allow a writer to the block file
    // REMIND - mjc - make this a filter stream that enforces a max
    // block size, so clients can't go crazy
    //
    File metafile = getMetaFile(f, b);
    DataNode.LOG.debug("writeTo blockfile is " + f + " of size " + f.length());
    DataNode.LOG.debug("writeTo metafile is " + metafile + " of size " + metafile.length());
    return createBlockWriteStreams( f , metafile);
  }

 

 

 

detach技术:
系统在升级时会创建一个snapshot,snapshot的文件和current里的数据块文件和数据块元文件是通过硬链接,指向了相同的内容。当改变current里的文件时,如果不进行detach操作,那么,修改的内容就会影响snapshot里的文件,这时,我们需要将对应的硬链接解除掉。方法很简单,就是在临时文件夹里,复制文件,然后将临时文件改名成为current里的对应文件,这样的话,current里的文件和snapshot里的文件就detach了。这样的技术,也叫copy-on-write,是一种有效提高系统性能的方法。DatanodeBlockInfo中的detachBlock,能够对Block对应的数据文件和元数据文件进行detach操作。

 

(3) DataStorage

对所有的本地存储路径进行统一管理,也就是对StorageDirectory进行管理,并没有对存储路径中的具体数据文件进行管理

 

(4) FSVolumeSet

 

       FSVolumeSet对所有的FSVolume对象进行管理,实际上就是对所有的存储路径进行管理。FSVolumeSet主要为上层(DataNode进程)提供存储数据块选择一个的存储路径(分区),就是为该数据块创建一个对应的本地磁盘文件,同时也负载统计它的存储空间的状态信息和收集所有的数据块信息

 

(5) FSVolume

 

管理块文件,统计存储目录的使用情况.

 

    private File currentDir;
    private File blocksBeingWritten;     // clients write here
    private FSDir dataDir;  //存储有效的数据块的最终位置(current/)
    private File tmpDir;    //存储数据块的中间位置(tmp/)
    private File detachDir; //存储数据块的copy on write(detach/)
    private DF usage;       //获取当前存储目录的空间使用信息
    private DU dfsUsage;    //获取当前存储目录所在的磁盘分区空间信息
    private long reserved;  //预留存储空间大小
 

 

(6) FSDir

 

      FSDir对应着HDFS中的一个目录,目录里存放着数据块文件和它的元文件。默认情况下,每个目录下最多有64个子目录,最多能存储64个块。在初始化一个目录时,会递归扫描该目录下的目录和文件,从而形成一个树状结构。当有数据块到达DataNode节点时,DataNode并不是马上在current/中为这个数据块选择合适的存储目录,而是先把它存放到存储路径的tmp/子目录下,当这个数据块被DataNode节点成功接受之后,才把它移动到current/下的合适目录中DataNode节点会首先把文件的数据块存储到存储路径的子目录current/下;当子目录current/中已经存储了maxBlocksPerDir个数据块之后,就会在目录current/下创建maxBlocksPerDir个子目录,然后从中选择一个子目录,把数据块存储到这个子目录中;如果选择的子目录也已经存储了maxBlocksPerDir个数据块,则又在这个子目录下创建maxBlocksPerDir个子目录,从这些子目录中选一个来存储数据块,就这样一次递归下去,直到存储路径的剩余存储空间不够存储一个数据块为止。maxBlocksPerDir的默认值是64,但也可以通过DataNode的配置文件来设置,它对应的配置选项是dsf.datanode.numblocks。

 

 

    File dir;//存储路径的子目录current/
    int numBlocks = 0;//存储目录当前已经存储的数据块的数量
    FSDir children[];//目录current/的子目录
    int lastChildIdx = 0;//存储上一个数据块的子目录序号

 

 

(7) BlockAndFile ActiveFile

 

      BlockAndFile:block信息和对应的文件 ActiveFile:正在写入的文件

     

(8) DatanodeBlockInfo

     DatanodeBlockInfo存放的是Block在文件系统上的信息。它保存了Block存放的卷(FSVolume),文件名和detach状态。
    
  private FSVolume volume;       // volume where the block belongs
  private File     file;         // block file
  private boolean detached;      // copy-on-write done for block
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值