Hadoop源码分析笔记(十三)：名字节点--数据块和数据节点管理

最新推荐文章于 2022-03-07 11:11:39 发布

剑邑龙泉

最新推荐文章于 2022-03-07 11:11:39 发布

阅读量2.7k

点赞数

分类专栏： Hadoop源码分析

Hadoop源码分析专栏收录该内容

14 篇文章 3 订阅

订阅专栏

数据块和节点管理

名字节点维护着HDFS文件系统中两个重要的关系：

一、HDFS文件系统的文件目录树，以及文件的数据块索引，即每个文件对应的数据块列表。

二、数据块与数据节点的对应关系，即某一数据块保存在哪些数据节点的信息。

在第二关系中，“fsimage”文件不会记录数据块保存在哪些数据节点的信息。而是在数据节点加入到集群时，由数据节点提供他所包含的块列表动态建立起来的。和第一关系相同的是，随着系统的运行，数据块和数据节点都会发生变化，上述对应关系也就会不断地发生变化，名字节点必须跟踪、维护这个关系，以保证系统的正常工作。

数据结构

和名字节点第一关系相比，数据块和数据节点的对应关系更复杂，包含更多的状态，也就需要更多的类来抽象相关的信息。接下来从INodeFile出发，逐步考察名字节点第二关系中的这些数据结构。INodeFile是一个很好的开始点，它通过成员变量blocks保存文件拥有的数据块，自然也就和名字节点第二关系发生关联。

注意，INodeFile.blocks也是名字节点第一关系中唯一涉及第二关系的类成员变量。它的类型是BlockInfo，BlockInfo是BlocksMap的内部类。

1、BlocksMap和DatanodeDescriptor

BlocksMap(数据块影射)管理着名字节点上数据块的元数据，包括数据块所属的INode和那些数据节点保存数据块等信息。也就是说，如果要定位某个数据块在哪些数据节点善个，只要访问名字节点的BlocksMap对象。

DatanodeDescriptor(数据节点描述符)是名字节点中对数据节点的抽象，它继承自org.apache.hadoop.hdfs.protocol.DatanodeInfo，添加了大量名字节点工作时需要的数据节点信息。

BlocksMap虽然保存名字节点第二关系中最重要的数据，但它的实现并不复杂：BlocksMap使用org.apache.hadoop.hdfs.uti.GSet，一个比较特殊的集合类型，存放(对象blocks)名字节点管理的所有数据块。GSet特殊的地方在于它是个集合，但提供了类似映射的功能，如blocks.get()，可以根据数据块对象获得它对应的BlockInfo对象。

BlockInfo(数据块信息)是BlocksMap的内部类，也是Block的子类，它增加了数据块所属的INode信息和保存该数据块的数据节点信息。INode信息保存在BlockInfo.inode中，但数据块所在数据节点DatanodeDescriptor对象的保存就比较特殊了，它保存在数据类型为Object的数组triplets中，而不是保存在一个DatanodeDescriptor数组中。而且数组triplets除了保存数据节点信息(第i个数据节点的信息保存在元素tripltes[3*i]中)，还以双向链表的形式保存该数据节点上其他两个数据块的数据块信息对象，即triplets[3*i+1]保存了当前数据块的前一个数据块的BlockInfo，triplets[3*i+2]保存了后面一个数据块的BlockInfo。也就是说，沿着triplets[3*i+1]或triplets[3*i+2]，可以遍历某个数据节点拥有的所有数据块的BlockInfo信息。代码如下：

class BlocksMap {
        
  /**
   * Internal class for block metadata.
   */
  static class BlockInfo extends Block implements LightWeightGSet.LinkedElement {
    private INodeFile          inode;

    /** For implementing {@link LightWeightGSet.LinkedElement} interface */
    private LightWeightGSet.LinkedElement nextLinkedElement;

    /**
     * This array contains triplets of references.
     * For each i-th data-node the block belongs to
     * triplets[3*i] is the reference to the DatanodeDescriptor
     * and triplets[3*i+1] and triplets[3*i+2] are references 
     * to the previous and the next blocks, respectively, in the 
     * list of blocks belonging to this data-node.
     */
    private Object[] triplets;

    BlockInfo(Block blk, int replication) {
      super(blk);
      this.triplets = new Object[3*replication];
      this.inode = null;
    }

    INodeFile getINode() {
      return inode;
    }

    DatanodeDescriptor getDatanode(int index) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3 < triplets.length : "Index is out of bound";
      DatanodeDescriptor node = (DatanodeDescriptor)triplets[index*3];
      assert node == null || 
          DatanodeDescriptor.class.getName().equals(node.getClass().getName()) : 
                "DatanodeDescriptor is expected at " + index*3;
      return node;
    }

    BlockInfo getPrevious(int index) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";
      BlockInfo info = (BlockInfo)triplets[index*3+1];
      assert info == null || 
          BlockInfo.class.getName().equals(info.getClass().getName()) : 
                "BlockInfo is expected at " + index*3;
      return info;
    }

    BlockInfo getNext(int index) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3+2 < triplets.length : "Index is out of bound";
      BlockInfo info = (BlockInfo)triplets[index*3+2];
      assert info == null || 
          BlockInfo.class.getName().equals(info.getClass().getName()) : 
                "BlockInfo is expected at " + index*3;
      return info;
    }

    void setDatanode(int index, DatanodeDescriptor node) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3 < triplets.length : "Index is out of bound";
      triplets[index*3] = node;
    }

    void setPrevious(int index, BlockInfo to) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3+1 < triplets.length : "Index is out of bound";
      triplets[index*3+1] = to;
    }

    void setNext(int index, BlockInfo to) {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert index >= 0 && index*3+2 < triplets.length : "Index is out of bound";
      triplets[index*3+2] = to;
    }

    private int getCapacity() {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert triplets.length % 3 == 0 : "Malformed BlockInfo";
      return triplets.length / 3;
    }

    /**
     * Ensure that there is enough  space to include num more triplets.
     *      * @return first free triplet index.
     */
    private int ensureCapacity(int num) {
      assert this.triplets != null : "BlockInfo is not initialized";
      int last = numNodes();
      if(triplets.length >= (last+num)*3)
        return last;
      /* Not enough space left. Create a new array. Should normally 
       * happen only when replication is manually increased by the user. */
      Object[] old = triplets;
      triplets = new Object[(last+num)*3];
      for(int i=0; i < last*3; i++) {
        triplets[i] = old[i];
      }
      return last;
    }

    /**
     * Count the number of data-nodes the block belongs to.
     */
    int numNodes() {
      assert this.triplets != null : "BlockInfo is not initialized";
      assert triplets.length % 3 == 0 : "Malformed BlockInfo";
      for(int idx = getCapacity()-1; idx >= 0; idx--) {
        if(getDatanode(idx) != null)
          return idx+1;
      }
      return 0;
    }

    /**
     * Add data-node this block belongs to.
     */
    boolean addNode(DatanodeDescriptor node) {
      if(findDatanode(node) >= 0) // the node is already there
        return false;
      // find the last null node
      int lastNode = ensureCapacity(1);
      setDatanode(lastNode, node);
      setNext(lastNode, null);
      setPrevious(lastNode, null);
      return true;
    }

    /**
     * Remove data-node from the block.
     */
    boolean removeNode(DatanodeDescriptor node) {
      int dnIndex = findDatanode(node);
      if(dnIndex < 0) // the node is not found
        return false;
      assert getPrevious(dnIndex) == null && getNext(dnIndex) == null : 
        "Block is still in the list and must be removed first.";
      // find the last not null node
      int lastNode = numNodes()-1; 
      // replace current node triplet by the lastNode one 
      setDatanode(dnIndex, getDatanode(lastNode));
      setNext(dnIndex, getNext(lastNode)); 
      setPrevious(dnIndex, getPrevious(lastNode)); 
      // set the last triplet to null
      setDatanode(lastNode, null);
      setNext(lastNode, null); 
      setPrevious(lastNode, null); 
      return true;
    }

    /**
     * Find specified DatanodeDescriptor.
     * @param dn
     * @return index or -1 if not found.
     */
    int findDatanode(DatanodeDescriptor dn) {
      int len = getCapacity();
      for(int idx = 0; idx < len; idx++) {
        DatanodeDescriptor cur = getDatanode(idx);
        if(cur == dn)
          return idx;
        if(cur == null)
          break;
      }
      return -1;
    }

    /**
     * Insert this block into the head of the list of blocks 
     * related to the specified DatanodeDescriptor.
     * If the head is null then form a new list.
     * @return current block as the new head of the list.
     */
    BlockInfo listInsert(BlockInfo head, DatanodeDescriptor dn) {
      int dnIndex = this.findDatanode(dn);
      assert dnIndex >= 0 : "Data node is not found: current";
      assert getPrevious(dnIndex) == null && getNext(dnIndex) == null : 
              "Block is already in the list and cannot be inserted.";
      this.setPrevious(dnIndex, null);
      this.setNext(dnIndex, head);
      if(head != null)
        head.setPrevious(head.findDatanode(dn), this);
      return this;
    }

    /**
     * Remove this block from the list of blocks 
     * related to the specified DatanodeDescriptor.
     * If this block is the head of the list then return the next block as 
     * the new head.
     * @return the new head of the list or null if the list becomes
     * empy after deletion.
     */
    BlockInfo listRemove(BlockInfo head, DatanodeDescriptor dn) {
      if(head == null)
        return null;
      int dnIndex = this.findDatanode(dn);
      if(dnIndex < 0) // this block is not on the data-node list
        return head;

      BlockInfo next = this.getNext(dnIndex);
      BlockInfo prev = this.getPrevious(dnIndex);
      this.setNext(dnIndex, null);
      this.setPrevious(dnIndex, null);
      if(prev != null)
        prev.setNext(prev.findDatanode(dn), next);
      if(next != null)
        next.setPrevious(next.findDatanode(dn), prev);
      if(this == head)  // removing the head
        head = next;
      return head;
    }

    int listCount(DatanodeDescriptor dn) {
      int count = 0;
      for(BlockInfo cur = this; cur != null;
            cur = cur.getNext(cur.findDatanode(dn)))
        count++;
      return count;
    }

    boolean listIsConsistent(DatanodeDescriptor dn) {
      // going forward
      int count = 0;
      BlockInfo next, nextPrev;
      BlockInfo cur = this;
      while(cur != null) {
        next = cur.getNext(cur.findDatanode(dn));
        if(next != null) {
          nextPrev = next.getPrevious(next.findDatanode(dn));
          if(cur != nextPrev) {
            System.out.println("Inconsistent list: cur->next->prev != cur");
            return false;
          }
        }
        cur = next;
        count++;
      }
      return true;
    }

    @Override
    public LightWeightGSet.LinkedElement getNext() {
      return nextLinkedElement;
    }

    @Override
    public void setNext(LightWeightGSet.LinkedElement next) {
      this.nextLinkedElement = next;
    }
  }......}

BlockInfo的triplets[3*i]保存数据节点信息，类型为DatanodeDescriptor，它是DatanodeInfo的子类，而 DatanodeInfo又是DatanodeID的子类，同时，DatanodeDescriptor也拥有一些内部类。在前面接触的DatanodeInfo，通过它可以定位一个数据节点，并了解该节点的一些状态，在数据节点和名字节点、客户端和名字节点的IPC接口中都有DatanodeInfo的应用。

和BlockInfo一样，DatanodeDescriptor添加了一些名字节点工作是需要的数据。这些数据可以分为如下三类：

1、数据节点的状态

包括isAlive、firstBlockReport和decommissiongingStatus、currApproxBlocksScheduled等。其中decommissioningStatus(类型为DatanodeDescriptor.DecommissionStatus)比较特殊，它只在节点处于撤销状态时使用；currApproxBlocksScheduled等几个成员变量用于估计数据节点的负载，为写文件操作分配数据块或进行数据块复制时，会根据节点负债信息，优先选择比较空闲的节点作为目标。

2、用于产生到该数据节点的名字节点的指令

包括bandwidth、replicateBlocks、recoverBlocks和invalidateBlocks四个成员变量，它们分别用于产生均衡器工作带宽更新(DNA_BALANCERBANDWIDTHUPDATE)、数据块复制(DNA_TRANSFER)、数据块恢复(DNA_RECOVERBLOCK)和数据块删除(DNA_INVALIDATE)的名字节点指令，以数据删除为例，成员变量invalidateBlocks保存了这个数据节点上等待删除的数据块，在下次心跳中，名字节点可能会产生一个DNA_INVALIDAATE操作，让数据节点删除集合invalidateBlocks中的一些数据块。成员变量replicateBlocks和recoverBlocks的类型是数据块队列BlockQueue，它的元素类型是BlockTargetPair。如类名所示，BlockTargeParis包含了数据块和目标数据节点(列表)两项信息，对应复制，目标数据节点是复制的目标节点，对于数据块恢复，目标数据节点则是参与恢复过程的数据节点。

3、成员变量blickList

它是该数据节点保存的数据块队列的头节点，队列元素类型为BlocksMap.BlockInfo

代码如下：

 **************************************************/
public class DatanodeDescriptor extends DatanodeInfo {
  
  // Stores status of decommissioning.
  // If node is not decommissioning, do not use this object for anything.
  DecommissioningStatus decommissioningStatus = new DecommissioningStatus();

  /** Block and targets pair */
  public static class BlockTargetPair {
    public final Block block;
    public final DatanodeDescriptor[] targets;    

    BlockTargetPair(Block block, DatanodeDescriptor[] targets) {
      this.block = block;
      this.targets = targets;
    }
  }

  /** A BlockTargetPair queue. */
  private static class BlockQueue {
    private final Queue<BlockTargetPair> blockq = new LinkedList<BlockTargetPair>();
   ......
}
   private volatile BlockInfo blockList = null;
  // isAlive == heartbeats.contains(this)
  // This is an optimization, because contains takes O(n) time on Arraylist
  protected boolean isAlive = false;
  protected boolean needKeyUpdate = false;

  // A system administrator can tune the balancer bandwidth parameter
  // (dfs.balance.bandwidthPerSec) dynamically by calling
  // "dfsadmin -setBalanacerBandwidth <newbandwidth>", at which point the
  // following 'bandwidth' variable gets updated with the new value for each
  // node. Once the heartbeat command is issued to update the value on the
  // specified datanode, this value will be set back to 0.
  private long bandwidth;

  /** A queue of blocks to be replicated by this datanode */
  private BlockQueue replicateBlocks = new BlockQueue();
  /** A queue of blocks to be recovered by this datanode */
  private BlockQueue recoverBlocks = new BlockQueue();
  /** A set of blocks to be invalidated by this datanode */
  private Set<Block> invalidateBlocks = new TreeSet<Block>();

  // Set to false after processing first block report
  private boolean firstBlockReport = true; 
  class DecommissioningStatus {
    int underReplicatedBlocks;
    int decommissionOnlyReplicas;
    int underReplicatedInOpenFiles;
    long startTime;
     ......
}
  ......
}

当数据节点成功接收到一个数据块后，会通过远程方法blockReceived()报告名字节点。名字节点自然需要把它添加/更新到对应数据节点的DatanodeDescriptor对象中，使用的就是addBlock()方法。它首先调用BlockInfo.addNode()将对象自己，也就是DatanodeSescriptor添加到数据块所属数据节点列表中；接着，又调用BlockInfo的listInsert()方法，将数据块插入到数据节点管理的数据块列表中。DatanodeDescriptor.addBlock()方法充分体现了BlocksMap和DatanodeDescriptor间的复杂关系，代码如下：

/**************************************************
 * DatanodeDescriptor tracks stats on a given DataNode,
 * such as available storage capacity, last update time, etc.,
 * and maintains a set of blocks stored on the datanode. 
 *
 * This data structure is a data structure that is internal
 * to the namenode. It is *not* sent over-the-wire to the Client
 * or the Datnodes. Neither is it stored persistently in the
 * fsImage.

 **************************************************/
public class DatanodeDescriptor extends DatanodeInfo {
   /**
   * Add data-node to the block.
   * Add block to the head of the list of blocks belonging to the data-node.
   */
  boolean addBlock(BlockInfo b) {
    if(!b.addNode(this))
      return false;
    // add to the head of the data-node list
    blockList = b.listInsert(blockList, this);
    return true;
  }
  ......
}

/**
 * This class maintains the map from a block to its metadata.
 * block's metadata currently includes INode it belongs to and
 * the datanodes that store the block.
 */
class BlocksMap {
        
  /**
   * Internal class for block metadata.
   */
  static class BlockInfo extends Block implements LightWeightGSet.LinkedElement {

    /**
     * Add data-node this block belongs to.
     */
    boolean addNode(DatanodeDescriptor node) {
      if(findDatanode(node) >= 0) // the node is already there
        return false;
      // find the last null node
      int lastNode = ensureCapacity(1);
      setDatanode(lastNode, node);
      setNext(lastNode, null);
      setPrevious(lastNode, null);
      return true;
    }

  /**
     * Insert this block into the head of the list of blocks 
     * related to the specified DatanodeDescriptor.
     * If the head is null then form a new list.
     * @return current block as the new head of the list.
     */
    BlockInfo listInsert(BlockInfo head, DatanodeDescriptor dn) {
      int dnIndex = this.findDatanode(dn);
      assert dnIndex >= 0 : "Data node is not found: current";
      assert getPrevious(dnIndex) == null && getNext(dnIndex) == null : 
              "Block is already in the list and cannot be inserted.";
      this.setPrevious(dnIndex, null);
      this.setNext(dnIndex, head);
      if(head != null)
        head.setPrevious(head.findDatanode(dn), this);
      return this;
    }
......

}

DatanodeDescriptor.getReplicationCommand()用于产生一条数据块复制指令，它在队列replicateBlocks中获取最多maxTransfers个元素，然后根据这些元素构造BlockCommand对象。这个方法很简单，在DatanodeDescriptor中，类似的方法还有getLeaseRecoverCommand()、getInvalidateBlocks()等，如下所示：

public class DatanodeDescriptor extends DatanodeInfo {

  BlockCommand getReplicationCommand(int maxTransfers) {
    List<BlockTargetPair> blocktargetlist = replicateBlocks.poll(maxTransfers);
    return blocktargetlist == null? null:
        new BlockCommand(DatanodeProtocol.DNA_TRANSFER, blocktargetlist);
  }
  ......
}

数据块副本状态

BlocksMap和DatanodeDescriptor一起保存名字节点第二关系，即数据块和数据节点的对应关系，但名字节点上对数据块的管理，还需要其他一些数据结构的支持。本节研究名字节点上的数据块副本状态(准确讲，是针对特定数据块副本的当前操作)、和管理副本状态时使用的类。同时，开始涉及org.apche.hadoop.hdfs.server.namenode.FSNameSystem的实现原理分析，这个类是整个名字节点的门面(Facade)，拥有冗长的代码和众多的成员变量和成员方法，如它的类型为FSDirectory的成员变量FSNamesystem.dir，就用于操作HDFS的目录树。

当系统已完美状态、在没有任何故障情况下工作时，数据块副本随着客户端写文件数据产生，并在文件删除时被删除。但集群在工作是，特别是上千台服务器组成的系统工作是，会有比较高的故障率，这时会出现各种各样的意外：

1、数据节点在执行写操作的过程中崩溃，破坏了数据流管道，客户端进行数据块恢复。

2、客户端写数据过程中崩溃，长时间没有恢复，名字节点进行租约恢复(对数据节点来说，是一个名字节点发起的数据块恢复)。

3、数据节点磁盘损坏，导致上面保存的数据块副本永久丢失。

4、数据节点出现故障，然后恢复。

上述情况都会对数据节点上的数据块副本状态产生影响，如数据块恢复就是数据块副本处于的一种状态(或施加在副本上的当前操作)。有些时候，一个正常的操作也会引起数据块副本的状态变化，如降低某个文件的副本数会导致一些数据块副本被删除。

下面是FSNamesystem中涉及数据块管理的成员变量：

/***************************************************
 * FSNamesystem does the actual bookkeeping work for the
 * DataNode.
 *
 * It tracks several important tables.
 *
 * 1)  valid fsname --> blocklist  (kept on disk, logged)
 * 2)  Set of all valid blocks (inverted #1)
 * 3)  block --> machinelist (kept in memory, rebuilt dynamically from reports)
 * 4)  machine --> blocklist (inverted #2)
 * 5)  LRU cache of updated-heartbeat machines
 ***************************************************/
public class FSNamesystem implements FSConstants, FSNamesystemMBean,
    NameNodeMXBean, MetricsSource {

  volatile long pendingReplicationBlocksCount = 0L;
  volatile long corruptReplicaBlocksCount = 0L;
  volatile long underReplicatedBlocksCount = 0L;
  volatile long scheduledReplicationBlocksCount = 0L;
  volatile long excessBlocksCount = 0L;
  volatile long pendingDeletionBlocksCount = 0L;
  //
  // Stores the correct file name hierarchy
  //
  public FSDirectory dir;

  //
  // Mapping: Block -> { INode, datanodes, self ref } 
  // Updated only in response to client-sent information.
  //
  final BlocksMap blocksMap = new BlocksMap(DEFAULT_INITIAL_MAP_CAPACITY, 
                                            DEFAULT_MAP_LOAD_FACTOR);

  //
  // Store blocks-->datanodedescriptor(s) map of corrupt replicas
  //
  public CorruptReplicasMap corruptReplicas = new CorruptReplicasMap();
/**
   * Stores the datanode -> block map.  
   * <p>
   * Done by storing a set of {@link DatanodeDescriptor} objects, sorted by 
   * storage id. In order to keep the storage map consistent it tracks 
   * all storages ever registered with the namenode.
   * A descriptor corresponding to a specific storage id can be
   * <ul> 
   * <li>added to the map if it is a new storage id;</li>
   * <li>updated with a new datanode started as a replacement for the old one 
   * with the same storage id; and </li>
   * <li>removed if and only if an existing datanode is restarted to serve a
   * different storage id.</li>
   * </ul> <br>
   * The list of the {@link DatanodeDescriptor}s in the map is checkpointed
   * in the namespace image file. Only the {@link DatanodeInfo} part is 
   * persistent, the list of blocks is restored from the datanode block
   * reports. 
   * <p>
   * Mapping: StorageID -> DatanodeDescriptor
   */
  NavigableMap<String, DatanodeDescriptor> datanodeMap = 
    new TreeMap<String, DatanodeDescriptor>();

  //
  // Keeps a Collection for every named machine containing
  // blocks that have recently been invalidated and are thought to live
  // on the machine in question.
  // Mapping: StorageID -> ArrayList<Block>
  //
  private Map<String, Collection<Block>> recentInvalidateSets = 
    new TreeMap<String, Collection<Block>>();

  //
  // Keeps a TreeSet for every named node.  Each treeset contains
  // a list of the blocks that are "extra" at that location.  We'll
  // eventually remove these extras.
  // Mapping: StorageID -> TreeSet<Block>
  //
  Map<String, Collection<Block>> excessReplicateMap = 
    new TreeMap<String, Collection<Block>>();

  Random r = new Random();

  /**
   * Stores a set of DatanodeDescriptor objects.
   * This is a subset of {@link #datanodeMap}, containing nodes that are 
   * considered alive.
   * The {@link HeartbeatMonitor} periodically checks for outdated entries,
   * and removes them from the list.
   */
  ArrayList<DatanodeDescriptor> heartbeats = new ArrayList<DatanodeDescriptor>();

/**
 * Store set of Blocks that need to be replicated 1 or more times.
 * Set of: Block
 */
  private UnderReplicatedBlocks neededReplications = new UnderReplicatedBlocks();
  // We also store pending replication-orders.
  private PendingReplicationBlocks pendingReplications;

  public LeaseManager leaseManager = new LeaseManager(this); 
......
}

基本上对特定数据块副本的下一步操作，FSNamesystem都有一个成员变量保存着相应的数据块和执行操作时需要的附加信息，有些还有计时器存放等待操作的数据块数目，具体内容如下：

corruptReplicas：保存损坏的数据块副本，如数据节点的数据块扫描器发现的错误数据块副本。

recentInvalidateSets：无效数据块副本，即等待删除的数据块副本，文件删除时，该文件拥有的数据块副本都会变成无效，一般来说，损坏的数据块副本也是无效的。

excessReplicateMap：多余副本，减少文件的副本数(可通过“hadoop fs -setrep”命令)会产生多余副本，名字节点启动时也可能发现多余副本，多余副本由名字节点在数据块的多个副本中选择得到。

neededReplications：等待复制，准备生成复制请求的数据块副本，数据块复制是为了让数据块的副本数满足文件的副本系数。

pendingReplications：已经生产复制请求的数据块副本保存在pendingReplications中，也就是说，当复制请求生成后，数据块的信息会从neededReplications取出，放入pendingReplications中。

leaseManager：租约管理器，间接保存了处于构建状态或恢复状态的数据块的信息。

为了方便后续讨论，上面代码包含了blocksMap和datanodeMap两个成员变量。blocksMap是名字节点中唯一的BlocksMap实例；datanodeMap保存名字节点当前管理的所有数据节点，并提供了通过数据节点存储标识快速查找对应DatanodeDescriptor对象的能力。

上面列表中的recentInvalidateSets和excessReplicateMap的类型比较简单，是一个数据节点存储标识到一组数据块对象(Collection<Block>)的映射。实现中，recentInvalidateSets使用java.utl.HashSet，excessReplicateMap则使用java.util.TreeSet，作为具体的映射实现。其他变量如corruptReplicas、neededReplications和pendingReplications都定义了对应的类保存需要的数据。

FSNamesystem.corruptReplicas的类型是CorruptReplicasMap，保存数据节点发现的损坏数据块副本。由于数据块往往有多个副本，只有该数据块的所有副本都损坏了，系统才认为数据块损坏，当部分副本出现损坏时，名字节点会进行数据块复制，复制完好的数据块副本直至数据块副本数恢复正常。CorruptReplicasMap用于配合上述过程，它通过成员变量corruptReplicasMap，记录了损坏副本以及副本所在的(一组)数据节点，当发现新的损坏副本是，通过addToCorruptReplicasMap()方法添加副本信息到CorruptRelicasMap对象中；当损坏数据块从数据节点上删除时，则调用removeFromCorruptReplicasMap()删除记录。CorruptReplicasMap代码如下：

public class CorruptReplicasMap{

  private Map<Block, Collection<DatanodeDescriptor>> corruptReplicasMap =
    new TreeMap<Block, Collection<DatanodeDescriptor>>();
  
  /**
   * Mark the block belonging to datanode as corrupt.
   *
   * @param blk Block to be added to CorruptReplicasMap
   * @param dn DatanodeDescriptor which holds the corrupt replica
   */
  public void addToCorruptReplicasMap(Block blk, DatanodeDescriptor dn) {
    Collection<DatanodeDescriptor> nodes = getNodes(blk);
    if (nodes == null) {
      nodes = new TreeSet<DatanodeDescriptor>();
      corruptReplicasMap.put(blk, nodes);
    }
    if (!nodes.contains(dn)) {
      nodes.add(dn);
      NameNode.stateChangeLog.info("BLOCK NameSystem.addToCorruptReplicasMap: "+
                                   blk.getBlockName() +
                                   " added as corrupt on " + dn.getName() +
                                   " by " + Server.getRemoteIp());
    } else {
      NameNode.stateChangeLog.info("BLOCK NameSystem.addToCorruptReplicasMap: "+
                                   "duplicate requested for " + 
                                   blk.getBlockName() + " to add as corrupt " +
                                   "on " + dn.getName() +
                                   " by " + Server.getRemoteIp());
    }
  }

  /**
   * Remove Block from CorruptBlocksMap
   *
   * @param blk Block to be removed
   */
  void removeFromCorruptReplicasMap(Block blk) {
    if (corruptReplicasMap != null) {
      corruptReplicasMap.remove(blk);
    }
  }

  /**
   * Remove the block at the given datanode from CorruptBlockMap
   * @param blk block to be removed
   * @param datanode datanode where the block is located
   * @return true if the removal is successful; 
             false if the replica is not in the map
   */ 
  boolean removeFromCorruptReplicasMap(Block blk, DatanodeDescriptor datanode) {
    Collection<DatanodeDescriptor> datanodes = corruptReplicasMap.get(blk);
    if (datanodes==null)
      return false;
    if (datanodes.remove(datanode)) { // remove the replicas
      if (datanodes.isEmpty()) {
        // remove the block if there is no more corrupted replicas
        corruptReplicasMap.remove(blk);
      }
      return true;
    }
    return false;
  }
    

  /**
   * Get Nodes which have corrupt replicas of Block
   * 
   * @param blk Block for which nodes are requested
   * @return collection of nodes. Null if does not exists
   */
  Collection<DatanodeDescriptor> getNodes(Block blk) {
    return corruptReplicasMap.get(blk);
  }

  /**
   * Check if replica belonging to Datanode is corrupt
   *
   * @param blk Block to check
   * @param node DatanodeDescriptor which holds the replica
   * @return true if replica is corrupt, false if does not exists in this map
   */
  boolean isReplicaCorrupt(Block blk, DatanodeDescriptor node) {
    Collection<DatanodeDescriptor> nodes = getNodes(blk);
    return ((nodes != null) && (nodes.contains(node)));
  }

  public int numCorruptReplicas(Block blk) {
    Collection<DatanodeDescriptor> nodes = getNodes(blk);
    return (nodes == null) ? 0 : nodes.size();
  }
  
  public int size() {
    return corruptReplicasMap.size();
  }
}

和CorruptReplicasMap相比，UnderReplicatedBlocks的实现被比较复杂。

等待复制的数据块信息保存在变量priorityQueues中。priorityQueues是一个列表，有三个类型为TreeSet的元素，对应不同的等级优先级，等待复制集合TreeSet的元素是Block，也就是说，priorityQueues保存了三个数据块集合。如下所示：

class UnderReplicatedBlocks implements Iterable<Block> {
  static final int LEVEL = 3;
  private List<TreeSet<Block>> priorityQueues = new ArrayList<TreeSet<Block>>();
      
  /* constructor */
  UnderReplicatedBlocks() {
    for(int i=0; i<LEVEL; i++) {
      priorityQueues.add(new TreeSet<Block>());
    }
  }
/* Return the priority of a block
   * @param block a under replication block
   * @param curReplicas current number of replicas of the block
   * @param expectedReplicas expected number of replicas of the block
   */
  private int getPriority(Block block, 
                          int curReplicas, 
                          int decommissionedReplicas,
                          int expectedReplicas) {
    if (curReplicas<0 || curReplicas>=expectedReplicas) {
      return LEVEL; // no need to replicate
    } else if(curReplicas==0) {
      // If there are zero non-decommissioned replica but there are
      // some decommissioned replicas, then assign them highest priority
      if (decommissionedReplicas > 0) {
        return 0;
      }
      return 2; // keep these blocks in needed replication.
    } else if(curReplicas==1) {
      return 0; // highest priority
    } else if(curReplicas*3<expectedReplicas) {
      return 1;
    } else {
      return 2;
    }
  }
 ......
}

从上面的代码所示，当复制源节点是一个等待撤销的数据节点或者数据块只剩下一个副本的情况下，优先级最高；副本数不到期望值的1/3时，优先级次之，其他情况下，优先级最低。优先级高的数据块最先得到复制。上述优先级计算实现在UnderReplicatedBlocks的私有方法getPriority()中，往对象中增加、修改复制项时，都需要调用这个方法。

等待复制的数据块从UnderReplicatedBlocks对象中读取出来以后，会根据集群数据节点负债等情况，生成复制请求，并将请求放入源数据节点中，即保存在该节点的DatanodeDescriptor对象的成员变量replicateBlocks中的。注意，这个复制请求可能因为多种原因执行失败，所有，名字节点会复制请求保存在FSNamasystem.pendingReplications中。

数据节点管理

数据节点启动时，会密集和名字节点通信，握手、注册并进行数据块上报，然后定期发送心跳信息，维护和名字节点的联系。本节讲解名字节点如何管理数据节点，包括：添加和撤销数据节点，数据节点启动时名字节点上执行的流程，心跳处理和名字节点的指令如何产生并下发等内容。

1、添加和撤销数据节点

HDFS在需要增加集群容器时，可以动态地往集群添加新的数据节点。相反，如果希望缩小集群的规模，那么需要撤销已存在的数据节点。如果一个数据节点频繁地发生故障或者进行缓慢，也可以通过撤销操作，将节点下架。上述操作让HDFS有了一定的弹性，可根据应用规模进行扩展或收缩。但在实现上述功能时，需要保证名字节点对能够够连接到名字节点的数据节点进行明确的管理，以保证数据节点受集群控制，也可以防止配置错误的数据节点误接入名字节点。

最简单的添加数据节点方法时根据集群情况、安装并设置新节点的配置文件，然后手动DataNode的守护进程，它会自动联系NameNode并加入集群。但HDFS提供了更明确的dfs.hosts文件和dfs.hosts.exclude文件指定，对能够连接到名字节点的数据节点进行更明确的管理，以保证数据节点受集群控制，也可以防止配置错误的数据节点接入名字节点。

命令refreshNodes其实使用远程接口ClientProtocol.refreshNode()，通过名字节点更新Include和exclude文件。该远程方法的主要实现逻辑在FSNamesystem.refreshNode()中，代码如下：

/**
   * Rereads the config to get hosts and exclude list file names.
   * Rereads the files to update the hosts and exclude lists.  It
   * checks if any of the hosts have changed states:
   * 1. Added to hosts  --> no further work needed here.
   * 2. Removed from hosts --> mark AdminState as decommissioned. 
   * 3. Added to exclude --> start decommission.
   * 4. Removed from exclude --> stop decommission.
   */
  public void refreshNodes(Configuration conf) throws IOException {
    checkSuperuserPrivilege();
    // Reread the config to get dfs.hosts and dfs.hosts.exclude filenames.
    // Update the file names and refresh internal includes and excludes list
    if (conf == null)
      conf = new Configuration();
    hostsReader.updateFileNames(conf.get("dfs.hosts",""), 
                                conf.get("dfs.hosts.exclude", ""));
    hostsReader.refresh();
    synchronized (this) {
      for (Iterator<DatanodeDescriptor> it = datanodeMap.values().iterator();
           it.hasNext();) {
        DatanodeDescriptor node = it.next();
        // Check if not include.
        if (!inHostsList(node, null)) {
          node.setDecommissioned();  // case 2.
        } else {
          if (inExcludedHostsList(node, null)) {
            if (!node.isDecommissionInProgress() && 
                !node.isDecommissioned()) {
              startDecommission(node);   // case 3.
            }
          } else {
            if (node.isDecommissionInProgress() || 
                node.isDecommissioned()) {
              stopDecommission(node);   // case 4.
            } 
          }
        }
      }
    } 
      
  }

撤销节点通过exclude文件，将要撤销的节点增加到文件中，然后还是执行“hadoop dfsadmin - refreshNodes”命令，名字节点就会开始撤销数据节点。被撤销节点上的数据块会复制到集群的其他数据节点，这个过程中，数据节点处于“正在撤销”状态，数据复制完成后才会转移到“已撤销”，这个时候就可以关闭相应的数据节点了。

2、数据节点的启动

添加数据节点并启动节点时，执行的流程和正常的数据节点启动是一样的。数据节点启动时，需要和名字节点进行握手、注册和数据块上报。如果系统支持Append操作，还需要上报处于客户端写状态的数据块信息。本节先关注前三个远程调用，即远程接口DatanodeProtocol的方法versionRequest()、register()和blockReport()在名字节点上的实现。

(1)握手

这些方法中个，versionRequest()最简单，调用FSNamesystem.getNamespaceInfo()并返回方法结果。

//实现在org.apche.hadoop.hdfs.server.namenode.NameNode中

 public NamespaceInfo versionRequest() throws IOException {
    return namesystem.getNamespaceInfo();
  }

//实现在org.apche.hadoop.hdfs.server.namenode.FSNamesystem中

  synchronized NamespaceInfo getNamespaceInfo() {
    return new NamespaceInfo(dir.fsImage.getNamespaceID(),
                             dir.fsImage.getCTime(),
                             getDistributedUpgradeVersion());
  }

(2)注册

远程方法register()的主要处理逻辑在FSNamesystem.registerDatanode()中，只要注意到数据节点可以重复发送注册信息，这个方法的实现就比较好理解。

首先，registerDatanode()需要为注册节点生成数据节点标识，名字节点不能完全信任数据节点发送过来的信息，它需要根据实际情况更新注册信息中携带的数据节点标识，然后，使用这个标识进行后续的处理。这一部分的代码还会判断是否允许该数据节点连接到名字节点，判断的依据是上一节介绍的include文件和exclude文件规则，由verifyNodeRegistration()方法执行。如果该数据节点不允许连接，简单抛出DisallowedDatanodeException异常即可。

接下来，名字节点需要根据不同情况，对注册请求进行不同的处理。可能出现的情况如下：

一、该数据节点没有注册过。

二、数据节点注册过，这次注册时重复注册。

三、数据节点注册过，但这次注册使用了新的数据节点存储标识(storageID)，表明该数据节点存储空间已经被清理了，原有的数据块副本已经被删除。

我们知道，datanodeMap是数据节点存储标识到DatanodeDescriptor的映射，为了区分上述情形，名字节点引入了另外一个映射，名字节点引入了另外一个映射Host2NodesMap，提供了从服务器(名称)到该服务器上启动的数据节点的DatanodeDescriptor(可能有多个)的映射。Host2NodesMap的成员变量声明如下：

class Host2NodesMap {
  private HashMap<String, DatanodeDescriptor[]> map
    = new HashMap<String, DatanodeDescriptor[]>();
.......
}

通过存储标识在datanodeMap中获取的数据节点描述符对象为nodeS，根据服务器名称和端口号在host2DataNodeMap(Host2NodesMap实例)中获取的描述符为nodeN，如果nodeN不为空，同时nodeN和nodeS不相等，则属于上面讨论的情况三，即数据节点使用新的存储标识进行注册。这时，nodeN即是原有老数据节点标识，利用这个标识，通过FSNamesystem的removeDatanode()和wipeDatanode()方法，清理原有节点在名字节点中保存的信息，并将nodeN设置为空，后续的处理，就和请求一的一致了。在排除了情况三以后，数据节点否是重复注册(情况二)，只需要根据nodeS是否为空，即可判断。代码如下：

public synchronized void registerDatanode(DatanodeRegistration nodeReg
                                            ) throws IOException {
    String dnAddress = Server.getRemoteAddress();
    if (dnAddress == null) {
      // Mostly called inside an RPC.
      // But if not, use address passed by the data-node.
      dnAddress = nodeReg.getHost();
    }      

    // check if the datanode is allowed to be connect to the namenode
    if (!verifyNodeRegistration(nodeReg, dnAddress)) {
      throw new DisallowedDatanodeException(nodeReg);
    }

    String hostName = nodeReg.getHost();
      
    // update the datanode's name with ip:port
    DatanodeID dnReg = new DatanodeID(dnAddress + ":" + nodeReg.getPort(),
                                      nodeReg.getStorageID(),
                                      nodeReg.getInfoPort(),
                                      nodeReg.getIpcPort());
    nodeReg.updateRegInfo(dnReg);
    nodeReg.exportedKeys = getBlockKeys();
      
    NameNode.stateChangeLog.info(
                                 "BLOCK* NameSystem.registerDatanode: "
                                 + "node registration from " + nodeReg.getName()
                                 + " storage " + nodeReg.getStorageID());

    DatanodeDescriptor nodeS = datanodeMap.get(nodeReg.getStorageID());
    DatanodeDescriptor nodeN = host2DataNodeMap.getDatanodeByName(nodeReg.getName());
      
    if (nodeN != null && nodeN != nodeS) {
      NameNode.LOG.info("BLOCK* NameSystem.registerDatanode: "
                        + "node from name: " + nodeN.getName());
      // nodeN previously served a different data storage, 
      // which is not served by anybody anymore.
      removeDatanode(nodeN);
      // physically remove node from datanodeMap
      wipeDatanode(nodeN);
      nodeN = null;
    }

    if (nodeS != null) {
      if (nodeN == nodeS) {
        // The same datanode has been just restarted to serve the same data 
        // storage. We do not need to remove old data blocks, the delta will
        // be calculated on the next block report from the datanode
        NameNode.stateChangeLog.debug("BLOCK* NameSystem.registerDatanode: "
                                      + "node restarted.");
      } else {
        // nodeS is found
        /* The registering datanode is a replacement node for the existing 
          data storage, which from now on will be served by a new node.
          If this message repeats, both nodes might have same storageID 
          by (insanely rare) random chance. User needs to restart one of the
          nodes with its data cleared (or user can just remove the StorageID
          value in "VERSION" file under the data directory of the datanode,
          but this is might not work if VERSION file format has changed 
       */        
        NameNode.stateChangeLog.info( "BLOCK* NameSystem.registerDatanode: "
                                      + "node " + nodeS.getName()
                                      + " is replaced by " + nodeReg.getName() + 
                                      " with the same storageID " +
                                      nodeReg.getStorageID());
      }
      // update cluster map
      clusterMap.remove(nodeS);
      nodeS.updateRegInfo(nodeReg);
      nodeS.setHostName(hostName);
      
      // resolve network location
      resolveNetworkLocation(nodeS);
      clusterMap.add(nodeS);
        
      // also treat the registration message as a heartbeat
      synchronized(heartbeats) {
        if( !heartbeats.contains(nodeS)) {
          heartbeats.add(nodeS);
          //update its timestamp
          nodeS.updateHeartbeat(0L, 0L, 0L, 0);
          nodeS.isAlive = true;
        }
      }
      return;
    } 

    // this is a new datanode serving a new data storage
    if (nodeReg.getStorageID().equals("")) {
      // this data storage has never been registered
      // it is either empty or was created by pre-storageID version of DFS
      nodeReg.storageID = newStorageID();
      NameNode.stateChangeLog.debug(
                                    "BLOCK* NameSystem.registerDatanode: "
                                    + "new storageID " + nodeReg.getStorageID() + " assigned.");
    }
    // register new datanode
    DatanodeDescriptor nodeDescr 
      = new DatanodeDescriptor(nodeReg, NetworkTopology.DEFAULT_RACK, hostName);
    resolveNetworkLocation(nodeDescr);
    unprotectedAddDatanode(nodeDescr);
    clusterMap.add(nodeDescr);
      
    // also treat the registration message as a heartbeat
    synchronized(heartbeats) {
      heartbeats.add(nodeDescr);
      nodeDescr.isAlive = true;
      // no need to update its timestamp
      // because its is done when the descriptor is created
    }
    return;
  }

对重复注册的数据节点，由于名字节点已经拥有该节点的信息，所有处理逻辑相对简单：更新节点在网络拓扑中的位置和(可能的)心跳信息。如果注册的是新数据节点，则需要分配数据节点存储标识(如果节点从没注册过)、创建数据节点描述符、获得节点的网络拓扑位置、添加节点到datanodeMap、host2DataNodeMap(通过unprotectedAddDatanode()方法)和心跳信息列表heartbeats中。

注册新节点时需要进行一系列的薄记工作。从名字节点中移除数据节点时，可使用方法removeDatanode()和wipeDatanode()，这两个方法除了在datanodeMap等对象中移除相应的记录外，还需要更新名字节点第二关系，由于removeStoredBlock()方法会移除保存在BlocksMap对象中的一些数据块副本，往往会改变该数据节点管理的数据块的副本的状态(影响副本不限于当前数据节点)，如触发数据块复制工作、删除无效数据块副本recentInvalidateSets中和这个数据节点相关的记录等。代码如下：

 private void removeDatanode(DatanodeDescriptor nodeInfo) {
    synchronized (heartbeats) {
      if (nodeInfo.isAlive) {
        updateStats(nodeInfo, false);
        heartbeats.remove(nodeInfo);
        nodeInfo.isAlive = false;
      }
    }

    for (Iterator<Block> it = nodeInfo.getBlockIterator(); it.hasNext();) {
      removeStoredBlock(it.next(), nodeInfo);
    }
    unprotectedRemoveDatanode(nodeInfo);
    clusterMap.remove(nodeInfo);
  }

 void unprotectedRemoveDatanode(DatanodeDescriptor nodeDescr) {
    nodeDescr.resetBlocks();
    removeFromInvalidates(nodeDescr.getStorageID());
    NameNode.stateChangeLog.debug(
                                  "BLOCK* NameSystem.unprotectedRemoveDatanode: "
                                  + nodeDescr.getName() + " is out of service now.");
  }

  void wipeDatanode(DatanodeID nodeID) throws IOException {
    String key = nodeID.getStorageID();
    host2DataNodeMap.remove(datanodeMap.remove(key));
    NameNode.stateChangeLog.debug(
                                  "BLOCK* NameSystem.wipeDatanode: "
                                  + nodeID.getName() + " storage " + key 
                                  + " is removed from datanodeMap.");
  }

(3)数据块上报

成功注册的数据节点，接下来会进行数据块上报，向名字节点提供它的数据块信息，该请求的主要处理实现是FSNamesystem.processReport()中。这个方法由两个关注点：

首先是异常DisallowedDatanodeException，processReport()通过shouldNodeShutdown()，检查该数据节点的状态是否为AdminStates.DECOMMISSIONED，如果是，表明该节点不允许连接到名字节点，通过抛异常，通知数据节点停止工作。

另一个关注点是通过DatanodeDescriptor.reportDiff()，将该数据节点管理的数据块副本，根据集群当前情况，进行分类，并添加到不同的副本状态管理对象中去。如方法变量toInvalidate中就保存了要删除的数据块副本，这些副本最终通过addToInvalidates()方法，添加到FSNamesystem的成员变量recentInvalidateSets中。代码如下：

/**
   * The given node is reporting all its blocks.  Use this info to 
   * update the (machine-->blocklist) and (block-->machinelist) tables.
   */
  public synchronized void processReport(DatanodeID nodeID, 
                                         BlockListAsLongs newReport
                                        ) throws IOException {
    long startTime = now();
    if (NameNode.stateChangeLog.isDebugEnabled()) {
      NameNode.stateChangeLog.debug("BLOCK* NameSystem.processReport: "
                             + "from " + nodeID.getName()+" " + 
                             newReport.getNumberOfBlocks()+" blocks");
    }
    DatanodeDescriptor node = getDatanode(nodeID);
    if (node == null || !node.isAlive) {
      throw new IOException("ProcessReport from dead or unregisterted node: "
                            + nodeID.getName());
    }

    // Check if this datanode should actually be shutdown instead.
    if (shouldNodeShutdown(node)) {
      setDatanodeDead(node);
      throw new DisallowedDatanodeException(node);
    }
    
    // To minimize startup time, we discard any second (or later) block reports
    // that we receive while still in startup phase.
    if (isInStartupSafeMode() && !node.firstBlockReport()) {
      NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: "
          + "discarded non-initial block report from " + nodeID.getName()
          + " because namenode still in startup phase");
      return;
    }

    //
    // Modify the (block-->datanode) map, according to the difference
    // between the old and new block report.
    //
    Collection<Block> toAdd = new LinkedList<Block>();
    Collection<Block> toRemove = new LinkedList<Block>();
    Collection<Block> toInvalidate = new LinkedList<Block>();
    node.reportDiff(blocksMap, newReport, toAdd, toRemove, toInvalidate);
        
    for (Block b : toRemove) {
      removeStoredBlock(b, node);
    }
    for (Block b : toAdd) {
      addStoredBlock(b, node, null);
    }
    for (Block b : toInvalidate) {
      NameNode.stateChangeLog.info("BLOCK* NameSystem.processReport: block " 
          + b + " on " + node.getName() + " size " + b.getNumBytes()
          + " does not belong to any file.");
      addToInvalidates(b, node);
    }
    long endTime = now();
    NameNode.getNameNodeMetrics().addBlockReport(endTime - startTime);
    NameNode.stateChangeLog.info("*BLOCK* NameSystem.processReport: from "
        + nodeID.getName() + ", blocks: " + newReport.getNumberOfBlocks()
        + ", processing time: " + (endTime - startTime) + " msecs");
    node.processedBlockReport();
  }

3、心跳

在DataNode.offerService()中，数据节点利用循环向节点发送心跳信息，维护它们间的关系，上报负债信息并获取名字节点指令。由上述描述可知，名字节点和数据节点心跳相关的代码可以分为两部分，心跳信息处理和心跳检查。

(1)心跳信息处理

FSNamesystem.handleHeartbeat()被Namenode.sendHeartbeat()调用，分三个步骤处理心跳信息。

1)首先是对发送请求的数据节点进行检查，判断该节点是否能连接到名字节点。同时也判断数据节点是否已经注册过，未注册的数据节点会收到DatanodeCommand.REGISTER指令，这时，节点要重新注册并上报数据块信息。

2)名字节点利用心跳信息中的负载信息，更新整个HDFS系统的负载信息。两次updateaStates()调用，通过先在计数器里减去上一次更新值，然后再加上这次上报的值，然后在加上这次上的值，得到整个系统新的负债值。DatanodeDescriptor.updateHeartbear()不但更新节点负载，同时也更新了节点的心跳时间。

3)名字节点会为这个数据节点产生名字节指令，并通过远程调用的返回值返回。其中DatanodeCommand.REGISTER，就是一种名字节点指令。

handleHeartbet()方法一般通过DatanodeDescriptor的对应方法产生名字节点指令，下面代码中的删除数据块副本指令，就是通过getInvalidateBlocks()方法获得。代码如下：

/**
   * The given node has reported in.  This method should:
   * 1) Record the heartbeat, so the datanode isn't timed out
   * 2) Adjust usage stats for future block allocation
   * 
   * If a substantial amount of time passed since the last datanode 
   * heartbeat then request an immediate block report.  
   * 
   * @return an array of datanode commands 
   * @throws IOException
   */
  DatanodeCommand[] handleHeartbeat(DatanodeRegistration nodeReg,
      long capacity, long dfsUsed, long remaining,
      int xceiverCount, int xmitsInProgress) throws IOException {
    DatanodeCommand cmd = null;
    synchronized (heartbeats) {
      synchronized (datanodeMap) {
        DatanodeDescriptor nodeinfo = null;
        try {
          nodeinfo = getDatanode(nodeReg);
        } catch(UnregisteredDatanodeException e) {
          return new DatanodeCommand[]{DatanodeCommand.REGISTER};
        }
          
        // Check if this datanode should actually be shutdown instead. 
        if (nodeinfo != null && shouldNodeShutdown(nodeinfo)) {
          setDatanodeDead(nodeinfo);
          throw new DisallowedDatanodeException(nodeinfo);
        }

        if (nodeinfo == null || !nodeinfo.isAlive) {
          return new DatanodeCommand[]{DatanodeCommand.REGISTER};
        }

        updateStats(nodeinfo, false);
        nodeinfo.updateHeartbeat(capacity, dfsUsed, remaining, xceiverCount);
        updateStats(nodeinfo, true);
        
        //check lease recovery
        cmd = nodeinfo.getLeaseRecoveryCommand(Integer.MAX_VALUE);
        if (cmd != null) {
          return new DatanodeCommand[] {cmd};
        }
      
        ArrayList<DatanodeCommand> cmds = new ArrayList<DatanodeCommand>();
        //check pending replication
        cmd = nodeinfo.getReplicationCommand(
              maxReplicationStreams - xmitsInProgress);
        if (cmd != null) {
          cmds.add(cmd);
        }
        //check block invalidation
        cmd = nodeinfo.getInvalidateBlocks(blockInvalidateLimit);
        if (cmd != null) {
          cmds.add(cmd);
        }
        // check access key update
        if (isAccessTokenEnabled && nodeinfo.needKeyUpdate) {
          cmds.add(new KeyUpdateCommand(accessTokenHandler.exportKeys()));
          nodeinfo.needKeyUpdate = false;
        }
        // check for balancer bandwidth update
        if (nodeinfo.getBalancerBandwidth() > 0) {
          cmds.add(new BalancerBandwidthCommand(nodeinfo.getBalancerBandwidth()));
          // set back to 0 to indicate that datanode has been sent the new value
          nodeinfo.setBalancerBandwidth(0);
        }
        if (!cmds.isEmpty()) {
          return cmds.toArray(new DatanodeCommand[cmds.size()]);
        }
      }
    }

    //check distributed upgrade
    cmd = getDistributedUpgradeCommand();
    if (cmd != null) {
      return new DatanodeCommand[] {cmd};
    }
    return null;
  }

生成名字节点指令所需的数据都存放在DatanodeDescriptor对象中，所以getInvalidateBlocks()的实现只需要两行代码，从成员变量invalidateBlocks中获取要删除的副本列表，并根据列表创建删除命令。代码如下：

 /**
   * Remove the specified number of blocks to be invalidated
   */
  BlockCommand getInvalidateBlocks(int maxblocks) {
    Block[] deleteList = getBlockArray(invalidateBlocks, maxblocks); 
    return deleteList == null? 
        null: new BlockCommand(DatanodeProtocol.DNA_INVALIDATE, deleteList);
  }

(2)心跳检查

心跳信息的处理由远程方法sendHeartbeat()实现。FSNamesystem中和心跳相关的另一部分代码是心跳检查，它拥有自己的线程，并通过heartbeatCheck()执行检查逻辑。心跳检查的间隔保存在成员变量heartbeatRechekInterval中，默认值是5分钟，可以通过配置项${heartbeat.recheck.interval}重新制定。代码如下：

  class HeartbeatMonitor implements Runnable {
    private long lastHeartbeatCheck;
    private long lastAccessKeyUpdate;
    /**
     */
    public void run() {
      while (fsRunning) {
        try {
          long now = now();
          if (lastHeartbeatCheck + heartbeatRecheckInterval < now) {
            heartbeatCheck();
            lastHeartbeatCheck = now;
          }
          if (isAccessTokenEnabled && (lastAccessKeyUpdate + accessKeyUpdateInterval < now)) {
            updateAccessKey();
            lastAccessKeyUpdate = now;
          }
        } catch (Exception e) {
          FSNamesystem.LOG.error(StringUtils.stringifyException(e));
        }
        try {
          Thread.sleep(5000);  // 5 seconds
        } catch (InterruptedException ie) {
        }
      }
    }
  }

在方法heartbeatCheck()中，如果发现节点长时间没有上报心跳(在isDatanodeDead()中实现)即可判断节点几不能正常工作，这时，调用removeDatanode()方法移除数据节点。

最简单的heartbeatCheck()实现方式，是通过一个循环语句依次检查名字节点的状态并对故障节点进行处理。但heartbeatCheck()没有采用这样的实现方式，而是采取了先寻找出现故障的数据节点，随后处理故障节点，再寻找下一个可能故障节点的形式实现心跳检查。原因主要是：由于故障处理需要对heartbeats和datanodeMap等进行同步操作，这时，心跳处理函数hanlderHeartbeat()就不能对它们进行更新，在系统负载重，或者机架故障造成数据节点大量故障时，会造成误判断。所以，heartbeatCheck()使用了故障检查和故障处理分离的方法，尽量避免上述情况。

这样带来的问题时：故障处理前需要再次确认故障状态，所以在调用removeDatanode()前，heartbeatCheck()还需要再次使用isDatanodeDead()确认故障。代码如下：

 /**
   * Check if there are any expired heartbeats, and if so,
   * whether any blocks have to be re-replicated.
   * While removing dead datanodes, make sure that only one datanode is marked
   * dead at a time within the synchronized section. Otherwise, a cascading
   * effect causes more datanodes to be declared dead.
   */
  void heartbeatCheck() {
    if (isInSafeMode()) {
      // not to check dead nodes if in safemode
      return;
    }
    boolean allAlive = false;
    while (!allAlive) {
      boolean foundDead = false;
      DatanodeID nodeID = null;

      // locate the first dead node.
      synchronized(heartbeats) {
        for (Iterator<DatanodeDescriptor> it = heartbeats.iterator();
             it.hasNext();) {
          DatanodeDescriptor nodeInfo = it.next();
          if (isDatanodeDead(nodeInfo)) {
            foundDead = true;
            nodeID = nodeInfo;
            break;
          }
        }
      }

      // acquire the fsnamesystem lock, and then remove the dead node.
      if (foundDead) {
        synchronized (this) {
          synchronized(heartbeats) {
            synchronized (datanodeMap) {
              DatanodeDescriptor nodeInfo = null;
              try {
                nodeInfo = getDatanode(nodeID);
              } catch (IOException e) {
                nodeInfo = null;
              }
              if (nodeInfo != null && isDatanodeDead(nodeInfo)) {
                NameNode.stateChangeLog.info("BLOCK* NameSystem.heartbeatCheck: "
                                             + "lost heartbeat from " + nodeInfo.getName());
                removeDatanode(nodeInfo);
              }
            }
          }
        }
      }
      allAlive = !foundDead;
    }
  }

在复杂的名字节点实现中，心跳接收和心跳检查的代码相对简单，但心跳检查中的故障和故障处理分离，体现了HDFS设计者对复杂分布式系统的把握能力。

数据块管理

名字节点第二关系的管理包括数据节点管理和数据块管理，本节在数据节点管理机制的基础上，继续研究和数据块管理相关的FSNamesystem逻辑。

1、添加数据块副本

FSNamesystem.addStoredBlock()是一个“大”方法，用于在BlockMap中添加/更新数据节点node上的数据块副本block。在分析数据节点的实现时，我们知道，不管是客户端写数据或者是数据块复制，操作成功后都会使用远程方法DatanodeProtocol.blockReceived()向名字节点汇报，NameNode.blockReceived()就会调用这个方法，将数据块副本和数据节点信息更新到BlocksMap对象中。如果该数据块副本是由数据块复制触发，通过参数delNodeHint还可以删除源节点上的数据块副本。使用addStoredBlock()方法的另一个场景是数据块副本信息到FSNamesystem.processReport()通过它，更新数据节点保存的数据块副本信息到FSNamesystem.blocksMap中。

/**
   * Modify (block-->datanode) map.  Remove block from set of 
   * needed replications if this takes care of the problem.
   * @return the block that is stored in blockMap.
   */
  synchronized Block addStoredBlock(Block block, 
                                    DatanodeDescriptor node,
                                    DatanodeDescriptor delNodeHint) {
    BlockInfo storedBlock = blocksMap.getStoredBlock(block);
    if (storedBlock == null) {
      // If we have a block in the block map with the same ID, but a different
      // generation stamp, and the corresponding file is under construction,
      // then we need to do some special processing.
      storedBlock = blocksMap.getStoredBlockWithoutMatchingGS(block);

      if (storedBlock == null) {
        return rejectAddStoredBlock(
          block, node,
          "Block not in blockMap with any generation stamp");
      }

      INodeFile inode = storedBlock.getINode();
      if (inode == null) {
        return rejectAddStoredBlock(
          block, node,
          "Block does not correspond to any file");
      }

      boolean reportedOldGS = block.getGenerationStamp() < storedBlock.getGenerationStamp();
      boolean reportedNewGS = block.getGenerationStamp() > storedBlock.getGenerationStamp();
      boolean underConstruction = inode.isUnderConstruction();
      boolean isLastBlock = inode.getLastBlock() != null &&
        inode.getLastBlock().getBlockId() == block.getBlockId();

      // We can report a stale generation stamp for the last block under construction,
      // we just need to make sure it ends up in targets.
      if (reportedOldGS && !(underConstruction && isLastBlock)) {
        return rejectAddStoredBlock(
          block, node,
          "Reported block has old generation stamp but is not the last block of " +
          "an under-construction file. (current generation is " +
          storedBlock.getGenerationStamp() + ")");
      }

      // Don't add blocks to the DN when they're part of the in-progress last block
      // and have an inconsistent generation stamp. Instead just add them to targets
      // for recovery purposes. They will get added to the node when
      // commitBlockSynchronization runs
      if (underConstruction && isLastBlock && (reportedOldGS || reportedNewGS)) {
        NameNode.stateChangeLog.info(
          "BLOCK* NameSystem.addStoredBlock: "
          + "Targets updated: block " + block + " on " + node.getName() +
          " is added as a target for block " + storedBlock + " with size " +
          block.getNumBytes());
        ((INodeFileUnderConstruction)inode).addTarget(node);
        return block;
      }
    }

    INodeFile fileINode = storedBlock.getINode();
    if (fileINode == null) {
      return rejectAddStoredBlock(
        block, node,
        "Block does not correspond to any file");
    }
    assert storedBlock != null : "Block must be stored by now";

    // add block to the data-node
    boolean added = node.addBlock(storedBlock);    


    // Is the block being reported the last block of an underconstruction file?
    boolean blockUnderConstruction = false;
    if (fileINode.isUnderConstruction()) {
      INodeFileUnderConstruction cons = (INodeFileUnderConstruction) fileINode;
      Block last = fileINode.getLastBlock();
      if (last == null) {
        // This should never happen, but better to handle it properly than to throw
        // an NPE below.
        LOG.error("Null blocks for reported block=" + block + " stored=" + storedBlock +
          " inode=" + fileINode);
        return block;
      }
      blockUnderConstruction = last.equals(storedBlock);
    }

    // block == storedBlock when this addStoredBlock is the result of a block report
    if (block != storedBlock) {
      if (block.getNumBytes() >= 0) {
        long cursize = storedBlock.getNumBytes();
        INodeFile file = storedBlock.getINode();
        if (cursize == 0) {
          storedBlock.setNumBytes(block.getNumBytes());
        } else if (cursize != block.getNumBytes()) {
          LOG.warn("Inconsistent size for block " + block + 
                   " reported from " + node.getName() + 
                   " current size is " + cursize +
                   " reported size is " + block.getNumBytes());
          try {
            if (cursize > block.getNumBytes() && !blockUnderConstruction) {
              // new replica is smaller in size than existing block.
              // Mark the new replica as corrupt.
              LOG.warn("Mark new replica " + block + " from " + node.getName() + 
                  "as corrupt because its length is shorter than existing ones");
              markBlockAsCorrupt(block, node);
            } else {
              // new replica is larger in size than existing block.
              if (!blockUnderConstruction) {
                // Mark pre-existing replicas as corrupt.
                int numNodes = blocksMap.numNodes(block);
                int count = 0;
                DatanodeDescriptor nodes[] = new DatanodeDescriptor[numNodes];
                Iterator<DatanodeDescriptor> it = blocksMap.nodeIterator(block);
                for (; it != null && it.hasNext();) {
                  DatanodeDescriptor dd = it.next();
                  if (!dd.equals(node)) {
                    nodes[count++] = dd;
                  }
                }
                for (int j = 0; j < count; j++) {
                  LOG.warn("Mark existing replica "
                      + block
                      + " from "
                      + node.getName()
                      + " as corrupt because its length is shorter than the new one");
                  markBlockAsCorrupt(block, nodes[j]);
                }
              }
              //
              // change the size of block in blocksMap
              //
              storedBlock.setNumBytes(block.getNumBytes());
            }
          } catch (IOException e) {
            LOG.warn("Error in deleting bad block " + block + e);
          }
        }
        
        //Updated space consumed if required.
        long diff = (file == null) ? 0 :
                    (file.getPreferredBlockSize() - storedBlock.getNumBytes());
        
        if (diff > 0 && file.isUnderConstruction() &&
            cursize < storedBlock.getNumBytes()) {
          try {
            String path = /* For finding parents */ 
              leaseManager.findPath((INodeFileUnderConstruction)file);
            dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
          } catch (IOException e) {
            LOG.warn("Unexpected exception while updating disk space : " +
                     e.getMessage());
          }
        }
      }
      block = storedBlock;
    }
    assert storedBlock == block : "Block must be stored by now";
        
    int curReplicaDelta = 0;
        
    if (added) {
      curReplicaDelta = 1;
      // 
      // At startup time, because too many new blocks come in
      // they take up lots of space in the log file. 
      // So, we log only when namenode is out of safemode.
      //
      if (!isInSafeMode()) {
        NameNode.stateChangeLog.info("BLOCK* NameSystem.addStoredBlock: "
                                      +"blockMap updated: "+node.getName()+" is added to "+block+" size "+block.getNumBytes());
      }
    } else {
      NameNode.stateChangeLog.warn("BLOCK* NameSystem.addStoredBlock: "
                                   + "Redundant addStoredBlock request received for " 
                                   + block + " on " + node.getName()
                                   + " size " + block.getNumBytes());
    }

    // filter out containingNodes that are marked for decommission.
    NumberReplicas num = countNodes(storedBlock);
    int numLiveReplicas = num.liveReplicas();
    int numCurrentReplica = numLiveReplicas
      + pendingReplications.getNumReplicas(block);

    // check whether safe replication is reached for the block
    incrementSafeBlockCount(numCurrentReplica);
 
    //
    // if file is being actively written to, then do not check 
    // replication-factor here. It will be checked when the file is closed.
    //
    if (blockUnderConstruction) {
      INodeFileUnderConstruction cons = (INodeFileUnderConstruction) fileINode;
      cons.addTarget(node);
      return block;
    }

    // do not handle mis-replicated blocks during startup
    if(isInSafeMode())
      return block;

    // handle underReplication/overReplication
    short fileReplication = fileINode.getReplication();
    if (numCurrentReplica >= fileReplication) {
      neededReplications.remove(block, numCurrentReplica, 
                                num.decommissionedReplicas, fileReplication);
    } else {
      updateNeededReplications(block, curReplicaDelta, 0);
    }
    if (numCurrentReplica > fileReplication) {
      processOverReplicatedBlock(block, fileReplication, node, delNodeHint);
    }
    // If the file replication has reached desired value
    // we can remove any corrupt replicas the block may have
    int corruptReplicasCount = corruptReplicas.numCorruptReplicas(block); 
    int numCorruptNodes = num.corruptReplicas();
    if ( numCorruptNodes != corruptReplicasCount) {
      LOG.warn("Inconsistent number of corrupt replicas for " + 
          block + "blockMap has " + numCorruptNodes + 
          " but corrupt replicas map has " + corruptReplicasCount);
    }
    if ((corruptReplicasCount > 0) && (numLiveReplicas >= fileReplication)) 
      invalidateCorruptReplicas(block);
    return block;
  }

很自然，addStoredBlock()需要对上报的数据块进行检查，如果在FSNamesystem.blocksMap中查找不到该数据块，将不带数据块版本号，在blocksMap中查找。由数据节点的代码分析可知，数据块恢复过程中个，数据块的版本会发生变化，这时，故障节点上不参与数据块恢复的数据块版本号会维护不变，集群会出现同一个数据块的不同副本拥有不同版本号的情况。在blocksMap中进行两次查找，都找不到数据块(对应的文件已经被删除)，或者查找到的数据块并不属于任何文件(inode为空)，则通过rejectAddStoredBlock()方法删除该副本。

接下来检查数据块是否已经失效，它需要当前数据块副本是否属于某个处于构建状态的文件的最后一个数据块。如果不是，显然，由于版本号不匹配，这是一个失效的副本，则删除之。否则，将数据节点加入到INodeFileUnderConstrucation对象的当前写数据块目标中，同时可以断定，该数据块处于数据块恢复过程中。这种情况下，通过addStoredBlock()添加的这个数据块，现在还不能判断是否有效，需要等待数据块恢复的结果，才能做进一步的处理。

上述代码中，数据节点上报的无效数据块副本都是通过rejectAddStoredBlock()方法删除的。这个方法在输出日志后，使用addToInvalidates()，将数据块副本加入recentInvalidateSets中。

通过上面检查的数据块副本，就可以加入到数据节点描述对象中。下面语句执行后，数据块既保存在blocksMap中，也保存在数据节点的DatanodeDescriptor对象中。该语句完成了FSNamesystem.addStoredBlock()的最基本工作：

boolean added = node.addBlock(storedBlock);

新添加的副本会影响属于同一数据块的其他副本，addStoredBlock()后面的代码就用于处理各种可能的影响。

首先，addStoredBlock()判断这个数据块是否属于某个处于构建状态文件的最后一个数据块。并将结果保存在blockUnderConstruction。代码如下：

 // Is the block being reported the last block of an underconstruction file?
    boolean blockUnderConstruction = false;
    if (fileINode.isUnderConstruction()) {
      INodeFileUnderConstruction cons = (INodeFileUnderConstruction) fileINode;
      Block last = fileINode.getLastBlock();
      if (last == null) {
        // This should never happen, but better to handle it properly than to throw
        // an NPE below.
        LOG.error("Null blocks for reported block=" + block + " stored=" + storedBlock +
          " inode=" + fileINode);
        return block;
      }
      blockUnderConstruction = last.equals(storedBlock);
    }

数据节点提交数据块的处理结果时，blockReceived()需要做一些附加处理。我们知道，数据块上报也会调用blockReceived()方法，和数据节点提交数据块的差别，在于前面一种情况，传入方法的数据块对象，和blocksMap中保存的数据块(即变量storedBlock)，它们引用的是相同的对象。也就是说，如果是“block==storedBlock”，那么blockReceived()正处于数据块上报的逻辑中个。注意，这里使用的不是Block.equals()，而是(不)等号，即比较的是对象的引用是否相等，而不是对象的内容是否相等。

和数据块上报不一样，(追加)写操作可能会改变数据块的长度，显然新数据块副本无效，通知markBlockAsCorrupt()，将其标记为“损坏”，如果新副本比较大，使用新副本的信息，并将其它副本标记为“损坏”。后续的HDFS版本中，几经放弃这种通过副本大小决定副本有效性的方法。不过，在分析HDFS文件的追加操作时，会给出系统出现上述不同长度数据块副本的原因。代码如下：

  // block == storedBlock when this addStoredBlock is the result of a block report
    if (block != storedBlock) {
      if (block.getNumBytes() >= 0) {
        long cursize = storedBlock.getNumBytes();
        INodeFile file = storedBlock.getINode();
        if (cursize == 0) {
          storedBlock.setNumBytes(block.getNumBytes());
        } else if (cursize != block.getNumBytes()) {
          LOG.warn("Inconsistent size for block " + block + 
                   " reported from " + node.getName() + 
                   " current size is " + cursize +
                   " reported size is " + block.getNumBytes());
          try {
            if (cursize > block.getNumBytes() && !blockUnderConstruction) {
              // new replica is smaller in size than existing block.
              // Mark the new replica as corrupt.
              LOG.warn("Mark new replica " + block + " from " + node.getName() + 
                  "as corrupt because its length is shorter than existing ones");
              markBlockAsCorrupt(block, node);
            } else {
              // new replica is larger in size than existing block.
              if (!blockUnderConstruction) {
                // Mark pre-existing replicas as corrupt.
                int numNodes = blocksMap.numNodes(block);
                int count = 0;
                DatanodeDescriptor nodes[] = new DatanodeDescriptor[numNodes];
                Iterator<DatanodeDescriptor> it = blocksMap.nodeIterator(block);
                for (; it != null && it.hasNext();) {
                  DatanodeDescriptor dd = it.next();
                  if (!dd.equals(node)) {
                    nodes[count++] = dd;
                  }
                }
                for (int j = 0; j < count; j++) {
                  LOG.warn("Mark existing replica "
                      + block
                      + " from "
                      + node.getName()
                      + " as corrupt because its length is shorter than the new one");
                  markBlockAsCorrupt(block, nodes[j]);
                }
              }
              //
              // change the size of block in blocksMap
              //
              storedBlock.setNumBytes(block.getNumBytes());
            }
          } catch (IOException e) {
            LOG.warn("Error in deleting bad block " + block + e);
          }
        }
        
        //Updated space consumed if required.
        long diff = (file == null) ? 0 :
                    (file.getPreferredBlockSize() - storedBlock.getNumBytes());
        
        if (diff > 0 && file.isUnderConstruction() &&
            cursize < storedBlock.getNumBytes()) {
          try {
            String path = /* For finding parents */ 
              leaseManager.findPath((INodeFileUnderConstruction)file);
            dir.updateSpaceConsumed(path, 0, -diff*file.getReplication());
          } catch (IOException e) {
            LOG.warn("Unexpected exception while updating disk space : " +
                     e.getMessage());
          }
        }
      }
      block = storedBlock;
    }
    assert storedBlock == block : "Block must be stored by now";

除了上面针对数据块副本长度，更准备或是副本有效性的处理，更新的数据块副本会影响同一数据块的其他副本的状态，代码如下：

int curReplicaDelta = 0;
        
    if (added) {
      curReplicaDelta = 1;
      // 
      // At startup time, because too many new blocks come in
      // they take up lots of space in the log file. 
      // So, we log only when namenode is out of safemode.
      //
      if (!isInSafeMode()) {
        NameNode.stateChangeLog.info("BLOCK* NameSystem.addStoredBlock: "
                                      +"blockMap updated: "+node.getName()+" is added to "+block+" size "+block.getNumBytes());
      }
    } else {
      NameNode.stateChangeLog.warn("BLOCK* NameSystem.addStoredBlock: "
                                   + "Redundant addStoredBlock request received for " 
                                   + block + " on " + node.getName()
                                   + " size " + block.getNumBytes());
    }

    // filter out containingNodes that are marked for decommission.
    NumberReplicas num = countNodes(storedBlock);
    int numLiveReplicas = num.liveReplicas();
    int numCurrentReplica = numLiveReplicas
      + pendingReplications.getNumReplicas(block);

    // check whether safe replication is reached for the block
    incrementSafeBlockCount(numCurrentReplica);
 
    //
    // if file is being actively written to, then do not check 
    // replication-factor here. It will be checked when the file is closed.
    //
    if (blockUnderConstruction) {
      INodeFileUnderConstruction cons = (INodeFileUnderConstruction) fileINode;
      cons.addTarget(node);
      return block;
    }

    // do not handle mis-replicated blocks during startup
    if(isInSafeMode())
      return block;

    // handle underReplication/overReplication
    short fileReplication = fileINode.getReplication();
    if (numCurrentReplica >= fileReplication) {
      neededReplications.remove(block, numCurrentReplica, 
                                num.decommissionedReplicas, fileReplication);
    } else {
      updateNeededReplications(block, curReplicaDelta, 0);
    }
    if (numCurrentReplica > fileReplication) {
      processOverReplicatedBlock(block, fileReplication, node, delNodeHint);
    }
    // If the file replication has reached desired value
    // we can remove any corrupt replicas the block may have
    int corruptReplicasCount = corruptReplicas.numCorruptReplicas(block); 
    int numCorruptNodes = num.corruptReplicas();
    if ( numCorruptNodes != corruptReplicasCount) {
      LOG.warn("Inconsistent number of corrupt replicas for " + 
          block + "blockMap has " + numCorruptNodes + 
          " but corrupt replicas map has " + corruptReplicasCount);
    }
    if ((corruptReplicasCount > 0) && (numLiveReplicas >= fileReplication)) 
      invalidateCorruptReplicas(block);
    return block;

首先，addStoredBlock()需要获得现在数据块副本的情况，通过countNodes()得到”正常副本数“numLiveReplicas，当前副本数(包括等待复制)numCurrentReplica；当然，如果前面顺利地添加了当前副本，则副本的增量curReplicaDelta赋值为1。在忽略几种特殊的，不需要关注副本状态的例外后，addStoredBlock()需要根据当前副本数和副本系数，分情况进行处理。

副本数多余或者等于副本系统时，显然可以移除可能存在的，处于等待复制队列neededReplications中的副本，停止复制请求；反之，通过undateNeededReplications()更新等待复制队列中保存的，等待复制的数据块副本数目。

对于有多余副本的情况，addStoredBlock()方法使用参数delNodeHint调用processOverReplicatedBlock()。由于只添加了一个数据块副本，所有往往会将数据节点delNodeHint上的那个副本删除。在方法的最后，”正常副本“副本数大于等于副本系数时，就可以删除所有损坏的数据块副本，以释放数据节点空间。

FSNamesystem.addStoredBlock()的功能貌似简单，但却是一个非常复杂的方法，我们花了大量的篇幅介绍这个方法，以期获得处理数据块，更准确地说，是涉及名字节点第二关系数据块处理中的一些要点，如需要区分处于构建状态文件和一般文件的不同情况；数据块副本的变化，包括版本号的变化、数据长度的变化等情形；单一副本的变化，会影响同一个数据块其他副本的状态等。

2、删除数据块副本

数据块副本的删除包括如下3种情况：

数据块副本所属的文件被删除，副本也就相应的删除掉。

数据块副本数多于副本系数，多余的副本会被删除。

数据块副本数已经损坏，也需要删除损坏的副本。

(1)删除文件拥有的数据块副本

当删除目录上的一个文件时(实现在FSDirectory.unprotectedDelete()中)，文件拥有的数据块通过FSNamesystem.removePathAndBlocks()删除。这个方法很简单，对于被删除的每个数据块，首先将它从blocksMap中删除；由于数据块被删除，即使有副本损坏，也都不需要再进行数据块复制，所有删除corruptRelcas中可能存在的副本损坏记录；最后，调用addToInvalidates()在所有拥有副本的数据节点上删除数据块。

addToInvalidates()方法由两个重载实现，下面代码中的这个实现用于获取数据块所有的数据节点，并调用另一个addToInvalidates()，最终使用addToInvalidatesNoLog()，删除数据块副本。addToInvalidatesNoLog()方法简单地将待删除的数据块副本信息添加到FSNamesystem的成员变量recentInvalidateSets中。代码如下：

 void removePathAndBlocks(String src, List<Block> blocks) throws IOException {
    leaseManager.removeLeaseWithPrefixPath(src);
    for(Block b : blocks) {
      blocksMap.removeINode(b);
      corruptReplicas.removeFromCorruptReplicasMap(b);
      addToInvalidates(b);
    }
  }

  private void addToInvalidates(Block b) {
    for (Iterator<DatanodeDescriptor> it = 
                                blocksMap.nodeIterator(b); it.hasNext();) {
      DatanodeDescriptor node = it.next();
      addToInvalidates(b, node);
    }
  }


  /**
   * Adds block to list of blocks which will be invalidated on 
   * specified datanode
   * @param b block
   * @param n datanode
   */
  void addToInvalidatesNoLog(Block b, DatanodeInfo n) {
    Collection<Block> invalidateSet = recentInvalidateSets.get(n.getStorageID());
    if (invalidateSet == null) {
      invalidateSet = new HashSet<Block>();
      recentInvalidateSets.put(n.getStorageID(), invalidateSet);
    }
    if (invalidateSet.add(b)) {
      pendingDeletionBlocksCount++;
    }
  }

(2)多余副本删除

addSotredBlock()方法最后阶段的处理中，移除多余副本使用了processOverReplicatedBlock()方法。这个方法其实只是为chooseExcessReplicates()准备数据，由于blocksMap保存着数据块的所有副本所在的数据节点，方法需要遍历这些数据节点，当节点同时满足下面3个条件时，才把它标记为”等待删除数据块“的候选数据节点，并保存在变量nonExcess中。

该节点信息不在excessReplicateMap中。

该节点不是一个处于”正在撤销“或”已撤销“的数据节点。

该节点保存的副本已经损坏

processOverReplicatedBlock()方法，代码如下：

 /**
   * Find how many of the containing nodes are "extra", if any.
   * If there are any extras, call chooseExcessReplicates() to
   * mark them in the excessReplicateMap.
   */
  private void processOverReplicatedBlock(Block block, short replication, 
      DatanodeDescriptor addedNode, DatanodeDescriptor delNodeHint) {
    if(addedNode == delNodeHint) {
      delNodeHint = null;
    }
    Collection<DatanodeDescriptor> nonExcess = new ArrayList<DatanodeDescriptor>();
    Collection<DatanodeDescriptor> corruptNodes = corruptReplicas.getNodes(block);
    for (Iterator<DatanodeDescriptor> it = blocksMap.nodeIterator(block); 
         it.hasNext();) {
      DatanodeDescriptor cur = it.next();
      Collection<Block> excessBlocks = excessReplicateMap.get(cur.getStorageID());
      if (excessBlocks == null || !excessBlocks.contains(block)) {
        if (!cur.isDecommissionInProgress() && !cur.isDecommissioned()) {
          // exclude corrupt replicas
          if (corruptNodes == null || !corruptNodes.contains(cur)) {
            nonExcess.add(cur);
          }
        }
      }
    }
    chooseExcessReplicates(nonExcess, block, replication, 
        addedNode, delNodeHint);    
  }

/**
   * We want "replication" replicates for the block, but we now have too many.  
   * In this method, copy enough nodes from 'srcNodes' into 'dstNodes' such that:
   *
   * srcNodes.size() - dstNodes.size() == replication
   *
   * We pick node that make sure that replicas are spread across racks and
   * also try hard to pick one with least free space.
   * The algorithm is first to pick a node with least free space from nodes
   * that are on a rack holding more than one replicas of the block.
   * So removing such a replica won't remove a rack. 
   * If no such a node is available,
   * then pick a node with least free space
   */
  void chooseExcessReplicates(Collection<DatanodeDescriptor> nonExcess, 
                              Block b, short replication,
                              DatanodeDescriptor addedNode,
                              DatanodeDescriptor delNodeHint) {
    // first form a rack to datanodes map and
    HashMap<String, ArrayList<DatanodeDescriptor>> rackMap =
      new HashMap<String, ArrayList<DatanodeDescriptor>>();
    for (Iterator<DatanodeDescriptor> iter = nonExcess.iterator();
         iter.hasNext();) {
      DatanodeDescriptor node = iter.next();
      String rackName = node.getNetworkLocation();
      ArrayList<DatanodeDescriptor> datanodeList = rackMap.get(rackName);
      if(datanodeList==null) {
        datanodeList = new ArrayList<DatanodeDescriptor>();
      }
      datanodeList.add(node);
      rackMap.put(rackName, datanodeList);
    }
    
    // split nodes into two sets
    // priSet contains nodes on rack with more than one replica
    // remains contains the remaining nodes
    ArrayList<DatanodeDescriptor> priSet = new ArrayList<DatanodeDescriptor>();
    ArrayList<DatanodeDescriptor> remains = new ArrayList<DatanodeDescriptor>();
    for( Iterator<Entry<String, ArrayList<DatanodeDescriptor>>> iter = 
      rackMap.entrySet().iterator(); iter.hasNext(); ) {
      Entry<String, ArrayList<DatanodeDescriptor>> rackEntry = iter.next();
      ArrayList<DatanodeDescriptor> datanodeList = rackEntry.getValue(); 
      if( datanodeList.size() == 1 ) {
        remains.add(datanodeList.get(0));
      } else {
        priSet.addAll(datanodeList);
      }
    }
    
    // pick one node to delete that favors the delete hint
    // otherwise pick one with least space from priSet if it is not empty
    // otherwise one node with least space from remains
    boolean firstOne = true;
    while (nonExcess.size() - replication > 0) {
      DatanodeInfo cur = null;
      long minSpace = Long.MAX_VALUE;

      // check if we can del delNodeHint
      if (firstOne && delNodeHint !=null && nonExcess.contains(delNodeHint) &&
            (priSet.contains(delNodeHint) || (addedNode != null && !priSet.contains(addedNode))) ) {
          cur = delNodeHint;
      } else { // regular excessive replica removal
        Iterator<DatanodeDescriptor> iter = 
          priSet.isEmpty() ? remains.iterator() : priSet.iterator();
          while( iter.hasNext() ) {
            DatanodeDescriptor node = iter.next();
            long free = node.getRemaining();

            if (minSpace > free) {
              minSpace = free;
              cur = node;
            }
          }
      }

      firstOne = false;
      // adjust rackmap, priSet, and remains
      String rack = cur.getNetworkLocation();
      ArrayList<DatanodeDescriptor> datanodes = rackMap.get(rack);
      datanodes.remove(cur);
      if(datanodes.isEmpty()) {
        rackMap.remove(rack);
      }
      if( priSet.remove(cur) ) {
        if (datanodes.size() == 1) {
          priSet.remove(datanodes.get(0));
          remains.add(datanodes.get(0));
        }
      } else {
        remains.remove(cur);
      }

      nonExcess.remove(cur);

      Collection<Block> excessBlocks = excessReplicateMap.get(cur.getStorageID());
      if (excessBlocks == null) {
        excessBlocks = new TreeSet<Block>();
        excessReplicateMap.put(cur.getStorageID(), excessBlocks);
      }
      if (excessBlocks.add(b)) {
        excessBlocksCount++;
        NameNode.stateChangeLog.debug("BLOCK* NameSystem.chooseExcessReplicates: "
                                      +"("+cur.getName()+", "+b
                                      +") is added to excessReplicateMap");
      }

      //
      // The 'excessblocks' tracks blocks until we get confirmation
      // that the datanode has deleted them; the only way we remove them
      // is when we get a "removeBlock" message.  
      //
      // The 'invalidate' list is used to inform the datanode the block 
      // should be deleted.  Items are removed from the invalidate list
      // upon giving instructions to the namenode.
      //
      addToInvalidatesNoLog(b, cur);
      NameNode.stateChangeLog.info("BLOCK* NameSystem.chooseExcessReplicates: "
                +"("+cur.getName()+", "+b+") is added to recentInvalidateSets");
    }
  }

FSNamesystem.chooseExcessReplicates()是一个冗长的方法，其目标是在上述候选数据节点中选择”正确“的节点，执行删除操作。数据块副本的放置是有讲究的，尽量不在同一个机架上放置相同的副本；同时，要尽量删除比较”忙“，即剩余空间少的数据节点上的副本。

(3)损坏副本删除

最后讨论的情况是损坏副本的删除。客户端读文件或者数据节点上的数据块扫描器都可能回发现损坏的副本，它们会将检测结果上报到名字节点，最终由markBlocksAsCorrupt()进行处理。代码如下：

  * Mark the block belonging to datanode as corrupt
   * @param blk Block to be marked as corrupt
   * @param dn Datanode which holds the corrupt replica
   */
  public synchronized void markBlockAsCorrupt(Block blk, DatanodeInfo dn)
    throws IOException {
    DatanodeDescriptor node = getDatanode(dn);
    if (node == null) {
      throw new IOException("Cannot mark block" + blk.getBlockName() +
                            " as corrupt because datanode " + dn.getName() +
                            " does not exist. ");
    }
    
    final BlockInfo storedBlockInfo = blocksMap.getStoredBlock(blk);
    if (storedBlockInfo == null) {
      // Check if the replica is in the blockMap, if not 
      // ignore the request for now. This could happen when BlockScanner
      // thread of Datanode reports bad block before Block reports are sent
      // by the Datanode on startup
      NameNode.stateChangeLog.info("BLOCK NameSystem.markBlockAsCorrupt: " +
                                   "block " + blk + " could not be marked " +
                                   "as corrupt as it does not exists in " +
                                   "blocksMap");
    } else {
      INodeFile inode = storedBlockInfo.getINode();
      if (inode == null) {
        NameNode.stateChangeLog.info("BLOCK NameSystem.markBlockAsCorrupt: " +
                                     "block " + blk + " could not be marked " +
                                     "as corrupt as it does not belong to " +
                                     "any file");
        addToInvalidates(storedBlockInfo, node);
        return;
      } 
      // Add this replica to corruptReplicas Map 
      corruptReplicas.addToCorruptReplicasMap(storedBlockInfo, node);
      if (countNodes(storedBlockInfo).liveReplicas()>inode.getReplication()) {
        // the block is over-replicated so invalidate the replicas immediately
        invalidateBlock(storedBlockInfo, node);
      } else {
        // add the block to neededReplication 
        updateNeededReplications(storedBlockInfo, -1, 0);
      }
    }
  }

3、数据块复制

数据块复制是HDFS正常运行的保证，也体现了HDFS故障检测和自动恢复的特性。

前面介绍的名字节点第二关系上的多个流程都会触发数据块复制，如数据节点撤销、数据块副本损坏等，这些流程一般通过FSNamesystem.updateNeededReplications()方法更新数据块复制信息，剩下的工作由名字节点的数据块副本复制线程完成。

更新数据块复制信息最终使用的是UnderReplicatedBlocks.update()方法，这个方法的参数就比较多，除了待复制的数据块block外，其他信息用于计算复制的优先级，优先级计算只需要四个参数，但update()方法需要计算更新前后的优先级，并根据优先级调整数据块所属的优先级集合。update()方法的主要代码如下：

/* update the priority level of a block */
  synchronized void update(Block block, int curReplicas, 
                           int decommissionedReplicas,
                           int curExpectedReplicas,
                           int curReplicasDelta, int expectedReplicasDelta) {
    int oldReplicas = curReplicas-curReplicasDelta;
    int oldExpectedReplicas = curExpectedReplicas-expectedReplicasDelta;
    int curPri = getPriority(block, curReplicas, decommissionedReplicas, curExpectedReplicas);
    int oldPri = getPriority(block, oldReplicas, decommissionedReplicas, oldExpectedReplicas);
    NameNode.stateChangeLog.debug("UnderReplicationBlocks.update " + 
                                  block +
                                  " curReplicas " + curReplicas +
                                  " curExpectedReplicas " + curExpectedReplicas +
                                  " oldReplicas " + oldReplicas +
                                  " oldExpectedReplicas  " + oldExpectedReplicas +
                                  " curPri  " + curPri +
                                  " oldPri  " + oldPri);
    if(oldPri != LEVEL && oldPri != curPri) {
      remove(block, oldPri);
    }
    if(curPri != LEVEL && priorityQueues.get(curPri).add(block)) {
      NameNode.stateChangeLog.debug(
                                    "BLOCK* NameSystem.UnderReplicationBlock.update:"
                                    + block
                                    + " has only "+curReplicas
                                    + " replicas and need " + curExpectedReplicas
                                    + " replicas so is added to neededReplications"
                                    + " at priority level " + curPri);
    }
  }

数据块副本复制线程实现在FSNamesystem的内部类ReplicationMonitor中。在FSNamesystem处于运行的情况下，它会隔一段时间调用computeDatanodeWork()和processPendingReplications()方法。间隔时间保持在变量replicationRecheckInterval中，可由配置项${dfs.replication.interval}指定，默认值为3秒。

在具体分析run()调用的链各个函数前，交代一下ReplicationMonitor使用的两个队列。ReplicateQueueProcessingStats和InvalidateQueueProcsssingStats都继承自QueueProcessingStatics，它们的共同特征是拥有多项任务，但服务器每次处理的任务数有限制，需要多次循环才能完成所有任务。QueueProcessingStatistics为这样的任务添加了一些度量信息，而ReplicationMonitor的两个内部类，分别包装了neededReplications和recentInvalidateSets。代码如下：

  /**
   * Periodically calls computeReplicationWork().
   */
  class ReplicationMonitor implements Runnable {
    static final int INVALIDATE_WORK_PCT_PER_ITERATION = 32;
    static final float REPLICATION_WORK_MULTIPLIER_PER_ITERATION = 2;
    ReplicateQueueProcessingStats replicateQueueStats = 
        new ReplicateQueueProcessingStats();
    InvalidateQueueProcessingStats invalidateQueueStats = 
        new InvalidateQueueProcessingStats();
    
    public void run() {
      while (fsRunning) {
        try {
          computeDatanodeWork();
          processPendingReplications();
          Thread.sleep(replicationRecheckInterval);
        } catch (InterruptedException ie) {
          LOG.warn("ReplicationMonitor thread received InterruptedException." + ie);
          break;
        } catch (IOException ie) {
          LOG.warn("ReplicationMonitor thread received exception. " + ie +  " " +
                   StringUtils.stringifyException(ie));
        } catch (Throwable t) {
          LOG.warn("ReplicationMonitor thread received Runtime exception. " + t + " " +
                   StringUtils.stringifyException(t));
          Runtime.getRuntime().exit(-1);
        }
      }......
    }

computeDatanodeWork()方法用于计算数据节点工作，这里的工作指的是数据块副本删除和副本复制，它们具体实现在computeInvalidateWork()和computeReplicationWork()中，并被computeDatanodeWorl()调用。所以，数据块副本删除不需要一个单独的线程，而是和数据块副本复制一起共用一个线程。

用于副本删除的方法computeInvalidateWork()，最多删除nodesToProcess个节点的待删除数据块副本。在随机选取了nodesToProcess个数据节点后，方法调用InvalidateWorkForOneNode()进行针对数据节点的操作。代码如下：

 /**
   * Schedule blocks for deletion at datanodes
   * @param nodesToProcess number of datanodes to schedule deletion work
   * @return total number of block for deletion
   */
  int computeInvalidateWork(int nodesToProcess) {
    int numOfNodes = 0;
    ArrayList<String> keyArray;
    synchronized (this) {
      numOfNodes = recentInvalidateSets.size();
      // get an array of the keys
      keyArray = new ArrayList<String>(recentInvalidateSets.keySet());
    }
    nodesToProcess = Math.min(numOfNodes, nodesToProcess);

    // randomly pick up <i>nodesToProcess</i> nodes 
    // and put them at [0, nodesToProcess)
    int remainingNodes = numOfNodes - nodesToProcess;
    if (nodesToProcess < remainingNodes) {
      for(int i=0; i<nodesToProcess; i++) {
        int keyIndex = r.nextInt(numOfNodes-i)+i;
        Collections.swap(keyArray, keyIndex, i); // swap to front
      }
    } else {
      for(int i=0; i<remainingNodes; i++) {
        int keyIndex = r.nextInt(numOfNodes-i);
        Collections.swap(keyArray, keyIndex, numOfNodes-i-1); // swap to end
      }
    }
    
    int blockCnt = 0;
    for(int nodeCnt = 0; nodeCnt < nodesToProcess; nodeCnt++ ) {
      blockCnt += invalidateWorkForOneNode(keyArray.get(nodeCnt));
    }
    return blockCnt;
  }

/**
   * Get blocks to invalidate for <i>nodeId</i> 
   * in {@link #recentInvalidateSets}.
   * 
   * @return number of blocks scheduled for removal during this iteration.
   */
  private synchronized int invalidateWorkForOneNode(String nodeId) {
    // blocks should not be replicated or removed if safe mode is on
    if (isInSafeMode())
      return 0;
    // get blocks to invalidate for the nodeId
    assert nodeId != null;
    DatanodeDescriptor dn = datanodeMap.get(nodeId);
    if (dn == null) {
      recentInvalidateSets.remove(nodeId);
      return 0;
    }
    Collection<Block> invalidateSet = recentInvalidateSets.get(nodeId);
    if (invalidateSet == null) {
      return 0;
    }

    ArrayList<Block> blocksToInvalidate = 
      new ArrayList<Block>(blockInvalidateLimit);

    // # blocks that can be sent in one message is limited
    Iterator<Block> it = invalidateSet.iterator();
    for(int blkCount = 0; blkCount < blockInvalidateLimit && it.hasNext();
                                                                blkCount++) {
      blocksToInvalidate.add(it.next());
      it.remove();
    }

    // If we send everything in this message, remove this node entry
    if (!it.hasNext()) {
      recentInvalidateSets.remove(nodeId);
    }

    dn.addBlocksToBeInvalidated(blocksToInvalidate);

    if(NameNode.stateChangeLog.isInfoEnabled()) {
      StringBuffer blockList = new StringBuffer();
      for(Block blk : blocksToInvalidate) {
        blockList.append(' ');
        blockList.append(blk);
      }
      NameNode.stateChangeLog.info("BLOCK* ask "
          + dn.getName() + " to delete " + blockList);
    }
    pendingDeletionBlocksCount -= blocksToInvalidate.size();
    return blocksToInvalidate.size();
  }

FSNamesystem.invalidateWorkForOneNode()也很简单，其工作就是将数据节点上已经失效的数据块，最多选取blockInvalidateLimit项，放入DatanodeDedcriptor的成员变量invalidateBlocks中，等待下次心跳，通过名字节点指令下发。

computeReplicationWork()的作用和computeInvalidateWork()类似，通过chooseUnderReplicatedBlocks()选择待复制副本，然后使用computeReplicationWorkForBlock()生成复制请求。代码如下：

 /**
   * Scan blocks in {@link #neededReplications} and assign replication
   * work to data-nodes they belong to. 
   * 
   * The number of process blocks equals either twice the number of live 
   * data-nodes or the number of under-replicated blocks whichever is less.
   * 
   * @return number of blocks scheduled for replication during this iteration.
   */
  private int computeReplicationWork(
                                  int blocksToProcess) throws IOException {
    // stall only useful for unit tests (see TestFileAppend4.java)
    if (stallReplicationWork)  {
      return 0;
    }
    
    // Choose the blocks to be replicated
    List<List<Block>> blocksToReplicate = 
      chooseUnderReplicatedBlocks(blocksToProcess);

    // replicate blocks
    int scheduledReplicationCount = 0;
    for (int i=0; i<blocksToReplicate.size(); i++) {
      for(Block block : blocksToReplicate.get(i)) {
        if (computeReplicationWorkForBlock(block, i)) {
          scheduledReplicationCount++;
        }
      }
    }
    return scheduledReplicationCount;
  }

数据块副本复制请求时有优先级的，chooseUnderReplicatedBlocks()根据优先级，最多选择blocksToProcess个请求，交由computeReplicationWorkForBlock()处理。方法computeReplicationWorkForBlock()看起来比较“长”，也包含了一些冗余的代码。但它的逻辑可以清楚地分为参数检查、选择复制的源数据节点、目标数据节点、和生成复制请求等四部分。

 /** Replicate a block
   * 
   * @param block block to be replicated
   * @param priority a hint of its priority in the neededReplication queue
   * @return if the block gets replicated or not
   */
  boolean computeReplicationWorkForBlock(Block block, int priority) {
    int requiredReplication, numEffectiveReplicas; 
    List<DatanodeDescriptor> containingNodes;
    DatanodeDescriptor srcNode;
    
    synchronized (this) {
      synchronized (neededReplications) {
        // block should belong to a file
        INodeFile fileINode = blocksMap.getINode(block);
        // abandoned block or block reopened for append
        if(fileINode == null || fileINode.isUnderConstruction()) { 
          neededReplications.remove(block, priority); // remove from neededReplications
          replIndex--;
          return false;
        }
        requiredReplication = fileINode.getReplication(); 

        // get a source data-node
        containingNodes = new ArrayList<DatanodeDescriptor>();
        NumberReplicas numReplicas = new NumberReplicas();
        srcNode = chooseSourceDatanode(block, containingNodes, numReplicas);
        if ((numReplicas.liveReplicas() + numReplicas.decommissionedReplicas())
            <= 0) {          
          missingBlocksInCurIter++;
        }
        if(srcNode == null) // block can not be replicated from any node
          return false;

        // do not schedule more if enough replicas is already pending
        numEffectiveReplicas = numReplicas.liveReplicas() +
                                pendingReplications.getNumReplicas(block);
        if(numEffectiveReplicas >= requiredReplication) {
          neededReplications.remove(block, priority); // remove from neededReplications
          replIndex--;
          NameNode.stateChangeLog.info("BLOCK* "
              + "Removing block " + block
              + " from neededReplications as it has enough replicas.");
          return false;
        }
      }
    }

    // choose replication targets: NOT HOLDING THE GLOBAL LOCK
    DatanodeDescriptor targets[] = replicator.chooseTarget(
        requiredReplication - numEffectiveReplicas,
        srcNode, containingNodes, null, block.getNumBytes());
    if(targets.length == 0)
      return false;

    synchronized (this) {
      synchronized (neededReplications) {
        // Recheck since global lock was released
        // block should belong to a file
        INodeFile fileINode = blocksMap.getINode(block);
        // abandoned block or block reopened for append
        if(fileINode == null || fileINode.isUnderConstruction()) { 
          neededReplications.remove(block, priority); // remove from neededReplications
          replIndex--;
          return false;
        }
        requiredReplication = fileINode.getReplication(); 

        // do not schedule more if enough replicas is already pending
        NumberReplicas numReplicas = countNodes(block);
        numEffectiveReplicas = numReplicas.liveReplicas() +
        pendingReplications.getNumReplicas(block);
        if(numEffectiveReplicas >= requiredReplication) {
          neededReplications.remove(block, priority); // remove from neededReplications
          replIndex--;
          NameNode.stateChangeLog.info("BLOCK* "
              + "Removing block " + block
              + " from neededReplications as it has enough replicas.");
          return false;
        } 

        // Add block to the to be replicated list
        srcNode.addBlockToBeReplicated(block, targets);

        for (DatanodeDescriptor dn : targets) {
          dn.incBlocksScheduled();
        }
        
        // Move the block-replication into a "pending" state.
        // The reason we use 'pending' is so we can retry
        // replications that fail after an appropriate amount of time.
        pendingReplications.add(block, targets.length);
        NameNode.stateChangeLog.debug(
            "BLOCK* block " + block
            + " is moved from neededReplications to pendingReplications");

        // remove from neededReplications
        if(numEffectiveReplicas + targets.length >= requiredReplication) {
          neededReplications.remove(block, priority); // remove from neededReplications
          replIndex--;
        }
        if (NameNode.stateChangeLog.isInfoEnabled()) {
          StringBuffer targetList = new StringBuffer("datanode(s)");
          for (int k = 0; k < targets.length; k++) {
            targetList.append(' ');
            targetList.append(targets[k].getName());
          }
          NameNode.stateChangeLog.info(
                    "BLOCK* ask "
                    + srcNode.getName() + " to replicate "
                    + block + " to " + targetList);
          NameNode.stateChangeLog.debug(
                    "BLOCK* neededReplications = " + neededReplications.size()
                    + " pendingReplications = " + pendingReplications.size());
        }
      }
    }
    
    return true;
  }

版权申明：本文部分摘自【蔡斌、陈湘萍】所著【Hadoop技术内幕深入解析Hadoop Common和HDFS架构设计与实现原理】一书，仅作为学习笔记，用于技术交流，其商业版权由原作者保留，推荐大家购买图书研究，转载请保留原作者，谢谢！

剑邑龙泉

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Hadoop源码分析笔记(十三)：名字节点--数据块和数据节点管理

数据块和节点管理名字节点维护着HDFS文件系统中两个重要的关系：一、HDFS文件系统的文件目录树，以及文件的数据块索引，即每个文件对应的数据块列表。二、数据块与数据节点的对应关系，即某一数据块保存在哪些数据节点的信息。在第二关系中，“fsimage”文件不会记录数据块保存在哪些数据节点的信息。而是在数据节点加
复制链接

扫一扫