Hadoop源码分析笔记(十二)：名字节点--文件系统目录树

最新推荐文章于 2022-09-04 15:53:23 发布

剑邑龙泉

最新推荐文章于 2022-09-04 15:53:23 发布

阅读量1.6k

点赞数

分类专栏： Hadoop源码分析

Hadoop源码分析专栏收录该内容

14 篇文章 3 订阅

订阅专栏

名字节点简介

名字节点在Hadoop分布式文件系统中只有一个实例，却是最复杂的一个实例。名字节点维护着HDFS文件系统中两个最重要的关系：

1、HDFS文件系统的文件目录树，以及文件的数据块索引，即每个文件对应的数据块列表。

2、数据块和数据节点的对应关系，即某一个数据块保存在哪些数据及节点的信息。

其中，HDFS的目录树、元信息和数据块索引等信息会持久到磁盘上，保存在命名空间镜像和编辑日志中。数据块和数据节点的对应关系则在名字节点启动后，由数据节点上报，动态建立。在上述关系的基础山个，名字节点管理数据节点，接收数据节点的注册、心跳、数据块提交等信息上报，发送数据块复制、删除、恢复等名字节点指令；同时，名字节点还为客户端对文件系统目录树的操作和对文件数据读写、对HDFS系统进行管理提供支持。

后面的讨论中，会把与目录树相关，即HDFS目录树、文件/目录元信息和数据块索引间的关系称为名字节点第一关系，数据块和数据节点的对应关系称为名字节点第二关系。

文件系统的目录树

1）、从i-node到INode

在Linux文件系统中的i-node，即索引节点。索引节点保存Linux文件的一些元信息，如文件类型与权限、所有者标识和以字节为单位的文件长度，在i-node的后半部分，则存放着数据块索引，也就是文件或目录数据的位置。

根据存放数据的大小，数据块索引可分为i-node内索引、一次间接块、二次间接块等。对于文件，数据块索引指向的就是文件数据分块存储的位置。对于文件，数据块索引指向的就是文件数据分块存储的位置，通过这些索引信息，就可以读取或者写入文件数据；对于目录而言，目录项被保存在为目录分配的块中。

上述i-node实现，充分考虑了需要将i-node保存在块设备上这个因素，所以引入了很多定长的记录和结构。

Linux的i-node是如此的有影响，以至于在名字节点中，对文件和目录的抽象使用INode对类进行命名。名字节点中，名字节点带INode的类包括INode、INodeDirectory、INodeFile和INodeFileUnderConstrucation等。它们的关系如下：

INode是一个抽象类，它是INodeDirectory和INodeFile的父类，INodeDirectory自然代表了HDFS中的目录，而INodeFile则抽象了HDFS中的文件。INodeDirectory有一个子类INodeDirectoryWithQuota，如类的名字所示，就是带配额的目录。INodeFile的子类INodeFileUnderConstrruction，是这个继承树上的一个"异类"，它代表了一个为写而打开的文件。

1、INode

INode作为这棵继承树的根，保存文件和目录共有的属性，包括文件/目录名name、父目录parent、最后修改时间modificationTime、最后访问时间accessTime和同时保存访问权限、文件主标识符、文件所在用户组标识符的permission。和Linux的i-node比，HDFS的INode不需要支持硬连接，也不需要支持索引节点最后修改时间等功能，所有它保存的属性比i-node少。

INode的属性中需要注意的是permission，其类型是长整型，即它有64字节，保存文件/目录的三个属性。如果将这三个属性保存在一个变量中呢？INode将permission的64字节分为三段，分别用于保存访问权限、文件主标识符和用户组标识符，并巧妙地利用了Java的枚举，建立长整型上的分段操作，实现了上述三个文件属性的操作。代码如下：

  private static enum PermissionStatusFormat {
    MODE(0, 16),
    GROUP(MODE.OFFSET + MODE.LENGTH, 25),
    USER(GROUP.OFFSET + GROUP.LENGTH, 23);

    final int OFFSET;
    final int LENGTH; //bit length
    final long MASK;

    PermissionStatusFormat(int offset, int length) {
      OFFSET = offset;
      LENGTH = length;
      MASK = ((-1L) >>> (64 - LENGTH)) << OFFSET;
    }

    long retrieve(long record) {
      return (record & MASK) >>> OFFSET;
    }

    long combine(long bits, long record) {
      return (record & ~MASK) | (bits << OFFSET);
    }
  }

枚举PermissionStatusFormat有三个值，MODE、GROUP、USER，分别用于处理访问权限、用户组标识符和文件主标识符。上述三个枚举值创建时，都会调用PermissionStatusFormat的构造函数，构造函数需要链各个参数，分别是枚举值对应的属性在长整形permission中的偏移量和长度。

INode.getUserName()用于获得该INodede文件主，由于文件所有者标识符保存在permission的41~63位，使用枚举USER.retrieve()，它会把成员变量permission和掩码USER.MASK进行与运算，然后右移，获得标识符的值；然后再保存标识符和用户名对应关系的SerialNumberManager实例中进行查找，以得到字符串形式的用户名。

INode.setUser()用于设置节点的用户名，它的实现也利用了PermissionStatusFormat，使用USER.combine()设置文件所有者标识符对应位。

HDFS中，用户名和用户标识的影射、用户组名和用户组标识符的影射保存在SerialNumberManager对象中。通过SerialNumberManaer，名字节点不必在INode对象中保存字符串形式的用户名和用户组名，节省了对象对内存的占用。

INode中的方法大多比较简单，提供了对成员变量的访问/设置能力。这些方法中，isRoot()值得一提，该方法用于判断当前节点是否是目录树的根，目录树的根是HDFS忠最重要的一个目录，所有目录都有跟目录衍生。如果INode的成员属性name长度为0，我们约定这是HDFS的根节点，INode.isRoot()返回真，否则，该节点便不是HDFS的根。

2、INodeDirectory和INodeDirectoryWithQuota

Linux是使用C语言开发的，缺少继承机制，所有，在i-node的第一个字段i_mode中，保存了文件类型和和访问权限字段，后续处理逻辑必须根据i-node的文件类型，使用不同的处理逻辑。但在HDFS的实现中，利用了Java的继承机制，目录和文件分别作为INode的子类，并以多态实现了文件和目录的特有操作。

INodeDirectory抽象了HDFS中的目录，目录是文件系统中的重要概念，“目录”是一个虚拟容器，里面保存一组文件和其他一些目录。除了根目录，HDFS中的文件/目录都属于某一个“目录容器”，类型为INodeDirectory的INode的成员变量parent，保存着文件/目录的父目录。

目录作为容器，在INodeDirectory实现中的体现是成员变量children，它是一个保存INode的列表。INodeDirectory中的大部分方法都是在操作这个列表，如创建子目录项、查询或遍历子目录项、替换子目录项等，它们的实现都比较简单。如下所示：

/**
 * Directory INode class.
 */
class INodeDirectory extends INode {
   ......
 INode removeChild(INode node) {
    assert children != null;
    int low = Collections.binarySearch(children, node.name);
    if (low >= 0) {
      return children.remove(low);
    } else {
      return null;
    }
  }

  /** Replace a child that has the same name as newChild by newChild.
   * 
   * @param newChild Child node to be added
   */
  void replaceChild(INode newChild) {
    if ( children == null ) {
      throw new IllegalArgumentException("The directory is empty");
    }
    int low = Collections.binarySearch(children, newChild.name);
    if (low>=0) { // an old child exists so replace by the newChild
      children.set(low, newChild);
    } else {
      throw new IllegalArgumentException("No child exists to be replaced");
    }
  }
  
  INode getChild(String name) {
    return getChildINode(DFSUtil.string2Bytes(name));
  }
......
}

  private INode getChildINode(byte[] name) {
    if (children == null) {
      return null;
    }
    int low = Collections.binarySearch(children, name);
    if (low >= 0) {
      return children.get(low);
    }
    return null;
  }

  /**
   */
  private INode getNode(byte[][] components) {
    INode[] inode  = new INode[1];
    getExistingPathINodes(components, inode);
    return inode[0];
  }

如果删除的是文件，怎么才能删除文件拥有的数据块？collectSubtreeBlocksAndClear()方法是INode的抽象方法，它会返回INode所在的子目录树中所有文件拥有的数据块。在调用节点删除操作前，名字节点的处理逻辑会使用这个方法，收集目录树拥有的所有数据块。当成员变量children为空，即目录是空目录的情况下，方法直接返回；否则，循环调用目录所管理的目录项的同名方法，收集文件或者子目录中文件的数据块。代码如下：

int collectSubtreeBlocksAndClear(List<Block> v) {
    int total = 1;
    if (children == null) {
      return total;
    }
    for (INode child : children) {
      total += child.collectSubtreeBlocksAndClear(v);
    }
    parent = null;
    children = null;
    return total;
  }

INodeDirectory有一个子类INodeDirectoryWithQuota，用于实现HDFS的配额机制，HDFS允许管理员为每个目录设置配额。配额有两种：

一、节点配额：用于限制目录下的名字数量，如果创建文件或目录时超过了该配额，操作失败。这个配额用于控制用于对于名字节点资源的占用，保存在成员变量nsQuota中。

二、空间配额：限制存储在目录树下的所有文件的总规模，空间配额保证用户不会过多咱用数据节点的资源，该配额由dsQuota变量保存。

HDFS的dfsadmin工具提供了修改目录配额的命令，该命令会修改INodeDirectoryWithQuota对象相应的成员变量。方法INodeDirectoryWithQuota.verifyQuota()则用于检测对目录树的更新是否满足设置的配额，如果不满足，，则该方法或抛出异常。

3、INodeFile和INodeFileUnderConstrucation

名字节点中个，文件由INodeFile抽象，它也是INode的子类。

INodeFile包含了两个文件特有的属性：header和blocks。变量header使用了和INode.permission一样的方法，在一个长整形变量里保存了文件的副本系数和文件数据块的大小，它的高16字节存放着副本系数，低48位存放了数据块大小。数组block存放了数据块大小。数组blocks保存文件拥有的数据块，数组元素类型是BlockInfo。

INodeFileUnderConstrucation是INodeFile的子类。它是指处于构建状态的文件索引节点，当客户端为写或者追加数据打开HDFS文件时，该文件就处于构建状态，在HDFS的目录树中，相应的节点就是一个INodeFileUnderConstruction对象。

在HDFS中，往文件里添加数据是一件比较复杂的事情，这一点不但体现在低7章对写数据流式接口的讨论中，也体现在INodeFileUnderConstruction等名字节点相关的实现中。代码如下：

class INodeFileUnderConstruction extends INodeFile {
  String clientName;         // lease holder
  private final String clientMachine;
  private final DatanodeDescriptor clientNode; // if client is a cluster node too.

  private int primaryNodeIndex = -1; //the node working on lease recovery
  private DatanodeDescriptor[] targets = null;   //locations for last block
  private long lastRecoveryTime = 0;
  ......
}

INodeFileUnderConstuction相关：

clientName：发起文件写的客户端名称，这个属性也用于租约管理中，在HDFS中租约是名字节点维护的，给予客户端在一定期限内可以进行文件写操作的权限的合同。

clientMachine：客户端所在的主机。

clientNode：如果客户端运行在集群内的某一个数据节点上时，对应的数据节点信息，DatanodeDescriptor是名字节点内部使用的，用于保存数据节点信息的类，它继承自DatanodeInfo。

targets：最后一个数据块的数据流管道成员，即当前参与到写数据的数据节点的列表。

primaryNodeIndex和lastRecoveryTime：都用于名字节点发起的数据块恢复，由名字节点发起的数据块恢复也叫租约恢复，这两个变量分别保存恢复时的主数据节点索引和恢复开始时间。

命名空间镜像和编辑日志

内存中的HDFS目录树及文件/目录元信息由INode及其子类保存，如果节点掉电或进程崩溃，数据将不再存在，因此，必须将上述信息保存在磁盘。命名空间镜像保存某一个时刻目录树的信息，它在HDFS中实现为类FSImage，是内存元数据和磁盘元数据间的桥梁。对内存目录树的修改，也必须同步到磁盘元数据上，但每次修改都将内存元数据导出到磁盘，显然是不现实的，为此，名字节点引入了镜像编辑日志，将改动保存在日志中，形成了命名空间镜像和编辑日志的备份机制，其中，命名空间镜像是某一个时间内存元数据的真实情况，而编辑日志则记录了该时刻以后所有元数据操作。

1、名字节点的磁盘目录文件结构

名字节点管理的目录可以分别为之保存命名空间镜像(配置项${dfs.name.dir}指定)、只保存编辑日志(配置项${dfs.name.edits.dir})和同时保存命名空间镜像和编辑日志三种情况。和数据节点类似的是，配置项都可以指定多个目录。在不指定配置项${dfs.name.edits.dir}的情况下，编辑日志也会存放在${dfs.name.dir}中。一般HDFS都不会特殊指定${dfs.name.edits.dir}。

${dfs.name.dir}目录下一般有三个目录和一个文件，其中文件是“in_use.lock”，它的功能和数据节点中的同名文件时一样的，保证名节点独占使用该目录。名字节点格式化后只会产生“current”和“image”目录，节点启动并运行一段时间后，才会创建“previous.checkpoint”目录。

名字节点把命名空间镜像和编辑日志保存在“current”目录下：

通常，这个目录有4个文件，它们分别是：

fsimage：元数据镜像文件

edits：日志文件，和元数据镜像文件一起，提供了一个完整的HDFS目录树及元信息。

fstime：保存了最后一次检查点的时间，检查点一般由第二名字节点产生，是一次fsimage和edits合并的结果。

VERSION：和数据节点类似，该文件保存了名字节点存储的一些属性。

${dfs.name.dir}/previous.checkpoint保存名字节点的上一次检查点，它的目录结构和“current”目录是一致的。${dfs.name.dir}/image是0.13及以前版本保存“fsimage”文件的地方，该目录下的“fsimage”文件和数据及节点的${dfs.data.dir}/storage文件有一样的作用，防止不兼容当前目录结构的名字节点的误启动。

2、FSImage和FSEditLog

数据节点中，对节点存储空间的管理，分别是由数据节点存储DataStorage和文件系统数据集FSDataset实现，DataStorage主要实现了存储空间的状态管理，而FSDataset则提供了数据节点逻辑所需的数据块存储服务。

和数据节点不同，名字节点的存储空间管理是由FSImage类和FSEditLog类一起完成。其中，命名空间镜像FSImage起主导作用，它管理这存储空间的生存期，同时，也负责命名空间镜像的保存和加载，还需要和第二名字节点合作，执行检查点过程。它与数据节点存储DataStorage类一样也继承自Storage。它们都利用了Storage提供的方法管理名字节点的目录结构。

编辑日志FSEditLog记录了对元数据的修改，和FSImage不同，编辑日志随着名字节点的运行不断产生，所有FSEditLog依赖输出流，使用输出流记录目录树的变化。与输出流对应的是输入流，用于读取持久化在磁盘上的编辑日志。FSEditLog为记录元数据变化提供了大量的log*()方法，如文件改名，就可以通过logRename()记录到编辑日志中。

3、命名空间镜像的保存

命名空间镜像存储了某一个时刻名字节点内存元数据的信息，包括前面分析过的INode信息，正在写入的文件信息(租约和INodeFileUnderConstruction对象)和其他一些状态信息。FSImage.saveFSImage()会将当前时刻的命名空间镜像，保存到参数newFile指定的文件中，代码如下：

void saveFSImage(File newFile) throws IOException {
    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
    FSDirectory fsDir = fsNamesys.dir;
    long startTime = FSNamesystem.now();
    //
    // Write out data
    //
    DataOutputStream out = new DataOutputStream(
                                                new BufferedOutputStream(
                                                                         new FileOutputStream(newFile)));
    try {
      out.writeInt(FSConstants.LAYOUT_VERSION);
      out.writeInt(namespaceID);
      out.writeLong(fsDir.rootDir.numItemsInTree());
      out.writeLong(fsNamesys.getGenerationStamp());
      byte[] byteStore = new byte[4*FSConstants.MAX_PATH_LENGTH];
      ByteBuffer strbuf = ByteBuffer.wrap(byteStore);
      // save the root
      saveINode2Image(strbuf, fsDir.rootDir, out);
      // save the rest of the nodes
      saveImage(strbuf, 0, fsDir.rootDir, out);
      fsNamesys.saveFilesUnderConstruction(out);
      fsNamesys.saveSecretManagerState(out);
      strbuf = null;
    } finally {
      out.close();
    }

    LOG.info("Image file of size " + newFile.length() + " saved in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
  }

成员函数savaFSImage()很简单，首先输出镜像文件的文件头，包括命名空间镜像格式的版本号、存储系统标识、目录树包含的节点数和当前已用的数据块版本号。

接下来，使用静态方法saveINode2Image()输出根节点，使用saveImage()保存目录树中的其他节点。根节点是名字节点管理的特殊INode，它的成员属性INode.name的长度为0,所以必须做特殊处理。

FSImag.saveImage()将目录树current下的所有INode信息(不包括current)输出到输出流out中。该方法的前两个参数携带了父节点的绝对路径，路径保存在parentPrefix中，长度为prefixLength。由于HDFS文件和目录的绝对路径不能超过8000字节，所有缓冲区parentPrefix的大小为31.25KB。如果有一个目录“/foo/bar”，在输出目录“bar”的INode时，它的父目录“/foo”会保存在parentPrefix中，当输入“/foo/bar”下的INode时，parentPrefix中保存的内容会变成“/foor/bar”。

假设当前调用的current参数值是目录"/foo"，saveImage()会先循环输出它的所有子节点：

一、设置缓冲区位置，操作完成后缓冲区中内容为“/foo”

二、将当前的目录通过ByteBuffer.put()追加到缓冲区中，这时，缓冲区中保存的是“/foo/bar”。

三、调用saveINode2Image()输出节点信息。

saveImage()方法代码如下：

/**
   * Save file tree image starting from the given root.
   * This is a recursive procedure, which first saves all children of
   * a current directory and then moves inside the sub-directories.
   */
  private static void saveImage(ByteBuffer parentPrefix,
                                int prefixLength,
                                INodeDirectory current,
                                DataOutputStream out) throws IOException {
    int newPrefixLength = prefixLength;
    if (current.getChildrenRaw() == null)
      return;
    for(INode child : current.getChildren()) {
      // print all children first
      parentPrefix.position(prefixLength);
      parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());
      saveINode2Image(parentPrefix, child, out);
    }
    for(INode child : current.getChildren()) {
      if(!child.isDirectory())
        continue;
      parentPrefix.position(prefixLength);
      parentPrefix.put(PATH_SEPARATOR).put(child.getLocalNameBytes());
      newPrefixLength = parentPrefix.position();
      saveImage(parentPrefix, newPrefixLength, (INodeDirectory)child, out);
    }
    parentPrefix.position(prefixLength);
  }

saveINode2Image()方法如下：

/*
   * Save one inode's attributes to the image.
   */
  private static void saveINode2Image(ByteBuffer name,
                                      INode node,
                                      DataOutputStream out) throws IOException {
    int nameLen = name.position();
    out.writeShort(nameLen);
    out.write(name.array(), name.arrayOffset(), nameLen);
    if (!node.isDirectory()) {  // write file inode
      INodeFile fileINode = (INodeFile)node;
      out.writeShort(fileINode.getReplication());
      out.writeLong(fileINode.getModificationTime());
      out.writeLong(fileINode.getAccessTime());
      out.writeLong(fileINode.getPreferredBlockSize());
      Block[] blocks = fileINode.getBlocks();
      out.writeInt(blocks.length);
      for (Block blk : blocks)
        blk.write(out);
      FILE_PERM.fromShort(fileINode.getFsPermissionShort());
      PermissionStatus.write(out, fileINode.getUserName(),
                             fileINode.getGroupName(),
                             FILE_PERM);
    } else {   // write directory inode
      out.writeShort(0);  // replication
      out.writeLong(node.getModificationTime());
      out.writeLong(0);   // access time
      out.writeLong(0);   // preferred block size
      out.writeInt(-1);    // # of blocks
      out.writeLong(node.getNsQuota());
      out.writeLong(node.getDsQuota());
      FILE_PERM.fromShort(node.getFsPermissionShort());
      PermissionStatus.write(out, node.getUserName(),
                             node.getGroupName(),
                             FILE_PERM);
    }
  }

FSImage.saveFSImage()利用上述两个方法输出目录树以后，会将当前系统中为写打开的文件，即INodeFileUnderConstruction()对应的文件也写到命名空间镜像中。

FSNamesystem.saveFilesUnderConstruction()方法会遍历租约管理器中保存的租约，并使用FSImage.writeINodeUnderConstruction()输出处于构建状态的文件索引节点。代码如下：

/**
   * Serializes leases. 
   */
  void saveFilesUnderConstruction(DataOutputStream out) throws IOException {
    synchronized (leaseManager) {
      out.writeInt(leaseManager.countPath()); // write the size

      for (Lease lease : leaseManager.getSortedLeases()) {
        for(String path : lease.getPaths()) {
          // verify that path exists in namespace
          INode node = dir.getFileINode(path);
          if (node == null) {
            throw new IOException("saveLeases found path " + path +
                                  " but no matching entry in namespace.");
          }
          if (!node.isUnderConstruction()) {
            throw new IOException("saveLeases found path " + path +
                                  " but is not under construction.");
          }
          INodeFileUnderConstruction cons = (INodeFileUnderConstruction) node;
          FSImage.writeINodeUnderConstruction(out, cons, path);
        }
      }
    }
  }

4、编辑日志数据保存

命名空间镜像FSImage作为磁盘上的文件，很难和名字节点内存中的元数据时时刻刻保存一致，为了提高元数据的可靠性，HDFS将对元数据的修改保存在编辑日志中，编辑日志和命名空间镜像一起，确定了当前时刻文件系统的元数据。

当HDFS运行时，需要在日志中记录没一个修改名字节点第一关系的事件，所以可以把日志抽象成一个只允许添加数据的数据输出流。EditLogOutputStream抽象了日志输出流，它的一个子类EditLogFileOutputStrem，用于将日志写往磁盘文件。EditLogOutputStrem代码如下：

/**
 * A generic abstract class to support journaling of edits logs into 
 * a persistent storage.
 */
abstract class EditLogOutputStream extends OutputStream {
  // these are statistics counters
  private long numSync;        // number of sync(s) to disk
  private long totalTimeSync;  // total time to sync

  EditLogOutputStream() throws IOException {
    numSync = totalTimeSync = 0;
  }

  /**
   * Get this stream name.
   * 
   * @return name of the stream
   */
  abstract String getName();

  /** {@inheritDoc} */
  abstract public void write(int b) throws IOException;

  /**
   * Write edits log record into the stream.
   * The record is represented by operation name and
   * an array of Writable arguments.
   * 
   * @param op operation
   * @param writables array of Writable arguments
   * @throws IOException
   */
  abstract void write(byte op, Writable ... writables) throws IOException;

  /**
   * Create and initialize new edits log storage.
   * 
   * @throws IOException
   */
  abstract void create() throws IOException;

  /** {@inheritDoc} */
  abstract public void close() throws IOException;

  /**
   * All data that has been written to the stream so far will be flushed.
   * New data can be still written to the stream while flushing is performed.
   */
  abstract void setReadyToFlush() throws IOException;

  /**
   * Flush and sync all data that is ready to be flush 
   * {@link #setReadyToFlush()} into underlying persistent store.
   * @throws IOException
   */
  abstract protected void flushAndSync() throws IOException;

  /**
   * Flush data to persistent store.
   * Collect sync metrics.
   */
  public void flush() throws IOException {
    numSync++;
    long start = FSNamesystem.now();
    flushAndSync();
    long end = FSNamesystem.now();
    totalTimeSync += (end - start);
  }

  /**
   * Return the size of the current edits log.
   * Length is used to check when it is large enough to start a checkpoint.
   */
  abstract long length() throws IOException;

  /**
   * Return total time spent in {@link #flushAndSync()}
   */
  long getTotalSyncTime() {
    return totalTimeSync;
  }

  /**
   * Return number of calls to {@link #flushAndSync()}
   */
  long getNumSync() {
    return numSync;
  }
}

编辑日志文件输出流EditLogFileOutputStream实现了EditLogOutputStream。

EditLogFileOutputStream拥有两个工作缓冲区：

bufCurrent：日志写入缓冲区。

bufReady：写文件缓冲区。

通过wirte()输出的日志记录会写到缓冲区bufCurrent中，当bufCurrent中的内容需要写往文件时，EditFileOutputStream会交换两个缓冲区，原来的日志写入缓冲区编程了写文件缓冲区，而原来的文件缓冲区则变成日志写入缓冲区。代码如下：

  /**
   * An implementation of the abstract class {@link EditLogOutputStream},
   * which stores edits in a local file.
   */
  static private class EditLogFileOutputStream extends EditLogOutputStream {
    private File file;
    private FileOutputStream fp;    // file stream for storing edit logs 
    private FileChannel fc;         // channel of the file stream for sync
    private DataOutputBuffer bufCurrent;  // current buffer for writing
    private DataOutputBuffer bufReady;    // buffer ready for flushing
    static ByteBuffer fill = ByteBuffer.allocateDirect(512); // preallocation

    EditLogFileOutputStream(File name) throws IOException {
      super();
      file = name;
      bufCurrent = new DataOutputBuffer(sizeFlushBuffer);
      bufReady = new DataOutputBuffer(sizeFlushBuffer);
      RandomAccessFile rp = new RandomAccessFile(name, "rw");
      fp = new FileOutputStream(rp.getFD()); // open for append
      fc = rp.getChannel();
      fc.position(fc.size());
    }

    @Override
    String getName() {
      return file.getPath();
    }

    /** {@inheritDoc} */
    @Override
    public void write(int b) throws IOException {
      bufCurrent.write(b);
    }

    /** {@inheritDoc} */
    @Override
    void write(byte op, Writable ... writables) throws IOException {
      write(op);
      for(Writable w : writables) {
        w.write(bufCurrent);
      }
    }

    /**
     * Create empty edits logs file.
     */
    @Override
    void create() throws IOException {
      fc.truncate(0);
      fc.position(0);
      bufCurrent.writeInt(FSConstants.LAYOUT_VERSION);
      setReadyToFlush();
      flush();
    }

    @Override
    public void close() throws IOException {
      // close should have been called after all pending transactions 
      // have been flushed & synced.
      int bufSize = bufCurrent.size();
      if (bufSize != 0) {
        throw new IOException("FSEditStream has " + bufSize +
                              " bytes still to be flushed and cannot " +
                              "be closed.");
      } 
      bufCurrent.close();
      bufReady.close();

      // remove the last INVALID marker from transaction log.
      fc.truncate(fc.position());
      fp.close();
      
      bufCurrent = bufReady = null;
    }

    /**
     * All data that has been written to the stream so far will be flushed.
     * New data can be still written to the stream while flushing is performed.
     */
    @Override
    void setReadyToFlush() throws IOException {
      assert bufReady.size() == 0 : "previous data is not flushed yet";
      write(OP_INVALID);           // insert end-of-file marker
      DataOutputBuffer tmp = bufReady;
      bufReady = bufCurrent;
      bufCurrent = tmp;
    }

    /**
     * Flush ready buffer to persistent store.
     * currentBuffer is not flushed as it accumulates new log records
     * while readyBuffer will be flushed and synced.
     */
    @Override
    protected void flushAndSync() throws IOException {
      preallocate();            // preallocate file if necessary
      bufReady.writeTo(fp);     // write data to file
      bufReady.reset();         // erase all data in the buffer
      fc.force(false);          // metadata updates not needed because of preallocation
      fc.position(fc.position()-1); // skip back the end-of-file marker
    }

    /**
     * Return the size of the current edit log including buffered data.
     */
    @Override
    long length() throws IOException {
      // file size + size of both buffers
      return fc.size() + bufReady.size() + bufCurrent.size();
    }

    // allocate a big chunk of data
    private void preallocate() throws IOException {
      long position = fc.position();
      if (position + 4096 >= fc.size()) {
        FSNamesystem.LOG.debug("Preallocating Edit log, current size " +
                                fc.size());
        long newsize = position + 1024*1024; // 1MB
        fill.position(0);
        int written = fc.write(fill, newsize);
        FSNamesystem.LOG.debug("Edit log size is now " + fc.size() +
                              " written " + written + " bytes " +
                              " at offset " +  newsize);
      }
    }
    
    /**
     * Returns the file associated with this stream
     */
    File getFile() {
      return file;
    }
  }

5、读取命名空间镜像和编辑日志数据

FSImage.loadFSImage()读取命名空间镜像，并将它包含的元数据添加/更新到内存元数据中，FSEditLog.loadFSEdits()则加载编辑日志，重放并应用日志记录的操作，以获得元数据在某个时刻的状态。loadFSImage()方法如下：

/**
   * Choose latest image from one of the directories,
   * load it and merge with the edits from that directory.
   * 
   * @return whether the image should be saved
   * @throws IOException
   */
  boolean loadFSImage() throws IOException {
    // Now check all curFiles and see which is the newest
    long latestNameCheckpointTime = Long.MIN_VALUE;
    long latestEditsCheckpointTime = Long.MIN_VALUE;
    StorageDirectory latestNameSD = null;
    StorageDirectory latestEditsSD = null;
    boolean needToSave = false;
    isUpgradeFinalized = true;
    Collection<String> imageDirs = new ArrayList<String>();
    Collection<String> editsDirs = new ArrayList<String>();
    for (Iterator<StorageDirectory> it = dirIterator(); it.hasNext();) {
      StorageDirectory sd = it.next();
      if (!sd.getVersionFile().exists()) {
        needToSave |= true;
        continue; // some of them might have just been formatted
      }
      boolean imageExists = false, editsExists = false;
      if (sd.getStorageDirType().isOfType(NameNodeDirType.IMAGE)) {
        imageExists = getImageFile(sd, NameNodeFile.IMAGE).exists();
        imageDirs.add(sd.getRoot().getCanonicalPath());
      }
      if (sd.getStorageDirType().isOfType(NameNodeDirType.EDITS)) {
        editsExists = getImageFile(sd, NameNodeFile.EDITS).exists();
        editsDirs.add(sd.getRoot().getCanonicalPath());
      }
      
      checkpointTime = readCheckpointTime(sd);
      if ((checkpointTime != Long.MIN_VALUE) && 
          ((checkpointTime != latestNameCheckpointTime) || 
           (checkpointTime != latestEditsCheckpointTime))) {
        // Force saving of new image if checkpoint time
        // is not same in all of the storage directories.
        needToSave |= true;
      }
      if (sd.getStorageDirType().isOfType(NameNodeDirType.IMAGE) && 
         (latestNameCheckpointTime < checkpointTime) && imageExists) {
        latestNameCheckpointTime = checkpointTime;
        latestNameSD = sd;
      }
      if (sd.getStorageDirType().isOfType(NameNodeDirType.EDITS) && 
           (latestEditsCheckpointTime < checkpointTime) && editsExists) {
        latestEditsCheckpointTime = checkpointTime;
        latestEditsSD = sd;
      }
      if (checkpointTime <= 0L)
        needToSave |= true;
      // set finalized flag
      isUpgradeFinalized = isUpgradeFinalized && !sd.getPreviousDir().exists();
    }

    // We should have at least one image and one edits dirs
    if (latestNameSD == null)
      throw new IOException("Image file is not found in " + imageDirs);
    if (latestEditsSD == null)
      throw new IOException("Edits file is not found in " + editsDirs);

    // Make sure we are loading image and edits from same checkpoint
    if (latestNameCheckpointTime > latestEditsCheckpointTime
        && latestNameSD != latestEditsSD
        && latestNameSD.getStorageDirType() == NameNodeDirType.IMAGE
        && latestEditsSD.getStorageDirType() == NameNodeDirType.EDITS) {
      // This is a rare failure when NN has image-only and edits-only
      // storage directories, and fails right after saving images,
      // in some of the storage directories, but before purging edits.
      // See -NOTE- in saveNamespace().
      LOG.error("This is a rare failure scenario!!!");
      LOG.error("Image checkpoint time " + latestNameCheckpointTime +
                " > edits checkpoint time " + latestEditsCheckpointTime);
      LOG.error("Name-node will treat the image as the latest state of " +
                "the namespace. Old edits will be discarded.");
    } else if (latestNameCheckpointTime != latestEditsCheckpointTime)
      throw new IOException("Inconsistent storage detected, " +
                      "image and edits checkpoint times do not match. " +
                      "image checkpoint time = " + latestNameCheckpointTime +
                      "edits checkpoint time = " + latestEditsCheckpointTime);
    
    // Recover from previous interrrupted checkpoint if any
    needToSave |= recoverInterruptedCheckpoint(latestNameSD, latestEditsSD);

    long startTime = FSNamesystem.now();
    long imageSize = getImageFile(latestNameSD, NameNodeFile.IMAGE).length();

    //
    // Load in bits
    //
    latestNameSD.read();
    needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
    LOG.info("Image file of size " + imageSize + " loaded in " 
        + (FSNamesystem.now() - startTime)/1000 + " seconds.");
    
    // Load latest edits
    if (latestNameCheckpointTime > latestEditsCheckpointTime)
      // the image is already current, discard edits
      needToSave |= true;
    else // latestNameCheckpointTime == latestEditsCheckpointTime
      needToSave |= (loadFSEdits(latestEditsSD) > 0);
    
    return needToSave;
  }

上面的loadFSImage()方法中，for语句包含的代码用于读入文件/目录的INode信息，并通过FSNamesystem.addToParent()方法，将节点添加到目录树中个。该方法会根据传入的参数，构造相应的INodeDirectoryWithQuota或INodeFile对象，并将对象插入到目录树的相应位置中。

命名空间镜像尾部保存的租约信息和安全相关信息，分别有loadFilesUnderConstruction()和loadSecretManagerState()读入。租约信息读入时，需要更新目录树上的信息，将原来的INodeFile对象替换成INodeFileUnderConstruction对象；同时，也需要更新租约管理器中的信息，如下所示：

private void loadFilesUnderConstruction(int version, DataInputStream in, 
                                  FSNamesystem fs) throws IOException {

    FSDirectory fsDir = fs.dir;
    if (version > -13) // pre lease image version
      return;
    int size = in.readInt();

    LOG.info("Number of files under construction = " + size);

    for (int i = 0; i < size; i++) {
      INodeFileUnderConstruction cons = readINodeUnderConstruction(in);

      // verify that file exists in namespace
      String path = cons.getLocalName();
      INode old = fsDir.getFileINode(path);
      if (old == null) {
        throw new IOException("Found lease for non-existent file " + path);
      }
      if (old.isDirectory()) {
        throw new IOException("Found lease for directory " + path);
      }
      INodeFile oldnode = (INodeFile) old;
      fsDir.replaceNode(path, oldnode, cons);
      fs.leaseManager.addLease(cons.clientName, path); 
    }
  }

通过loadFSImage()读取命名空间镜像后，内存中的名字节点第一关系信息只包含了保存镜像的那一刻的内容，还需要假造后续对元数据的修改，并编辑日志中的内容，才能完全恢复元数据。FSEditLog.loadFSEdits()用于加载并应用日志，它需要一个EditLogInputStream实例，即EditLogFileInputStream对象。方法loadFSEdits()冗长，大部分的实现用于根据日志记录的操作码和操作参数，调用FSDirectory中的对应方法，修改内存元信息。代码如下：

 /**
   * Load an edit log, and apply the changes to the in-memory structure
   * This is where we apply edits that we've been writing to disk all
   * along.
   */
  static int loadFSEdits(EditLogInputStream edits) throws IOException {
    FSNamesystem fsNamesys = FSNamesystem.getFSNamesystem();
    FSDirectory fsDir = fsNamesys.dir;
    int numEdits = 0;
    int logVersion = 0;
    String clientName = null;
    String clientMachine = null;
    String path = null;
    int numOpAdd = 0, numOpClose = 0, numOpDelete = 0,
        numOpRename = 0, numOpSetRepl = 0, numOpMkDir = 0,
        numOpSetPerm = 0, numOpSetOwner = 0, numOpSetGenStamp = 0,
        numOpTimes = 0, numOpGetDelegationToken = 0,
        numOpRenewDelegationToken = 0, numOpCancelDelegationToken = 0,
        numOpUpdateMasterKey = 0, numOpOther = 0;

    long startTime = FSNamesystem.now();

    DataInputStream in = new DataInputStream(new BufferedInputStream(edits));
    try {
      // Read log file version. Could be missing. 
      in.mark(4);
      // If edits log is greater than 2G, available method will return negative
      // numbers, so we avoid having to call available
      boolean available = true;
      try {
        logVersion = in.readByte();
      } catch (EOFException e) {
        available = false;
      }
      if (available) {
        in.reset();
        logVersion = in.readInt();
        if (logVersion < FSConstants.LAYOUT_VERSION) // future version
          throw new IOException(
                          "Unexpected version of the file system log file: "
                          + logVersion + ". Current version = " 
                          + FSConstants.LAYOUT_VERSION + ".");
      }
      assert logVersion <= Storage.LAST_UPGRADABLE_LAYOUT_VERSION :
                            "Unsupported version " + logVersion;

      while (true) {
        long timestamp = 0;
        long mtime = 0;
        long atime = 0;
        long blockSize = 0;
        byte opcode = -1;
        try {
          opcode = in.readByte();
          if (opcode == OP_INVALID) {
            FSNamesystem.LOG.info("Invalid opcode, reached end of edit log " +
                                   "Number of transactions found " + numEdits);
            break; // no more transactions
          }
        } catch (EOFException e) {
          break; // no more transactions
        }
        numEdits++;
        switch (opcode) {
        case OP_ADD:
        case OP_CLOSE: {
          // versions > 0 support per file replication
          // get name and replication
          int length = in.readInt();
          if (-7 == logVersion && length != 3||
              -17 < logVersion && logVersion < -7 && length != 4 ||
              logVersion <= -17 && length != 5) {
              throw new IOException("Incorrect data format."  +
                                    " logVersion is " + logVersion +
                                    " but writables.length is " +
                                    length + ". ");
          }
          path = FSImage.readString(in);
          short replication = adjustReplication(readShort(in));
          mtime = readLong(in);
          if (logVersion <= -17) {
            atime = readLong(in);
          }
          if (logVersion < -7) {
            blockSize = readLong(in);
          }
          // get blocks
          Block blocks[] = null;
          if (logVersion <= -14) {
            blocks = readBlocks(in);
          } else {
            BlockTwo oldblk = new BlockTwo();
            int num = in.readInt();
            blocks = new Block[num];
            for (int i = 0; i < num; i++) {
              oldblk.readFields(in);
              blocks[i] = new Block(oldblk.blkid, oldblk.len, 
                                    Block.GRANDFATHER_GENERATION_STAMP);
            }
          }

          // Older versions of HDFS does not store the block size in inode.
          // If the file has more than one block, use the size of the
          // first block as the blocksize. Otherwise use the default
          // block size.
          if (-8 <= logVersion && blockSize == 0) {
            if (blocks.length > 1) {
              blockSize = blocks[0].getNumBytes();
            } else {
              long first = ((blocks.length == 1)? blocks[0].getNumBytes(): 0);
              blockSize = Math.max(fsNamesys.getDefaultBlockSize(), first);
            }
          }
           
          PermissionStatus permissions = fsNamesys.getUpgradePermission();
          if (logVersion <= -11) {
            permissions = PermissionStatus.read(in);
          }

          // clientname, clientMachine and block locations of last block.
          if (opcode == OP_ADD && logVersion <= -12) {
            clientName = FSImage.readString(in);
            clientMachine = FSImage.readString(in);
            if (-13 <= logVersion) {
              readDatanodeDescriptorArray(in);
            }
          } else {
            clientName = "";
            clientMachine = "";
          }

          // The open lease transaction re-creates a file if necessary.
          // Delete the file if it already exists.
          if (FSNamesystem.LOG.isDebugEnabled()) {
            FSNamesystem.LOG.debug(opcode + ": " + path + 
                                   " numblocks : " + blocks.length +
                                   " clientHolder " +  clientName +
                                   " clientMachine " + clientMachine);
          }

          fsDir.unprotectedDelete(path, mtime);

          // add to the file tree
          INodeFile node = (INodeFile)fsDir.unprotectedAddFile(
                                                    path, permissions,
                                                    blocks, replication, 
                                                    mtime, atime, blockSize);
          if (opcode == OP_ADD) {
            numOpAdd++;
            //
            // Replace current node with a INodeUnderConstruction.
            // Recreate in-memory lease record.
            //
            INodeFileUnderConstruction cons = new INodeFileUnderConstruction(
                                      node.getLocalNameBytes(),
                                      node.getReplication(), 
                                      node.getModificationTime(),
                                      node.getPreferredBlockSize(),
                                      node.getBlocks(),
                                      node.getPermissionStatus(),
                                      clientName, 
                                      clientMachine, 
                                      null);
            fsDir.replaceNode(path, node, cons);
            fsNamesys.leaseManager.addLease(cons.clientName, path);
          }
          break;
        } 
        case OP_SET_REPLICATION: {
          numOpSetRepl++;
          path = FSImage.readString(in);
          short replication = adjustReplication(readShort(in));
          fsDir.unprotectedSetReplication(path, replication, null);
          break;
        } 
        case OP_RENAME: {
          numOpRename++;
          int length = in.readInt();
          if (length != 3) {
            throw new IOException("Incorrect data format. " 
                                  + "Mkdir operation.");
          }
          String s = FSImage.readString(in);
          String d = FSImage.readString(in);
          timestamp = readLong(in);
          HdfsFileStatus dinfo = fsDir.getFileInfo(d);
          fsDir.unprotectedRenameTo(s, d, timestamp);
          fsNamesys.changeLease(s, d, dinfo);
          break;
        }
        case OP_DELETE: {
          numOpDelete++;
          int length = in.readInt();
          if (length != 2) {
            throw new IOException("Incorrect data format. " 
                                  + "delete operation.");
          }
          path = FSImage.readString(in);
          timestamp = readLong(in);
          fsDir.unprotectedDelete(path, timestamp);
          break;
        }
        case OP_MKDIR: {
          numOpMkDir++;
          PermissionStatus permissions = fsNamesys.getUpgradePermission();
          int length = in.readInt();
          if (-17 < logVersion && length != 2 ||
              logVersion <= -17 && length != 3) {
            throw new IOException("Incorrect data format. " 
                                  + "Mkdir operation.");
          }
          path = FSImage.readString(in);
          timestamp = readLong(in);

          // The disk format stores atimes for directories as well.
          // However, currently this is not being updated/used because of
          // performance reasons.
          if (logVersion <= -17) {
            atime = readLong(in);
          }

          if (logVersion <= -11) {
            permissions = PermissionStatus.read(in);
          }
          fsDir.unprotectedMkdir(path, permissions, timestamp);
          break;
        }
        case OP_SET_GENSTAMP: {
          numOpSetGenStamp++;
          long lw = in.readLong();
          fsDir.namesystem.setGenerationStamp(lw);
          break;
        } 
        case OP_DATANODE_ADD: {
          numOpOther++;
          FSImage.DatanodeImage nodeimage = new FSImage.DatanodeImage();
          nodeimage.readFields(in);
          //Datnodes are not persistent any more.
          break;
        }
        case OP_DATANODE_REMOVE: {
          numOpOther++;
          DatanodeID nodeID = new DatanodeID();
          nodeID.readFields(in);
          //Datanodes are not persistent any more.
          break;
        }
        case OP_SET_PERMISSIONS: {
          numOpSetPerm++;
          if (logVersion > -11)
            throw new IOException("Unexpected opcode " + opcode
                                  + " for version " + logVersion);
          fsDir.unprotectedSetPermission(
              FSImage.readString(in), FsPermission.read(in));
          break;
        }
        case OP_SET_OWNER: {
          numOpSetOwner++;
          if (logVersion > -11)
            throw new IOException("Unexpected opcode " + opcode
                                  + " for version " + logVersion);
          fsDir.unprotectedSetOwner(FSImage.readString(in),
              FSImage.readString_EmptyAsNull(in),
              FSImage.readString_EmptyAsNull(in));
          break;
        }
        case OP_SET_NS_QUOTA: {
          if (logVersion > -16) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          fsDir.unprotectedSetQuota(FSImage.readString(in), 
                                    readLongWritable(in), 
                                    FSConstants.QUOTA_DONT_SET);
          break;
        }
        case OP_CLEAR_NS_QUOTA: {
          if (logVersion > -16) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          fsDir.unprotectedSetQuota(FSImage.readString(in),
                                    FSConstants.QUOTA_RESET,
                                    FSConstants.QUOTA_DONT_SET);
          break;
        }

        case OP_SET_QUOTA:
          fsDir.unprotectedSetQuota(FSImage.readString(in),
                                    readLongWritable(in),
                                    readLongWritable(in));
                                      
          break;

        case OP_TIMES: {
          numOpTimes++;
          int length = in.readInt();
          if (length != 3) {
            throw new IOException("Incorrect data format. " 
                                  + "times operation.");
          }
          path = FSImage.readString(in);
          mtime = readLong(in);
          atime = readLong(in);
          fsDir.unprotectedSetTimes(path, mtime, atime, true);
          break;
        }
        case OP_GET_DELEGATION_TOKEN: {
          if (logVersion > -19) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          numOpGetDelegationToken++;
          DelegationTokenIdentifier delegationTokenId = 
              new DelegationTokenIdentifier();
          delegationTokenId.readFields(in);
          long expiryTime = readLong(in);
          fsNamesys.getDelegationTokenSecretManager()
              .addPersistedDelegationToken(delegationTokenId, expiryTime);
          break;
        }
        case OP_RENEW_DELEGATION_TOKEN: {
          if (logVersion > -19) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          numOpRenewDelegationToken++;
          DelegationTokenIdentifier delegationTokenId = 
              new DelegationTokenIdentifier();
          delegationTokenId.readFields(in);
          long expiryTime = readLong(in);
          fsNamesys.getDelegationTokenSecretManager()
              .updatePersistedTokenRenewal(delegationTokenId, expiryTime);
          break;
        }
        case OP_CANCEL_DELEGATION_TOKEN: {
          if (logVersion > -19) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          numOpCancelDelegationToken++;
          DelegationTokenIdentifier delegationTokenId = 
              new DelegationTokenIdentifier();
          delegationTokenId.readFields(in);
          fsNamesys.getDelegationTokenSecretManager()
              .updatePersistedTokenCancellation(delegationTokenId);
          break;
        }
        case OP_UPDATE_MASTER_KEY: {
          if (logVersion > -19) {
            throw new IOException("Unexpected opcode " + opcode
                + " for version " + logVersion);
          }
          numOpUpdateMasterKey++;
          DelegationKey delegationKey = new DelegationKey();
          delegationKey.readFields(in);
          fsNamesys.getDelegationTokenSecretManager().updatePersistedMasterKey(
              delegationKey);
          break;
        }
        default: {
          throw new IOException("Never seen opcode " + opcode);
        }
        }
      }
    } catch (IOException ex) {
      // Failed to load 0.20.203 version edits during upgrade. This version has
      // conflicting opcodes with the later releases. The editlog must be 
      // emptied by restarting the namenode, before proceeding with the upgrade.
      if (Storage.is203LayoutVersion(logVersion) &&
          logVersion != FSConstants.LAYOUT_VERSION) {
        String msg = "During upgrade, failed to load the editlog version " + 
        logVersion + " from release 0.20.203. Please go back to the old " + 
        " release and restart the namenode. This empties the editlog " +
        " and saves the namespace. Resume the upgrade after this step.";
        throw new IOException(msg, ex);
      } else {
        throw ex;
      }
      
    } finally {
      in.close();
    }
    FSImage.LOG.info("Edits file " + edits.getName() 
        + " of size " + edits.length() + " edits # " + numEdits 
        + " loaded in " + (FSNamesystem.now()-startTime)/1000 + " seconds.");

    if (FSImage.LOG.isDebugEnabled()) {
      FSImage.LOG.debug("numOpAdd = " + numOpAdd + " numOpClose = " + numOpClose 
          + " numOpDelete = " + numOpDelete + " numOpRename = " + numOpRename 
          + " numOpSetRepl = " + numOpSetRepl + " numOpMkDir = " + numOpMkDir
          + " numOpSetPerm = " + numOpSetPerm 
          + " numOpSetOwner = " + numOpSetOwner
          + " numOpSetGenStamp = " + numOpSetGenStamp 
          + " numOpTimes = " + numOpTimes
          + " numOpGetDelegationToken = " + numOpGetDelegationToken
          + " numOpRenewDelegationToken = " + numOpRenewDelegationToken
          + " numOpCancelDelegationToken = " + numOpCancelDelegationToken
          + " numOpUpdateMasterKey = " + numOpUpdateMasterKey
          + " numOpOther = " + numOpOther);
    }

    if (logVersion != FSConstants.LAYOUT_VERSION) // other version
      numEdits++; // save this image asap
    return numEdits;
  }

第二名字节点

通过FSImage.saveFSImage()输出命名空间镜像是一个非常消耗资源的操作，在名字节点上周期性使用该方法保存元数据，需要让名字节点处于只读状态，这将严重影响运行在HDFS上的业务。如果只使用编辑日志，记录文件系统上的元数据的变化，随着系统的运行，日志文件将会变的越来越大，虽然在名字节点运行期间这不会对系统造成影响，但如果名字节点重新启动，就需要花很长的时间执行FSEditLog.loadFSEdits()，影响名字节点的可用性。解决的方法是运行一个第二名字节点，由它定期地获取并合并名字节点上的命名空间镜像和编辑日志，生成新的命名空间镜像（又称元数据检查点），然后上传并替换名字节点原有镜像，清空编辑日志。

检查点产生的流程，若下所示：（该方法实现在SecondaryNameNode.doCheckPoint()中）

 /**
   * Create a new checkpoint
   */
  void doCheckpoint() throws IOException {

    // Do the required initialization of the merge work area.
    startCheckpoint();

    // Tell the namenode to start logging transactions in a new edit file
    // Retuns a token that would be used to upload the merged image.
    CheckpointSignature sig = (CheckpointSignature)namenode.rollEditLog();

    // error simulation code for junit test
    if (ErrorSimulator.getErrorSimulation(0)) {
      throw new IOException("Simulating error0 " +
                            "after creating edits.new");
    }

    downloadCheckpointFiles(sig);   // Fetch fsimage and edits
    doMerge(sig);                   // Do the merge
  
    //
    // Upload the new image into the NameNode. Then tell the Namenode
    // to make this new uploaded image as the most current image.
    //
    putFSImage(sig);

    // error simulation code for junit test
    if (ErrorSimulator.getErrorSimulation(1)) {
      throw new IOException("Simulating error1 " +
                            "after uploading new image to NameNode");
    }

    namenode.rollFsImage();
    checkpointImage.endCheckpoint();

    LOG.warn("Checkpoint done. New Image Size: " 
              + checkpointImage.getFsImageName().length());
  }

对上述过程进行展开，第二名字节点工作是，命名空间镜像合并前后的数据都保存在磁盘上，和其他节点一样，在磁盘上也维护着一定的目录文件结构，所以，doCheckpoint()需要先保证工作区处于恰当的状态。接下来，第二名字节点调用远程方法NamenodeProtocol.rollEditLog()，让名字节点进行准备，这时，名字节点会关闭到编辑日志文件“edits”的输出，后续的新日志记录写完“edits.new”文件。

接下来，第二名字节点通过HTTP协议下载名字节点上的命名空间镜像“fsimage”和编辑日志“edits”到本地工作区，并进行合并；合并后的内存元数据会保存到第二名字节点后的磁盘上，形成新的镜像文件；然后通过putFSImage()方法。使用HTTP协议上传新命名空间镜像到名字节点。新命名空间镜像成功上传后，第二名字节点再次通过远程接口NamenodeProtocol，使用rollFsIamge()方法通知名字节点。这时，名字节点会使用新命名空间镜像作为当前镜像，并将“edits.new”改名为“edits”，这样新命名空间镜像和编辑日志“edits”，又一起定义了当前内存元数据的真实组织情况。

版权申明：本文部分摘自【蔡斌、陈湘萍】所著【Hadoop技术内幕深入解析Hadoop Common和HDFS架构设计与实现原理】一书，仅作为学习笔记，用于技术交流，其商业版权由原作者保留，推荐大家购买图书研究，转载请保留原作者，谢谢！

剑邑龙泉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop源码分析笔记(十二)：名字节点--文件系统目录树

名字节点简介名字节点在Hadoop分布式文件系统中只有一个实例，却是最复杂的一个实例。名字节点维护着HDFS文件系统中两个最重要的关系： 1、HDFS文件系统的文件目录树，以及文件的数据块索引，即每个文件对应的数据块列表。 2、数据块和数据节点的对应关系，即某一个数据块保存在哪些数据及节点的信息。其中，HDFS的目录树、元
复制链接

扫一扫