Hadoop源码分析笔记(四)：Hadoop文件系统简介

最新推荐文章于 2021-02-07 15:33:41 发布

剑邑龙泉

最新推荐文章于 2021-02-07 15:33:41 发布

阅读量915

点赞数

分类专栏： Hadoop源码分析

Hadoop源码分析专栏收录该内容

14 篇文章 3 订阅

订阅专栏

Hadoop文件系统简介

Hadoop文件系统，包括Hadoop抽象文件系统以及基于该抽象文件系统的大量具体文件系统，以满足构建在Hadoop上的各类应用的各种数据访问需求，是文件系统发展的新阶段。

文件系统的实现

一、块管理

文件的物理结构指文件在存储设备(如磁盘)上的存取方式。为了便于管理，设备往往将存储空间组织成为具有一定结构的存储单位。以磁盘为例，磁盘在逻辑上会划分为磁道、柱面和扇区，扇区是磁盘的读写单位，也是磁盘读写的最小寻址单位，一个磁盘一般是512字节，2009年后引入使用4096字节扇区的磁盘。

块管理用于记录存储块和文件的关联关系，对于随机存储设备而言一般有如下三种方法来实现块管理。(1)、连续分配(2)、链接表(3)、索引链式表

二、目录管理

目录作为文件盒子目录的容器，其数据由一组结构化的记录组成，每个记录描述了集合中的一个文件或者子目录。记录提供足够的信息。

Hadoop抽象文件系统

在Hadoop中，经常需要向流写入计算结果，或者从流中读取结果。Java的数据流DataOutputStream和DataInputStrem支持写入和读取所有Java基本类型的方法。数据流广泛使用于Hadoop的实现中，如序列化机制Writable。

为了提供对不同数据访问的一致接口，Hadoop借鉴了Linux虚拟文件系统的概念，引入了Hadoop抽象文件系统，并在Hadoop抽象文件系统的基础上，提供了大量的具体文件系统的实现，满足构建于Hadoop上应用的各种数据访问需求。Hadoop文件抽象类在org.apache.hadoop.fs.FileSystem。

与Linux和Java文件API类似，Hadoop抽象文件系统的方法可以分为两部分：一部分用于处理文件和目录的相关事务；另一部分用于读写文件数据。

Hadoop抽象文件系统中，用于读文件数据的流是FSDataInputStream，对应地，写文件通过抽象类FSDataOutputStream来实现。其中它们的定义如下：

public interface Seekable {
  /**
   * Seek to the given offset from the start of the file.
   * The next read() will be from that location.  Can't
   * seek past the end of the file.
   */
  void seek(long pos) throws IOException;
  
  /**
   * Return the current offset from the start of the file
   */
  long getPos() throws IOException;

  /**
   * Seeks a different copy of the data.  Returns true if 
   * found a new source, false otherwise.
   */
  boolean seekToNewSource(long targetPos) throws IOException;
}


/** Stream that permits positional reading. */
public interface PositionedReadable {
  /**
   * Read upto the specified number of bytes, from a given
   * position within a file, and return the number of bytes read. This does not
   * change the current offset of a file, and is thread-safe.
   */
  public int read(long position, byte[] buffer, int offset, int length)
    throws IOException;
  
  /**
   * Read the specified number of bytes, from a given
   * position within a file. This does not
   * change the current offset of a file, and is thread-safe.
   */
  public void readFully(long position, byte[] buffer, int offset, int length)
    throws IOException;
  
  /**
   * Read number of bytes equalt to the length of the buffer, from a given
   * position within a file. This does not
   * change the current offset of a file, and is thread-safe.
   */
  public void readFully(long position, byte[] buffer) throws IOException;
}

public interface Closeable extends AutoCloseable {

    /**
     * Closes this stream and releases any system resources associated
     * with it. If the stream is already closed then invoking this
     * method has no effect.
     *
     * @throws IOException if an I/O error occurs
     */
    public void close() throws IOException;
}

public class FSDataInputStream extends DataInputStream
    implements Seekable, PositionedReadable, Closeable {
......
}

public class FSDataOutputStream extends DataOutputStream implements Syncable {
  ......
}

Hadoop实现的具体文件系统，主要的有本地的fs.LocalFileSystem、fs.RawLocalFileSystem，HDFS的hdfs.DistributedFileSystem，内存的rs.RamInMemoryFileSystem、fs.InMemoryFileSystem等具体实现。这么多的文件系统的实现保证了Hadoop应用可以访问不同环境中的数据。

版权申明：本文部分摘自【蔡斌、陈湘萍】所著【Hadoop技术内幕深入解析Hadoop Common和HDFS架构设计与实现原理】一书，仅作为学习笔记，用于技术交流，其商业版权由原作者保留，推荐大家购买图书研究，转载请保留原作者，谢谢！

剑邑龙泉

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop源码分析笔记(四)：Hadoop文件系统简介

Hadoop文件系统简介 Hadoop文件系统，包括Hadoop抽象文件系统以及基于该抽象文件系统的大量具体文件系统，以满足构建在Hadoop上的各类应用的各种数据访问需求，是文件系统发展的新阶段。文件系统的实现一、块管理文件的物理结构指文件在存储设备(如磁盘)上的存取方式。为了便于管理，设备往往将存储空间组织成为具有一定结构的存
复制链接

扫一扫

专栏目录