Hadoop的分布式文件系统

最新推荐文章于 2024-07-05 09:42:36 发布

珠峰之巅-程序员

最新推荐文章于 2024-07-05 09:42:36 发布

阅读量524

点赞数

分类专栏： Hadoop

Hadoop 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

HDFS默认情况下，块的大小是64MB。与磁盘相比它的块是巨大的，数据转换的时间就比需找数据的开始块的时间大的多。因此这转换一个大文件的操作取决于磁盘的速度。

Namenodes 和Datanodes

Namenodes管理文件系统的命名空间。它维护文件系统的结构和属性信息对于所有的文件和目录在树形结构。这信息被持续的存储在本地的硬盘以两种文件的格式：命名空间的图像和编辑日志。

Datanodes是文件系统的工作空间。他们存储和获得块当他们被告知和他们报告回这namenodes带有的他们存储的块的列表。

命令接口：% hadoop fs -copyFromLocal input/docs/quangle.txt hdfs://localhost/user/tom/
quangle.txt

         % hadoop fs -copyToLocal quangle.txt quangle.copy.txt
       % md5 input/docs/quangle.txt quangle.copy.txt
       MD5 (input/docs/quangle.txt) = a16f231da6b05e2ba7a339320e7dacd9

读取数据从一个Hadoop URL中：

InputStream in=null;

try{

in=new URL("hdfs://host/path").openStream();

}finally{IOUtils.closeStream(in);}

public class URLCat {
static {
URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
}
public static void main(String[] args) throws Exception {
InputStream in = null;
try {
in = new URL(args[0]).openStream();
IOUtils.copyBytes(in, System.out, 4096, false);
} finally {
IOUtils.closeStream(in);
}
}
}

获取FileSystem
    public static FileSystem get(Configuration conf) throws IOException
    public static FileSystem get(URI uri, Configuration conf) throws IOException
    public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

获取输入的流：

public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

其中可以获得文件指针的位置

public interface Seekable {
void seek(long pos) throws IOException;
long getPos() throws IOException;
}

写数据：

public FSDataOutputStream create(Path f) throws IOException获得写入的流

其中有回调接口

   package org.apache.hadoop.util;
   public interface Progressable {
            public void progress();
   }

添加内容到文件的末尾

public FSDataOutputStream append(Path f) throws IOException

目录的操作：

public boolean mkdirs(Path f) throws IOException新建目录

查询文件系统：

File的属性数据：FileStatus中保存

这FileStatus类封装文件的属性数据为文件和目录，包括文件的长度，块的大小，重复度，修改时间，拥有者和权限信息。

这getFileStatus()在文件系统提供一种获得一个FileStatus对象对于一个单独的文件或目录。

   public class ShowFileStatusTest {
          private MiniDFSCluster cluster; // use an in-process HDFS cluster for testing
          private FileSystem fs;
         @Before
          public void setUp() throws IOException {
                Configuration conf = new Configuration();
                if (System.getProperty("test.build.data") == null) {
                System.setProperty("test.build.data", "/tmp");

            }
          cluster = new MiniDFSCluster(conf, 1, true, null);
          fs = cluster.getFileSystem();
          OutputStream out = fs.create(new Path("/dir/file"));
          out.write("content".getBytes("UTF-8"));
          out.close();
}

}
cluster = new MiniDFSCluster(conf, 1, true, null);
fs = cluster.getFileSystem();
OutputStream out = fs.create(new Path("/dir/file"));
out.write("content".getBytes("UTF-8"));
out.close();
}
@After
public void tearDown() throws IOException {
if (fs != null) { fs.close(); }
if (cluster != null) { cluster.shutdown(); }
}
@Test(expected = FileNotFoundException.class)
public void throwsFileNotFoundForNonExistentFile() throws IOException {
fs.getFileStatus(new Path("no-such-file"));
}

@Test
public void fileStatusForFile() throws IOException {
Path file = new Path("/dir/file");
FileStatus stat = fs.getFileStatus(file);
assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));
assertThat(stat.isDir(), is(false));
assertThat(stat.getLen(), is(7L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 1));
assertThat(stat.getBlockSize(), is(64 * 1024 * 1024L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rw-r--r--"));
}
@Test
public void fileStatusForDirectory() throws IOException {
Path dir = new Path("/dir");
FileStatus stat = fs.getFileStatus(dir);
assertThat(stat.getPath().toUri().getPath(), is("/dir"));
assertThat(stat.isDir(), is(true));
assertThat(stat.getLen(), is(0L));
assertThat(stat.getModificationTime(),
is(lessThanOrEqualTo(System.currentTimeMillis())));
assertThat(stat.getReplication(), is((short) 0));
assertThat(stat.getBlockSize(), is(0L));
assertThat(stat.getOwner(), is("tom"));
assertThat(stat.getGroup(), is("supergroup"));
assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));
}
}

列出文件：

发现一个单独文件或目录的信息是有用的，但是你也需要能够列出一个目录中内容。这就是FileSystem的listStatus（）方法：

public FileStatus[] listStatus(Path f) throws IOException

public FileStatus[] listStatus(Path f,PathFilter filter) throws IOException

public FileStatus[] listStatus(Path[] files) throws IOException

public FileStatus[] listStatus(Path[] files,PathFilter filter) throw IOException

其中，PathFilter可以限制匹配的文件和目录。

文件模式：

它是共同需要的在一个单独操作处理一套文件。

public FileStatus[] globStatus(Path pathPattern) throws IOEXCEPTion

public FileStatus[] globStatus(Path pathPattern,PathFilter filter) throws IOException

PathFilter:

Glob patterns并不是总是足够强大的来描述你想要获得的一套文件。例如，它不能够排除一个特定文件使用一个glob格式。

package org.apache.hadoop.fs;
public interface PathFilter {
boolean accept(Path path);
}

删除数据：

使用delete（）方法在FileSystem来永久的移除文件或者目录：

public boolean delete（Path f，boolean recursive）

如果recursive是true则一个非空的目录被删除和它的内容也被删除。

数据流动：

HDFS打开 Distributed FileSystem -》get block locations from NameNode。

HDFS client读取FSData InputStream，读取数据从DataNode。
数据文件写的分析：

HDFS client 创建文件在Distributed FileSystem向NameNode请求创建新文件，完成后也要向NameNode发送消息。

HDFS client写数据通过FSData OutputStream 向datanodes写入数据。完成后关闭。

Hadoop的重复块的放置：

Hadoop的默认策略是放置第一个从发的在相同的节点和client。第二个重复的块被放在随机选择的一个节点和第一个不同的曺内。第三个放置在和第二个相同的曺但是不同的节点上。

保证一致性：

Path p = new Path("p");
FSDataOutputStream out = fs.create(p);
out.write("content".getBytes("UTF-8"));
out.flush();
out.sync();
assertThat(fs.getFileStatus(p).getLen(), is(((long) "content".length())));

并行拷贝：

% hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar

使用的存档：Hado% hadoop fs -lsr /my/filesop