第三章第五节 JAVA接口

最新推荐文章于 2021-10-20 22:22:33 发布

fkbush

最新推荐文章于 2021-10-20 22:22:33 发布

阅读量743

点赞数

分类专栏： hadoop

hadoop 专栏收录该内容

25 篇文章 1 订阅

订阅专栏

在这一节中，我们研究HADOOP FileSystem类：与HADOOP文件系统交互的API。

尽管我们主要关注的是HDFS实现DistributedFileSystem，通常你应该努力自己写抽象

类FileSystem的实现，以在不同文件系统间保持可移植性。这在你测试你的程序时是

非常有用的，例如，你可以使用你本地的数据来快速测试。

从HADOOP URL读取数据

从HADOOP文件系统读取文件的一个最简单的方法是使用java.net.URL对象来打开

一个流来读取数据。常用格式如下：

		InputStream in = null;
		try {
			in = new URL("hdfs://localhost/test").openStream();
			IOUtils.copyBytes(in, System.out, 4096, false);
		} finally {
			IOUtils.closeStream(in);
		}

为了让JAVA认识HADOOP的hdfs URL，还有一些工作需要做。通过调用URL的

setURLStreamHandlerFactory()方法并传入一个FsUrlStreamHandlerFactory.对象来达

到这个目的。这个方法每个虚拟机只能调用一次，所以把它放在一个静态代码块里。这个

限制意味着如果你程序的其它部分----也许是在你控制范围之外的第三方组件----已经设置

了一个URLStreamHandlerFactory,你就不能通过这个方法来从HADOOP读取数据。下一

节中讨论一个替代方法。

例3-1是使用一个URLStreamHandler来显示HADOOP文件系统的文件的程序，和UNIX

系统的cat命令类似。

public class URLCat {
	static {
		URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory());
	}

	public static void main(String[] args) throws Exception {
		InputStream in = null;
		try {
			in = new URL(args[0]).openStream();
			IOUtils.copyBytes(in, System.out, 4096, false);
		} finally {
			IOUtils.closeStream(in);
		}
	}
}

我们利用HADOOP自带的IOUtils类在finally语句块中关闭流，以及在输入流和输出流

（System.out）之间拷贝数据。copyBytes()方法的最后两个参数分别是复制的缓冲区大小

和在拷贝完成后是否关闭流。我们自己关闭流，System.out不需要关闭。

这里是例子的运行：

% export HADOOP_CLASSPATH=hadoop-examples.jar
% hadoop URLCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

使用FileSystem API读取数据

像前面一节所解释的那样，有时候你的应用程序不能设置一个URLStreamHandlerFactory，

在这种情况下，你需要使用FileSystem API来打开一个文件的输入流。

HADOOP文件系统中的一个文件用一个HADOOP路径对象来表示（不是java.io.File对象，

因为它的语义和本地文件系统密切相关）。你可以把路径想象成一个HADOOP文件系统的URI，

就像hdfs://localhost/user/tom/quangle.txt一样。

FileSystem是一个一般化的文件系统API，所以第一步是检索一个我们希望使用的文件系统的

实例----在这个例子中是HDFS。这里有几个静态方法来得到一个FileSystem实例：

public static FileSystem get(Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf) throws IOException
public static FileSystem get(URI uri, Configuration conf, String user) throws IOException

一个Configuration对象封装了一个客户端或是服务器的配置，通过读取类路径的配置文件来设置，如

etc/hadoop/core-site.xml。第一个方法返回默认的文件系统（在core-site.xml里指定的，如果没指定

则默认是本地文件系统）。第二个方法使用一个给定的URI和一个权限认证来决定使用哪个文件系统，

如果对应的URI没有定义测返回默认的文件系统。第三个方法是得到一个特定用户的文件系统，这个在

安全环境中非常重要（见309页“安全”）。

有些情况下，也许你想要一个本地文件系统实例。这时，你可以很方便的使用getLocal()方法得到：

public static LocalFileSystem getLocal(Configuration conf) throws IOException

拿到FileSystem实例后，我们调用open()方法来得到文件的输入流：

public FSDataInputStream open(Path f) throws IOException
public abstract FSDataInputStream open(Path f, int bufferSize) throws IOException

第一个方法使用默认的缓冲区大小4KB。

把这些放到一起，我们可以重写例3-1为例3-2：

public class FileSystemCat {
	public static void main(String[] args) throws Exception {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri), conf);
		InputStream in = null;
		try {
			in = fs.open(new Path(uri));
			IOUtils.copyBytes(in, System.out, 4096, false);
		} finally {
			IOUtils.closeStream(in);
		}
	}
}

这个程序运行如下：

% hadoop FileSystemCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

FSDataInputStream

FileSystem的open()方法返回一个FSDataInputStream而不是标准的java.io类。这个

类继承了java.io.DataInputStream并支持随机访问，所以你可以从流的任何部分开始读：

package org.apache.hadoop.fs;
public class FSDataInputStream extends DataInputStream implements Seekable, PositionedReadable {
    // implementation elided
}

Seekable接口允许查询文件的一个位置同时提供一个查询方法来得到当前位置到文件开头的偏移量：

public interface Seekable {
    void seek(long pos) throws IOException;
    long getPos() throws IOException;
}

如果调用seek()方法时传入的值大于文件的长度，会抛出一个IOExcetion。不像java.io.InputStream的skip()

方法，它只能定位当前位置之后的地方，seek()方法可以移动到文件的任意位置。

例3-3是例3-2的扩展，它把一个文件写到标准输出流两次：写了一次之后，它找到文件的开始位置并再次写出：

public class FileSystemDoubleCat {
	public static void main(String[] args) throws Exception {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri), conf);
		FSDataInputStream in = null;
		try {
			in = fs.open(new Path(uri));
			IOUtils.copyBytes(in, System.out, 4096, false);
			in.seek(0); // go back to the start of the file
			IOUtils.copyBytes(in, System.out, 4096, false);
		} finally {
			IOUtils.closeStream(in);
		}
	}
}

下面是在一个小的文件上的运行结果：

% hadoop FileSystemDoubleCat hdfs://localhost/user/tom/quangle.txt
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.
On the top of the Crumpetty Tree
The Quangle Wangle sat,
But his face you could not see,
On account of his Beaver Hat.

FSDataInputStream也实现了PositionedReadable接口用来根据给定的偏移量来读取部分文件：

public interface PositionedReadable {
	public int read(long position, byte[] buffer, int offset, int length) throws IOException;

	public void readFully(long position, byte[] buffer, int offset, int length) throws IOException;

	public void readFully(long position, byte[] buffer) throws IOException;
}

read()方法从文件指定位置（position）读取一定长度（length）数据到缓冲区（buffer）的偏移（offset）位置。

返回值是实际读取的数据字节数；调用都需要检验这个值，因为它可能比length小。 readFully()方法会读取length长度

的byte数据到buffer(或buffer.length byte在只有一个字节的缓冲区时），除非到达文件的末尾，这时会抛出一个EOFException。

所有的这些方法都保存了文件当前的偏移量，并且是线程安全的（但FSDataInputStream没用被设计用来并发访问，

所以最好创建多个实例），他们提供了访问文件主体时访问文件其它部分----元数据----的方便方法。

最后，记住调用seek()方法是相对昂贵的操作应该谨慎使用。你应该让你的应用访问模式依赖于流数据（通过使用

MapReduce,例如）而不是使用大量的seek。

写数据

FileSystem类有一些创建文件的方法。最简单的方法是接收一个要创建文件的Path对象，返回一个输出流用来写：

public FSDataOutputStream create(Path f) throws IOException

这个方法有一个重载的方法允许你指定是否强制覆盖已存在的文件，文件的复制因子，写文件时用到的缓冲区，

文件的block size，文件的权限等。

还有一个重载方法，传入一个回调接口函数Progressable，当数据开始写到你的datanode时会通知你的应用：

package org.apache.hadoop.util;
public interface Progressable {
    public void progress();
}

作为创建一个新文件的替代方案，你可以使用append()方法来添加内容到已存在的文件中（还有一些其它的重载

方法）：

public FSDataOutputStream append(Path f) throws IOException

append操作只允许单一写入到已存在的文件的末尾。通过这个API，应用可以产生极大的文件，如日志文件。

append操作是可选的，并不是所有的HADOOP文件系统都实现了，如，HDFS支持append，但S3文件系统不支持。

例3-4显示如何拷贝一个本地文件到HADOOP文件系统。我们通过每写64KB数据到datanode打印一条语句来

说明进展情况（注意这个特殊的行为并不是API指定的，所以在稍后的版本中需要改变。API很少允许你指示“什么事情

正在发生”）。例3-4：

public class FileCopyWithProgress {
	public static void main(String[] args) throws Exception {
		String localSrc = args[0];
		String dst = args[1];
		InputStream in = new BufferedInputStream(new FileInputStream(localSrc));
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(dst), conf);
		OutputStream out = fs.create(new Path(dst), new Progressable() {
			public void progress() {
				System.out.print(".");
			}
		});
		IOUtils.copyBytes(in, out, 4096, true);
	}
}

典型用法：

% hadoop FileCopyWithProgress input/docs/1400-8.txt
hdfs://localhost/user/tom/1400-8.txt
.................

目前，没有其它的HADOOP文件系统在写的过程中会调用progress()。progress在MapReduce应用中是很重要的，

在稍后章节中你会看到这一点。

FSDataOutputStream

FileSystem的create方法返回一个FSDataOutputStream，像FSDataInputStream一样，有一个查询文件当前位置

的方法：

package org.apache.hadoop.fs;
public class FSDataOutputStream extends DataOutputStream implements Syncable {
	public long getPos() throws IOException {
		// implementation elided
	}
	// implementation elided
}

然而，不像FSDataInputStream,FSDataOutputStream不允许seek。这是因为HDFS只允许顺序写到一个文件，或是

append到一个文件。换句话说，它只支持写到文件的末尾而不支持写到任何位置，所以写的时候可以seek是没有任何价值的。

FileSystem提供一个创建目录的方法：

public boolean mkdirs(Path f) throws IOException

如果这个目录的父目录不存在，这个方法会创建所有需要的父目录，就像java.io.File的mkdirs()方法一样。如果目录及

父目录都创建成功它返回true。

通常，你不需要显示的创建一个目录，因为通过调用create()方法写一个文件的时候会自动创建父目录。

查询文件系统

文件元数据：文件状态

任何一个文件系统的最重要的特征是可以导航目录结构并检索它存储的文件及目录信息。类FileStatus封装了

文件系统的文件及目录的元数据，包括文件长度，block size，复制因子，修改时间及权限信息。

FileSystem的getFileStatus()方法提供了得到FileStatus的途径。如例3-5所示：

import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.OutputStream;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.junit.After;
import org.junit.Before;
import org.junit.Test;

public class ShowFileStatusTest {
	//private MiniDFSCluster cluster; hadoop2.2.0
									// testing
	private FileSystem fs;

	@Before
	public void setUp() throws IOException {
		String dst = "hdfs://localhost/test2/";
		Configuration conf = new Configuration();
		if (System.getProperty("test.build.data") == null) {
			System.setProperty("test.build.data", "/tmp");
		}
		//cluster = new MiniDFSCluster.Builder(conf).build();
		//fs = cluster.getFileSystem();
		fs = FileSystem.get(URI.create(dst), conf);//取代MiniDFSCluster 
		OutputStream out = fs.create(new Path("/dir/file"));
		out.write("content".getBytes("UTF-8"));
		out.close();
	}

	@After
	public void tearDown() throws IOException {
		if (fs != null) {
			fs.close();
		}
		//if (cluster != null) {
		//	cluster.shutdown();
		//}
	}

	@Test(expected = FileNotFoundException.class)
	public void throwsFileNotFoundForNonExistentFile() throws IOException {
		fs.getFileStatus(new Path("no-such-file"));
	}

	@Test
	public void fileStatusForFile() throws IOException {
		Path file = new Path("/dir/file");
		FileStatus stat = fs.getFileStatus(file);
		//assertThat(stat.getPath().toUri().getPath(), is("/dir/file"));
		assert stat.getPath().toUri().getPath().equals("/dir/file");
		//assertThat(stat.isDirectory(), is(false));
		assert !stat.isDirectory();
		//assertThat(stat.getLen(), is(7L));
		assert stat.getLen() == 7L;
		//assertThat(stat.getModificationTime(),is(lessThanOrEqualTo(System.currentTimeMillis())));
		assert stat.getModificationTime() <= System.currentTimeMillis();
		//assertThat(stat.getReplication(), is((short) 1));
		assert stat.getReplication() == (short)1;
		//assertThat(stat.getBlockSize(), is(128 * 1024 * 1024L));
		assert stat.getBlockSize() == 128 * 1024 * 1024L;
		//assertThat(stat.getOwner(), is(System.getProperty("user.name")));
		assert stat.getOwner().equals(System.getProperty("user.name"));
		//assertThat(stat.getGroup(), is("supergroup"));
		assert stat.getGroup().equals("supergroup");
		//assertThat(stat.getPermission().toString(), is("rw-r--r--"));
		assert stat.getPermission().toString().equals("rw-r--r--");
	}


	@Test
	public void fileStatusForDirectory() throws IOException {
		Path dir = new Path("/dir");
		//修改同上
//		FileStatus stat = fs.getFileStatus(dir);
//		assertThat(stat.getPath().toUri().getPath(), is("/dir"));
//		assertThat(stat.isDirectory(), is(true));
//		assertThat(stat.getLen(), is(0L));
//		assertThat(stat.getModificationTime(),
//				is(lessThanOrEqualTo(System.currentTimeMillis())));
//		assertThat(stat.getReplication(), is((short) 0));
//		assertThat(stat.getBlockSize(), is(0L));
//		assertThat(stat.getOwner(), is(System.getProperty("user.name")));
//		assertThat(stat.getGroup(), is("supergroup"));
//		assertThat(stat.getPermission().toString(), is("rwxr-xr-x"));
	}

}

如果文件或目录不存在，将抛出FileNotFoundException。当然，如果你只关心文件或目录

是否存在，FileSystem的exists()方法更方便：

public boolean exists(Path f) throws IOException

文件列表

找到一个文件或目录的信息是有用的，但是你同样经常需要查询一个目录的文件列表。这就是

FileSystem的listStatus()方法的作用：

public FileStatus[] listStatus(Path f) throws IOException
public FileStatus[] listStatus(Path f, PathFilter filter) throws IOException
public FileStatus[] listStatus(Path[] files) throws IOException
public FileStatus[] listStatus(Path[] files, PathFilter filter) throws IOException

当参数是一个文件时，返回一个长度为1的FileStatus数组对象。当参数是一个上当时，它

返回零个或多个FileStatus对象，表示这个目录包含的文件或目录。

多种重载的方法允许提供一个PathFilter来严格匹配文件或目录。第67页“PathFilter”可以看到一个

例子。最后，如果你定义了一个路径数组，返回的结果和对每一个路径调用listStatus()方法然后把返回

结果放到一个数组里一致。这个对于创建一个由文件系统不同部分组成的输入文件很有用。例3-6是这个

想法的一个演示。注意它使用了FileUtil的stat2Paths()方法来把一个FileStatus数组对象转成一个Path数组

对象。例3-6：

public class ListStatus {
	public static void main(String[] args) throws Exception {
		String uri = args[0];
		Configuration conf = new Configuration();
		FileSystem fs = FileSystem.get(URI.create(uri), conf);
		Path[] paths = new Path[args.length];
		for (int i = 0; i < paths.length; i++) {
			paths[i] = new Path(args[i]);
		}
		FileStatus[] status = fs.listStatus(paths);
		Path[] listedPaths = FileUtil.stat2Paths(status);
		for (Path p : listedPaths) {
			System.out.println(p);
		}
	}
}

我们可以使用这个程序来找到组合路径的目录列表的结合：

% hadoop ListStatus hdfs://localhost/ hdfs://localhost/user/tom
hdfs://localhost/user
hdfs://localhost/user/tom/books
hdfs://localhost/user/tom/quangle.txt

File patterns

在一个操作中处理一个文件集合是一个很普通的需求。例如，一个处理日志的MapReduce job

也许要分析一个月的文件，包含在若干目录中。不是循环每一个文件和目录来定义输入，而是使用

通配符来匹配多个文件，这种操作称为globbing。HADOOP提供了两个FileSystem方法来处理glob:

public FileStatus[] globStatus(Path pathPattern) throws IOException
public FileStatus[] globStatus(Path pathPattern, PathFilter filter) throws IOException

globStatus()方法返回一个FileStatus对象数组，它的路径匹配这个pattern，根据path排序。同样

可以使用PathFilter定义一个严格匹配。

HADOOP支持的通配符和UNIX的bash shell的一样（见表3-2）：

Table 3-2. Glob characters and their meanings
Glob            Name                           Matches
*               asterisk                       Matches zero or more characters
?               question mark                  Matches a single character
[ab]            character class                Matches a single character in the set {a, b}
[^ab]           negated character class        Matches a single character that is not in the set {a, b}
[a-b]           character range                Matches a single character in the (closed) range [a, b], where a is lexicographically 
                                               less than or equal to b
[^a-b]          negated character range        Matches a single character that is not in the (closed) range [a, b], where a is
                                               lexicographically less than or equal to b
{a,b}           alternation                    Matches either expression a or b
\c              escaped character              Matches character c when it is a metacharacter

设想目录文件保存在一个目录里，根据日期组织层级结构。这样2007年最后一天的日志文件会

保存在/2007/12/31。假设所有文件列表如下：

/
├── 2007/
│ └── 12/
│ ├── 30/
│ └── 31/
└── 2008/
└── 01/
├── 01/
└── 02/

这是一些文件glob及解释：

Glob                       Expansion
/*                         /2007 /2008
/*/*                       /2007/12 /2008/01
/*/12/*                    /2007/12/30 /2007/12/31
/200?                      /2007 /2008
/200[78]                   /2007 /2008
/200[7-8]                  /2007 /2008
/200[^01234569]            /2007 /2008
/*/*/{31,01}               /2007/12/31 /2008/01/01
/*/*/3{0,1}                /2007/12/30 /2007/12/31
/*/{12/31,01/01}           /2007/12/31 /2008/01/01

PathFilter

glob模式并不总是足够强大到可以描述你想访问的文件。例如，使用glob模式不能排序一个

特定的文件。FileSystem的listStatus()及globStatus()方法可以接受一个PathFilter，允许程序式控制

匹配：

package org.apache.hadoop.fs;
public interface PathFilter {
    boolean accept(Path path);
}

例3-7显示了一个PathFilter排除与正则表达式匹配的path:

public class RegexExcludePathFilter implements PathFilter {
	private final String regex;

	public RegexExcludePathFilter(String regex) {
		this.regex = regex;
	}

	public boolean accept(Path path) {
		return !path.toString().matches(regex);
	}
}

这个过滤器只通过那些和正则表达式不匹配的。在glob挑出一个初始的文件集合后，再使用

过滤器提炼这个结果。例如：

fs.globStatus(new Path("/2007/*/*"), new RegexExcludeFilter("^.*/2007/12/31$"))

会定位到/2007/12/30。

过滤器只能通过由Path表示的文件名称过滤。它不能使用文件属性，如创建时间等。然而，它

可以执行glob及正则表达式都实现不了的匹配。例如，如果你文件保存目录结构是根据日期安排（像

前面的那样），你可以写一个PathFilter来找到在一个给定的日期范围的文件。

删除数据

使用FileSystem的delete()方法可以删除文件或目录：

public boolean delete(Path f, boolean recursive) throws IOException

如果f是一个文件为是一个空的目录，recursive被忽略。只有在recursive是true的时候，删除一个

非空目录时才会同时删除它的子文件及目录（否则，会抛出一个IOException）。

fkbush

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

第三章 第五节 JAVA接口

第三章第五节 JAVA接口