Hadoop的I/O操作——SequenceFile
1. 基于文件的数据结构
Hadoop的HDFS和MapReduce自框架主要是针对大数据文件来设计的,在小文件的处理上不但效率低,还浪费内存资源(每个小文件占据一个block,每个block的元数据都要存储在namenode里)。为了解决这个问题,通常采用容器来对一些小文件进行存储,Hadoop提供了2种类型的容器:SequenceFile和MapFile。
2.SequenceFile简单介绍
(1)SequenceFile是Hadoop用来存储二进制形式的key-value对而设计的一种平面文件。
(2)SequenceFile文件中,每个key-value被看作为一条记录(Record)。
(3)基于SequenceFile,可以提出一些HDFS中小文件存储的解决方案:将小文件合并成一个大文件,例如将每个小文件的文件名作为key,文件内容作为value,然后将这个键值对写入SequenceFile文件种。
(4)SequenceFile文件支持三种压缩类型(SequenceFile.CompressionType):
a. NONE:不对Record进行压缩;
b. RECORD:仅压缩每个Record里的value值;
c. BLOCK: 将整个block里的所有Record压缩到一起。
对于这3种压缩类型,Hadoop提供了3种相应类型的Writer:
a. SequenceFile.writer 写入时不进行压缩
b. SequenceFile.RecordCompressWriter 写入时只压缩key-value对中的value
c. SequenceFile.BlockCompressWriter 写入时将一批key-value对压缩成一个block。
(5)SequenceFile的结构
SequenceFile文件由文件头(Header)和随后的多条记录组成:
1) 文件头
SequenceFile文件的前3个字节为SEQ(顺序文件代码),紧接着的一个字节是顺序文件的版本好。文件头还包括其他字段,例如键和值类的名称、数据压缩细节、同步标识(它用于在读取文件时能从任意位置开始识别记录边界,每个文件都有一个随机生成的同步表示,其值存在头文件中,同步标识位于顺序文件中的记录与记录之间。同步标识的额外存储开销要求小于1%,所以没必要在每条记录结尾添加同步标识。)等。
2)记录
记录的内部结构取决于是否压缩,如果压缩则取决于是记录压缩还是快压缩。
A. 没有启用压缩
每条记录的内容依此为:记录长度(字节数)、键长度、键、值。
B. 记录压缩
与无压缩基本相同,不同点在于其值是用文件头中定义的codec压缩的。
C. 块压缩
快压缩一次压缩多条记录,它可以不断向数据块中压缩记录,直到块的字节数达到io.seqfile.compress.blocksize属性中设置的字节数,默认为1MB.每个新块的开始处都需要插入同步标识。
每个数据块的内容依次为(Apache Hadoop 2.9.1 的API里写的):
3. SequenceFile的写入
利用createWriter静态方法创建SequenceFile对象,返回SequenceFile.Writer对象,该方法有多个重载版本,都需要指定待写入的数据流(FSDataOutputStream或FileSystem对象和Path对象),Configuration对象、键和值的类型,例如:
org.apache.hadoop.io.SequenceFile.Writer createWriter(FileSystem fs, Configuration conf, Path name, Class keyClass, Class valClass)
一旦获得SequenceFile.Writer实例对象,就可以调用其append来写入key-value对。
实际操作:
(1)先查看HDFS里的/test/目录:
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop fs -ls /test/
18/10/12 11:50:07 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 2 items
-rw-r--r-- 1 root supergroup 10240 2018-10-11 06:27 /test/1
-rw-r--r-- 1 root supergroup 135 2018-10-11 06:26 /test/1.gz
(2)编写MySequenceWrite.java,实现MySequenceWriter类:
import static org.hamcrest.CoreMatchers.is;
import static org.junit.Assert.assertThat;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.util.ReflectionUtils;
import org.junit.*;
public class MySequenceWrite {
private static final String[] Data = {
"lala, I am an apple",
"haha, I am an banana",
"guagua, This is a shoot",
"mama, These are flowers",
"gaga, Thoses are tickets"
};
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
IntWritable key = new IntWritable();
Text value = new Text();
Path path = new Path(uri);
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
for(int i=1; i<21; i++) {
key.set(i);
value.set(Data[i%Data.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}
(3)编译MySequenceWriter.java:
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# javac MySequenceWrite.java
Note: MySequenceWrite.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
(4)用hadoop运行该类
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop MySequenceWrite /test/2.txt
18/10/12 11:56:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/12 11:56:41 INFO compress.CodecPool: Got brand-new compressor [.deflate]
[128] 1 haha, I am an banana
[167] 2 guagua, This is a shoot
[208] 3 mama, These are flowers
[252] 4 gaga, Thoses are tickets
[297] 5 lala, I am an apple
[337] 6 haha, I am an banana
[376] 7 guagua, This is a shoot
[417] 8 mama, These are flowers
[461] 9 gaga, Thoses are tickets
[506] 10 lala, I am an apple
[546] 11 haha, I am an banana
[585] 12 guagua, This is a shoot
[626] 13 mama, These are flowers
[670] 14 gaga, Thoses are tickets
[715] 15 lala, I am an apple
[755] 16 haha, I am an banana
[794] 17 guagua, This is a shoot
[835] 18 mama, These are flowers
[879] 19 gaga, Thoses are tickets
[924] 20 lala, I am an apple
(5)查看HDFS里的/test/目录,可以看到多了2.txt:
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop fs -ls /test/
18/10/12 11:58:06 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 10240 2018-10-11 06:27 /test/1
-rw-r--r-- 1 root supergroup 135 2018-10-11 06:26 /test/1.gz
-rw-r--r-- 1 root supergroup 964 2018-10-12 11:56 /test/2.txt
4. SequenceFile的读取
从头到尾读取顺序文件的方法是创建SequenceFile.Reader实例,然后反复调用next()方法迭代读取记录。
(1)如果key-value都是Writable类型,则调用以键和值作为参数的next()方法,其会将数据流中的下一条键值对读入变量中。当读取到文件结尾时会返回false。
public boolean next(Writable key, Writable val)
(2)如果使用非Writable的序列化框架,则使用下面两个方法:
public Object next(Object key) throws IOException
public Object getCurrentValue(Object val) throws IOException
如果next()返回的是非null对象,则可以从数据流中读取键值对,并通过getCurrentValue读取value;如果next()返回的是null,说明已经到文件结尾。
实际操作:
(1)编写MySequenceRead.java,实现MySequenceRead类:
import static org.hamcrest.CoreMatchers.is;
import static org.junit.Assert.assertThat;
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.*;
import org.apache.hadoop.util.ReflectionUtils;
import org.junit.*;
public class MySequenceRead {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while(reader.next(key, value)) {
System.out.printf("[%s]\t%s\t%s\n", position, key, value);
position = reader.getPosition();
}
} finally {
IOUtils.closeStream(reader);
}
}
}
(2)编译MySequenceRead.java
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# javac MySequenceRead.java
Note: MySequenceRead.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
(3)用hadoop运行该类:
root@6b08ff31fc7f:/hadoop/hadoop-2.9.1/test/hadoop-io# hadoop MySequenceRead /test/2.txt
18/10/12 12:12:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/12 12:12:05 INFO compress.CodecPool: Got brand-new decompressor [.deflate]
[128] 1 haha, I am an banana
[167] 2 guagua, This is a shoot
[208] 3 mama, These are flowers
[252] 4 gaga, Thoses are tickets
[297] 5 lala, I am an apple
[337] 6 haha, I am an banana
[376] 7 guagua, This is a shoot
[417] 8 mama, These are flowers
[461] 9 gaga, Thoses are tickets
[506] 10 lala, I am an apple
[546] 11 haha, I am an banana
[585] 12 guagua, This is a shoot
[626] 13 mama, These are flowers
[670] 14 gaga, Thoses are tickets
[715] 15 lala, I am an apple
[755] 16 haha, I am an banana
[794] 17 guagua, This is a shoot
[835] 18 mama, These are flowers
[879] 19 gaga, Thoses are tickets
[924] 20 lala, I am an apple