SequenceFile是一种基于文件的数据结构,专门用于存贮大文件。其特点就是利用二进制键值对存储数据
一、SequenceFile写操作
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
// vv SequenceFileWriteDemo
public class SequenceFileWriteDemo {
private static final String[] DATA = {
"One, two, buckle my shoe",
"Three, four, shut the door",
"Five, six, pick up sticks",
"Seven, eight, lay them straight",
"Nine, ten, a big fat hen"
};
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
IntWritable key = new IntWritable();
Text value = new Text();
SequenceFile.Writer writer = null;
try {
writer = SequenceFile.createWriter(fs, conf, path,
key.getClass(), value.getClass());
for (int i = 0; i < 100; i++) {
key.set(100 - i);
value.set(DATA[i % DATA.length]);
System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
writer.append(key, value);
}
} finally {
IOUtils.closeStream(writer);
}
}
}
SequenceFile的输入比较简单,就是通过SequenceFile.createWriter创建一个实例,利用这个实例的append方法可以按照键值对的形式写入数据
截取前面一部分运行结果。
$ hadoop SequenceFileWriteDemo numbers.seq
13/11/06 21:48:39 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/06 21:48:39 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/11/06 21:48:39 INFO compress.CodecPool: Got brand-new compressor
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
[635] 89 Three, four, shut the door
二、读取SequenceFile
import java.io.IOException;
import java.net.URI;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;
// vv SequenceFileReadDemo
public class SequenceFileReadDemo {
public static void main(String[] args) throws IOException {
String uri = args[0];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(uri), conf);
Path path = new Path(uri);
SequenceFile.Reader reader = null;
try {
reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
ReflectionUtils.newInstance(reader.getValueClass(), conf);
/**如此一来,我们就不需要知道具体的文件数据类型是什么,全部利用Writable进行读取,注意此处只是先将key和value实例化了,但里边是没有任何数据的。需要注意如何 *通过调用getKeyClass()和getValueClass()得到SequenceFile.Reader找到的类型,然后RflectionUtils用来创建键、值的实例*/
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? "*" : "";
System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
position = reader.getPosition(); // beginning of next record
}
} finally {
IOUtils.closeStream(reader);
}
/**next才将key和value赋予了真正的值,然后syncSeen()返回true当且仅当先前调用next时经过了一个同步标志,注意next是一条条地读取的,但同步标识不是每一条记录后 *边都有,而是一个数据块后才会有,所以经过多条记录才会出现一个同步标志*/
}
}
运行结果:
$ hadoop SequenceFileReadDemo numbers.seq
13/11/06 21:50:44 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/11/06 21:50:44 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
13/11/06 21:50:44 INFO compress.CodecPool: Got brand-new decompressor
[128] 100 One, two, buckle my shoe
[173] 99 Three, four, shut the door
[220] 98 Five, six, pick up sticks
[264] 97 Seven, eight, lay them straight
[314] 96 Nine, ten, a big fat hen
[359] 95 One, two, buckle my shoe
[404] 94 Three, four, shut the door
[451] 93 Five, six, pick up sticks
[495] 92 Seven, eight, lay them straight
[545] 91 Nine, ten, a big fat hen
[590] 90 One, two, buckle my shoe
[635] 89 Three, four, shut the door
[682] 88 Five, six, pick up sticks
[726] 87 Seven, eight, lay them straight
[776] 86 Nine, ten, a big fat hen
[821] 85 One, two, buckle my shoe
[866] 84 Three, four, shut the door
[913] 83 Five, six, pick up sticks
[957] 82 Seven, eight, lay them straight
[1007] 81 Nine, ten, a big fat hen
[1052] 80 One, two, buckle my shoe
[1097] 79 Three, four, shut the door
[1144] 78 Five, six, pick up sticks
[1188] 77 Seven, eight, lay them straight
[1238] 76 Nine, ten, a big fat hen
[1283] 75 One, two, buckle my shoe
[1328] 74 Three, four, shut the door
[1375] 73 Five, six, pick up sticks
[1419] 72 Seven, eight, lay them straight
[1469] 71 Nine, ten, a big fat hen
[1514] 70 One, two, buckle my shoe
[1559] 69 Three, four, shut the door
[1606] 68 Five, six, pick up sticks
[1650] 67 Seven, eight, lay them straight
[1700] 66 Nine, ten, a big fat hen
[1745] 65 One, two, buckle my shoe
[1790] 64 Three, four, shut the door
[1837] 63 Five, six, pick up sticks
[1881] 62 Seven, eight, lay them straight
[1931] 61 Nine, ten, a big fat hen
[1976] 60 One, two, buckle my shoe
[2021*] 59 Three, four, shut the door
[2088] 58 Five, six, pick up sticks
[2132] 57 Seven, eight, lay them straight
[2182] 56 Nine, ten, a big fat hen
[2227] 55 One, two, buckle my shoe
[2503] 49 Three, four, shut the door
[2550] 48 Five, six, pick up sticks
[2272] 54 Three, four, shut the door
[2319] 53 Five, six, pick up sticks
[2363] 52 Seven, eight, lay them straight
[2413] 51 Nine, ten, a big fat hen
[2458] 50 One, two, buckle my shoe
[2594] 47 Seven, eight, lay them straight
[2644] 46 Nine, ten, a big fat hen
[2689] 45 One, two, buckle my shoe
[2734] 44 Three, four, shut the door
[2781] 43 Five, six, pick up sticks
[2825] 42 Seven, eight, lay them straight
[2875] 41 Nine, ten, a big fat hen
这部分后边还讲到MapFile,其实将SequenceFile经过排序之后就是MapFile,所以一个MapFile包含两个文件,一个文件是SequenceFile文件,还有一个是索引文件,所以写文件的方式和SequenceFile完全一样。MapFile的读取是可以指定读取位置的(具体书上有介绍),而且将SequenceFile文件转化为MapFile的方式也很简单,就是添加一个索引文件。
上面只是我在读书的时候做的一点批注,具体知识还需要去看书才可以,只是边看书边写下自己的理解会很有收获。