SequenceFile 是一个由二进制序列化过的 key/value 的字节流组成的文本存储文件,它可以在map/reduce过程中的input/output 的format时被使用。在map/reduce过程中,map处理文件的临时输出就是使用SequenceFile处理过的。 所以一般的SequenceFile均是在FileSystem中生成,供map调用的原始文件。
SequenceFile 也可用于处理小文件:如果为key小文件名,value为文件内容,则可以将大批小文件合并成一个大文件。
public
class
SequenceFileWriteDemo {
public
static
void
main(String[] args)
throws
IOException {
//
TODO
Auto-generated method stub
String uri =
"file:///Repositories/workspace/Demo/hadoop1/target/seq.sq"
;
Configuration conf =
new
Configuration();
FileSystem fs = FileSystem. get(URI.create (uri), conf);
Path path =
new
Path(uri);
IntWritable key =
new
IntWritable();
Text value =
new
Text();
SequenceFile.Writer writer =
null
;
try
{
writer = SequenceFile. createWriter(fs, conf, path, key.getClass(), value.getClass());
key.set(1);
value.set(uri);
writer.append(key, value);
}
finally
{
IOUtils. closeStream(writer);
}
}
}
public
class
SequenceFileReadDemo {
public
static
void
main(String[] args)
throws
IOException {
//
TODO
Auto-generated method stub
String uri =
"file:///Repositories/workspace/Demo/hadoop1/target/seq.sq"
;
Configuration conf =
new
Configuration();
FileSystem fs = FileSystem. get(URI.create (uri), conf);
Path path =
new
Path(uri);
SequenceFile.Reader reader =
null
;
try
{
reader =
new
SequenceFile.Reader(fs, path, conf);
Writable key = (Writable) ReflectionUtils. newInstance(
reader.getKeyClass(), conf);
Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
long
position = reader.getPosition();
while
(reader.next(key, value)) {
String syncSeen = reader.syncSeen() ?
"*"
:
""
;
System.
out
.printf(
"[%s%s]\t%s\t%s\n"
, position, syncSeen, key,
value);
position = reader.getPosition();
// beginning of next record
}
}
finally
{
IOUtils. closeStream(reader);
}
}
}
压缩类型,
由SequenceFile类的内部枚举类CompressionType来表示,定义了三种方式
不采用压缩:
CompressionType.NONE
记录级别的压缩:
CompressionType.RECORD
块级别的压缩:
CompressionType.BLOCK
使用时可以通过参数: io.seqfile.compression.type=[NONE|RECORD|BLOCK] 来指定具体的压缩方式
可以通过下面命令查看生成的文件内容:
hadoop fs -text
numbers.seq
sequence 文件由一个head 和多个 record 组成。
Head头3个字节为SEQ,跟着1个字节表示版本。然后包括