一篇文章搞懂 SequenceFile 到底是什么以及该怎么用

Shockang

于 2021-05-29 00:26:02 发布

阅读量1.4w

点赞数 14

分类专栏：大数据技术体系文章标签：大数据 hdfs

本文链接：https://blog.csdn.net/Shockang/article/details/117376761

版权

大数据技术体系专栏收录该内容

282 篇文章 576 订阅

订阅专栏

写在前面

本文隶属于专栏《1000个问题搞定大数据技术体系》，该专栏为笔者原创，引用请注明来源，不足和错误之处请在评论区帮忙指出，谢谢！

本专栏目录结构和文献引用请见1000个问题搞定大数据技术体系

解答

概述

SequenceFile 是 Hadoop提供的一种对二进制文件的支持。二进制文件直接将<Key， Value>对序列化到文件中。

HDFS文件系统是适合存储大文件的，很小的文件如果很多的话对于 Namenode 的压力会非常大，因为每个文件都会有一条元数据信息存储在 Namenode上，当小文件非常多也就意味着在 Namenode上存储的元数据信息就非常多。

Hadoop是适合存储大数据的，所以我们可以通过 SequenceFile 将小文件合并起来，可以获得更高效率的存储和计算。

SequenceFile 中的 key 和 value 可以是任意类型的 Writable 或者自定义 Writable 类型。

对于一定大小的数据，比如说100G0B，如果采用 SequenceFile 进行存储的话占用的空间是大于100G0B的，因为 SequenceFile 的存储中为了查找方便添加了一些额外的信息。

特点

支持压缩:可定制为基于 Record(记录) 和 Block(块) 压缩。
无压缩类型:如果没有启动压缩(默认设置)，那么每个记录就由它的记录长度(字节数)、键的长度，键和值组成，长度字段为4字节。 SequenceFile 内部结构如图所示。
Record 针对行压缩，只压缩 Value 部分不压缩Key； Block 对 Key 和Value 都压缩。
本地化任务支持:因为文件可以被切分，因此在运行 MapReduce 任务时数据的本地化情况应该是非常好的；尽可能多地发起 MapTask 来进行并行处理进而提高作业的执行效率。
难度低:因为是 Hadoop 框架提供的API，业务逻辑侧的修改比较简单。

编程示例

读

package com.shockang.study.bigdata.hdfs.sequencefile;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.ReflectionUtils;

import java.io.IOException;
import java.net.URI;

public class SequenceFileReadDemo {

    public static void main(String[] args) throws IOException {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path path = new Path(uri);

        SequenceFile.Reader reader = null;
        try {
            reader = new SequenceFile.Reader(fs, path, conf);
            Writable key = (Writable)
                    ReflectionUtils.newInstance(reader.getKeyClass(), conf);
            Writable value = (Writable)
                    ReflectionUtils.newInstance(reader.getValueClass(), conf);
            long position = reader.getPosition();
            while (reader.next(key, value)) {
                String syncSeen = reader.syncSeen() ? "*" : "";
                System.out.printf("[%s%s]\t%s\t%s\n", position, syncSeen, key, value);
                position = reader.getPosition();
            }
        } finally {
            IOUtils.closeStream(reader);
        }
    }
}

写

package com.shockang.study.bigdata.hdfs.sequencefile;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import java.io.IOException;
import java.net.URI;

public class SequenceFileWriteDemo {

    private static final String[] DATA = {
            "One, two, buckle my shoe",
            "Three, four, shut the door",
            "Five, six, pick up sticks",
            "Seven, eight, lay them straight",
            "Nine, ten, a big fat hen"
    };

    public static void main(String[] args) throws IOException {
        String uri = args[0];
        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(URI.create(uri), conf);
        Path path = new Path(uri);

        IntWritable key = new IntWritable();
        Text value = new Text();
        SequenceFile.Writer writer = null;
        try {
            writer = SequenceFile.createWriter(fs, conf, path,
                    key.getClass(), value.getClass());

            for (int i = 0; i < 100; i++) {
                key.set(100 - i);
                value.set(DATA[i % DATA.length]);
                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
                writer.append(key, value);
            }
        } finally {
            IOUtils.closeStream(writer);
        }
    }
}