自定义·InputFormat实现小文件合并

最新推荐文章于 2022-01-26 16:30:54 发布

情深不仅李义山

最新推荐文章于 2022-01-26 16:30:54 发布

阅读量164

点赞数

分类专栏： Hadoop 文章标签：大数据 mapreduce hadoop

本文链接：https://blog.csdn.net/weixin_43854618/article/details/108816239

版权

Hadoop 专栏收录该内容

14 篇文章 0 订阅

订阅专栏

我们在使用MapReduce读取文件时，一般使用的都是TextInputFormat进行读取文件，TextInputFormat是以行来读取的，读取出来的key是偏移量，value就是每一行文本。我们都知道在Hadoop中，小文件不仅影响NameNode性能，同时也影响MapReduce性能，所以我们正好可以通过自定义InputFormat来实现一下小文件的合并，同时也可以以此来了解InputFormat的某些底层原理。

我们要实现文件读取的功能，不妨先看看TextInputFormat是怎么实现的。走！我们先看看TextInputFormat的源码。

@Public
@Stable
public class TextInputFormat extends FileInputFormat<LongWritable, Text> {
    public TextInputFormat() {
    }

    public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) {
        String delimiter = context.getConfiguration().get("textinputformat.record.delimiter");
        byte[] recordDelimiterBytes = null;
        if (null != delimiter) {
            recordDelimiterBytes = delimiter.getBytes(Charsets.UTF_8);
        }

        return new LineRecordReader(recordDelimiterBytes);
    }

    protected boolean isSplitable(JobContext context, Path file) {
        CompressionCodec codec = (new CompressionCodecFactory(context.getConfiguration())).getCodec(file);
        return null == codec ? true : codec instanceof SplittableCompressionCodec;
    }
}

可以看到TextInputFormat是继承FileInputFormat，所以我们自定义的InputFormat也同样可以继承FileInputFormat。于是我们就写一个类，重写与TextInputFormat相同的方法，我们看到TextInputFormat有一个createRecordReader方法，它的返回值是一行，显然我们要的不是每次只返回一行，我们要一次性读取一个文件，所以必须重写这个方法。再看它的返回值类型，是一个RecordReader，所以我们先要创建一个我们自己的RecordReader，所以可以继承RecordReader：
代码如下，必要的地方已给出注释：

public class MyRecordReader extends RecordReader<NullWritable, BytesWritable> {

    private Configuration configuration = null;
    private FileSplit fileSplit = null;
    private boolean flag = false;
    private BytesWritable byteWritable = new BytesWritable();
    private FileSystem fileSystem = null;
    private FSDataInputStream inputStream = null;

    //初始化操作
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //获取configuration对象
        configuration = taskAttemptContext.getConfiguration();

        //获取文件切片
        fileSplit = (FileSplit) inputSplit;
    }

    //获取key和value
    //key： NullWritable     value: ByteWritable
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {

        if(!flag){
            //1.获取原文件的字节输入流
            //1.1获取原文件的文件系统
            fileSystem = FileSystem.get(configuration);

            //1.2通过文件系统获得文件字节输入流
            inputStream = fileSystem.open(fileSplit.getPath());

            //2.读取原文件数据到普通字节数组byte[]
            byte[] inputBytes = new byte[(int) fileSplit.getLength()];
            IOUtils.readFully(inputStream, inputBytes, 0, (int) fileSplit.getLength());

            //3.将普通字节数组byte[]封装到ByteWritable[]

            byteWritable.set(inputBytes, 0, (int) fileSplit.getLength());

            flag = true;
        }

        return flag;
    }

    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return byteWritable;
    }

    //获取进度
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    //收尾工作，释放资源
    @Override
    public void close() throws IOException {
        inputStream.close();
        fileSystem.close();
    }
}

自定义InputFormat代码如下，必要的地方已给出注释：

public class MyInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {

        //1.创建RecordReader对象
        MyRecordReader myRecordReader = new MyRecordReader();

        //2.将inputSplit，taskAttemptContext传给myRecordReader
        myRecordReader.initialize(inputSplit, taskAttemptContext);

        return myRecordReader;
    }

    //定义是否可以切分
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }
}

我们要实现小文件合并，在读取所有小文件后，还要进行序列化操作，序列化内容分为头和尾，头就是文件名称，尾就是文件内容。所以在map里面要获取内容所属的文件名称，连同文件内容一起写进上下文对象中。

//将自定义的InputFormat读取的文件用SequenceOutputFormat输出
public class SequenceFileMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
    @Override
    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
        FileSplit fileSplit = (FileSplit) context.getInputSplit();
        String fileName = fileSplit.getPath().getName();

        context.write(new Text(fileName), value);
    }
}

这个需求可以不用Reduce也可以，因为在map中就已经把任务完成了，所以直接写Job就好了，需要注意的是在Job里面要制定输出的类型使用job.setOutputFormatClass(SequenceFileOutputFormat.class)。
代码如下：

public class JobMain extends Configured implements Tool {
    @Override
    public int run(String[] strings) throws Exception {

        Job job = Job.getInstance(super.getConf(), "Inputformat");

        job.setInputFormatClass(MyInputFormat.class);
        MyInputFormat.addInputPath(job, new Path("file:///D:\\in\\Inputformat"));

        job.setMapperClass(SequenceFileMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(BytesWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);

        job.setOutputFormatClass(SequenceFileOutputFormat.class);
        SequenceFileOutputFormat.setOutputPath(job, new Path("file:///D:\\out\\InputFormat"));

        boolean bl = job.waitForCompletion(true);
        return bl ? 0 : 1;
    }

    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();

        int run = ToolRunner.run(configuration, new JobMain(), args);

        System.exit(run);
    }
}