MapReduce自定义输出输入及求TopN

最新推荐文章于 2024-04-28 07:21:24 发布

Imflash

最新推荐文章于 2024-04-28 07:21:24 发布

阅读量183

点赞数

分类专栏： hadoop 文章标签： hadoop

本文链接：https://blog.csdn.net/Imflash/article/details/100619037

版权

本文介绍了如何在Hadoop MapReduce中自定义InputFormat以合并小文件，自定义OutputFormat以按需输出，并详细讲解了如何通过自定义分组求取TopN。通过自定义InputFormat，实现小文件合并，提高处理效率；自定义OutputFormat以满足特定文件路径需求；最后，通过自定义分组和Reducer计算TopN。

摘要由CSDN通过智能技术生成

1. 自定义InputFormat合并小文件

1.1 需求

无论hdfs还是mapreduce，对于小文件都有损效率，实践中，又难免面临处理大量小文件的场景，此时，就需要有相应解决方案

1.2 分析

小文件的优化无非以下几种方式：

1、在数据采集的时候，就将小文件或小批数据合成大文件再上传HDFS

2、在业务处理之前，在HDFS上使用mapreduce程序对小文件进行合并

3、在mapreduce处理时，可采用combineInputFormat提高效率

1.3 实现

本节实现的是上述第二种方式

程序的核心机制：

自定义一个InputFormat

改写RecordReader，实现一次读取一个完整文件封装为KV

在输出时使用SequenceFileOutPutFormat输出合并文件（输出二进制文件，是经过压缩的）

该文件由两部分组成，

文件头部：存放文件名

文件正文：文件内容的字节数组

在这里插入图片描述

代码如下：

自定义InputFromat

对应cdh2.6.0(5.4.0)

public class MyInputFormat extends FileInputFormat<NullWritable,BytesWritable> {
   
    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
   
        //1:创建自定义RecordReader对象
        MyRecordReader myRecordReader = new MyRecordReader();
        //2:将inputSplit和context对象传给MyRecordReader
        myRecordReader.initialize(inputSplit, taskAttemptContext);


        return myRecordReader;
    }

    /*
     设置文件是否可以被切割
     */
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
   
        return false;
    }
}

自定义RecordReader

public class MyRecordReader extends RecordReader<NullWritable,BytesWritable>{
   

    private Configuration configuration = null;
    private  FileSplit fileSplit = null;
    private boolean processed = false;
    private BytesWritable bytesWritable = new BytesWritable();
    private  FileSystem fileSystem = null;
    private  FSDataInputStream inputStream = null;
    //进行初始化工作
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
   
        //获取文件的切片
          fileSplit= (FileSplit)inputSplit;

        //获取Configuration对象
         configuration = taskAttemptContext.getConfiguration();
    }

    //该方法用于获取K1和V1
    /*
     K1: NullWritable
     V1: BytesWritable
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
   
        if(!processed){
   
            //1:获取源文件的字节输入流
            //1.1 获取源文件的文件系统 (FileSystem)
             fileSystem = FileSystem.get(configuration);
            //1.2 通过FileSystem获取文件字节输入流
             inputStream = fileSystem.open(fileSplit.getPath());

            //2:读取源文件数据到普通的字节数组(byte[]);此处默认是小文件，所以用int强制转换
            byte[] bytes = new byte[(int) fileSplit.getLength()];
            IOUtils.readFully(inputStream, bytes, 0, (int)fileSplit.getLength());

            //3:将字节数组中数据封装到BytesWritable ,得到v1

            bytesWritable.set(bytes, 0, (int)fileSplit.getLength());

            processed = true;

            return true;
        }

        return false;
    }

    //返回K1
    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
   
        return NullWritable.get();
    }

    //返回V1
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
   
        return bytesWritable;
    }

    //获取文件读取的进度
    //文件一次性读不完，这里可以记录读取的进度，即偏移量
    @Override
    public float getProgress() throws IOException, InterruptedException {
   
        return 0;
    }

    //进行资源释放
    @Override
    public void close() throws IOException {
   
        inputStream.close();
        fileSystem.close();
    }
}