MapReduce中自定义InputFormat

最新推荐文章于 2020-12-01 16:43:25 发布

minchowang

最新推荐文章于 2020-12-01 16:43:25 发布

阅读量447

点赞数 1

分类专栏： Hadoop 文章标签： mapreduce 大数据 java

本文链接：https://blog.csdn.net/qq_39261894/article/details/104545554

版权

Hadoop 专栏收录该内容

17 篇文章 2 订阅

订阅专栏

Hadoop内置的输入文件格式类有：
1）FileInputFormat<K,V> 这个是基本的父类，自定义就直接使用它作为父类。
2）TextInputFormat<LongWritable,Text> 这个是默认的数据格式类。key代表当前行数据距离文件开始的偏移量，value代码当前行字符串。
3）SequenceFileInputFormat<K,V> 这个是序列文件输入格式，使用序列文件可以提高效率，但是不利于查看结果，建议在过程中使用序列文件，最后展示可以使用可视化输出。
4）KeyValueTextInputFormat<Text,Text> 这个是读取以Tab（也即是\t）分隔的数据，每行数据如果以\t分隔，那么使用这个读入，就可以自动把\t前面的当做key，后面的当做value。
5）CombineFileInputFormat<K,V> 合并大量小数据是使用。
在Driver端需要设置切片大小

CombineTextInputFormat.setMaxInputSplitSize(job, 4194304);// 4m

6）MultipleInputs，多种输入，可以为每个输入指定逻辑处理的Mapper。

源码解析

1. Mapper

// 进入context.nextKeyValue()方法，从而进入WrappedMapper类。

public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
	}
}

2. WrappedMapper

// 进入该方法的nextKeyValue()，从而进入MapContextImpl类。

public boolean nextKeyValue() throws IOException, InterruptedException{
      return mapContext.nextKeyValue();
}

3. MapContextImpl

public boolean nextKeyValue() throws IOException, InterruptedException {
    return reader.nextKeyValue();
}

// reader的申明和赋值

public class MapContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    extends TaskInputOutputContextImpl<KEYIN,VALUEIN,KEYOUT,VALUEOUT> 
    implements MapContext<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
  private RecordReader<KEYIN,VALUEIN> reader;
  private InputSplit split;
  public MapContextImpl(Configuration conf, TaskAttemptID taskid,
                        RecordReader<KEYIN,VALUEIN> reader,
                        RecordWriter<KEYOUT,VALUEOUT> writer,
                        OutputCommitter committer,
                        StatusReporter reporter,
                        InputSplit split) {
    super(conf, taskid, writer, committer, reporter);
    this.reader = reader;
    this.split = split;
  }
}

自定义InputFormat

自定义一个类继承FileInputFormat
重写RecordReader，实现一次读取一个完整文件封装成KV
在输出的时候使用SequenceFileOutputFormat输出合并文件

WholeFileInputformat

public class WholeFileInputformat extends FileInputFormat<Text, BytesWritable> {
	
	//单个文件不切割
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;
    }

    @Override
    public RecordReader<Text, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        WholeRecordReader wholeRecordReader = new WholeRecordReader();
        wholeRecordReader.initialize(split, context);
        return wholeRecordReader;
    }

WholeRecordReader

public class WholeRecordReader extends RecordReader<Text, BytesWritable> {
    //文件切片
    FileSplit fileSplit;
  
    Configuration conf;
	// 一次读取一个文件
    private boolean isProgress = true;


    private BytesWritable v = new BytesWritable();
    private Text k = new Text();

    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        fileSplit = (FileSplit) split;
        conf = context.getConfiguration();

    }

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if (isProgress) {
            //获取切片的长度
            long length = fileSplit.getLength();
            //定义缓存区
            byte[] contents = new byte[(int) length];
            //获取切片路径
            Path path = fileSplit.getPath();
            //获取fs
            FileSystem fs = path.getFileSystem(conf);
            //读取文件
            FSDataInputStream fis = fs.open(path);
            //拷贝流
            IOUtils.readFully(fis, contents, 0, (int) length);

            //设置v:输出文件内容
            v.set(contents, 0, (int) length);

            //设置k:获取路径名称
            k.set(path.toString());

            IOUtils.closeStream(fis);
            //标志位置false
            isProgress = false;
            return true;
        }
        return false;
    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return k;
    }

    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return v;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    @Override
    public void close() throws IOException {

    }

Driver端设置


 	// 设置输入的inputFormat
	job.setInputFormatClass(WholeFileInputformat.class);

    // 设置输出的outputFormat
	job.setOutputFormatClass(SequenceFileOutputFormat.class);