Hadoop MapReduce处理海量小文件：基于CombineFileInputFormat

最新推荐文章于 2022-10-23 15:05:41 发布

诡笑书生

最新推荐文章于 2022-10-23 15:05:41 发布

阅读量534

点赞数 1

分类专栏： Hadoop 文章标签：小文件 Hadoop 优化 RecordReader

Hadoop 专栏收录该内容

13 篇文章 0 订阅

订阅专栏

原文链接：http://shiyanjun.cn/archives/299.html

在使用Hadoop处理海量小文件的应用场景中，如果你选择使用CombineFileInputFormat，而且你是第一次使用，可能你会感到有点迷惑。虽然，从这个处理方案的思想上很容易理解，但是可能会遇到这样那样的问题。
使用CombineFileInputFormat作为Map任务的输入规格描述，首先需要实现一个自定义的RecordReader。
CombineFileInputFormat的大致原理是，他会将输入多个数据文件（小文件）的元数据全部包装到CombineFileSplit类里面。也就是说，因为小文件的情况下，在HDFS中都是单Block的文件，即一个文件一个Block，一个CombineFileSplit包含了一组文件Block，包括每个文件的起始偏移（offset），长度（length），Block位置（localtions）等元数据。如果想要处理一个CombineFileSplit，很容易想到，对其包含的每个InputSplit（实际上这里面没有这个，你需要读取一个小文件块的时候，需要构造一个FileInputSplit对象）。
在执行MapReduce任务的时候，需要读取文件的文本行（简单一点是文本行，也可能是其他格式数据）。那么对于CombineFileSplit来说，你需要处理其包含的小文件Block，就要对应设置一个RecordReader，才能正确读取文件数据内容。通常情况下，我们有一批小文件，格式通常是相同的，只需要在为CombineFileSplit实现一个RecordReader的时候，内置另一个用来读取小文件Block的RecordReader，这样就能保证读取CombineFileSplit内部聚积的小文件。

编程实现

通过上面的说明，我们基于Hadoop内置的CombineFileInputFormat来实现处理海量小文件，需要做的工作就很显然了，如下所示：

实现一个RecordReader来读取CombineFileSplit包装的文件Block
继承自CombineFileInputFormat实现一个使用我们自定义的RecordReader的输入规格说明类
处理数据的Mapper实现类
配置用来处理海量小文件的MapReduce Job

下面，对编程实现的过程，详细讲解：

CombineSmallfileRecordReader类

为CombineFileSplit实现一个RecordReader，并在内部使用Hadoop自带的LineRecordReader来读取小文件的文本行数据，代码实现如下所示：

package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;

public class CombineSmallfileRecordReader extends RecordReader<LongWritable, BytesWritable> {

	private CombineFileSplit combineFileSplit;
	private LineRecordReader lineRecordReader = new LineRecordReader();
	private Path[] paths;
	private int totalLength;
	private int currentIndex;
	private float currentProgress = 0;
	private LongWritable currentKey;
	private BytesWritable currentValue = new BytesWritable();;

	public CombineSmallfileRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer index) throws IOException {
		super();
		this.combineFileSplit = combineFileSplit;
		this.currentIndex = index; // 当前要处理的小文件Block在CombineFileSplit中的索引
	}

	@Override
	public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
		this.combineFileSplit = (CombineFileSplit) split;
		// 处理CombineFileSplit中的一个小文件Block，因为使用LineRecordReader，需要构造一个FileSplit对象，然后才能够读取数据
		FileSplit fileSplit = new FileSplit(combineFileSplit.getPath(currentIndex), combineFileSplit.getOffset(currentIndex), combineFileSplit.getLength(currentIndex), combineFileSplit.getLocations());
		lineRecordReader.initialize(fileSplit, context);

		this.paths = combineFileSplit.getPaths();
		totalLength = paths.length;
		context.getConfiguration().set("map.input.file.name", combineFileSplit.getPath(currentIndex).getName());
	}

	@Override
	public LongWritable getCurrentKey() throws IOException, InterruptedException {
		currentKey = lineRecordReader.getCurrentKey();
		return currentKey;
	}

	@Override
	public BytesWritable getCurrentValue() throws IOException, InterruptedException {
		byte[] content = lineRecordReader.getCurrentValue().getBytes();
		currentValue.set(content, 0, content.length);
		return currentValue;
	}

	@Override
	public boolean nextKeyValue() throws IOException, InterruptedException {
		if (currentIndex >= 0 && currentIndex < totalLength) {
			return lineRecordReader.nextKeyValue();
		} else {
			return false;
		}
	}

	@Override
	public float getProgress() throws IOException {
		if (currentIndex >= 0 && currentIndex < totalLength) {
			currentProgress = (float) currentIndex / totalLength;
			return currentProgress;
		}
		return currentProgress;
	}

	@Override
	public void close() throws IOException {
		lineRecordReader.close();
	}
}

如果存在这样的应用场景，你的小文件具有不同的格式，那么久需要考虑对不同类型的小文件，使用不同的内置RecordReader，具体逻辑也是在上面的类中实现。

CombineSmallfileInputFormat类

我们已经为CombineFileSplit实现了一个RecordReader，然后需要在一个CombineFileInputFormat中注入这个RecordReader类实现类CombineSmallfileRecordReader的对象。这时，需要实现一个CombineFileInputFormat的子类，可以重写createRecordReader方法。我们实现的CombineSmallfileInputFormat，代码如下所示：

package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;

import java.io.IOException;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;

public class CombineSmallfileInputFormat extends CombineFileInputFormat<LongWritable, BytesWritable> {

	@Override
	public RecordReader<LongWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {

		CombineFileSplit combineFileSplit = (CombineFileSplit) split;
		CombineFileRecordReader<LongWritable, BytesWritable> recordReader = new CombineFileRecordReader<LongWritable, BytesWritable>(combineFileSplit, context, CombineSmallfileRecordReader.class);
		try {
			recordReader.initialize(combineFileSplit, context);
		} catch (InterruptedException e) {
			new RuntimeException("Error to initialize CombineSmallfileRecordReader.");
		}
		return recordReader;
	}

}

上面比较重要的是，一定要通过CombineFileRecordReader来创建一个RecordReader，而且它的构造方法的参数必须是上面的定义的类型和顺序，构造方法包含3个参数：第一个是CombineFileSplit类型，第二个是TaskAttemptContext类型，第三个是Class<? extends RecordReader>类型。

CombineSmallfileMapper类

下面，我们实现我们的MapReduce任务实现类，CombineSmallfileMapper类代码，如下所示：

package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;

import java.io.IOException;

import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class CombineSmallfileMapper extends Mapper<LongWritable, BytesWritable, Text, BytesWritable> {

	private Text file = new Text();

	@Override
	protected void map(LongWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
		String fileName = context.getConfiguration().get("map.input.file.name");
		file.set(fileName);
		context.write(file, value);
	}

}

比较简单，就是将输入的文件文本行拆分成键值对，然后输出。

CombineSmallfiles类

下面看我们的主方法入口类，这里面需要配置我之前实现的MapReduce Job，实现代码如下所示：

package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer;

public class CombineSmallfiles {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		Configuration conf = new Configuration();
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if (otherArgs.length != 2) {
			System.err.println("Usage: conbinesmallfiles <in> <out>");
			System.exit(2);
		}

		conf.setInt("mapred.min.split.size", 1);
		conf.setLong("mapred.max.split.size", 26214400); // 25m

		conf.setInt("mapred.reduce.tasks", 5);

		Job job = new Job(conf, "combine smallfiles");
		job.setJarByClass(CombineSmallfiles.class);
		job.setMapperClass(CombineSmallfileMapper.class);
		job.setReducerClass(IdentityReducer.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(BytesWritable.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(BytesWritable.class);

		job.setInputFormatClass(CombineSmallfileInputFormat.class);
		job.setOutputFormatClass(SequenceFileOutputFormat.class);

		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

		int exitFlag = job.waitForCompletion(true) ? 0 : 1;
		System.exit(exitFlag);

	}

}

诡笑书生

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop MapReduce处理海量小文件：基于CombineFileInputFormat

原文链接：http://shiyanjun.cn/archives/299.html在使用Hadoop处理海量小文件的应用场景中，如果你选择使用CombineFileInputFormat，而且你是第一次使用，可能你会感到有点迷惑。虽然，从这个处理方案的思想上很容易理解，但是可能会遇到这样那样的问题。使用CombineFileInputFormat作为Map任务的输入规格描述，首先需要
复制链接

扫一扫