mapreduce合并小文件CombineFileInputFormat

最新推荐文章于 2022-10-18 20:19:46 发布

技术蚂蚁

最新推荐文章于 2022-10-18 20:19:46 发布

阅读量5.2k

点赞数 1

分类专栏： Hadoop

Hadoop 专栏收录该内容

72 篇文章 3 订阅

订阅专栏

小文件是指文件size小于HDFS上block大小的文件。这样的文件会给hadoop的扩展性和性能带来严重问题。首先，在HDFS中，任何block，文件或者目录在内存中均以对象的形式存储，每个对象约占150byte，如果有1000 0000个小文件，每个文件占用一个block，则namenode大约需要2G空间。如果存储1亿个文件，则namenode需要20G空间（见参考资料[1][4][5]）。这样namenode内存容量严重制约了集群的扩展。其次，访问大量小文件速度远远小于访问几个大文件。HDFS最初是为流式访问大文件开发的，如果访问大量小文件，需要不断的从一个datanode跳到另一个datanode，严重影响性能。最后，处理大量小文件速度远远小于处理同等大小的大文件的速度。每一个小文件要占用一个slot，而task启动将耗费大量时间甚至大部分时间都耗费在启动task和释放task上。

本文首先介绍了hadoop自带的解决小文件问题的方案（以工具的形式提供），包括Hadoop Archive，Sequence file和CombineFileInputFormat；

我们基于Hadoop内置的CombineFileInputFormat来实现处理海量小文件，需要做的工作就很显然了，目前hadoop已经有自带的实现org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat(可以直接使用)，如下所示：

实现一个RecordReader来读取CombineFileSplit包装的文件Block
继承自CombineFileInputFormat实现一个使用我们自定义的RecordReader的输入规格说明类
处理数据的Mapper实现类
配置用来处理海量小文件的MapReduce Job

CombineSmallfileRecordReader类

为CombineFileSplit实现一个RecordReader，并在内部使用Hadoop自带的LineRecordReader来读取小文件的文本行数据，代码实现如下所示：

 
package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
 
 
 
import java.io.IOException;
 
 
 
import org.apache.hadoop.fs.Path;
 
import org.apache.hadoop.io.BytesWritable;
 
import org.apache.hadoop.io.LongWritable;
 
import org.apache.hadoop.mapreduce.InputSplit;
 
import org.apache.hadoop.mapreduce.RecordReader;
 
import org.apache.hadoop.mapreduce.TaskAttemptContext;
 
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
 
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
 
import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
 
 
 
public class CombineSmallfileRecordReader extends RecordReader<LongWritable, BytesWritable> {
 
 
 
    private CombineFileSplit combineFileSplit;
 
    private LineRecordReader lineRecordReader = new LineRecordReader();
 
    private Path[] paths;
 
    private int totalLength;
 
    private int currentIndex;
 
    private float currentProgress = 0;
 
    private LongWritable currentKey;
 
    private BytesWritable currentValue = new BytesWritable();;
 
 
 
    public CombineSmallfileRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer index) throws IOException {
 
        super();
 
        this.combineFileSplit = combineFileSplit;
 
        this.currentIndex = index; // 当前要处理的小文件Block在CombineFileSplit中的索引
 
    }
 
 
 
    @Override
 
    public void initialize(InputSplit split, TaskAttemptContext context) throwsIOException, InterruptedException {
 
        this.combineFileSplit = (CombineFileSplit) split;
 
        // 处理CombineFileSplit中的一个小文件Block，因为使用LineRecordReader，需要构造一个FileSplit对象，然后才能够读取数据
 
        FileSplit fileSplit = newFileSplit(combineFileSplit.getPath(currentIndex), combineFileSplit.getOffset(currentIndex), combineFileSplit.getLength(currentIndex), combineFileSplit.getLocations());
 
        lineRecordReader.initialize(fileSplit, context);
 
 
 
        this.paths = combineFileSplit.getPaths();
 
        totalLength = paths.length;
 
        context.getConfiguration().set("map.input.file.name", combineFileSplit.getPath(currentIndex).getName());
 
    }
 
 
 
    @Override
 
    public LongWritable getCurrentKey() throws IOException, InterruptedException {
 
        currentKey = lineRecordReader.getCurrentKey();
 
        return currentKey;
 
    }
 
 
 
    @Override
 
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
 
        byte[] content = lineRecordReader.getCurrentValue().getBytes();
 
        currentValue.set(content, 0, content.length);
 
        return currentValue;
 
    }
 
 
 
    @Override
 
    public boolean nextKeyValue() throws IOException, InterruptedException {
 
        if (currentIndex >= 0 && currentIndex < totalLength) {
 
            return lineRecordReader.nextKeyValue();
 
        } else {
 
            return false;
 
        }
 
    }
 
 
 
    @Override
 
    public float getProgress() throws IOException {
 
        if (currentIndex >= 0 && currentIndex < totalLength) {
 
            currentProgress = (float) currentIndex / totalLength;
 
            return currentProgress;
 
        }
 
        return currentProgress;
 
    }
 
 
 
    @Override
 
    public void close() throws IOException {
 
        lineRecordReader.close();
 
    }
 
}

如果存在这样的应用场景，你的小文件具有不同的格式，那么久需要考虑对不同类型的小文件，使用不同的内置RecordReader，具体逻辑也是在上面的类中实现。

CombineSmallfileInputFormat类

我们已经为CombineFileSplit实现了一个RecordReader，然后需要在一个CombineFileInputFormat中注入这个RecordReader类实现类CombineSmallfileRecordReader的对象。这时，需要实现一个CombineFileInputFormat的子类，可以重写createRecordReader方法。我们实现的CombineSmallfileInputFormat，代码如下所示：

 
package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
 
 
 
import java.io.IOException;
 
 
 
import org.apache.hadoop.io.BytesWritable;
 
import org.apache.hadoop.io.LongWritable;
 
import org.apache.hadoop.mapreduce.InputSplit;
 
import org.apache.hadoop.mapreduce.RecordReader;
 
import org.apache.hadoop.mapreduce.TaskAttemptContext;
 
import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
 
import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
 
import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
 
 
 
public class CombineSmallfileInputFormat extendsCombineFileInputFormat<LongWritable, BytesWritable> {
 
 
 
    @Override
 
    public RecordReader<LongWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
 
 
 
        CombineFileSplit combineFileSplit = (CombineFileSplit) split;
 
        CombineFileRecordReader<LongWritable, BytesWritable> recordReader = newCombineFileRecordReader<LongWritable, BytesWritable>(combineFileSplit, context, CombineSmallfileRecordReader.class);
 
        try {
 
            recordReader.initialize(combineFileSplit, context);
 
        } catch (InterruptedException e) {
 
            new RuntimeException("Error to initialize CombineSmallfileRecordReader.");
 
        }
 
        return recordReader;
 
    }
 
 
 
}

上面比较重要的是，一定要通过CombineFileRecordReader来创建一个RecordReader，而且它的构造方法的参数必须是上面的定义的类型和顺序，构造方法包含3个参数：第一个是CombineFileSplit类型，第二个是TaskAttemptContext类型，第三个是Class<? extends RecordReader>类型。

CombineSmallfileMapper类

下面，我们实现我们的MapReduce任务实现类，CombineSmallfileMapper类代码，如下所示：

 
package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;
 
 
 
import java.io.IOException;
 
 
 
import org.apache.hadoop.io.BytesWritable;
 
import org.apache.hadoop.io.LongWritable;
 
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
 
 
 
public class CombineSmallfileMapper extends Mapper<LongWritable, BytesWritable, Text, BytesWritable> {
 
 
 
    private Text file = new Text();
 
 
 
    @Override
 
    protected void map(LongWritable key, BytesWritable value, Context context)throws IOException, InterruptedException {
 
        String fileName = context.getConfiguration().get("map.input.file.name");
 
        file.set(fileName);
 
        context.write(file, value);
 
    }
 
 
 
}

比较简单，就是将输入的文件文本行拆分成键值对，然后输出。

CombineSmallfiles类

下面看我们的主方法入口类，这里面需要配置我之前实现的MapReduce Job，实现代码如下所示：

`01`	`package` `org.shirdrn.kodz.inaction.hadoop.smallfiles.combine;`

02

`03`	`import` `java.io.IOException;`

04

`05`	`import` `org.apache.hadoop.conf.Configuration;`

`06`	`import` `org.apache.hadoop.fs.Path;`

`07`	`import` `org.apache.hadoop.io.BytesWritable;`

`08`	`import` `org.apache.hadoop.io.Text;`

`09`	`import` `org.apache.hadoop.mapreduce.Job;`

`10`	`import` `org.apache.hadoop.mapreduce.lib.input.FileInputFormat;`

`11`	`import` `org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;`

`12`	`import` `org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;`

`13`	`import` `org.apache.hadoop.util.GenericOptionsParser;`

`14`	`import` `org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer;`

15

`16`	`public` `class` `CombineSmallfiles {`

17

`18`	`public` `static` `void` `main(String[] args)` `throws` `IOException, ClassNotFoundException, InterruptedException {`

19

`20`	`Configuration conf =` `new` `Configuration();`

`21`	`String[] otherArgs =` `new` `GenericOptionsParser(conf, args).getRemainingArgs();`

`22`	`if` `(otherArgs.length !=` `2) {`

`23`	`System.err.println("Usage: conbinesmallfiles <in> <out>");`

`24`	`System.exit(2);`

25 }

26

`27`	`conf.setInt("mapred.min.split.size",` `1);`

`28`	`conf.setLong("mapred.max.split.size",` `26214400);` `// 25m`

29

`30`	`conf.setInt("mapred.reduce.tasks",` `5);`

31

`32`	`Job job =` `new` `Job(conf,` `"combine smallfiles");`

`33`	`job.setJarByClass(CombineSmallfiles.class);`

`34`	`job.setMapperClass(CombineSmallfileMapper.class);`

`35`	`job.setReducerClass(IdentityReducer.class);`

36

`37`	`job.setMapOutputKeyClass(Text.class);`

`38`	`job.setMapOutputValueClass(BytesWritable.class);`

`39`	`job.setOutputKeyClass(Text.class);`

`40`	`job.setOutputValueClass(BytesWritable.class);`

41

`42`	`job.setInputFormatClass(CombineSmallfileInputFormat.class);`

`43`	`job.setOutputFormatClass(SequenceFileOutputFormat.class);`

44

`45`	`FileInputFormat.addInputPath(job,` `new` `Path(otherArgs[0]));`

`46`	`FileOutputFormat.setOutputPath(job,` `new` `Path(otherArgs[1]));`

47

`48`	`int` `exitFlag = job.waitForCompletion(true) ?` `0` `:` `1;`

`49`	`System.exit(exitFlag);`

50

51 }

52

53 }

技术蚂蚁

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
mapreduce合并小文件CombineFileInputFormat

小文件是指文件size小于HDFS上block大小的文件。这样的文件会给hadoop的扩展性和性能带来严重问题。首先，在HDFS中，任何block，文件或者目录在内存中均以对象的形式存储，每个对象约占150byte，如果有1000 0000个小文件，每个文件占用一个block，则namenode大约需要2G空间。如果存储1亿个文件，则namenode需要20G空间（见参考资料[1][4][5]）。
复制链接

扫一扫