小文件是指文件size小于HDFS上block大小的文件。这样的文件会给hadoop的扩展性和性能带来严重问题。首先,在HDFS中,任何block,文件或者目录在内存中均以对象的形式存储,每个对象约占150byte,如果有1000 0000个小文件,每个文件占用一个block,则namenode大约需要2G空间。如果存储1亿个文件,则namenode需要20G空间(见参考资料[1][4][5])。这样namenode内存容量严重制约了集群的扩展。 其次,访问大量小文件速度远远小于访问几个大文件。HDFS最初是为流式访问大文件开发的,如果访问大量小文件,需要不断的从一个datanode跳到另一个datanode,严重影响性能。最后,处理大量小文件速度远远小于处理同等大小的大文件的速度。每一个小文件要占用一个slot,而task启动将耗费大量时间甚至大部分时间都耗费在启动task和释放task上。
本文首先介绍了hadoop自带的解决小文件问题的方案(以工具的形式提供),包括Hadoop Archive,Sequence file和CombineFileInputFormat;
我们基于Hadoop内置的CombineFileInputFormat来实现处理海量小文件,需要做的工作就很显然了,目前hadoop已经有自带的实现org.apache.hadoop.mapreduce.lib.input.CombineTextInputFormat(可以直接使用),如下所示:
- 实现一个RecordReader来读取CombineFileSplit包装的文件Block
- 继承自CombineFileInputFormat实现一个使用我们自定义的RecordReader的输入规格说明类
- 处理数据的Mapper实现类
- 配置用来处理海量小文件的MapReduce Job
- CombineSmallfileRecordReader类
为CombineFileSplit实现一个RecordReader,并在内部使用Hadoop自带的LineRecordReader来读取小文件的文本行数据,代码实现如下所示:
01 | package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine; |
03 | import java.io.IOException; |
05 | import org.apache.hadoop.fs.Path; |
06 | import org.apache.hadoop.io.BytesWritable; |
07 | import org.apache.hadoop.io.LongWritable; |
08 | import org.apache.hadoop.mapreduce.InputSplit; |
09 | import org.apache.hadoop.mapreduce.RecordReader; |
10 | import org.apache.hadoop.mapreduce.TaskAttemptContext; |
11 | import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; |
12 | import org.apache.hadoop.mapreduce.lib.input.FileSplit; |
13 | import org.apache.hadoop.mapreduce.lib.input.LineRecordReader; |
15 | public class CombineSmallfileRecordReader extends RecordReader<LongWritable, BytesWritable> { |
17 | private CombineFileSplit combineFileSplit; |
18 | private LineRecordReader lineRecordReader = new LineRecordReader(); |
20 | private int totalLength; |
21 | private int currentIndex; |
22 | private float currentProgress = 0 ; |
23 | private LongWritable currentKey; |
24 | private BytesWritable currentValue = new BytesWritable();; |
26 | public CombineSmallfileRecordReader(CombineFileSplit combineFileSplit, TaskAttemptContext context, Integer index) throws IOException { |
28 | this .combineFileSplit = combineFileSplit; |
29 | this .currentIndex = index; |
33 | public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException { |
34 | this .combineFileSplit = (CombineFileSplit) split; |
36 | FileSplit fileSplit = new FileSplit(combineFileSplit.getPath(currentIndex), combineFileSplit.getOffset(currentIndex), combineFileSplit.getLength(currentIndex), combineFileSplit.getLocations()); |
37 | lineRecordReader.initialize(fileSplit, context); |
39 | this .paths = combineFileSplit.getPaths(); |
40 | totalLength = paths.length; |
41 | context.getConfiguration().set( "map.input.file.name" , combineFileSplit.getPath(currentIndex).getName()); |
45 | public LongWritable getCurrentKey() throws IOException, InterruptedException { |
46 | currentKey = lineRecordReader.getCurrentKey(); |
51 | public BytesWritable getCurrentValue() throws IOException, InterruptedException { |
52 | byte [] content = lineRecordReader.getCurrentValue().getBytes(); |
53 | currentValue.set(content, 0 , content.length); |
58 | public boolean nextKeyValue() throws IOException, InterruptedException { |
59 | if (currentIndex >= 0 && currentIndex < totalLength) { |
60 | return lineRecordReader.nextKeyValue(); |
67 | public float getProgress() throws IOException { |
68 | if (currentIndex >= 0 && currentIndex < totalLength) { |
69 | currentProgress = ( float ) currentIndex / totalLength; |
70 | return currentProgress; |
72 | return currentProgress; |
76 | public void close() throws IOException { |
77 | lineRecordReader.close(); |
如果存在这样的应用场景,你的小文件具有不同的格式,那么久需要考虑对不同类型的小文件,使用不同的内置RecordReader,具体逻辑也是在上面的类中实现。
- CombineSmallfileInputFormat类
我们已经为CombineFileSplit实现了一个RecordReader,然后需要在一个CombineFileInputFormat中注入这个RecordReader类实现类CombineSmallfileRecordReader的对象。这时,需要实现一个CombineFileInputFormat的子类,可以重写createRecordReader方法。我们实现的CombineSmallfileInputFormat,代码如下所示:
01 | package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine; |
03 | import java.io.IOException; |
05 | import org.apache.hadoop.io.BytesWritable; |
06 | import org.apache.hadoop.io.LongWritable; |
07 | import org.apache.hadoop.mapreduce.InputSplit; |
08 | import org.apache.hadoop.mapreduce.RecordReader; |
09 | import org.apache.hadoop.mapreduce.TaskAttemptContext; |
10 | import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat; |
11 | import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader; |
12 | import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit; |
14 | public class CombineSmallfileInputFormat extends CombineFileInputFormat<LongWritable, BytesWritable> { |
17 | public RecordReader<LongWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException { |
19 | CombineFileSplit combineFileSplit = (CombineFileSplit) split; |
20 | CombineFileRecordReader<LongWritable, BytesWritable> recordReader = new CombineFileRecordReader<LongWritable, BytesWritable>(combineFileSplit, context, CombineSmallfileRecordReader. class ); |
22 | recordReader.initialize(combineFileSplit, context); |
23 | } catch (InterruptedException e) { |
24 | new RuntimeException( "Error to initialize CombineSmallfileRecordReader." ); |
上面比较重要的是,一定要通过CombineFileRecordReader来创建一个RecordReader,而且它的构造方法的参数必须是上面的定义的类型和顺序,构造方法包含3个参数:第一个是CombineFileSplit类型,第二个是TaskAttemptContext类型,第三个是Class<? extends RecordReader>类型。
下面,我们实现我们的MapReduce任务实现类,CombineSmallfileMapper类代码,如下所示:
01 | package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine; |
03 | import java.io.IOException; |
05 | import org.apache.hadoop.io.BytesWritable; |
06 | import org.apache.hadoop.io.LongWritable; |
07 | import org.apache.hadoop.io.Text; |
08 | import org.apache.hadoop.mapreduce.Mapper; |
10 | public class CombineSmallfileMapper extends Mapper<LongWritable, BytesWritable, Text, BytesWritable> { |
12 | private Text file = new Text(); |
15 | protected void map(LongWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { |
16 | String fileName = context.getConfiguration().get( "map.input.file.name" ); |
18 | context.write(file, value); |
比较简单,就是将输入的文件文本行拆分成键值对,然后输出。
下面看我们的主方法入口类,这里面需要配置我之前实现的MapReduce Job,实现代码如下所示:
01 | package org.shirdrn.kodz.inaction.hadoop.smallfiles.combine; |
03 | import java.io.IOException; |
05 | import org.apache.hadoop.conf.Configuration; |
06 | import org.apache.hadoop.fs.Path; |
07 | import org.apache.hadoop.io.BytesWritable; |
08 | import org.apache.hadoop.io.Text; |
09 | import org.apache.hadoop.mapreduce.Job; |
10 | import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; |
11 | import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; |
12 | import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat; |
13 | import org.apache.hadoop.util.GenericOptionsParser; |
14 | import org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer; |
16 | public class CombineSmallfiles { |
18 | public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { |
20 | Configuration conf = new Configuration(); |
21 | String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs(); |
22 | if (otherArgs.length != 2 ) { |
23 | System.err.println( "Usage: conbinesmallfiles <in> <out>" ); |
27 | conf.setInt( "mapred.min.split.size" , 1 ); |
28 | conf.setLong( "mapred.max.split.size" , 26214400 ); |
30 | conf.setInt( "mapred.reduce.tasks" , 5 ); |
32 | Job job = new Job(conf, "combine smallfiles" ); |
33 | job.setJarByClass(CombineSmallfiles. class ); |
34 | job.setMapperClass(CombineSmallfileMapper. class ); |
35 | job.setReducerClass(IdentityReducer. class ); |
37 | job.setMapOutputKeyClass(Text. class ); |
38 | job.setMapOutputValueClass(BytesWritable. class ); |
39 | job.setOutputKeyClass(Text. class ); |
40 | job.setOutputValueClass(BytesWritable. class ); |
42 | job.setInputFormatClass(CombineSmallfileInputFormat. class ); |
43 | job.setOutputFormatClass(SequenceFileOutputFormat. class ); |
45 | FileInputFormat.addInputPath(job, new Path(otherArgs[ 0 ])); |
46 | FileOutputFormat.setOutputPath(job, new Path(otherArgs[ 1 ])); |
48 | int exitFlag = job.waitForCompletion( true ) ? 0 : 1 ; |
49 | System.exit(exitFlag); |