Hadoop MapReduce处理海量小文件:自定义InputFormat和RecordReader

摘要 一般来说,基于Hadoop的MapReduce框架来处理数据,主要是面向海量大数据,对于这类数据,Hadoop能够使其真正发挥其能力。对于海量小文件,不是说不能使用Hadoop来处理,只不过直接进行处理效率不会高,而且海量的小文件对于HDFS的架构设计来说,会占用NameNode大量的内存来保存文件的元数据(Bookkeeping)。

一般来说,基于Hadoop的MapReduce框架来处理数据,主要是面向海量大数据,对于这类数据,Hadoop能够使其真正发挥其能力。对于海量小文件,不是说不能使用Hadoop来处理,只不过直接进行处理效率不会高,而且海量的小文件对于HDFS的架构设计来说,会占用NameNode大量的内存来保存文件的元数据(Bookkeeping)。另外,由于文件比较小,我们是指远远小于HDFS默认Block大小(64M),比如1k~2M,都很小了,在进行运算的时候,可能无法最大限度地充分Locality特性带来的优势,导致大量的数据在集群中传输,开销很大。
但是,实际应用中,也存在类似的场景,海量的小文件的处理需求也大量存在。那么,我们在使用Hadoop进行计算的时候,需要考虑将小数据转换成大数据,比如通过合并压缩等方法,可以使其在一定程度上,能够提高使用Hadoop集群计算方式的适应性。Hadoop也内置了一些解决方法,而且提供的API,可以很方便地实现。
下面,我们通过自定义InputFormat和RecordReader来实现对海量小文件的并行处理。
基本思路描述如下:
在Mapper中将小文件合并,输出结果的文件中每行由两部分组成,一部分是小文件名称,另一部分是该小文件的内容。

编程实现

我们实现一个WholeFileInputFormat,用来控制Mapper的输入规格,其中对于输入过程中处理文本行的读取使用的是自定义的WholeFileRecordReader。当Map任务执行完成后,我们直接将Map的输出原样输出到HDFS中,使用了一个最简单的IdentityReducer。
现在,看一下我们需要实现哪些内容:

  1. 读取每个小文件内容的WholeFileRecordReader
  2. 定义输入小文件的规格描述WholeFileInputFormat
  3. 用来合并小文件的Mapper实现WholeSmallfilesMapper
  4. 输出合并后的文件Reducer实现IdentityReducer
  5. 配置运行将多个小文件合并成一个大文件

接下来,详细描述上面的几点内容。

  • WholeFileRecordReader类

输入的键值对类型,对小文件,每个文件对应一个InputSplit,我们读取这个InputSplit实际上就是具有一个Block的整个文件的内容,将整个文件的内容读取到BytesWritable,也就是一个字节数组。

01package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.fs.FSDataInputStream;
06import org.apache.hadoop.fs.FileSystem;
07import org.apache.hadoop.fs.Path;
08import org.apache.hadoop.io.BytesWritable;
09import org.apache.hadoop.io.IOUtils;
10import org.apache.hadoop.io.NullWritable;
11import org.apache.hadoop.mapreduce.InputSplit;
12import org.apache.hadoop.mapreduce.JobContext;
13import org.apache.hadoop.mapreduce.RecordReader;
14import org.apache.hadoop.mapreduce.TaskAttemptContext;
15import org.apache.hadoop.mapreduce.lib.input.FileSplit;
16 
17public class WholeFileRecordReader extends RecordReader<NullWritable, BytesWritable> {
18 
19private FileSplit fileSplit;
20private JobContext jobContext;
21private NullWritable currentKey = NullWritable.get();
22private BytesWritable currentValue;
23private boolean finishConverting = false;
24 
25@Override
26public NullWritable getCurrentKey() throws IOException, InterruptedException {
27return currentKey;
28}
29 
30@Override
31public BytesWritable getCurrentValue() throws IOException, InterruptedException {
32return currentValue;
33}
34 
35@Override
36public void initialize(InputSplit split, TaskAttemptContext context) throwsIOException, InterruptedException {
37this.fileSplit = (FileSplit) split;
38this.jobContext = context;
39context.getConfiguration().set("map.input.file", fileSplit.getPath().getName());
40}
41 
42@Override
43public boolean nextKeyValue() throws IOException, InterruptedException {
44if (!finishConverting) {
45currentValue = new BytesWritable();
46int len = (int) fileSplit.getLength();
47byte[] content = new byte[len];
48Path file = fileSplit.getPath();
49FileSystem fs = file.getFileSystem(jobContext.getConfiguration());
50FSDataInputStream in = null;
51try {
52in = fs.open(file);
53IOUtils.readFully(in, content, 0, len);
54currentValue.set(content, 0, len);
55finally {
56if (in != null) {
57IOUtils.closeStream(in);
58}
59}
60finishConverting = true;
61return true;
62}
63return false;
64}
65 
66@Override
67public float getProgress() throws IOException {
68float progress = 0;
69if (finishConverting) {
70progress = 1;
71}
72return progress;
73}
74 
75@Override
76public void close() throws IOException {
77// TODO Auto-generated method stub
78 
79}
80}

实现RecordReader接口,最核心的就是处理好迭代多行文本的内容的逻辑,每次迭代通过调用nextKeyValue()方法来判断是否还有可读的文本行,直接设置当前的Key和Value,分别在方法getCurrentKey()和getCurrentValue()中返回对应的值。
另外,我们设置了”map.input.file”的值是文件名称,以便在Map任务中取出并将文件名称作为键写入到输出。

  • WholeFileInputFormat类
01package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.io.BytesWritable;
06import org.apache.hadoop.io.NullWritable;
07import org.apache.hadoop.mapreduce.InputSplit;
08import org.apache.hadoop.mapreduce.RecordReader;
09import org.apache.hadoop.mapreduce.TaskAttemptContext;
10import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11 
12public class WholeFileInputFormat extends FileInputFormat<NullWritable, BytesWritable> {
13 
14@Override
15public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
16RecordReader<NullWritable, BytesWritable> recordReader = newWholeFileRecordReader();
17recordReader.initialize(split, context);
18return recordReader;
19}
20}

这个类实现比较简单,继承自FileInputFormat后需要实现createRecordReader()方法,返回用来读文件记录的RecordReader,直接使用前面实现的WholeFileRecordReader创建一个实例,然后调用initialize()方法进行初始化。

  • WholeSmallfilesMapper
01package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.io.BytesWritable;
06import org.apache.hadoop.io.NullWritable;
07import org.apache.hadoop.io.Text;
08import org.apache.hadoop.mapreduce.Mapper;
09 
10public class WholeSmallfilesMapper extends Mapper<NullWritable, BytesWritable, Text, BytesWritable> {
11 
12private Text file = new Text();
13 
14@Override
15protected void map(NullWritable key, BytesWritable value, Context context) throwsIOException, InterruptedException {
16String fileName = context.getConfiguration().get("map.input.file");
17file.set(fileName);
18context.write(file, value);
19}
20}
  • IdentityReducer类
01package org.shirdrn.kodz.inaction.hadoop.smallfiles;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.mapreduce.Reducer;
06 
07public class IdentityReducer<Text, BytesWritable> extends Reducer<Text, BytesWritable, Text, BytesWritable> {
08 
09@Override
10protected void reduce(Text key, Iterable<BytesWritable> values, Context context)throws IOException, InterruptedException {
11for (BytesWritable value : values) {
12context.write(key, value);
13}
14}
15}

这个是Reduce任务的实现,只是将Map任务的输出原样写入到HDFS中。

  • WholeCombinedSmallfiles
01package org.shirdrn.kodz.inaction.hadoop.smallfiles.whole;
02 
03import java.io.IOException;
04 
05import org.apache.hadoop.conf.Configuration;
06import org.apache.hadoop.fs.Path;
07import org.apache.hadoop.io.BytesWritable;
08import org.apache.hadoop.io.Text;
09import org.apache.hadoop.mapreduce.Job;
10import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
11import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
12import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
13import org.apache.hadoop.util.GenericOptionsParser;
14import org.shirdrn.kodz.inaction.hadoop.smallfiles.IdentityReducer;
15 
16public class WholeCombinedSmallfiles {
17 
18public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
19 
20Configuration conf = new Configuration();
21String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
22if (otherArgs.length != 2) {
23System.err.println("Usage: conbinesmallfiles <in> <out>");
24System.exit(2);
25}
26 
27Job job = new Job(conf, "combine smallfiles");
28 
29job.setJarByClass(WholeCombinedSmallfiles.class);
30job.setMapperClass(WholeSmallfilesMapper.class);
31job.setReducerClass(IdentityReducer.class);
32 
33job.setMapOutputKeyClass(Text.class);
34job.setMapOutputValueClass(BytesWritable.class);
35job.setOutputKeyClass(Text.class);
36job.setOutputValueClass(BytesWritable.class);
37 
38job.setInputFormatClass(WholeFileInputFormat.class);
39job.setOutputFormatClass(SequenceFileOutputFormat.class);
40 
41job.setNumReduceTasks(5);
42 
43FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
44FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
45 
46int exitFlag = job.waitForCompletion(true) ? 0 1;
47System.exit(exitFlag);
48}
49 
50}

这是是程序的入口,主要是对MapReduce任务进行配置,只需要设置好对应的配置即可。我们设置了5个Reduce任务,最终会有5个输出结果文件。
这里,我们的Reduce任务执行的输出格式为SequenceFileOutputFormat定义的,就是SequenceFile,二进制文件。

运行程序

  • 准备工作
1jar -cvf combine-smallfiles.jar -C ./ org/shirdrn/kodz/inaction/hadoop/smallfiles
2xiaoxiang@ubuntu3:~$ cd /opt/stone/cloud/hadoop-1.0.3
3xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -mkdir/user/xiaoxiang/datasets/smallfiles
4xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -copyFromLocal /opt/stone/cloud/dataset/smallfiles/* /user/xiaoxiang/datasets/smallfiles
  • 运行MapReduce程序
001xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop jar combine-smallfiles.jar org.shirdrn.kodz.inaction.hadoop.smallfiles.whole.WholeCombinedSmallfiles /user/xiaoxiang/datasets/smallfiles /user/xiaoxiang/output/smallfiles/whole
00213/03/23 14:09:24 INFO input.FileInputFormat: Total input paths to process : 117
00313/03/23 14:09:24 INFO mapred.JobClient: Running job: job_201303111631_0016
00413/03/23 14:09:25 INFO mapred.JobClient: map 0% reduce 0%
00513/03/23 14:09:40 INFO mapred.JobClient: map 1% reduce 0%
00613/03/23 14:09:46 INFO mapred.JobClient: map 3% reduce 0%
00713/03/23 14:09:52 INFO mapred.JobClient: map 5% reduce 0%
00813/03/23 14:09:58 INFO mapred.JobClient: map 6% reduce 0%
00913/03/23 14:10:04 INFO mapred.JobClient: map 8% reduce 0%
01013/03/23 14:10:10 INFO mapred.JobClient: map 10% reduce 0%
01113/03/23 14:10:13 INFO mapred.JobClient: map 10% reduce 1%
01213/03/23 14:10:16 INFO mapred.JobClient: map 11% reduce 1%
01313/03/23 14:10:22 INFO mapred.JobClient: map 13% reduce 1%
01413/03/23 14:10:28 INFO mapred.JobClient: map 15% reduce 1%
01513/03/23 14:10:34 INFO mapred.JobClient: map 17% reduce 1%
01613/03/23 14:10:40 INFO mapred.JobClient: map 18% reduce 2%
01713/03/23 14:10:46 INFO mapred.JobClient: map 20% reduce 2%
01813/03/23 14:10:52 INFO mapred.JobClient: map 22% reduce 2%
01913/03/23 14:10:58 INFO mapred.JobClient: map 23% reduce 2%
02013/03/23 14:11:04 INFO mapred.JobClient: map 25% reduce 3%
02113/03/23 14:11:10 INFO mapred.JobClient: map 27% reduce 3%
02213/03/23 14:11:16 INFO mapred.JobClient: map 29% reduce 3%
02313/03/23 14:11:22 INFO mapred.JobClient: map 30% reduce 3%
02413/03/23 14:11:28 INFO mapred.JobClient: map 32% reduce 3%
02513/03/23 14:11:34 INFO mapred.JobClient: map 34% reduce 4%
02613/03/23 14:11:40 INFO mapred.JobClient: map 35% reduce 4%
02713/03/23 14:11:46 INFO mapred.JobClient: map 37% reduce 4%
02813/03/23 14:11:52 INFO mapred.JobClient: map 39% reduce 4%
02913/03/23 14:11:58 INFO mapred.JobClient: map 41% reduce 5%
03013/03/23 14:12:04 INFO mapred.JobClient: map 42% reduce 5%
03113/03/23 14:12:10 INFO mapred.JobClient: map 44% reduce 5%
03213/03/23 14:12:16 INFO mapred.JobClient: map 46% reduce 5%
03313/03/23 14:12:22 INFO mapred.JobClient: map 47% reduce 5%
03413/03/23 14:12:25 INFO mapred.JobClient: map 47% reduce 6%
03513/03/23 14:12:28 INFO mapred.JobClient: map 49% reduce 6%
03613/03/23 14:12:34 INFO mapred.JobClient: map 51% reduce 6%
03713/03/23 14:12:40 INFO mapred.JobClient: map 52% reduce 6%
03813/03/23 14:12:46 INFO mapred.JobClient: map 54% reduce 7%
03913/03/23 14:12:52 INFO mapred.JobClient: map 56% reduce 7%
04013/03/23 14:12:58 INFO mapred.JobClient: map 58% reduce 7%
04113/03/23 14:13:04 INFO mapred.JobClient: map 59% reduce 7%
04213/03/23 14:13:10 INFO mapred.JobClient: map 61% reduce 7%
04313/03/23 14:13:13 INFO mapred.JobClient: map 61% reduce 8%
04413/03/23 14:13:16 INFO mapred.JobClient: map 63% reduce 8%
04513/03/23 14:13:22 INFO mapred.JobClient: map 64% reduce 8%
04613/03/23 14:13:28 INFO mapred.JobClient: map 66% reduce 8%
04713/03/23 14:13:34 INFO mapred.JobClient: map 68% reduce 8%
04813/03/23 14:13:40 INFO mapred.JobClient: map 70% reduce 9%
04913/03/23 14:13:46 INFO mapred.JobClient: map 71% reduce 9%
05013/03/23 14:13:52 INFO mapred.JobClient: map 73% reduce 9%
05113/03/23 14:13:58 INFO mapred.JobClient: map 75% reduce 9%
05213/03/23 14:14:04 INFO mapred.JobClient: map 76% reduce 9%
05313/03/23 14:14:10 INFO mapred.JobClient: map 78% reduce 10%
05413/03/23 14:14:16 INFO mapred.JobClient: map 80% reduce 10%
05513/03/23 14:14:22 INFO mapred.JobClient: map 82% reduce 10%
05613/03/23 14:14:28 INFO mapred.JobClient: map 83% reduce 10%
05713/03/23 14:14:34 INFO mapred.JobClient: map 85% reduce 10%
05813/03/23 14:14:37 INFO mapred.JobClient: map 85% reduce 11%
05913/03/23 14:14:40 INFO mapred.JobClient: map 87% reduce 11%
06013/03/23 14:14:46 INFO mapred.JobClient: map 88% reduce 11%
06113/03/23 14:14:52 INFO mapred.JobClient: map 90% reduce 11%
06213/03/23 14:14:58 INFO mapred.JobClient: map 92% reduce 12%
06313/03/23 14:15:04 INFO mapred.JobClient: map 94% reduce 12%
06413/03/23 14:15:10 INFO mapred.JobClient: map 95% reduce 12%
06513/03/23 14:15:16 INFO mapred.JobClient: map 97% reduce 12%
06613/03/23 14:15:22 INFO mapred.JobClient: map 99% reduce 12%
06713/03/23 14:15:28 INFO mapred.JobClient: map 100% reduce 13%
06813/03/23 14:15:37 INFO mapred.JobClient: map 100% reduce 26%
06913/03/23 14:15:40 INFO mapred.JobClient: map 100% reduce 39%
07013/03/23 14:15:49 INFO mapred.JobClient: map 100% reduce 59%
07113/03/23 14:15:52 INFO mapred.JobClient: map 100% reduce 79%
07213/03/23 14:15:58 INFO mapred.JobClient: map 100% reduce 100%
07313/03/23 14:16:03 INFO mapred.JobClient: Job complete: job_201303111631_0016
07413/03/23 14:16:03 INFO mapred.JobClient: Counters: 29
07513/03/23 14:16:03 INFO mapred.JobClient: Job Counters
07613/03/23 14:16:03 INFO mapred.JobClient: Launched reduce tasks=5
07713/03/23 14:16:03 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=491322
07813/03/23 14:16:03 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
07913/03/23 14:16:03 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
08013/03/23 14:16:03 INFO mapred.JobClient: Launched map tasks=117
08113/03/23 14:16:03 INFO mapred.JobClient: Data-local map tasks=117
08213/03/23 14:16:03 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=719836
08313/03/23 14:16:03 INFO mapred.JobClient: File Output Format Counters
08413/03/23 14:16:03 INFO mapred.JobClient: Bytes Written=147035685
08513/03/23 14:16:03 INFO mapred.JobClient: FileSystemCounters
08613/03/23 14:16:03 INFO mapred.JobClient: FILE_BYTES_READ=147032689
08713/03/23 14:16:03 INFO mapred.JobClient: HDFS_BYTES_READ=147045529
08813/03/23 14:16:03 INFO mapred.JobClient: FILE_BYTES_WRITTEN=296787727
08913/03/23 14:16:03 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=147035685
09013/03/23 14:16:03 INFO mapred.JobClient: File Input Format Counters
09113/03/23 14:16:03 INFO mapred.JobClient: Bytes Read=147029851
09213/03/23 14:16:03 INFO mapred.JobClient: Map-Reduce Framework
09313/03/23 14:16:03 INFO mapred.JobClient: Map output materialized bytes=147036169
09413/03/23 14:16:03 INFO mapred.JobClient: Map input records=117
09513/03/23 14:16:03 INFO mapred.JobClient: Reduce shuffle bytes=145779618
09613/03/23 14:16:03 INFO mapred.JobClient: Spilled Records=234
09713/03/23 14:16:03 INFO mapred.JobClient: Map output bytes=147032074
09813/03/23 14:16:03 INFO mapred.JobClient: CPU time spent (ms)=79550
09913/03/23 14:16:03 INFO mapred.JobClient: Total committed heap usage (bytes)=19630391296
10013/03/23 14:16:03 INFO mapred.JobClient: Combine input records=0
10113/03/23 14:16:03 INFO mapred.JobClient: SPLIT_RAW_BYTES=15678
10213/03/23 14:16:03 INFO mapred.JobClient: Reduce input records=117
10313/03/23 14:16:03 INFO mapred.JobClient: Reduce input groups=117
10413/03/23 14:16:03 INFO mapred.JobClient: Combine output records=0
10513/03/23 14:16:03 INFO mapred.JobClient: Physical memory (bytes) snapshot=20658409472
10613/03/23 14:16:03 INFO mapred.JobClient: Reduce output records=117
10713/03/23 14:16:03 INFO mapred.JobClient: Virtual memory (bytes) snapshot=65064620032
10813/03/23 14:16:03 INFO mapred.JobClient: Map output records=117
  • 验证程序运行结果
01xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -ls/user/xiaoxiang/output/smallfiles/whole
02Found 7 items
03-rw-r--r-- 3 xiaoxiang supergroup 0 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/_SUCCESS
04drwxr-xr-x - xiaoxiang supergroup 0 2013-03-23 14:09 /user/xiaoxiang/output/smallfiles/whole/_logs
05-rw-r--r-- 3 xiaoxiang supergroup 30161482 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00000
06-rw-r--r-- 3 xiaoxiang supergroup 30160646 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00001
07-rw-r--r-- 3 xiaoxiang supergroup 27647901 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00002
08-rw-r--r-- 3 xiaoxiang supergroup 30161567 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00003
09-rw-r--r-- 3 xiaoxiang supergroup 28904089 2013-03-23 14:15 /user/xiaoxiang/output/smallfiles/whole/part-r-00004
10 
11xiaoxiang@ubuntu3:/opt/stone/cloud/hadoop-1.0.3$ bin/hadoop fs -text /user/xiaoxiang/output/smallfiles/whole/part-r-00000 | cut -d" " -f 1
12data_50000_000 53
13data_50000_005 4c
14data_50000_014 47
15data_50000_019 47
16data_50000_023 50
17data_50000_028 54
18data_50000_032 45
19data_50000_037 55
20data_50000_041 4e
21data_50000_046 4d
22data_50000_050 4c
23data_50000_055 55
24data_50000_064 54
25data_50000_069 42
26data_50000_073 48
27data_50000_078 54
28data_50000_082 42
29data_50000_087 53
30data_50000_091 43
31data_50000_096 41
32data_50000_203 4d
33data_50000_208 49
34data_50000_212 48
35data_50000_230 46

可以看到,Reducer阶段生成了5个文件,他们都是将小文件合并后的得到的大文件,如果需要对这些文件进行其他处理,可以再实现满足实际处理的Mapper,将输入路径指定的前面Reducer的输出路径即可。这样一来,对于大量小文件的处理,转换成了数个大文件的处理,就能够充分利用Hadoop MapReduce计算集群的优势。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值