一个MapReduce 程序示例 细节决定成败(六) :CombineFileInputFormat

hadoop的mr 任务设计上是针对大文件的,但实践中难免会遇到大量小文件的情况,就像我们这个字符数量统计的mr。 

输入是三个小文件。所以每个文件至少都会产生一下split,每个split 又会产生一个map 任务。对于很少的数据量,启动一个jvm的代价显得有点过大了。
所以这时就需要我们使用CombineFileInputFormat来对输入的小文件进行合并。


本次实验,我们使用自定义的InputFormat,用来减少mapper 任务数据。

既然要写一个自定义的InputFormat 那我们先来了解一下什么是InputFormat、以及相关的概念inputsplit、 recordreader.


InputFormat 描述了一个MR Job的输入格式。那有以下几个作用:
  1. 校验输入文件是否有效。
  2. 把输入文件划分为逻辑的InputSplit ,每一个InputSplit 实例会送往不同的Mapper(重点,这是我们本篇的重点,也是优化的地方,意味着过多的inputsplit 会造成过多的map 任务)
  3. 提供一下RecordReader,用来从inputsplit实例中读取key value 来供maper 使用。

基于文件输入的InputFormat默认是按照输入文件的大小来划分inputsplit。一般情况下,inputsplit的最大值为分布式文件系统的块大小(默认为128m、之前版本为64m),inputsplit的最小值可以通过mapreduce.input.fileinputformat.split.minsize.进行设置。

InputSplit

InputSplit 描述了哪儿些数据装被送往不同的maper。一般情况下,InputSplit 包含的是基于字节的数据视图、这些字节的解析则是由后面要说到的RecordReader.

RecordReader

RecordReader 负责从InputSplit中 解析出供map 函数使用的对。


一下InputFormat中最主要的两个功能就是getInputSplits 和 createRecordReader 。 而一般 getInputSplit 在FileInputFormat中已经完成,FileInputFormat一般 
是所有基于文件的InputFormat要继承的类。
所以重点就放在了RecordReader.
下面直接上代码吧。
关注以下几个地方:

  1. MyCombinedFilesInputFormat  的父类是CombineFileInputFormat
  2. createRecordReader 返回的是一个CombineFileInputFormat,其构造函数数据一个自定义的RecordReader.
  3. MyRecordReader 的父类是RecordReader. 
  4. 关注initialize 方法中split的处理。
MyCombinedFilesInputFormat


点击(此处)折叠或打开

  1. package wordcount;

  2. import java.io.IOException;

  3. import org.apache.hadoop.io.LongWritable;
  4. import org.apache.hadoop.io.Text;
  5. import org.apache.hadoop.mapreduce.InputSplit;
  6. import org.apache.hadoop.mapreduce.RecordReader;
  7. import org.apache.hadoop.mapreduce.TaskAttemptContext;
  8. import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
  9. import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
  10. import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
  11. import org.apache.hadoop.mapreduce.lib.input.FileSplit;
  12. import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;


  13. public class MyCombinedFilesInputFormat extends CombineFileInputFormat<LongWritable, Text> {


  14.         @Override
  15.         public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
  16.                 return new CombineFileRecordReader<LongWritable, Text>((CombineFileSplit) split,context,MyRecordReader.class);
  17.         }
  18.         public static class MyRecordReader extends RecordReader<LongWritable, Text> {
  19.                 private Integer index;
  20.                 private LineRecordReader reader;

  21.                 public MyRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) {
  22.                         this.index = index;
  23.                         reader = new LineRecordReader();
  24.                 }

  25.                 @Override
  26.                 public void initialize(InputSplit split, TaskAttemptContext context)
  27.                                 throws IOException, InterruptedException {
  28.                         CombineFileSplit cfsplit = (CombineFileSplit) split;
  29.                         FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
  30.                                                                                                 cfsplit.getOffset(index),
  31.                                                                                                 cfsplit.getLength(index),
  32.                                                                                                 cfsplit.getLocations()
  33.                                         );
  34.                         reader.initialize(fileSplit, context);
  35.                 }

  36.                 @Override
  37.                 public boolean nextKeyValue() throws IOException, InterruptedException {
  38.                         return reader.nextKeyValue();
  39.                 }

  40.                 @Override
  41.                 public LongWritable getCurrentKey() throws IOException, InterruptedException {
  42.                         return reader.getCurrentKey();
  43.                 }

  44.                 @Override
  45.                 public Text getCurrentValue() throws IOException,
  46.                                 InterruptedException {
  47.                         return reader.getCurrentValue();
  48.                 }

  49.                 @Override
  50.                 public float getProgress() throws IOException, InterruptedException {
  51.                         return reader.getProgress();
  52.                 }

  53.                 @Override
  54.                 public void close() throws IOException {
  55.                         reader.close();
  56.                 }

  57.         }
  58. }

下面看一下job的配置
注意以下两个地方:

  1. job.setInputFormatClass(MyCombinedFilesInputFormat.class); 
  2. MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);


点击(此处)折叠或打开

  1. 点击(此处)折叠或打开
  2.         @Override
  3.         public int run(String[] args) throws Exception {
  4.                 //valid the parameters
  5.                 if(args.length !=2){
  6.                         return -1;
  7.                 }

  8.                 Job job = Job.getInstance(getConf(), "MyWordCountJob");
  9.                 job.setJarByClass(MyWordCountJob.class);

  10.                 Path inPath = new Path(args[0]);
  11.                 Path outPath = new Path(args[1]);

  12.                 outPath.getFileSystem(getConf()).delete(outPath,true);
  13.                 TextInputFormat.setInputPaths(job, inPath);
  14.                 TextOutputFormat.setOutputPath(job, outPath);


  15.                 job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
  16.                 job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);

  17.                 job.setInputFormatClass(MyCombinedFilesInputFormat.class);
  18.                 MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
  19.                 job.setOutputFormatClass(TextOutputFormat.class);
  20.                 job.setMapOutputKeyClass(Text.class);
  21.                 job.setMapOutputValueClass(IntWritable.class);
  22.                 job.setOutputKeyClass(Text.class);
  23.                 job.setOutputValueClass(IntWritable.class);


  24.                 return job.waitForCompletion(true)?0:1;
  25.         }
执行一下验证是否影响了map的任务数。
可以看到,输入是三个小文件、但只起了一个map task。

点击(此处)折叠或打开

  1. [train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ output
  2. 16/05/12 11:12:48 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
  3. 16/05/12 11:12:49 INFO input.FileInputFormat: Total input paths to process : 3
  4. 16/05/12 11:12:49 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 157
  5. 16/05/12 11:12:49 INFO mapreduce.JobSubmitter: number of splits:1
  6. 16/05/12 11:12:49 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
  7. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
  8. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
  9. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
  10. 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
  11. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
  12. 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
  13. 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
  14. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
  15. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
  16. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
  17. 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
  18. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
  19. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
  20. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
  21. 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
  22. 16/05/12 11:12:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0104
  23. 16/05/12 11:12:50 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0104 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
  24. 16/05/12 11:12:50 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0104/
  25. 16/05/12 11:12:50 INFO mapreduce.Job: Running job: job_1462517728035_0104
  26. 16/05/12 11:12:58 INFO mapreduce.Job: Job job_1462517728035_0104 running in uber mode : false
  27. 16/05/12 11:12:58 INFO mapreduce.Job: map 0% reduce 0%
  28. 16/05/12 11:13:10 INFO mapreduce.Job: map 100% reduce 0%
  29. 16/05/12 11:13:17 INFO mapreduce.Job: map 100% reduce 100%
  30. 16/05/12 11:13:18 INFO mapreduce.Job: Job job_1462517728035_0104 completed successfully
  31. 16/05/12 11:13:18 INFO mapreduce.Job: Counters: 43
  32.         File System Counters
  33.                 FILE: Number of bytes read=1198
  34.                 FILE: Number of bytes written=170905
  35.                 FILE: Number of read operations=0
  36.                 FILE: Number of large read operations=0
  37.                 FILE: Number of write operations=0
  38.                 HDFS: Number of bytes read=487
  39.                 HDFS: Number of bytes written=108
  40.                 HDFS: Number of read operations=8
  41.                 HDFS: Number of large read operations=0
  42.                 HDFS: Number of write operations=2
  43.         Job Counters
  44.                 Launched map tasks=1
  45.                 Launched reduce tasks=1
  46.                 Other local map tasks=1
  47.                 Total time spent by all maps in occupied slots (ms)=71600
  48.                 Total time spent by all reduces in occupied slots (ms)=33720
  49.         Map-Reduce Framework
  50.                 Map input records=8
  51.                 Map output records=149
  52.                 Map output bytes=894
  53.                 Map output materialized bytes=1198
  54.                 Input split bytes=330
  55.                 Combine input records=0
  56.                 Combine output records=0
  57.                 Reduce input groups=26
  58.                 Reduce shuffle bytes=1198
  59.                 Reduce input records=149
  60.                 Reduce output records=26
  61.                 Spilled Records=298
  62.                 Shuffled Maps =1
  63.                 Failed Shuffles=0
  64.                 Merged Map outputs=1
  65.                 GC time elapsed (ms)=54
  66.                 CPU time spent (ms)=1850
  67.                 Physical memory (bytes) snapshot=445575168
  68.                 Virtual memory (bytes) snapshot=1995010048
  69.                 Total committed heap usage (bytes)=345636864
  70.         Shuffle Errors
  71.                 BAD_ID=0
  72.                 CONNECTION=0
  73.                 IO_ERROR=0
  74.                 WRONG_LENGTH=0
  75.                 WRONG_MAP=0
  76.                 WRONG_REDUCE=0
  77.         File Input Format Counters
  78.                 Bytes Read=0
  79.         File Output Format Counters
  80.                 Bytes Written=108


本篇中简单处理了MyRecordReader,在下篇中演示如何使用一下MyRecordReader 实现读取自定义的Key Value对输入到map函数中。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2109215/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/30066956/viewspace-2109215/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值