hadoop的mr 任务设计上是针对大文件的,但实践中难免会遇到大量小文件的情况,就像我们这个字符数量统计的mr。
输入是三个小文件。所以每个文件至少都会产生一下split,每个split 又会产生一个map 任务。对于很少的数据量,启动一个jvm的代价显得有点过大了。
所以这时就需要我们使用CombineFileInputFormat来对输入的小文件进行合并。
本次实验,我们使用自定义的InputFormat,用来减少mapper 任务数据。
InputFormat 描述了一个MR Job的输入格式。那有以下几个作用:
- 校验输入文件是否有效。
- 把输入文件划分为逻辑的InputSplit ,每一个InputSplit 实例会送往不同的Mapper(重点,这是我们本篇的重点,也是优化的地方,意味着过多的inputsplit 会造成过多的map 任务)。
- 提供一下RecordReader,用来从inputsplit实例中读取key value 来供maper 使用。
基于文件输入的InputFormat默认是按照输入文件的大小来划分inputsplit。一般情况下,inputsplit的最大值为分布式文件系统的块大小(默认为128m、之前版本为64m),inputsplit的最小值可以通过mapreduce.input.fileinputformat.split.minsize.进行设置。
InputSplit 描述了哪儿些数据装被送往不同的maper。一般情况下,InputSplit 包含的是基于字节的数据视图、这些字节的解析则是由后面要说到的RecordReader.
RecordReader 负责从InputSplit中 解析出供map 函数使用的对。
一下InputFormat中最主要的两个功能就是getInputSplits 和 createRecordReader 。 而一般 getInputSplit 在FileInputFormat中已经完成,FileInputFormat一般
是所有基于文件的InputFormat要继承的类。
所以重点就放在了RecordReader.
下面直接上代码吧。
关注以下几个地方:
- MyCombinedFilesInputFormat 的父类是CombineFileInputFormat
- createRecordReader 返回的是一个CombineFileInputFormat,其构造函数数据一个自定义的RecordReader.
- MyRecordReader 的父类是RecordReader.
- 关注initialize 方法中split的处理。
点击(此处)折叠或打开
- package wordcount;
-
- import java.io.IOException;
-
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.mapreduce.InputSplit;
- import org.apache.hadoop.mapreduce.RecordReader;
- import org.apache.hadoop.mapreduce.TaskAttemptContext;
- import org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader;
- import org.apache.hadoop.mapreduce.lib.input.CombineFileSplit;
- import org.apache.hadoop.mapreduce.lib.input.FileSplit;
- import org.apache.hadoop.mapreduce.lib.input.LineRecordReader;
-
-
- public class MyCombinedFilesInputFormat extends CombineFileInputFormat<LongWritable, Text> {
-
-
- @Override
- public RecordReader<LongWritable, Text> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException {
- return new CombineFileRecordReader<LongWritable, Text>((CombineFileSplit) split,context,MyRecordReader.class);
- }
- public static class MyRecordReader extends RecordReader<LongWritable, Text> {
- private Integer index;
- private LineRecordReader reader;
-
- public MyRecordReader(CombineFileSplit split, TaskAttemptContext context, Integer index) {
- this.index = index;
- reader = new LineRecordReader();
- }
-
- @Override
- public void initialize(InputSplit split, TaskAttemptContext context)
- throws IOException, InterruptedException {
- CombineFileSplit cfsplit = (CombineFileSplit) split;
- FileSplit fileSplit = new FileSplit(cfsplit.getPath(index),
- cfsplit.getOffset(index),
- cfsplit.getLength(index),
- cfsplit.getLocations()
- );
- reader.initialize(fileSplit, context);
- }
-
- @Override
- public boolean nextKeyValue() throws IOException, InterruptedException {
- return reader.nextKeyValue();
- }
-
- @Override
- public LongWritable getCurrentKey() throws IOException, InterruptedException {
- return reader.getCurrentKey();
- }
-
- @Override
- public Text getCurrentValue() throws IOException,
- InterruptedException {
- return reader.getCurrentValue();
- }
-
- @Override
- public float getProgress() throws IOException, InterruptedException {
- return reader.getProgress();
- }
-
- @Override
- public void close() throws IOException {
- reader.close();
- }
-
- }
- }
下面看一下job的配置
注意以下两个地方:
- job.setInputFormatClass(MyCombinedFilesInputFormat.class);
- MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
点击(此处)折叠或打开
- 点击(此处)折叠或打开
- @Override
- public int run(String[] args) throws Exception {
- //valid the parameters
- if(args.length !=2){
- return -1;
- }
-
- Job job = Job.getInstance(getConf(), "MyWordCountJob");
- job.setJarByClass(MyWordCountJob.class);
-
- Path inPath = new Path(args[0]);
- Path outPath = new Path(args[1]);
-
- outPath.getFileSystem(getConf()).delete(outPath,true);
- TextInputFormat.setInputPaths(job, inPath);
- TextOutputFormat.setOutputPath(job, outPath);
-
-
- job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
- job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
-
- job.setInputFormatClass(MyCombinedFilesInputFormat.class);
- MyCombinedFilesInputFormat.setMaxInputSplitSize(job, 1024*1024*64);
- job.setOutputFormatClass(TextOutputFormat.class);
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- job.setOutputKeyClass(Text.class);
- job.setOutputValueClass(IntWritable.class);
-
-
- return job.waitForCompletion(true)?0:1;
- }
可以看到,输入是三个小文件、但只起了一个map task。
点击(此处)折叠或打开
- [train@sandbox MyWordCount]$ hadoop jar mywordcount.jar mrdemo/ output
- 16/05/12 11:12:48 INFO client.RMProxy: Connecting to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
- 16/05/12 11:12:49 INFO input.FileInputFormat: Total input paths to process : 3
- 16/05/12 11:12:49 INFO input.CombineFileInputFormat: DEBUG: Terminated node allocation with : CompletedNodes: 1, size left: 157
- 16/05/12 11:12:49 INFO mapreduce.JobSubmitter: number of splits:1
- 16/05/12 11:12:49 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.jar is deprecated. Instead, use mapreduce.job.jar
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.value.class is deprecated. Instead, use mapreduce.map.output.value.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.map.class is deprecated. Instead, use mapreduce.job.map.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.reduce.class is deprecated. Instead, use mapreduce.job.reduce.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.inputformat.class is deprecated. Instead, use mapreduce.job.inputformat.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.input.dir is deprecated. Instead, use mapreduce.input.fileinputformat.inputdir
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapreduce.outputformat.class is deprecated. Instead, use mapreduce.job.outputformat.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.mapoutput.key.class is deprecated. Instead, use mapreduce.map.output.key.class
- 16/05/12 11:12:49 INFO Configuration.deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
- 16/05/12 11:12:49 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1462517728035_0104
- 16/05/12 11:12:50 INFO impl.YarnClientImpl: Submitted application application_1462517728035_0104 to ResourceManager at sandbox.hortonworks.com/192.168.252.131:8050
- 16/05/12 11:12:50 INFO mapreduce.Job: The url to track the job: http://sandbox.hortonworks.com:8088/proxy/application_1462517728035_0104/
- 16/05/12 11:12:50 INFO mapreduce.Job: Running job: job_1462517728035_0104
- 16/05/12 11:12:58 INFO mapreduce.Job: Job job_1462517728035_0104 running in uber mode : false
- 16/05/12 11:12:58 INFO mapreduce.Job: map 0% reduce 0%
- 16/05/12 11:13:10 INFO mapreduce.Job: map 100% reduce 0%
- 16/05/12 11:13:17 INFO mapreduce.Job: map 100% reduce 100%
- 16/05/12 11:13:18 INFO mapreduce.Job: Job job_1462517728035_0104 completed successfully
- 16/05/12 11:13:18 INFO mapreduce.Job: Counters: 43
- File System Counters
- FILE: Number of bytes read=1198
- FILE: Number of bytes written=170905
- FILE: Number of read operations=0
- FILE: Number of large read operations=0
- FILE: Number of write operations=0
- HDFS: Number of bytes read=487
- HDFS: Number of bytes written=108
- HDFS: Number of read operations=8
- HDFS: Number of large read operations=0
- HDFS: Number of write operations=2
- Job Counters
- Launched map tasks=1
- Launched reduce tasks=1
- Other local map tasks=1
- Total time spent by all maps in occupied slots (ms)=71600
- Total time spent by all reduces in occupied slots (ms)=33720
- Map-Reduce Framework
- Map input records=8
- Map output records=149
- Map output bytes=894
- Map output materialized bytes=1198
- Input split bytes=330
- Combine input records=0
- Combine output records=0
- Reduce input groups=26
- Reduce shuffle bytes=1198
- Reduce input records=149
- Reduce output records=26
- Spilled Records=298
- Shuffled Maps =1
- Failed Shuffles=0
- Merged Map outputs=1
- GC time elapsed (ms)=54
- CPU time spent (ms)=1850
- Physical memory (bytes) snapshot=445575168
- Virtual memory (bytes) snapshot=1995010048
- Total committed heap usage (bytes)=345636864
- Shuffle Errors
- BAD_ID=0
- CONNECTION=0
- IO_ERROR=0
- WRONG_LENGTH=0
- WRONG_MAP=0
- WRONG_REDUCE=0
- File Input Format Counters
- Bytes Read=0
- File Output Format Counters
- Bytes Written=108
本篇中简单处理了MyRecordReader,在下篇中演示如何使用一下MyRecordReader 实现读取自定义的Key Value对输入到map函数中。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/30066956/viewspace-2109215/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/30066956/viewspace-2109215/