一个MapReduce 程序示例细节决定成败(三) ：Combiner

最新推荐文章于 2021-05-11 13:00:45 发布

Tom 1988

最新推荐文章于 2021-05-11 13:00:45 发布

阅读量153

点赞数

分类专栏：大数据

本文链接：https://blog.csdn.net/xu_guo_jie/article/details/83146287

版权

大数据专栏收录该内容

9 篇文章 0 订阅

订阅专栏

上一篇中，我们写了一个统计输入文件中 a~z 每个字符个数的mr 程序。通过查看运行日志的Counter 计数可以看到map 到 reduce 之间的网络传输是多少。
而本篇我们将介绍的Combiner 是一个非常重要的组件，主要可以用来减少网络传输。原理是在每个map 任务运行节点上，先把map的输出进行汇总，然后再传输到
reducer任务，也可以称此为一个map 端的reduce操作。
先上代码。因为我们的combiner 逻辑与 reducer逻辑相同，所以我们简单的使用MyWordCountReducer来设置成任务的combiner。

job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);

重点看运行日志：

File System Counters
                FILE: Number of bytes read=422
                FILE: Number of bytes written=338601
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=556
                HDFS: Number of bytes written=103
                HDFS: Number of read operations=12
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters 
                Launched map tasks=3
                Launched reduce tasks=1
                Data-local map tasks=3
                Total time spent by all maps in occupied slots (ms)=355336
                Total time spent by all reduces in occupied slots (ms)=66400
        Map-Reduce Framework
                Map input records=8
                Map output records=137
                Map output bytes=822
                Map output materialized bytes=434
                Input split bytes=399
                Combine input records=137
                Combine output records=52
                Reduce input groups=25
                Reduce shuffle bytes=434
                Reduce input records=52
                Reduce output records=25
                Spilled Records=104
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=274
                CPU time spent (ms)=3430
                Physical memory (bytes) snapshot=1078874112
                Virtual memory (bytes) snapshot=3947868160
                Total committed heap usage (bytes)=884539392
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters 
                Bytes Read=157
        File Output Format Counters 
                Bytes Written=103

通过日志可以看出， map 的输入输出记录数为8和137 这个与没combiner没影响。但我们发现reduce的输入由137变成了52。也就是说通过网络传输的数据量减少了。
同时也可以发现这个变化是combine 影响的。可以看到 map 输出的137条记录实际是进行combine中处理了，combine 输入为map输入的137条记录，combine 输出的52条记录进行到reduce中继续汇总。
注意：combine 是在map 端的汇总，没有网络传输，它汇总的也只是combiner所在节点的map的输入，不是全部节点的数据。
reduce 中处理的才是所有节点的数据。
下一篇，我们介绍另一种类似combiner功能的实现方式。in-map 聚合。

最后还是把全部代码贴到后面吧

package wordcount;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.log4j.Logger;

public class MyWordCountJob extends Configured implements Tool {
        Logger log = Logger.getLogger(MyWordCountJob.class);

        public static class MyWordCountMapper extends
                        Mapper<LongWritable, Text, Text, IntWritable> {
                Logger log = Logger.getLogger(MyWordCountJob.class);

                Text mapKey = new Text();
                IntWritable mapValue = new IntWritable(1);
                @Override
                protected void map(LongWritable key, Text value, Context context)
                                throws IOException, InterruptedException {
                        for(char c :value.toString().toLowerCase().toCharArray()){
                                if(c>='a' && c <='z'){
                                        mapKey.set(String.valueOf(c));
                                        context.write(mapKey, mapValue);
                                }
                        }
                }

        }


        public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
                Text rkey = new Text();
                IntWritable rvalue = new IntWritable(1);
                @Override
                protected void reduce(Text key, Iterable<IntWritable> values,Context context)
                                throws IOException, InterruptedException {
                        int n=0;
                        for(IntWritable value :values){
                                n+= value.get();
                        }
                        rvalue.set(n);
                        context.write(key, rvalue);
                }
        }

        @Override
        public int run(String[] args) throws Exception {
                //valid the parameters
                if(args.length !=2){
                        return -1;
                }

                Job job = Job.getInstance(getConf(), "MyWordCountJob");
                job.setJarByClass(MyWordCountJob.class);

                Path inPath = new Path(args[0]);
                Path outPath = new Path(args[1]);

                outPath.getFileSystem(getConf()).delete(outPath,true);
                TextInputFormat.setInputPaths(job, inPath);
                TextOutputFormat.setOutputPath(job, outPath);


            job.setMapperClass(MyWordCountJob.MyWordCountMapper.class);
                job.setReducerClass(MyWordCountJob.MyWordCountReducer.class);
                job.setCombinerClass(MyWordCountJob.MyWordCountReducer.class);
            job.setInputFormatClass(TextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

                job.setMapOutputKeyClass(Text.class);
                job.setMapOutputValueClass(IntWritable.class);
                job.setOutputKeyClass(Text.class);
                job.setOutputValueClass(IntWritable.class);

                return job.waitForCompletion(true)?0:1;
        }
        public static void main(String [] args){
                int result = 0;
                try {
                        result = ToolRunner.run(new Configuration(), new MyWordCountJob(), args);
                } catch (Exception e) {
                        e.printStackTrace();
                }
                System.exit(result);
        }

}

下一篇中，我们使用in-map Aggregation来优化我们的任务。在某种情况下，可以比combiner更有效率。

Tom 1988

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
一个MapReduce 程序示例细节决定成败(三) ：Combiner

上一篇中，我们写了一个统计输入文件中 a~z 每个字符个数的mr 程序。通过查看运行日志的Counter 计数可以看到map 到 reduce 之间的网络传输是多少。而本篇我们将介绍的Combiner 是一个非常重要的组件，主要可以用来减少网络传输。原理是在每个map 任务运行节点上，先把map的输出进行汇总，然后再传输到reducer任务，也可以称此为一个map 端的reduce操作。先上...
复制链接

扫一扫