Hadoop系列七（Combiner + Shuffle简单理解）

最新推荐文章于 2024-08-25 23:50:29 发布

YinJuan791739156

最新推荐文章于 2024-08-25 23:50:29 发布

阅读量401

点赞数 9

文章标签： hadoop 大数据分布式

本文链接：https://blog.csdn.net/YinJuan791739156/article/details/135202529

版权

一、简介

Combiner 是Mapper和Reduce之外的另一种组件。
Combiner的父类是Reducer。
Combiner和Reducer的区别在于运行位置的不同。Combiner运行在MapTask所在节点。接收当前MapTask的数据，而Reducer接收的是所有的来自Mapper的数据
Combiner的意义在于提前处理Mapper产生数据，进行局部汇总，减少网络传输。
Combiner应用的前提是不能影响最终的结果，而且Combiner输出的KV应该和Reducer的输出类型是一致的

二、代码实操

1、基础数据

banzhang ni hao
xihuan hadoop banzhang
banzhang ni hao
xihuan hadoop banzhang

2、Combiner类

package cn.nuwa.hap.cm;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;



public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

    private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }

        //封装outKV
        outV.set(sum);

        //写出outKV
        context.write(key,outV);
    }

}

3、Mapper类

package cn.nuwa.hap.cm;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;


public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    Text k = new Text();
    IntWritable v = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] words = line.split(" ");
        for (String word : words) {
            k.set(word);
            context.write(k, v);
        }


    }
}

4、Reduce类

package cn.nuwa.hap.cm;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;



public class WordCountReduce  extends Reducer<Text, IntWritable, Text, IntWritable> {

    int sum;
    IntWritable v = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
        sum = 0;
        for (IntWritable count : values) {
            sum += count.get();
        }

        v.set(sum);
        context.write(key, v);
    }
}

5、Driver类

package cn.nuwa.hap.cm;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;



public class WordCountDriver {
    public static void main(String[] args) throws Exception{
        // 1 获取配置信息以及获取job对象
        Configuration entries = new Configuration();
        Job job = Job.getInstance(entries);

        // 2 关联本Driver程序的jar
        job.setJarByClass(WordCountDriver.class);

        // 3 关联Mapper和Reducer的jar
        job.setMapperClass(WordCountMapper.class);
//        job.setReducerClass(WordCountReduce.class);
        // 3.1 设置Combiner
        job.setCombinerClass(WordCountCombiner.class);


        // 4 设置Mapper输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5 设置最终输出kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 7 提交job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);

    }
}

6、结果

banzhang	4
hadoop	2
hao	2
ni	2
xihuan	2

7、总结

Combiner主要设计用来做局部数据的汇总，以此减少数据网络传输。当只有一个MapTask任务时，完全可以使用Combiner替换Reduce。

三、Shuffle简介

Mapper之后，Reducer之前，Hadoop对数据的操作称为Shuffle。
Mapper处理结束的数据，会先存放在环形缓冲区中（内存中）
环形缓冲区将内存一分为二，比如预设了环形缓冲区的内存为100M，那么每部分就是50M的内存，mapper向缓冲区写入数据时会先在一个分区内写入数据，当该分区的数据达到80%时，会切换分区，后续所有的数据都写入另外一个分区。数据达到80%的分区会进行一次数据分区，然后对分区内的数据做排序，之后会产生一次数据的溢写，所谓的溢写就是将数据由内存写到磁盘的过程。在溢写过程中有一个可选的过程就是Combiner，写出完成就会有本次溢写的数据记录。当第二次再溢写之后，会产生另外一个溢写数据记录，届时会将两次的数据记录做归并，形成一个完整的数据，之后每次溢写的文件都需要和完整的数据文件再做归并。当所有数据处理完成之后就可以得到最终的数据，之后还可以对数据进行压缩，再磁盘存储。
之后Reduce将所有Mapper写出到磁盘的最终数据拷贝过来。当内存不够时会将数据写入到磁盘，然后对每个mapper的数据进行归并排序，然后再分组（这里知道为什么Reducer接收到的数据是一个Iterable了吧），最后才去调用我们自定义的Reduce方法。

说明：

如果你什么都不懂，那么可能对【所有Mapper】这样的描述有所困惑，其实也好理解，MapReduce最大的特点是分布式计算，那么它能在哪些地方做分布式计算呢？想通了这个问题就知道答案了。Mapper阶段就可以做分布式处理，Hadoop将数据按照大小进行块切分，然后将切分的数据分别发送给不同的MapTask任务做数据的处理，每个MapTask其实都是一个独立的Mapper阶段，当所有的MapTask处理完成数据后，Reduce需要先对数据做一次数据的收集。然后才会进行最终的统计。

YinJuan791739156

关注

9
点赞
踩
10

收藏

觉得还不错? 一键收藏
0
评论
Hadoop系列七（Combiner + Shuffle简单理解）

Mapper之后，Reducer之前，Hadoop对数据的操作称为Shuffle。Mapper处理结束的数据，会先存放在环形缓冲区中（内存中）环形缓冲区将内存一分为二，比如预设了环形缓冲区的内存为100M，那么每部分就是50M的内存，mapper向缓冲区写入数据时会先在一个分区内写入数据，当该分区的数据达到80%时，会切换分区，后续所有的数据都写入另外一个分区。数据达到80%的分区会进行一次数据分区，然后对分区内的数据做排序，之后会产生一次数据的溢写，所谓的溢写就是将数据由内存写到磁盘的过程。
复制链接

扫一扫