11 - MapReduce工作流程、shuffle机制、分区、排序以及合并

爱上口袋的天空

已于 2024-04-18 22:01:51 修改

阅读量393

点赞数

分类专栏： # hadoop3.x 文章标签： hadoop

于 2019-08-04 12:44:42 首次发布

本文链接：https://blog.csdn.net/K_520_W/article/details/98106410

版权

hadoop3.x 专栏收录该内容

30 篇文章 2 订阅

订阅专栏

3、WritableComparable排序

3.1、自定义排序WritableComparable原理分析

3.2、WritableComparable排序案例实操（全排序）

3.3、WritableComparable排序案例实操（区内排序）

4、 Combiner合并

4.1、Combiner合并案例实操

一、MapReduce工作流程

上面的流程是整个MapReduce最全工作流程，但是Shuffle过程只是从第7步开始到第16步结束，具体Shuffle过程详解，如下：

MapTask收集我们的map()方法输出的kv对，放到内存缓冲区中
从内存缓冲区不断溢出本地磁盘文件，可能会溢出多个文件
多个溢出文件会被合并成大的溢出文件
在溢出过程及合并的过程中，都要调用Partitioner进行分区和针对key进行排序
ReduceTask根据自己的分区号，去各个MapTask机器上取相应的结果分区数据
ReduceTask会抓取到同一个分区的来自不同MapTask的结果文件，ReduceTask会将这些文件再进行合并（归并排序）
合并成大文件后，Shuffle的过程也就结束了，后面进入ReduceTask的逻辑运算过程（从文件中取出一个一个的键值对Group，调用用户自定义的reduce()方法）

注意：

Shuffle中的缓冲区大小会影响到MapReduce程序的执行效率，原则上说，缓冲区越大，磁盘io的次数越少，执行速度就越快
缓冲区的大小可以通过参数调整，参数：mapreduce.task.io.sort.mb默认100M

二、Shuffle机制

1、Shuffle机制

Map方法之后，Reduce方法之前的数据处理过程称之为Shuffle

2、Partition分区

2.1、Partition分区案例实操

1）需求：将统计结果按照手机归属地不同省份输出到不同文件中（分区）

2）输入数据如下

1   13736230513   192.196.100.1   www.atguigu.com   2481   24681   200
2   13846544121   192.196.100.2           264   0   200
3    13956435636   192.196.100.3           132   1512   200
4    13966251146   192.168.100.1           240   0   404
5    18271575951   192.168.100.2   www.atguigu.com   1527   2106   200
6    84188413   192.168.100.3   www.atguigu.com   4116   1432   200
7    13590439668   192.168.100.4           1116   954   200
8    15910133277   192.168.100.5   www.hao123.com   3156   2936   200
9    13729199489   192.168.100.6           240   0   200
10    13630577991   192.168.100.7   www.shouhu.com   6960   690   200
11    15043685818   192.168.100.8   www.baidu.com   3659   3538   200
12    15959002129   192.168.100.9   www.atguigu.com   1938   180   500
13    13560439638   192.168.100.10           918   4938   200
14    13470253144   192.168.100.11           180   180   200
15    13682846555   192.168.100.12   www.qq.com   1938   2910   200
16    13992314666   192.168.100.13   www.gaga.com   3008   3720   200
17    13509468723   192.168.100.14   www.qinghua.com   7335   110349   404
18    18390173782   192.168.100.15   www.sogou.com   9531   2412   200
19    13975057813   192.168.100.16   www.baidu.com   11058   48243   200
20    13768778790   192.168.100.17           120   120   200
21    13568436656   192.168.100.18   www.alibaba.com   2481   24681   200
22    13568436656   192.168.100.19           1116   954   200

3）期望输出数据

手机号136、137、138、139开头都分别放到一个独立的4个文件中，其他开头的放到一个文件中

4）需求分析

5）在之前如下的流量汇总案例上面进行加工

10 - MapReduce之Hadoop序列化,MapReduce框架原理https://blog.csdn.net/K_520_W/article/details/975638376）增加一个分区类如下

package com.kgf.mapreduce.partition;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner extends Partitioner<Text, FlowBean> {

    @Override
    public int getPartition(Text text, FlowBean flowBean, int numPartitions) {
        //获取手机号前三位prePhone
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);

        //定义一个分区号变量partition,根据prePhone设置分区号
        int partition;

        if("136".equals(prePhone)){
            partition = 0;
        }else if("137".equals(prePhone)){
            partition = 1;
        }else if("138".equals(prePhone)){
            partition = 2;
        }else if("139".equals(prePhone)){
            partition = 3;
        }else {
            partition = 4;
        }

        //最后返回分区号partition
        return partition;
    }
}

7）在驱动函数中增加自定义数据分区设置和ReduceTask设置

package com.kgf.mapreduce.partition;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

	public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

		//1:获取job对象
		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);
		
		//2:设置jar包路径
		job.setJarByClass(FlowDriver.class);
		
		//3:管理自定义的Mapper和Reducer类
		job.setMapperClass(FlowMapper.class);
		job.setReducerClass(FlowReducer.class);
		
		//4:Mapper输出类型
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(FlowBean.class);
		
		//5：Reducer输出类型
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(FlowBean.class);


		//8 指定自定义分区器
		job.setPartitionerClass(ProvincePartitioner.class);

		//9 同时指定相应数量的ReduceTask
		job.setNumReduceTasks(5);


		//6：设置输出路径
		FileInputFormat.setInputPaths(job,new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		//7：提交
		boolean result = job.waitForCompletion(true);
		System.exit(result?0:1);
	}
	
}

3、WritableComparable排序

3.1、自定义排序WritableComparable原理分析

bean对象做为key传输，需要实现WritableComparable接口重写compareTo方法，就可以实现排序。

@Override
public int compareTo(FlowBean bean) {

	int result;
		
	// 按照总流量大小，倒序排列
	if (this.sumFlow > bean.getSumFlow()) {
		result = -1;
	}else if (this.sumFlow < bean.getSumFlow()) {
		result = 1;
	}else {
		result = 0;
	}

	return result;
}

3.2、WritableComparable排序案例实操（全排序）

1）需求

根据流量汇总案例产生的结果再次对总流量进行倒序排序

2）输入的数据

13470253144   180   180   360
13509468723   7335   110349   117684
13560439638   918   4938   5856
13568436656   3597   25635   29232
13590439668   1116   954   2070
13630577991   6960   690   7650
13682846555   1938   2910   4848
13729199489   240   0   240
13736230513   2481   24681   27162
13768778790   120   120   240
13846544121   264   0   264
13956435636   132   1512   1644
13966251146   240   0   240
13975057813   11058   48243   59301
13992314666   3008   3720   6728
15043685818   3659   3538   7197
15910133277   3156   2936   6092
15959002129   1938   180   2118
18271575951   1527   2106   3633
18390173782   9531   2412   11943
84188413   4116   1432   5548

3）期望输出数据

4）需求分析

5）代码实现

5.1）FlowBean对象在在需求1基础上增加了比较功能

package com.kgf.mapreduce.writablecompable1;

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class FlowBean implements WritableComparable<FlowBean> {

    private long upFlow; //上行流量
    private long downFlow; //下行流量
    private long sumFlow; //总流量

    //提供无参构造
    public FlowBean() {
    }

    //生成三个属性的getter和setter方法
    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    public void setSumFlow() {
        this.sumFlow = this.upFlow + this.downFlow;
    }

    //实现序列化和反序列化方法,注意顺序一定要一致
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(this.upFlow);
        out.writeLong(this.downFlow);
        out.writeLong(this.sumFlow);

    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.upFlow = in.readLong();
        this.downFlow = in.readLong();
        this.sumFlow = in.readLong();
    }

    //重写ToString,最后要输出FlowBean
    @Override
    public String toString() {
        return upFlow + "\t" + downFlow + "\t" + sumFlow;
    }

    @Override
    public int compareTo(FlowBean o) {

        //按照总流量比较,倒序排列
        if(this.sumFlow > o.sumFlow){
            return -1;
        }else if(this.sumFlow < o.sumFlow){
            return 1;
        }else {
            return 0;
        }
    }
}

5.2）编写Mapper类

package com.kgf.mapreduce.writablecompable1;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class FlowMapper extends Mapper<LongWritable, Text, FlowBean, Text> {
    private FlowBean outK = new FlowBean();
    private Text outV = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //1 获取一行数据
        String line = value.toString();

        //2 按照"\t",切割数据
        String[] split = line.split("\t");

        //3 封装outK outV
        outK.setUpFlow(Long.parseLong(split[1]));
        outK.setDownFlow(Long.parseLong(split[2]));
        outK.setSumFlow();
        outV.set(split[0]);

        //4 写出outK outV
        context.write(outK,outV);
    }
}

5.3）编写Reducer类

package com.kgf.mapreduce.writablecompable1;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowReducer extends Reducer<FlowBean, Text, Text, FlowBean> {
    
    @Override
    protected void reduce(FlowBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {

        //遍历values集合,循环写出,避免总流量相同的情况
        for (Text value : values) {
            //调换KV位置,反向写出
            context.write(value,key);
        }
    }
}

5.4）编写Driver类

package com.kgf.mapreduce.writablecompable1;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowDriver {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1 获取job对象
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2 关联本Driver类
        job.setJarByClass(FlowDriver.class);

        //3 关联Mapper和Reducer
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReducer.class);

        //4 设置Map端输出数据的KV类型
        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(Text.class);

        //5 设置程序最终输出的KV类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        //6 设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path("F:\\test\\input2"));
        FileOutputFormat.setOutputPath(job, new Path("F:\\test\\output3"));

        //7 提交Job
        boolean b = job.waitForCompletion(true);
        System.exit(b ? 0 : 1);
    }
}

3.3、WritableComparable排序案例实操（区内排序）

1）需求

要求每个省份手机号输出的文件中按照总流量内部排序。

2）需求分析

基于前一个需求，增加自定义分区类，分区按照省份手机号设置

3）案例实操

（1）增加自定义分区类

package com.kgf.mapreduce.writablecompable2;

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

public class ProvincePartitioner2 extends Partitioner<FlowBean, Text> {

    @Override
    public int getPartition(FlowBean flowBean, Text text, int numPartitions) {
        //获取手机号前三位
        String phone = text.toString();
        String prePhone = phone.substring(0, 3);

        //定义一个分区号变量partition,根据prePhone设置分区号
        int partition;
        if("136".equals(prePhone)){
            partition = 0;
        }else if("137".equals(prePhone)){
            partition = 1;
        }else if("138".equals(prePhone)){
            partition = 2;
        }else if("139".equals(prePhone)){
            partition = 3;
        }else {
            partition = 4;
        }

        //最后返回分区号partition
        return partition;
    }
}

（2）在驱动类中添加分区类

 // 设置自定义分区器
job.setPartitionerClass(ProvincePartitioner2.class);

// 设置对应的ReduceTask的个数
job.setNumReduceTasks(5);

4、 Combiner合并

4.1、Combiner合并案例实操

1）需求

统计过程中对每一个MapTask的输出进行局部汇总，以减小网络传输量即采用Combiner功能。

（1）数据输入

1   zhang1   10   100
2   lisi1   20   100
3   lisi2   40   100
4   lisi3   20   100
5   lisi3   40   100
6   lisi4   20   100
7   lisi5   30   100
8   lisi6   30   100
9   lisi7   20   100
10   lisi8   30   100
11   lisi9   10   100
12   zhao   20   100

（2）期望输出数据

期望：Combine输入数据多，输出时经过合并，输出数据降低

2）需求分析

注意：基于之前的案例进行操作

09 - MapReduce之入门概述、Mapreduce 优缺点、核心思想、MapReduce进程、MapReduce 编程规范、以及WordCount 案例https://blog.csdn.net/K_520_W/article/details/97485863

3）案例实操-方案一

（1）增加一个WordCountCombiner类继承Reducer

package com.kgf.mapreduce.Combiner;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable outV = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int sum = 0;
        for (IntWritable value : values) {
            sum += value.get();
        }

        //封装outKV
        outV.set(sum);

        //写出outKV
        context.write(key,outV);
    }
}

（2）在WordcountDriver驱动类中指定Combiner

// 指定需要使用combiner，以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordCountCombiner.class);

4）案例实操-方案二

爱上口袋的天空

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录