MapReduce框架原理之Shuffle机制

最新推荐文章于 2022-07-24 13:11:26 发布

✔✔✔✔

最新推荐文章于 2022-07-24 13:11:26 发布

阅读量456

点赞数

分类专栏： Hadoop生态圈文章标签： hadoop mapreduce

本文链接：https://blog.csdn.net/qq_44110741/article/details/108682745

版权

Hadoop生态圈专栏收录该内容

21 篇文章 1 订阅

订阅专栏

二、WritableComparable排序

1、排序的分类

2、自定义排序WritableComparable

3、WritableComparable排序案例实操

三、Combiner合并 ==,2>,1>,1>

1、combiner简介

2、自定义combiner实现步骤

3、Combiner合并案例实操

四、GroupingComparator分组（辅助排序）

1、案例

前言

Mapreduce确保每个reducer的输入都是按key排序的。系统执行排序的过程（即将mapper输出作为输入传给reducer 的这个过程）称为shuffle（洗牌）。

一、Partiton分区

把map任务输出的kv放到不同的分区文件中，相同分区的数据由一个reduce task来处理。从而达到reduce并行把结果写到不同文件的目的。

1、默认partition分区

public class HashPartitioner<K, V> extends Partitioner<K, V> {
  public int getPartition(K key, V value, int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }
}

默认分区是根据key的hashCode对reduceTasks个数取模得到的，其目的是尽可能均匀的把数据分布到各个文件，让每个reduce task处理的数据大致相等。用户没法控制哪个key存储到哪个分区。

2、自定义Partitioner步骤

（1）自定义分区类，继承Partitioner类，重写getPartition（）方法，在该方法中实现自己的分区逻辑。

（2）在job驱动中，设置自动以partitioner

job.setPartitionerClass(CustomPartitioner.class);

（3）自定义partition后，要根据自定义partitioner的逻辑设置相应数量的reduce task

job.setNumReduceTasks(5);

3、注意

（1）分区器的调用时机：分区器在mapper写出kv之后，kv进入到环形缓冲区之前被调用。相当于给每个kv打上了一个区号标记，该标记决定了以后kv被溢写到磁盘上时的位置。例如，某个kv是1区的，就会被溢写到1区的文件中。

（2）reduce task的数量决定了文件数量。
如果reduceTask的数量> getPartition的结果数，则会多产生几个空的输出文件part-r-000xx；
如果1<reduceTask的数量<getPartition的结果数，则有一部分分区数据无处安放，会Exception；
如果reduceTask的数量=1，则不管mapTask端输出多少个分区文件，最终结果都交给这一个reduceTask，最终也就只会产生一个结果文件 part-r-00000；

4、Partition分区案例实操

（1）需求

在给定的文本文件中统计输出每一个单词出现的总次数，结果输出到两个文件，单词首字母a-p一个文件，q-z一个文件。

（2）数据准备 wordcount.txt

（3）代码实现

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 统计文本中单词出现的次数
 * Mapper从文本中读取数据
 * 每次读取一行使用map方法中的逻辑进行处理
 * 处理后的数据以kv对输出
 * */
public class WcMapper extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 将读取到的数据value转换成字符串
        String line = value.toString();
        // 2 对字符串进行切割 空格切割
        String[] split = line.split(" ");
        // 3 对切割后的字符数组进行循环
        for (String s : split) {
            Text k = new Text();
            k.set(s);
            IntWritable v = new IntWritable(1);
            // 4 组装并输出键值对
            context.write(k,v);
            System.out.println(k+"  mapper  "+v);
        }
    }
}

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

import java.util.regex.Pattern;

/**
 * mapper中输出的数据分配区号
 * a-p在0区，其余的在q-z的在1区
 * Partitioner<Text, IntWritable> 这里的类型是map输出的kv数据类型
 */
public class WcPartition extends Partitioner<Text, IntWritable> {
    public int getPartition(Text text, IntWritable intWritable, int i) {
        //提取单词首字母
        String word = text.toString();
        String s = word.substring(0,1);
        // a-p在0区，其余的在q-z的在1区
        int res = 1;
        if(Pattern.matches("[a-pA-P]", s)){
            res = 0;
        }
        return res;
    }
}

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 *分区器
 *对mapper处理过的数据进行合并
 *reduce方法中对相同key的<k,v>对进行合并
 **/
public class WcReduce extends Reducer<Text, IntWritable,Text, IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //key相同的<k,v>在迭代器中
        // 遍历迭代器进行合并
        // 定义一个sum统计总数
        int sum = 0;
        for (IntWritable value : values) {
            int i = value.get();
            sum = sum + i;
        }
        System.out.println(key+" reduce "+new IntWritable(sum));
        // 组装键值对并写出
        context.write(key,new IntWritable(sum));
    }

}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WcDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //创建配置对象
        Configuration conf = new Configuration();
        //job任务
        Job job = Job.getInstance(conf);
        // 1 设置jar包位置
        job.setJarByClass(WcDriver.class);
        // 2 设置Mapper和Reduce类
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReduce.class);
        // 3 设置Mapper输出的keyout和valueout
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // 4 设置最终输出的的keyout和valueout
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //设置分区器
        job.setPartitionerClass(WcPartition.class);
        job.setNumReduceTasks(2);//reduce task数量

        // 5 设置输入和输出路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
        // 6 任务提交到yarn
        boolean b = job.waitForCompletion(true);
        System.out.println("是否成功"+b);
    }
}

二、WritableComparable排序

排序是MapReduce框架的重要作用之一，Map Task 和 Reduce task 都会对数据按照 key 进行排序。该操作属于hadoop的默认行为。任何程序中的数据均会被排序，而不管逻辑是否需要。默认排序是按照字典顺序排序

对于Map Task，它会将处理的结果暂时放到一个缓冲区中，当缓冲区使用率达到一定阈值后，再对缓冲区中的数据进行一次排序，并将这些有序数据写到磁盘上，而当数据处理完毕后，它会对磁盘上所有文件进行一次合并，以将这些文件合并成一个大的有序文件。

对于Reduce Task，它从每个Map Task上远程拷贝相应的数据文件，如果文件大小超过一定阈值，则放到磁盘上，否则放到内存中。如果磁盘上文件数目达到一定阈值，则进行一次合并以生成一个更大文件；如果内存中文件大小或者数目超过一定阈值，则进行一次合并后将数据写到磁盘上。当所有数据拷贝完毕后，Reduce Task统一对内存和磁盘上的所有数据进行一次合并。

1、排序的分类

（1）部分排序：MapReduce根据输入记录的键对数据集排序。保证输出的每个文件内部排序。

（2）全排序：最简单的，只生成一个结果文件。如果存在分区，让分区与分区之间也是有序的，这样生成的最终结果也是全排序的。

2、自定义排序WritableComparable

第一步：把需要排序的数据放到mapper的keyout位置，如果需要排序的数据是个bean对象则需要第二步。

第二步：告知框架按照bean的哪个属性进行排序，按照升序还是降序排序。也就是让自定义的类实现WritableComparable接口，重写compareto方法，通过返回1，0，-1的方式实现排序。

3、WritableComparable排序案例实操

（1）需求

按照手机号进行分区，每个区内按照总花费升序排序输出。

（2）数据准备最后一列为总花费

（3）代码实现

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class PhoneFeeSortBean implements WritableComparable<PhoneFeeSortBean> {

    private long baseFee;
    private long communicateFee;
    private long msgFee;
    private long flowFee;
    private long sumFee;

    public PhoneFeeSortBean() {
    }

    public PhoneFeeSortBean(long baseFee, long communicateFee, long msgFee, long flowFee, long sumFee) {
        this.baseFee = baseFee;
        this.communicateFee = communicateFee;
        this.msgFee = msgFee;
        this.flowFee = flowFee;
        this.sumFee = sumFee;
    }

    public long getBaseFee() {
        return baseFee;
    }

    public void setBaseFee(long baseFee) {
        this.baseFee = baseFee;
    }

    public long getCommunicateFee() {
        return communicateFee;
    }

    public void setCommunicateFee(long communicateFee) {
        this.communicateFee = communicateFee;
    }

    public long getMsgFee() {
        return msgFee;
    }

    public void setMsgFee(long msgFee) {
        this.msgFee = msgFee;
    }

    public long getFlowFee() {
        return flowFee;
    }

    public void setFlowFee(long flowFee) {
        this.flowFee = flowFee;
    }

    public long getSumFee() {
        return sumFee;
    }

    public void setSumFee(long sumFee) {
        this.sumFee = sumFee;
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(baseFee);
        dataOutput.writeLong(communicateFee);
        dataOutput.writeLong(msgFee);
        dataOutput.writeLong(flowFee);
        dataOutput.writeLong(sumFee);
    }

    public void readFields(DataInput dataInput) throws IOException {
        this.baseFee = dataInput.readLong();
        this.communicateFee = dataInput.readLong();
        this.msgFee = dataInput.readLong();
        this.flowFee = dataInput.readLong();
        this.sumFee = dataInput.readLong();
    }

    @Override
    public String toString() {
        return this.baseFee+"\t"+this.communicateFee+"\t"+this.msgFee+"\t"+this.flowFee+"\t"+this.sumFee;
    }

    //按照总花费升序排序
    // 通过compareTo告知框架按照哪个属性排序
    // 1 升序
    // -1 降序
    // this-o
    public int compareTo(PhoneFeeSortBean o) {
        int res = 0;
        if(this.getSumFee() > o.getSumFee()){
            res = 1;
        }else if(this.getSumFee() < o.getSumFee()){
            res = -1;
        }
        return res;
    }

}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/**
 * 将输入的数据：13329142740	5	60	8	3	76
 * 封装bean，放在keyout的位置
 */
public class PhoneFeeSortMapper extends Mapper<LongWritable, Text, PhoneFeeSortBean, Text> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //13329142740	5	60	8	3	76
        String line = value.toString();
        String[] split = line.split("\t");

        Text v = new Text(split[0]);

        PhoneFeeSortBean k = new PhoneFeeSortBean(Long.parseLong(split[1]),Long.parseLong(split[2]),Long.parseLong(split[3]),Long.parseLong(split[4]),Long.parseLong(split[5]));
        context.write(k,v);
    }
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
/**
 * 将mapper输出的数据按照手机号进行分区
 */
public class PhonePartition extends Partitioner<PhoneFeeSortBean, Text> {
    public int getPartition(PhoneFeeSortBean phoneFeeSortBean, Text text, int i) {
        String phoneNum = text.toString().substring(0, 3);
        int res = 0;
        if("135".equals(phoneNum)){
            res = 1;
        }else if("136".equals(phoneNum)){
            res = 2;
        }else if("137".equals(phoneNum)){
            res = 3;
        }else if("138".equals(phoneNum)){
            res = 4;
        }else if("139".equals(phoneNum)){
            res = 5;
        }
        return res;
    }
}

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/**
 *交换key value位置并输出
 */
public class PhoneFeeSortReduce extends Reducer<PhoneFeeSortBean, Text, Text, PhoneFeeSortBean> {
    @Override
    protected void reduce(PhoneFeeSortBean key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        for (Text value : values) {
            context.write(value,key);
        }
    }
}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class PhoneFeeSortDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(PhoneFeeSortDriver.class);
        job.setMapperClass(PhoneFeeSortMapper.class);
        job.setReducerClass(PhoneFeeSortReduce.class);
        job.setMapOutputKeyClass(PhoneFeeSortBean.class);
        job.setMapOutputValueClass(Text.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(PhoneFeeSortBean.class);

        //设置分区
        job.setPartitionerClass(PhonePartition.class);
        job.setNumReduceTasks(6);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));


        boolean res = job.waitForCompletion(true);
        System.out.println("程序运行是否成功"+res);
    }
}

三、Combiner合并 <b,1> <b,1>==<b,2>

1、combiner简介

Combiner是MapReduce程序中mapper和reducer之外的一种组件，其父类组件就是Reducer。

Combiner与Reducer的区别：运行位置不同
Combiner是在每一个maptask所在节点运行，用于局部合并。
Reduce是用于接收全局Mapper输出的结果，用于全局合并。

Combiner的意义：预合并，对每一个maptask的输出进行合并，减少网络传输量。

注意：
并非所有的mr都可以使用combiner，例如：求平均。
combiner的要与reduce的输入相对应。

2、自定义combiner实现步骤

（1）自定义一个combiner继承Reducer，重写reduce方法，描述预合并的逻辑。

（2）在job驱动类中设置：

job.setCombinerClass(XXX.class);

3、Combiner合并案例实操

（1）需求

统计过程中对每一个maptask的输出进行局部汇总，以减小网络传输量即采用Combiner功能。

（2）数据准备

（3）代码实现（基于之前的Wordcount案例）

方案一

1）增加一个WordcountCombiner类继承Reducer。

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WordcountCombiner extends Reducer<Text,IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        // <hadoop,1>
        // <hadoop,1>
        // 定义单词出现的总次数
        int sum = 0;
        for (IntWritable value : values) {
            int i = value.get();
            sum = sum + i;
        }
        //组装<单词，总次数> 键值对
        IntWritable v = new IntWritable(sum);
        // 将键值对写出
        context.write(key,v);

    }
}

2）在WordcountDriver驱动类中指定combiner。

job.setCombinerClass(WordcountCombiner.class);

方案二

将WordcountReducer作为combiner在WordcountDriver驱动类中指定。

// 指定需要使用combiner，以及用哪个类作为combiner的逻辑
job.setCombinerClass(WordcountReduce.class);

运行程序，如图所示:

四、GroupingComparator分组（辅助排序）

GroupingComparator分组是用来告知MapReduce框架按照什么样的标准对数据进行分组调用reduce，自定义一个分组的标准。

1、案例

（1）需求

有如下家庭用户部分月份电费数据：

用户id	月份	电费
5686621	201906	132.6
5686622	201908	100.8
5686621	201912	144.6
5686622	201906	54.8
5686622	201905	322.4
5686621	201907	22.5
5686622	201910	152.4
5686622	201912	612.4
5686624	201906	22.8
5686624	201908	12.8
5686623	201901	33.8
5686623	201907	63.5
5686623	201906	93.5
5686624	201909	13.6

现在需要求出每户电费最高的月份以及最高电费是多少。

（2）输入数据

（3）案例分析

1）利用“用户id、月份、电费”的bean作为key，可以将map阶段读取到的所有电费数据按照userid升序、电费降序排序，发送到reduce。

2）在reduce端利用groupingcomparator将userid相同的kv聚合成组，然后取第一个即是最大值，如图所示：

（4）代码实现

import org.apache.hadoop.io.WritableComparable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

public class DianfeiBean implements WritableComparable<DianfeiBean> {

    private int userid;
    private String month;
    private double fee;

    public DianfeiBean() {
    }

    public DianfeiBean(int userid, String month, double fee) {
        this.userid = userid;
        this.month = month;
        this.fee = fee;
    }

    public int getUserid() {
        return userid;
    }

    public void setUserid(int userid) {
        this.userid = userid;
    }

    public String getMonth() {
        return month;
    }

    public void setMonth(String month) {
        this.month = month;
    }

    public double getFee() {
        return fee;
    }

    public void setFee(double fee) {
        this.fee = fee;
    }

    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeInt(userid);
        dataOutput.writeUTF(month);
        dataOutput.writeDouble(fee);
    }

    public void readFields(DataInput dataInput) throws IOException {
        this.userid = dataInput.readInt();
        this.month = dataInput.readUTF();
        this.fee = dataInput.readDouble();
    }

    //userid升序  电费降序
    public int compareTo(DianfeiBean o) {
        int res1 = 0;
        if(this.userid > o.userid){
            res1 = 1;
        }else if(this.userid < o.userid){
            res1 = -1;
        }else {
            int res2 = 0;
            if(this.fee > this.fee){
                res2 = -1;
            }else if(this.fee < this.fee){
                res2 = 1;
            }
            return res2;
        }
        return res1;
    }

    public String toString() {
        return this.userid+"\t"+this.month+"\t"+this.fee;
    }
}

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class DianfeiMapper extends Mapper<LongWritable, Text, DianfeiBean, NullWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] split = line.split("\t");
        DianfeiBean k = new DianfeiBean(Integer.parseInt(split[0]), split[1], Double.parseDouble(split[2]));
        context.write(k,NullWritable.get());
    }
}

import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class DianfeiReduce extends Reducer<DianfeiBean, NullWritable, DianfeiBean, NullWritable> {
    @Override
    protected void reduce(DianfeiBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;

public class DianfeiGroupComparator extends WritableComparator {

    public DianfeiGroupComparator(){
        super(DianfeiBean.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        DianfeiBean aa = (DianfeiBean)a;
        DianfeiBean bb = (DianfeiBean)b;
        int res = 0;
        if(aa.getUserid() > bb.getUserid()){
            res = 1;
        }else if(aa.getUserid() < bb.getUserid()){
            res = -1;
        }
        return res;
    }
}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class DianfeiDriver {
    public static void main(String[] args) throws Exception {
        // 1 创建一个配置对象
        Configuration conf = new Configuration();
        // 2 通过配置对象获取一个job对象
        Job job = Job.getInstance(conf);
        // 3 设置job的jar包
        job.setJarByClass(DianfeiDriver.class);
        // 4 设置job的mapper类，reduce类
        job.setMapperClass(DianfeiMapper.class);
        job.setReducerClass(DianfeiReduce.class);
        // 5 设置mapper的keyout和valueout
        job.setMapOutputKeyClass(DianfeiBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        // 6 设置最终输出数据的keyout和valueout
        job.setOutputKeyClass(DianfeiBean.class);
        job.setOutputValueClass(NullWritable.class);
        // 7 设置输入数据的路径和输出数据的路径
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        //注意输出目录不能事先存在，必须设置一个不存在的目录，框架会自行创建,否则就会报错
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        job.setGroupingComparatorClass(DianfeiGroupComparator.class);

        // 8 向yarn或者本地yarn模拟器提交任务
        boolean res = job.waitForCompletion(true);
        System.out.println("是否成功:"+res);

    }
}