day04_mapReducer的例子和shuffle的过程_mapreducer compareto()-CSDN博客

上次已经知道如何写一个简单的MapReducer程序来统计文件中各个单词出现的个数了，是输出到来了同一个文件，
并且结果的排序是按照key的索引的默认顺序进行排列的，今天我们还进行制定排序算法和分组输出结果文件

1. 功能具体实现

这次我们处理的数据集是

1363157985066   13726230503 00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com        24  27  2481    24681   200
1363157995052   13826544101 5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4            4   0   264 0   200
1363157991076   13926435656 20-10-7A-28-CC-0A:CMCC  120.196.100.99          2   4   132 1512    200
1363154400022   13926251106 5C-0E-8B-8B-B1-50:CMCC  120.197.40.4            4   0   240 0   200
1363157993044   18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99  iface.qiyi.com  视频网站    15  12  1527    2106    200
1363157995074   84138413    5C-0E-8B-8C-E8-20:7DaysInn  120.197.40.4    122.72.52.12        20  16  4116    1432    200
1363157993055   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          18  15  1116    954 200
1363157995033   15920133257 5C-0E-8B-C7-BA-20:CMCC  120.197.40.4    sug.so.360.cn   信息安全    20  20  3156    2936    200
1363157983019   13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82          4   0   240 0   200
1363157984041   13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4    s19.cnzz.com    站点统计    24  9   6960    690 200
1363157973098   15013685858 5C-0E-8B-C7-F7-90:CMCC  120.197.40.4    rank.ie.sogou.com   搜索引擎    28  27  3659    3538    200
1363157986029   15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99  www.umeng.com   站点统计    3   3   1938    180 200
1363157992093   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          15  9   918 4938    200
1363157986041   13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4            3   3   180 180 200
1363157984040   13602846565 5C-0E-8B-8B-B6-00:CMCC  120.197.40.4    2052.flash2-http.qq.com 综合门户    15  12  1938    2910    200
1363157995093   13922314466 00-FD-07-A2-EC-BA:CMCC  120.196.100.82  img.qfc.cn      12  12  3008    3720    200
1363157982040   13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99  y0.ifengimg.com 综合门户    57  102 7335    110349  200
1363157986072   18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99  input.shouji.sogou.com  搜索引擎    21  18  9531    2412    200
1363157990043   13925057413 00-1F-64-E1-E6-9A:CMCC  120.196.100.55  t3.baidu.com    搜索引擎    69  63  11058   48243   200
1363157988072   13760778710 00-FD-07-A4-7B-08:CMCC  120.196.100.82          2   2   120 120 200
1363157985079   13823070001 20-7C-8F-70-68-1F:CMCC  120.196.100.99          6   3   360 180 200
1363157985069   13600217502 00-1F-64-E2-E8-B1:CMCC  120.196.100.55          18  138 1080    186852  200

自定义数据结构

这次我们使用MapReducer进行同一号码的上下行流量统计
首先是需要定义自己的数据结果，需要我们定义的数据实现Writable 接口，实现 序列化和反序列化的函数，这样MapReduer在数据传递过程中，才不会报错，也不会丢失数据

package cn.itcast.hadoop.mr.flowsum;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.Writable;

/**
 * FlowBean 是我们自定义的数据类型，要在hadoop的各个节点之间传输，应遵循hadoop的序列化机制
 * 就必须实现hadoop相应的序列化接口
 * */

public class FlowBean implements Writable {

    private String phoneNB;
    private long up_flow;
    private long d_flow;
    private long s_flow;

    // 在反序列化时，反射机制需要调用空参构造函数
    public FlowBean(){}

    public FlowBean(String phoneNB, long up_flow, long d_flow) {
        super();
        this.phoneNB = phoneNB;
        this.up_flow = up_flow;
        this.d_flow = d_flow;
    }

    public String getPhoneNB() {
        return phoneNB;
    }

    public void setPhoneNB(String phoneNB) {
        this.phoneNB = phoneNB;
    }

    public long getUp_flow() {
        return up_flow;
    }

    public void setUp_flow(long up_flow) {
        this.up_flow = up_flow;
    }

    public long getD_flow() {
        return d_flow;
    }

    public void setD_flow(long d_flow) {
        this.d_flow = d_flow;
    }

    /**
     * @return the s_flow
     */
    public long getS_flow() {
        return s_flow;
    }

    /**
     * @param s_flow the s_flow to set
     */
    public void setS_flow(long s_flow) {
        this.s_flow = s_flow;
    }


    @Override
    public String toString() {

        return ""+ phoneNB + "\t" + up_flow + "\t" + d_flow;
    }


    // 将对象数据序列化到流中
    @Override
    public void write(DataOutput out) throws IOException {
        out.writeUTF(phoneNB);
        out.writeLong(up_flow);
        out.writeLong(d_flow);
        out.writeLong(s_flow);

    }

    // 从数据流中泛序列出对象的数据
    // 从数据流中读出对象字段时，必须跟序列化时的顺序保持一致
    @Override
    public void readFields(DataInput in) throws IOException {

        phoneNB = in.readUTF();
        up_flow = in.readLong();
        d_flow = in.readLong();
        s_flow = in.readLong();
    }
}

为了节省空间，就把map和reducer的类定义成了静态类

package cn.itcast.hadoop.mr.flowsum;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;


public class FlowSumRunner extends Configured implements Tool {


    public static class FlowSumMapper extends Mapper<LongWritable, Text, Text,FlowBean> {
        //拿到日志中的数据，切分各个字段
        @Override
        protected void map(LongWritable key, Text value, Context context)
         throws IOException, InterruptedException{

            String line = value.toString();

            String[] fileds = StringUtils.split(line, "\t");

            String phoneNB = fileds[1];
            long up_flow = Long.parseLong(fileds[7]);
            long d_flow = Long.parseLong(fileds[8]);

            context.write(new Text(phoneNB), new FlowBean(phoneNB,up_flow, d_flow));
        }
    }


    public static class FlowSumReducer extends Reducer<Text, FlowBean, Text, FlowBean> {

        //框架每传递一组数据   
        @Override
        protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {

            long up_flow_counter = 0;
            long d_flow_counter= 0;

            for (FlowBean value : values) {
                up_flow_counter += value.getUp_flow();
                d_flow_counter += value.getD_flow();
            }

            context.write(key, new FlowBean(key.toString(),up_flow_counter, d_flow_counter));
        }
    }

    @Override
    public int run(String[] args) throws Exception {

        Configuration conf = new Configuration(); 

        Job job = Job.getInstance(conf);

        job.setJarByClass(FlowSumRunner.class);

        job.setMapperClass(FlowSumMapper.class);
        job.setReducerClass(FlowSumReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));  //读取命令行参数
        FileOutputFormat.setOutputPath(job, new Path(args[1]));//读取命令行参数

        return job.waitForCompletion(true)?0:1;
    }

    public static void main(String[] args) throws Exception{

    //hadoop jar flow.jar cn.itcast.hadoop.mr.flowsum.FlowSumRunner /flow/data /flow/output

        int res = ToolRunner.run(new Configuration(), new FlowSumRunner(), args);
        System.exit(res);
    }
}

把文件上传到 hadoop fs -put  HTTP_20130313143750.dat  /flow/data

打好jar包，执行
hadoop jar flow.jar cn.itcast.hadoop.mr.flowsum.FlowSumRunner /flow/data  /flow/output

[hadoop@hadoop1 ~]$ hadoop fs -cat /flow/output/part-r-00000
13480253104 13480253104 180 200
13502468823 13502468823 102 7335
13560439658 13560439658 5892    400
13600217502 13600217502 186852  200
13602846565 13602846565 12  1938
13660577991 13660577991 9   6960
13719199419 13719199419 0   200
13726230503 13726230503 2481    24681
13760778710 13760778710 120 200
13823070001 13823070001 180 200
13826544101 13826544101 0   200
13922314466 13922314466 3008    3720
13925057413 13925057413 63  11058
13926251106 13926251106 0   200
13926435656 13926435656 1512    200
15013685858 15013685858 27  3659
15920133257 15920133257 20  3156
15989002119 15989002119 3   1938
18211575961 18211575961 12  1527
18320173382 18320173382 18  9531
84138413    84138413    4116    1432

//第一列是默认的key， 后面的是我们FlowBean自己的toString()方法的内容

上行流量排序

如果不实用MapReducer默认的排序方式，使用自定义的方式，FlowBean需要实现Comparable接口，compareTo这个函数
Writable和Comparable合在一起 就是 WritableComparable接口

package cn.itcast.hadoop.mr.flowsum;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

public class FlowBean implements WritableComparable<FlowBean> {

    ...
    ...
    ...

    @Override
    public int compareTo(FlowBean o) {
        return this.getUp_flow() > o.getUp_flow() ? -1:1;
    }
}

因为MapReduce默认是索引 也就是 Key排序，所以我们也要相应的改一下Map和Reduer

package cn.itcast.hadoop.mr.flowsort;

import java.io.IOException;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import cn.itcast.hadoop.mr.flowsum.FlowBean;

public class SortMR {

    // NullWritable 如果什么也不想输出的话  使用NumWritable
    public static class SortMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {

        @Override
        protected void map(LongWritable key, Text value, Context context )
        throws IOException, InterruptedException{
            String line = value.toString();

            String[] fields = StringUtils.split(line,"\t");

            String phoneNB = fields[0];
            long up_flow = Long.parseLong(fields[1]);
            long d_flow = Long.parseLong(fields[2]);


            context.write(new FlowBean(phoneNB, up_flow, d_flow), NullWritable.get());
        }
    }

    public static class SortReducer extends Reducer<FlowBean, NullWritable, FlowBean, NullWritable> {

        @Override
        protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context)
                 throws IOException, InterruptedException

        {
            context.write(key, NullWritable.get());
        }
    }

    public static void main(String[] args) 
            throws IllegalArgumentException, IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration(); 

        Job job = Job.getInstance(conf);

        job.setJarByClass(SortMR.class);

        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);

        job.setMapOutputKeyClass(FlowBean.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(FlowBean.class);
        job.setOutputValueClass(NullWritable.class);

        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));


        int result = job.waitForCompletion(true)?0:1;

        System.exit(result);
    }

}

[hadoop@hadoop1 ~]$ hadoop jar flowsort.jar cn.itcast.hadoop.mr.flowsort.SortMR /flow/data  /flow/sortoutput


[hadoop@hadoop1 ~]$ hadoop fs -cat /flow/sortoutput/part-r-00000
13600217502 186852  200
13560439658 4938    200
84138413    4116    1432
13922314466 3008    3720
13726230503 2481    24681
13926435656 1512    200
13560439658 954 200
13480253104 180 200
13823070001 180 200
13760778710 120 200
13502468823 102 7335
13925057413 63  11058
15013685858 27  3659
15920133257 20  3156
18320173382 18  9531
18211575961 12  1527
13602846565 12  1938
13660577991 9   6960
15989002119 3   1938
13719199419 0   200
13826544101 0   200
13926251106 0   200

输出结果分组

截止到现在  我们看到的输出结果一直都只有一个文件
我们现在想  把 135，136，137，138，139和其他的号码区分开  
这个时候实际就是把这不同部分的数据交给不同的Reducer去处理，然后各个Reducer输出各自的结果

把map的结果分发给不同的Reducer，需要进行一个分组的动作，hadoop默认都是一个组，所以默认只有一个Reducer，输出的也就只有一个文件了

现在我们实现Partitioner 这个类，可以按照我们自己的需求把数据进行分组

package cn.itcast.hadoop.mr.areapartition;

import java.util.HashMap;
import org.apache.hadoop.mapreduce.Partitioner;

public  class AreaPartitioner<KEY, VALUE> extends Partitioner<KEY, VALUE> {

    private static  HashMap<String, Integer> areaMap = new HashMap<>();

    static {        //暂且先静态写死分组
        areaMap.put("135", 0);
        areaMap.put("136", 1);
        areaMap.put("137", 2);
        areaMap.put("138", 3);
        areaMap.put("139", 4);
    }

    @Override
    public int getPartition(KEY key, VALUE value, int numPartitions) {

        // 从key中拿到手机号，查询手机归属地字典， 不同省份返回不同的编组
        Integer code = areaMap.get(key.toString().substring(0,3));
        int areaCoder = code  ==  null ? 5:code;

        return areaCoder;
    }

}

然后需要在 job运行之前指定 ParrtitionerClass 类

job.setPartitionerClass(AreaPartitioner.class);
job.setNumReduceTasks(6); //然后设置启动Reducer的数量

// 这里需要注意的是  这里的数量必须大于等于 你设置的数据分组的数量  不然会进行报错  
//  如果不设置 默认为1  就不会分组， 所有数据就只有一个文件
// 如果设置的多了  多的文件里面不会有数据

hadoop jar flowarea.jar cn.itcast.hadoop.mr.areapartition.FlowSumArea /flow/data  /flow/areaoutput

[hadoop@hadoop1 ~]$ hadoop fs -ls /flow/areaoutput
Found 7 items
-rw-r--r--   1 hadoop supergroup          0 2018-03-18 13:57 /flow/areaoutput/_SUCCESS
-rw-r--r--   1 hadoop supergroup         66 2018-03-18 13:56 /flow/areaoutput/part-r-00000
-rw-r--r--   1 hadoop supergroup         98 2018-03-18 13:56 /flow/areaoutput/part-r-00001
-rw-r--r--   1 hadoop supergroup         97 2018-03-18 13:57 /flow/areaoutput/part-r-00002
-rw-r--r--   1 hadoop supergroup         62 2018-03-18 13:57 /flow/areaoutput/part-r-00003
-rw-r--r--   1 hadoop supergroup        130 2018-03-18 13:57 /flow/areaoutput/part-r-00004
-rw-r--r--   1 hadoop supergroup        219 2018-03-18 13:57 /flow/areaoutput/part-r-00005

[hadoop@hadoop1 ~]$ hadoop fs -cat /flow/areaoutput/part-r-00000
13502468823 13502468823 102 7335
13560439658 13560439658 5892    400

2. MRAppMaster的Shuffer的过程

Hadoop的shuffle过程就是从map端输出到reduce端输入之间的过程，这一段应该是Hadoop中最核心的部分，
因为涉及到Hadoop中最珍贵的网络资源，所以shuffle过程中会有很多可以调节的参数，也有很多策略可以研究，
这方面可以看看大神董西成的相关文章或他写的MapReduce相关书籍。

　　上图中分为Map任务和Reduce任务两个阶段，从map端输出到reduce端的红色和绿色的线表示数据流的一个过程，也我们所要了解的Shuffle过程。

2.1 Map端

（1）在map端首先接触的是InputSplit，在InputSplit中含有DataNode中的数据，每一个InputSplit都会分配一个Mapper任务，Mapper任务结束后产生<K2,V2>的输出，这些输出先存放在缓存中，每个map有一个环形内存缓冲区，用于存储任务的输出。默认大小100MB（io.sort.mb属性），一旦达到阀值0.8(io.sort.spil l.percent)，一个后台线程就把内容写到(spill)Linux本地磁盘中的指定目录（mapred.local.dir）下的新建的一个溢出写文件。

总结：map过程的输出是写入本地磁盘而不是HDFS，但是一开始数据并不是直接写入磁盘而是缓冲在内存中，缓存的好处就是减少磁盘I/O的开销，提高合并和排序的速度。又因为默认的内存缓冲大小是100M（当然这个是可以配置的），所以在编写map函数的时候要尽量减少内存的使用，为shuffle过程预留更多的内存，因为该过程是最耗时的过程。

　　（2）写磁盘前，要进行partition、sort和combine等操作。通过分区，将不同类型的数据分开处理，之后对不同分区的数据进行排序，如果有Combiner，还要对排序后的数据进行combine。等最后记录写完，将全部溢出文件合并为一个分区且排序的文件。

　　（3）最后将磁盘中的数据送到Reduce中，从图中可以看出Map输出有三个分区，有一个分区数据被送到图示的Reduce任务中，剩下的两个分区被送到其他Reducer任务中。而图示的Reducer任务的其他的三个输入则来自其他节点的Map输出。

补充：在写磁盘的时候采用压缩的方式将map的输出结果进行压缩是一个减少网络开销很有效的方法！关于如何使用压缩，在本文第三部分会有介绍。

2.2 Reduce端

　　（1）Copy阶段：Reducer通过Http方式得到输出文件的分区。

　　reduce端可能从n个map的结果中获取数据，而这些map的执行速度不尽相同，当其中一个map运行结束时，reduce就会从 JobTracker中获取该信息。map运行结束后TaskTracker会得到消息，进而将消息汇报给JobTracker，reduce定时从 JobTracker获取该信息，reduce端默认有5个数据复制线程从map端复制数据。

　　（2）Merge阶段：如果形成多个磁盘文件会进行合并

　　从map端复制来的数据首先写到reduce端的缓存中，同样缓存占用到达一定阈值后会将数据写到磁盘中，同样会进行partition、 combine、排序等过程。如果形成了多个磁盘文件还会进行合并，最后一次合并的结果作为reduce的输入而不是写入到磁盘中。
　　
　　
（3）Reducer的参数：最后将合并后的结果作为输入传入Reduce任务中。

总结：当Reducer的输入文件确定后，整个Shuffle操作才最终结束。之后就是Reducer的执行了，最后Reducer会把结果存到HDFS上。