第四节 MapReduce(二)

吴琼老师

已于 2022-09-27 13:51:15 修改

阅读量650

点赞数

分类专栏：大数据BigData 文章标签： mapreduce hadoop 大数据

于 2022-09-27 13:47:24 首次发布

本文链接：https://blog.csdn.net/u013280750/article/details/126915589

版权

大数据BigData 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

MapReduce工作机制

1. 话题：job 执行流程

从MapReduce第一天的学习，基本上也会对MapReduce有些感觉了，接下来我们来看一张MapReduce的作业的运行机制图，即，job执行流程。

执行流程如下：job提交阶段1-4，job初始化5-6，job分配任务 7-8，job计算9-10

run job ，相当于调用job.waitForCompletion(true); 收集job运行环境的参数信息，比如：检查，输入路径和输出路径的信息的合法性，如果不正确，会报错，并终止任务的提交。
例如：检查，Driver类，输入、输出路径。
get new jobID ，向ResourceMananger（JobTracker） 获取 MapReduce 的JobID。相当于唯一身份标识符。
计算输入分片，将运行作业所需要的资源(简单理解为jar包资源（job.xml文件））上传到HDFS。
submit job,提交job，客户端通知JobTracker可以对job进行运算，并同时把job运算资源跟存储位置告诉给JobTracker。
jobTracker收到相关任务信息之后，开始进行job的初始化 （包括运算资源的分配）。
根据map的分片数量 （spilt<128M，分一片），以及用户自定义设置的分区数量 （reduce分区），计算有多少个map任务和reduce任务
taskTracker向jobTracker发送心跳包，报告自己的信息，并从jobTracker领取分配的任务。
TaskTracker 去hdfs上，下载相关的运算任务，到自己的节点（jar包）。
启动Jvm开始工作。
运算任务，map任务或者reduce任务。最后生成文件上传hdfs。

2. 多文件处理

2.1 计算各科成绩总和

多文件计算，需求如下：计算每个人三个月各科成绩总和。 要求用面向对象的思想。
- 提示1： 怎么能准确的获取要处理的文件名称？

在这里插入图片描述

public class Score implements Writable {
    //姓名 语文 数学 英语
    private String student_Name;
    private int chinese;
    private int math;
    private int english;


    //MR序列化方法
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(student_Name);
        dataOutput.writeInt(chinese);
        dataOutput.writeInt(math);
        dataOutput.writeInt(english);
    }

    //MR反序列化方法
    @Override
    public void readFields(DataInput dataInput) throws IOException {
       this.student_Name =dataInput.readUTF();
       this.chinese =dataInput.readInt();
       this.math =dataInput.readInt();
       this.english =dataInput.readInt();
    }

    //get（）和set（）方法

    public String getStudent_Name() {
        return student_Name;
    }

    public void setStudent_Name(String student_Name) {
        this.student_Name = student_Name;
    }

    public int getChinese() {
        return chinese;
    }

    public void setChinese(int chinese) {
        this.chinese = chinese;
    }

    public int getMath() {
        return math;
    }

    public void setMath(int math) {
        this.math = math;
    }

    public int getEnglish() {
        return english;
    }

    public void setEnglish(int english) {
        this.english = english;
    }

    //toString()方法

    @Override
    public String toString() {
        return "Score{" +
                "student_Name='" + student_Name + '\'' +
                ", chinese=" + chinese +
                ", math=" + math +
                ", english=" + english +
                '}';
    }
}

核心重要方法获取HDFS上文件名
- FileSplit fileSplit = (FileSplit) context.getInputSplit();
- String file_name = fileSplit.getPath().getName();

public class ScoreMapper extends Mapper<LongWritable, Text,Text,Score> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //1 获取文件信息，截取相应字段封装对象。
        String[] data = value.toString().split(" ");

        String name = data[1]; //姓名
        int score = Integer.parseInt(data[2]);//返回的分数,语文，数学或者英语。
        
        //2. 封装对象信息
        Score s = new Score();
        s.setStudent_Name(name);

        //2.1 一个文件<128M属于一个切片，获取hdfs上，切片数据中的文件名。
        FileSplit fileSplit = (FileSplit) context.getInputSplit();
        String file_name = fileSplit.getPath().getName();

        if (file_name.equals("Chinese.txt")){
            s.setChinese(score);
        }else if (file_name.equals("English.txt")){
            s.setEnglish(score);
        }else{
            s.setMath(score);
        }

        // 输出
        context.write(new Text(s.getStudent_Name()),s);
    }
}

public class ScoreReduce extends Reducer<Text,Score,Text,Score> {
    @Override
    protected void reduce(Text key, Iterable<Score> values, Context context) throws IOException, InterruptedException {
        //1，累计求和进行输出
        int math =0;
        int englis =0;
        int chinese=0;

        for (Score value :values){
            math = math +value.getMath();
            englis = englis +value.getEnglish();
            chinese =chinese+value.getChinese();
        }

        //封装对象
        Score temp = new Score();
        temp.setStudent_Name(key.toString());
        temp.setChinese(chinese);
        temp.setMath(math);
        temp.setEnglish(englis);

        context.write(key,temp);
    }
}

public class ScoreDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(ScoreDriver.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(ScoreMapper.class);
        job.setReducerClass(ScoreReduce.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Score.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Score.class);


        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/score"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/score/resault"));

        //5.提交
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

2.2 排序 WritableComparable<>

对电影进行排序，将rate评分，需要进行降序排序 （从大到小）。
- 数据源的格式很多种，下图格式叫做 json。
有比较和排序时，在对象中需要实现WritableComparable<>.
- 并且需要重写比较方法compareTo 。

public class Movie implements WritableComparable<Movie> {
    private long movie_id; //电影id
    private int rate; //评分
    private int user_id; //用户id

    public long getMovie_id() {
        return movie_id;
    }

    public void setMovie_id(long movie_id) {
        this.movie_id = movie_id;
    }

    public int getRate() {
        return rate;
    }

    public void setRate(int rate) {
        this.rate = rate;
    }

    public int getUser_id() {
        return user_id;
    }

    public void setUser_id(int user_id) {
        this.user_id = user_id;
    }

    //序列化
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(movie_id);
        dataOutput.writeInt(rate);
        dataOutput.writeInt(user_id);
    }

    //反序列化
    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.movie_id =dataInput.readLong();
        this.rate = dataInput.readInt();
        this.user_id=dataInput.readInt();
    }

    //实现比较
    @Override
    public int compareTo(Movie o) {
        return o.rate-this.rate; //降序排序
    }

    @Override
    public String toString() {
        return "Movie{" +
                "movie_id=" + movie_id +
                ", rate=" + rate +
                ", user_id=" + user_id +
                '}';
    }
}

Mapper中的方法注意的是如何解析json串。
- Mapper主要功能就是排序，所以reduce可以不用合并（即Reduce省略）。

public class CompareMapper extends Mapper<LongWritable, Text,Movie, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //1.解析json格式
        ObjectMapper o = new ObjectMapper();
        JsonNode tree = o.readTree(value.toString());
        //2.获取数据
        int rate = tree.get("rate").asInt();
        int uid = tree.get("uid").asInt();
        long movie = tree.get("movie").asLong();

        //3.封装属性
        Movie m = new Movie();
        m.setMovie_id(movie);
        m.setRate(rate);
        m.setUser_id(uid);

        context.write(m,NullWritable.get());
    }
}

public class CompareDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(CompareDriver.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(CompareMapper.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(Movie.class);
        job.setMapOutputValueClass(NullWritable.class);


        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/movie"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/movie/resault"));

        //5.提交
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

2.3 全排序

MapReduce默认输出是在单个reduce中以key排序的，多个reduce输出之间是不排序的，
所谓全排序， 就是指多个reduce之间的输出是有序的。
- 单排序： 每个map任务对自己的输入数据进行排序，但是无法做到全局排序，需要将数据传递到reduce，然后通过reduce进行一次总的排序，但是这样做的要求是只能有一个reduce任务来完成。 并行成都不高，资源利用率低，所以采用以下方式进行排序。
- 多区间排序，受限的就是排序时间长的那个分区，所以就要求数据在各区间的分布相对均匀。这样能有效避免数据倾斜导致的性能降低。

//1.测试数据
1123 112 1123
12 34 344
23 323 3232
1 7 8
33 222 3333
1234 3 5 66

public class TotalMapper extends Mapper<LongWritable, Text, IntWritable,IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] data = value.toString().split(" ");
        //遍历数据,将所有int类型数据输出（参考wordcount）

        for (String s :data){
            int resault = Integer.parseInt(s);
            context.write(new IntWritable(resault),new IntWritable(1));//输出
        }

    }
}

public class TotalReduce extends Reducer<IntWritable,IntWritable,IntWritable,IntWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //1.可以不做实质性的处理，如果有相应业务： 这里也可以做求和，或者拼串都行。
        //1.1 这里因为Mapper输出是自定义IntWritable为1 ，这里就是求key值得 出现次数，做求和处理。
        int temp =0; //中间变量，作用合并value值。
        for (IntWritable value: values){
            temp+=value.get(); //计数
        }

        context.write(key,new IntWritable(temp));
    }
}

public class Totalpartition extends Partitioner<IntWritable,IntWritable> {
    @Override
    public int getPartition(IntWritable key, IntWritable value, int i) {

        int num = key.get();//获取key值
        // 根据区间值进行划分。
        if (num>=0&&num<10){
            return 0;
        }else  if (num>=10&&num<100){
            return 1;
        }else if(num>=100&&num<1000){
            return 2;
        }else{
            return 3;
        }

    }
}

public class TotalDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(TotalDriver.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(TotalMapper.class);
        job.setReducerClass(TotalReduce.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(IntWritable.class);

        //3.1 设置reduce分区,和自定义分区
        job.setNumReduceTasks(4);
        job.setPartitionerClass(Totalpartition.class);
        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/sort"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/sort/resault"));

        //5.提交
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

2.4 多 Mapreduce任务处理

需求： 计算出3个季度的总利润（收入-支出），并实现排序。
- 增加一种处理问题的方法，当有些技术案例场景，需要两个mr。 这就要划分，第一个mr做什么？ 第二个mr做什么？

在这里插入图片描述

第一个MapReduce 生成文件。此时是没有进行排序。

public class MultMapper_1 extends Mapper<LongWritable,Text,Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //1.跳过首行，不处理。 这个0是什么意思！？
        if(key.toString().equals("0")){
            System.out.println("该行数据不处理："+ value);
            return;
        }else{
            //拼key值 姓名+公司名称  value 总和利润（收入-支出）
            String[] data = value.toString().split(" ");
            String name = data[1];
            String company = data[2];
            int exp_Money = Integer.parseInt(data[3]);
            int inc_Money = Integer.parseInt(data[4]);

            //相当于：   冯宝宝,哪都通	8818
            context.write(new Text(name+","+company),new IntWritable(exp_Money-inc_Money));
        }
    }
}

public class MultReduce_1 extends Reducer<Text, IntWritable,Text,IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        //中间值，累计求和。
        int temp = 0;
        for (IntWritable value:values){
            temp+= value.get(); //累计求和
        }
        context.write(key,new IntWritable(temp));
    }
}

public class MultDriver_1 {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(MultDriver_1.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(MultMapper_1.class);
        job.setReducerClass(MultReduce_1.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);


        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/mult"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/mult/resault"));

        //5.提交
        job.waitForCompletion(true);
    }
}

第二个Mapper负责处理封装对象属性。注意： Mapreduce生成的key和value之间是制表符分割的。 其中的hdfs上的文件会自动忽略_SUCCESS文件。

public class MultProfit implements WritableComparable<MultProfit> {

    private String name;
    private String company;//公司
    private int profit; //总利润


    public String getName() {
        return name;
    }

    public void setName(String name) {
        this.name = name;
    }

    public String getCompany() {
        return company;
    }

    public void setCompany(String company) {
        this.company = company;
    }

    public int getProfit() {
        return profit;
    }

    public void setProfit(int profit) {
        this.profit = profit;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(name);
        dataOutput.writeUTF(company);
        dataOutput.writeInt(profit);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.name =dataInput.readUTF();
        this.company=dataInput.readUTF();
        this.profit=dataInput.readInt();
    }

    @Override
    public int compareTo(MultProfit o) {
        return o.profit-this.profit;
    }

    @Override
    public String toString() {
        return "MultProfit{" +
                "name='" + name + '\'' +
                ", company='" + company + '\'' +
                ", profit=" + profit +
                '}';
    }
}

  public class MultMapper_2 extends Mapper<LongWritable,Text,MultProfit, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

      //1.提示： 底层代码 key和value生产 制表符分割的。冯宝宝,哪都通	10593
        String[] data = value.toString().split("\t");
        int profit = Integer.parseInt(data[1]); //获取总流量
        //冯宝宝,哪都通  截取姓名和公司
        String[] split = data[0].toString().split(",");

        String name = split[0];
        String company = split[1];

        //2.封装对象属性。
        MultProfit multProfit = new MultProfit();
        multProfit.setName(name);
        multProfit.setCompany(company);
        multProfit.setProfit(profit);

        context.write(multProfit,NullWritable.get());
    }
}

public class MultDriver_2 {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(MultDriver_2.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(MultMapper_2.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(MultProfit.class);
        job.setMapOutputValueClass(NullWritable.class);


        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/mult/resault"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/mult/resault2"));

        //5.提交
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

3. 合并(Combine)分析

3.1 Combine概述

Mapper先输出<k,v>键值对，然后再Reduce中合并处理结果。假设： 如果有10亿行数据，Mapper会生成10亿个键值对在网络上输出，这样会增加网络压力。
我们可不可以在Mapper端进行合并，只输出最大值即可。这样提高了网络效率，也提高了程序效率。
Combine可以理解为，在Mapper端的Reduce操作，先进行合并在输出给Reduce。前提是不能改变最终输出结果，也不是所有场景都适合体现合并，比如,计算平均值 ?。

3.2 Combine 图示

拿我们之前学过的 WordCount 举例子。

	hello angelbaby
	hello yangmi
	hello angelbaby
	hello angelbaby
	hello yangmi
	hello liuyifei

在这里插入图片描述
2. 假设有两个Mapper去处理两个WrodCount 传递到 一个Reduce处理数据，这就需要合并11次。

思考问题：假设有10个Map任务去处理呢？ 这样就会造成Reduce负载均衡过大。怎样去解决这样的问题？Combine

这个时候如果在Mapper端进行一次Combine合并，减少Reduce合并次数。

结论： Combiner发生在Map端，对数据进行局部聚合处理，数据量变小，传送到reduce端的数据的传输时间变短，作业的整体时间变短。
Combine是MapReduce的一种优化手段之一，减少数据倾斜。

3.3 Combine案例

因为combine的输入是map的输出，combine的输出是reduce的输入， 而map的输出和reduce的输出是一致的，所以，我们需要确保combine的输入和输出是一样的
创建WcCombine类继承 Reducer 其它保持不变。 需要在Driver类中指定Combine类即可。

public class WcCombine extends Reducer<Text, IntWritable,Text,IntWritable> {

    // 2.测试是否被调用....
    public WcCombine(){
        System.out.println("被调用.....");
    }


    /**
     * 因为 Combine也是合并同Reduce一样，只是在Mapper端先合并。
     *  Combine的输入是 Mapper输出。
     *  Combine的输出是 Reduce输入。
     *          所以，我们需要确保combine的输入和输出是一样。
     * @param key
     * @param values
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {



        // 1.提前聚合一下。
        int temp=0;
        for (IntWritable value : values){
            temp += value.get();
        }
        //1.1 输出结果。
        System.out.println(key+"-------"+temp);
        context.write(key,new IntWritable(temp));

    }
}

Driver类需要指定Combine类。

 //6 指定Combine类
        job.setCombinerClass(WcCombine.class);

在这里插入图片描述

4. 话题：Shuffle 洗牌

4.1 概述

Mapper的输出是Reduce输入，MapReduce确保每个Reduce的输入都是按照键值进行排序的。 系统执行排序，将Mapper输出作为输入传给Reducer的过程成为 ,Shuffle

这里面就有奇迹发生的地方， Map 和 Reduce 。

4.2 Map 端

Map输出时，并没有直接写到磁盘上，而是先写入了 环形的缓冲区 之中，该缓冲区，默认100M,可以修改maprduce.task.io.sort.mb。
溢写（spill）。 缓冲区有阈值（mapredue.map.sort.spill.percent）,可以配置默认0.8 或者80%。超过阈值就开始溢写入磁盘（可以理解为，内存存不下，先放磁盘）。在缓冲区溢写过程中，Map输出会继续写入环形缓冲区。
- 如果此时缓冲区被填满，Map会被阻塞直到写磁盘过程完成（可以理解为：先保证溢写，然后再输入，80M溢写继续，20M留个Map进行输入）。
分区。 在写磁盘之前，就会先进行分区，在每个分区中，按照key即键值进行排序，如果此时有Combine合并，会在排序之后的输出上运行。分区–>排序–>Combine
- 在Map端的Combine目的使Map输出结果更紧凑，减少写入磁盘和传递给Reduce的数据。
Merge归并。环形缓冲区达到阈值，就会发生溢写到磁盘，在Mapper任务输出之前，会将溢写的文件会归并成一个已经分区且排序的输出文件。 默认10个，开始归并。
- 合并和归并的区别：有两个键值对： <hello,1> <hello,1>
- 合并 Combine <hello ,2>。归并 Merge <hello,<1,1>>。

4.3 Reduce 端

主要负责的就是复制和排序。

Mapper端指定了分区 partition，这就对应了Reudce的数量，且文件都已经分区且排序的。
Copy阶段： Fetch就是Reduce将属于自己的文件抓取过来的过程相当于Copy的过程。如果Map输出的文件相对较小，会复制到Reduce内存中进行处理，如果超过处理阈值，则合并后溢写入磁盘。
Merge。会对内存中和磁盘中的文件进行归并，以防止内存中，或者磁盘中文件过多。
Sort。复制完所有的Map任务之后，Reduce进行合并阶段，在对数据进行合并的同时，会进行排序操作，由于MapTask 阶段已经对数据进行了局部的排序，Reduce只需保证Copy的数据的最终整体进行一次归并排序即可。
Reduce，最后将结果，输出到 Hdfs上。

4.4 简单理解：环形缓冲区，它是个啥？

MapTask中有一个类MapOutputBuffer<K, V>，中的init（）方法 ，缓冲区其实本质是一个数组结构（ byte[] kvbuffer），它主要是用于存储 kvmeta元数据信息（存储key-value的索引信息，包括key开始位置，value开始位置，partition分区信息，以及value长度）。
- 其目的就是快速收集Map中的键值对，进行快速排序，减少磁盘io的操作。
环形缓冲区， 它的一种 满足FIFO定义结构数据缓存器，属于环形队列 满足数据的先进先出的方式，队列设定了最大长度（初始化100M），这就从侧面要求进队列和出队列的数据会相对稳定，达到一种平衡状态，这样该结构的特点就是能很快知道队列是否满为空。能以很快速度的来存取数据。
- 如果当该队列被填满时，根据阈值0.8 ，有80M内存空间发生溢写操作，20M依然会收集Map输入。
byte[] kvbuffer存储分为，空闲区(包含其他两个区)，数据区，索引区，数据和索引区不会重复，是分别从左和右写入，有一个分界点equator(赤道)来分割,初始equator的位置是0

在这里插入图片描述

5 MapReduce 调优建议

减少数据倾斜，在分区端使用自定义分区extends Partition<key,value>。
可以设置或者调整环形缓冲区大小。
可以在Mapper 输出之前加入Combine操作，减少Reduce合并，减少网络传输压力。
设置MapReduce的压缩格式。
物理调优，增加集群数量。

5.1 压缩方式

使用，bzip2举例子

只需要在Dirver类中加入以下参数

//开启map端压缩，减少网络到reduce
 conf.setBoolean("mapred.compress.map.output", true);
        //指定压缩方式，本例中用的gzip压缩
        conf.setClass("mapred.map.output.compression.codec", BZip2Codec.class,  CompressionCodec.class)

//设置压缩机制,开启压缩机制，默认是不开启的。
 FileOutputFormat.setCompressOutput(job, true);
        //使用bzip压缩算法。
        FileOutputFormat.setOutputCompressorClass(job, BZip2Codec.class);

在这里插入图片描述

6 Mapper自定义输入格式

6.1 自定义KeyIn 为行号

在我们之前学过的案例，Mapper<LongWritable, Text,Text,IntWritable> 第一个LongWritable 是偏移量，第二个Text是每行内容。
- 现在需要自定义输入格式， 第一个不传入偏移量，传入行号。 不用LongWritable类型，使用自定义类型如行号 IntWritable类型。
需要重写输入类型中两个类：
- 重写，数据读取格式类 extends FileInputFormat<keyin,valuein >{} 专门读取文件类。
- 重写，读取器 extends RecordReader<keyin, valuein>

// FileInputFormat<key,value> key是输入类型IntWritable 要对应的是行号。
public class WordCountFormat extends FileInputFormat<IntWritable,Text> {

    //重写读取器。 需要重写 RecordReader。
    @Override
    public RecordReader<IntWritable, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new WordCountRecordReader();
    }

}

//读取器是抽象类，里面的方法都需要重写。
public class WordCountRecordReader extends RecordReader<IntWritable, Text> {

    //定义成员变量，方便调用
    private FileSplit fs; //切片属性
    //按照行读取 真的干活的。
    // 注意导入包：org.apache.hadoop.util.LineReader;
    private LineReader lineReader;

    //类型读取
    private  IntWritable key;
    private  Text value;
    //计数器 读了多少行
    private int count;


    /**
     * 初始化方法，在加载之前进行创建。
     * @param inputSplit
     * @param taskAttemptContext
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //主要，干的一件事就是获取hdfs上得文件是，使用读取器来读取。
        //1,转换类型 InputSplit inputSplit
        fs = (FileSplit) inputSplit;
        //就可以获取hdfs文件系统
        Path path = fs.getPath();
        FileSystem fileSystem = path.getFileSystem(new Configuration());

        //通过文件获取文件流
        FSDataInputStream fsin = fileSystem.open(path);
        //最终得到读取器。拿到了输入的文件流。
        lineReader = new LineReader(fsin);

    }

    /**
     * 知识点1 ：
     * netKeyValue会被调用多次，方法返回值如果是true，循环一次，就会被调用一次。
     *  相当于读 key --value 读到就返回true 反之就false。
     *
     * 例如：for(int 1=0;i<10;i++){
     *		return true;  //相当于调用10次。因为返回是true；
     * }
     * return false;  //直到返回值为false时候，就不会被调用了。
     * 知识点2：每当nextKeyValue(),调用一次，getCurrentKey()和getCurrentValue()也调用一次。
     *
     * 知识点3：getCurrentKey()是给map的key传值的，getCurrentValue()是给value传值的。 map里的key ，value。
     */

    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        //1.初始化变量key和value
        key = new IntWritable();
        value = new Text();
        // 2.存储中间缓存
        Text temp = new Text();
        //3. 读取一行内容，返回的是字节内容的长度，可以用来做判断。是否还要内容。
        int len = lineReader.readLine(temp); //读到当前行如 第一行hello angelbaby，将内容存到temp中。

        //假如没有内容
        if (len==0){
            return false;
        }else {
            //如果有内容就把当前行赋值给value; 相当于 value=temp；
            value.append(temp.getBytes(),0,temp.getLength());
            count++; //记录行数
            key.set(count); // key值设置 一定是key.set(),不能用key = count;
            return true;
        }
    }

    //相当于给Mapper方法传递key值
    @Override
    public IntWritable getCurrentKey() throws IOException, InterruptedException {
        return key;
    }
    //相当于给Mapper方法传递value值
    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    @Override
    public void close() throws IOException {
        //关闭资源，关闭读取流。
        if(lineReader!=null)lineReader=null;
    }
}

public class WordCountMapper extends Mapper<IntWritable, Text,IntWritable,Text> {

    @Override
    protected void map(IntWritable key, Text value, Context context) throws IOException, InterruptedException {

        context.write(key,value);
    }
}

public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job，并启动关联
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(WordCountDriver.class);

        //2.设置运行Mapper和Reducer运行类
        job.setMapperClass(WordCountMapper.class);

        //3.设置Mapper和Reduce输出格式
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(Text.class);

        // 需要注册地址自定义类
        job.setInputFormatClass(WordCountFormat.class);
        
        //4.设置路径,处理该文件夹下所有文件
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/mr"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/mr/resault2"));

        //5.提交
        job.waitForCompletion(true);
    }
}

6.2 案例：自定义输入类型

要求Mapper输出格式 张楚岚英语 80 数学 90 语文 100 java 100
- 思路： 相当于 key 是人名 value是后两行数据，进行拼接。

张楚岚
英语 80 数学 90
语文 100 java 100
冯宝宝
英语 20 数学 30
语文 20 java 100
张灵玉
英语 100 数学 100
语文 100 java 100

public class MyFormat extends FileInputFormat<Text, Text> {

    @Override
    public RecordReader<Text, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        return new MyRecordReader();
    }
}

注意 Value直接需要拼接。

public class MyRecordReader extends RecordReader<Text,Text> {
    //定义成员变量方便使用
    private FileSplit fs;
    private Text key;
    private Text value;

    //定义读取器
    private LineReader lineReader;


    @Override
    public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
        //1.主要就是获取文件的输入流
        fs = (FileSplit) inputSplit;
        Path path = fs.getPath(); //拿到文件路径
        // 2.获取hdfs文件系统
        FileSystem fileSystem = path.getFileSystem(new Configuration());

        //获取文件输入流
        FSDataInputStream fsin = fileSystem.open(path);
         lineReader = new LineReader(fsin);

    }

    //nextKeyValue()调用一次，getCurrentkey和getCurrentValue也会跟着调用。
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        //1初始化key和value
        key = new Text();
        value = new Text();
        //2 创建缓存变量
        Text temp = new Text();

        int len = lineReader.readLine(temp);//返回读取内容字节长度。
        if (len==0){
            return false;
        }else{
            //Todo 业务
            //1.读取的第一行应该是key,放入key中，
            key.set(temp);
            //2.连续跳两行，使用for循环跳两行
            for (int i = 0; i <2 ; i++) {
                lineReader.readLine(temp);//读取第二行开始
                value.append(temp.getBytes(),0,temp.getLength());
                //拼接空格
                value.append(" ".getBytes(),0," ".length());
            }

            return true;
        }


    }

    @Override
    public Text getCurrentKey() throws IOException, InterruptedException {
        return key;
    }

    @Override
    public Text getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    @Override
    public float getProgress() throws IOException, InterruptedException {
        return 0;
    }

    @Override
    public void close() throws IOException {
        if (lineReader==null)lineReader.close();
    }
}

public class MyMapper extends Mapper<Text,Text,Text,Text> {
    @Override
    protected void map(Text key, Text value, Context context) throws IOException, InterruptedException {

        context.write(key,value);
    }
}

public class MyDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1,创建一个job
        Job job = Job.getInstance(new Configuration());

        //2.指定运行类在linux下
        job.setJarByClass(MyDriver.class);
        job.setMapperClass(MyMapper.class);

        //3.指定自定义输入类
        job.setInputFormatClass(MyFormat.class);

        //3.1 指定Mapper输出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        //4. 指定输入输出格式
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/custom"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/custom/result"));

        //5.运行job
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

吴琼老师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
第四节 MapReduce(二)

Mapper先输出键值对，然后再Reduce中合并处理结果。如果有10亿行数据，Mapper会生成10亿个键值对在网络上输出，这样会增加网络压力。我们可不可以在Mapper端进行合并，只输出最大值即可。这样提高了网络效率，也提高了程序效率。Combine可以理解为，在Mapper端的Reduce操作，先进行合并在输出给Reduce。前提是不能改变最终输出结果，也不是所有场景都适合体现合并，比如,计算平均值?。提前合并求平均值了。```javaMapper Reduce 平均值。
复制链接

扫一扫