（十三）MapReduce的其他案例及总结

最新推荐文章于 2024-05-15 05:32:57 发布

Ares_song

最新推荐文章于 2024-05-15 05:32:57 发布

阅读量1k

点赞数 1

分类专栏：云计算与大数据文章标签： hadoop

本文链接：https://blog.csdn.net/Ares_song/article/details/106891924

版权

云计算与大数据专栏收录该内容

31 篇文章 6 订阅

订阅专栏

除了WordCount，这里再介绍两个案例，Combiner和Partitioner。

一、MapReduce案例之Combiner

1、关于combiner

1、每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。

2、combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。

3、如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，先完成的map会在本地聚合，提升速度。

4、 combiner的意义就是对每一个maptask的输出进行局部汇总，以减小网络传输量

5、注意：Combiner的输出是Reducer的输入，如果Combiner是可插拔的，添加Combiner绝不能改变最终的计算结果。所以Combiner只应该用于那种Reduce的输入key/value与输出key/value类型完全一致，且不影响最终结果的场景。比如累加，最大值等

combiner是MR程序中Mapper和Reducer之外的一种组件
combiner组件的父类就是Reducer
combiner和reducer的区别在于运行的位置：Combiner是在每一个maptask所在的节点运行、Reducer是接收全局所有Mapper的输出结果

是不是看的一头雾水？那就看一下combiner的分析实例

2、combiner分析

假设有两个map。

第一个map输出为：(1950,0) (1950,20) (1950,10)

第二个map输出为：(1950,25) (1950,15) (1950,30)

Reduce函数被调用是，输入如下：（1950，[0,20,10,25,15,30]）因为30是最大的值，所以输出如下：（1950,30）

如果我们使用 combiner：那么reduce调用的时候传入的数据如下：(1950，[20,30])–>（1950,30）用表达式表示为：Max(0,20,10,25,15,30)=max(max(0,20,10),max(25,15,30))=max(20,30)=30

使用 Combiners要小心刚才我们是计算最大值可以使用Combiners能提高效率。如果我们要是求平均值呢？Avg（0,20,10,25,15,30） = 15

如果使用Combiner会得到什么样的结果呢？

第一个map输出为： avg(0,20,10) = 10

第二个map输出为：Avg（25,15,30） = 23

输入到reduce出来的结果为： Avg(10,23) = 17.5 17.5和15？

所以：使用combiner一定要注意。

combiner具体实现步骤：

自定义一个combiner继承Reducer，重写reduce方法
在job中设置： job.setCombinerClass(CustomCombiner.class)

3、combiner优化代码

下面贴出Combiner的版本，对Wordcount进行了一次升级

package MapReduce.demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * 使用MapReduce开发WordCount应用程序
 */
public class CombinerApp {

    /**
     * Map：读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        LongWritable one = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            for(String word :  words) {
                // 通过上下文把map的处理结果输出
                context.write(new Text(word), one);
            }

        }
    }

    /**
     * Reduce：归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    /**
     * 定义Driver：封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        // 准备清理已存在的输出目录
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath, true);
            System.out.println("output file exists, but is has deleted");
        }

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(CombinerApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //通过job设置combiner处理类，其实逻辑上和我们的reduce是一模一样的
        job.setCombinerClass(MyReducer.class);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

二、MapReduce案例之Partitioner

1、Partitioner的理解

第一次使用MapReduce程序的一个常见误解就是认为程序只使用一个reducer。毕竟，单个reducer在处理之前对所有数据进行排序，并将输出数据存储在单独一个输出文件中—谁不喜欢排序数据？我们很容易理解这样的约束是毫无意义的，在大部分时间使用多个reducer是必需的，否则map / reduce理念将不在有用。

Partitioner的作用是针对Mapper阶段的中间数据进行切分，然后将相同分片的数据交给同一个reduce处理。Partitioner过程其实就是Mapper阶段shuffle过程中关键的一部分。
这就对partition有两个要求：

1）均衡负载，尽量的将工作均匀的分配给不同的reduce。

2）效率，分配速度一定要快。

2、Partitioner的使用

在老版本的hadoop中，Partitioner是个接口。而在后来新版本的hadoop中，Partitioner变成了一个抽象类（本人目前使用的版本为2.6.5）。hadoop中默认的partition是HashPartitioner。根据Mapper阶段输出的key的hashcode做划分

在很多场景中，我们是需要通过重写Partitioner来实现自己需求的。例如，我们有全国分省份的数据，我们经常需要将相同省份的数据输入到同一个文件中。这个时候，通过重写Partitioner就可以达到上面的目的。

3、代码实例

3.1、需求：

根据手机产品牌子，统计不同手机品牌数量，并将统计结果到不同文件

3.2、分析：

Mapreduce中会将map输出的kv对，按照相同key分组，然后分发给不同的reducetask，默认的分发规则为：根据key的hashcode%reducetask数来分发，所以：如果要按照我们自己的需求进行分组，则需要改写数据分发（分组）组件Partitioner，自定义一个CustomPartitioner继承抽象类：Partitioner，然后在job对象中，设置自定义partitioner： job.setPartitionerClass(CustomPartitioner.class)

3.3、代码实现：

package MapReduce.demo;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class ParititonerApp {

    /**
     * Map：读取输入的文件
     */
    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable>{

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

            // 接收到的每一行数据
            String line = value.toString();

            //按照指定分隔符进行拆分
            String[] words = line.split(" ");

            context.write(new Text(words[0]), new LongWritable(Long.parseLong(words[1])));

        }
    }

    /**
     * Reduce：归并操作
     */
    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long sum = 0;
            for(LongWritable value : values) {
                // 求key出现的次数总和
                sum += value.get();
            }

            // 最终统计结果的输出
            context.write(key, new LongWritable(sum));
        }
    }

    public static class MyPartitioner extends Partitioner<Text, LongWritable> {

        @Override
        public int getPartition(Text key, LongWritable value, int numPartitions) {

            if(key.toString().equals("xiaomi")) {
                return 0;
            }

            if(key.toString().equals("huawei")) {
                return 1;
            }

            if(key.toString().equals("iphone7")) {
                return 2;
            }

            return 3;
        }
    }


    /**
     * 定义Driver：封装了MapReduce作业的所有信息
     */
    public static void main(String[] args) throws Exception{

        //创建Configuration
        Configuration configuration = new Configuration();

        // 准备清理已存在的输出目录
        Path outputPath = new Path(args[1]);
        FileSystem fileSystem = FileSystem.get(configuration);
        if(fileSystem.exists(outputPath)){
            fileSystem.delete(outputPath, true);
            System.out.println("output file exists, but is has deleted");
        }

        //创建Job
        Job job = Job.getInstance(configuration, "wordcount");

        //设置job的处理类
        job.setJarByClass(ParititonerApp.class);

        //设置作业处理的输入路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));

        //设置map相关参数
        job.setMapperClass(MyMapper.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        //设置reduce相关参数
        job.setReducerClass(MyReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        //设置job的partition
        job.setPartitionerClass(MyPartitioner.class);
        //设置4个reducer，每个分区一个
        job.setNumReduceTasks(4);

        //设置作业处理的输出路径
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

三、关于MR程序开发的总结

mapreduce在编程的时候，基本上一个固化的模式，没有太多可灵活改变的地方，除了以下几处：

1、输入数据接口：InputFormat —> FileInputFormat(文件类型数据读取的通用抽象类) DBInputFormat （数据库数据读取的通用抽象类）
默认使用的实现类是： TextInputFormat job.setInputFormatClass(TextInputFormat.class)
TextInputFormat的功能逻辑是：一次读一行文本，然后将该行的起始偏移量作为key，行内容作为value返回

2、逻辑处理接口： Mapper
完全需要用户自己去实现其中 map() setup() clean()

3、map输出的结果在shuffle阶段会被partition以及sort，此处有两个接口可自定义：
Partitioner
有默认实现 HashPartitioner，逻辑是根据key和numReduces来返回一个分区号；

key.hashCode()&Integer.MAXVALUE % numReduces
通常情况下，用默认的这个HashPartitioner就可以，如果业务上有特别的需求，可以自定义

Comparable
当我们用自定义的对象作为key来输出时，就必须要实现WritableComparable接口，override其中的compareTo()方法

4、reduce端的数据分组比较接口： Groupingcomparator
reduceTask拿到输入数据（一个partition的所有数据）后，首先需要对数据进行分组，其分组的默认原则是key相同，然后对每一组kv数据调用一次reduce()方法，并且将这一组kv中的第一个kv的key作为参数传给reduce的key，将这一组数据的value的迭代器传给reduce()的values参数

利用上述这个机制，我们可以实现一个高效的分组取最大值的逻辑：自定义一个bean对象用来封装我们的数据，然后改写其compareTo方法产生倒序排序的效果然后自定义一个Groupingcomparator，将bean对象的分组逻辑改成按照我们的业务分组id来分组（比如订单号）这样，我们要取的最大值就是reduce()方法中传进来key

5、逻辑处理接口：Reducer
完全需要用户自己去实现其中 reduce() setup() clean()

6、输出数据接口： OutputFormat —> 有一系列子类 FileOutputformat DBoutputFormat …..
默认实现类是TextOutputFormat，功能逻辑是：将每一个KV对向目标文本文件中输出为一行

Ares_song

关注

1
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
（十三）MapReduce的其他案例及总结

除了WordCount，这里再介绍两个案例，Combiner和Partitioner。一、MapReduce案例之Combiner1、关于combiner1、每一个map可能会产生大量的输出，combiner的作用就是在map端对输出先做一次合并，以减少传输到reducer的数据量。2、combiner最基本是实现本地key的归并，combiner具有类似本地的reduce功能。3、如果不用combiner，那么，所有的结果都是reduce完成，效率会相对低下。使用combiner，.
复制链接

扫一扫