第三节 MapReduce(一)

吴琼老师

已于 2022-09-19 15:22:50 修改

阅读量1.3k

点赞数 1

分类专栏：大数据BigData 文章标签： mapreduce 大数据 hadoop

于 2022-09-18 11:33:34 首次发布

本文链接：https://blog.csdn.net/u013280750/article/details/126799634

版权

大数据BigData 专栏收录该内容

8 篇文章 1 订阅

订阅专栏

MapReduce

1. 概述

1.1 MapReduce 是什么

MapReduce 是谷歌改变世界的三篇论文之一，它是一个简化的并行计算编程模型，其最有 意义的地方在于，让一些 没有分布式编程经验的人员，在不会 并行编程 的情况下，将自己的程序运行在分布式系统上。
MapReduce采用的是 “分散任务，汇总结果” 的思想，将大规模的数据集的清洗工作 分发给各个子节点完成，然后整合各个子节点的中间结果，最后汇总计算得到最终结果。

1.2 为什么要用集群的方式处理和计算文件？

MapReduce可以看做两个单词，合作去干一件事。
- 两个组件： Mapper组件（映射）， Reduce组件（规约）。
- 解释为： Mapper 做数据清洗，数据整理。 Reduce 数据汇总合并。
一台计算机，HDFS主要功能是管理和存储数据，主要利用磁盘存储能力。MapReduce主要使用的是cpu的计算和处理，这就充分的使用了计算机的整体功能。
- 如下图： 针对于集群中HDFS 和 MapReduce 中的组件名称。
了解了上面的一些知识之后，我们回想以前是咋处理文件的？使用单个机器去计算。为什么要用集群的方式处理和计算一个文件？ 主要是因为大数据处理的数据集都是PB，TB级别，单个数据太大，以前的单机形式处理不合适了，所以采用了集群的形式，集群可以理解为“人多干活快”！
- MapReduce 利用集群的思想，每个机器处理整个文件中的一部分文件。
- MapReduce 有自己的一套分配机制，用来进行任务分配，这个分配工作由ResourceManager负责，NodeManager去执行操作。
- MapReduce 会对数据进行 “逻辑切块（inputsplit）” 默认128M切片，一个切片对应一个Map数量计算。
- MapReduce 对于数据的处理方式，默认按照行来处理，一行一行读取。
- MapReduce 也有自己的监测机制通过 RPC 心跳来监测子节点变化， 来保证运算能力。

1.3 集群与分布式区别

集群： 的主要场景是为了分担请求压力，就是增加机器数量解决问题 ，集群中任一节点，都是做一个完整的任务。
- 同一个业务，部署在多个服务器上。
分布式： 的主要应用场景是 单台机器已经无法满足这种性能的要求，必须要融合多个节点，并且节点之间是相关之间有交互的。
- 将一套系统拆分成不同子系统，部署在不同服务器上。

举一个生活的例子：
例如， 餐厅中的厨师，配菜之间的关系。

2. MapReduce 编程模型

2.1 编程模型演变

想一下，如果统计一个大文件中的单词计数问题 ？
- 传统写法怎么去解决问题？ 数据源从HDFS上取数据。
- 将准备的数据上传到hdfs上。

/**
 * 统计单词
 */
public class WordCount {
    public static void main(String[] args) throws Exception {
        //1.创建连接
        FileSystem fs = FileSystem.get(new URI("hdfs://192.168.150.130:9000"), new Configuration(), "root");
        //2.读取文件 
        FSDataInputStream fin = fs.open(new Path("/wordcount.txt"));
        //2.1 包一层高效的bufferReader，转换流使用。
        BufferedReader br = new BufferedReader(new InputStreamReader(fin));

        //3.2 创建集合 hashmap
        HashMap<String,Integer> hashMap = new HashMap();

        //3.开始读取
        String line = null;
        while ((line=br.readLine())!=null){
            String[] words = line.split(" "); //按照空格切
                //3.1 使用hashmap做单词计数
                for (String word : words){
                    if (hashMap.containsKey(word)){
                        hashMap.put(word,hashMap.get(word)+1); //如果之前存在加1
                    }else {
                        hashMap.put(word,1);
                    }
                }
        }

        //遍历输出map中数据
        for (Map.Entry<String,Integer> entry : hashMap.entrySet()){
            System.out.println(entry.getKey() +"----"+entry.getValue());
        }

        //后获取的先关，先获取的后关
            br.close();
            fin.close();

    }
}

输出结果：

yangmi   ...    2
hello   ...    6
liuyifei   ...    1
angelbaby   ...    3

思考一个问题，我想让多个机器去一起做这件事：怎么设计

2.2 编程模型简单概述

从上图来看，基本上属于MapReduce转变的前身设计思路，从MapReduce命名上来看有两部分组成，Map意思映射，Reduce为规约。
可以按照下面的方式理解Map和Reduce。
- 输入（input）一个大文件，通过切片（split）之后，将数据分成多个切片。
- 每个文件切片由单独的节点进行处理，这就是Map方法。
- 将各个节点计算的结果进行汇总，并得到最终结果，这就是Reduce方法。
- 任务job = Map+Reduce，Map输出，就是Reduce的输入。且所有输入输出都是键值对<Key,Value>的形式。

3. MapReduce 组件介绍

3.1 导入MR依赖jar

导入Mapreduce的相关依赖jar包，因为需要编写MapReduce程序。
将 wordcount.txt 上传到hdfs上。

3.2 Mapper 组件编程介绍

我们先了解Mapper的使用，看看Mapper的做了些什么事？
- extends Mapper<????> 类
- Driver类 程序的驱动类。

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;


//1.编写Map方法需要 extends Mapper<keyin,valuein,keyout,valueout>,为什么形参是这些，一会在讲解？
public class WordCountMapper extends Mapper<LongWritable, Text,LongWritable,Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        context.write(key,value); //context是输出的意思，直接输出key value值。
    }
}

注意，Mapper的输出目录不能提前存在，否则报错。二次生成需要提前删除上一次，目录结果。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.FileInputStream;
import java.io.IOException;

public class WordCountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        //1.创建一个job任务入口
        Job job = Job.getInstance(new Configuration());//不需要配置文件就直接new Configuration();
        //2.设置jar包的启动路径
        job.setJarByClass(WordCountDriver.class);

        //3.设置key和value的输出参数类型，注意导入包hadoop.io.*下。
        job.setMapOutputKeyClass(LongWritable.class);
        job.setMapOutputValueClass(Text.class);

        //4.指定处理相应目录下的数据hdfs,和接受处理结果得目录（注意目录不要提前生成，有则删除）
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/wordcount.txt"));
       FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/resault"));  


        //5.提交job任务，系统运行。
        job.waitForCompletion(true);
    }
}

3.3 运行程序两种方式

3.3.1 Jar包运行

将运行程序打成jar包，上传到服务器上运行。File--->Project structure
将jar上传到Linux服务器上，使用命令hadoop jar +jar包名称运行。
- 注意：在运行hadoop jar时需要先，启动yarn平台。然后使用jps查看启动进程。

在这里插入图片描述

3.3.2 IDEA运行

也需要依赖hadoop.dll 和winutils文件所以学习hdfs时已经配置好的就可以直接使用（参考hdfs插件配置）， 但是不管使用哪种方式都是需要Yarn平台，所以如果使用IDEA运行就需要导入Yarn的依赖jar包，导入在工程中。
- 注意， 因为MapReduce运行的其中一种方式需要Yarn，不然会报错org/apache/hadoop/yarn/util/Apps
直接运行main即可。 结果什么意思？还记得 偏移量 么？

3.4 Mapper组件输出

现在我想对WordCount.txt中的单词进行统计，该怎么在mapper中做处理？
- 提示： Text类型转String类型 Text.toString();

package com.java.mapper;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

//keyin valuein保持不变接受类型就是 偏移量和每行内容，但是自定义 keyout valueout 是属于用户自定义类型。
/*
    既然想要每个单词的输出格式
            hello 1
            angelbaby 1
            。。。
            keyout 就是Text valueout 就是IntWritable类型
 */
public class WordCountMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String s = value.toString();  //Text 相当于String Text.toString()转String类型。
        String[] words = s.split(" ");
        for (String word : words){
            context.write(new Text(word),new IntWritable(1)); // valueout 输出的1是自己定义的。
        }
    }
}

注意Driver，当Mapper输入和输出类型不一致时，需要指定job.setMapperClass()。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class Driver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job任务，并关联启动类
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(Driver.class); //指定linux运行的

        //1.1 不然会报类型不匹配。默认保持输入输出一致。
        job.setMapperClass(WordCountMapper.class); //如果map输入和输出不一致就需要使用这个。

        //2.设置mapper输出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);


        //3.设置路径
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/wordcount.txt"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/resault"));

        //4启动程序
        job.waitForCompletion(true);
    }
}

3.5 Reduce组件编程介绍

了解完Mapper之后，我们来看看Reduce组件做了什么事情？
- extends Reducer<????> 类

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

/*  第一组泛型 KEYIN Valuein 是 mapper输出的类型
    第二组泛型 就是自定义输出类型。
     我们先看看reducer给我们干了什么事情？
            第二组先默认输出 Text类型
 */
public class WordCountReduce extends Reducer<Text, IntWritable,Text,Text> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        //3,创建一个字符串接收拼串结果。
        String resault = "";

        //1.先把value值遍历出来看看
        for (IntWritable value :values){
            //2.将结果拼成一个串。
            resault = resault + value.get()+",";
        }

        //4.输出结果,key是Mapper输出的keyout类型
        context.write(key,new Text(resault));
    }
}

Driver类要添加Reducer类指定，和输出格式。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class Driver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.创建一个job任务，并关联启动类
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(Driver.class); //指定linux运行的

        //1.1 不然会报类型不匹配。默认保持输入输出一致。
        job.setMapperClass(WordCountMapper.class); //如果map输入和输出不一致就需要使用这个。
        job.setReducerClass(WordCountReduce.class); //指定reducer类型

        //2.设置mapper输出格式
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        //2.1 设置reducer输出格式
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        //3.设置路径
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/wordcount.txt"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/resault"));

        //4启动程序
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

4. MapReduce编程案例

4.1 WordCount 单词统计

根据上面代码经验，尝试着，自己写出单词统计。
- 提示：IntWritable.get() 转int类型。

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class WcMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String[] words = value.toString().split(" ");
        for (String word : words){
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class WcReduce extends Reducer<Text, IntWritable, Text,IntWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        int count = 0;  //计数器
        for (IntWritable value:values){
            count = count +value.get();// IntWriable.get() 转换int
        }
        context.write(key,new IntWritable(count));
    }
}

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class WcDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.启动job主线程
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(WcDriver.class);//在linux下启动的jar

        //2.指定mapper和reducer的类
        job.setMapperClass(WcMapper.class);
        job.setReducerClass(WcReduce.class);

        //3.设置mapper和reducer输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //4.设置hdfs路径和输出地址
        //该路径下文件目录都会被处理，也可以指定到文件。
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/mr"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/resault"));

        //5.提交job任务 ，并启动。
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

4.2 数据去重

假如，电商统计数据，电话号有重复的不能多领取奖品，有这样的数据很多重复，筛除重复怎么做？
- 主要，是使用reduce合并去重复。提示：NullWritable.get() 相当于 null 类型

import java.io.IOException;

public class DistinctMapper extends Mapper<LongWritable, Text,Text, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        //NullWritable.get()转null类型 ; NullWritable相当于java中null。
        context.write(value,NullWritable.get()); //相当于排序输出
    }
}

public class DistinctReduce extends Reducer<Text, NullWritable,Text,NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {

        context.write(key,NullWritable.get()); //相当于使用reduce去重复。valueout 输出null即可。
    }
}

public class DistinctDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.启动job主线程
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(DistinctDriver.class);//在linux下启动的jar

        //2.指定mapper和reducer的类
        job.setMapperClass(DistinctMapper.class);
        job.setReducerClass(DistinctReduce.class);

        //3.设置mapper和reducer输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //4.设置hdfs路径和输出地址
        //该路径下文件目录都会被处理，也可以指定到文件。
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/dis"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/dis/resault"));

        //5.提交job任务 ，并启动。
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

4.3 求各部门平均工资(一)

求各部门的平均工资，该数据源来自于数据库导出格式。

在这里插入图片描述

public class AvgSalaryMapper extends Mapper<LongWritable, Text,Text, IntWritable> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String[] arrs = value.toString().split(" ");
        String department = arrs[2];
        int salary = Integer.parseInt(arrs[1]);

        context.write(new Text(department),new IntWritable(salary));
    }
}

public class AvgSalaryReduce extends Reducer<Text, IntWritable,Text, DoubleWritable> {

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

        //统计和
        int count = 0;
        //计数器,
        int num = 0;
        for (IntWritable value : values){ //传过来的值是 k 和 v形式。
            count = count + value.get();
            num++; //计数

        }
        double avg = count/num; //类型转换？ 这么写行不行？有没有问题。
        //System.out.println(key+"是"+count+" "+num);

        context.write(key,new DoubleWritable(avg)); //key值合并之后输出一次。
    }
}

public class AvgSalaryDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.启动job主线程
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(AvgSalaryDriver.class);//在linux下启动的jar

        //2.指定mapper和reducer的类
        job.setMapperClass(AvgSalaryMapper.class);
        job.setReducerClass(AvgSalaryReduce.class);

        //3.设置mapper和reducer输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(DoubleWritable.class);

        //4.设置hdfs路径和输出地址
        //该路径下文件目录都会被处理，也可以指定到文件。
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/salary"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/salary/resault"));

        //5.提交job任务 ，并启动。
        job.waitForCompletion(true);
    }
}

思考一个问题：这结果正确么？

在这里插入图片描述

4.4 求员工中工资最高的人(二)

数据接 4.3 中数据，相当于求最大值的人名和薪资，输出结果：只有人名和最高薪资。
怎么处理数据的传递。

public class MaxMapper extends Mapper<LongWritable, Text, IntWritable,Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] words = value.toString().split(" ");
        /**思路分析： map作用就是切分，reduce作用是聚合。
         *  所以要想输出最值只有一条，那就需要在reduce聚合时key相同，value算出不同值。
         *  需要进行拼串   new StringBudile()
         */
        //拼串 形式 徐三,2450
        StringBuilder name_Salary = new StringBuilder().append(words[0]).append(",").append(words[1]);
        String string = name_Salary.toString(); //转String类型
        context.write(new IntWritable(0),new Text(string));//key值随便传，在reduce合并不输出即可。
    }
}

因为key值都是0，相同不需要考虑，Reduce不输出key值即可。

public class MaxReduce extends Reducer<IntWritable, Text,Text,IntWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
        int max =0; //最大值
        String resName=""; //最大值人名
        //1.遍历拿到value的拼串
        for (Text value :values){
            String[] words = value.toString().split(",");
             String name = words[0];
            int salary = Integer.parseInt(words[1]);

            if (max<salary){
                max=salary;
                resName=name;
            }
        }

        context.write(new Text(resName),new IntWritable(max));
    }
}

public class MaxDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.启动job主线程
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(MaxDriver.class);//在linux下启动的jar

        //2.指定mapper和reducer的类
        job.setMapperClass(MaxMapper.class);
        job.setReducerClass(MaxReduce.class);

        //3.设置mapper和reducer输出类型
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(Text.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        //4.设置hdfs路径和输出地址
        //该路径下文件目录都会被处理，也可以指定到文件。
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/max"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/max/resault1"));

        //5.提交job任务 ，并启动。
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

4.5 对象序列化：手机上网流量

针对上网数据，计算手机上网产生的，上行流量和下行流量和。
面向对象编程，注意：涉及到对象序列化问题。一定要保证序列化和反序列化一致
NullPointerException 如果不实现Writable接口。

在这里插入图片描述

封装 Flow 对象，还记得在hdfs中的指定编码去读文件么！？

public class Flow implements Writable {
    private String address; 
    private String phone_Number;
    private long upFlow;
    private long downFlow;
    private long sumFlow; //总流量相当于 上行加下行流量

    //提供构造方法
    public Flow(){}

    //含参数构造方法
    public Flow(String phone_Number,String address,long upFlow, long downFlow) {
        this.phone_Number =phone_Number;
        this.address = address;
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow =upFlow+downFlow; //眼前一亮，计算总流量。
    }

    public String getAddress() {
        return address;
    }

    public void setAddress(String address) {
        this.address = address;
    }

    public String getPhone_Number() {
        return phone_Number;
    }

    public void setPhone_Number(String phone_Number) {
        this.phone_Number = phone_Number;
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return "Flow{" +
                "address='" + address + '\'' +
                ", phone_Number='" + phone_Number + '\'' +
                ", upFlow=" + upFlow +
                ", downFlow=" + downFlow +
                ", sumFlow=" + sumFlow +
                '}';
    }

    /*1.重写两个方法write序列化方法。
                readFillds 反序列化方法。
               2.注意序列化顺序和反序列化顺序保持一致。
             */
    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(phone_Number);
        dataOutput.writeUTF(address);
        dataOutput.writeLong(upFlow);
        dataOutput.writeLong(downFlow);
        dataOutput.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.phone_Number=dataInput.readUTF();
        this.address=dataInput.readUTF();
        this.upFlow = dataInput.readLong();
        this.downFlow = dataInput.readLong();
        this.sumFlow = dataInput.readLong();
    }

}

public class FlowMapper extends Mapper<LongWritable, Text,Text,Flow> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        //15639120688	http://v.baidu.com/movie 3936 12058 shanghai
        //1,按照规则切分数据
        String[] words = value.toString().split("\t");//按照制表符切
        String phone = words[0];

        //1.1  在按照空格切分数据
        String[] words2 = words[1].split(" ");//按照空格切
        long upFlow = Long.parseLong(words2[1]);
        long downFlow = Long.parseLong(words2[2]);
        String address = words2[3];

        //2.封装对象
        Flow flow = new Flow(phone,address,upFlow, downFlow);
        context.write(new Text(phone),flow);
    }
}

public class FlowReduce extends Reducer<Text,Flow,Text,Flow> {
    @Override
    protected void reduce(Text key, Iterable<Flow> values, Context context) throws IOException, InterruptedException {
        //根据数据分析，有重复手机号段，就需要对结果进行累加
        long upFlow =0;
        long downFlow =0;
        String adress="";
        String phone_Num="";

        for (Flow f:values){
            upFlow = upFlow + f.getUpFlow();//上行流量，累加求和。
            downFlow = downFlow + f.getDownFlow();//下行流量，累加求和。
            phone_Num=f.getPhone_Number();
            adress=f.getAddress();
        }

        //通过构造方法赋值
        Flow flow = new Flow(phone_Num,adress,upFlow,downFlow);
        context.write(new Text(key),flow);
    }
}

public class FlowDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

        //1.启动job主线程
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(FlowDriver.class);//在linux下启动的jar

        //2.指定mapper和reducer的类
        job.setMapperClass(FlowMapper.class);
        job.setReducerClass(FlowReduce.class);

        //3.设置mapper和reducer输出类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Flow.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Flow.class);

        //4.设置hdfs路径和输出地址
        //该路径下文件目录都会被处理，也可以指定到文件。
        FileInputFormat.setInputPaths(job,new Path("hdfs://192.168.150.130:9000/flow"));
        FileOutputFormat.setOutputPath(job,new Path("hdfs://192.168.150.130:9000/flow/resault"));

        //5.提交job任务 ，并启动。
        job.waitForCompletion(true);
    }
}

在这里插入图片描述

5. 分区

5.1 分区概述

在进行MapReduce时，有时候需要把 最终的输出数据 分到不同的文件中，例如：如果要按照部门划分，就需要将相同部门的人放到一个文件中。谁来操作这个最终的输出？ 答案是： Reducer，
如果想要得到多个文件，就需要设置相同数量的Reducer任务在运行，Reducer的数据来自Mapper任务，Mapper要划分数据，将不同的数据分配给不同的Reducer运行，这个划分的过程称为分区（Partition）。负责实现划分数据的类成为Partitioner。
- Hadoop默认是一个Reducer。需要在Driver类中设置Reducer的数量。

job.setNumReduceTasks(3); //例如，三个分区。即生成3个不同的文件。

通过 4.3 求各部门的平均工资案例，来看下分区。 得到如下结果：？为什么？
Partitioner类干了啥事情？
- 相当于key.hashcode（）%reducer个数取余，不能保证结果均匀的分配到文件中。
- 这样会造成数据倾斜现象。

在这里插入图片描述

5.2 自定义分区

改造案例 4.5 手机上网流量按照地区分输出文件。
- 如果按照分区输出的话，需要分5个输出文件。这就需要自定义分区。
- 自定义分区要extends Partition<key,value> 第一个泛型时Mapper输出key 第二个是Mapper输出Value类型。
- 重写方法中getPartition(Text text, Flow flow, int i) 的第一个参数类型，是reduce输出的key相同。第二个参数类型是reduce输出的value相同。第三个参数，是job拿到的第几个分区。

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;

/*自定义分区类
 *      extends Partition<？,？> 主要是map进行工作。
        * 第一个泛型的，是map的输出类型的key。
        * 第二个泛型的，是map的输出类型的value。

        @Override 重写getPatition（）方法
        *
 */
public class MyPartitioner extends Partitioner<Text,Flow> {

    /*
        getPartition 的第一个参数类型，是reduce输出的key相同。
	 *               第二个参数类型是reduce输出的value相同。
	 * int numPartitions 是从job中拿到的几个分区。

     */

    @Override
    public int getPartition(Text text, Flow flow, int i) {
        if (flow.getAddress().equals("chengdu")){
            return 0;//去第0个分区。
        }else if (flow.getAddress().equals("guangzhou")){
            return 1;
        }else if (flow.getAddress().equals("shenzhen")){
            return 2;
        }else if(flow.getAddress().equals("beijing")){
            return 3;
        }else {
            return 4; //最后一个上海
        }

    }
}

Driver类中添加 设置如下：

   //自定义分区设置 增加输出和设置分区类
        job.setNumReduceTasks(5);
        job.setPartitionerClass(MyPartitioner.class);

在这里插入图片描述

6. Hadoop的数据类型对照

Java类型	Hadoop类型
boolean	BooleanWritable
Integer int	IntWritable
Long long	LongWritable
Float float	FloatWritable
Double double	DoubleWritable
String	Text
null	NullWritable

7. MapReudce 不擅长的场景

通过学习MapReduce之后，它虽然有很多优势，如，处理PB级别数据，高容错性，一台机器出问题，不用担心，job可以在其他机器上运行。扩展性高，当资源不满足时可以增加机器数量，但是它也有一些不擅长的场景：
- 实时计算，它属于流式输出数据，无法像MySQL一样，在毫秒级或者秒级别返回数据。
- 流式计算，不能动态输入，实时计算数据，而是输入的数据需要是静态的，实时计算产生的日志就不可以，需要其他流式实时框架处理。

吴琼老师

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
第三节 MapReduce(一)

从上图来看，基本上属于MapReduce转变的前身设计思路，从MapReduce命名上来看有两部分组成，Map意思映射，Reduce为规约。可以按照下面的方式理解Map和Reduce。输入（input）一个大文件，通过切片（split）之后，将数据分成多个切片。每个文件切片由单独的节点进行处理，这就是Map方法。将各个节点计算的结果进行汇总，并得到最终结果，这就是Reduce方法。任务job = Map+Reduce，Map输出，就是Reduce的输入。
复制链接

扫一扫