Hadoop中的MapReduce框架原理、Join应用、Reduce Join、案例实操、Map Join、采用DistributedCache、案例实操

Redamancy_06

已于 2022-11-20 20:13:36 修改

阅读量696

点赞数 20

分类专栏： # Hadoop 文章标签： hadoop mapreduce 大数据

于 2022-10-09 08:00:00 首次发布

本文链接：https://blog.csdn.net/Redamancy06/article/details/127187881

版权

Hadoop 专栏收录该内容

43 篇文章 25 订阅

订阅专栏

13.MapReduce框架原理

13.6 Join应用

13.6.1 Reduce Join

Map端的主要工作：为来自不同表或文件的key/value对，打标签以区别不同来源的记录。然后用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。
Reduce端的主要工作：在Reduce端以连接字段作为key的分组已经完成，我们只需要在每一个分组当中将那些来源于不同文件的记录（在Map阶段已经打标志）分开，最后进行合并就ok了。

13.6.2Reduce Join案例实操

13.6.2.1 需求

在这里插入图片描述

订单数据表t_order

id	pid	amount
1001	01	1
1002	02	2
1003	03	3
1004	01	4
1005	02	5
1006	03	6

在这里插入图片描述

01	小米
02	华为
03	格力

商品信息表t_product

pid	pname
01	小米
02	华为
03	格力

将商品信息表中数据根据商品pid合并到订单数据表中

最终数据形式

id	pname	amount
1001	小米	1
1004	小米	4
1002	华为	2
1005	华为	5
1003	格力	3
1006	格力	6

13.6.2.2需求分析

通过将关联条件作为Map输出的key，将两表满足Join条件的数据并携带数据所来源的文件信息，发往同一个ReduceTask，在Reduce中进行数据的串联。

在这里插入图片描述

13.6.2.3代码实现

在这里插入图片描述创建一个reducejoin包

13.6.2.3.1 创建商品和订单合并后的TableBean类

在这里插入图片描述
创建一个TableBean类

package com.summer.mapreduce.reducejoin;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/**
 * @author Redamancy
 * @create 2022-10-07 9:19
 */
public class TableBean implements Writable {

    private String id; //订单id
    private String pid; //产品id
    private int amount; //产品数量
    private String pname; //产品名称
    private String flag; //判断是order表还是pd表的标志字段

    public TableBean() {
    }

    public String getId() {
        return id;
    }

    public void setId(String id) {
        this.id = id;
    }

    public String getPid() {
        return pid;
    }

    public void setPid(String pid) {
        this.pid = pid;
    }

    public int getAmount() {
        return amount;
    }

    public void setAmount(int amount) {
        this.amount = amount;
    }

    public String getPname() {
        return pname;
    }

    public void setPname(String pname) {
        this.pname = pname;
    }

    public String getFlag() {
        return flag;
    }

    public void setFlag(String flag) {
        this.flag = flag;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(id);
        dataOutput.writeUTF(pid);
        dataOutput.writeInt(amount);
        dataOutput.writeUTF(pname);
        dataOutput.writeUTF(flag);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.id = dataInput.readUTF();
        this.pid = dataInput.readUTF();
        this.amount = dataInput.readInt();
        this.pname = dataInput.readUTF();
        this.flag = dataInput.readUTF();
    }

    @Override
    public String toString() {
        return id + '\t' + pname +  '\t'  + amount;
    }
}

13.6.2.3.2 编写TableMapper类

在这里插入图片描述创建一个TableMapper类

package com.summer.mapreduce.reducejoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

/**
 * @author Redamancy
 * @create 2022-10-07 9:27
 */
public class TableMapper extends Mapper<LongWritable, Text, Text,TableBean> {

    private String filename;
    private Text outK = new Text();
    private TableBean outV = new TableBean();
    

    @Override
    protected void setup(Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
        //获取对应文件名称
        InputSplit split = context.getInputSplit();
        FileSplit fileSplit = (FileSplit) split;
        filename = fileSplit.getPath().getName();

    }

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, TableBean>.Context context) throws IOException, InterruptedException {
        //1 获取一行
        String line = value.toString();

        //2 判断是哪个文件,然后针对文件进行不同的操作
        if (filename.contains(("order"))){//订单表的处理
            String[] split = line.split("\t");
            //封装outK
            outK.set(split[1]);

            //封装outV
            outV.setId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.parseInt(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        }else{//商品表的处理
            String[] split = line.split("\t");
            //封装outK
            outK.set(split[0]);

            //封装outV
            outV.setId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");

        }

        //写出KV
        context.write(outK, outV);


    }
}

13.6.2.3.3 编写TableReducer类

在这里插入图片描述创建一个TableReducer类

package com.summer.mapreduce.reducejoin;

import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;

/**
 * @author Redamancy
 * @create 2022-10-07 9:58
 */
public class TableReducer extends Reducer<Text,TableBean, TableBean, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Reducer<Text, TableBean, TableBean, NullWritable>.Context context) throws IOException, InterruptedException {

        ArrayList<TableBean> orderBeans = new ArrayList<>();
        TableBean pdBean = new TableBean();

        for (TableBean value : values) {
            //判断数据来自哪个表
            if("order".equals(value.getFlag())){//订单表
                //因为hadoop底层将列表重写了，所以使用增加的方法和javase里面的不一样，需要先new对象出来，然后再赋值
                //创建一个临时TableBean对象接收value
                TableBean tmpOrderBean = new TableBean();
                try {
                    BeanUtils.copyProperties(tmpOrderBean, value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                //将临时TableBean对象添加到集合orderBeans
                orderBeans.add(tmpOrderBean);


            }else{//商品表
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (IllegalAccessException e) {


                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }

            }
        }

        //遍历集合orderBeans,替换掉每个orderBean的pid为pname,然后写出
        for (TableBean orderBean : orderBeans) {
            orderBean.setPname(pdBean.getPname());
            //写出修改后的orderBean对象
            context.write(orderBean, NullWritable.get());
        }

    }
}

13.6.2.3.4 编写TableDriver类

在这里插入图片描述

创建一个TableDriver类

package com.summer.mapreduce.reducejoin;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * @author Redamancy
 * @create 2022-10-07 10:56
 */
public class TableDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //1 获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2 设置jar包路径
        job.setJarByClass(TableDriver.class);

        //3 关联mapper和reduccer
        job.setMapperClass(TableMapper.class);
        job.setReducerClass(TableReducer.class);

        //4 设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(TableBean.class);

        //5 设置最终输出的kv类型
        job.setOutputKeyClass(TableBean.class);
        job.setOutputValueClass(NullWritable.class);

        //6 设置输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\Acode\\Hadoop\\input\\inputJoinorderandpd"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\Acode\\Hadoop\\output\\output1"));

        //7 提交job
        boolean result = job.waitForCompletion(true);

        System.exit(result ? 0 : 1);
    }
}

13.6.2.4 测试

运行程序查看结果

在这里插入图片描述

1004	小米	4
1001	小米	1
1005	华为	5
1002	华为	2
1006	格力	6
1003	格力	3

13.6.2.5 总结

缺点：这种方式中，合并的操作是在Reduce阶段完成，Reduce端的处理压力太大，Map节点的运算负载则很低，资源利用率不高，且在Reduce阶段极易产生数据倾斜。
解决方案：Map端实现数据合并。

13.6.3 Map Join

13.6.3.1 使用场景

Map Join适用于一张表十分小、一张表很大的场景。

13.6.3.2 优点

思考：在Reduce端处理过多的表，非常容易产生数据倾斜。怎么办？
在Map端缓存多张表，提前处理业务逻辑，这样增加Map端业务，减少Reduce端数据的压力，尽可能的减少数据倾斜。

13.6.3.3 具体办法：采用DistributedCache

（1）在Mapper的setup阶段，将文件读取到缓存集合中。
（2）在Driver驱动类中加载缓存。

//缓存普通文件到Task运行节点。
job.addCacheFile(new URI("file:///e:/cache/pd.txt"));
//如果是集群运行,需要设置HDFS路径
job.addCacheFile(new URI("hdfs://hadoop102:8020/cache/pd.txt"));

13.6.4 Map Join案例实操

13.6.4.1 需求

在这里插入图片描述

订单数据表t_order

id	pid	amount
1001	01	1
1002	02	2
1003	03	3
1004	01	4
1005	02	5
1006	03	6

在这里插入图片描述

01	小米
02	华为
03	格力

商品信息表t_product

pid	pname
01	小米
02	华为
03	格力

将商品信息表中数据根据商品pid合并到订单数据表中

最终数据形式

id	pname	amount
1001	小米	1
1004	小米	4
1002	华为	2
1005	华为	5
1003	格力	3
1006	格力	6

13.6.4.2需求分析

MapJoin适用于关联表中有小表的情形。

在这里插入图片描述

13.6.4.3代码实现

在这里插入图片描述创建一个mapjoin包

13.6.4.3.1 先在MapJoinDriver驱动类中添加缓存文件

在这里插入图片描述创建一个MapJoinDriver类

package com.summer.mapreduce.mapjoin;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * @author Redamancy
 * @create 2022-10-07 12:10
 */
public class MapJoinDriver {
    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException, URISyntaxException {
        //1 获取job
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        //2 设置jar包路径
        job.setJarByClass(MapJoinDriver.class);

        //3 关联mapper
        job.setMapperClass(MapJoinMapper.class);

        //4 设置map输出的kv类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(NullWritable.class);


        //5 设置最终输出的kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        //8 加载缓存数据
        job.addCacheFile(new URI("D:\\Acode\\Hadoop\\input\\inputmapJoinorderandpd\\pd"));

        //9 Map端Join的逻辑不需要Reduce阶段，设置reduceTask数量为0
        job.setNumReduceTasks(0);

        // 6 设置输入输出路径
        FileInputFormat.setInputPaths(job, new Path("D:\\Acode\\Hadoop\\input\\inputmapJoinorderandpd\\order"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\Acode\\Hadoop\\output\\output2"));

        //7 提交job
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

13.6.4.3.2 在MapJoinMapper类中的setup方法中读取缓存文件

在这里插入图片描述
创建一个MapJoinMapper类

package com.summer.mapreduce.mapjoin;


import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;


import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;


/**
 * @author Redamancy
 * @create 2022-10-07 11:22
 */
public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    private HashMap<String, String> pdmap = new HashMap<>();
    private Text text = new Text();


    //任务开始前将pd数据缓存进pdMap
    @Override
    protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        //通过缓存文件得到小表数据pd.txt
        URI[] cacheArchives = context.getCacheArchives();
        Path path = new Path(cacheArchives[0]);

        //获取文件系统对象,并开流
        FileSystem fs = FileSystem.get(context.getConfiguration());
        FSDataInputStream fis = fs.open(path);

        //通过包装流转换为reader,方便按行读取
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis, "UTF-8"));

        //逐行读取，按行处理
        String line;
        while (StringUtils.isNotEmpty(line = reader.readLine())){
            //切割一行
            String[] split = line.split("\t");
            pdmap.put(split[0],split[1]);

        }

        //关流
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context) throws IOException, InterruptedException {

        //读取大表数据
        String[] fields = value.toString().split("\t");

        //通过大表每行数据的pid,去pdMap里面取出pname
        String pname = pdmap.get(fields[1]);

        //将大表每行数据的pid替换为pname
        text.set(fields[0] + "\t" + pname + "\t" + fields[2]);

        //写出
        context.write(text, NullWritable.get());

    }
}

13.6.4.4 测试

运行程序查看结果

在这里插入图片描述

1004	小米	4
1001	小米	1
1005	华为	5
1002	华为	2
1006	格力	6
1003	格力	3

Redamancy_06

关注

20
点赞
踩
16

收藏

觉得还不错? 一键收藏
打赏
17
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

Hadoop中的MapReduce框架原理、Join应用、Reduce Join、案例实操、Map Join、采用DistributedCache、案例实操

文章目录

13.MapReduce框架原理

13.6 Join应用

13.6.1 Reduce Join

13.6.2Reduce Join案例实操

13.6.2.1 需求

13.6.2.2需求分析

13.6.2.3代码实现

13.6.2.3.1 创建商品和订单合并后的TableBean类

13.6.2.3.2 编写TableMapper类

13.6.2.3.3 编写TableReducer类

13.6.2.3.4 编写TableDriver类

13.6.2.4 测试

13.6.2.5 总结

13.6.3 Map Join

13.6.3.1 使用场景

13.6.3.2 优点

13.6.3.3 具体办法：采用DistributedCache

13.6.4 Map Join案例实操

13.6.4.1 需求

13.6.4.2需求分析

13.6.4.3代码实现

13.6.4.3.1 先在MapJoinDriver驱动类中添加缓存文件

13.6.4.3.2 在MapJoinMapper类中的setup方法中读取缓存文件

13.6.4.4 测试