Hadoop MapReduce Join应用 | ReduceJoin案例实操

最新推荐文章于 2023-05-09 09:44:02 发布

lesileqin

最新推荐文章于 2023-05-09 09:44:02 发布

阅读量528

点赞数 3

分类专栏：大数据学习笔记 Hadoop 文章标签：大数据 java mapreduce join hadoop

本文链接：https://blog.csdn.net/lesileqin/article/details/115917135

版权

大数据学习笔记同时被 2 个专栏收录

38 篇文章 23 订阅

订阅专栏

Hadoop

34 篇文章 8 订阅

订阅专栏

Hadoop中的MapReduce是一种编程模型，用于大规模数据集的并行运算

下面的连接是我的MapReduce系列博客~配合食用效果更佳！

MapReduce 开发总结 | 内容过于精彩，别人女朋友看完都跟我跑了！

一、ReduceJoin 是什么

在现实世界，很多事情都是有关联的，这些关联的事务被抽象成数据的话，如果放在一个文件中是很麻烦的，所以人们一般会用多个文件进行存储，Join做的工作就是：把这些相关的文件都关联到一起。现在可能非常抽象，请耐心往下看案例分析

Map端主要工作：为来自不通表或文件的k-v键值对，打标签以区别不同来源的记录，用连接字段作为key，其余部分和新加的标志作为value，最后进行输出。

Reduce端主要工作：在Reduce端以连接字段作为key的分组已经完成，只需要在每一个分组当中将哪些来源不同文件的记录（这个在Map阶段已经打标志）分开，最后进行合并就odk了

二、ReduceJoin案例分析

1、需求分析

现在我们有两个表（以文件方式输入）：一个订单表，一个商品表，这两个表用pid关联了起来：
在这里插入图片描述
最终期望输出的数据格式为：

order订单表数据：

pd商品表数据：

01	小米
02	华为
03	格力

在这里插入图片描述

具体分析部分请看撸代码章节

2、撸代码

1）Bean对象

这个Bean对象，应该包含全部字段，再附加一个标记变量，标记该条记录来自哪个文件

所以需要设置五个属性：

private String orderId; // 订单ID
private String pid; //商品ID
private int amount; //商品数量
private String pname;   //商品名称
private String flag;    //标志是哪个表

在idea中按住alt+insert键，生成他们的getter和setter方法，这里不再贴出

既然是一个bean对象，为了集群之间传输方便，需要实现Writable序列化接口，并且要加一个空参构造方法，且重写toString方法：

//使用Writable必须加一个无参构造
public TableBean() {
}

@Override
public void write(DataOutput out) throws IOException {
	//字符串类型使用UTF写
    out.writeUTF(this.orderId);
    out.writeUTF(this.pid);
    out.writeInt(this.amount);
    out.writeUTF(this.pname);
    out.writeUTF(this.flag);
}

@Override
public void readFields(DataInput in) throws IOException {
    this.orderId = in.readUTF();
    this.pid = in.readUTF();
    this.amount = in.readInt();
    this.pname = in.readUTF();
    this.flag = in.readUTF();
}

//因为最终我们只输出三个属性到文件中，所以这里的toString只写三个字段
@Override
public String toString() {
    return orderId + "\t" + pname + "\t" + amount;
}

2）Mapper

这个Mapper就有点讲究了，首先看Mapper需要加的四个参数：前两个保持不变，后两个是pid(Text)-TableBean：

public class TableMapper extends Mapper<LongWritable, Text,Text,TableBean>

因为我们要区分每条记录都是来自哪个文件的，所以为了节省内存开销，可以重写Mapper的初始化setup方法，在以前的博客有解释Mapper的三个方法。在初始化方法中，通过切片获取文件名称：

private String fileName;
@Override
protected void setup(Context context) throws IOException, InterruptedException {
    FileSplit split = (FileSplit) context.getInputSplit();
    fileName = split.getPath().getName();
}

然后是map()方法，在这个方法中，首先判断记录来自哪个文件，然后封装TableBean（没有的字段设置为空）最后使用context输出，Mapper完整代码：

package com.wzq.mapreduce.reducejoin;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class TableMapper extends Mapper<LongWritable, Text,Text,TableBean> {

    private String fileName;
    private Text outK = new Text();
    private TableBean outV = new TableBean();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit split = (FileSplit) context.getInputSplit();
        fileName = split.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split("\t");
        if(fileName.contains("order")){
            //设置key，对应order的pid字段
            outK.set(split[1]);
            //设置value
            outV.setOrderId(split[0]);
            outV.setPid(split[1]);
            outV.setAmount(Integer.parseInt(split[2]));
            outV.setPname("");
            outV.setFlag("order");
        }else {
            //设置key，对应pd的pid字段
            outK.set(split[0]);
            //设置value
            outV.setOrderId("");
            outV.setPid(split[0]);
            outV.setAmount(0);
            outV.setPname(split[1]);
            outV.setFlag("pd");
        }
        context.write(outK,outV);
    }
}

3）Reducer

Reducer也需要四个泛型，前两个泛型对应Mapper输出的泛型；由于我们把属性都封装到了Bean，所以后面两个参数可以直接设置为bean与空：

public class TableReducer extends Reducer<Text,TableBean,TableBean, NullWritable>

完整代码：

package com.wzq.mapreduce.reducejoin;

import javafx.scene.control.Tab;
import org.apache.commons.beanutils.BeanUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.lang.reflect.InvocationTargetException;
import java.util.ArrayList;

public class TableReducer extends Reducer<Text,TableBean,TableBean, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<TableBean> values, Context context) throws IOException, InterruptedException {
        //该集合存储从order文件来的数据
        ArrayList<TableBean> order = new ArrayList<>();
        //因为pd文件只有一条对应的记录，所以这里直接new一个对象
        TableBean pd = new TableBean();

        //遍历键重复的对象
        for (TableBean value : values) {
            //获取文件名
            String flag = value.getFlag();
            //如果进来的是order
            if("order".equals(flag)){
                //新建一个临时变量
                TableBean tmp = new TableBean();
                try {
                    //属性对拷
                    BeanUtils.copyProperties(tmp,value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
                //添加到集合中
                order.add(tmp);
            }else{
                try {
                    //属性对拷
                    BeanUtils.copyProperties(pd,value);
                } catch (IllegalAccessException e) {
                    e.printStackTrace();
                } catch (InvocationTargetException e) {
                    e.printStackTrace();
                }
            }
        }

        //循环遍历集合，设置每个bean的商品名称，最终通过context输出
        for (TableBean tableBean : order) {
            tableBean.setPname(pd.getPname());
            context.write(tableBean,NullWritable.get());
        }

    }
}