Hadoop-MapReduce-Join多种应用

最新推荐文章于 2023-07-18 15:13:09 发布

魔笛Love

最新推荐文章于 2023-07-18 15:13:09 发布

阅读量102

点赞数

文章标签： mapreduce hadoop 大数据

本文链接：https://blog.csdn.net/clearlxj/article/details/118568695

版权

本文详细介绍了如何使用MapJoin技术将商品信息表与订单数据表根据商品ID合并，通过在Map阶段处理业务逻辑，减少Reduce阶段压力，以小米华为格力等商品数据为例，展示了从数据预处理到最终输出的完整流程。

摘要由CSDN通过智能技术生成

Join多种应用

Reduce Join

在这里插入图片描述

实操：

将商品信息表中数据根据商品id合并到订单数据表中

输入数据：

订单数据

id pid amount
1001 01 1
1002 02 2
1003 03 3
1004 01 4
1005 02 5
1006 03 6

商品数据

pid name
01 小米
02 华为
03 格力

期望合并之后的数据如下：

1001 小米 1
1004 小米 4
1002 华为 2
1005 华为 5
1003 格力 3
1006 格力 6

分析：

Map阶段：获取输入文件类型、获取输入数据、不同文件分别处理、封装Bean对象输出；以pid为key（两张表中的公共字段，相当于数据库的外键的数据），剩余的字段以及标识为value；在完成map之后，会默认对key进行排序。

Reduce阶段：reduce方法缓存订单数据集合和产品表，然后进行合并。

Bean对象如下：

public class JoinBean implements Writable {

    // id pid amount
    // pid pname

    private String id; // 订单id
    private String pid; // 商品id
    private int amount; // 数量
    private String pname; // 商品名称
    private String flag; // 定义一个标记，标记是订单表还是产品表

    // 无参构造方法，反射使用的，必须写
    public JoinBean () {

        super();
    }


    @Override
    public void write (DataOutput out) throws IOException {

        // 序列化方法
        // writeUTF是写String的
        out.writeUTF(id);
        out.writeUTF(pid);
        out.writeInt(amount);
        out.writeUTF(pname);
        out.writeUTF(flag);
    }

    @Override
    public void readFields (DataInput in) throws IOException {

        // 反序列化方法
        id = in.readUTF();
        pid = in.readUTF();
        amount = in.readInt();
        pname = in.readUTF();
        flag = in.readUTF();
    }

    @Override
    public String toString () {

        return id + "\t" + pname + "\t" + amount;
    }

    ...
}

Mapper

public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, JoinBean> {

    String filename;

    JoinBean joinBean = new JoinBean();
    Text k = new Text();

    @Override
    protected void setup (Context context) throws IOException, InterruptedException {

        // 获取文件的名称
        // 获取切片信息，并强转成文件切片
        FileSplit inputSplit = (FileSplit) context.getInputSplit();
        filename = inputSplit.getPath().getName();
    }

    @Override
    protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        // 订单表
        // id pid amount
        // 1001 01 1

        // 商品表
        // pid name
        // 01 小米

        // 获取一行数据
        String line = value.toString();

        String[] fileds = line.split(" ");

        // 封装输出的key和value
        if (filename.startsWith("order")) {
            // 认为是订单表
            joinBean.setId(fileds[0]);
            joinBean.setPid(fileds[1]);
            joinBean.setAmount(Integer.parseInt(fileds[2]));
            joinBean.setPname("");
            joinBean.setFlag("order");

            k.set(joinBean.getPid());
        } else {
            // 其他认为是商品表
            joinBean.setId("");
            joinBean.setPid(fileds[0]);
            joinBean.setAmount(0);
            joinBean.setPname(fileds[1]);
            joinBean.setFlag("pd");

            k.set(joinBean.getPid());
        }

        context.write(k, joinBean);
    }
}

Reducer

public class ReduceJoinReducer extends Reducer<Text, JoinBean, JoinBean, NullWritable> {


    @Override
    protected void reduce (Text key, Iterable<JoinBean> values, Context context) throws IOException, InterruptedException {

        // 存放订单数据
        List<JoinBean> orderBeanList = new ArrayList<>();

        // 存放商品数据
        JoinBean pdBean = new JoinBean();

        for (JoinBean value : values) {
            if ("order".equals(value.getFlag())) {

                // 必须通过拷贝，不能直接add
                // 因为value只是一个引用，每次循环时引用都会改变
                JoinBean joinBean = new JoinBean();

                try {
                    BeanUtils.copyProperties(joinBean, value);
                } catch (Exception e) {
                    e.printStackTrace();
                }

                orderBeanList.add(joinBean);
            } else {
                try {
                    BeanUtils.copyProperties(pdBean, value);
                } catch (Exception e) {
                    e.printStackTrace();
                }
            }
        }

        for (JoinBean bean : orderBeanList) {
            bean.setPname(pdBean.getPname());

            context.write(bean, NullWritable.get());
        }
    }
}

运行结果如下：

在这里插入图片描述

因为MapTask的数量一般情况下都是大于ReduceTask的，因此能够在Map阶段实现的东西，最好都在Map阶段完成，尽量减轻Reduce阶段的压力。

MapJoin

使用场景

Map Join 适用于一张表十分小、一张表很大的场景。这样可以将小表存放在内存中。

优点

在 Reduce 端处理过多的表,非常容易产生数据倾斜，在 Map 端缓存多张表,提前处理业务逻辑,这样增加 Map 端业务,减少 Reduce 端数据的压力,尽可能的减少数据倾斜；并且传输的数据量也能够减少。

具体办法:采用 DistributedCache

1、在 Mapper 的 setup 阶段（初始化阶段）,将文件读取到缓存集合中。

2、在驱动函数中加载缓存。

// 缓存普通文件到 Task 运行节点。
job.addCacheFile(new URI("/路径"));

案例实操：

目的、输入与ReduceJoin均相同。

分析：

Driver缓存文件：

1、加载缓存数据

2、Map端join的逻辑不需要Reduce阶段，因此设置ReduceTask数量为0则可以不进入Reduce阶段，减少了非常耗时的Shuffle过程。

读取缓存的文件数据：

1、在setup()方法中：获取缓存的文件并循环读取文件数据，对数据进行切割，缓存数据到集合当中，最后关闭流。

2、在map方法中：获取行数据进行切割，获取订单id、获取商品名称然后进行拼接，最后写出。

Driver

public class MapJoinDriver {

    public static void main (String[] args) throws Exception {

        // 1、获取Job对象
        Configuration conf = new Configuration();
        // 设置分隔符，要放在job实例化前面
        Job job = Job.getInstance(conf);

        // 2、设置jar文件存储位置，即驱动类的路径
        job.setJarByClass(MapJoinDriver.class);

        // 3、关联Map类
        job.setMapperClass(MapJoinMapper.class

        // 5、设置最终数据输出的key和value
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(NullWritable.class);

        // 6、设置程序输入路径和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        // 加载缓存数据
        job.addCacheFile(new URI("file:///home/lxj/hadoop-data/input/reduceJodin/pd.txt"));
        // MapJoin的逻辑不需要Reduce阶段，设置reduceTask的数量为0
        job.setNumReduceTasks(0);

        // 7、提交Job对象
//        job.submit();
        job.waitForCompletion(true);
    }
}

Mapper

public class MapJoinMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

    Map<String, String> pdMap = new HashMap<>();

    Text k = new Text();

    @Override
    protected void setup (Context context) throws IOException, InterruptedException {

        // 缓存表
        URI[] cacheFiles = context.getCacheFiles();
        String path = cacheFiles[0].getPath();
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                new FileInputStream(path), StandardCharsets.UTF_8));

        String line;
        while (StringUtils.isNotEmpty(line = reader.readLine())) {
            // pid name
            // 01 小米

            // 切割数据并进行缓存
            String[] fileds = line.split(" ");
            pdMap.put(fileds[0], fileds[1]);
        }

        // 关闭资源
        IOUtils.closeStream(reader);
    }

    @Override
    protected void map (LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        // 订单表
        // id pid amount
        // 1001 01 1

        // 商品表
        // pid name
        // 01 小米
        String line = value.toString();
        String[] fileds = line.split(" ");

        String pid = fileds[1];
        String pname = pdMap.get(pid);

        k.set(fileds[0] + " " + pname + " " + fileds[2]);

        context.write(k, NullWritable.get());
    }
}