hadoop中mapJoin和reuceJoin的区别和使用场景

shining0903lxy

已于 2022-08-30 10:32:43 修改

阅读量424

点赞数

分类专栏： hdoop 文章标签：大数据 mapreduce

于 2020-04-25 22:19:25 首次发布

本文链接：https://blog.csdn.net/weixin_43548518/article/details/105753150

版权

本文探讨了Hadoop中MapReduce实现的MapJoin和ReduceJoin的区别。ReduceJoin通过在reduce阶段进行JOIN操作，可能导致reduce端压力大和数据倾斜问题。而MapJoin适用于小表与大表JOIN的情况，它在map阶段缓存小表，减轻reduce端负担，避免数据倾斜。文中通过案例详细阐述了两种JOIN操作的工作原理和实操步骤。

摘要由CSDN通过智能技术生成

sql 语句：

select order.id, product.pname, order.amount  from user join order on product.pid = order.pid

用mr 也可以实现上述这种join ,这里包括mapJoin 和 reduceJoin

reduceJoin的工作原理
mapTask：
对数据进行打标签区分数据不同源
连接on 字段为key, 剩余部分+标签字段为value
reduceTask:
相同连接字段的数据进入共一个reduce方法
将来源不同的数据汇总

reduceJoin 案例实操
现在有俩个数据文件
order文件数据如下：

  id       pid	amount
1001	  01	    1
1002	  02	    2
1003      03	   3
1004	  01	   4
1005	  02	   5
1006	  03	   6

product文件数据如下：

pid	pname
01	小米
02	华为
03	格力

要求结果

id	   pname	amount
1001	小米	    1
1004	小米     4
1002	华为     2
1005	华为	    5
1003	格力	    3
1006	格力	    6

mapTask：

public class TableMapper extends Mapper<LongWritable, Text, Text, TableBean> {
    String fileName;
    String p_id;
    Text k = new Text();

    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        FileSplit inputSplit = (FileSplit) context.getInputSplit();
        fileName = inputSplit.getPath().getName();
    }

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        String line = value.to