大数据Hadoo之MR Map Join案例实操

最新推荐文章于 2024-05-03 11:44:49 发布

@阿证1024

最新推荐文章于 2024-05-03 11:44:49 发布

阅读量326

点赞数

分类专栏：大数据 # Hadoop 文章标签： mapreduce 大数据 hadoop

本文链接：https://blog.csdn.net/qq_43437122/article/details/106403354

版权

大数据同时被 2 个专栏收录

42 篇文章 2 订阅

订阅专栏

Hadoop

27 篇文章 0 订阅

订阅专栏

1．使用场景

Map Join适用于一张表十分小、一张表很大的场景。

2．优点

思考：在Reduce端处理过多的表，非常容易产生数据倾斜。怎么办？

在Map端缓存多张表，提前处理业务逻辑，这样增加Map端业务，减少Reduce端数据的压力，尽可能的减少数据倾斜。

3．具体办法：采用DistributedCache

（1）在Mapper的setup阶段，将文件读取到缓存集合中。
（2）在驱动函数中加载缓存。

// 缓存普通文件到Task运行节点。
job.addCacheFile(new URI(“file://e:/cache/pd.txt”));

4.案例实操

4.1 需求：

订单数据表order.txt：

id	  pid	amount
1001	01	1
1002	02	2
1003	03	3
1004	01	4
1005	02	5
1006	03	6

商品信息表pd.txt：

pid	pname
01	小米
02	华为
03	格力

将商品信息表中数据根据商品pid合并到订单数据表中。

最终数据形式：

id	pname	amount
1001	小米	1
1004	小米	4
1002	华为	2
1005	华为	5
1003	格力	3
1006	格力	6

4.2．需求分析

MapJoin适用于关联表中有小表的情形。
在这里插入图片描述

4.3．实现代码

Mapper：

注意：numReduceTask的数量设置为0，因为MapJoin的逻辑在Map端已经完成，所以不需要Reducer。

package com.mapreduce.mapjoin;

import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;


public class TableDriver {
	public static void main(String[] args) throws Exception {
		args = new String[] { "D:\\hadoop-2.7.1\\winMR\\MapJoin\\input",
				"D:\\hadoop-2.7.1\\winMR\\MapJoin\\output1" };

		Configuration conf = new Configuration();
		Job job = Job.getInstance(conf);

		job.setJarByClass(TableDriver.class);

		job.setMapperClass(TableMapper.class);

		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(NullWritable.class);

		// 加载缓存数据
		job.addCacheFile(new URI("file:///D:/hadoop-2.7.1/winMR/MapJoin/cachefile/pd.txt"));
		
		// map端join的逻辑不需要reduce阶段，设置reduceTask数量为0
		job.setNumReduceTasks(0);
		
		FileInputFormat.setInputPaths(job, new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));

		job.waitForCompletion(true);
	}
}

Mapper：

package com.mapreduce.mapjoin;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

public class TableMapper extends Mapper<LongWritable, Text, Text, NullWritable>{
	
	Map<String, String> pdMap = new HashMap<>();
	
	@Override
	protected void setup(Mapper<LongWritable, Text, Text, NullWritable>.Context context)
			throws IOException, InterruptedException {
		// 获取获取缓存文件
		URI[] cacheFiles = context.getCacheFiles();
		String path = cacheFiles[0].getPath().toString();
		// 读取文件内容
		BufferedReader reader = new BufferedReader(new InputStreamReader(new FileInputStream(path), "UTF-8"));
		
		String line;
		// 切割
		while(StringUtils.isNotEmpty(line = reader.readLine())) {
			String[] fields = line.split("\t");
			// 缓存数据到集合
			pdMap.put(fields[0], fields[1]);
		}
		// 关流
		reader.close();
	}
	
	Text k = new Text();
	
	protected void map(LongWritable key, Text value, Context context) 
			throws java.io.IOException ,InterruptedException {
		// 1. 获取一行
		String line = value.toString();
		
		// 2. 截取
		String[] fields = line.split("\t");
		
		// 3. 获取产品id
		String pId = fields[1];
		
		// 4. 获取商品名称
		String pName = pdMap.get(pId);
		
		// 5. 拼接
		k.set(line + "\t" + pName);
		
		// 6. 写出
		context.write(k, NullWritable.get());
	}; 
}