Hadoop(六) -- MapReduce（四）join

最新推荐文章于 2024-04-30 15:19:58 发布

BubbleMa

最新推荐文章于 2024-04-30 15:19:58 发布

阅读量284

点赞数

分类专栏： Hadoop 文章标签： mapreduce hadoop big data

本文链接：https://blog.csdn.net/qq_26857793/article/details/121910565

版权

Hadoop 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

一、Mapper中的方法

1. setup()

map方法的前置方法，每一个maptask任务初始化时会调用一次。

作用：准备map函数需要的参数（准备数据）

reduce join：读取文件路径、文件名、准备参数

map join：将小文件读取至内存，准备数据

2. cleanup()

收尾工作，每一个maptask任务结束时调用

一个切片对应一个maptask，其中：setup和cleanup每个maptask调用一次，map一行调用一次。

public void run(Context context) throws IOException, InterruptedException {
    // 调用setup方法
    setup(context);
    try {
      // context.nextKeyValue() 判断是否还有下一组k v
      // k - 偏移量  v - 一行内容
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      // 调用cleanup方法
      cleanup(context);
    }
 }

setup方法中获取文件名

String fileName = null;
/**
* 参数  上下文对象，读取文件信息（每个maptask任务对应的是切片信息）
*/
@Override
protected void setup(Context context) throws IOException, InterruptedException {
    // 获取当前maptask的切片信息
	InputSplit inputSplit = context.getInputSplit();
	FileSplit fs = (FileSplit) inputSplit;
	fileName = fs.getPath().getName();
}

二、reduce join

1）map端

setup() -- 获取文件名

map中：key -- 关联键 value -- 标记 + 其它字段

2）reduce端

接收数据，相同关联键的所有数据。

首先创建两个集合，存放两个表中的剩余字段。循环遍历values放入不同集合中。循环遍历集合，拼接。（若a表中没有b表中的关联字段，则a集合为空，不仅循环 -- 内连接）

/**
 * 数据:
 * 1. order表
 * 		oid		data		pid		num
 * 		1001	20150710	p0001	2
 * 		1002	20150710	p0001	3
 * 2. product表
 * 		pid		产品		产品编号	price
 * 		p0001	小米5	C01		2000
 * 		p0002	锤子T1	C01		3500
 */
public class ReduceJoin {
	static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
		String fileName = null;
		@Override
		protected void setup(Context context) throws IOException, InterruptedException {
			InputSplit inputSplit = context.getInputSplit();
			FileSplit fs = (FileSplit) inputSplit;
			fileName = fs.getPath().getName();
		}

		@Override
		protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
			String[] infos = value.toString().split("\t");
			if (fileName.startsWith("order")) {
				// 给不同表的数据打上标记后传入reduce
				context.write(new Text(infos[2]), new Text("o," + infos[0] + "," + infos[1] + "," + infos[3]));
			} else {
				context.write(new Text(infos[0]), new Text("p," + infos[1] + "," + infos[2] + "," + infos[3]));
			}
		}
	}

	static class MyReducer extends Reducer<Text, Text, Text, Text> {
		@Override
		protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
			// reduce端创建两个容器存储数据
			ArrayList<String> orders = new ArrayList<String>();
			ArrayList<String> products = new ArrayList<String>();
			for (Text value : values) {
				String[] infos = value.toString().split(",");
				if (infos[0].equals("o")) orders.add(value.toString().substring(2));
				else products.add(value.toString().substring(2));
			}
			for (String order : orders) {
				for (String product : products) {
					// 拼接两个表的数据
					context.write(key, new Text(order + "," + product));
				}
			}
		}
	}

reduce join的缺点：

1）reduce端进行join的时候容易造成数据倾斜

2）所有可以关联的数据全部发送到reduce端，创建list集合进行存储，当一个key关联的数据特别多时会造成OOM(OutOfMemory)

3）reduce端数据本身的并行度不高，reducetask任务经验个数 <= 节点数 * 0.95

三、map join

在map()之前将另一个文件存入缓存，map()每次读取一条数据与缓存数据做关联。

流：只能读取本地文件，需要保证在每一个运行maptask的节点上保存一份。

在job中指定：job.addCacheFile(URL); 将指定URI的文件缓存到每一个maptask节点。

整体逻辑：

maptask读取一个表（大表），将另一个表（小表）缓存到每一个maptask节点。

map端：

setup：创建本地流，读取本地缓存的小表，存储到map容器中

map：每次可以读到大表中的一条数据，在map集合中查找关联

static class MyMapper extends Mapper<LongWritable, Text, Text, Text> {
	Map<String, String> cache = new HashMap<String, String>();
	@Override
	protected void setup(Context context) throws IOException, InterruptedException {
		// 获取本地缓存路径  path数组: 缓存可能加载多个文件
		Path[] path = context.getLocalCacheFiles();
		// 创建本地字符流
		BufferedReader reader = new BufferedReader(new FileReader(path[0].toString()));
		String line = null;
		while ((line = reader.readLine()) != null) {
			String[] infos = line.split("\t");
			// 将小表中的数据存入cache集合中
			cache.put(infos[0], infos[1] + "\t" + infos[2]);
		}
	}

	@Override
	protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
		super.map(key, value, context);
	}
}

job设置

// reducetask个数不设置默认为1，若不需要reducetask将其设置为0
// 若不设置Reducer.class，默认执行Reducer类的代码
job.setNumReduceTask(0);
// 添加缓存文件
job.addCacheFile(new URI("/"));

mapjoin 缺点：需要将一个表放在缓存中，不何用与大表关联大表的操作

mapjoin 优点：

1）不会产生数据倾斜

2）并行度高，执行效率高

3）没有reducetask，没有shuffle (没有排序|分组|分区|Combiner)，执行效率高

BubbleMa

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop(六) -- MapReduce（四）join

MapReduce, map join, reduce join
复制链接

扫一扫

专栏目录

Hadoop(六) -- MapReduce（四）join

“相关推荐”对你有帮助么？