大数据处理中绕不开的是两个甚至多个表的Join操作,由于要进行Shuffle操作,所以比较耗费时间,本文介绍了传统的Reduce端Join方法,具体实现可以看我的github
基础数据
有两张表,分别是用户信息数据以及用户页面的日志数据,建表语句如下所示:
-- 用户行为数据
CREATE TABLE `tmp.log_user_behavior`(
`user_id` string COMMENT '用户id',
`item_id` string COMMENT '商品id',
`category_id` string COMMENT '商品分类id',
`behavior` string COMMENT '行为',
`ts` date COMMENT '行为发生时间')
PARTITIONED BY (
`date` string)
231758,3622677,4758477,pv
92253,642337,4135185,pv
297958,1762578,4801426,pv
786771,1940649,1320293,pv
789048,3144191,2355072,pv
895384,1138468,1602288,pv
578800,1324176,4135836,pv
886777,4606952,996587,pv
-- 用户信息
CREATE TABLE `tmp.base_user_info`(
`user_id` string COMMENT '用户id',
`user_name` string COMMENT '用户名字')
PARTITIONED BY (
`date` string)
66985,name-66985
332113,name-332113
102932,name-102932
874086,name-874086
我们想要获取在用户行为信息上面添加上用户name信息,用sql实现如下:
select a.user_id, a.item_id, a.category_id, a.behavior, a.ts, b.user_name
from tmp.log_user_behavior a
join tmp.base_user_info b
on a.user_id = b.user_id
where a.date='20200831' and b.date='20200831'
本文从MR角度来实现Reduce端Join编程方法和思路。
Reduce端join
原理介绍
Reduce-Side-Join是一种最简单的join方式,其主要思想如下:
-
在Map阶段,map函数同时读取两个文件File1和File2,为了区分两种来源的key/value数据对,对每条数据打一个标签(tag),比如:tag=0表示来自文件File1,tag=2表示来自文件File2。所以Map阶段的主要任务是对不同文件中的数据打标签;
-
在Reduce阶段,reduce函数获取key相同的来自File1和File2文件的value list, 然后对于同一个key,对File1和File2中的数据进行join(笛卡尔乘积)。Reduce阶段进行实际的连接操作,生成结果。
UserActionInfo实现
package com.hadoop.mapreduce.bean;
import org.apache.hadoop.io.Writable;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
/** 用户action信息 **/
public class UserActionInfo implements Writable {
private String userId;
private String itemId;
private String categoryId;
private String behavior;
public UserActionInfo() {
}
public UserActionInfo(String userId, String itemId, String categoryId, String behavior) {
this.userId = userId;
this.itemId = itemId;
this.categoryId = categoryId;
this.behavior = behavior;
}
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public String getItemId() {
return itemId;
}
public void setItemId(String itemId) {
this.itemId = itemId;
}
public String getCategoryId() {
return categoryId;
}
public void setCategoryId(String categoryId) {
this.categoryId = categoryId;
}
public String getBehavior() {
return behavior;
}
public void setBehavior(String behavior) {
this.behavior = behavior;
}
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(this.userId);
dataOutput.writeUTF(this.itemId);
dataOutput.writeUTF(this.categoryId);
dataOutput.writeUTF(this.behavior);
}
@Override
public void readFields(DataInput dataInput) throws IOException {
this.userId = dataInput.readUTF();
this.itemId = dataInput.readUTF();
this.categoryId = dataInput.readUTF();
this.behavior = dataInput.readUTF();
}
}
Mapper
实现自定义Mapper方法, 对不同的输入数据源打标记,设置UserId为key,其余数据信息为value,输出给Reducer,在其中加入了Counter来进行数据的Debug操作。
package com.hadoop.mapreduce.map;
import com.hadoop.mapreduce.enums.FileRecorder;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
/** 实现Reduce join的Mapper */
public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] fields = line.split(",");
if (fields != null && fields.length == 4) { // user action log
String userId = fields[0];
String itemId = fields[1];
String categoryId = fields[2];
String behavior = fields[3];
// 添加tag标识
context.write(new Text(userId), new Text("2;" + itemId + ";" + categoryId + ";" + behavior));
context.getCounter(FileRecorder.UserActionMapRecorder).increment(1);
} else if (fields != null && fields.length == 2) { // user info
String userId = fields[0];
String userName = fields[1];
context.write(new Text(userId), new Text("1;" + userName));
context.getCounter(FileRecorder.UserInfoMapRecorder).increment(1);
}
}
}
Reducer
实现自定义Reducer方法, 由于一个用户一个Name,所以需要得到所有的UserAction信息,添加上UserName输出即可
package com.hadoop.mapreduce.reduce;
import com.hadoop.mapreduce.bean.UserActionInfo;
import com.hadoop.mapreduce.enums.FileRecorder;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
public class ReduceJoinReducer extends Reducer<Text, Text, NullWritable, Text> {
public void reduce(Text key, Iterable<Text> values, Context context
) throws IOException, InterruptedException {
String userId = key.toString();
String userName = "";
List<UserActionInfo> userActionInfos = new ArrayList<>();
for (Text value : values) {
String[] items = value.toString().split(";");
if (items[0].equals("2")) { // 标记为1的是用户行为信息
UserActionInfo userActionInfo = new UserActionInfo(userId, items[1], items[2], items[3]);
userActionInfos.add(userActionInfo);
context.getCounter(FileRecorder.UserActionReduceRecorder).increment(1);
} else if (items[0].equals("1")) {
userName = items[1];
context.getCounter(FileRecorder.UserInfoReduceRecoreder).increment(1);
}
}
for (UserActionInfo userActionInfo: userActionInfos) {
context.write(NullWritable.get(), new Text(userActionInfo.getUserId() + ";" +
userName + ";" + userActionInfo.getItemId()));
}
}
}
主程序
主程序设置Mapper和Reducer进行处理,由于有多个输入源,所以要用MultipleInputs来制定输入数据以及对应处理的Mapper,这里我们使用了同一个Mapper操作。
package com.hadoop.mapreduce.main;
import com.hadoop.mapreduce.enums.FileRecorder;
import com.hadoop.mapreduce.map.ReduceJoinMapper;
import com.hadoop.mapreduce.reduce.ReduceJoinReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
/**
* Reducejoin是表连接的基本范式
* Map处理为key-value格式,并添加标记
* Reduce处理相同key的数据,进行合并输出
*/
public class ReduceJoinMain {
public static void main(String[] args) throws Exception {
if (args.length != 3) {
System.err.println("please input 3 params: user_info_File log_user_action_file output_mapjoin directory");
System.exit(0);
}
String userInfoDir = args[0];
String userLogDir = args[1];
String output = args[2];
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
/** 检查输入是否有数据 **/
if (!fs.exists(new Path(userInfoDir))) {
System.err.println("not found File: " + userInfoDir);
System.exit(0);
}
if (!fs.exists(new Path(userLogDir))) {
System.err.println("not found File: " + userLogDir);
System.exit(0);
}
/** 删除结果目录中数据 **/
Path outputPath = new Path(output);
if (fs.exists(outputPath)) {
fs.delete(outputPath, true);
}
/** Job信息配置 **/
Job job = new Job(conf, "reduce-side-join-task");
job.setJarByClass(ReduceJoinMain.class);
/** 指定Mapper相关参数 **/
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);
MultipleInputs.addInputPath(job, new Path(userInfoDir), TextInputFormat.class, ReduceJoinMapper.class);
MultipleInputs.addInputPath(job, new Path(userLogDir), TextInputFormat.class, ReduceJoinMapper.class);
/** 指定Reducer相关参数 **/
job.setReducerClass(ReduceJoinReducer.class);
job.setOutputKeyClass(NullWritable.class);
job.setOutputValueClass(Text.class);
FileOutputFormat.setOutputPath(job, new Path(output));
job.waitForCompletion(true);
}
}
运行
#!/bin/bash
jar=hadoop-repos-1.0-SNAPSHOT-jar-with-dependencies.jar
userInfoDir=hdfs://hadoop1/user/bigdata/tmp/mr/user_info/date=20200831
userActionDir=hdfs://hadoop1/user/bigdata/tmp/mr/user_behavior_log/date=20200831
outputDir=hdfs://hadoop1/user/bigdata/tmp/mr/reduce_side_join_result
hadoop jar ${jar} com.hadoop.mapreduce.main.ReduceJoinMain ${userInfoDir} ${userActionDir} ${outputDir}
## 运行日志信息
20/08/31 19:05:12 INFO impl.YarnClientImpl: Submitted application application_1597930504574_105878
20/08/31 19:05:12 INFO mapreduce.Job: The url to track the job: http://bj3-data-master-hadoop02.tencn:8088/proxy/application_1597930504574_105878/
20/08/31 19:05:12 INFO mapreduce.Job: Running job: job_1597930504574_105878
20/08/31 19:05:20 INFO mapreduce.Job: Job job_1597930504574_105878 running in uber mode : false
20/08/31 19:05:20 INFO mapreduce.Job: map 0% reduce 0%
20/08/31 19:05:30 INFO mapreduce.Job: map 1% reduce 0%
...
0/08/31 19:06:48 INFO mapreduce.Job: map 100% reduce 94%
20/08/31 19:06:49 INFO mapreduce.Job: map 100% reduce 100%
20/08/31 19:06:56 INFO mapreduce.Job: Job job_1597930504574_105878 completed successfully
20/08/31 19:06:56 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=10872377
FILE: Number of bytes written=61728468
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=16894309
HDFS: Number of bytes written=12919635
HDFS: Number of read operations=738
HDFS: Number of large read operations=0
HDFS: Number of write operations=64
Job Counters
Launched map tasks=214
Launched reduce tasks=32
Data-local map tasks=36
Rack-local map tasks=178
Total time spent by all maps in occupied slots (ms)=1264900
Total time spent by all reduces in occupied slots (ms)=315118
Total time spent by all map tasks (ms)=632450
Total time spent by all reduce tasks (ms)=157559
Total vcore-milliseconds taken by all map tasks=632450
Total vcore-milliseconds taken by all reduce tasks=157559
Total megabyte-milliseconds taken by all map tasks=1295257600
Total megabyte-milliseconds taken by all reduce tasks=322680832
Map-Reduce Framework
Map input records=717642
Map output records=717642
Map output bytes=18266565
Map output materialized bytes=12108909
Input split bytes=63028
Combine input records=0
Combine output records=0
Reduce input groups=231912
Reduce shuffle bytes=12108909
Reduce input records=717642
Reduce output records=485730
Spilled Records=1435284
Shuffled Maps =6848
Failed Shuffles=0
Merged Map outputs=6848
GC time elapsed (ms)=18418
CPU time spent (ms)=253370
Physical memory (bytes) snapshot=139216171008
Virtual memory (bytes) snapshot=918350282752
Total committed heap usage (bytes)=188129214464
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
com.hadoop.mapreduce.enums.FileRecorder
UserActionMapRecorder=485730[添加链接描述](https://www.cnblogs.com/codeOfLife/p/5521356.html)
UserActionReduceRecorder=485730
UserInfoMapRecorder=231912
UserInfoReduceRecoreder=231912
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=12919635
User Action Map Num:485730
参考
- http://shzhangji.com/cnblogs/2015/01/13/understand-reduce-side-join/
- https://www.cnblogs.com/codeOfLife/p/5521356.html