mr实现join的功能--Reduce端Join

最新推荐文章于 2023-01-15 08:09:15 发布

小白数据猿

最新推荐文章于 2023-01-15 08:09:15 发布

阅读量279

点赞数 2

分类专栏： Hadoop&Hive

本文链接：https://blog.csdn.net/lidongmeng0213/article/details/108328270

版权

Hadoop&Hive 专栏收录该内容

15 篇文章 4 订阅

订阅专栏

大数据处理中绕不开的是两个甚至多个表的Join操作，由于要进行Shuffle操作，所以比较耗费时间，本文介绍了传统的Reduce端Join方法，具体实现可以看我的github

基础数据

有两张表，分别是用户信息数据以及用户页面的日志数据，建表语句如下所示:

-- 用户行为数据
CREATE TABLE `tmp.log_user_behavior`(
  `user_id` string COMMENT '用户id',
  `item_id` string COMMENT '商品id',
  `category_id` string COMMENT '商品分类id',
  `behavior` string COMMENT '行为',
  `ts` date COMMENT '行为发生时间')
PARTITIONED BY (
  `date` string)
  
231758,3622677,4758477,pv
92253,642337,4135185,pv
297958,1762578,4801426,pv
786771,1940649,1320293,pv
789048,3144191,2355072,pv
895384,1138468,1602288,pv
578800,1324176,4135836,pv
886777,4606952,996587,pv  

-- 用户信息
CREATE TABLE `tmp.base_user_info`(
  `user_id` string COMMENT '用户id',
  `user_name` string COMMENT '用户名字')
PARTITIONED BY (
  `date` string)

66985,name-66985
332113,name-332113
102932,name-102932
874086,name-874086

我们想要获取在用户行为信息上面添加上用户name信息，用sql实现如下:

select a.user_id, a.item_id, a.category_id, a.behavior, a.ts, b.user_name
from tmp.log_user_behavior a 
join tmp.base_user_info b 
on a.user_id = b.user_id
where a.date='20200831' and b.date='20200831'

本文从MR角度来实现Reduce端Join编程方法和思路。

Reduce端join

原理介绍

Reduce-Side-Join是一种最简单的join方式，其主要思想如下：

在Map阶段，map函数同时读取两个文件File1和File2，为了区分两种来源的key/value数据对，对每条数据打一个标签（tag）,比如：tag=0表示来自文件File1，tag=2表示来自文件File2。所以Map阶段的主要任务是对不同文件中的数据打标签；
在Reduce阶段，reduce函数获取key相同的来自File1和File2文件的value list，然后对于同一个key，对File1和File2中的数据进行join（笛卡尔乘积）。Reduce阶段进行实际的连接操作，生成结果。

UserActionInfo实现

package com.hadoop.mapreduce.bean;

import org.apache.hadoop.io.Writable;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;

/** 用户action信息 **/
public class UserActionInfo implements Writable {
    private String userId;
    private String itemId;
    private String categoryId;
    private String behavior;

    public UserActionInfo() {
    }

    public UserActionInfo(String userId, String itemId, String categoryId, String behavior) {
        this.userId = userId;
        this.itemId = itemId;
        this.categoryId = categoryId;
        this.behavior = behavior;
    }

    public String getUserId() {
        return userId;
    }

    public void setUserId(String userId) {
        this.userId = userId;
    }

    public String getItemId() {
        return itemId;
    }

    public void setItemId(String itemId) {
        this.itemId = itemId;
    }

    public String getCategoryId() {
        return categoryId;
    }

    public void setCategoryId(String categoryId) {
        this.categoryId = categoryId;
    }

    public String getBehavior() {
        return behavior;
    }

    public void setBehavior(String behavior) {
        this.behavior = behavior;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(this.userId);
        dataOutput.writeUTF(this.itemId);
        dataOutput.writeUTF(this.categoryId);
        dataOutput.writeUTF(this.behavior);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.userId = dataInput.readUTF();
        this.itemId = dataInput.readUTF();
        this.categoryId = dataInput.readUTF();
        this.behavior = dataInput.readUTF();
    }
}

Mapper

实现自定义Mapper方法，对不同的输入数据源打标记，设置UserId为key，其余数据信息为value，输出给Reducer，在其中加入了Counter来进行数据的Debug操作。

package com.hadoop.mapreduce.map;

import com.hadoop.mapreduce.enums.FileRecorder;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

/** 实现Reduce join的Mapper */
public class ReduceJoinMapper extends Mapper<LongWritable, Text, Text, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] fields = line.split(",");
        if (fields != null && fields.length == 4) { // user action log
            String userId = fields[0];
            String itemId = fields[1];
            String categoryId = fields[2];
            String behavior = fields[3];
            // 添加tag标识
            context.write(new Text(userId), new Text("2;" + itemId + ";" + categoryId + ";" + behavior));
            context.getCounter(FileRecorder.UserActionMapRecorder).increment(1);
        } else if (fields != null && fields.length == 2) { // user info
            String userId = fields[0];
            String userName = fields[1];
            context.write(new Text(userId), new Text("1;" + userName));
            context.getCounter(FileRecorder.UserInfoMapRecorder).increment(1);
        }
    }
}

Reducer

实现自定义Reducer方法，由于一个用户一个Name，所以需要得到所有的UserAction信息，添加上UserName输出即可

package com.hadoop.mapreduce.reduce;

import com.hadoop.mapreduce.bean.UserActionInfo;
import com.hadoop.mapreduce.enums.FileRecorder;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class ReduceJoinReducer extends Reducer<Text, Text, NullWritable, Text> {
    public void reduce(Text key, Iterable<Text> values, Context context
    ) throws IOException, InterruptedException {
        String userId = key.toString();
        String userName = "";
        List<UserActionInfo> userActionInfos = new ArrayList<>();

        for (Text value : values) {
            String[] items = value.toString().split(";");
            if (items[0].equals("2")) { // 标记为1的是用户行为信息
                UserActionInfo userActionInfo = new UserActionInfo(userId, items[1], items[2], items[3]);
                userActionInfos.add(userActionInfo);
                context.getCounter(FileRecorder.UserActionReduceRecorder).increment(1);
            } else if (items[0].equals("1")) {
                userName = items[1];
                context.getCounter(FileRecorder.UserInfoReduceRecoreder).increment(1);
            }
        }

        for (UserActionInfo userActionInfo: userActionInfos) {
            context.write(NullWritable.get(), new Text(userActionInfo.getUserId() + ";" +
                    userName  + ";" + userActionInfo.getItemId()));
        }
    }
}

主程序

主程序设置Mapper和Reducer进行处理，由于有多个输入源，所以要用MultipleInputs来制定输入数据以及对应处理的Mapper，这里我们使用了同一个Mapper操作。

package com.hadoop.mapreduce.main;

import com.hadoop.mapreduce.enums.FileRecorder;
import com.hadoop.mapreduce.map.ReduceJoinMapper;
import com.hadoop.mapreduce.reduce.ReduceJoinReducer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.MultipleInputs;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**
 * Reducejoin是表连接的基本范式
 * Map处理为key-value格式，并添加标记
 * Reduce处理相同key的数据，进行合并输出
 */
public class ReduceJoinMain {
    public static void main(String[] args) throws Exception {
        if (args.length != 3) {
            System.err.println("please input 3 params: user_info_File log_user_action_file output_mapjoin directory");
            System.exit(0);
        }
        String userInfoDir = args[0];
        String userLogDir = args[1];
        String output = args[2];

        Configuration conf = new Configuration();
        FileSystem fs = FileSystem.get(conf);

        /** 检查输入是否有数据 **/
        if (!fs.exists(new Path(userInfoDir))) {
            System.err.println("not found File: " + userInfoDir);
            System.exit(0);
        }

        if (!fs.exists(new Path(userLogDir))) {
            System.err.println("not found File: " + userLogDir);
            System.exit(0);
        }

        /** 删除结果目录中数据 **/
        Path outputPath = new Path(output);
        if (fs.exists(outputPath)) {
            fs.delete(outputPath, true);
        }

        /** Job信息配置 **/
        Job job = new Job(conf, "reduce-side-join-task");
        job.setJarByClass(ReduceJoinMain.class);

        /** 指定Mapper相关参数 **/
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);
        MultipleInputs.addInputPath(job, new Path(userInfoDir), TextInputFormat.class, ReduceJoinMapper.class);
        MultipleInputs.addInputPath(job, new Path(userLogDir), TextInputFormat.class, ReduceJoinMapper.class);

        /** 指定Reducer相关参数 **/
        job.setReducerClass(ReduceJoinReducer.class);
        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(Text.class);
        FileOutputFormat.setOutputPath(job, new Path(output));

        job.waitForCompletion(true);
    }
}

运行

#!/bin/bash
jar=hadoop-repos-1.0-SNAPSHOT-jar-with-dependencies.jar
userInfoDir=hdfs://hadoop1/user/bigdata/tmp/mr/user_info/date=20200831
userActionDir=hdfs://hadoop1/user/bigdata/tmp/mr/user_behavior_log/date=20200831
outputDir=hdfs://hadoop1/user/bigdata/tmp/mr/reduce_side_join_result

hadoop jar ${jar} com.hadoop.mapreduce.main.ReduceJoinMain ${userInfoDir} ${userActionDir} ${outputDir}

## 运行日志信息
20/08/31 19:05:12 INFO impl.YarnClientImpl: Submitted application application_1597930504574_105878
20/08/31 19:05:12 INFO mapreduce.Job: The url to track the job: http://bj3-data-master-hadoop02.tencn:8088/proxy/application_1597930504574_105878/
20/08/31 19:05:12 INFO mapreduce.Job: Running job: job_1597930504574_105878
20/08/31 19:05:20 INFO mapreduce.Job: Job job_1597930504574_105878 running in uber mode : false
20/08/31 19:05:20 INFO mapreduce.Job:  map 0% reduce 0%
20/08/31 19:05:30 INFO mapreduce.Job:  map 1% reduce 0%
...
0/08/31 19:06:48 INFO mapreduce.Job:  map 100% reduce 94%
20/08/31 19:06:49 INFO mapreduce.Job:  map 100% reduce 100%
20/08/31 19:06:56 INFO mapreduce.Job: Job job_1597930504574_105878 completed successfully
20/08/31 19:06:56 INFO mapreduce.Job: Counters: 54
	File System Counters
		FILE: Number of bytes read=10872377
		FILE: Number of bytes written=61728468
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=16894309
		HDFS: Number of bytes written=12919635
		HDFS: Number of read operations=738
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=64
	Job Counters
		Launched map tasks=214
		Launched reduce tasks=32
		Data-local map tasks=36
		Rack-local map tasks=178
		Total time spent by all maps in occupied slots (ms)=1264900
		Total time spent by all reduces in occupied slots (ms)=315118
		Total time spent by all map tasks (ms)=632450
		Total time spent by all reduce tasks (ms)=157559
		Total vcore-milliseconds taken by all map tasks=632450
		Total vcore-milliseconds taken by all reduce tasks=157559
		Total megabyte-milliseconds taken by all map tasks=1295257600
		Total megabyte-milliseconds taken by all reduce tasks=322680832
	Map-Reduce Framework
		Map input records=717642
		Map output records=717642
		Map output bytes=18266565
		Map output materialized bytes=12108909
		Input split bytes=63028
		Combine input records=0
		Combine output records=0
		Reduce input groups=231912
		Reduce shuffle bytes=12108909
		Reduce input records=717642
		Reduce output records=485730
		Spilled Records=1435284
		Shuffled Maps =6848
		Failed Shuffles=0
		Merged Map outputs=6848
		GC time elapsed (ms)=18418
		CPU time spent (ms)=253370
		Physical memory (bytes) snapshot=139216171008
		Virtual memory (bytes) snapshot=918350282752
		Total committed heap usage (bytes)=188129214464
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	com.hadoop.mapreduce.enums.FileRecorder
		UserActionMapRecorder=485730[添加链接描述](https://www.cnblogs.com/codeOfLife/p/5521356.html)
		UserActionReduceRecorder=485730
		UserInfoMapRecorder=231912
		UserInfoReduceRecoreder=231912
	File Input Format Counters
		Bytes Read=0
	File Output Format Counters
		Bytes Written=12919635
User Action Map Num:485730

参考

http://shzhangji.com/cnblogs/2015/01/13/understand-reduce-side-join/
https://www.cnblogs.com/codeOfLife/p/5521356.html

小白数据猿

关注

2
点赞
踩
3

收藏

觉得还不错? 一键收藏
0
评论
mr实现join的功能--Reduce端Join

大数据处理中绕不开的是两个甚至多个表的Join操作，由于要进行Shuffle操作，所以比较耗费时间，本文介绍了传统的Reduce端Join方法，具体实现可以看我的github基础数据有两张表，分别是用户信息数据以及用户页面的日志数据，建表语句如下所示:-- 用户行为数据CREATE TABLE `tmp.log_user_behavior`( `user_id` string COMMENT '用户id', `item_id` string COMMENT '商品id', `cate.
复制链接

扫一扫