MapReduce之Join多表查询实现

MapReduce之Join多表查询实现

0.思路:

1、在map阶段注意区分读取进来的数据所属哪张表,需做判断进行区分
2、在reduce阶段注意对相同key的value进行处理,分别取出哪些是部门表和员工表的信息
3、编写Job类,设置mapper及输入输出
4、注意将emp.cvs和dept.csv放在同一个目录下
在这里插入图片描述
内容如下:
在这里插入图片描述
在这里插入图片描述

1.依赖
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>2.7.3</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>2.7.3</version>
</dependency>
2.添加log4j.properties文件在资源目录下即resources,文件内容如下
log4j.rootLogger=FATAL, dest1
log4j.logger.dsaLogging=DEBUG, dsa
log4j.additivity.dsaLogging=false

log4j.appender.dest1=org.apache.log4j.ConsoleAppender
log4j.appender.dest1.layout=org.apache.log4j.PatternLayout
log4j.appender.dest1.layout.ConversionPattern=%-5p:%l: %m%n
log4j.appender.dest1.ImmediateFlush=true

log4j.appender.dsa=org.apache.log4j.RollingFileAppender
log4j.appender.dsa.File=./logs/dsa.log
log4j.appender.dsa.MaxFileSize=2000KB
# Previously MaxBackupIndex=2
log4j.appender.dsa.MaxBackupIndex=5
log4j.appender.dsa.layout=org.apache.log4j.PatternLayout
log4j.appender.dsa.layout.ConversionPattern=%l:%d: %m%n
3.编写mapper类 EqualJoinMapper.java
package com.mr.jointable;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;

public class EqualJoinMapper extends Mapper<LongWritable, Text, IntWritable,Text> {

    IntWritable key2 = new IntWritable();
    Text value2 = new Text();

    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
/**数据结构:
 * emp:7499,ALLEN,SALESMAN,7698,1981/2/20,1600,300,30
 * dept:20,RESEARCH,DALLAS
 */
        System.out.println("偏移量:" + key + ",value : " + value.toString());
//1、分词
        String[] splits = value.toString().split(",");
//2、区别emp和dept
        if (splits.length >= 8){ //读取的是emp表数据
            String empName = splits[1];
            String empDept = splits[7];
            key2.set(Integer.parseInt(empDept));
            value2.set(empName);
        }else{ //读取是dept表的数据
            String detpNo = splits[0];
            String deptName = "*" + splits[1];//加*的目的是标识当前的数据是属于部门表里面的
            key2.set(Integer.parseInt(detpNo));
            value2.set(deptName);
        }
//3、通过context写出去
        context.write(key2,value2);
    }
}
4.编写reducer类
package com.mr.jointable;

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class EqualJoinReducer extends Reducer<IntWritable,Text,Text,Text> {

    Text key4 = new Text();
    Text value4 = new Text();

    protected void reduce(IntWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
//从values中取出部门名称和员工的名称
        String empNameList = "";
        String deptName = "";
        for (Text v: values) {
            String data = v.toString();
            int deptFlag = data.indexOf("*");
            if (deptFlag != -1 ){//找到包含有*号的数据:部门名称 如*RESEARCH
                deptName = data.substring(1);
            }else{
                empNameList = data + ";" + empNameList;
            }
        }
        key4.set(deptName);
        value4.set(empNameList);

        context.write(key4,value4);
    }
}
5.编写Job类
package com.mr.jointable;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.util.Random;

public class EqualJoinJob {
    public static void main(String[] args) throws Exception {
        Job job = Job.getInstance(new Configuration());
        job.setJarByClass(EqualJoinJob.class);
//设置Mapper
        job.setMapperClass(EqualJoinMapper.class);
        job.setMapOutputKeyClass(IntWritable.class);//key2
        job.setMapOutputValueClass(Text.class);//value2
//设置Reducer
        job.setReducerClass(EqualJoinReducer.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

//先使用本地文件做测试
        FileInputFormat.setInputPaths(job,new Path("D:\\equaljoin\\"));
        FileOutputFormat.setOutputPath(job,new Path(getOutputDir()));

        boolean result = job.waitForCompletion(true);

        System.out.println("result:" + result);
    }

    //用于产生随机输出目录
    public static String getOutputDir(){
        String prefix = "D:\\output_equaljoin\\";
        long time = System.currentTimeMillis();
        int random = new Random().nextInt();
        return prefix + "result_" + time + "_" + random;
    }
}
6.运行结果

在这里插入图片描述
在这里插入图片描述

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论
MapReduce实现join操作通常有两种方式:Reduce-side join和Map-side join。 1. Reduce-side join: Reduce-side join是最常用的实现方式。它的基本思想是将两个需要join的表分别映射为(key, value)的形式,其中key为需要join的字段,value则包含该字段以及其他需要输出的字段。然后将两个表的数据都输入到Map函数中,在Map函数中对两个表的数据进行标记,并将需要join的字段作为输出的key。在Reduce函数中,对相同的key进行合并,得到最终的输出结果。 下面是一个示例的Reduce-side join实现: Map函数: ``` public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] fields = line.split(","); String joinKey = fields[0]; String table = fields[1]; // 表名 String data = fields[2]; // 数据 Text outputKey = new Text(joinKey); Text outputValue = new Text(table + ":" + data); context.write(outputKey, outputValue); } ``` Reduce函数: ``` public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { List<String> table1Data = new ArrayList<String>(); List<String> table2Data = new ArrayList<String>(); for (Text value : values) { String[] fields = value.toString().split(":"); if (fields[0].equals("table1")) { table1Data.add(fields[1]); } else if (fields[0].equals("table2")) { table2Data.add(fields[1]); } } for (String data1 : table1Data) { for (String data2 : table2Data) { context.write(key, new Text(data1 + "," + data2)); } } } ``` 2. Map-side join: Map-side join是一种更加高效的实现方式,它的基本思想是将一个表的数据缓存到内存中,然后在Map函数中将另一个表的数据与缓存的数据进行join。需要注意的是,Map-side join只适用于小表与大表之间的join操作,因为需要将小表的数据全部缓存到内存中。 下面是一个示例的Map-side join实现: Map函数: ``` public void setup(Context context) throws IOException, InterruptedException { // 读取小表的数据并缓存到内存中 BufferedReader br = new BufferedReader(new FileReader("table1.csv")); String line; while ((line = br.readLine()) != null) { String[] fields = line.split(","); String joinKey = fields[0]; String data = fields[1] + "," + fields[2]; table1Data.put(joinKey, data); } br.close(); } public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String[] fields = line.split(","); String joinKey = fields[0]; String data = fields[1] + "," + fields[2]; if (table1Data.containsKey(joinKey)) { String table1Data = table1Data.get(joinKey); context.write(new Text(joinKey), new Text(table1Data + "," + data)); } } ``` 需要注意的是,Map-side join需要提前将小表的数据缓存到内存中,因此需要在Map函数之前执行setup函数。同时,为了提高效率,通常使用HashMap等数据结构来缓存小表的数据。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

Agatha方艺璇

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值