Join方式
(1)Reduce端连接
(2)Map端连接
(3)SemiJoin半连接
(1)Reduce端连接原理
Map端的主要工作:打标签,为来自不同表(文件)的key/value对打标签以区别不同来源的记录。然后用连接字段作为key,其余部分和新加的标志作为value,最后进行输出。
reduce端的主要工作:在reduce端以连接字段作为key的分组已经完成,我们只需要在每一个分组当中将那些来源于不同文件的记录(在map阶段已经打标签)分开,最后进行笛卡尔乘积只就ok了。
使用场景:当两张表都是大表的时候使用
不足:这种方式的缺点很明显就是会造成map和reduce端也就是shuffle阶段出现大量的数据传输,效率很低。
之所以会存在reduce join这种方式,我们可以很明显的看出原:因为整体数据被分割了,每个map task只处理一部分数据而不能够获取到所有需要的join字段,因此我们需要在讲join key作为reduce端的分组将所有join key相同的记录集中起来进行处理,所以reduce join这种方式就出现了。
其实,就是map打标签,reduce根据标签进行笛卡尔乘积。
(2)Map端连接原理
DistributedCache是分布式缓存的一种实现,它在整个MapReduce框架中起着相当重要的作用,他可以支撑我们写一些相当复杂高效的分布式程序。说回到这里,JobTracker在作业启动之前会获取到DistributedCache的资源uri列表,并将对应的文件分发到各个涉及到该作业的任务的TaskTracker上。另外,关于DistributedCache和作业的关系,比如权限、存储路径区分、public和private等属性
另外还有一种比较变态的Map Join方式,就是结合HBase来做Map Join操作。这种方式完全可以突破内存的控制,使你毫无忌惮的使用Map Join,而且效率也非常不错。
用法:在提交作业的时候先将小表文件放到该作业的DistributedCache中,然后从DistributeCache中取出该小表进行join key / value解释分割放到内存中(可以放大Hash Map等等容器中)。然后扫描大表,看大表中的每条记录的join key /value值是否能够在内存中找到相同join key的记录,如果有则直接输出结果。
使用场景:一张表十分小、一张表很大。
其实,就是在mapper中setup方法把小表加入缓存,然后map中做对比,保留有关系的数据
(3)SemiJoin半连接原理
在map端过滤掉一些数据,在网络中只传输参与连接的数据不参与连接的数据不必在网络中进行传输,从而减少了shuffle的网络传输量,使整体效率得到提高,其他思想和reduce join是一模一样的。就是将小表中参与join的key单独抽出来通过DistributedCach分发到相关节点,然后将其取出放到内存中(可以放到HashSet中),在map阶段扫描连接表,将join key不在内存HashSet中的记录过滤掉,让那些参与join的记录通过shuffle传输到reduce端进行join操作,其他的和reduce join都是一样的。
其实,该过程可以分两步:
第一步,先把相对较小的一个表中的关联字段单独抽取出来,存放在数据文件中;
第二步,把上一步生成的文件加入缓存,在解析数据时,把另外一张表中的key和缓存文件中的key关联不上的清除掉,再给两张表kv打上标签。reduce即可根据标签把数据分组,然后进行笛卡尔乘积,这就完成了关联。
使用场景对比:
reduce端连接 两张大表进行关联
map端连接 一个大表,一个小表进行关联
semiJion半连接 对reduce关联的优化
效率分析:
reduce端连接 效率最低
map端连接 效率最高
semiJion半连接 比reduce关联高
代码详解:
(1)ReduceJoin
package com.zhiyou.bd17.mr1017;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//计算每个省份的用户对系统的访问次数
public class ReduceJoin {
public static class ValueWithFlag implements Writable{
private String value;
private String flag;
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
//序列化
public void write(DataOutput out) throws IOException {
out.writeUTF(value);
out.writeUTF(flag);
}
//反序列化
public void readFields(DataInput in) throws IOException {
this.value = in.readUTF();
this.flag = in.readUTF();
}
}
//读取两个文件,根据来源把每一个kv对打上标签输出给reduce,key必须是关联字段
public static class ReduceJoinMap extends Mapper<LongWritable, Text, Text, ValueWithFlag> {
private FileSplit inputSplit;
private String fileName;
private String[] infos;
private Text outKey = new Text();
private ValueWithFlag outValue = new ValueWithFlag();
@Override
protected void setup(Mapper<LongWritable, Text, Text, ValueWithFlag>.Context context)
throws IOException, InterruptedException {
inputSplit = (FileSplit)context.getInputSplit();
if (inputSplit.getPath().toString().contains("user-logs-large.txt")) {
fileName = "userLogsLarge";
}else if (inputSplit.getPath().toString().contains("user_info.txt")) {
fileName = "userInfo";
}
}
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, ValueWithFlag>.Context context)
throws IOException, InterruptedException {
outValue.setFlag(fileName);
infos = value.toString().split("\\s");
if (fileName.equals("userLogsLarge")) {
//解析user-logs-large.txt的过程(用户名,行为类型,ip地址)
outKey.set(infos[0]);
outValue.setValue(infos[1]+"\t"+infos[2]);
}else if (fileName.equalsIgnoreCase("userInfo")) {
//解析user_info.txt的过程(用户名,性别,省份)
outKey.set(infos[0]);
outValue.setValue(infos[1]+"\t"+infos[2]);
}
context.write(outKey, outValue);
}
}
//接收map发送过来的kv,根据value中的flag来把同一个key对应的value分成两组
//name俩组中的数据就是分别来自两个表中的数据,对这两组数据做笛卡尔乘积即完成关联
public static class ReduceJoinReduce extends Reducer<Text, ValueWithFlag, Text, Text> {
private List<String> userLogsLargeList;
private List<String> userInfoList;
private Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<ValueWithFlag> values,
Reducer<Text, ValueWithFlag, Text, Text>.Context context) throws IOException, InterruptedException {
userLogsLargeList = new ArrayList<String>();
userInfoList = new ArrayList<String>();
for (ValueWithFlag value : values) {
if (value.getFlag().equals("userLogsLarge")) {
userLogsLargeList.add(value.getValue());
}else if (value.getFlag().equals("userInfo")) {
userInfoList.add(value.getValue());
}
}
//对两组中的数据进行笛卡尔乘积
for (String userLogsLarge : userLogsLargeList) {
for (String userInfo : userInfoList) {
outValue.set(userLogsLarge+"\t"+userInfo);
context.write(key, outValue);
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(ReduceJoin.class);
job.setJobName("Reduce关联");
job.setMapperClass(ReduceJoinMap.class);
job.setReducerClass(ReduceJoinReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(ValueWithFlag.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(job, new Path("/bd17/user_info.txt"));
FileInputFormat.addInputPath(job, new Path("/bd17/user-logs-large.txt"));
Path outputDir = new Path("/bd17/output/reducejoin");
outputDir.getFileSystem(configuration).delete(outputDir,true);
FileOutputFormat.setOutputPath(job, outputDir);
System.exit(job.waitForCompletion(true)?0:1);
}
}
}
user_info.txt数据如下图:
user-logs-large.txt数据如下:
代码执行结果如下:
(2)MapJoin
package com.zhiyou.bd17.mr1017;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.HashMap;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
//计算每个省份的用户对系统的访问次数
public class MapJoin {
//map读取分布式缓存文件,把他加载到一个hashmap中关联字段作key,计算相关字段值作为value
//map方法中处理大表数据,每处理一条就取出关键字段,
//看hashmap中是否存在,存在代表能关联,不存在代表关联不上
public static class MapJoinMap extends Mapper<LongWritable, Text, Text, IntWritable> {
private HashMap<String, String> userInfos = new HashMap<String, String>();
private String[] infos;
private Text outKeys = new Text();
private IntWritable ONE = new IntWritable(1);
@Override
protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
//获取分布式缓存文件的路径
URI[] cacheFiles = context.getCacheFiles();
FileSystem fileSystem = FileSystem.get(context.getConfiguration());
for (URI uri : cacheFiles) {
if (uri.toString().contains("user_info.txt")) {
FSDataInputStream inputStream = fileSystem.open(new Path(uri));
InputStreamReader inputStreamReader = new InputStreamReader(inputStream,"UTF-8");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line = bufferedReader.readLine();
while (line != null) {
infos = line.split("\\s");
userInfos.put(infos[0], infos[2]);
line = bufferedReader.readLine();
}
}
}
}
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
throws IOException, InterruptedException {
infos = value.toString().split("\\s");
if (userInfos.containsKey(infos[0])) {
outKeys.set(userInfos.get(infos[0]));
context.write(outKeys, ONE);
}
}
}
public static class MapJoinReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
private int sum;
private IntWritable outValue = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values,
Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
outValue.set(sum);
context.write(key, outValue);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(MapJoin.class);
job.setJobName("Map关联");
job.setMapperClass(MapJoinMap.class);
job.setReducerClass(MapJoinReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
//设置分布式缓存文件(小表)
Path cacheFilePath = new Path("/bd17/user_info.txt");
job.addCacheFile(cacheFilePath.toUri());
//大表
FileInputFormat.addInputPath(job, new Path("/bd17/user-logs-large.txt"));
Path outputDir = new Path("/bd17/output/mapjoin");
outputDir.getFileSystem(configuration).delete(outputDir,true);
FileOutputFormat.setOutputPath(job, outputDir);
System.exit(job.waitForCompletion(true)?0:1);
}
}
数据同上
结果如下:
(3)SemiJoin1这是半连接第一步
package com.zhiyou.bd17.mr1017;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SemiJoin1 {
public static class SemiJoinMap extends Mapper<LongWritable, Text, Text, NullWritable>{
private Text outKey = new Text();
private String[] infos;
private NullWritable outValue = NullWritable.get();
@Override
protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, NullWritable>.Context context)
throws IOException, InterruptedException {
infos = value.toString().split("\\s");
outKey.set(infos[0]);
context.write(outKey, outValue);
}
}
public static class SemiJoinReduce extends Reducer<Text, NullWritable, Text, NullWritable> {
private NullWritable outValue = NullWritable.get();
@Override
protected void reduce(Text key, Iterable<NullWritable> values,
Reducer<Text, NullWritable, Text, NullWritable>.Context context) throws IOException, InterruptedException {
context.write(key, outValue);
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJarByClass(SemiJoin1.class);
job.setJobName("半连接第一步");
job.setMapperClass(SemiJoinMap.class);
job.setReducerClass(SemiJoinReduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path("/bd17/user_info.txt"));
Path outputDir = new Path("/bd17/semiJion");
outputDir.getFileSystem(configuration).delete(outputDir,true);
FileOutputFormat.setOutputPath(job, outputDir);
System.exit(job.waitForCompletion(true)?0:1);
}
}
执行结果:
(2)SemiJoin2这是半连接第二步
package com.zhiyou.bd17.mr1017;
import java.io.BufferedReader;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class SemiJion2 {
public static class ValueWithFlag implements Writable{
private String value;
private String flag;
public String getValue() {
return value;
}
public void setValue(String value) {
this.value = value;
}
public String getFlag() {
return flag;
}
public void setFlag(String flag) {
this.flag = flag;
}
//序列化
public void write(DataOutput out) throws IOException {
out.writeUTF(value);
out.writeUTF(flag);
}
//反序列化
public void readFields(DataInput in) throws IOException {
this.value = in.readUTF();
this.flag = in.readUTF();
}
}
//读取两个文件,根据来源把每一个kv对打上标签输出给reduce,key必须是关联字段
public static class ReduceJoinMap extends Mapper<LongWritable, Text, Text, ValueWithFlag> {
private ArrayList<String> userInfos = new ArrayList<String>();
private FileSplit inputSplit;
private String fileName;
private String[] infos;
private Text outKey = new Text();
private ValueWithFlag outValue = new ValueWithFlag();
@Override
protected void setup(Mapper<LongWritable, Text, Text, ValueWithFlag>.Context context)
throws IOException, InterruptedException {
//获取分布式缓存文件的路径
URI[] cacheFiles = context.getCacheFiles();
FileSystem fileSystem = FileSystem.get(context.getConfiguration());
for (URI uri : cacheFiles) {
if (uri.toString().contains("semiJion")) {
FSDataInputStream inputStream = fileSystem.open(new Path(uri));
InputStreamReader inputStreamReader = new InputStreamReader(inputStream,"UTF-8");
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
String line = null;
while ((line=bufferedReader.readLine()) != null) {
userInfos.add(line);
}
}
}
inputSplit = (FileSplit)context.getInputSplit();
if (inputSplit.getPath().toString().contains("user-logs-large.txt")) {
fileName = "userLogsLarge";
}else if (inputSplit.getPath().toString().contains("user_info.txt")) {
fileName = "userInfo";
}
}
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, Text, ValueWithFlag>.Context context)
throws IOException, InterruptedException {
outValue.setFlag(fileName);
infos = value.toString().split("\\s");
if (userInfos.contains(infos[0])) {
if (fileName.equals("userLogsLarge")) {
//解析user-logs-large.txt的过程(用户名,行为类型,ip地址)
outKey.set(infos[0]);
outValue.setValue(infos[1]+"\t"+infos[2]);
}else if (fileName.equalsIgnoreCase("userInfo")) {
//解析user_info.txt的过程(用户名,性别,省份)
outKey.set(infos[0]);
outValue.setValue(infos[1]+"\t"+infos[2]);
}
context.write(outKey, outValue);
}
}
}
//接收map发送过来的kv,根据value中的flag来把同一个key对应的value分成两组
//name俩组中的数据就是分别来自两个表中的数据,对这两组数据做笛卡尔乘积即完成关联
public static class ReduceJoinReduce extends Reducer<Text, ValueWithFlag, Text, Text> {
private List<String> userLogsLargeList;
private List<String> userInfoList;
private Text outValue = new Text();
@Override
protected void reduce(Text key, Iterable<ValueWithFlag> values,
Reducer<Text, ValueWithFlag, Text, Text>.Context context) throws IOException, InterruptedException {
userLogsLargeList = new ArrayList<String>();
userInfoList = new ArrayList<String>();
for (ValueWithFlag value : values) {
if (value.getFlag().equals("userLogsLarge")) {
userLogsLargeList.add(value.getValue());
}else if (value.getFlag().equals("userInfo")) {
userInfoList.add(value.getValue());
}
}
//对两组中的数据进行笛卡尔乘积
for (String userLogsLarge : userLogsLargeList) {
for (String userInfo : userInfoList) {
outValue.set(userLogsLarge+"\t"+userInfo);
context.write(key, outValue);
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration configuration = new Configuration();
Job job = Job.getInstance(configuration);
job.setJobName("半连接第二步");
job.setMapperClass(ReduceJoinMap.class);
job.setReducerClass(ReduceJoinReduce.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(ValueWithFlag.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
//设置分布式缓存文件(小表)
Path cacheFilePath = new Path("/bd17/semiJion/part-r-00000");
job.addCacheFile(cacheFilePath.toUri());
FileInputFormat.addInputPath(job, new Path("/bd17/user_info.txt"));
FileInputFormat.addInputPath(job, new Path("/bd17/user-logs-large.txt"));
Path outputDir = new Path("/bd17/output/semiJion2");
outputDir.getFileSystem(configuration).delete(outputDir,true);
FileOutputFormat.setOutputPath(job, outputDir);
System.exit(job.waitForCompletion(true)?0:1);
}
}
}
执行结果:
与reduce关联结果对比