Hadoop的Map侧join

最新推荐文章于 2022-03-16 17:22:54 发布

tianshidcbw

最新推荐文章于 2022-03-16 17:22:54 发布

阅读量404

点赞数

文章标签： Hadoop

本文链接：https://blog.csdn.net/tianshidcbw/article/details/51697959

版权

写了关于Hadoop的Map侧join
和Reduce的join，今天我们就来在看另外一种比较中立的Join。

SemiJoin，一般称为半链接，其原理是在Map侧过滤掉了一些不需要join的数据，从而大大减少了reduce的shffule时间，因为我们知道，如果仅仅使用Reduce侧连接，那么如果一份数据中，存在大量的无效数据，而这些数据，在join中，并不需要，但是因为没有做过预处理，所以这些数据，直到真正的执行reduce函数时，才被定义为无效数据，而这时候，前面已经执行过shuffle和merge和sort，所以这部分无效的数据，就浪费了大量的网络IO和磁盘IO，所以在整体来讲，这是一种降低性能的表现，如果存在的无效数据越多，那么这种趋势，就越明显。

之所以会出现半连接，这其实也是reduce侧连接的一个变种，只不过我们在Map侧，过滤掉了一些无效的数据，所以减少了reduce过程的shuffle时间，所以能获取一个性能的提升。

具体的原理也是利用DistributedCache将小表的的分发到各个节点上，在Map过程的setup函数里，读取缓存里面的文件，只将小表的链接键存储在hashset里，在map函数执行时，对每一条数据，进行判断，如果这条数据的链接键为空或者在hashset里面不存在，那么则认为这条数据，是无效的数据，所以这条数据，并不会被partition分区后写入磁盘，参与reduce阶段的shuffle和sort，所以在一定程序上，提升了join性能。需要注意的是如果
小表的key依然非常巨大，可能会导致我们的程序出现OOM的情况，那么这时候我们就需要考虑其他的链接方式了。

测试数据如下：
模拟小表数据：
1,三劫散仙,13575468248
2,凤舞九天,18965235874
3,忙忙碌碌,15986854789
4,少林寺方丈,15698745862

模拟大表数据：
3,A,99,2013-03-05
1,B,89,2013-02-05
2,C,69,2013-03-09
3,D,56,2013-06-07
5,E,100,2013-09-09
6,H,200,2014-01-10

代码如下：

      Java代码 
     【下载地址】    
       
      
    
 package com.semijoin;  
   
 import java.io.BufferedReader;  
 import java.io.DataInput;  
 import java.io.DataOutput;  
 import java.io.FileReader;  
 import java.io.IOException;  
 import java.net.URI;  
 import java.util.ArrayList;  
 import java.util.HashSet;  
 import java.util.List;  
   
 import org.apache.hadoop.conf.Configuration;  
 import org.apache.hadoop.filecache.DistributedCache;  
 import org.apache.hadoop.fs.FileSystem;  
 import org.apache.hadoop.fs.Path;  
 import org.apache.hadoop.io.LongWritable;  
 import org.apache.hadoop.io.Text;  
 import org.apache.hadoop.io.WritableComparable;  
   
 import org.apache.hadoop.mapred.JobConf;  
 import org.apache.hadoop.mapreduce.Job;  
 import org.apache.hadoop.mapreduce.Mapper;  
 import org.apache.hadoop.mapreduce.Reducer;  
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;  
 import org.apache.hadoop.mapreduce.lib.input.FileSplit;  
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;  
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;  
 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;  
   
 /*** 
  *  
  * Hadoop1.2的版本 
  *  
  * hadoop的半链接 
  *  
  * SemiJoin实现 
  *  
  * @author qindongliang 
  *  
  *    大数据交流群：376932160 
  *  搜索技术交流群：324714439 
  *  
  *  
  *  
  * **/  
 public class Semjoin {  
       
       
       
     /** 
      *  
      *  
      * 自定义一个输出实体 
      *  
      * **/  
     private static class CombineEntity implements WritableComparable<CombineEntity>{  
   
           
         private Text joinKey;//连接key  
         private Text flag;//文件来源标志  
         private Text secondPart;//除了键外的其他部分的数据  
           
           
         public CombineEntity() {  
             // TODO Auto-generated constructor stub  
             this.joinKey=new Text();  
             this.flag=new Text();  
             this.secondPart=new Text();  
         }  
           
         public Text getJoinKey() {  
             return joinKey;  
         }  
   
         public void setJoinKey(Text joinKey) {  
             this.joinKey = joinKey;  
         }  
   
         public Text getFlag() {  
             return flag;  
         }  
   
         public void setFlag(Text flag) {  
             this.flag = flag;  
         }  
   
         public Text getSecondPart() {  
             return secondPart;  
         }  
   
         public void setSecondPart(Text secondPart) {  
             this.secondPart = secondPart;  
         }  
   
         @Override  
         public void readFields(DataInput in) throws IOException {  
             this.joinKey.readFields(in);  
             this.flag.readFields(in);  
             this.secondPart.readFields(in);  
               
         }  
   
         @Override  
         public void write(DataOutput out) throws IOException {  
             this.joinKey.write(out);  
             this.flag.write(out);  
             this.secondPart.write(out);  
               
         }  
   
         @Override  
         public int compareTo(CombineEntity o) {  
             // TODO Auto-generated method stub  
             return this.joinKey.compareTo(o.joinKey);  
         }  
           
           
           
     }  
       
       
       
       
     private static class JMapper extends Mapper<LongWritable, Text, Text, CombineEntity>{  
           
         private CombineEntity combine=new CombineEntity();  
         private Text flag=new Text();  
         private  Text joinKey=new Text();  
         private Text secondPart=new Text();  
         /** 
          * 存储小表的key 
          *  
          *  
          * */  
         private HashSet<String> joinKeySet=new HashSet<String>();  
           
           
         @Override  
         protected void setup(Context context)throws IOException, InterruptedException {  
            
             //读取文件流  
             BufferedReader br=null;  
             String temp;  
             // 获取DistributedCached里面 的共享文件  
             Path path[]=DistributedCache.getLocalCacheFiles(context.getConfiguration());  
               
             for(Path p:path){  
                   
                 if(p.getName().endsWith("a.txt")){  
                     br=new BufferedReader(new FileReader(p.toString()));  
                     //List<String> list=Files.readAllLines(Paths.get(p.getName()), Charset.forName("UTF-8"));  
                       
                     while((temp=br.readLine())!=null){  
                         String ss[]=temp.split(",");  
                         //map.put(ss[0], ss[1]+"\t"+ss[2]);//放入hash表中  
                         joinKeySet.add(ss[0]);//加入小表的key  
                     }  
                 }  
             }  
               
               
         }  
           
           
           
         @Override  
         protected void map(LongWritable key, Text value,Context context)  
                 throws IOException, InterruptedException {  
               
           
                //获得文件输入路径  
             String pathName = ((FileSplit) context.getInputSplit()).getPath().toString();  
           
             if(pathName.endsWith("a.txt")){  
                   
                 String  valueItems[]=value.toString().split(",");  
                   
                   
                 /** 
                  * 在这里过滤必须要的连接字符 
                  *  
                  * */  

最低0.47元/天解锁文章

tianshidcbw

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Hadoop的Map侧join

写了关于Hadoop的Map侧join 和Reduce的join，今天我们就来在看另外一种比较中立的Join。 SemiJoin，一般称为半链接，其原理是在Map侧过滤掉了一些不需要join的数据，从而大大减少了reduce的shffule时间，因为我们知道，如果仅仅使用Reduce侧连接，那么如果一份数据中，存在大量的无效数据，而这些数据，在join中，并不需要，但是因为没有做过预处
复制链接

扫一扫