map端join算法实现

最新推荐文章于 2024-08-13 17:53:59 发布

M_wzz

最新推荐文章于 2024-08-13 17:53:59 发布

阅读量242

点赞数 1

分类专栏：代码文章标签： mapreduce 大数据

本文链接：https://blog.csdn.net/M_wzz/article/details/109520667

版权

代码专栏收录该内容

5 篇文章 0 订阅

订阅专栏

map端join算法实现

1、原理阐述
适用于关联表中有小表的情形；
可以将小表分发到所有的map节点，这样，map节点就可以在本地对自己所读到的大表数据进行join并输出最终结果，可以大大提高join操作的并发度，加快处理速度
2、实现示例
先在mapper类中预先定义好小表，进行join
引入实际场景中的解决方案：一次加载数据库或者用

orders.txt
在这里插入图片描述
product.txt

第一步：定义mapJoin

public class JoinMap extends Mapper<LongWritable,Text,Text,Text> {
    HashMap<String,String> b_tab = new HashMap<String, String>();
    String line = null;
//  map端的初始化方法当中获取缓存文件，一次性加载到map当中来
@Override
public void setup(Context context) throws IOException, InterruptedException {
    //这种方式获取所有的缓存文件
    URI[] cacheFiles1 = DistributedCache.getCacheFiles(context.getConfiguration());
    Path[] localCacheFiles = DistributedCache.getLocalCacheFiles(context.getConfiguration());
    URI[] cacheFiles = DistributedCache.getCacheFiles(context.getConfiguration());
    FileSystem fileSystem = FileSystem.get(cacheFiles[0], context.getConfiguration());
    FSDataInputStream open = fileSystem.open(new Path(cacheFiles[0]));
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(open));
    while ((line = bufferedReader.readLine())!=null){
        String[] split = line.split(",");
        b_tab.put(split[0],split[1]+"\t"+split[2]+"\t"+split[3]);
    }
    fileSystem.close();
    IOUtils.closeStream(bufferedReader);
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    //这里读的是这个map task所负责的那一个切片数据（在hdfs上）
    String[] fields = value.toString().split(",");
    String orderId = fields[0];
    String date = fields[1];
    String pdId = fields[2];
    String amount = fields[3];
    //获取map当中的商品详细信息
    String productInfo = b_tab.get(pdId);
    context.write(new Text(orderId), new Text(date + "\t" + productInfo+"\t"+amount));
 }
}

第二步：定义程序运行main方法

public class MapSideJoin extends Configured implements Tool {
@Override
public int run(String[] args) throws Exception {
    Configuration conf = super.getConf();
    //注意，这里的缓存文件的添加，只能将缓存文件放到hdfs文件系统当中，放到本地加载不到
    DistributedCache.addCacheFile(new URI("hdfs://192.168.10.111:8020/orders.txt"),conf);
    Job job = Job.getInstance(conf, "MapJoin");
    job.setJarByClass(MapSideJoin.class);
    job.setInputFormatClass(TextInputFormat.class);
    TextInputFormat.addInputPath(job,new Path("file:///E:\\input\\product.txt"));
   
    job.setMapperClass(JoinMap.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    TextOutputFormat.setOutputPath(job,new Path("file:///E:\\output"));
    boolean b = job.waitForCompletion(true);
    return b?0:1;
}
public static void main(String[] args) throws Exception {
    Configuration configuration = new Configuration();
    ToolRunner.run(configuration,new MapSideJoin(),args);
     }
}

reduce端join算法实现