大数据表转移hdfs后查询处理

最新推荐文章于 2024-06-08 08:45:00 发布

ljtyxl

最新推荐文章于 2024-06-08 08:45:00 发布

阅读量1.4k

点赞数

分类专栏： bigdata 文章标签： hdfs 大数据

本文链接：https://blog.csdn.net/u014033218/article/details/75267256

版权

bigdata 专栏收录该内容

102 篇文章 0 订阅

订阅专栏

reduce端join算法实现
1、需求：
订单数据表t_order：
id date pid amount
1001 20150710 P0001 2
1002 20150710 P0001 3
1002 20150710 P0002 3

商品信息表t_product
id name category_id price
P0001 小米5 C01 2
P0002 锤子T1 C01 3

假如数据量巨大，两表的数据是以文件的形式存储在HDFS中，需要用mapreduce程序来实现一下SQL查询运算：
select a.id,a.date,b.name,b.category_id,b.price from t_order a join t_product b on a.pid = b.id

2、实现机制：
通过将关联的条件作为map输出的key，将两表满足join条件的数据并携带数据所来源的文件信息，发往同一个reduce task，在reduce中进行数据的串联

public class OrderJoin {

static class OrderJoinMapper extends Mapper<LongWritable, Text, Text, OrderJoinBean> {

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

        // 拿到一行数据，并且要分辨出这行数据所属的文件
        String line = value.toString();

        String[] fields = line.split("\t");

        // 拿到itemid
        String itemid = fields[0];

        // 获取到这一行所在的文件名（通过inpusplit）
        String name = "你拿到的文件名";

        // 根据文件名，切分出各字段（如果是a，切分出两个字段，如果是b，切分出3个字段）

        OrderJoinBean bean = new OrderJoinBean();
        bean.set(null, null, null, null, null);
        context.write(new Text(itemid), bean);

    }

}

static class OrderJoinReducer extends Reducer<Text, OrderJoinBean, OrderJoinBean, NullWritable> {

    @Override
    protected void reduce(Text key, Iterable<OrderJoinBean> beans, Context context) throws IOException, InterruptedException {

         //拿到的key是某一个itemid,比如1000
        //拿到的beans是来自于两类文件的bean
        //  {1000,amount} {1000,amount} {1000,amount}   ---   {1000,price,name}

        //将来自于b文件的bean里面的字段，跟来自于a的所有bean进行字段拼接并输出
    }
}

}

缺点：这种方式中，join的操作是在reduce阶段完成，reduce端的处理压力太大，map节点的运算负载则很低，资源利用率不高，且在reduce阶段极易产生数据倾斜

解决方案： map端join实现方式

4.4.2 map端join算法实现
1、原理阐述
适用于关联表中有小表的情形；
可以将小表分发到所有的map节点，这样，map节点就可以在本地对自己所读到的大表数据进行join并输出最终结果，可以大大提高join操作的并发度，加快处理速度
2、实现示例
–先在mapper类中预先定义好小表，进行join
–引入实际场景中的解决方案：一次加载数据库或者用distributedcache
public class TestDistributedCache {
static class TestDistributedCacheMapper extends Mapper

ljtyxl

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
大数据表转移hdfs后查询处理

reduce端join算法实现 1、需求：订单数据表t_order： id date pid amount 1001 20150710 P0001 2 1002 20150710 P0001 3 1002 20150710 P0002 3商品信息表t_product id name category_id price
复制链接

扫一扫