hive--Sort Merge Bucket Map Join

最新推荐文章于 2024-07-04 10:04:36 发布

liuxw0035

最新推荐文章于 2024-07-04 10:04:36 发布

阅读量141

点赞数

分类专栏： Hadoop 文章标签： hadoop hive mapjoin bucket

本文链接：https://blog.csdn.net/liuxw0035/article/details/84226093

版权

Hadoop 专栏收录该内容

27 篇文章 0 订阅

订阅专栏

Bucket Map Join

1. 测试1：两个1亿多记录的表，不存在数据倾斜与笛卡尔积，测试下来与普通的join差不多；

2. 测试2：一个4000万和一个5000多万的表join,关联键数据倾斜，并且笛卡尔积，效果明显；

create table lxw_test(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 10 BUCKETS;

create table lxw_test1(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 5 BUCKETS;

两个表关联键为imei,需要按imei分桶并且排序，小表（lxw_test）分桶数是大表（lxw_test1）的倍数(这点是在网上看的，需要这样，暂且这么做了)；

set hive.enforce.bucketing = true;

插入数据前需要打开该选项；

insert overwrite table lxw_test
select imei,sndaid,null  
from woa_all_user_info_his 
where pt = '2012-05-28' 
limit 40000000;


insert overwrite table lxw_test1
select imei,sndaid,data_time 
from woa_all_user_info_his 
where pt = '2012-05-28';

join时需要打开的参数：

set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

select /*+ mapjoin(b) */ count(1) 
from lxw_test1 a 
join lxw_test b 
on a.imei = b.imei

将小表做为驱动表，mapjoin;

包括insert数据，差不多10分钟左右；

如果这两个表做普通的join, 耗时1个多小时，没跑完，kill掉了。

liuxw0035

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
hive--Sort Merge Bucket Map Join

Bucket Map Join 1. 测试1：两个1亿多记录的表，不存在数据倾斜与笛卡尔积，测试下来与普通的join差不多； 2. 测试2：一个4000万和一个5000多万的表join,关联键数据倾斜，并且笛卡尔积，效果明显； create table lxw_test(imei string,sndaid string,data_time string)C...
复制链接

扫一扫