一.大表小表join
将key相对分散,并且数据量小的表放在join的左边,这样可以有效减少内存溢出错误发生的几率;再进一步,可以使用Group让小的维度表(1000条以下的记录条数)先进内存。在map端完成reduce。
①:建大表、小表和join后表
create table bigtable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
create table smalltable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
create table jointable(id bigint, time bigint, uid string, keyword string, url_rank int, click_num int, click_url string) row format delimited fields terminated by '\t';
②:向大表和小表中导入数据
load data local inpath '/opt/module/datas/bigtable' into table bigtable;
load data local inpath '/opt/module/datas/smalltable' into table smalltable;
③:关闭mapjoin功能
set hive.auto.convert.join = false;
④:大表join小表
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from bigtable b
left join smalltable s
on s.id = b.id;
⑤:小表join大表
insert overwrite table jointable
select b.id, b.time, b.uid, b.keyword, b.url_rank, b.click_num, b.click_url
from smalltable s
left join bigtable b
on b.id = s.id;
结论
在现有的hive版本里对小表JOIN大表和大表JOIN小表进行了优化,两者运行时间基本没有区别。