众所周知hive多表联查在实际生产中有很广泛的应用,使用较为快捷的当然是right join,left join;要避免的查询那就是inner join,当表很大的时候性能差异特别明显,但是还有一种用法就是full join,其实full join本身性能并不差,与right join,left join性能旗鼓相当,但仅仅使用full join出来的效果可能并不是自己想要的,效果如下:
(举例四个表table1,table2,table3,table4)(为保护隐私数据做简化处理)
SELECT *
FROM
(SELECT name,
datas
FROM table1
WHERE dt="20190110")t1
FULL JOIN
(SELECT name,
datas
FROM table2
WHERE dt="20190110")t2 ON t1.name=t2.name
FULL JOIN
(SELECT name,
datas
FROM table3
WHERE dt="20190110")t3 ON t1.name=t3.name
FULL JOIN
(SELECT name,
datas
FROM table4
WHERE dt="20190110")t4 ON t1.name=t4.name
结果呢是下面这种
Time taken: 23.468 seconds
当然你想要的就是公司名字合并了,于是想到了coalesce()函数
SELECT coalesce(t1.name,t2.name,t3.name,t4.name) AS name,
t1.datas,
t2.datas,
t3.datas,
t4.datas
FROM
(SELECT name,
datas
FROM table1
WHERE dt="20190110")t1
FULL JOIN
(SELECT name,
datas
FROM table2
WHERE dt="20190110")t2 ON t1.name=t2.name
FULL JOIN
(SELECT name,
datas
FROM table3
WHERE dt="20190110")t3 ON coalesce(t1.name,t2.name)=t3.name
FULL JOIN
(SELECT name,
datas
FROM table4
WHERE dt="20190110")t4 ON coalesce(t1.name,t2.name,t3.name)=t4.name
效果如下
Time taken: 76.187 seconds
效果实现了,可性能真的差到了极点。足足多了三倍时间,原因很简单,正是因为每次join都会使用ON coalesce()进行校验,判断前面是否有空值(不这样写,只要前面table1,table2,table3数据有空值结果就会多出来结果,导致错误)
最后划重点:综上所述,性能主要体现在在字段合并上,解决方案得从这里思考,方法当然是使用left join替代full join
答案是:找原始表用原始表的distinct(name)去left join (table1,table2,table3,table4)的name
SELECT t0.name,
t1.datas,
t2.datas,
t3.datas,
t4.datas
FROM
(SELECT distinct(name)
FROM ooodata)t0
LEFT JOIN
(SELECT name,
datas
FROM table1
WHERE dt=dt="20190110")t1 ON t0.name=t1.name
LEFT JOIN
(SELECT name,
datas
FROM table2
WHERE dt=dt="20190110")t2 ON t0.name=t2.name
LEFT JOIN
(SELECT name,
datas
FROM table3
WHERE dt=dt="20190110")t3 ON t0.name=t3.name
LEFT JOIN
(SELECT name,
datas
FROM table4
WHERE dt=dt="20190110")t4 ON t0.name=t4.name
结果当然如下
Time taken: 25.468 seconds
ok,圆满完成,祝你工作愉快