union all的巧妙用法,使用union all也可以横向合并sql查询结果
我们都知道union all是纵向连接查询结果,join是横向,但是用union all横向连接大家试过吗?
恢复菜鸟代码如下:
select ep.productid,productname,count(st.tduserid),count(distinct sl.tduserid),count(distinct sn.tduserid),avg(sl.interval_level)
from(select productid,productname from xxx.product where productid = '3006090') ep
join(select tduserid,productid from xxx_page_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6)) st
on ep.productid=st.productid
join(select tduserid,interval_level,productid from xxx_launch_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6)) sl
on st.productid=sl.productid
join(select tduserid,productid from xxx_newuser_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6)) sn
on sl.productid=sn.productid
group by ep.productid,productname;
刚开始然后写出的hql语句基本没啥优化,然后在生产集群跑了20分钟没跑完。用union all只跑了1m26s,写起来可能复杂些,不多说代码如下:
select '2019-04-07' dates,
'3006090' productid,
max(pro) productname,
sum(pv) pv,
sum(uv) uv,
cast(sum(duration) as decimal(10,4)) duration,
sum(new_uv) new_uv
from
(select productname pro,
'0' pv,
'0' uv,
'0' duration,
'0' new_uv
from xxx.product where productid = '3006090'
union all
select '0' pro,
count(tduserid) pv,
'0' uv,
'0' duration,
'0' new_uv
from xxx_page_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6) and
productid = '3006090'
union all
select '0' pro,
'0' pv,
count(distinct tduserid) uv,
avg(interval_level) duration,
'0' new_uv
from xxx_launch_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6) and
productid = '3006090'
union all
select '0' pro,
'0' pv,
'0' uv,
'0' duration,
count(distinct tduserid) new_uv
from xxx_newuser_ex where l_date <= '2019-04-07' and l_date >= date_add('2019-04-07', -6) and
productid = '3006090'
) t;
那么数值的可以用sum求和,当有汉字时怎么办呢,用max就可以解决这个问题。
这里的是脚本代码
#!/bin/bash
APP=XXX(数据库)
hive=/opt/module/hive/bin/hive
if [ -n "$1" ];then
n_date= $1
esle
n_date=`date -d "1 day ago" +%Y-%m-%d`
fi
sql="set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=16;
insert overwrite table "$APP".dws.dispaly_upv partition(dt='$n_date')
select cast('${n_date}' as date) dates,
'3006090' productid,
max(pro) productname,
sum(pv) pv,
sum(uv) uv,
cast(sum(duration) as decimal(10,2)) duration,
sum(new_uv) new_uv
from
(select productname pro,
'0' pv,
'0' uv,
'0' duration,
'0' new_uv
from "$APP".xxx_product where productid = '3006090'
union all
select '0' pro,
count(tduserid) pv,
'0' uv,
'0' duration,
'0' new_uv
from "$APP".xxx_page_ex where l_date <= '${n_date}' and l_date >= date_add('${n_date}', -6) and
productid = '3006090'
union all
select '0' pro,
'0' pv,
count(distinct tduserid) uv,
avg(interval_level) duration,
'0' new_uv
from "$APP".xxx_launch_ex where l_date <= '${n_date}' and l_date >= date_add('${n_date}', -6) and
productid = '3006090'
union all
select '0' pro,
'0' pv,
'0' uv,
'0' duration,
count(distinct tduserid) new_uv
from "$APP".xxx_newuser_ex where l_date <= '${n_date}' and l_date >= date_add('${n_date}', -6) and
productid = '3006090'
) t"
$hive -e "$sql"
性能调优
为什么 在使用union效率会比较高,因为hive对union all 也做了优化。
前面在设置性能调优参数时提到了hive.exec.parallel可以控制一个sql中多个可并行执行的job的运行方式.
当hive.exec.parallel为true的时候,同一个sql中可以并行执行的job会并发的执行.
而参数hive.exec.parallel.thread.number就是控制对于同一个sql来说同时可以运行的job的最大值,该参数默认为8.此时最大可以同时运行8个job.
这个任务中的最后一个sql,是n个sql一起的union all操作
通过修改参数hive.exec.parallel和hive.exec.parallel.thread.number测试了下不同情况的执行速度
其中hive.exec.parallel.thread.number表示同时运行的job的个数
不开并行:2533.722 seconds
开并行,并行度为8:266.82 seconds
开并行,并行度为16:194.33 seconds
开并行,并行度为32:133.117 seconds
当并行度为32的时候,速度提升了将近20倍,但是会占用更多的map和reduce资源