1.上线该优化后任务性能表现
任务执行时长:
任务一:该任务为wps_android国内应用的天任务耗时截图,执行时长缩短3-4倍。该任务是天任务中最长的任务,也就是说以现在的任务流程部署情况,整体天任务执行时长缩短3倍以上。
任务二:该任务为国内应用的周任务耗时截图,执行时长缩短3-4倍。
任务三:该任务为数仓单个子任务中耗时最长的任务(30天活跃任务),执行时间缩短5-6倍。其中任务的执行时间包括执行hive sql查询的时间+数据写入RDS的时间。
任务四:该任务为数仓月任务耗时截图,执行时长整体缩短2-3倍。
hive SQL数:
数仓计算天任务SQL数从2W降低到8K左右
2.原理分析
grouping sets 语法是hive0.10版引进的高级函数。
我们知道,下面sql
SELECT month, day,COUNT(DISTINCT device_id) AS uv,grouping_id FROM dwd_test GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY grouping_id ;
等价于
SELECT month,NULL,COUNT(DISTINCT device_id) AS uv,1 AS grouping_id FROM dwd_test GROUP BY month
UNION ALL
SELECT NULL,day,COUNT(DISTINCT device_id) AS uv,2 AS grouping_id FROM dwd_test GROUP BY day
UNION ALL
SELECT month,day,COUNT(DISTINCT device_id) AS uv,3 AS grouping_id FROM dwd_test GROUP BY month,day
查询结果
那为什么我们优化SQL时不直接使用UNION ALL语法呢?
因为使用UNION ALL合并分组集的缺点是代码太长并且性能比较低下,因为每一个查询都会分别单独扫描一次源表。
而使用GROUPINGS SETS合并分组集可明显减少代码长度,并且能够优化扫描源表的次数,不必为每个分组集单独对表源进行扫描,从而提高性能。
EXPLAIN
SELECT SOURCE,
app_version,
channel,
sum(front_start_pv) AS front_start_pv,
count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv,
sum(all_start_pv) AS all_start_pv,
count(distinct(deviceId)) AS all_start_uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERE dt = '2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
GROUP BY SOURCE,
app_version,
channel
GROUPING
SETS ((SOURCE), (SOURCE,
app_version), (SOURCE,
channel), (SOURCE,
app_version,
channel))
explain SELECT SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERE dt='2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
channel,
province,
model
对于SQL分析执行计划。图一为合并后的sql,图二为合并前四条SQL中的一条。从图中可知,grouping sets语法优化后的sql还是一个map一个reduce完成所有指标的计算,并没有更多的计算过程,最后多一个reduce完成数据的整理。
3.使用中遇到的坑
一、数据上报null值与grouping sets合并各维度组合产生的null区分
SQL1
SELECT SOURCE,
app_version,
channel,
sum(front_start_pv) AS front_start_pv,
count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv,
sum(all_start_pv) AS all_start_pv,
count(distinct(deviceId)) AS all_start_uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERE dt = '2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,nvl(app_version,''),nvl(channel,''),deviceId
) a
GROUP BY SOURCE,app_version,channel
GROUPING SETS ((SOURCE), (SOURCE,app_version), (SOURCE,channel), (SOURCE,app_version,channel))
我们可以通过nvl(app_version,'') 函数对上报的null值填充,这样结果中所有的null值就都是组合产生的null值了。
二、不同维度组合join中关联条件不同容易引起数据的不一致
SQL2
SELECT SOURCE,
app_version,
channel,
count(distinct(a.deviceId)) AS uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-05-24'
AND SOURCE='dw'
AND front_start_pv>0
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
JOIN
(SELECT SOURCE,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt = '2020-06-23'
AND SOURCE='dw'
AND front_start_pv>0
GROUP BY SOURCE,
deviceId) b ON a.source = b.source
AND a.deviceId = b.deviceId
GROUP BY a.source,
app_version,
channel
GROUPING
SETS ((a.source), (a.source,
app_version), (a.source,
channel), (a.source,
app_version,
channel))
SQL3
SELECT SOURCE,
app_version,
channel,
count(distinct(a.deviceId)) AS uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-05-24'
AND front_start_pv>0
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
JOIN
(SELECT SOURCE,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt = '2020-06-23'
AND front_start_pv>0
GROUP BY SOURCE,
deviceId) b ON a.deviceId = b.deviceId
GROUP BY a.source,
app_version,
channel
GROUPING
SETS ((), (app_version), (channel), (app_version,
channel))
如果直接把这两个sql合成一个,会因为不同维度组合join时关联条件不同导致最终数据偏多。