使用groupingsets函数优化数仓离线计算性能表现、原理分析及避坑指南

1.上线该优化后任务性能表现

任务执行时长:

任务一:该任务为wps_android国内应用的天任务耗时截图,执行时长缩短3-4倍。该任务是天任务中最长的任务,也就是说以现在的任务流程部署情况,整体天任务执行时长缩短3倍以上。

任务二:该任务为国内应用的周任务耗时截图,执行时长缩短3-4倍。

 任务三:该任务为数仓单个子任务中耗时最长的任务(30天活跃任务),执行时间缩短5-6倍。其中任务的执行时间包括执行hive sql查询的时间+数据写入RDS的时间。

 任务四:该任务为数仓月任务耗时截图,执行时长整体缩短2-3倍。

hive SQL数:

数仓计算天任务SQL数从2W降低到8K左右

2.原理分析

grouping sets 语法是hive0.10版引进的高级函数。

我们知道,下面sql

SELECT month, day,COUNT(DISTINCT device_id) AS uv,grouping_id FROM dwd_test GROUP BY month,day GROUPING SETS (month,day,(month,day)) ORDER BY grouping_id ;

等价于 

SELECT month,NULL,COUNT(DISTINCT device_id) AS uv,1 AS grouping_id FROM dwd_test GROUP BY month 
UNION ALL 
SELECT NULL,day,COUNT(DISTINCT device_id) AS uv,2 AS grouping_id FROM dwd_test GROUP BY day
UNION ALL 
SELECT month,day,COUNT(DISTINCT device_id) AS uv,3 AS grouping_id FROM dwd_test GROUP BY month,day

查询结果 

 

那为什么我们优化SQL时不直接使用UNION ALL语法呢?

因为使用UNION ALL合并分组集的缺点是代码太长并且性能比较低下,因为每一个查询都会分别单独扫描一次源表。

而使用GROUPINGS SETS合并分组集可明显减少代码长度,并且能够优化扫描源表的次数,不必为每个分组集单独对表源进行扫描,从而提高性能。

EXPLAIN
SELECT SOURCE,
       app_version,
       channel,
       sum(front_start_pv) AS front_start_pv,
       count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv,
       sum(all_start_pv) AS all_start_pv,
       count(distinct(deviceId)) AS all_start_uv
FROM
  (SELECT SOURCE,
          nvl(app_version,'') AS app_version,
          nvl(channel,'') AS channel,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_seek_edu.dwd_app_start_d
   WHERE dt = '2020-04-21'
     AND SOURCE='dw'
   GROUP BY SOURCE,
            nvl(app_version,''),
            nvl(channel,''),
            deviceId) a
GROUP BY SOURCE,
         app_version,
         channel
GROUPING
SETS ((SOURCE), (SOURCE,
                 app_version), (SOURCE,
                                channel), (SOURCE,
                                           app_version,
                                           channel))

 

explain SELECT SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERE dt='2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
channel,
province,
model

 

 

 对于SQL分析执行计划。图一为合并后的sql,图二为合并前四条SQL中的一条。从图中可知,grouping sets语法优化后的sql还是一个map一个reduce完成所有指标的计算,并没有更多的计算过程,最后多一个reduce完成数据的整理。

3.使用中遇到的坑

一、数据上报null值与grouping sets合并各维度组合产生的null区分

SQL1
SELECT SOURCE,
app_version,
channel,
sum(front_start_pv) AS front_start_pv,
count(distinct(if(front_start_pv>0,deviceId,NULL))) AS front_start_uv,
sum(all_start_pv) AS all_start_pv,
count(distinct(deviceId)) AS all_start_uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_seek_edu.dwd_app_start_d
WHERE dt = '2020-04-21'
AND SOURCE='dw'
GROUP BY SOURCE,nvl(app_version,''),nvl(channel,''),deviceId
) a
GROUP BY SOURCE,app_version,channel
GROUPING SETS ((SOURCE), (SOURCE,app_version), (SOURCE,channel), (SOURCE,app_version,channel))

我们可以通过nvl(app_version,'') 函数对上报的null值填充,这样结果中所有的null值就都是组合产生的null值了。

二、不同维度组合join中关联条件不同容易引起数据的不一致

SQL2
SELECT SOURCE,
app_version,
channel,
count(distinct(a.deviceId)) AS uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-05-24'
AND SOURCE='dw'
AND front_start_pv>0
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
JOIN
(SELECT SOURCE,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt = '2020-06-23'
AND SOURCE='dw'
AND front_start_pv>0
GROUP BY SOURCE,
deviceId) b ON a.source = b.source
AND a.deviceId = b.deviceId
GROUP BY a.source,
app_version,
channel
GROUPING
SETS ((a.source), (a.source,
app_version), (a.source,
channel), (a.source,
app_version,
channel))
SQL3
SELECT SOURCE,
app_version,
channel,
count(distinct(a.deviceId)) AS uv
FROM
(SELECT SOURCE,
nvl(app_version,'') AS app_version,
nvl(channel,'') AS channel,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-05-24'
AND front_start_pv>0
GROUP BY SOURCE,
nvl(app_version,''),
nvl(channel,''),
deviceId) a
JOIN
(SELECT SOURCE,
deviceId
FROM dw_wps_android.dwd_app_start_d
WHERE dt = '2020-06-23'
AND front_start_pv>0
GROUP BY SOURCE,
deviceId) b ON a.deviceId = b.deviceId
GROUP BY a.source,
app_version,
channel
GROUPING
SETS ((), (app_version), (channel), (app_version,
channel))

 如果直接把这两个sql合成一个,会因为不同维度组合join时关联条件不同导致最终数据偏多。

  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值