使用grouping sets语法与拆分sql查询时间对比

背景:为满足业务方分析需求,数仓基础统计中存在大量多维度的指标计算。原有计算中根据不同维度拆分多个sql串行提交hive集群进行查询,多个任务在map阶段存在重复读取源数据等问题,因此引入grouping sets语法合并业务逻辑为一条sql。

一次map一次reduce完成一个基础指标的统计,减少集群资源浪费。现以安卓国内版天活跃启动统计任务为例进行测试。

1.sql1 按province,model维度统计source为dw的前后台启动活跃数据。

sql1
SELECT SOURCE,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT SOURCE,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
     AND SOURCE='dw'
   GROUP BY SOURCE,
            province,
            model,
            deviceId) a
GROUP BY SOURCE,
         province,
         model

2.sql2 按channel,province,model维度统计source为dw的前后台启动活跃数据。 

sql2
SELECT SOURCE,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT SOURCE,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
     AND SOURCE='dw'
   GROUP BY SOURCE,
            channel,
            province,
            model,
            deviceId) a
GROUP BY SOURCE,
         channel,
         province,
         model

 

 sql3 按app_version,province,model维度统计source为dw的前后台启动活跃数据。

sql3
SELECT SOURCE,
app_version,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
province,
model

 

 4.sql4 按app_version,channel,province,model维度统计source为dw的前后台启动活跃数据。

sql4
SELECT SOURCE,
app_version,
channel,
province,
model,
sum(front_start_pv),
sum(if(front_start_pv>0,1,0)),
sum(all_start_pv),
sum(1)
FROM
(SELECT SOURCE,
app_version,
channel,
province,
model,
deviceId,
sum(front_start_pv) AS front_start_pv,
sum(all_start_pv) AS all_start_pv
FROM dw_wps_android.dwd_app_start_d
WHERE dt='2020-04-04'
AND SOURCE='dw'
GROUP BY SOURCE,
app_version,
channel,
province,
model,
deviceId) a
GROUP BY SOURCE,
app_version,
channel,
province,
model

 

 5.sql5 按province,model维度统计source为所有平台的前后台启动活跃数据

sql5
SELECT 'all' AS SOURCE,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            province,
            model,
            deviceId) a
GROUP BY 'all',
         province,
         model

 

 6.sql6 按channel,province,model维度统计source为所有平台的前后台启动活跃数据

sql6
SELECT 'all' AS SOURCE,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            channel,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         channel,
         province,
         model

 

 7.sql7 按app_version,province,model维度统计source为所有平台的前后台启动活跃数据

sql7
 SELECT 'all' AS SOURCE,
       app_version,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          app_version,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            app_version,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         app_version,
         province,
         model

 

 8.sql8 按app_version,channel,province,model维度统计source为所有平台的前后台启动活跃数据

sql8
SELECT 'all' AS SOURCE,
       app_version,
       channel,
       province,
       model,
       sum(front_start_pv),
       sum(if(front_start_pv>0,1,0)),
       sum(all_start_pv),
       sum(1)
FROM
  (SELECT 'all' AS SOURCE,
          app_version,
          channel,
          province,
          model,
          deviceId,
          sum(front_start_pv) AS front_start_pv,
          sum(all_start_pv) AS all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY 'all',
            app_version,
            channel,
            province,
            model,
            deviceId) a
GROUP BY 'all',
         app_version,
         channel,
         province,
         model 

 

 9.sql9 合并source,app_version,channel,province,model维度组合统计前后台启动活跃数据

sql9
SELECT nvl(source,"all") as source,
       app_version,
       channel,
       province,
       model,
       sum(front_start_pv) as front_start_pv,
       sum(if(front_start_pv>0,1,0)) as front_start_uv,
       sum(all_start_pv) as all_start_pv,
       sum(1) as all_start_uv
FROM
  (SELECT source,
          nvl(app_version,"dw_null") as app_version,
          nvl(channel,"dw_null") as channel,
          nvl(province,"dw_null") as province,
          nvl(model,"dw_null") as model,
          deviceId,
          sum(front_start_pv) as front_start_pv,
          sum(all_start_pv) as all_start_pv
   FROM dw_wps_android.dwd_app_start_d
   WHERE dt='2020-04-04'
   GROUP BY source,
            nvl(app_version,"dw_null"),
            nvl(channel,"dw_null"),
            nvl(province,"dw_null"),
            nvl(model,"dw_null"),
            deviceId) a
GROUP BY source,
         app_version,
         channel,
         province,
         model
GROUPING SETS(
(source,province,model),
(source,channel,province,model),
(source,app_version,province,model),
(source,app_version,channel,province,model),
(province,model),
(channel,province,model),
(app_version,province,model),
(app_version,channel,province,model)
)HAVING source='dw' OR source is null

 

 

sql行号

数据行

time(秒)

sql1

191618

54.289

sql2

349692

186.34

sql3

811334

48.341

sql4

1059390

50.373

sql5

197314

66.308

sql6

405496

68.312

sql7

1077795

152.358

sql8

1376569

182.347

总和

5469208

808.668

sql行号

数据行

time(秒)

sql9

5469208

134.432

说明:sql都是都测试队列相同集群环境下测试的。都是在container复用的情况下进行测试,排除了第一个sql从0申请资源执行时间过长的因素。

结论:采用grouping set语法进行sql合并能有效减少hive任务执行时间,同时以上9个sql占用的集群资源基本相当,优化后会大量减少集群资源的消耗。之后会测试数仓批量任务执行优化前后的时间对比,进行更全面的对比。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值