spark sql多维分析优化——细节是魔鬼

这次是分享一个多维分析优化的案例

【本文大纲】

  1. 业务背景

  2. spark sql处理count distinct的原理

  3. spark sql 处理 grouping sets的原理

  4. 优化过程及效果

  5. 总结

1、业务背景

先上sql:

select 
        if(req_netease_user is null, 'all', req_netease_user) as netease_user, 
        if(campaign_id is null, 'all', campaign_id) as campaign_id, 
        if(spec_id is null, 'all', spec_id) as spec_id, 
        if(app_bundle is null, 'all', app_bundle) as app_bundle, 
        if(render_name is null,'all', render_name) as render_name, 
        if(platform is null, 'all', platform) as platform, 
        count(distinct request_id) as bid_request_num,
        count(distinct deviceid) as bid_request_uv,
        count(distinct case when bid_response_nbr=10001 then bid_response_id else null end) as offer_num,
        count(distinct case when bid_response_nbr=10001 then deviceid else null end) as offer_uv,
    dt
from 
(
    select
        distinct dt, 
        if(req_netease_user is null, 'null', req_netease_user) as req_netease_user, 
        if(render_name is null, 'null', render_name) as render_name, 
        if(platform is null,'null', platform) as platform, 
        if(campaign_id is null, 'null', campaign_id) as campaign_id,
        if(spec_id is null, 'null', spec_id) as spec_id, 
        if(app_bundle is null, 'null', app_bundle) as app_bundle,
        request_id, 
        bid_response_nbr, 
        bid_response_id, 
        deviceid
    from table_a where dt = '2019-08-11' and request_id is not null
) tmp group by dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name, platform
grouping sets(
  (dt),
​
  (dt, req_netease_user), 
  (dt, campaign_id),
  (dt, spec_id),
  (dt, app_bundle),
  (dt, render_name),
  (dt, platform),
​
  (dt, req_netease_user, campaign_id),
  (dt, req_netease_user, spec_id),
  (dt, req_netease_user, app_bundle),
  (dt, req_netease_user, render_name),
  (dt, req_netease_user, platform),
​
  (dt, req_netease_user, campaign_id, spec_id),
  (dt, req_netease_user, campaign_id, app_bundle),
  (dt, req_netease_user, campaign_id, render_name),
  (dt, req_netease_user, campaign_id, platform),
    
    (dt, req_netease_user, campaign_id, spec_id, app_bundle),
    (dt, req_netease_user, campaign_id, spec_id, render_name),
    (dt, req_netease_user, campaign_id, spec_id, platform),
​
    (dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name),
    (dt, req_netease_user, campaign_id, spec_id, app_bundle, platform),
​
    (dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name, platform)
)

逻辑不复杂,就是慢,运行时间如下:

要运行5个小时~~~

这是一张广告竞价的业务表,每一条请求 request_id 都会产生一条数据,一天下来,数据量是很大的(几十亿)。 然而,又要对  7个维度做成22个组合,分别求 count(distinct request_id) ,  count(distinct deviceid), count(distinct case when bid_response_nbr=10001 then bid_response_id else null end) ,count(distinct case when bid_response_nbr=10001 then deviceid else null end) 。 只能说,需求好无耻啊 啊 啊 啊

2、spark sql对count distinct做的优化

在 hive 中我们对count distinct  的优化往往是这样的:

--优化前
select count(distinct id) from table_a 
​
--优化后
select 
  count(id)
from
(
    select 
        id
    from table_a group by id
) tmp

hive往往只用一个 reduce 来处理全局聚合函数,最后导致数据倾斜;在不考虑其它因素的情况下,我们的优化方案是先 group by 再 count 。

在使用spark sql  时,貌似不用担心这个问题,因为 spark 对count distinct  做了优化:

explain 
select 
    count(distinct id),
    count(distinct name) 
from table_a

执行计划如下:

== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(if ((gid#147005 = 2)) table_a.`id`#147007 else null), count(if ((gid#147005 = 1)) table_a.`name`#147006 else null)])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_count(if ((gid#147005 = 2)) table_a.`id`#147007 else null), partial_count(if ((gid#147005 = 1)) table_a.`name`#147006 else null)])
      +- *(2) HashAggregate(keys=[table_a.`name`#147006, table_a.`id`#147007, gid#147005], functions=[])
         +- Exchange(coordinator id: 387101114) hashpartitioning(table_a.`name`#147006, table_a.`id`#147007, gid#147005, 4096), coordinator[target post-shuffle partition size: 67108864]
            +- *(1) HashAggregate(keys=[table_a.`name`#147006, table_a.`id`#147007, gid#147005], functions=[])
               +- *(1) Expand [List(name#146984, null, 1), List(null, id#146979, 2)], [table_a.`name`#147006, table_a.`id`#147007, gid#147005]
                  +- *(1) Project [id#146979, name#146984]
                     +- *(1) FileScan parquet table_a

从执行计划中可以看到,在处理 count distinct 时,用 Expand 的方式,具体是怎么 expand 的呢,如下图:

expand 之后,再以id、name 为 key 进行HashAggregate 也就是 group by ,这样以来,就相当于去重了。后面直接计算count (id)  、 count(name)  就可以,把数据分而治之。 在一定程度上缓解了数据倾斜。

顺便附上 distinct  这块的部分代码,方便做对照理解:

def rewrite(a: Aggregate): Aggregate = {
    // 把所有聚合表式取出来
    val aggExpressions = a.aggregateExpressio
  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

小萝卜算子

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值