sparksql 定义聚合函数_spark sql多维分析优化——细节是魔鬼

这次是分享一个多维分析优化的案例

【本文大纲】

  1. 业务背景

  2. spark sql处理count distinct的原理

  3. spark sql 处理 grouping sets的原理

  4. 优化过程及效果

  5. 总结

1、业务背景

先上sql:

select         if(req_netease_user is null, 'all', req_netease_user) as netease_user,         if(campaign_id is null, 'all', campaign_id) as campaign_id,         if(spec_id is null, 'all', spec_id) as spec_id,         if(app_bundle is null, 'all', app_bundle) as app_bundle,         if(render_name is null,'all', render_name) as render_name,         if(platform is null, 'all', platform) as platform,         count(distinct request_id) as bid_request_num,        count(distinct deviceid) as bid_request_uv,        count(distinct case when bid_response_nbr=10001 then bid_response_id else null end) as offer_num,        count(distinct case when bid_response_nbr=10001 then deviceid else null end) as offer_uv,    dtfrom (    select        distinct dt,         if(req_netease_user is null, 'null', req_netease_user) as req_netease_user,         if(render_name is null, 'null', render_name) as render_name,         if(platform is null,'null', platform) as platform,         if(campaign_id is null, 'null', campaign_id) as campaign_id,        if(spec_id is null, 'null', spec_id) as spec_id,         if(app_bundle is null, 'null', app_bundle) as app_bundle,        request_id,         bid_response_nbr,         bid_response_id,         deviceid    from table_a where dt = '2019-08-11' and request_id is not null) tmp group by dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name, platformgrouping sets(  (dt),  (dt, req_netease_user),   (dt, campaign_id),  (dt, spec_id),  (dt, app_bundle),  (dt, render_name),  (dt, platform),  (dt, req_netease_user, campaign_id),  (dt, req_netease_user, spec_id),  (dt, req_netease_user, app_bundle),  (dt, req_netease_user, render_name),  (dt, req_netease_user, platform),  (dt, req_netease_user, campaign_id, spec_id),  (dt, req_netease_user, campaign_id, app_bundle),  (dt, req_netease_user, campaign_id, render_name),  (dt, req_netease_user, campaign_id, platform),        (dt, req_netease_user, campaign_id, spec_id, app_bundle),    (dt, req_netease_user, campaign_id, spec_id, render_name),    (dt, req_netease_user, campaign_id, spec_id, platform),    (dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name),    (dt, req_netease_user, campaign_id, spec_id, app_bundle, platform),    (dt, req_netease_user, campaign_id, spec_id, app_bundle, render_name, platform))

逻辑不复杂,就是慢,运行时间如下:

22981d9ef91fca42ff8d0e812b8f34b0.png

要运行5个小时~~~

这是一张广告竞价的业务表,每一条请求 request_id 都会产生一条数据,一天下来,数据量是很大的(几十亿)。 然而,又要对  7个维度做成22个组合,分别求 count(distinct request_id) ,  count(distinct deviceid), count(distinct case when bid_response_nbr=10001 then bid_response_id else null end) ,count(distinct case when bid_response_nbr=10001 then deviceid else null end) 。 只能说,需求好无耻啊 啊 啊 啊

2、spark sql对count distinct做的优化

在 hive 中我们对count distinct  的优化往往是这样的:

--优化前select count(distinct id) from table_a --优化后select   count(id)from(    select         id    from table_a group by id) tmp

hive往往只用一个 reduce 来处理全局聚合函数,最后导致数据倾斜;在不考虑其它因素的情况下,我们的优化方案是先 group by 再 count 。

在使用spark sql  时,貌似不用担心这个问题,因为 spark 对count distinct  做了优化:

explain select     count(distinct id),    count(distinct name) from table_a

执行计划如下:

== Physical Plan ==*(3) HashAggregate(keys=[], functions=[count(if ((gid#147005 = 2)) table_a.`id`#147007 else null), count(if ((gid#147005 = 1)) table_a.`name`#147006 else null)])+- Exchange SinglePartition   +- *(2) HashAggregate(keys=[], functions=[partial_count(if ((gid#147005 = 2)) table_a.`id`#147007 else null), partial_count(if ((gid#147005 = 1)) table_a.`name`#147006 else null)])      +- *(2) HashAggregate(keys=[table_a.`name`#147006, table_a.`id`#147007, gid#147005], functions=[])         +- Exchange(coordinator id: 387101114) hashpartitioning(table_a.`name`#147006, table_a.`id`#147007, gid#147005, 4096), coordinator[target post-shuffle partition size: 67108864]            +- *(1) HashAggregate(keys=[table_a.`name`#147006, table_a.`id`#147007, gid#147005], functions=[])               +- *(1) Expand [List(name#146984, null, 1), List(null, id#146979, 2)], [table_a.`name`#147006, table_a.`id`#147007, gid#147005]                  +- *(1) Project [id#146979, name#146984]                     +- *(1) FileScan parquet table_a

从执行计划中可以看到,在处理 count distinct 时,用 Expand 的方式,具体是怎么 expand 的呢,如下图:

e540fe7ecd1068b0966253bb2ddf7600.png

expand 之后,再以id、name 为 key 进行HashAggregate 也就是 group by ,这样以来&

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值