Vol09-多字段多粒度count(distinct) 优化

最新推荐文章于 2024-07-30 09:01:15 发布

Pysamlam

最新推荐文章于 2024-07-30 09:01:15 发布

阅读量692

点赞数

数据处理过程中count(distinct)效率较低，数据量一多准出问题，如果是多字段count(distinct)效率更低。

解决方法：使用row_number实现count(distinct)的功能，不存在数据倾斜问题。

实例分析：流量分站点分渠道数据。

此需求中需要汇总分站点分渠道count(distinct)的字段有：

访客数、新访客数、新用户访问次数、访次、成功下单访次、商品页访次、加入购物车访次、订单提交页访次、唯一IP数量、新注册用户数、首次购买用户数、重复购买用户数、下单用户数、新注册且下单的用户数。

使用过的方法：

多count(distinct) 存在数据倾斜，数据无法生成。
将多个需要count(distinct)的数据拆为单个left outer join ,每个按照需要汇总的字段进行group by,最后做count，数据可以生成，但是job数过多，以上所有字段全站以及分站点分渠道加完总计>130个job，而且新增字段job也会随之增加。

Row_number具体实现：

维护一张gdm_s14_ol_site_tmp表，存放需要count(distinct)的去重数据(按照需要统计的字段以及统计的具体值做去重)。

select

web_site_id as site,

'uv' as type,

browser_uniq_id as datavalue

from

gdm.gdm_online_log

where dt = '"""+ht.data_day_str + """'

union all

select

web_site_id as site,

'new_visitor_qtty' as type,

browser_uniq_id as datavalue

from

gdm.gdm_online_log

where dt = '"""+ht.data_day_str + """' and visit_times = 1

union all

select

web_site_id as site,

'reg_user_qtty' as type,

user_log_acct as datavalue

from

gdm.gdm_d14_user_reg_log

where dt = '"""+ht.data_day_str + """' ……

group by site, type, datavalue

使用row_number完成数据汇总

select

site,

sum(case when type = 'uv' and uniq_num= 1 then cast(1 as bigint) end) as uv,

sum(case when type = 'new_visitor_qtty'and uniq_num = 1 then cast(1 as bigint) end) as new_visitor_qtty,

……

from

(

select

site,

row_number(type, datavalue, site) asuniq_num,

type

from

(

select

case when site_id is null then '-1'

when site_id in('1', '2', '3', '4','5', '6', '7', '8', '9') then site_id

else '-1' end as site,

field_fenxi as type,

value_fenxi as datavalue

from

gdm.gdm_s14_ol_site_tmp

where

dt = '"""+ht.data_day_str + """'

and value_fenxi is not null distributeby type, datavalue, site

sort by type, datavalue, site

) tmp3

) tmp4

group by site

运行结果

天，周，月正常运行，每天需要做count(distinct)1亿多条记录，月数据有30-40亿记录，数据正常运行。

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。