数据处理过程中count(distinct)效率较低,数据量一多准出问题,如果是多字段count(distinct)效率更低。
解决方法:使用row_number实现count(distinct)的功能,不存在数据倾斜问题。
实例分析:流量分站点分渠道数据。
此需求中需要汇总分站点分渠道count(distinct)的字段有:
访客数、新访客数、新用户访问次数、访次、成功下单访次、商品页访次、加入购物车访次、订单提交页访次、唯一IP数量、新注册用户数、首次购买用户数、重复购买用户数、下单用户数、新注册且下单的用户数。
使用过的方法:
多count(distinct) 存在数据倾斜,数据无法生成。
将多个需要count(distinct)的数据拆为单个left outer join ,每个按照需要汇总的字段进行group by,最后做count,数据可以生成,但是job数过多,以上所有字段全站以及分站点分渠道加完总计>130个job,而且新增字段job也会随之增加。
Row_number具体实现:
维护一张gdm_s14_ol_site_tmp表,存放需要count(distinct)的去重数据(按照需要统计的字段以及统计的具体值做去重)。
select
web_site_id as site,
'uv' as type,
browser_uniq_id as datavalue
from
gdm.gdm_online_log
where dt = '"""+ht.data_day_str + """'
union all
select
web_site_id as site,
'new_visitor_qtty' as type,
browser_uniq_id as datavalue
from
gdm.gdm_online_log
where dt = '"""+ht.data_day_str + """' and visit_times = 1
union all
select
web_site_id as site,
'reg_user_qtty' as type,
user_log_acct as datavalue
from
gdm.gdm_d14_user_reg_log
where dt = '"""+ht.data_day_str + """' ……
group by site, type, datavalue
使用row_number完成数据汇总
select
site,
sum(case when type = 'uv' and uniq_num= 1 then cast(1 as bigint) end) as uv,
sum(case when type = 'new_visitor_qtty'and uniq_num = 1 then cast(1 as bigint) end) as new_visitor_qtty,
……
from
(
select
site,
row_number(type, datavalue, site) asuniq_num,
type
from
(
select
case when site_id is null then '-1'
when site_id in('1', '2', '3', '4','5', '6', '7', '8', '9') then site_id
else '-1' end as site,
field_fenxi as type,
value_fenxi as datavalue
from
gdm.gdm_s14_ol_site_tmp
where
dt = '"""+ht.data_day_str + """'
and value_fenxi is not null distributeby type, datavalue, site
sort by type, datavalue, site
) tmp3
) tmp4
group by site
运行结果
天,周,月正常运行,每天需要做count(distinct)1亿多条记录,月数据有30-40亿记录,数据正常运行。