背景: 近期做了ABT项目,基于公司的营销系统的push渠道去做AB测试,把人群划分为A分支和B分支,每个分支采用不同的算法模型,去查看其对应A分支和B分支带来的uv转化和对应的订单转化.
代码: 原有模式
select dim1
,dim2
,count(distinct if(condition1,member_id,null)) as member_cnt1
,count(distinct if(condition2,member_id,null)) as member_cnt2
,count(distinct if(condition3,member_id,null)) as member_cnt3
,count(distinct if(condition1,itme_id,null)) as itme_cnt1
,count(distinct if(condition2,itme_id,null)) as itme_cnt2
,count(distinct if(condition3,itme_id,null)) as itme_cnt3
,sum(index3) as index3
from table
现象: 可以发现数据存在在某几个reduce 卡主的情况,而且存在着很多天跑步过去的情况.重试多次还是不行.
优化手段:优化手段: 将 去重指标count(distinct xxxx) 拉出去单独计算, 这一部分去重指标改为group by再向上聚合,再union all 回去.
现有模式:
select dim1
,dim2
,max(member_cnt1 ) as member_cnt1
,max(member_cnt2 ) as member_cnt2
,max(member_cnt3 ) as member_cnt3
,max(itme_cnt1 ) as itme_cnt1
,max(itme_cnt2 ) as itme_cnt2
,max(itme_cnt3 ) as itme_cnt3
,max(index3 ) as index3
from (select dim1
,dim2
,count(if(condition1,member_id,null)) as member_cnt1
,count(if(condition2,member_id,null)) as member_cnt2
,count(if(condition3,member_id,null)) as member_cnt3
,0 as item_cnt1
,0 as item_cnt2
,0 as item_cnt3
,sum(index3) as index3
from (select dim1
,dim2
,member_id
,max(if(condition1,1,0)) as condition1
,max(if(condition2,1,0)) as condition2
,max(if(condition3,1,0)) as condition3
,sum(index3) as index3
from table
group by dim1
,dim2
,member_id
) s
group by dim1
,dim2
union all
select dim1
,dim2
,0 as member_cnt1
,0 as member_cnt2
,0 as member_cnt3
,count(if(condition1,item_id,null)) as item_cnt1
,count(if(condition2,item_id,null)) as item_cnt2
,count(if(condition3,item_id,null)) as item_cnt3
,sum(index3) as index3
from (select dim1
,dim2
,item_id
,max(if(condition1,1,0)) as condition1
,max(if(condition2,1,0)) as condition2
,max(if(condition3,1,0)) as condition3
,sum(index3) as index3
from table
group by dim1
,dim2
,item_id
) s
group by dim1
,dim2
) s
group by dim1
,dim2
结果 : 将count(distinct) 改为group by 向上聚合后,速度提升很快,10多分钟就跑过去了.
原理:
当要统计某一列去重数时,如果数据量很大,count(distinct) 就会非常慢,原因与 group by 类似,
count(distinct) 逻辑只会有很少的 reducer 来处理。这时可以用 group by 来改写:
优化前 ,一个普通的只使用一个reduceTask来进行count(distinct) 操作
优化后 ,但是这样写会启动两个MR job(单纯 distinct 只会启动一个),所以要确保数据量大到启动
job 的 overhead 远小于计算耗时,才考虑这种方法。当数据集很小或者 key 的倾斜比较明显时,group
by 还可能会比 distinct 慢。
select t.a, count(t.b) , sum(t.c) from t group by t.a;
select t.a, count(distinct t.b, t.c) from t group by t.a;