hive count(distinct )的优化手段

背景:  近期做了ABT项目,基于公司的营销系统的push渠道去做AB测试,把人群划分为A分支和B分支,每个分支采用不同的算法模型,去查看其对应A分支和B分支带来的uv转化和对应的订单转化.

代码: 原有模式

select dim1
       ,dim2
       ,count(distinct if(condition1,member_id,null)) as member_cnt1
       ,count(distinct if(condition2,member_id,null)) as member_cnt2
       ,count(distinct if(condition3,member_id,null)) as member_cnt3
       ,count(distinct if(condition1,itme_id,null))   as itme_cnt1
       ,count(distinct if(condition2,itme_id,null))   as itme_cnt2
       ,count(distinct if(condition3,itme_id,null))   as itme_cnt3
       ,sum(index3)  as index3
from   table

现象: 可以发现数据存在在某几个reduce 卡主的情况,而且存在着很多天跑步过去的情况.重试多次还是不行.

优化手段:优化手段:  将 去重指标count(distinct  xxxx) 拉出去单独计算,   这一部分去重指标改为group  by再向上聚合,再union all 回去.

现有模式:

select dim1
        ,dim2
        ,max(member_cnt1 ) as member_cnt1
        ,max(member_cnt2 ) as member_cnt2
        ,max(member_cnt3 ) as member_cnt3
        ,max(itme_cnt1   ) as itme_cnt1 
        ,max(itme_cnt2   ) as itme_cnt2 
        ,max(itme_cnt3   ) as itme_cnt3 
        ,max(index3      ) as index3    
from  (select  dim1
                ,dim2
               ,count(if(condition1,member_id,null)) as member_cnt1
               ,count(if(condition2,member_id,null)) as member_cnt2
               ,count(if(condition3,member_id,null)) as member_cnt3
               ,0     as   item_cnt1
               ,0     as   item_cnt2
               ,0     as   item_cnt3
               ,sum(index3)  as index3
        from    (select    dim1
                         ,dim2
                         ,member_id
                         ,max(if(condition1,1,0)) as condition1
                         ,max(if(condition2,1,0)) as condition2
                         ,max(if(condition3,1,0)) as condition3
                         ,sum(index3)             as index3
                from   table
                group by dim1
                         ,dim2
                         ,member_id
                ) s 
        group by dim1
                 ,dim2
 
        union all
 
        select  dim1
                ,dim2
               ,0     as   member_cnt1
               ,0     as   member_cnt2
               ,0     as   member_cnt3
               ,count(if(condition1,item_id,null)) as item_cnt1
               ,count(if(condition2,item_id,null)) as item_cnt2
               ,count(if(condition3,item_id,null)) as item_cnt3
               ,sum(index3)  as index3
        from    (select    dim1
                         ,dim2
                         ,item_id
                         ,max(if(condition1,1,0)) as condition1
                         ,max(if(condition2,1,0)) as condition2
                         ,max(if(condition3,1,0)) as condition3
                         ,sum(index3)             as index3
                from   table
                group by dim1
                         ,dim2
                         ,item_id
                ) s 
        group by dim1
                 ,dim2
        ) s
group by dim1
        ,dim2

结果  : 将count(distinct)  改为group by 向上聚合后,速度提升很快,10多分钟就跑过去了.

原理:

当要统计某一列去重数时,如果数据量很大,count(distinct) 就会非常慢,原因与 group by 类似,
count(distinct) 逻辑只会有很少的 reducer 来处理。这时可以用 group by 来改写:


优化前 ,一个普通的只使用一个reduceTask来进行count(distinct) 操作
优化后 ,但是这样写会启动两个MR job(单纯 distinct 只会启动一个),所以要确保数据量大到启动
job 的 overhead 远小于计算耗时,才考虑这种方法。当数据集很小或者 key 的倾斜比较明显时,group
by 还可能会比 distinct 慢。
select t.a, count(t.b) , sum(t.c) from t group by t.a;
select t.a, count(distinct t.b, t.c) from t group by t.a;

  • 2
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值