Impala的count(distinct QUESTION_ID) 与ndv(QUESTION_ID)

在impala中,一个select执行多个count(distinct col)会报错,举例:

select C_DEPT2,
         count(distinct QUESTION_BUSI_ID) as wo_num,
         count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2

报错信息:

ERROR: AnalysisException: all DISTINCT aggregate functions need to have the same set of parameters as count(DISTINCT QUESTION_BUSI_ID); deviating function: count(DISTINCT CREATOR_ID)
Consider using NDV() instead of COUNT(DISTINCT) if estimated counts are acceptable. Enable the APPX_COUNT_DISTINCT query option to perform this rewrite automatically.

这时候,可通过以下方法解决:

1、得到的是近似值,数据量越大越不准确:

(1)SQL运行前,先运行命令:set APPX_COUNT_DISTINCT=true;

set APPX_COUNT_DISTINCT=true;
select C_DEPT2,
       count(distinct QUESTION_BUSI_ID) as wo_num,
       count(distinct CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

(2)将count(distinct col)用函数ndv(col)代替

select C_DEPT2,
       ndv(QUESTION_BUSI_ID) as wo_num,
       ndv(CREATOR_ID) as creator_num
  from pdm.kudu_q_basic
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

需要注意的是,在set APPX_COUNT_DISTINCT=true;的情况下,使用count(distinct col)会自动转化成ndv(col),得到的是近似值,所以以上两种方法的结果数据一致。

2、精确值。拆分为子查询,再关联,如下:

set APPX_COUNT_DISTINCT = false; -- 将参数置为false,使用count(distinct col),确保不会转化成ndv(col)
select a.C_DEPT2, a.wo_num, b.creator_num
  from (select C_DEPT2, count(distinct QUESTION_BUSI_ID) as wo_num
          from pdm.kudu_q_basic
         where substr(CREATE_DATE, 1, 7) = '2020-10'
         group by C_DEPT2) a
  left join (select C_DEPT2, count(distinct CREATOR_ID) as creator_num
               from pdm.kudu_q_basic
              where substr(CREATE_DATE, 1, 7) = '2020-10'
              group by C_DEPT2) b on a.C_DEPT2 = b.C_DEPT2
 order by a.C_DEPT2

验证:

select C_DEPT2, count(*)
  from pdm.kudu_q_basic -- 表中无重复数据
 where substr(CREATE_DATE, 1, 7) = '2020-10'
 group by C_DEPT2
 order by C_DEPT2

总结:解决在impala中一个select执行多个count(distinct col)报错问题,可以用过设置参数set APPX_COUNT_DISTINCT = true;或将count(distinct col)用ndv(col)解决,但得到的是近似值,不准确。还可以通过分别在子查询中进行count(distinct col)再关联得到准确值,但要注意参数 APPX_COUNT_DISTINCT = false,不然会自动转化为ndv(col)得到的还是近似值

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值