count(distinct ) over(partition by order by)替换成size(collect_set() over(partition by order by))

这个函数的大致意思是:在分组内求去重后的数量

为什么不用count(distinct ),不支持原因

例子:

数据准备:

SELECT '1' as id ,'201808' as m,'a' as k
union all
SELECT '2' as id ,'201808' as m,'a' as k
union all
SELECT '1' as id ,'201809' as m,'a' as k
union all
SELECT '1' as id ,'201808' as m,'b' as k
union all
SELECT '2' as id ,'201809' as m,'b' as k


id代表人编号,m代表月份,k代表其他key键。

idmk
1201808a
2201808a
1201809a
1201808b
2201809b


需求:本月累计人数(即9月份的客户要包含9月以前的客户数)

预计结果

k          m            n

a        201808     2

a        201809     2

b        201808    1

b        201809    2

第一步:

SELECT
k,
m,
size(collect_set(id) OVER(PARTITION BY k ORDER BY m asc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )) as n
from
(
SELECT '1' as id ,'201808' as m,'a' as k
union all
SELECT '2' as id ,'201808' as m,'a' as k
union all
SELECT '1' as id ,'201809' as m,'a' as k
union all
SELECT '1' as id ,'201808' as m,'b' as k
union all
SELECT '2' as id ,'201809' as m,'b' as k
)t
order by k,m

 

kmn
a2018081
a2018082
a2018092
b2018081
b2018092


第二步:取出分组内最后一条即可

SELECT
k,
m,
n,
row_number() over(PARTITION BY k,m ORDER BY n desc)rk
from
(
SELECT
k,
m,
size(collect_set(id) OVER(PARTITION BY k ORDER BY m asc rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW )) as n
from
(
SELECT '1' as id ,'201808' as m,'a' as k
union all
SELECT '2' as id ,'201808' as m,'a' as k
union all
SELECT '1' as id ,'201809' as m,'a' as k
union all
SELECT '1' as id ,'201808' as m,'b' as k
union all
SELECT '2' as id ,'201809' as m,'b' as k
)t
order by k,m
)having rk = 1
 

结果为:

kmnrk
a20180821
a20180921
b20180811
b20180921

 

若需要了解rows BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,请参考https://mp.csdn.net/postedit/82909903

  • 3
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
是的,count over partition by 可以去重。在SQL中,使用count函数结合over和partition by子句可以实现对分组后的数据进行去重计数。具体实现方式是将count(distinct ) over(partition by order by)替换成size(collect_set() over(partition by order by))。这样可以求得分组后的去重个数。这种方法适用于既要保证原数据明细表不变,又要保证分组统计数据的场景。另外,还可以使用distinct关键字结合count函数来实现去重计数,例如select distinct name,count(*) over () from t_window where substring(orderdate,1,7) = '2015-04'。还可以使用group by子句来实现去重计数,例如select name,count(*) over () from t_window where substring(orderdate,1,7) = '2015-04' group by name。 #### 引用[.reference_title] - *1* [【hive&spark】HiveSql&SparkSql—COUNT(DISTINCT ) OVER (PARTITION BY )报错解决方案](https://blog.csdn.net/qyj19920704/article/details/126372968)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* *3* [Hive窗口函数Over partition by](https://blog.csdn.net/weixin_32265569/article/details/115475759)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值