hive优化之distinct转group by实战

最新推荐文章于 2023-02-19 22:58:37 发布

cclovezbf

最新推荐文章于 2023-02-19 22:58:37 发布

阅读量1.1k

点赞数 1

分类专栏： hive 文章标签： hive distinct group by

本文链接：https://blog.csdn.net/cclovezbf/article/details/111928426

版权

hive 专栏收录该内容

49 篇文章 10 订阅

订阅专栏

说明：一个公司有多个业务biz_id，每个业务收款的时候通过不同或者相同的account去收款

业务需要需要计算出该公司有过滤重复后biz_count 和过滤重复后的account_count

1先来个distinct ，占用内存，容易oom，所以一般都会把distinct改为group by

select company_name,count(distinct  biz_id),count(distinct account_id)
from  odsiadata.ia_fdw_b_profile_account_info
where company_id ='1001000000113619'
group by company_name

2不用distinct，想到了 collect_set ，个人感觉应该和distinct差不多也会占用内存，有大佬知道哪里有distinct占内存或者collect_set占内存出处留言让我学习下

select company_name,size(collect_set(biz_id)),size(collect_set(account_id))
from  odsiadata.ia_fdw_b_profile_account_info
where company_id ='1001000000113619'
group by company_name

3用group by 借鉴https://blog.csdn.net/henrrywan/article/details/86597468 但是他的sql不完整感觉像复制的。。。。不管

select company_name ,
		sum(IF(t_type='biz',1,0)) biz_count, --记住用sum。。。之前用count浪费了会时间
		sum(if(t_type='account',1,0)) account_num , --记住用sum。。。之前用count浪费了会时间
		MIN(created_time) 
from(
select company_name,biz_id as col ,"biz" as t_type ,min(created_time) created_time from  odsiadata.ia_fdw_b_profile_account_info 
group by company_name,biz_id 
union
select company_name,account_id as col ,"account" as t_type,min(created_time) created_time from odsiadata.ia_fdw_b_profile_account_info 
group by company_name,account_id 
)t
GROUP by company_name

cclovezbf

关注

1
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
hive优化之distinct转group by实战

说明：一个公司有多个业务biz_id，每个业务收款的时候通过不同或者相同的account去收款业务需要需要计算出该公司有过滤重复后biz_count 和过滤重复后的account_count1先来个distinct ，占用内存，容易oom，所以一般都会把distinct改为group byselect company_name,count(distinct biz_id),count(distinct account_id)from odsiadata.ia_fdw_b_profi..
复制链接

扫一扫

专栏目录