hive去重问题的几种方式

最新推荐文章于 2023-07-18 11:36:06 发布

水墨风漾

最新推荐文章于 2023-07-18 11:36:06 发布

阅读量3.3k

点赞数 1

分类专栏： Hive

本文链接：https://blog.csdn.net/shuimofengyang/article/details/97106639

版权

Hive 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

在做维表时，很重要一点是要去重。那么去重有那些方式呢？
1.distinct
2.group by
前面两种关系型的也常用，我就不说了。
3.row_number
insert overwrite table dw_dw.dw_dim_address
select
row_number() over() as pk_address_code
, t2.province_name
,t2.city_name
,t2.region_name
,t2.town_name
,t2.address
, t2.detail_address
,CAST (t2.lng as DOUBLE ) lng
,CAST (t2.lat as DOUBLE ) lat
,FROM_UNIXTIME(UNIX_TIMESTAMP(),‘yyyyMMdd’) as etl_date
from (
select row_number() over(distribute by t1.province,t1.city,t1.region,t1.town,t1.address) as pk
, t1.province province_name
,t1.city city_name
,t1.region region_name
,t1.town town_name
,t1.address address
, t1.detail_address detail_address
,CAST (t1.lng as DOUBLE )lng
,CAST (t1.lat as DOUBLE ) lat
,FROM_UNIXTIME(UNIX_TIMESTAMP(),‘yyyyMMdd’) as etl_date
from
(select
i.province
,i.city
,i.region
,i.town
,i.address
,concat(i.province,i.city,i.region,i.town,i.address) detail_address
,case when j.lng is not null and j.lng != ‘’ and j.lng<>‘error’ then j.lng else ‘0’ end as lng
,case when j.lat is not null and j.lat != ‘’ and j.lat<>‘error’ then j.lat else ‘0’ end as lat
from dw_dw.dw_map_jwd j right join dw_dw.dw_iu_merge i on j.ids=i.ids
where 1 = 1
and i.province<>‘KUYU90’
and (i.province<>‘KUYU90’ and i.city<>‘KUYU90’)
and (i.province<>‘KUYU90’ and i.city<>‘KUYU90’ and i.region <> ‘KUYU90’)
) t1
) t2 where t2.pk=1

注：1.这里是row_number（）over（）可以做主键这样用，但impala中可不行，用法还是有些区别的。2.这里对于row_number() over(distribute by x,x,x) 等分组排序后，对于只取第一个因此 pk=1。

4.collect_set
SELECT
t2.ssoid as user_id
,collect_set(t2.cp)[0] as cp
,collect_set(t2.id)[0] as id
,t2.ssoid as ssoid
,collect_set(t2.huan_id)[0] as huan_id
,collect_set(t2.bind_email)[0] as bind_emails
,collect_set(t2.weixin_unionid)[0] as weixin_unionids
,collect_set(t2.table_source)[0] as table_sources
,collect_set(t2.name)[0] as name
,collect_set(t2.sex)[0] as sex
,collect_set(t2.birth)[0] as birth
,collect_set(t2.constellation)[0] as constellation
,collect_set(t2.interests)[0] as interests
,collect_set(t2.education)[0] as education
,collect_set(t2.province)[0] as province
,collect_set(t2.city)[0] as city
,collect_set(t2.region)[0] as region
,collect_set(t2.town)[0] as town
,collect_set(t2.address)[0] as address
,collect_set(t2.statdate)[0] as statdate
,sort_array(collect_set(t2.statdate))[0] as first_statdate
FROM (
SELECT *
FROM DW_DW.DW_USER_BASEANDKEY t1
where
t1.ssoid is not null
and t1.ssoid != ‘’
and t1.ssoid != ‘KUYU90’
and (t1.cp is null
or t1.cp = ‘’
or t1.cp = ‘KUYU90’)
distribute by ssoid sort by statdate desc
) t2 group by t2.ssoid
这样利用collect_set来突破group by的限制，Hive中在group by查询的时候要求出现在select后面的列都必须是出现在group by后面的，即select列必须是作为分组依据的列。