hive去重问题的几种方式

在做维表时,很重要一点是要去重。那么去重有那些方式呢?
1.distinct
2.group by
前面两种关系型的也常用,我就不说了。
3.row_number
insert overwrite table dw_dw.dw_dim_address
select
row_number() over() as pk_address_code
, t2.province_name
,t2.city_name
,t2.region_name
,t2.town_name
,t2.address
, t2.detail_address
,CAST (t2.lng as DOUBLE ) lng
,CAST (t2.lat as DOUBLE ) lat
,FROM_UNIXTIME(UNIX_TIMESTAMP(),‘yyyyMMdd’) as etl_date
from (
select row_number() over(distribute by t1.province,t1.city,t1.region,t1.town,t1.address) as pk
, t1.province province_name
,t1.city city_name
,t1.region region_name
,t1.town town_name
,t1.address address
, t1.detail_address detail_address
,CAST (t1.lng as DOUBLE )lng
,CAST (t1.lat as DOUBLE ) lat
,FROM_UNIXTIME(UNIX_TIMESTAMP(),‘yyyyMMdd’) as etl_date
from
(select
i.province
,i.city
,i.region
,i.town
,i.address
,concat(i.province,i.city,i.region,i.town,i.address) detail_address
,case when j.lng is not null and j.lng != ‘’ and j.lng<>‘error’ then j.lng else ‘0’ end as lng
,case when j.lat is not null and j.lat != ‘’ and j.lat<>‘error’ then j.lat else ‘0’ end as lat
from dw_dw.dw_map_jwd j right join dw_dw.dw_iu_merge i on j.ids=i.ids
where 1 = 1
and i.province<>‘KUYU90’
and (i.province<>‘KUYU90’ and i.city<>‘KUYU90’)
and (i.province<>‘KUYU90’ and i.city<>‘KUYU90’ and i.region <> ‘KUYU90’)
) t1
) t2 where t2.pk=1

注:1.这里是row_number()over()可以做主键这样用,但impala中可不行,用法还是有些区别的。2.这里对于row_number() over(distribute by x,x,x) 等分组排序后 ,对于只取第一个因此 pk=1。

4.collect_set
SELECT
t2.ssoid as user_id
,collect_set(t2.cp)[0] as cp
,collect_set(t2.id)[0] as id
,t2.ssoid as ssoid
,collect_set(t2.huan_id)[0] as huan_id
,collect_set(t2.bind_email)[0] as bind_emails
,collect_set(t2.weixin_unionid)[0] as weixin_unionids
,collect_set(t2.table_source)[0] as table_sources
,collect_set(t2.name)[0] as name
,collect_set(t2.sex)[0] as sex
,collect_set(t2.birth)[0] as birth
,collect_set(t2.constellation)[0] as constellation
,collect_set(t2.interests)[0] as interests
,collect_set(t2.education)[0] as education
,collect_set(t2.province)[0] as province
,collect_set(t2.city)[0] as city
,collect_set(t2.region)[0] as region
,collect_set(t2.town)[0] as town
,collect_set(t2.address)[0] as address
,collect_set(t2.statdate)[0] as statdate
,sort_array(collect_set(t2.statdate))[0] as first_statdate
FROM (
SELECT *
FROM DW_DW.DW_USER_BASEANDKEY t1
where
t1.ssoid is not null
and t1.ssoid != ‘’
and t1.ssoid != ‘KUYU90’
and (t1.cp is null
or t1.cp = ‘’
or t1.cp = ‘KUYU90’)
distribute by ssoid sort by statdate desc
) t2 group by t2.ssoid
这样利用collect_set来突破group by的限制,Hive中在group by查询的时候要求出现在select后面的列都必须是出现在group by后面的,即select列必须是作为分组依据的列。

扩展:
Hive中collect相关的函数有collect_list和collect_set,它们都是将分组中的某列转为一个数组返回,不同的是collect_list不去重而collect_set去重。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值