Hive的数据去重Group by Having count()等方法

查询出当前表中,电话号码出现数次数为两次的数据!并且删除掉重复多余的数据,只留下一条数据!having count and group by

select mobile,count(1)cnt from rbc_audience_users group by mobile having count(mobile) >1

查询出rbc_audience_users表中 mobile大于一条的,并且查询出最小的uid 嵌套一个子查询,将查询的结果作为一个子表,用嵌套查询uid 查询出uid ,然后用当前表去 uid IN (),即可查询出来电话号码重的其中一个!

可以查询出当前需要去重的UID 和电话号码! 仅有一份

SELECT uid, mobile
FROM rbc_audience_users
WHERE mobile IN (
    SELECT mobile
    FROM (
        SELECT mobile, COUNT(1) AS cnt
        FROM rbc_audience_users
        WHERE mobile != ''
        GROUP BY mobile
        HAVING COUNT(mobile) > 1
    ) a
);

查询出重复的所有的电话号码的用户ID,和Mobile

SELECT uid,mobile
FROM rbc_audience_users
WHERE mobile IN (
    SELECT mobile
    FROM (
        SELECT mobile, COUNT(1) AS cnt
        FROM rbc_audience_users
        WHERE mobile != ''
            AND mobile != 'null'
        GROUP BY mobile
        HAVING COUNT(mobile) > 1
    ) a
);

查询出所有的电话号码出现次数为两次数据,并且查询mobile 不为null  和空串的电话号码!

将上述两个表的结果数据数据插入到分别的两张表,然后使用 left join where is null(取差集的方式) 的方式查询出来重复数据

将ID全部查询出来,然后传入Es对当前用户ID数据进行删除! 

即可达到数据删除冗余数据的目的! 

全部重复数据,

微信
insert into rbc_audience_user_not_distinct_mobiledata select uid,mobile,'10300010001' from rbc_audience_user_v1 where mobile in (select mobile from (select mobile,count(1)cnt from rbc_audience_user_v1 where mobile !='null' and mobile !='' and data_code='10300010001'  group by mobile having count(mobile) >1) a) and data_code='10300010001';
1947

听听
insert into rbc_audience_user_not_distinct_mobiledata  select uid,mobile,'10300010002' from rbc_audience_user_v1 where mobile in (select mobile from (select mobile,count(1)cnt from rbc_audience_user_v1 where mobile !='null' and mobile !='' and data_code='10300010002'  group by mobile having count(mobile) >1) a) and data_code='10300010002';
0

交通台数据
insert into rbc_audience_user_not_distinct_mobiledata  select uid,mobile,'10300010003' from rbc_audience_user_v1 where mobile in (select mobile from (select mobile,count(1)cnt from rbc_audience_user_v1 where mobile !='null' and mobile !='' and data_code='10300010003'  group by mobile having count(mobile) >1) a) and data_code='10300010003';
35743
========================================================================================
仅有一条数据

微信
insert into rbc_audience_user_distinct_mobiledata SELECT uid, mobile,'10300010001' FROM rbc_audience_user_v1 WHERE uid IN (SELECT uid FROM (SELECT count(1) cnt,min(uid) uid, mobile FROM rbc_audience_user_v1 WHERE mobile != '' and  mobile !='null' and data_code='10300010001' GROUP BY mobile HAVING count(1) > 1) tb) and  data_code='10300010001';
860

听听
insert into rbc_audience_user_distinct_mobiledata SELECT uid, mobile,'10300010002' FROM rbc_audience_user_v1 WHERE uid IN (SELECT uid FROM (SELECT count(1) cnt,min(uid) uid, mobile FROM rbc_audience_user_v1 WHERE mobile != '' and  mobile !='null'  and  data_code='10300010002' GROUP BY mobile HAVING count(1) > 1) tb) and  data_code='10300010002';
0

交通台数据
insert into rbc_audience_user_distinct_mobiledata SELECT uid, mobile,'10300010003' FROM rbc_audience_user_v1 WHERE uid IN (SELECT uid FROM (SELECT count(1) cnt,min(uid) uid, mobile FROM rbc_audience_user_v1 WHERE mobile != '' and  mobile !='null' and  data_code='10300010003' GROUP BY mobile HAVING count(1) > 1) tb) and  data_code='10300010003';


数据取差集
select a.id from rbc_audience_user_not_distinct_mobiledata where id not in (select id from rbc_audience_user_distinct_mobiledata) limit 10;

 

  • 0
    点赞
  • 7
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值