hivesql数据清洗,对比新老数据方法

本文介绍了如何将Python的数据清洗脚本迁移到Hive中,通过创建临时表并执行SQL查询来对比新老数据的一致性,关注var1、var2和var3字段。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

项目背景

公司最近需要将python的数据清洗脚本迁移到hive上面,为了保证hive的清洗逻辑正确,需要对比清洗后的新老数据是否相同,下面贴出题主提供的清洗方法

方法

举个例子:

旧表表结构
create table if not exist db.table_old (
session_id string,
var1 string,
var2 string
var2 string
);
新表表结构
create table if not exist db.table_new (
session_id string,
var1 string,
var2 string
var2 string
);
对比步骤:
1、建一张用于比对计数的临时表:
create table if not exist db.table_cnt (
var1 string,
var2 string
var2 string
);
2、利用脚本生成对比计数sql
insert overwrite table db.table_cnt partition(dt='2023-11-19')
select 
sum(case when t1.var1 is null and t2.var1 is null then 0 when t1.var1 = t2.var1 then 0 else 1 end) as var1,
sum(case when t1.var2 is null and t2.var2 is null then 0 when t1.var2 = t2.var2 then 0 else 1 end) as var2,
sum(case when t1.var3 is null and t2.var3 is null then 0 when t1.var3 = t2.var3 then 0 else 1 end) as var3
from 
(
select * from db.table_old where dt='2023-11-19' and sessionid in ('')
) t1
left join
(
select * from db.table_new where dt='2023-11-19' and sessionid in ('')
) t2
on t1.session_id = t2.session_id;
3、在2生成的表基础上近一步统计(这一部分sql也可以用脚本生成)
select name, cnt
from
(
select  'var1' as name, var1 as cnt from db.table_cnt where dt='2023-11-19' union all
select  'var2' as name, var2 as cnt from db.table_cnt where dt='2023-11-19' union all
select  'var3' as name, var3 as cnt from db.table_cnt where dt='2023-11-19'
) t1
order by cnt desc

43的统计结果按照不一致的字段数量从大到小排列,基于上述结果case by case解决

select t1.session_id,
   t1.var1 as var1_old,
   t1.var1 as var1_new
from 
(
select * from db.table_old where dt='2023-11-19'
) t1
left join
(
select * from db.table_new where dt='2023-11-19'
) t2
on t1.session_id = t2.session_id
where t1.var1 <> t2.var1
limit 100
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值