项目背景
公司最近需要将python的数据清洗脚本迁移到hive上面,为了保证hive的清洗逻辑正确,需要对比清洗后的新老数据是否相同,下面贴出题主提供的清洗方法
方法
举个例子:
旧表表结构
create table if not exist db.table_old (
session_id string,
var1 string,
var2 string
var2 string
);
新表表结构
create table if not exist db.table_new (
session_id string,
var1 string,
var2 string
var2 string
);
对比步骤:
1、建一张用于比对计数的临时表:
create table if not exist db.table_cnt (
var1 string,
var2 string
var2 string
);
2、利用脚本生成对比计数sql
insert overwrite table db.table_cnt partition(dt='2023-11-19')
select
sum(case when t1.var1 is null and t2.var1 is null then 0 when t1.var1 = t2.var1 then 0 else 1 end) as var1,
sum(case when t1.var2 is null and t2.var2 is null then 0 when t1.var2 = t2.var2 then 0 else 1 end) as var2,
sum(case when t1.var3 is null and t2.var3 is null then 0 when t1.var3 = t2.var3 then 0 else 1 end) as var3
from
(
select * from db.table_old where dt='2023-11-19' and sessionid in ('')
) t1
left join
(
select * from db.table_new where dt='2023-11-19' and sessionid in ('')
) t2
on t1.session_id = t2.session_id;
3、在2生成的表基础上近一步统计(这一部分sql也可以用脚本生成)
select name, cnt
from
(
select 'var1' as name, var1 as cnt from db.table_cnt where dt='2023-11-19' union all
select 'var2' as name, var2 as cnt from db.table_cnt where dt='2023-11-19' union all
select 'var3' as name, var3 as cnt from db.table_cnt where dt='2023-11-19'
) t1
order by cnt desc
4、3的统计结果按照不一致的字段数量从大到小排列,基于上述结果case by case解决
select t1.session_id,
t1.var1 as var1_old,
t1.var1 as var1_new
from
(
select * from db.table_old where dt='2023-11-19'
) t1
left join
(
select * from db.table_new where dt='2023-11-19'
) t2
on t1.session_id = t2.session_id
where t1.var1 <> t2.var1
limit 100