hivesql数据清洗，对比新老数据方法_hive sql 数据清洗函数-CSDN博客

本文链接：https://blog.csdn.net/ggplot_2/article/details/134594296

本文介绍了如何将Python的数据清洗脚本迁移到Hive中，通过创建临时表并执行SQL查询来对比新老数据的一致性，关注var1、var2和var3字段。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

项目背景

公司最近需要将python的数据清洗脚本迁移到hive上面，为了保证hive的清洗逻辑正确，需要对比清洗后的新老数据是否相同，下面贴出题主提供的清洗方法

方法

举个例子：

旧表表结构
create table if not exist db.table_old (
session_id string,
var1 string,
var2 string
var2 string
);
新表表结构
create table if not exist db.table_new (
session_id string,
var1 string,
var2 string
var2 string
);
对比步骤：
1、建一张用于比对计数的临时表：
create table if not exist db.table_cnt (
var1 string,
var2 string
var2 string
);
2、利用脚本生成对比计数sql
insert overwrite table db.table_cnt partition(dt='2023-11-19')
select 
sum(case when t1.var1 is null and t2.var1 is null then 0 when t1.var1 = t2.var1 then 0 else 1 end) as var1,
sum(case when t1.var2 is null and t2.var2 is null then 0 when t1.var2 = t2.var2 then 0 else 1 end) as var2,
sum(case when t1.var3 is null and t2.var3 is null then 0 when t1.var3 = t2.var3 then 0 else 1 end) as var3
from 
(
select * from db.table_old where dt='2023-11-19' and sessionid in ('')
) t1
left join
(
select * from db.table_new where dt='2023-11-19' and sessionid in ('')
) t2
on t1.session_id = t2.session_id;
3、在2生成的表基础上近一步统计(这一部分sql也可以用脚本生成)
select name, cnt
from
(
select  'var1' as name, var1 as cnt from db.table_cnt where dt='2023-11-19' union all
select  'var2' as name, var2 as cnt from db.table_cnt where dt='2023-11-19' union all
select  'var3' as name, var3 as cnt from db.table_cnt where dt='2023-11-19'
) t1
order by cnt desc

4、3的统计结果按照不一致的字段数量从大到小排列，基于上述结果case by case解决

select t1.session_id,
   t1.var1 as var1_old,
   t1.var1 as var1_new
from 
(
select * from db.table_old where dt='2023-11-19'
) t1
left join
(
select * from db.table_new where dt='2023-11-19'
) t2
on t1.session_id = t2.session_id
where t1.var1 <> t2.var1
limit 100