# 上亿条数据，如何比对并发现两个表数据差异

12 篇文章 0 订阅
1 篇文章 1 订阅

3.1 数据量级比对

3.2 一致性比对

3.2.1 勾稽验证+md5方法

3.2.2 暴力比对法

3.3 差异数据发现

……

## 三、验数方法

1. 数据量级比对：先比对两个表核心字段数据量级，如果量级不同，两个表数据肯定不一致。
2. 一致性比对：如果量级相同，比对一致性。
3. 差异数据发现：如果数据不一致，把不一致的数据打印出来。

### 3.1 数据量级比对

select left_table.pv-right_table.pv as pv_diff,
left_table.user_id_uv - right_table.user_id_uv as user_id_uv_diff,
left_table.order_id_uv - right_table.order_id_uv as order_id_uv_diff,
left_table.city_id_uv - right_table.city_id_uv as city_id_uv_diff
from (
select count(1) as pv,
count(distinct user_id) as user_id_uv,
count(distinct order_id) as order_id_uv,
count(distinct city_id) as city_id_uv
from mart_online.fact_user_order_day
where dt=20190413
)left_table
left outer join (
select count(1) as pv,
count(distinct user_id) as user_id_uv,
count(distinct order_id) as order_id_uv,
count(distinct city_id) as city_id_uv
from mart_test.fact_user_order_day
where dt=20190413
)right_table
on 1=1
左表pv减去右表pv值为：[0]，核心字段uv差为：[0] 即两个表数据条数相同
+-------+----------------+------------------+---------------+
|pv_diff|user_id_uv_diff |order_id_uv_diff  |city_id_uv_diff|
+-------+----------------+------------------+---------------+
|      0|               0|                 0|              0|
+-------+----------------+------------------+---------------+

### 3.2 一致性比对

#### 3.2.1 勾稽验证+md5方法

md5：就是计算一行数据的md5值，把它当成key去做比对。尤其是在百亿数据规模的情况下，这种方法也使用。

************ 数据量一致性验证SQL ************* 注意：这里采用 full join

select sum(case when left_table.record_key is not null or left_table.record_key !='' then 1 else 0 end) as left_table_num,
sum(case when right_table.record_key is not null or right_table.record_key !='' then 1 else 0 end) as right_table_num,
sum(case when left_table.record_key = right_table.record_key then 1 else 0 end) as left_right_equal_num
from (
select md5(
concat(
if(user_id is null, '-', cast(user_id as string)),
if(user_name is null, '-', cast(user_name as string)),
if(order_id is null, '-', cast(order_id as string)),
if(city_id is null, '-', cast(city_id as string)),
if(city_name is null, '-', cast(city_name as string)),
if(字段n…… is null, '-', cast(字段n…… as string)),
if(dt is null, '-', cast(dt as string))
)
) as record_key
from mart_online.fact_user_order_day
where dt=20190413
)left_table
full outer join (
select md5(
concat(
if(user_id is null, '-', cast(user_id as string)),
if(user_name is null, '-', cast(user_name as string)),
if(order_id is null, '-', cast(order_id as string)),
if(city_id is null, '-', cast(city_id as string)),
if(city_name is null, '-', cast(city_name as string)),
if(字段n…… is null, '-', cast(字段n…… as string)),
if(dt is null, '-', cast(dt as string))
)
) as record_key
from mart_test.fact_user_order_day
where dt=20190413
)right_table
on left_table.record_key=right_table.record_key
************ 数据量一致性验证报表 *************
[left_table_num]左表中的数据条数，[right_table_num]右表中的条数，[left_right_equal_num]两个表中相等的数据条数。

+--------------+---------------+--------------------+
|left_table_num|right_table_num|left_right_equal_num|
+--------------+---------------+--------------------+
|      16358699|       16358699|            16353039|
+--------------+---------------+--------------------+

#### 3.2.2 暴力比对法

select online.*,
test.* from(
select id,
user_id,
user_name,
order_id,
city_id,
city_name
from mart_online.fact_user_order_day
where dt='20190413'
)online
left outer join (
select id,
user_id,
user_name,
order_id,
city_id,
city_name
from mart_test.fact_user_order_day
where dt='20190413'
) test
on test.id=online.id
where test.user_id!=online.user_id
or test.user_name!=online.user_name
or test.order_id!=online.order_id
or test.city_id!= online.city_id
or test.city_name!= online.city_name

### 3.3 差异数据发现

select left_table.*,
right_table.*
from (
select *
from mart_online.fact_user_order_day
where dt=20190413
)left_table
full outer join (
select *
from mart_test.fact_user_order_day
where dt=20190413
)right_table
on left_table.id = right_table.id
and left_table.dt = right_table.dt
where COALESCE(left_table.user_id, 0) <> COALESCE(right_table.user_id, 0)
or COALESCE(left_table.user_name, 0) <> COALESCE(right_table.user_name, 0)
or COALESCE(left_table.order_id, 0) <> COALESCE(right_table.order_id, 0)
or COALESCE(left_table.city_id, 0) <> COALESCE(right_table.city_id, 0)
or COALESCE(left_table.city_name, 0) <> COALESCE(right_table.city_name, 0)
or COALESCE(left_table.字段n……, 0) <> COALESCE(right_table.字段n……, 0)
不一致的条数：[5660]，case如下表所示：
+-------+----------------+------------------+---------------+---------------+
|id     |left_user_id    |left_字段n……       |right_user_id  |right_字段n……   |
+-------+----------------+------------------+---------------+---------------+
|      0|               1|             哇哈哈|              1|           养乐多|
+-------+----------------+------------------+---------------+---------------+

## 四、总结

• 16
点赞
• 131
收藏
觉得还不错? 一键收藏
• 5
评论
10-12 2599
08-06 335
05-09 1万+
09-22 775
08-24
05-09 5128

### “相关推荐”对你有帮助么？

• 非常没帮助
• 没帮助
• 一般
• 有帮助
• 非常有帮助

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。