row_num引起的数据差异

一、背景

在对一个Hive历史Job优化后需要比对两张结果表的数据,发现有一张表两个字段有一些差异(优化只是加了优化参数没有修改逻辑,所以数据有差异是不正常的)

二、定位

根据差异字段定位到是一张临时表里面的使用了row_num引起的

1、现象

就是以下代码在同一天不同时间段跑,最终数据量不一致

drop table if exists  tz0002_v2 purge; 
create external table tmp_tz0002_v2 stored as orc tblproperties ('external.table.purge'='true')  as
select cast(vip_card as bigint) member17_nbr
      ,date_format(a.operate_time,'yyyy-MM')   as auto_renew_month    --签约年月
from (select card_no as vip_card,
             case when status='1' then '签约' when status='-1' then '解约' else status end status, 
             operate_time, -- 操作时间
             row_number() over (partition by card_no order by operate_time desc) rn
      from auto_renew_sign_d t1 
      where date_format(t1.operate_time,'yyyy-MM-dd') <= '${yestoday_str}'  
      )a
   where rn=1
  and status='签约'
group by  cast(vip_card as bigint)
        , date_format(a.operate_time,'yyyy-MM')
;
select count(1) as cnt , 'v1' as type from tz0002_v1
union all 
select count(1) as cnt , 'v2' as type from tz0002_v2
+----------+-------+
|   cnt    | type  |
+----------+-------+
| 2748473  | v1|
| 2748463  | v2|
+----------+-------+

2、找出差异的id

select 
    a.*
from 
tz0002_v1  a
left join tz0002_v2  b
on a.member17_nbr=b.member17_nbr
where b.member17_nbr is null;

拿具体差异的id查看数据

select
uid
,card_no
,status
,operate_time
,row_number() over (partition by card_no order by operate_time desc) rn
from  cn_ods_dmp_secure.o_bas_auto_renew_sign_d
where card_no='107421002798006351';
;
+--------------+--------------------+---------+----------------------+-----+
|     uid      |      card_no       | status  |     operate_time     | rn  |
+--------------+--------------------+---------+----------------------+-----+
| 18181734302  | 107421002798006351  | 1       | 2023-05-14 13:56:19  | 1   |
| 18181734302  | 107421002798006351  | -1      | 2023-05-14 13:56:19  | 2   |
| 18181734302  | 107421002798006351  | 1       | 2022-11-29 16:10:29  | 3   |
| 18181734302  | 107421002798006351  | -1      | 2022-04-09 12:06:28  | 4   |
| 18181734302  | 107421002798006351  | 1       | 2021-12-07 13:37:42  | 5   |
+--------------+--------------------+---------+----------------------+-----+

以card_no分组operate_time倒序排序,会发现最新的两条记录 operate_time相同,status不同【那么不同时间跑有可能这次排序号1的是状态等于1的,下次跑是状态等于-1的】,最终会取状态等于1的且rn等于1的数据

最后然后发现每个ID会有一个update_time,而且最新时间两条不同状态的数据update_time不相同,那么在row_num中将operate_time 修改成update_time更加符合具体业务逻辑,得到的数据也更加准确

+--------------------+---------+----------------------+----------------------+
|      card_no       | status  |     operate_time     |     update_time      |
+--------------------+---------+----------------------+----------------------+
| 107421002798006351  | 1       | 2021-12-07 13:37:42  | 2021-12-07 13:37:44  |
| 107421002798006351  | -1      | 2022-04-09 12:06:28  | 2022-04-09 12:06:29  |
| 107421002798006351  | 1       | 2022-11-29 16:10:29  | 2022-11-29 16:10:32  |
| 107421002798006351  | 1       | 2023-05-14 13:56:19  | 2023-05-14 13:56:19  |
| 107421002798006351  | -1      | 2023-05-14 13:56:19  | 2023-05-14 13:56:20  |
+--------------------+---------+----------------------+----------------------+

三 、修复

将代码修改成下面

drop table if exists  tz0002_v2 purge; 
create external table tmp_tz0002_v2 stored as orc tblproperties ('external.table.purge'='true')  as
select cast(vip_card as bigint) member17_nbr
      ,date_format(a.operate_time,'yyyy-MM')   as auto_renew_month    --签约年月
from (select card_no as vip_card,
             case when status='1' then '签约' when status='-1' then '解约' else status end status, 
             operate_time, -- 操作时间
             row_number() over (partition by card_no order by update_time desc) rn
      from auto_renew_sign_d t1 
      where date_format(t1.operate_time,'yyyy-MM-dd') <= '${yestoday_str}'  
      )a
   where rn=1
  and status='签约'
group by  cast(vip_card as bigint)
        , date_format(a.operate_time,'yyyy-MM')
;

经过测试不同时间段跑出来的结果都是相同的,以前同事上线的历史代码修改后再比对数据也是正常的

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值