原始数据如下
date | id_no | state |
---|---|---|
20220131 | 30001 | 101 |
20220228 | 30001 | 101 |
20220331 | 30001 | 102 |
20220430 | 30001 | 103 |
20220531 | 30001 | 101 |
20220630 | 30001 | 101 |
20220731 | 30001 | 101 |
20220831 | 30001 | 102 |
20220930 | 30001 | 101 |
20220331 | 30002 | 101 |
20220430 | 30002 | 102 |
20220531 | 30002 | 103 |
20220630 | 30002 | 103 |
20220731 | 30002 | 103 |
要求向下合并,并转成拉链表,形式如下
id_no | state | begin_date | end_date |
---|---|---|---|
30001 | 101 | 20220131 | 20220331 |
30001 | 102 | 20220331 | 20220430 |
30001 | 103 | 20220430 | 20220531 |
30001 | 101 | 20220531 | 20220831 |
30001 | 102 | 20220831 | 20220930 |
30002 | 101 | 20220331 | 20220430 |
30002 | 102 | 20220430 | 20220531 |
30002 | 103 | 20220531 | 20220731 |
由于是向下合并,每个id_no的最后一个date的状态会被舍去,如果需要加上这个状态,文末会有方法。
为了大家理解方便,我没有把代码写在一起,采取分步制表形式,另外我的数据库版本很老,不支持with X as的语句。。。
步骤一加序号
先按照id_no分组,按照date排序加上序号(order_no)
CREATE TABLE TMP_cut_to_zip1 AS
select date,id_no,state,ROW_NUMBER() OVER(PARTITION BY id_no ORDER BY date) as order_no from TMP_cut_to_zip
date | id_no | state | order_no |
---|---|---|---|
20220131 | 30001 | 101 | 1 |
20220228 | 30001 | 101 | 2 |
20220331 | 30001 | 102 | 3 |
20220430 | 30001 | 103 | 4 |
20220531 | 30001 | 101 | 5 |
20220630 | 30001 | 101 | 6 |
20220731 | 30001 | 101 | 7 |
20220831 | 30001 | 102 | 8 |
20220930 | 30001 | 101 | 9 |
20220331 | 30002 | 101 | 1 |
20220430 | 30002 | 102 | 2 |
20220531 | 30002 | 103 | 3 |
20220630 | 30002 | 103 | 4 |
20220731 | 30002 | 103 | 5 |
步骤二去重
我们向下向上自关联一下表格,观察一下情况
select *
from TMP_cut_to_zip1 a
left join TMP_cut_to_zip1 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
left join TMP_cut_to_zip1 c on a.order_no=c.order_no+1 and a.id_no=c.id_no
order by 2,1
date | id_no | state | order_no | date_1 | id_no_1 | state_1 | order_no_1 | date_2 | id_no_2 | state_2 | order_no_2 |
---|---|---|---|---|---|---|---|---|---|---|---|
20220131 | 30001 | 101 | 1 | 20220228 | 30001 | 101 | 2 | None | None | None | None |
20220228 | 30001 | 101 | 2 | 20220331 | 30001 | 102 | 3 | 20220131 | 30001 | 101 | 1 |
20220331 | 30001 | 102 | 3 | 20220430 | 30001 | 103 | 4 | 20220228 | 30001 | 101 | 2 |
20220430 | 30001 | 103 | 4 | 20220531 | 30001 | 101 | 5 | 20220331 | 30001 | 102 | 3 |
20220531 | 30001 | 101 | 5 | 20220630 | 30001 | 101 | 6 | 20220430 | 30001 | 103 | 4 |
20220630 | 30001 | 101 | 6 | 20220731 | 30001 | 101 | 7 | 20220531 | 30001 | 101 | 5 |
20220731 | 30001 | 101 | 7 | 20220831 | 30001 | 102 | 8 | 20220630 | 30001 | 101 | 6 |
20220831 | 30001 | 102 | 8 | 20220930 | 30001 | 101 | 9 | 20220731 | 30001 | 101 | 7 |
20220930 | 30001 | 101 | 9 | None | None | None | None | 20220831 | 30001 | 102 | 8 |
20220331 | 30002 | 101 | 1 | 20220430 | 30002 | 102 | 2 | None | None | None | None |
20220430 | 30002 | 102 | 2 | 20220531 | 30002 | 103 | 3 | 20220331 | 30002 | 101 | 1 |
20220531 | 30002 | 103 | 3 | 20220630 | 30002 | 103 | 4 | 20220430 | 30002 | 102 | 2 |
20220630 | 30002 | 103 | 4 | 20220731 | 30002 | 103 | 5 | 20220531 | 30002 | 103 | 3 |
20220731 | 30002 | 103 | 5 | None | None | None | None | 20220630 | 30002 | 103 | 4 |
以上b是向下关联,c是向上关联,由于我们需要向下合并记录,仔细观察上表,按照ab,ac的关系排列组合
1.a=b,a=c:上下都相同,去除
2.a<>b,a<>c:上下都不同,保留
3.a=b,a<>c:与下相同,与上不同,保留
4.a<>b,a=c:与下不同,与上相同,去除
另外遇到倒数2条记录都相同的情况,尾条必须保留
select *
from TMP_cut_to_zip1 a
left join TMP_cut_to_zip1 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
left join TMP_cut_to_zip1 c on a.order_no=c.order_no+1 and a.id_no=c.id_no
where (a.state<>ifnull(b.state,0) and a.state<>ifnull(c.state,0)) #上下都不同
or (a.state=ifnull(b.state,0) and a.state<>ifnull(c.state,0)) #与下相同与上不同
or b.state is null #保留尾条
order by 2,1
date | id_no | state | order_no | date_1 | id_no_1 | state_1 | order_no_1 | date_2 | id_no_2 | state_2 | order_no_2 |
---|---|---|---|---|---|---|---|---|---|---|---|
20220131 | 30001 | 101 | 1 | 20220228 | 30001 | 101 | 2 | None | None | None | None |
20220331 | 30001 | 102 | 3 | 20220430 | 30001 | 103 | 4 | 20220228 | 30001 | 101 | 2 |
20220430 | 30001 | 103 | 4 | 20220531 | 30001 | 101 | 5 | 20220331 | 30001 | 102 | 3 |
20220531 | 30001 | 101 | 5 | 20220630 | 30001 | 101 | 6 | 20220430 | 30001 | 103 | 4 |
20220831 | 30001 | 102 | 8 | 20220930 | 30001 | 101 | 9 | 20220731 | 30001 | 101 | 7 |
20220930 | 30001 | 101 | 9 | None | None | None | None | 20220831 | 30001 | 102 | 8 |
20220331 | 30002 | 101 | 1 | 20220430 | 30002 | 102 | 2 | None | None | None | None |
20220430 | 30002 | 102 | 2 | 20220531 | 30002 | 103 | 3 | 20220331 | 30002 | 101 | 1 |
20220531 | 30002 | 103 | 3 | 20220630 | 30002 | 103 | 4 | 20220430 | 30002 | 102 | 2 |
20220731 | 30002 | 103 | 5 | None | None | None | None | 20220630 | 30002 | 103 | 4 |
然后就可以提取我们需要的了,顺便再按照id分组date排序加个编号,给下一步用
CREATE TABLE TMP_cut_to_zip2 AS
select a.date,a.id_no,a.state,
ROW_NUMBER() OVER(PARTITION BY a.id_no ORDER BY a.date) as order_no
from TMP_cut_to_zip1 a
left join TMP_cut_to_zip1 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
left join TMP_cut_to_zip1 c on a.order_no=c.order_no+1 and a.id_no=c.id_no
where (a.state<>ifnull(b.state,0) and a.state<>ifnull(c.state,0))
or (a.state=ifnull(b.state,0) and a.state<>ifnull(c.state,0))
or b.state is null
结果如下
date | id_no | state | order_no |
---|---|---|---|
20220131 | 30001 | 101 | 1 |
20220331 | 30001 | 102 | 2 |
20220430 | 30001 | 103 | 3 |
20220531 | 30001 | 101 | 4 |
20220831 | 30001 | 102 | 5 |
20220930 | 30001 | 101 | 6 |
20220331 | 30002 | 101 | 1 |
20220430 | 30002 | 102 | 2 |
20220531 | 30002 | 103 | 3 |
20220731 | 30002 | 103 | 4 |
第三步出结果
先向下自连接,观察一下
select *
from TMP_cut_to_zip2 a
left join TMP_cut_to_zip2 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
order by 2,1
date | id_no | state | order_no | date_1 | id_no_1 | state_1 | order_no_1 |
---|---|---|---|---|---|---|---|
20220131 | 30001 | 101 | 1 | 20220331 | 30001 | 102 | 2 |
20220331 | 30001 | 102 | 2 | 20220430 | 30001 | 103 | 3 |
20220430 | 30001 | 103 | 3 | 20220531 | 30001 | 101 | 4 |
20220531 | 30001 | 101 | 4 | 20220831 | 30001 | 102 | 5 |
20220831 | 30001 | 102 | 5 | 20220930 | 30001 | 101 | 6 |
20220930 | 30001 | 101 | 6 | None | None | None | None |
20220331 | 30002 | 101 | 1 | 20220430 | 30002 | 102 | 2 |
20220430 | 30002 | 102 | 2 | 20220531 | 30002 | 103 | 3 |
20220531 | 30002 | 103 | 3 | 20220731 | 30002 | 103 | 4 |
20220731 | 30002 | 103 | 4 | None | None | None | None |
观察后,取a.state,a.date做开始日期,b.date做结束日期,即可
%%sql
select a.id_no,a.state,a.date as begin_date,b.date as end_date
from TMP_cut_to_zip2 a
inner join TMP_cut_to_zip2 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
order by 1,3
最后结果如下
id_no | state | begin_date | end_date |
---|---|---|---|
30001 | 101 | 20220131 | 20220331 |
30001 | 102 | 20220331 | 20220430 |
30001 | 103 | 20220430 | 20220531 |
30001 | 101 | 20220531 | 20220831 |
30001 | 102 | 20220831 | 20220930 |
30002 | 101 | 20220331 | 20220430 |
30002 | 102 | 20220430 | 20220531 |
30002 | 103 | 20220531 | 20220731 |
补充:如果需要保留每个id的最后一个状态,那需要在原表中加入29991231的日期,此日期沿用最后一个时间的状态。
select (select '29991231')as date,id_no,state from TMP_cut_to_zip a
where not exists(select 1 from TMP_cut_to_zip where id_no=a.id_no and date>a.date)
union
select * from TMP_cut_to_zip
效果如下:
date | id_no | state |
---|---|---|
20220731 | 30001 | 101 |
20220930 | 30001 | 101 |
20220731 | 30002 | 103 |
20220331 | 30001 | 102 |
20220331 | 30002 | 101 |
20220531 | 30001 | 101 |
20220531 | 30002 | 103 |
20220831 | 30001 | 102 |
20220131 | 30001 | 101 |
29991231 | 30002 | 103 |
20220430 | 30001 | 103 |
20220430 | 30002 | 102 |
29991231 | 30001 | 101 |
20220630 | 30001 | 101 |
20220630 | 30002 | 103 |
20220228 | 30001 | 101 |
重复步骤一、二、三的代码即可
CREATE TABLE TMP_cut_to_zip1_1 AS
select date,id_no,state,ROW_NUMBER() OVER(PARTITION BY id_no ORDER BY date) as order_no from
(select (select '29991231')as date,id_no,state from TMP_cut_to_zip a
where not exists(select 1 from TMP_cut_to_zip where id_no=a.id_no and date>a.date)
union
select * from TMP_cut_to_zip)a;
CREATE TABLE TMP_cut_to_zip1_2 AS
select a.date,a.id_no,a.state,
ROW_NUMBER() OVER(PARTITION BY a.id_no ORDER BY a.date) as order_no
from TMP_cut_to_zip1_1 a
left join TMP_cut_to_zip1_1 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
left join TMP_cut_to_zip1_1 c on a.order_no=c.order_no+1 and a.id_no=c.id_no
where (a.state<>ifnull(b.state,0) and a.state<>ifnull(c.state,0)) or (a.state=ifnull(b.state,0) and a.state<>ifnull(c.state,0))
or b.state is null;
select a.id_no,a.state,a.date as begin_date,b.date as end_date
from TMP_cut_to_zip1_2 a
inner join TMP_cut_to_zip1_2 b on a.order_no=b.order_no-1 and a.id_no=b.id_no
order by 1,3
结果如下
id_no | state | begin_date | end_date |
---|---|---|---|
30001 | 101 | 20220131 | 20220331 |
30001 | 102 | 20220331 | 20220430 |
30001 | 103 | 20220430 | 20220531 |
30001 | 101 | 20220531 | 20220831 |
30001 | 102 | 20220831 | 20220930 |
30001 | 101 | 20220930 | 29991231 |
30002 | 101 | 20220331 | 20220430 |
30002 | 102 | 20220430 | 20220531 |
30002 | 103 | 20220531 | 29991231 |