查询重复数据（某个字段允许指定范围内偏移）

最新推荐文章于 2024-04-17 15:25:46 发布

置顶 Jess_eagle

最新推荐文章于 2024-04-17 15:25:46 发布

阅读量551

点赞数

分类专栏：数据库文章标签： postgresql 数据库重复数据查询

本文链接：https://blog.csdn.net/u011099093/article/details/78795535

版权

数据库专栏收录该内容

10 篇文章 0 订阅

订阅专栏

前一段时间遇到一个问题，要求在pg数据库环境下编写一个sql，实现相同记录在不同商家的查询规则，因为相对于商家来说有些在a商店购入的记录会在b商店显示为卖出，而且交易的时间有时也不会完全一样，针对这样的情况，我尝试了许多次但都不能使用一个sql来实现，遂使用了临时表的方式来进行多次查询操作：

介绍下表和字段：原始表tb，字段pa1,pa2与pb1,pb2相对应不同方向的相关字段，pc为时间字段，pd为方向；

首先第一步，把所有现有的指定字段相同交易方向的重复数据剔除，不重复的数据放在临时表中备用：

drop table if exists tmp_tb1;

select * into tb from(select * ,row_number() over(partition by pa1,pa2,pb1,pb2,pc,pd order by pc)t from tb )tp where tp.t=1;

select ...into... from用来将查询的数据导入一个不存在的表；

第二步，将交易数据的改成交易方向相同的数据，交易方向保持不变：

drop table if exists tmp_tb2;

select *,row_number() over(partition by pa1,pa2,pb1,pb2,pc ,pd order by pc) t into tmp_tb2 from(select id,pb1 pa1,pb2 pa2,pa1 pb1,pa2 pb2,pc from tmp_tb1 where pd like '%进%'

union all

select id,pa1,pa2,pb1,pb2,pc,pd from tmp_tb1 where pd like '%出%')tp;

第三步，获取指定偏移时间内的数据：

drop if exists tmp_tb3;

select id,pa1,pa2,pb1,pb2,pc,pd,id1,id2,bmini,cmini into tmp_tb3 from(

select a.*,b.id id1,c.id id2,abs(extract(epoch from(a.pc-b.pc))) bmini,abs(extract(epoch from(a.pc-c.pc))) cmini from

tmp_tb2 a

left join tmp_tb2 b on a.pd<>b.pd and a.pa1=b.pa1 and a.pa2=b.pa2 and a.pb1=b.pb1 and a.pb2=b.pb2 and a.t=b.t-1

and abs(extract(epoch from(a.pc-b.pc))) between 0 and(60*$filter$)

left join tmp_tb2 c

on a.pd<>b.pd and a.pa1=c.pa1 and a.pa2=c.pa2 and a.pb1=c.pb1 and a.pb2=c.pb2 and a.t=c.t+1

and abs(extract(epoch from(a.pc-c.pc))) between 0 and(60*$filter$)

)tp where bmini is not null or cmini is not null;

上述sql中$filter$为指定偏移分钟的占位符，abs(extract(epoch from(a.pc-c.pc)))最内层epoch from配合extract可计算出指定时间的秒数，可能为负，所以取绝对值；

最后，查询最终结果：

select * from tb where id in(select id from tmp_tb3)

如果想删除这部分重复数据：

delete from tb where id in（select id from tmp_tb3 where id2 in（select id from tmp_tb3 where id1 is not null））

-----------20190426更新--------------

经过线上测试发现，如果要经过上面的方式对不同方向且时间存在偏差的数据进行去重，当数据量很大时在第一步第二步会很慢，故将上述sql整合为一个sql：

with tmp as(
    -- 先把所有按列不重复的数据调换进出方向查出来备用，并根据关键字段排序
    --注意这里要保证排序后的顺序是固定的，因为之前没按id排序导致查询的结果每次排序都不同
    with tmptb  as(    
        select row_number() over(partition by kh,mc,je,dskh,dsmc order by sj,id)tt,
        id,kh,mc,sj,je,fx,dskh,dsmc from(
            select tp.id,case when fx like '%出%' then kh else dskh end kh,
            case when fx like '%出%' then mc else dsmc end mc,sj,je,fx,
            case when fx like '%出%' then dskh else kh end dskh,
            case when fx like '%出%' then dsmc else mc end dsmc from
            (select distinct id,jlid from tbname where jlid =1)aa
            inner join
            (select row_number() over(partition by kh,fx,je,dskh,sj order by id)t,
            id,kh,mc,sj,je,fx,dskh,dsmc from tbname where jlid=1)tp
            on(aa.id=tp.id and tp.t=1)
        )tp1
    )
    -- 再比较相邻重复数据是否在所选时间偏移范围
    select id,kh,mc,sj,je,fx,dskh,dsmc,id1,id2,bmini,cmini from(
        select tb2.id id1,tb3.id id2,
        abs(extract(epoch from(tb1.sj-tb2.sj)))bmini,
        abs(extract(epoch from(tb1.sj-tb3.sj)))cmini,tb1.* from tmptb tb1
        left join tmptb tb2 on
        (tb1.kh=tb2.kh and coalesce(tb1.mc,'1')=coalesce(tb2.mc,'1') 
        and tb1.je=tb2.je and tb1.dskh=tb2.dskh and 
        coalesce(tb1.dsmc,'1')=coalesc(tb2.dsmc,'1') and tb1.tt=tb2.tt-1 
        and tb1.fx<>tb2.fx
        and abs(extract(epoch from(tb1.sj-tb2.sj))) between 0 and(60*1)
        left join tmptb tb3 on
        (tb1.kh=tb3.kh and coalesce(tb1.mc,'1')=coalesce(tb3.mc,'1') 
        and tb1.je=tb3.je and tb1.dskh=tb3.dskh and 
        coalesce(tb1.dsmc,'1')=coalesc(tb3.dsmc,'1') and tb1.tt=tb3.tt-1 
        and tb1.fx<>tb3.fx
        and abs(extract(epoch from(tb1.sj-tb3.sj))) between 0 and(60*1)
    )tpp where bmini is not null or cmini is not null
) 
-- 查出多余的重复数据id
select id from tbname where sjid=1 and id in(select id from tmp where id2 in
(select id from tmp where id1 is not null))