数据库中去重的需求比较常见,比较常见的如单列去重、多列去重、行去重等。pg中针对这些不同的去重要求,我们也可以使用不同的去重方法。https://www.cndba.cn/foucus/article/3916
1、单列去重单列去重应该是最常见的了,就是将某一列中重复的记录去除掉,我们可以根据要求保留最新或者最旧的记录。
—创建测试数据
bill=# create table test1(id int primary key, c1 int, c2 timestamp);
CREATE TABLE
bill=# insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();
INSERT 0 1000000
bill=# create index idx_test1 on test1(c1,id);
CREATE INDEX
—方法1:
聚合,not in
https://www.cndba.cn/foucus/article/3916
bill=# explain delete from test1 where id not in (select max(id) from test1 group by c1);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Delete on test1 (cost=30609.23..48515.23 rows=500000 width=6)
-> Seq Scan on test1 (cost=30609.23..48515.23 rows=500000 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> GroupAggregate (cost=0.42..30606.73 rows=1001 width=8)
Group Key: test1_1.c1
-> Index Only Scan using idx_test1 on test1 test1_1 (cost=0.42..25596.72 rows=1000000 width=8)
(7 rows)
—方法2:
使用窗口查询,in
bill=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.42..55596.72 rows=995000 width=4)
Filter: (t.rn <> 1)
-> WindowAgg (cost=0.42..43096.72 rows=1000000 width=16)
-> Index Only Scan using idx_test1 on test1 (cost=0.42..25596.72 rows=1000000 width=8)
(4 rows)
—方法3:
使用游标的方式去遍历,每一条记录比较一次。
bill=# do language plpgsql $$
bill$# declare
bill$# v_rec record;
bill$# v_c1 int;
bill$# cur1 cursor for select c1,id from test1 order by c1,id for update;
bill$# begin
bill$# for v_rec in cur1 loop
bill$# if v_rec.c1 = v_c1 then
bill$# delete from test1 where current of cur1;
bill$# end if;
bill$# v_c1 := v_rec.c1;
bill$# end loop;
bill$# end;
bill$# $$;
DO
上面三种方式,方法二效率最高,其次是方法三。
http