PostgreSQL 数据去重大法

最新推荐文章于 2023-10-16 14:52:41 发布

宁延安

最新推荐文章于 2023-10-16 14:52:41 发布

阅读量1.9k

点赞数

文章标签： c postgresql velocity Lotus JNI

本文链接：https://blog.csdn.net/qq_28125719/article/details/84897315

版权

阅读原文请点击：[url]http://click.aliyun.com/m/22549/[/url]
摘要：标签 PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排背景去重的需求比较常见，去重也可以衍生出很多变种。

标签

PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排

背景

去重的需求比较常见，去重也可以衍生出很多变种。例如

1. 单列去重，很好理解，就是按某列去除重复记录。保留规则（例如保留最新的，保留最旧的，或者保留某个其他字段最大的）。

2. 多列去重，按多列，去除重复记录。保留规则（例如保留最新的，保留最旧的，或者保留某个其他字段最大的）。

3. 行去重，按行，去除重复记录。保留规则（例如保留最新的，保留最旧的，或者保留某个其他字段最大的）。

4. 多列混合去重，按多列混合模式(ROW1: col1=1 , col2=2；ROW2: col1=2, col2=1；这种混合重复的去重)，去除重复记录。保留规则（例如保留最新的，保留最旧的，或者保留某个其他字段最大的）。

下面依次举例，使用不同的方法去重，用户可以自由选择效率最佳的。

单列去重

测试数据

create table test1(id int primary key, c1 int, c2 timestamp);
insert into test1 select generate_series(1,1000000), random()*1000, clock_timestamp();

create index idx_test1 on test1(c1,id);
-- 这个索引可以起到加速效果。如果没有这个索引，以下三种方法，第二种效率最高，其次是第三种。
需求：去除c1重复的行，保留id最大的。

方法1，使用聚合，not in

postgres=# explain delete from test1 where id not in (select max(id) from test1 group by c1);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Delete on test1 (cost=35115.63..53023.01 rows=500055 width=6)
-> Seq Scan on test1 (cost=35115.63..53023.01 rows=500055 width=6)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
-> GroupAggregate (cost=0.42..35113.13 rows=1001 width=8)
Group Key: test1_1.c1
-> Index Only Scan using idx_test1 on test1 test1_1 (cost=0.42..30102.57 rows=1000110 width=8)
(7 rows)
Time: 0.564 ms

postgres=# delete from test1 where id not in (select max(id) from test1 group by c1);
DELETE 998999
Time: 1126.504 ms (00:01.127)
方法2，使用窗口查询，IN

postgres=# explain select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1;
QUERY PLAN
--------------------------------------------------------------------------------------------------
Subquery Scan on t (cost=0.42..60075.54 rows=995109 width=4)
Filter: (t.rn <> 1)
-> WindowAgg (cost=0.42..47574.17 rows=1000110 width=16)
-> Index Only Scan using idx_test1 on test1 (cost=0.42..30072.24 rows=1000110 width=8)
(4 rows)
Time: 0.512 ms

postgres=# delete from test1 where id in (select id from (select row_number() over(partition by c1 order by id) as rn, id from test1) t where t.rn<>1);
DELETE 998999
Time: 2430.276 ms (00:02.430)
方法3，使用PLPGSQL，内部使用排序+游标。

每条记录判断一次的方法去重，只有一次排序+每条记录比对的开销。

do language plpgsql $$
declare
v_rec record;
v_c1 int;
cur1 cursor for select c1,id from test1 order by c1,id for update;
begin
for v_rec in cur1 loop
if v_rec.c1 = v_c1 then
delete from test1 where current of cur1;
end if;
v_c1 := v_rec.c1;
end loop;
end;
$$;

DO
Time: 7345.773 ms (00:07.346)

postgres=# select count(*) from test1;
count
-------
1001
(1 row)

Time: 61.672 ms
postgres=# select * from test1 limit 10;
id | c1 | c2
----+-----+----------------------------
1 | 582 | 2017-06-02 10:21:10.60918
2 | 278 | 2017-06-02 10:21:10.609331
3 | 659 | 2017-06-02 10:21:10.609338
4 | 372 | 2017-06-02 10:21:10.609341
5 | 184 | 2017-06-02 10:21:10.609343
6 | 121 | 2017-06-02 10:21:10.609345
7 | 132 | 2017-06-02 10:21:10.609347
8 | 290 | 2017-06-02 10:21:10.609348
9 | 980 | 2017-06-02 10:21:10.60935
10 | 305 | 2017-06-02 10:21:10.609352
(10 rows)
PostgreSQL 10黑科技

即使只有部分驱动列，也能使用索引排序。

例如index(c1)，可以用于order by c1,id;

《PostgreSQL 10.0 preview 优化器改进 - 不完整索引支持复合排序》

多列去重

测试数据

create table test1(id int primary key, c1 int, c2 int, c3 timestamp);
insert into test1 select generate_series(1,1000000), random()*1000, random()*1000, clock_timestamp();

create index idx_test1 on test1(c1,c2,id);
-- 这个索引可以起到加速效果。
需求：去除c1,c2重复的行，保留id最大的。

方法1，

postgres=# explain (analyze,verbose,timing,costs,buffers) delete from test1 where id not in (select max(id) from test1 group by c1,c2);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Delete on public.test1 (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1634.960..1634.960 rows=0 loops=1)
Buffers: shared hit=1378788
-> Seq Scan on public.test1 (cost=40820.36..59690.36 rows=500000 width=6) (actual time=1090.956..1446.374 rows=367618 loops=1)
Output: test1.ctid
Filter: (NOT (hashed SubPlan 1))
Rows Removed by Filter: 632382
Buffers: shared hit=1011170
SubPlan 1
-> GroupAggregate (cost=0.42..40570.36 rows=100000 width=12) (actual time=0.035..842.497 rows=632382 loops=1)
Output: max(test1_1.id), test1_1.c1, test1_1.c2
Group Key: test1_1.c1, test1_1.c2
Buffers: shared hit=1004800
-> Index Only Scan using idx_test1 on public.test1 test1_1 (cost=0.42..32070.36 rows=1000000 width=12) (actual time=0.027..587.506 rows=1000000 loops=1)
Output: test1_1.c1, test1_1.c2, test1_1.id
Heap Fetches: 1000000
Buffers: shared hit=1004800
Planning time: 0.211 ms
Execution time: 1641.679 ms
(18 rows)
方法2，
阅读原文请点击：[url]http://click.aliyun.com/m/22549/[/url]

宁延安

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL 数据去重大法

阅读原文请点击：[url]http://click.aliyun.com/m/22549/[/url]摘要：标签 PostgreSQL , 去重 , 单列去重 , 多列去重 , 行去重 , 多列混合去重 , varidict 参数 , 数组排序 , 数组元素重排背景去重的需求比较常见，去重也可以衍生出很多变种。标签PostgreSQL , 去重 , 单列去重 , 多列去重...
复制链接

扫一扫