PostgreSQL 百亿级数据范围查询, 分组排序窗口取值极致优化 case

最新推荐文章于 2024-05-06 21:25:17 发布

weixin_34224941

最新推荐文章于 2024-05-06 21:25:17 发布

阅读量2.0k

点赞数 3

文章标签：数据库

原文链接：https://yq.aliyun.com/articles/39680

版权

本文介绍了针对PostgreSQL数据库的一个案例，如何对一个查询进行多阶段优化，以实现对亿级数据的范围查询、分组排序和窗口取值的极致性能。优化包括利用函数、游标、并行处理和特定索引策略，最终达到毫秒级响应，性能提升一万倍。

摘要由CSDN通过智能技术生成

本文将对一个任意范围按ID分组查出每个ID对应的最新记录的CASE做一个极致的优化体验。
优化后性能维持在可控范围内，任意数据量，毫秒级返回，性能平稳可控。
比优化前性能提升1万倍。

CASE

有一张数据表，结构：   

CREATE TABLE target_position ( 
target_id varchar(80), 
time bigint, 
content text 
); 

数据量是 100 亿条左右   
target_id 大约 20 万个   

数据库使用的是 PostgreSQL 9.4    

需求：   
查询每个目标指定时间段的最新一条数据，要求1秒内返回数据。  
时间段不确定     

现在是使用窗口函数来实现，如下：   
select target_id,time,content from (select *,row_number() over (partition by target_id order by time desc) rid from target_position where time>开始时间 and time<=结束时间) as t where rid=1; 
效果很差。

分析一下原理，这个case其实慢就慢在扫描的时间段，因为需要遍历整个时间段的数据，然后分组排序，取出该时间段内每个target_id的最新一条记录。
这个语句决定了时间段越大，可能的扫描量就越大，时间越久。
直奔最优方案，CASE里有提到，target_id大约20万个，理论上不管要扫描的范围有多大，最多只需要扫描20万条tuple。
怎样做到呢，用函数即可。
首先要开另外一种表维护target_id的唯一值，方便取数据，这个需要应用层配合来做到这一点，其实不难的，就是关系解耦。
下面是测试样本

postgres=# create unlogged table t1(id int, crt_time timestamp);
CREATE TABLE
postgres=# create unlogged table t2(id int primary key);
CREATE TABLE
postgres=# insert into t1 select trunc(random()*200000),clock_timestamp() from generate_series(1,100000000);
INSERT 0 100000000
postgres=# create index idx_t1_1 on t1(id,crt_time desc);
CREATE INDEX
postgres=# select * from t1 limit 10;
   id   |          crt_time          
--------+----------------------------
  49092 | 2016-05-06 16:50:29.88595
    947 | 2016-05-06 16:50:29.887553
 179124 | 2016-05-06 16:50:29.887562
 197308 | 2016-05-06 16:50:29.887564
  93558 | 2016-05-06 16:50:29.887566
 127133 | 2016-05-06 16:50:29.887568
 163507 | 2016-05-06 16:50:29.887569
 110546 | 2016-05-06 16:50:29.887571
  65363 | 2016-05-06 16:50:29.887573
 122666 | 2016-05-06 16:50:29.887575
(10 rows)
postgres=# insert into t2 select generate_series(1,200000);
INSERT 0 200000

来看一个未优化的查询计划和耗时，从查询计划来看，已经很优了，但是由于提供的查询范围内数据量有450多万，所以最后查询的耗时也达到了15秒。

postgres=# explain analyze select * from (select *,row_number() over(partition by id order by crt_time desc) rn from t1 where crt_time between '2016-05-06 16:50:29.887566' and '2016-05-06 16:50:34.887566') t where rn=1;
                                                                                   QUERY PLAN                                                                                    
----------------------------------------------------------------------------------------------------------------------------
 Subquery Scan on t  (cost=0.57..1819615.87 rows=2500 width=20) (actual time=0.083..15301.915 rows=200000 loops=1)
   Filter: (t.rn = 1)
   Rows Removed by Filter: 4320229
   ->  WindowAgg  (cost=0.57..1813365.87 rows=500000 width=12) (actual time=0.078..14012.867 rows=4520229 loops=1)
         ->  Index Only Scan using idx_t1_1 on t1  (cost=0.57..1804615.87 rows=500000 width=12) (actual time=0.066..10603.161 rows=4520229 loops=1)
               Index Cond: ((crt_time >= '2016-05-06 16:50:29.887566'::timestamp without time zone) AND (crt_time <= '2016-05-06 16:50:34.887566'::timestamp without time zone))
               Heap Fetches: 4520229
 Planning time: 0.202 ms
 Execution time: 15356.066 ms
(9 rows)

优化阶段1

通过online code循环，性能提升到了秒级。

postgres=# do language plpgsql 
$$
  
declare
x int;
begin
  for x in select id from t2 loop
    perform * from t1 wher

最低0.47元/天解锁文章

weixin_34224941

关注

3
点赞
踩
12

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

PostgreSQL 百亿级数据范围查询, 分组排序窗口取值 极致优化 case

CASE

优化阶段1

PostgreSQL 百亿级数据范围查询, 分组排序窗口取值极致优化 case