PostgreSQL 十亿级模糊查询最佳实践

最新推荐文章于 2023-12-27 17:04:47 发布

weixin_34007906

最新推荐文章于 2023-12-27 17:04:47 发布

阅读量3.8k

点赞数

文章标签：数据库大数据前端 ViewUI

原文链接：https://yq.aliyun.com/articles/74480

版权

背景

前模糊(有前缀的模糊)，后模糊(有后缀的模糊)，前后模糊(无前后缀的模糊)，正则匹配都属于文本搜索领域常见的需求。

PostgreSQL拥有很强的文本搜索能力，除了支持全文检索，还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的，可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。

不同的模糊查询需求，有不同的优化方法。

对于前模糊和后模糊，PostgreSQL则与其他数据库一样，可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。

对于前后模糊和正则匹配，一种方法是使用pg_trgm插件，利用GIN索引加速模糊和正则查询（输入3个或3个以上字符的模糊查询效果很好）。另一种方法是自定义GIN表达式索引的方法，适合于定制的模糊查询。

一、前模糊与后模糊的优化

1. 前模糊(有前缀的模糊)优化方法

使用b-tree可以支持前模糊的查询。

1.1 当使用类型默认的index ops class时，仅适合于collate="C"的查询（当数据库默认的lc_collate<>C时，索引和查询都需要明确指定collate "C"）。

索引、查询条件的collate必须一致才能使用索引。

例子

test=# create table test(id int, info text);      
CREATE TABLE      
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000      
test=# create index idx on test(info collate "C");      
CREATE INDEX      
      
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";      
                                                      QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------      
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)      
   Output: id, info      
   Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))      
   Filter: (test.info ~~ 'abcd%'::text COLLATE "C")      
   Buffers: shared hit=18 read=3      
 Planning time: 0.424 ms      
 Execution time: 0.124 ms      
(7 rows)

1.2 当数据库默认的lc_collate<>C时，还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops，使用pattern ops将使用字符的查询方式而非binary的搜索方式。

文档中有如下解释

https://www.postgresql.org/docs/9.6/static/indexes-opclass.html

The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops   
support B-tree indexes on the types text, varchar, and char respectively.   
  
The difference from the default operator classes is that the values are compared strictly   
character by character rather than according to the locale-specific collation rules.   
  
This makes these operator classes suitable for use by queries involving pattern   
matching expressions (LIKE or POSIX regular expressions) when the database   
does not use the standard "C" locale.

例子

test=# drop table test;  
DROP TABLE  
test=# create table test(id int, info text);      
CREATE TABLE  
test=# insert into test select generate_series(1,1000000),md5(random()::text);      
INSERT 0 1000000  
test=# create index idx on test(info text_pattern_ops);   
CREATE INDEX  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")  
   Buffers: shared hit=12 read=3  
 Planning time: 0.253 ms  
 Execution time: 0.081 ms  
(7 rows)  
  
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";      
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~~ 'abcd%'::text COLLATE "C")  
   Buffers: shared hit=15  
 Planning time: 0.141 ms  
 Execution time: 0.072 ms  
(7 rows)

使用类型对应的pattern ops，索引搜索不仅支持LIKE的写法，还支持规则表达式的写法，如下：

test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';  
                                                      QUERY PLAN                                                        
----------------------------------------------------------------------------------------------------------------------  
 Index Scan using idx on public.test  (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)  
   Output: id, info  
   Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))  
   Filter: (test.info ~ '^abcd'::text)  
   Buffers: shared hit=15  
 Planning time: 0.213 ms  
 Execution time: 0.083 ms  
(7 rows)

2. 后模糊(有后缀的模糊)的优化方法

使用反转函数(reverse)索引，可以支持后模糊的查询。

最低0.47元/天解锁文章

weixin_34007906

关注

0
点赞
踩
8

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL 十亿级模糊查询最佳实践

标签PostgreSQL , 模糊查询 , 正则查询 , pg_trgm , bytea , gin , 函数索引背景前模糊(有前缀的模糊)，后模糊(有后缀的模糊)，前后模糊(无前后缀的模糊)，正则匹配都属于文本搜索领域常见的需求。PostgreSQL拥有很强的文本搜索能力，除了支持全文检索，还支持模糊查询、正则查询。内置的pg_tr...
复制链接

扫一扫