标签
PostgreSQL , 模糊查询 , 正则查询 , pg_trgm , bytea , gin , 函数索引
背景
前模糊(有前缀的模糊),后模糊(有后缀的模糊),前后模糊(无前后缀的模糊),正则匹配都属于文本搜索领域常见的需求。
PostgreSQL拥有很强的文本搜索能力,除了支持全文检索,还支持模糊查询、正则查询。内置的pg_trgm插件是一般数据库没有的,可能很多人没有听说过。同时还内置了表达式索引、GIN索引的功能。
不同的模糊查询需求,有不同的优化方法。
对于前模糊和后模糊,PostgreSQL则与其他数据库一样,可以使用btree来加速。后模糊可以使用反转函数的函数索引来加速。
对于前后模糊和正则匹配,一种方法是使用pg_trgm插件,利用GIN索引加速模糊和正则查询(输入3个或3个以上字符的模糊查询效果很好)。另一种方法是自定义GIN表达式索引的方法,适合于定制的模糊查询。
一、前模糊与后模糊的优化
1. 前模糊(有前缀的模糊)优化方法
使用b-tree可以支持前模糊的查询。
1.1 当使用类型默认的index ops class时,仅适合于collate="C"的查询(当数据库默认的lc_collate<>C时,索引和查询都需要明确指定collate "C")。
索引、查询条件的collate必须一致才能使用索引。
例子
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info collate "C");
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.057..0.093 rows=18 loops=1)
Output: id, info
Index Cond: ((test.info >= 'abcd'::text) AND (test.info < 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=18 read=3
Planning time: 0.424 ms
Execution time: 0.124 ms
(7 rows)
1.2 当数据库默认的lc_collate<>C时,还有一种方法让b-tree索引支持模糊查询。使用对应类型的pattern ops,使用pattern ops将使用字符的查询方式而非binary的搜索方式。
文档中有如下解释
https://www.postgresql.org/docs/9.6/static/indexes-opclass.html
The operator classes text_pattern_ops, varchar_pattern_ops, and bpchar_pattern_ops
support B-tree indexes on the types text, varchar, and char respectively.
The difference from the default operator classes is that the values are compared strictly
character by character rather than according to the locale-specific collation rules.
This makes these operator classes suitable for use by queries involving pattern
matching expressions (LIKE or POSIX regular expressions) when the database
does not use the standard "C" locale.
例子
test=# drop table test;
DROP TABLE
test=# create table test(id int, info text);
CREATE TABLE
test=# insert into test select generate_series(1,1000000),md5(random()::text);
INSERT 0 1000000
test=# create index idx on test(info text_pattern_ops);
CREATE INDEX
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "zh_CN";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.038..0.059 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "zh_CN")
Buffers: shared hit=12 read=3
Planning time: 0.253 ms
Execution time: 0.081 ms
(7 rows)
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info like 'abcd%' collate "C";
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.027..0.050 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~~ 'abcd%'::text COLLATE "C")
Buffers: shared hit=15
Planning time: 0.141 ms
Execution time: 0.072 ms
(7 rows)
使用类型对应的pattern ops,索引搜索不仅支持LIKE的写法,还支持规则表达式的写法,如下:
test=# explain (analyze,verbose,timing,costs,buffers) select * from test where info ~ '^abcd';
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Index Scan using idx on public.test (cost=0.42..16.76 rows=100 width=37) (actual time=0.031..0.061 rows=12 loops=1)
Output: id, info
Index Cond: ((test.info ~>=~ 'abcd'::text) AND (test.info ~<~ 'abce'::text))
Filter: (test.info ~ '^abcd'::text)
Buffers: shared hit=15
Planning time: 0.213 ms
Execution time: 0.083 ms
(7 rows)
2. 后模糊(有后缀的模糊)的优化方法
使用反转函数(reverse)索引,可以支持后模糊的查询。