说明:
数据库中常用到模糊查询基本分为三类:后模糊(abc%)、前模糊(%abc)、前后模糊(%abc%),一般数据库中都能够支持后模糊查询,但是对于前模糊和前后模糊的支持并不好,但是PostgreSQL对于模糊查询的支持可以说是相当强大,针对不同场景有不同的优化手段,在PostgreSQL中对于这三种模糊查询采用的方法大致如下:
后模糊(abc%):可以使用btree来优化
前模糊(%abc):可以利用reverse函数建立函数索引来优化
前后模糊(%abc%):常利用pg_trgm插件和gin索引来优化
前模糊和后模糊:
后模糊(abc%):
1、利用btree索引来进行优化即可,但是需要注意使用类型默认的index ops class时,仅适合于collate="C"的查询,例如:
bill=# create table t1(id int,info text);
CREATE TABLE
bill=# insert into t1 select generate_series(1,100000),hashtext(random()::text);
INSERT 0 100000
bill=# create index idx_t1 on t1 using btree(info collate "C");
CREATE INDEX
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where info like 'abc%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1 on public.t1 (cost=0.42..3.04 rows=10 width=14) (actual time=0.005..0.005 rows=0 loops=1)
Output: id, info
Index Cond: ((t1.info >= 'abc'::text) AND (t1.info < 'abd'::text))
Filter: (t1.info ~~ 'abc%'::text COLLATE "C")
Buffers: shared hit=3
Planning Time: 0.414 ms
Execution Time: 0.024 ms
(7 rows)
2、当collate<>"C"时,使用对应类型的pattern ops来让btree索引支持模糊查询
bill=# create index idx_t1 on t1 using btree(info text_pattern_ops);
CREATE INDEX
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where info like 'abc%';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1 on public.t1 (cost=0.42..3.04 rows=10 width=14) (actual time=0.027..0.027 rows=0 loops=1)
Output: id, info
Index Cond: ((t1.info ~>=~ 'abc'::text) AND (t1.info ~<~ 'abd'::text))
Filter: (t1.info ~~ 'abc%'::text)
Buffers: shared read=3
Planning Time: 0.335 ms
Execution Time: 0.054 ms
(7 rows)
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where info like 'abc%' collate "C";
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1 on public.t1 (cost=0.42..3.04 rows=10 width=14) (actual time=0.027..0.027 rows=0 loops=1)
Output: id, info
Index Cond: ((t1.info ~>=~ 'abc'::text) AND (t1.info ~<~ 'abd'::text))
Filter: (t1.info ~~ 'abc%'::text COLLATE "C")
Buffers: shared hit=3
Planning Time: 0.133 ms
Execution Time: 0.056 ms
(7 rows)
同时还能支持规则表达式的写法,如下:
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where info ~ '^abc';
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------
Index Scan using idx_t1 on public.t1 (cost=0.42..3.04 rows=10 width=14) (actual time=0.009..0.009 rows=0 loops=1)
Output: id, info
Index Cond: ((t1.info ~>=~ 'abc'::text) AND (t1.info ~<~ 'abd'::text))
Filter: (t1.info ~ '^abc'::text)
Buffers: shared hit=3
Planning Time: 0.210 ms
Execution Time: 0.026 ms
(7 rows)
前模糊(%abc)
使用反转函数(reverse)索引,可以支持前模糊的查询。
1、同样,使用类型默认的index ops class时,仅适合于collate="C"的查询,例如:
bill=# create index idx_t1 on t1 using btree(reverse(info) collate "C");
CREATE INDEX
bill=# select * from t1 limit 1;
id | info
----+-------------
1 | -2100249117
(1 row)
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where reverse(info) like '117%' collate "C";
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t1 (cost=8.14..378.67 rows=500 width=14) (actual time=0.142..0.287 rows=101 loops=1)
Output: id, info
Filter: (reverse(t1.info) ~~ '117%'::text COLLATE "C")
Heap Blocks: exact=94
Buffers: shared hit=94 read=3
-> Bitmap Index Scan on idx_t1 (cost=0.00..8.02 rows=500 width=0) (actual time=0.118..0.118 rows=101 loops=1)
Index Cond: ((reverse(t1.info) >= '117'::text) AND (reverse(t1.info) < '118'::text))
Buffers: shared read=3
Planning Time: 0.117 ms
Execution Time: 0.324 ms
(10 rows)
2、当collate<>"C"时,使用对应类型的pattern ops来让btree索引支持模糊查询,并且也能支持规则表达式的写法
bill=# create index idx_t1 on t1 using btree(reverse(info) text_pattern_ops);
CREATE INDEX
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where reverse(info) like '117%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t1 (cost=8.14..378.67 rows=500 width=14) (actual time=0.084..0.212 rows=101 loops=1)
Output: id, info
Filter: (reverse(t1.info) ~~ '117%'::text)
Heap Blocks: exact=94
Buffers: shared hit=94 read=3
-> Bitmap Index Scan on idx_t1 (cost=0.00..8.02 rows=500 width=0) (actual time=0.062..0.063 rows=101 loops=1)
Index Cond: ((reverse(t1.info) ~>=~ '117'::text) AND (reverse(t1.info) ~<~ '118'::text))
Buffers: shared read=3
Planning Time: 0.096 ms
Execution Time: 0.242 ms
(10 rows)
bill=# explain (analyze,verbose,timing,costs,buffers) select * from t1 where reverse(info) ~ '^117';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.t1 (cost=8.14..378.67 rows=500 width=14) (actual time=0.060..0.283 rows=101 loops=1)
Output: id, info
Filter: (reverse(t1.info) ~ '^117'::text)
Heap Blocks: exact=94
Buffers: shared hit=97
-> Bitmap Index Scan on idx_t1 (cost=0.00..8.02 rows=500 width=0) (actual time=0.037..0.037 rows=101 loops=1)
Index Cond: ((reverse(t1.info) ~>=~ '117'::text) AND (reverse(t1.info) ~<~ '118'::text))
Buffers: shared hit=3
Planning Time: 0.129 ms
Execution Time: 0.320 ms
(10 rows)
前、后模糊的合体:
使用pg_trgm索引,可以支持前、后模糊的查询。
如果想要支持中文模糊查询,需要注意数据库的lc_ctype不能为"C",并且索引、查询条件的collate必须一致才能使用索引,例如:
bill=# \l+ bill
List of databases
Name | Owner | Encoding | Collate | Ctype | Access privileges | Size | Tablespace | Description
------+----------+----------+---------+-------------+-------------------+-------+------------+-------------
bill | postgres | UTF8 | C | zh_CN.UTF-8 | | 95 MB | pg_default |
(1 row)
构造环境:
bill=# create table test001(c1 text);
CREATE TABLE
bill=# create or replace function gen_hanzi(int) returns text as $$
bill$# declare
bill$# res text;
bill$# begin
bill$# if $1 >=1 then
bill$# select string_agg(chr(19968+(random()*20901)::int), '') into res from generate_series(1,$1);
bill$# return res;
bill$# end if;
bill$# return null;
bill$# end;
bill$# $$ language plpgsql strict;
CREATE FUNCTION
bill=# insert into test001 select gen_hanzi(20) from generate_series(1,100000);
INSERT 0 100000
bill=# create index idx_test001_1 on test001 using gin (c1 gin_trgm_ops);
CREATE INDEX
bill=# select * from test001 limit 5;
c1
------------------------------------------
頺縈鈍鮆嘻頔傉澩裂驁沧蜏鉩靣鞪僗怅鏒翌豇
傸鵾鰔骝髑熯偣孊弰瞅油禩轷贶墯佁貫阖翜蹽
戦綷筿降峝蹮險田螻攠當剞嘌蛓獿鮯諨嬰虡纙
錮虒弧蜖比蒞窐砫宥軡泜箏矘犍秥冴僕礫四逇
跐藉圹縲稡霃黇兯蒟穭曣轋虫徼嗣嶹矧鳨冥肺
(5 rows)
模糊查询:
bill=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '你%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=10.38..475.76 rows=500 width=32) (actual time=0.041..0.052 rows=5 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '你%'::text)
Heap Blocks: exact=5
Buffers: shared hit=9
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..10.25 rows=500 width=0) (actual time=0.031..0.032 rows=5 loops=1)
Index Cond: (test001.c1 ~~ '你%'::text)
Buffers: shared hit=4
Planning Time: 0.098 ms
Execution Time: 0.076 ms
(10 rows)
bill=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%中国';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=6.58..19.42 rows=10 width=61) (actual time=0.023..0.023 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%中国'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=1
Buffers: shared hit=5
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..6.58 rows=10 width=0) (actual time=0.015..0.015 rows=1 loops=1)
Index Cond: (test001.c1 ~~ '%中国'::text)
Buffers: shared hit=4
Planning Time: 0.248 ms
Execution Time: 0.045 ms
(11 rows)
前后模糊(%abc%):
使用pg_trgm插件,支持前后模糊的查询。
同样,需要注意数据库的lc_ctype不能为"C",并且索引、查询条件的collate必须一致才能使用索引
建议输入3个或3个以上字符,否则效果不佳
bill=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%中国%';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on public.test001 (cost=0.00..2387.00 rows=10 width=61) (actual time=21.432..21.432 rows=0 loops=1)
Output: c1
Filter: (test001.c1 ~~ '%中国%'::text)
Rows Removed by Filter: 100000
Buffers: shared hit=1137
Planning Time: 0.097 ms
Execution Time: 21.451 ms
(7 rows)
bill=# explain (analyze,verbose,timing,costs,buffers) select * from test001 where c1 like '%我是程序员%';
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on public.test001 (cost=16.98..29.82 rows=10 width=61) (actual time=0.041..0.042 rows=0 loops=1)
Output: c1
Recheck Cond: (test001.c1 ~~ '%我是程序员%'::text)
Buffers: shared hit=10
-> Bitmap Index Scan on idx_test001_1 (cost=0.00..16.98 rows=10 width=0) (actual time=0.039..0.039 rows=0 loops=1)
Index Cond: (test001.c1 ~~ '%我是程序员%'::text)
Buffers: shared hit=10
Planning Time: 0.112 ms
Execution Time: 0.066 ms
(9 rows)
可以看到只有2个字符性能有明显下降
总结:
- 对于只有后模糊(abc%)查询需求,使用collate "C"的b-tree索引;当collate不为"C"时,可以使用类型对应的pattern ops(例如text_pattern_ops)建立b-tree索引。
- 对于只有前模糊(%abc)的查询需求,使用collate "C"的reverse()表达式的b-tree索引;当collate不为"C"时,可以使用类型对应的pattern ops(例如text_pattern_ops)建立b-tree索引。
- 对于前后均模糊(%abc%)查询需求,并且包含中文,请使用lc_ctype <> "C"的数据库,同时使用pg_trgm插件的gin索引。