在数据库SQL查询优化中,索引是最常用的优化手段。合理的使用索引,可以极大提升数据库查询效率,充分利用系统资源。数据库中,数据选择率比较低的时候使用索引可以带来性能提升,但是如果数据选择率比较高,索引额外带来索引读代价,同时会带来随机读的问题,性能不如顺序扫描。因此本文介绍的索引类型只能用于数据选择率比较低的SQL语句。这是选择索引的前提。
Postgresql中主要支持6种类型的索引:BTREE、HASH、GiST、SP-GiSP、GIN、BRIN。可以根据实际的应用场景选择合适的索引,BTREE、HASH是比较常用的索引。
1. BTREE索引:
CREATE INDEX默认使用BTREE索引,适合按照顺序存储的数据进行比较查询和范围查询,查询优化器会优先考虑使用BTREE索引,如果涉及到以下任何一种操作:
1)<,<=,=,>,>=
2)以及这些操作的组合,比如between and,也可以使用BTREE。
3)在索引列上的IS NULL 或者IS NOT NULL也可以使用BTREE。
4)BTREE索引也可以用于模糊查询,但是仅限字符串开头是常量的情况下,比如 name LIKE ‘Jason%’,或者name ~ ’^Jason’。但是name LIKE ‘%Jason’是不能用的。
5)Min/Max聚集操作也可使用BTREE索引。
6)其实在merge join以及order by中,可以通过使用BTREE索引的有序性来减少sort带来的代价。
例子:
test=# create table t1 (id int, info text);
CREATE TABLE
test=# insert into t1 values(generate_series(1,100000), md5(random()::text));
INSERT 0 100000
test=# \d t1
Table "public.t1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
info | text | | |
test=# analyze t1;
ANALYZE
--不建立索引,默认使用顺序扫描
test=# explain select * from t1 where t1.id = 10007;
QUERY PLAN
------------------------------------------------------
Seq Scan on t1 (cost=0.00..2084.00 rows=1 width=37)
Filter: (id = 10007)
(2 rows)
test=# explain select * from t1 where t1.id >10007;
QUERY PLAN
----------------------------------------------------------
Seq Scan on t1 (cost=0.00..2084.00 rows=90249 width=37)
Filter: (id > 10007)
(2 rows)
test=# explain select * from t1 where t1.id > 10007 and t1.id < 12000;
QUERY PLAN
---------------------------------------------------------
Seq Scan on t1 (cost=0.00..2334.00 rows=2042 width=37)
Filter: ((id > 10007) AND (id < 12000))
(2 rows)
--建立BTREE索引
test=# create index on t1(id);
CREATE INDEX
test=# analyze t1;
ANALYZE
test=# explain select * from t1 where t1.id = 10007;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using t1_id_idx on t1 (cost=0.29..8.31 rows=1 width=37)
Index Cond: (id = 10007)
(2 rows)
--下面例子中没有使用索引的原因是选择率太高,优化器会使用顺序扫描
test=# explain select * from t1 where t1.id >10007;
QUERY PLAN
----------------------------------------------------------
Seq Scan on t1 (cost=0.00..2084.00 rows=90103 width=37)
Filter: (id > 10007)
(2 rows)
test=# explain select * from t1 where t1.id > 10007 and t1.id < 12000;
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using t1_id_idx on t1 (cost=0.29..83.73 rows=1972 width=37)
Index Cond: ((id > 10007) AND (id < 12000))
(2 rows)
test=# explain select * from t1 where t1.id >98765;
QUERY PLAN
-------------------------------------------------------------------------
Index Scan using t1_id_idx on t1 (cost=0.29..51.31 rows=1201 width=37)
Index Cond: (id > 98765)
(2 rows)
test=# \d t1
Table "public.t1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
info | text | | |
Indexes:
"t1_id_idx" btree (id)
test=# explain select * from t1 where t1.id between 10007 and 11000;
QUERY PLAN
------------------------------------------------------------------------
Index Scan using t1_id_idx on t1 (cost=0.29..43.79 rows=975 width=37)
Index Cond: ((id >= 10007) AND (id <= 11000))
(2 rows)
test=# explain select * from t1 where t1.id IS NULL;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using t1_id_idx on t1 (cost=0.29..4.31 rows=1 width=37)
Index Cond: (id IS NULL)
(2 rows)
test=# create index on t1(info text_pattern_ops);
CREATE INDEX
test=# \d t1;
Table "public.t1"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
info | text | | |
Indexes:
"t1_id_idx" btree (id)
"t1_info_idx" btree (info text_pattern_ops)
test=# analyze t1;
ANALYZE
--模糊查询使用索引
test=# explain select * from t1 where t1.info like '0123%';
QUERY PLAN
------------------------------------------------------------------------
Index Scan using t1_info_idx on t1 (cost=0.42..8.44 rows=10 width=37)
Index Cond: ((info ~>=~ '0123'::text) AND (info ~<~ '0124'::text))
Filter: (info ~~ '0123%'::text)
(3 rows)
test=# explain analyze select min(id) from t1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.33..0.34 rows=1 width=4) (actual time=0.143..0.143 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.29..0.33 rows=1 width=4) (actual time=0.114..0.116 rows=1 loops=1)
-> Index Only Scan using t1_id_idx on t1 (cost=0.29..3691.29 rows=100000 width=4) (actual time=0.107..0.107 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 1
Planning Time: 0.531 ms
Execution Time: 0.263 ms
(8 rows)
test=# explain analyze select max(id) from t1;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Result (cost=0.33..0.34 rows=1 width=4) (actual time=0.054..0.054 rows=1 loops=1)
InitPlan 1 (returns $0)
-> Limit (cost=0.29..0.33 rows=1 width=4) (actual time=0.043..0.044 rows=1 loops=1)
-> Index Only Scan Backward using t1_id_idx on t1 (cost=0.29..3691.29 rows=100000 width=4) (actual time=0.040..0.041 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 1
Planning Time: 0.485 ms
Execution Time: 0.128 ms
(8 rows)
test=#
2. Hash索引:
只能处理简单的等值比较,当索引列涉及到等于操作比较时,优化器会考虑使用Hash索引。Hash索引是通过比较hash值来查找定位,如果hash索引列的数据重复度比较高,容易产生严重的hash冲突,从而降低查询效率,因此这种情况下,不适合hash索引。
CREATE INDEX idx_name ON table_name USING HASH (column_name);
例子:
test=# create table t2 (id int, info text);
CREATE TABLE
test=# insert into t2 values(generate_series(1,100000), md5(random()::text));
INSERT 0 100000
test=# create index on t2 using hash(id);
CREATE INDEX
test=# analyze t2;
ANALYZE
test=# \d t2;
Table "public.t2"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
info | text | | |
Indexes:
"t2_id_idx" hash (id)
test=# explain select * from t2 where id = 10008;
QUERY PLAN
---------------------------------------------------------------------
Index Scan using t2_id_idx on t2 (cost=0.00..8.02 rows=1 width=37)
Index Cond: (id = 10008)
(2 rows)
--非等于操作不会用到hash索引
test=# explain select * from t2 where id < 10008;
QUERY PLAN
---------------------------------------------------------
Seq Scan on t2 (cost=0.00..2084.00 rows=9826 width=37)
Filter: (id < 10008)
(2 rows)
test=# explain select * from t2 where id is NULL;
QUERY PLAN
------------------------------------------------------
Seq Scan on t2 (cost=0.00..1834.00 rows=1 width=37)
Filter: (id IS NULL)
(2 rows)
test=#
3. GiST索引
不是独立的索引类型,是一种架构或者索引模板,是一棵平衡二叉树。适用于多维数据类型和集合数据类型,和Btree索引类似,同样适用于其他的数据类型。GiST可以用来做位置搜索,如包含、相交、左边、右边等。和Btree索引相比,GiST多字段索引在查询条件中包含索引字段的任何子集都会使用索引扫描,而Btree索引只有查询条件包含第一个索引字段才会使用索引扫描。GiST索引特定操作符类型高度依赖于索引策略(操作符类)。GiST跟Btree索引相比,索引创建耗时较长,占用空间也比较大。
下面例子中建立了BTREE组合索引(a, b),如果SQL where条件中有a或者a,b都可以使用该组合索引,但是如果where条件中只有b,则无法使用索引。此时,GiST可以解决这种情况。
例子:
test=# create table t3(a bigint, b timestamp without time zone,c varchar(64));
CREATE TABLE ^
test=# insert into t3 values(generate_series(1,100000), now()::timestamp, md5(random()::text));
INSERT 0 100000
test=# create index on t3(a, b);
CREATE INDEX
test=# analyze t3;
ANALYZE
test=# \d t3
Table "public.t3"
Column | Type | Collation | Nullable | Default
--------+-----------------------------+-----------+----------+---------
a | bigint | | |
b | timestamp without time zone | | |
c | character varying(64) | | |
Indexes:
"t3_a_b_idx" btree (a, b)
test=# explain select * from t3 where a = 10000;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using t3_a_b_idx on t3 (cost=0.42..8.44 rows=1 width=49)
Index Cond: (a = 10000)
(2 rows)
test=# explain select * from t3 where b = '2022-11-18 17:50:29.245683';
QUERY PLAN
---------------------------------------------------------------------------
Seq Scan on t3 (cost=0.00..2281.00 rows=1 width=49)
Filter: (b = '2022-11-18 17:50:29.245683'::timestamp without time zone)
(2 rows)
test=# create extension btree_gist;
CREATE EXTENSION
test=# create index idx_t3_gist on t3 using gist(a,b);
CREATE INDEX
test=# analyze t3;
ANALYZE
test=# explain select * from t3 where a = 10000;
QUERY PLAN
----------------------------------------------------------------------
Index Scan using t3_a_b_idx on t3 (cost=0.42..8.44 rows=1 width=49)
Index Cond: (a = 10000)
(2 rows)
test=# explain select * from t3 where b = '2022-11-18 17:50:29.245683';
QUERY PLAN
-------------------------------------------------------------------------------
Index Scan using idx_t3_gist on t3 (cost=0.28..8.30 rows=1 width=49)
Index Cond: (b = '2022-11-18 17:50:29.245683'::timestamp without time zone)
(2 rows)
test=# explain select * from t3 where a = '10000';
QUERY PLAN
-----------------------------------------------------------------------
Index Scan using idx_t3_gist on t3 (cost=0.28..8.30 rows=1 width=49)
Index Cond: (a = '10000'::bigint)
(2 rows)
test=# explain select * from t3 where a = '10000' or b = '2022-11-18 17:50:29.245683';
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t3 (cost=8.58..12.59 rows=1 width=49)
Recheck Cond: ((a = '10000'::bigint) OR (b = '2022-11-18 17:50:29.245683'::timestamp without time zone))
-> BitmapOr (cost=8.58..8.58 rows=1 width=0)
-> Bitmap Index Scan on idx_t3_gist (cost=0.00..4.29 rows=1 width=0)
Index Cond: (a = '10000'::bigint)
-> Bitmap Index Scan on idx_t3_gist (cost=0.00..4.29 rows=1 width=0)
Index Cond: (b = '2022-11-18 17:50:29.245683'::timestamp without time zone)
(7 rows)
test=# explain select * from t3 where a = '10000' and b = '2022-11-18 17:50:29.245683';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------
Index Scan using t3_a_b_idx on t3 (cost=0.42..6.19 rows=1 width=49)
Index Cond: ((a = '10000'::bigint) AND (b = '2022-11-18 17:50:29.245683'::timestamp without time zone))
(2 rows)
test=#
4. SP-GiST索引
和GiST类似,但是是一棵不平衡树,支持多维和海量数据,把空间分割成互不相交的部分。SP-GiST适用于空间可以递归分割成不相交区域的结构,包括四叉树、k-D树和基数树。
5. GIN索引
倒排序索引,适合于包含多个组成值的数据,比如数组,全文检索等。用来保存一个键值对的集合,支持用户定义的索引策略,对于不同的索引策略,可以使用不同的操作符。
键值对(Key,postion list):其中Key是一个键值,而postion list是包含Key的位置值。比如('Bob','10:25 14:3 29:5') 就表示关键字'Bob'在这些位置(元组TID)上存在。 当我们用关键字'Bob'去查询的时候,一下就定位到包含关键字的元组有这三个。
使用方法如下:
test=# create table t4(id int, info text);
CREATE TABLE
test=# insert into t4 values(generate_series(1,10000), md5(random()::text));
INSERT 0 10000
test=# create index idx_t4_gin on t4 using gin(to_tsvector('english',info));
CREATE INDEX
test=# analyze t4;
ANALYZE
test=# \d t4
Table "public.t4"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
info | text | | |
Indexes:
"idx_t4_gin" gin (to_tsvector('english'::regconfig, info))
test=# explain select * from t4 where to_tsvector('english', info) @@ plainto_tsquery( 'hello');
QUERY PLAN
-------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t4 (cost=12.25..16.77 rows=1 width=37)
Recheck Cond: (to_tsvector('english'::regconfig, info) @@ plainto_tsquery('hello'::text))
-> Bitmap Index Scan on idx_t4_gin (cost=0.00..12.25 rows=1 width=0)
Index Cond: (to_tsvector('english'::regconfig, info) @@ plainto_tsquery('hello'::text))
(4 rows)
6. BRIN索引
块范围索引,它将数据在磁盘上的block按照一定的数目进行分组,分组之后,计算每组的取值范围。在查找数据时,会遍历这些取值范围,排除掉不在范围之内的分组。BRIN索引适用于存储流式数据日志。例如:按照时间插入的数据,由于数据是按照时间插入,因此数据块上记录的范围信息很少会出现交叉情况,索引过滤后需要比较的数据块也会少很多;反之,如果数据交叉严重,通过索引无法过滤掉任何一个数据块时,操作起来会比全表扫描更加耗时。
test=# create table t5(id int, name text);
CREATE TABLE
test=# insert into t5 values(generate_series(1,100000), md5(random()::text));
INSERT 0 100000
test=# create index idx_t5_brin on t5 using brin(id);
CREATE INDEX
test=# analyze t5;
ANALYZE
test=# \d t5
Table "public.t5"
Column | Type | Collation | Nullable | Default
--------+---------+-----------+----------+---------
id | integer | | |
name | text | | |
Indexes:
"idx_t5_brin" brin (id)
test=# explain select * from t5 where id > 98765;
QUERY PLAN
-------------------------------------------------------------------------------
Bitmap Heap Scan on t5 (cost=12.33..1024.91 rows=1199 width=37)
Recheck Cond: (id > 98765)
-> Bitmap Index Scan on idx_t5_brin (cost=0.00..12.03 rows=14286 width=0)
Index Cond: (id > 98765)
(4 rows)
test=# explain analyze select * from t5 where id > 98765;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on t5 (cost=12.33..1024.91 rows=1199 width=37) (actual time=2.033..2.408 rows=1235 loops=1)
Recheck Cond: (id > 98765)
Rows Removed by Index Recheck: 6605
Heap Blocks: lossy=66
-> Bitmap Index Scan on idx_t5_brin (cost=0.00..12.03 rows=14286 width=0) (actual time=0.043..0.043 rows=1280 loops=1)
Index Cond: (id > 98765)
Planning Time: 0.115 ms
Execution Time: 2.545 ms
(8 rows)
test=#