Bitmap Index Scan

本文转载自:https://yq.aliyun.com/articles/662100?spm=a2c4e.11155435.0.0.56e45560mi4bdC#

Bitmap Index Scan

数据库里面的表的扫描方式主要是以下几种方式:sequential scans, index scans, and bitmap index scans,当然还有index only scan,这种算是index scans中比较特殊的一种,需要的信息在索引中都能找到,扫描索引即可,不需要去扫描表。

这篇文章主要谈谈Bitmap Index Scan的原理及适用的场景。

定义

A plain indexscan fetches one tuple-pointer at a time from the index,and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory “bitmap” data structure, and then visits the table tuples in physical tuple-location order. The bitmap scan improves locality of reference to the table at the cost of more bookkeeping overhead to manage the “bitmap” data structure — and at the cost that the data is no longer retrieved in index order, which doesn’t matter for your query but would matter if you said ORDER BY.

A bitmapped index scan works in two stages. First the index or indexes are scanned to create a bitmap representing matching tuple.

这段解释是tom lane在postgres邮件组中的回答,我觉得是比较权威而且通俗易懂的解释。

核心是传统的index scan每次从索引中去取一个tuple的指针,然后立马去表中取数据,每一次会造成一次随机io。如果数据量较多的情况下,会比较低效。而bitmap scan一次性将符合条件的tuple-pointers全部取出来,然后在内存中进行地址排序,然后去取出数据,这时的读取数据由于进行的地址排序,读取时就变成了顺序的读。其实就是一个随机读转化为顺序读取的过程,但是取出的数据由于进行了地址的排序,就没有顺序。同时,对于limit这种sql,bitmap index scan这种就不适合,因为它一次会取出所有数据。

和传统索引扫描对比:

在这里插入图片描述

在读取数据量比较小时,index scan比较合适,在读取数据量比较大的情况下,bitmap index scan会更有优势。

下面通过实验验证

创建测试表

CREATE TABLE sampletable (x numeric);

插入数据

INSERT INTO sampletable SELECT random() * 10000 FROM generate_series(1, 10000000);

全表扫描

postgres=# explain (analyze,buffers) select * from sampletable where x = 12;
                                                           QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
 Seq Scan on sampletable  (cost=10000000000.00..10000179055.00 rows=1 width=11) (actual time=1144.790..1144.790 rows=0 loops=1)
   Filter: (x = '12'::numeric)
   Rows Removed by Filter: 10000000
   Buffers: shared hit=54055
 Planning time: 0.190 ms
 Execution time: 1144.811 ms
(6 rows)

从执行计划看到整体的读取page的数量(一般在磁盘中我们叫block,如果数据已经加载至内存我们叫page)是54055。有没有同学会质疑以下这个大小是不是真正的表大小?

我们查看pg_class表:

postgres=# SELECT relpages, reltuples FROM pg_class WHERE relname = 'sampletable';
 relpages | reltuples
----------+-----------
    54055 |     1e+07

postgres=# select pg_relation_size('sampletable')/8/1024;
 ?column?
----------
    54055
(1 row)

上面两种方法都可以计算中实际表的大小。

建立索引:

CREATE INDEX  ON sampletable(x);

再来运行语句

postgres=# explain (analyze,buffers) select * from sampletable where x = 12;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using sampletable_x_idx on sampletable  (cost=0.43..8.45 rows=1 width=11) (actual time=0.034..0.034 rows=0 loops=1)
   Index Cond: (x = '12'::numeric)
   Heap Fetches: 0
   Buffers: shared read=3
 Planning time: 0.220 ms
 Execution time: 0.054 ms

由于取的数据比较少,所以不会使用bitmap index scan,这时我们读取一个范围内的数据:

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <100;
                                                              QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sampletable  (cost=2536.93..59167.58 rows=98780 width=11) (actual time=26.456..100.278 rows=99979 loops=1)
   Recheck Cond: ((x > '0'::numeric) AND (x < '100'::numeric))
   Heap Blocks: exact=45642
   Buffers: shared hit=46028
   ->  Bitmap Index Scan on sampletable_x_idx  (cost=0.00..2512.24 rows=98780 width=0) (actual time=19.426..19.426 rows=99979 loops=1)
         Index Cond: ((x > '0'::numeric) AND (x < '100'::numeric))
         Buffers: shared hit=386
 Planning time: 0.172 ms
 Execution time: 105.153 ms

如果我们希望走传统index scan,有两种方法,一种是修改enable_bitmapscan(默认是on),还有一种是修改random_page_cost,后面一种修改随机扫描的cost方法其实就是利用我们前面所讲的原理,让执行计划认为随机扫描的代价很低,虽然实际中并不是这样。

这里我们采用第一种方法:

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <100;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using sampletable_x_idx on sampletable  (cost=0.43..209968.64 rows=98780 width=11) (actual time=0.033..91.571 rows=99979 loops=1)
   Index Cond: ((x > '0'::numeric) AND (x < '100'::numeric))
   Heap Fetches: 99979
   Buffers: shared hit=100364
 Planning time: 0.095 ms
 Execution time: 96.648 ms

可以看到,两种扫描方式时间其实差不多。我们继续扩大扫描的范围看看效果:

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <2000;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using sampletable_x_idx on sampletable  (cost=0.43..287155.15 rows=2003705 width=11) (actual time=0.033..1681.522 rows=2000273 loops=1)
   Index Cond: ((x > '0'::numeric) AND (x < '2000'::numeric))
   Heap Fetches: 2000273
   Buffers: shared hit=2007910
 Planning time: 0.105 ms
 Execution time: 1782.037 ms
(6 rows)

设置打开bitmapscan

postgres=# set enable_bitmapscan = on;
SET

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <2000;
                                                                  QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sampletable  (cost=51402.41..135512.99 rows=2003705 width=11) (actual time=247.258..589.842 rows=2000273 loops=1)
   Recheck Cond: ((x > '0'::numeric) AND (x < '2000'::numeric))
   Heap Blocks: exact=54055
   Buffers: shared hit=61721
   ->  Bitmap Index Scan on sampletable_x_idx  (cost=0.00..50901.49 rows=2003705 width=0) (actual time=238.489..238.489 rows=2000273 loops=1)
         Index Cond: ((x > '0'::numeric) AND (x < '2000'::numeric))
         Buffers: shared hit=7666
 Planning time: 0.109 ms
 Execution time: 677.900 ms

可以看到,当范围扩大0~2000时,bitmap index scan 明显更优。

还有在and or 这种条件下,bitmap index scan 优势也是很明显

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <2000 or x >9999;
                                                                     QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on sampletable  (cost=51927.12..141063.23 rows=2004448 width=11) (actual time=257.583..611.480 rows=2001278 loops=1)
   Recheck Cond: (((x > '0'::numeric) AND (x < '2000'::numeric)) OR (x > '9999'::numeric))
   Heap Blocks: exact=54055
   Buffers: shared hit=61728
   ->  BitmapOr  (cost=51927.12..51927.12 rows=2004635 width=0) (actual time=248.819..248.819 rows=0 loops=1)
         Buffers: shared hit=7673
         ->  Bitmap Index Scan on sampletable_x_idx  (cost=0.00..50901.49 rows=2003705 width=0) (actual time=248.723..248.723 rows=2000273 loops=1)
               Index Cond: ((x > '0'::numeric) AND (x < '2000'::numeric))
               Buffers: shared hit=7666
         ->  Bitmap Index Scan on sampletable_x_idx  (cost=0.00..23.41 rows=930 width=0) (actual time=0.093..0.093 rows=1005 loops=1)
               Index Cond: (x > '9999'::numeric)
               Buffers: shared hit=7
 Planning time: 0.236 ms
 Execution time: 699.203 ms
(14 rows)

postgres=# set enable_bitmapscan = off;
SET

postgres=# explain (analyze,buffers) select * from sampletable where x >0 and x <2000 or x >9999;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using sampletable_x_idx on sampletable  (cost=0.43..595241.76 rows=2004448 width=11) (actual time=0.016..8253.516 rows=2001278 loops=1)
   Filter: (((x > '0'::numeric) AND (x < '2000'::numeric)) OR (x > '9999'::numeric))
   Rows Removed by Filter: 7998722
   Heap Fetches: 10000000
   Buffers: shared hit=10038131
 Planning time: 0.199 ms
 Execution time: 8346.357 ms
(7 rows)

可以看到bitmap index scan比普通的index scan快了一个数量级,同时我们在执行计划中有bitmap or,通过对两个bitmap做或运算,筛选数据。

其他

不知道有没有注意执行计划中有一个recheck cond的过程,这个意思是在取出的数据比较多时,内存中就不会单独存储每行的指针,而是直接存储对应的page的指针,这种就是有损的(lossy),所以做Bitmap Heap Scan时需要double check一下page中满足条件的行。我们这里的Heap Blocks都是exact,意思是无损的,精确的,其实并不需要recheck的过程,只是这里会显示出来。如果我们遇到了Heap Blocks lossy的情况,可以适当提高work_mem.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值