PostgreSQL中bloom的介绍（二）

瀚高PG实验室

于 2018-05-15 14:29:58 发布

阅读量655

点赞数

分类专栏： PostgreSQL之扩展语言文章标签： postgresql

本文链接：https://blog.csdn.net/pg_hgdb/article/details/80322040

版权

PostgreSQL之扩展语言专栏收录该内容

18 篇文章 7 订阅

订阅专栏

作者：瀚高PG实验室（Highgo PG Lab）- Kalath
接下来，我们测试一下使用bloom索引之后的效果。
首先，需要创建一些测试数据。所以，先新建一张表：

test=# CREATE TABLE bloomtest(id serial, v1 int, v2 int, v3 int, v4 int, v5 int, v6 int);
CREATE TABLE

然后向表中插入一些随机值：

test=# insert into bloomtest (v1, v2, v3, v4, v5, v6)
test-#     select
test-#         random() * 10000000,
test-#         random() * 10000000,
test-#         random() * 10000000,
test-#         random() * 10000000,
test-#         random() * 10000000,
test-#         random() * 10000000
test-#     from generate_series(1, 1000000);
INSERT 0 1000000

先看一下表中数据的分布情况（取前十行）：

test=# select * from bloomtest limit 10;
 id |   v1    |   v2    |   v3    |   v4    |   v5    |   v6    
----+---------+---------+---------+---------+---------+---------
  1 | 7331968 | 9314138 | 2780210 | 7231840 | 7799359 | 6998207
  2 |  638584 |  186670 | 8859652 | 3975114 | 5966950 | 3116526
  3 | 5769711 | 7454585 | 6832561 | 1960396 | 7641411 |  324686
  4 |  920342 | 6663771 | 2444043 | 6599157 | 1988208 | 1493191
  5 | 8097268 | 8394039 |  973589 | 2365498 | 5094445 | 6625843
  6 |  542468 | 2426413 | 5939981 | 3322678 | 9658252 | 3739340
  7 |  320884 |  296837 | 3926010 | 9180536 | 4271951 | 9892960
  8 | 2297062 |   41662 | 7347545 | 9129623 | 2002058 | 4988956
  9 | 9454309 | 2922400 | 1652727 | 1898353 | 9521557 | 3640934
 10 | 3391544 | 7618825 | 2034974 | 4365134 | 9984323 | 7129419
(10 rows)

接下来，我们用特定列的值来验证行的存在性：

test=# explain analyze select * from bloomtest where v1 = 781269 and v5=56161;
                              QUERY PLAN                  
                                      
-----------------------------------------------------------------------------------

 Gather  (cost=1000.00..14147.10 rows=1 width=28) (actual time=104.709..104.709 row
s=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on bloomtest  (cost=0.00..13147.00 rows=1 width=28) (actua
l time=95.857..95.857 rows=0 loops=3)
         Filter: ((v1 = 781269) AND (v5 = 56161))
         Rows Removed by Filter: 333333
 Planning time: 0.073 ms
 Execution time: 105.153 ms
(8 rows)

从结果中可以看到没有符合条件的行，但是得到这个结果需要花费一定的时间（本例中约为105ms），PostgreSQL会使用顺序扫描的方式扫描整个表。
有几种改进办法：
方法1.可以使用为（v1,v5）列创建b-tree索引的方法，但是如果这样的话会有局限性。下次如果想要从（v2,v4）列上查询，这个索引就不适用了。
方法2.可以给每个列单独建一个索引，但是这同样有局限性：

test=# create index idx_1 on bloomtest (v1);
CREATE INDEX
...
...
...
test=# create index idx_6 on bloomtest (v6);
CREATE INDEX

看一下效果：

test=# explain analyze select * from bloomtest where v1 = 781269 and v5=56161;
                           QUERY PLAN                     
                               
-----------------------------------------------------------------------------------
 Index Scan using idx_5 on bloomtest  (cost=0.42..8.45 rows=1 width=28) (actual tim
e=0.053..0.053 rows=0 loops=1)
   Index Cond: (v5 = 56161)
   Filter: (v1 = 781269)
 Planning time: 0.509 ms
 Execution time: 0.071 ms
(5 rows)

从结果中可以看到，查询确实变快了。然而，这种方法也有明显缺陷，PostgreSQL只使用了v5列上的索引，然后过滤v1列。如果v5列上有多个列值都为56161的话这种方法就会适得其反。
方法3. 使用bloom过滤器。
首先创建扩展：

test=# create extension bloom;
CREATE EXTENSION

然后创建索引：

test=# CREATE INDEX bloom_idx ON bloomtest using bloom (v1,v2,v3,v4,v5,v6);
CREATE INDEX

然后删除之前为每列创建的索引。

test=# drop index idx_1,idx_2,idx_3,idx_4,idx_5,idx_6;
DROP INDEX

最后，看一下效果：

test=# explain analyze select * from bloomtest where v1 = 781269 and v5=56161;
                           QUERY PLAN                 
                                       
-----------------------------------------------------------------------------------
 Bitmap Heap Scan on bloomtest  (cost=17848.00..17852.02 rows=1 width=28) (actual t
ime=18.311..18.311 rows=0 loops=1)
   Recheck Cond: ((v1 = 781269) AND (v5 = 56161))
   Rows Removed by Index Recheck: 246
   Heap Blocks: exact=244
   ->  Bitmap Index Scan on bloom_idx  (cost=0.00..17848.00 rows=1 width=0) (actual
 time=16.066..16.066 rows=246 loops=1)
         Index Cond: ((v1 = 781269) AND (v5 = 56161))
 Planning time: 0.070 ms
 Execution time: 18.336 ms
(8 rows)

可以看到，使用bloom过滤器以后，两列都会被使用，而且只会花费大约18ms
还可以看到bloom索引反回了246个false positive，它们之后会被堆扫描中的过滤器进行处理。