数据库查询代价估算优化的深度介绍

最新推荐文章于 2023-02-10 19:05:49 发布

勉仁

最新推荐文章于 2023-02-10 19:05:49 发布

阅读量4.1k

点赞数

分类专栏：数据库代价估算文章标签：数据库优化 CBO PostgreSQL 代价估算

本文链接：https://blog.csdn.net/michaelyang_yz/article/details/51873261

版权

数据库同时被 2 个专栏收录

11 篇文章 1 订阅

订阅专栏

代价估算

1 篇文章 0 订阅

订阅专栏

数据库查询代价估算优化的深度介绍

今天在看一本书的时候发现在查询代价估算一节讲的不是很清晰，受篇幅限制吧。例如原文会出现这样两句话“索引列出现在JOIN/ON子句中，作为连接条件，不可使用索引”，“索引列出现在JOIN/ON子句中，作为限制条件满足”key<op>常量”格式可用索引”，这样的话就很容易误导读者，没有将本质叙述出来。真实原因，如果驱动表行数很少，在约束条件上内表走索引过滤性强的时候，那么通常选择走索引代价会更小。举个例子：

test=# create table t(c1 int unique, c2 int unique);

test=# create table t1(c1 int primary key, c2 int, c3 int, c4 int 
unique);

test=# explain select * from t join t1 on t.c2 = t1.c4 where t.c1 = 1;

+---------------------------------------------------------------------------+
| QUERY PLAN|
+---------------------------------------------------------------------------+
| Nested Loop  (cost=0.31..16.36 rows=1 width=24)   |
|   ->  Index Scan using t_c1_key on t  (cost=0.16..8.17 rows=1 width=8)|
| Index Cond: (c1 = 1)  |
|   ->  Index Scan using t1_c4_key on t1  (cost=0.15..8.17 rows=1 width=16) |
| Index Cond: (c4 = t.c2)   |
+---------------------------------------------------------------------------+
5 rows in set

在t1索引列c4上，并没有常量condition，也没有其等值的t.c2的常量condition。但t.c1 = 1这个equal condition可以估算返回1行数据，获得这一行数据后t.c2为常量值，那么可以利用t1.c4 = t.c2这一条件走索引t1_c4_key快速找到需要的行做NL-JOIN。

自己已经工作两年，在这段时间中一项任务就是负责代价估算的开发工作，所以在这里写一下自己对CBO（基于代价的优化）的认识作为分享和总结，希望对大家提高对CBO的认识有帮助，也希望大家多多拍砖帮助自己更快成长。

接下来，我会按照自己项目中各个Operator的代价估算来讲，从底层基表访问到上层LIMIT逐一讲述。在介绍的过程中，我会举实际SQL的例子，也会加一些自己所在团队数据库（后面简称为OurDB）的基础实现介绍，希望能帮助你更好的理解。所以别闲我的介绍啰嗦啊，我是希望大部分有数据库基础知识的人都能看懂。由于能有条件安装OurDB的会非常非常少，当然我也没提DB名字，因此这里举例主要用PostgreSQL，即使是OurDB的例子，也会写Postgre的等价SQL。

基表访问(TABLE/INDEX SCAN/GET)的代价估算

基表访问路径选择的重要性

基表的访问路径(Table Access Path)的代价估算和选择，自己认为是最重要的。

就其自身而言，如果SQL最优路径是走索引A，结果走了前缀索引列不相关或者需要大量回表的索引B，那这条SQL的执行很可能比最优路径性能差几个数量级。

同时如果基表路径返回行数估算偏差大，对于上层Operator的选择往往是毁灭性的。例如两表t、t1做JOIN时，t被估算返回1行数据，然后优化器在选择JOIN type的时候选择了t作为外表，t1作为内表的无MATERIAL的NestLoop Join。如下：

test=# explain select * from t, t1 where t.c1 = 1;
+------------------------------------------------------------------------+
| QUERY PLAN |
+------------------------------------------------------------------------+
| Nested Loop  (cost=0.16..53.57 rows=1770 width=24) |
|   ->  Index Scan using t_c1_key on t  (cost=0.16..8.17 rows=1 width=8) |
| Index Cond: (c1 = 1)   |
|   ->  Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=16)|
+------------------------------------------------------------------------+
4 rows in set

如果行数估算错误，等到实际执行时候，t表返回了1000行，这样实际执行时内表要执行1000次的Seq Scan，真正执行的代价会是估算代价的近1000倍。这个时候有MATERIAL的NL-JOIN会比所选择的执行效率高很多。

test=# create table t2(c1 int unique, c2 int);
test=# explain select * from t2, t1 where t2.c2 = 1;
+----------------------------------------------------------------+
| QUERY PLAN |
+----------------------------------------------------------------+
| Nested Loop  (cost=0.00..307.85 rows=19470 width=24)   |
|   ->  Seq Scan on t1  (cost=0.00..27.70 rows=1770 width=16)|
|   ->  Materialize  (cost=0.00..36.81 rows=11 width=8)  |
| ->  Seq Scan on t2  (cost=0.00..36.75 rows=11 width=8) |
|   Filter: (c2 = 1) |
+----------------------------------------------------------------+
5 rows in set

读者可以在介绍JOIN OPERATOR代价估算之前估计一下，以下两条SELECT SQL会选择的JOIN方式，然后在PostgreSQL上执行一下，看自己的判断是否准确。

test=# create table t1(c1 int primary key, c2 int, c3 int, c4 int 
unique);
test=# create table t2(c1 int unique, c2 int);
test=# explain select * from t2, t1 where t1.c1 = t2.c1 and t2.c1
 = 1;
test=# explain select * from t2, t1 where t1.c1 = t2.c1 and t2.c2 = 1
;

经过上面的介绍，相信你也会认为最底层的TABLE ACCESS PATH选择是非常关键的。接下来我会基于OurDB的实现来介绍TABLE ACCESS PATH的代价估算。

OurDB背景介绍

在OurDB中，索引表和主表在存储层存储方式、访问方式是一样的，在内部会加前缀__index作为区分，比如创建t_c1_key，在内部可以理解为一张名为__index_t_c1_key的table。在访问主表或者索引表的时候，都会根据SQL抽取query_range计算需要访问的数据区间。例如下面SQL中你可以看到，Query1选择走主表，其range为(1,2)。Query2选择走索引t_c2，range为((1, MAX)-(3,MIN))。在t_c2的range expr中，你可以看到c1，是因为索引表为保证row的唯一性，会加入主表的primay key。

(root@test)> create table t(c1 int primary key, c2 int, c3 int, key t_c2(c2));

Query1：
(root@test)> explain extended_noaddr select * from t where c1 > 1 and c1 < 2\G
*************************** 1. row ***************************
Query Plan: ===================================
|ID|OPERATOR  |NAME|EST. ROWS|COST|
-----------------------------------
|0 |TABLE SCAN|t   |1|37  |
===================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [t.c3]), filter(nil),
  access([t.c1], [t.c2], [t.c3]), partitions(p0),
  is_index_back=false,
  range_expr([t.c1]), range(1 ; 2)


Query2：
(root@test)> explain extended_noaddr select * from t where c2 > 1 and c2 < 3\G
*************************** 1. row ***************************
Query Plan: ======================================
|ID|OPERATOR  |NAME   |EST. ROWS|COST|
--------------------------------------
|0 |TABLE SCAN|t(t_c2)|1|164 |
======================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [t.c3]), filter(nil),
  access([t.c2], [t.c1], [t.c3]), partitions(p0),
  is_index_back=true,
  range_expr([t.c2], [t.c1]), range(1,MAX ; 3,MIN)

在Postgre中的等价SQL为：

test=# create table t(c1 int primary key, c2 int, c3 int);

test=# create index t_c2 on t(c2);

test=# explain verbose select * from t where c1 > 1 and c1;

test=# explain verbose select * from t where c2 > 1 and c2 < 3
;

query range(index cond)

我们知道table的所有行数 * table filter的选择率就是基表访问返回的行数。但是对于基表路径的选择，更为关键的是query range的选择性，即索引表上需要访问多少行数据。同一个表，同样的filters，各个路径经过filter返回的行数是一致的，但是query range决定了需要在访问路径上访问多少数据。例如，上面例子中Qury1选择走主表和索引t_c2最终都会是1行。但它们一个是range（1,2）的范围扫描，一个是range(min,min);(max，max)的Seq Scan，其代价差距会非常大。

(root@test)> explain extended_noaddr select * from t use index(t_c2) where c1 > 1 and c1 < 2\G
*************************** 1. row ***************************
Query Plan: ======================================
|ID|OPERATOR  |NAME   |EST. ROWS|COST|
--------------------------------------
|0 |TABLE SCAN|t(t_c2)|1|302 |
======================================

Outputs & filters:
-------------------------------------
  0 - output([t.c1], [t.c2], [t.c3]), filter([t.c1 > 1], [t.c1 < 2]),
  access([t.c1], [t.c2], [t.c3]), partitions(p0),
  is_index_back=true, filter_before_indexback[true,true],
  range_expr([t.c2], [t.c1]), range(MIN,MIN ; MAX,MAX)always true

在PostgreSQL中等价SQL，由于pg不支持Hint，所以需要自己安装插件，可参考文档https://yq.aliyun.com/articles/4796和http://pghintplan.osdn.jp/pg_hint_plan.html：

test=# /*+IndexScan(t3 t3_c2)*/explain verbose select * from t3 where c1 > 1 and c1 < 2;

STATEMENT:  /*+IndexScan(t3 t3_c2)*/explain verbose select * from t3 where c1 > 1 and c1 < 2;
  QUERY PLAN
-------------------------------------------------------------------------------
 Seq Scan on public.t3  (cost=10000000000.00..10000000039.10 rows=10 width=12)
   Output: c1, c2, c3
   Filter: ((t3.c1 > 1) AND (t3.c1 < 2))
(3 rows)

test=# /*+IndexScan(t3 t3_pkey)*/explain verbose select * from t3 where c1 > 1 and c1 < 2;

STATEMENT:  /*+IndexScan(t3 t3_pkey)*/explain verbose select * from t3 where c1 > 1 and c1 < 2;
 QUERY PLAN
----------------------------------------------------------------------------
 Index Scan using t3_pkey on public.t3  (cost=0.15..32.35 rows=10 width=12)
   Output: c1, c2, c3
   Index Cond: ((t3.c1 > 1) AND (t3.c1 < 2))

query range sel

并不是所有在索引列上的条件都可以成为index cond，对于同时包含索引列和其他列计算的filter，复杂计算无法抽取成range的filter都会无法成为index cond。例如在PostgreSQL中：

test=# create table t4 (c1 int primary key, c2 varchar(10), c3 varchar(10));

test=# create index t4_c3 on t4(c3);
QUERY1：
test=# explain select * from t4 where c3 >= 'a' and c3 <= 'b';
+-------------------------------------------------------------------------------+
| QUERY PLAN|
+-------------------------------------------------------------------------------+
| Bitmap Heap Scan on t4  (cost=4.19..12.66 rows=4 width=80)|
|   Recheck Cond: (((c3)::text >= 'a'::text) AND ((c3)::text <= 'b'::text)) |
|   ->  Bitmap Index Scan on t4_c3  (cost=0.00..4.19 rows=4 width=0)|
| Index Cond: (((c3)::text >= 'a'::text) AND ((c3)::text <= 'b'::text)) |
+-------------------------------------------------------------------------------+

QUERY2：
test=#  explain select * from t4 where c3 between 'a' and 'b';
+-------------------------------------------------------------------------------+
| QUERY PLAN|
+-------------------------------------------------------------------------------+
| Bitmap Heap Scan on t4  (cost=4.19..12.66 rows=4 width=80)|
|   Recheck Cond: (((c3)::text >= 'a'::text) AND ((c3)::text <= 'b'::text)) |
|   ->  Bitmap Index Scan on t4_c3  (cost=0.00..4.19 rows=4 width=0)|
| Index Cond: (((c3)::text >= 'a'::text) AND ((c3)::text <= 'b'::text)) |
+-------------------------------------------------------------------------------+

QUERY3：
test=# explain select * from t4 where c3 like 'a%';
+----------------------------------------------------+
| QUERY PLAN |
+----------------------------------------------------+
| Seq Scan on t4  (cost=0.00..19.25 rows=4 width=80) |
|   Filter: ((c3)::text ~~ 'a%'::text)   |
+----------------------------------------------------+

QUERY4：
test=# explain select * from t4 where c3 like '%a';
+------------------------------------------------------+
| QUERY PLAN   |
+------------------------------------------------------+
| Seq Scan on t4  (cost=0.00..19.25 rows=148 width=80) |
|   Filter: ((c3)::text ~~ '%a'::text) |
+------------------------------------------------------+

QUERY5：
test=# explain select * from t4 where repeat(c3, 4) = 'a';
+----------------------------------------------------+
| QUERY PLAN |
+----------------------------------------------------+
| Seq Scan on t4  (cost=0.00..21.10 rows=4 width=80) |
|   Filter: (repeat((c3)::text, 4) = 'a'::text)  |
+----------------------------------------------------+

可以看到只有QUERY1和QUERY2抽取除了Index Cond（其实like ‘a%’可以抽取出Index Cond[a,b) ）。

如果索引有两列(a,b)，如果range为(1,2)-(2,3)，那么其range选择率可以近似为(1,min)-(2,max)。当需要回表的时候，使用range选择数据，可以通过索引列上的filter做过滤获取rowkey后再回表。