PG顺序扫描和索引扫描代价估算

最新推荐文章于 2024-09-12 23:58:52 发布

南風_入弦

最新推荐文章于 2024-09-12 23:58:52 发布

阅读量538

点赞数 11

文章标签：数据库 postgresql

本文链接：https://blog.csdn.net/qq_29431123/article/details/141009751

版权

一、顺序扫描

select * from tb1 where id<8000;

公式：启动代价+运行代价
顺序扫描启动代价为0.

run_cost = cpu_run_cost+disk_run_cost
		 = (cpu_tuple_cost+cpu_operator_cost)*num_tuple+seq_page_cost*num_page;

seq_page_cost、cpu_tuple_cost、cpu_operator_cost可以通过参数配置来确定

test=# show seq_page_cost;

 seq_page_cost 
---------------

 1
(1 row)

test=# show cpu_tuple_cost
test-# ;

 cpu_tuple_cost 
----------------

 0.01
(1 row)

test=# show cpu_operator_cost;

 cpu_operator_cost 
-------------------

 0.0025
(1 row)


num_tuple和num_page可以查看数据字典

test=# select relpages,reltuples from pg_class where relname='tb1';
 relpages | reltuples 
----------+-----------
       23 |     10000
	   
	   
可得 run_cost=(0.01+0.0025)*10000+1*23=148
顺序扫描代价=0+148=148。

验证
test=# explain select * from tb1 where id<8000;

                       QUERY PLAN                       
--------------------------------------------------------

 Seq Scan on tb1  (cost=0.00..148.00 rows=7999 width=8)
   Filter: (id < 8000)
(2 rows)

前面的0.00为启动代价；后面的148.00为总代价

二、索引扫描

假设查询：select id ,data from tb1 where date<240;
num_page和num_tuple 值参数上面
num_index_page 和num_index_tuple如下

test=# select relpages,reltuples from pg_class where relname='tb1_data_idx';
 relpages | reltuples 
----------+-----------
       16 |     10000

启动代价：
启动代价就是读取索引页以访问第一条元组的代价，公式如下：

startup_cost={ceil(log2(num_index_tuple))+(index_high+1)*50}*cpu_operator_cost

ceil为取整函数
log2为取对数即 2的多少次方等于括号内的值

index_high为索引高度，目前值为1

startup_cost={ceil(log2(10000)+(1+1)*50}*0.0025=0.2825

运行代价：
索引的运行代价是表和索引的CPU代价与IO代价之和。

run_cost=(index_cpu_cost+table_cpu_cost)+(index_io_cost+table_io_cost)

如果使用仅索引扫描，则不会估计表的CPU和IO消耗

index_cpu_cost=Snum_index_tuple(cpu_index_tuple_cost+qual_op_cost)
table_cpu_cost=Snum_tuplecpu_tuple_cost
index_io_cost=ceil(S*num_index_page)*random_page_cost

test=# show cpu_index_tuple_cost;

 cpu_index_tuple_cost 
----------------------

 0.005
(1 row)
test=# show cpu_tuple_cost;

 cpu_tuple_cost 
----------------

 0.01
(1 row)

test=# show random_page_cost;

 random_page_cost 
------------------

 4
(1 row)

test=# 

qual_op_cost粗略的来说是索引求值的代价，默认是0.0025

S是指选择率 Selectivity,S*元组总数即为估计要读取的元组的数量 

查询谓词的选择率是通过直方图界值与高频值估计的、这些信息在pg_stats视图可以查到
 most_common_freqs 即为选择率

select schemaname,tablename, most_common_vals,most_common_freqs 
 from pg_stats where tablename='tb1';
 schemaname | tablename | most_common_vals | most_common_freqs 
------------+-----------+------------------+-------------------
 public     | tb1       |                  | 
 public     | tb1       |                  | 
(2 rows)

test=# 

如果高频直方值不可用，则采用目标列上的直方图界值计算

test=# select histogram_bounds from pg_stats 
test-# where tablename='tb1' and attname='data';
                                                                                                                                                                                                   

                                  histogram_bounds                                                     

-------------------------------------------------------------------------------------------------------

 {1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,220
0,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4
300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300
,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,8200,8300,84
00,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000}
(1 row)

test=#

从上述查询可以看到，10000个值被分为100个bucket，桶从0开始，每个桶保存了大概相同数量的元组。直方图界值就是相应桶的边界。
比如直方图界值的第0个值是1 ，即这个bucket_0中最小的值为1，hb(0)=1。bucket_1 的界值为100，即最小值为100，hb(1)=100。

240在第3个桶，即hb（2）
根据界值计算选择率公式

S=｛2+[240-hb(2)]/[hb(3)-hb(2)]｝/100=｛2+[240-200]/[300-200]｝/100=0.024

选择率S即为0.024

带入上面公式

index_cpu_cost=0.02410000(0.005+0.0025)=1.8
table_cpu_cost=0.024100000.01=2.4
index_io_cost=ceil(0.024*16)*4=4

table_io_cost=max_io_cost+indexCorrelation平方*（min_io_cost-max_io_cost）

max_io_cost是最差情况下的IO代价，即扫描所有数据页的代价 max_io_cost=num_pagerandom_page_cost=234=92
min_io_cost是最优情况下的IO代价，即顺序扫描选定的数据页

min_io_cost=1random_page_cost+[ceil{Snum_page)-1]seq_page_cost
=14+[ceil(0.024*23)-1]*1=5

indexCorerelation是索引相关性，取值范围从-1 到1 从pg_stats视图可以查看到

select tablename,attname,correlation from pg_stats where tablename='tb1';
 tablename | attname | correlation 
-----------+---------+-------------
 tb1       | id      |           1
 tb1       | data    |           1
(2 rows)

table_io_cost=92+11（5-92）=5

run_cost=(1.8+2.4)+(4+5)=13.2
综上所述总代价=0.2825+13.2=13.4825

test=#  explain select id ,data from tb1 where data<240;

                                QUERY PLAN                                 
---------------------------------------------------------------------------

 Index Scan using tb1_data_idx on tb1  (cost=0.29..12.47 rows=239 width=8)
   Index Cond: (data < 240)
(2 rows)

seq_page_cost和random_page_cost分表代表顺序扫描和随机扫描 1/4显然是机械硬盘的代价。

对SSD来说，随机扫描代价为4显然太大，可以设置为1