一、顺序扫描
select * from tb1 where id<8000;
公式:启动代价+运行代价
顺序扫描启动代价为0.
run_cost = cpu_run_cost+disk_run_cost
= (cpu_tuple_cost+cpu_operator_cost)*num_tuple+seq_page_cost*num_page;
seq_page_cost、cpu_tuple_cost、cpu_operator_cost可以通过参数配置来确定
test=# show seq_page_cost;
seq_page_cost
---------------
1
(1 row)
test=# show cpu_tuple_cost
test-# ;
cpu_tuple_cost
----------------
0.01
(1 row)
test=# show cpu_operator_cost;
cpu_operator_cost
-------------------
0.0025
(1 row)
num_tuple和num_page可以查看数据字典
test=# select relpages,reltuples from pg_class where relname='tb1';
relpages | reltuples
----------+-----------
23 | 10000
可得 run_cost=(0.01+0.0025)*10000+1*23=148
顺序扫描代价=0+148=148。
验证
test=# explain select * from tb1 where id<8000;
QUERY PLAN
--------------------------------------------------------
Seq Scan on tb1 (cost=0.00..148.00 rows=7999 width=8)
Filter: (id < 8000)
(2 rows)
前面的0.00为启动代价;后面的148.00为总代价
二、索引扫描
假设查询:select id ,data from tb1 where date<240;
num_page和num_tuple 值参数上面
num_index_page 和num_index_tuple如下
test=# select relpages,reltuples from pg_class where relname='tb1_data_idx';
relpages | reltuples
----------+-----------
16 | 10000
启动代价:
启动代价就是读取索引页以访问第一条元组的代价,公式如下:
startup_cost={ceil(log2(num_index_tuple))+(index_high+1)*50}*cpu_operator_cost
ceil为取整函数
log2为取对数 即 2的多少次方等于括号内的值
index_high为索引高度,目前值为1
startup_cost={ceil(log2(10000)+(1+1)*50}*0.0025=0.2825
运行代价:
索引的运行代价是表和索引的CPU代价与IO代价之和。
run_cost=(index_cpu_cost+table_cpu_cost)+(index_io_cost+table_io_cost)
如果使用仅索引扫描,则不会估计表的CPU和IO消耗
index_cpu_cost=Snum_index_tuple(cpu_index_tuple_cost+qual_op_cost)
table_cpu_cost=Snum_tuplecpu_tuple_cost
index_io_cost=ceil(S*num_index_page)*random_page_cost
test=# show cpu_index_tuple_cost;
cpu_index_tuple_cost
----------------------
0.005
(1 row)
test=# show cpu_tuple_cost;
cpu_tuple_cost
----------------
0.01
(1 row)
test=# show random_page_cost;
random_page_cost
------------------
4
(1 row)
test=#
qual_op_cost粗略的来说是索引求值的代价,默认是0.0025
S是指选择率 Selectivity,S*元组总数即为估计要读取的元组的数量
查询谓词的选择率是通过直方图界值与高频值估计的、这些信息在pg_stats视图可以查到
most_common_freqs 即为选择率
select schemaname,tablename, most_common_vals,most_common_freqs
from pg_stats where tablename='tb1';
schemaname | tablename | most_common_vals | most_common_freqs
------------+-----------+------------------+-------------------
public | tb1 | |
public | tb1 | |
(2 rows)
test=#
如果高频直方值不可用,则采用目标列上的直方图界值计算
test=# select histogram_bounds from pg_stats
test-# where tablename='tb1' and attname='data';
histogram_bounds
-------------------------------------------------------------------------------------------------------
{1,100,200,300,400,500,600,700,800,900,1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000,2100,220
0,2300,2400,2500,2600,2700,2800,2900,3000,3100,3200,3300,3400,3500,3600,3700,3800,3900,4000,4100,4200,4
300,4400,4500,4600,4700,4800,4900,5000,5100,5200,5300,5400,5500,5600,5700,5800,5900,6000,6100,6200,6300
,6400,6500,6600,6700,6800,6900,7000,7100,7200,7300,7400,7500,7600,7700,7800,7900,8000,8100,8200,8300,84
00,8500,8600,8700,8800,8900,9000,9100,9200,9300,9400,9500,9600,9700,9800,9900,10000}
(1 row)
test=#
从上述查询可以看到,10000个值被分为100个bucket,桶从0开始,每个桶保存了大概相同数量的元组。直方图界值就是相应桶的边界。
比如直方图界值的第0个值是1 ,即 这个bucket_0中最小的值为1,hb(0)=1。bucket_1 的界值为100,即最小值为100,hb(1)=100。
240在第3个桶,即hb(2)
根据界值计算选择率公式
S={2+[240-hb(2)]/[hb(3)-hb(2)]}/100={2+[240-200]/[300-200]}/100=0.024
选择率S即为0.024
带入上面公式
index_cpu_cost=0.02410000(0.005+0.0025)=1.8
table_cpu_cost=0.024100000.01=2.4
index_io_cost=ceil(0.024*16)*4=4
table_io_cost=max_io_cost+indexCorrelation平方*(min_io_cost-max_io_cost)
max_io_cost是最差情况下的IO代价,即扫描所有数据页的代价 max_io_cost=num_pagerandom_page_cost=234=92
min_io_cost是最优情况下的IO代价,即顺序扫描选定的数据页
min_io_cost=1random_page_cost+[ceil{Snum_page)-1]seq_page_cost
=14+[ceil(0.024*23)-1]*1=5
indexCorerelation是索引相关性,取值范围从-1 到1 从pg_stats视图可以查看到
select tablename,attname,correlation from pg_stats where tablename='tb1';
tablename | attname | correlation
-----------+---------+-------------
tb1 | id | 1
tb1 | data | 1
(2 rows)
table_io_cost=92+11(5-92)=5
run_cost=(1.8+2.4)+(4+5)=13.2
综上所述 总代价=0.2825+13.2=13.4825
test=# explain select id ,data from tb1 where data<240;
QUERY PLAN
---------------------------------------------------------------------------
Index Scan using tb1_data_idx on tb1 (cost=0.29..12.47 rows=239 width=8)
Index Cond: (data < 240)
(2 rows)
seq_page_cost和random_page_cost分表代表顺序扫描和随机扫描 1/4显然是机械硬盘的代价。
对SSD来说,随机扫描代价为4显然太大,可以设置为1