CLUSTERING_FACTOR、回表、数据分布倾斜走全表还是索引

最新推荐文章于 2023-05-29 11:25:59 发布

cjh81378066

最新推荐文章于 2023-05-29 11:25:59 发布

阅读量192

点赞数

CLUSTERING_FACTOR Indicates the amount of order of the rows in the table based on the values of the index.
■ If the value is near the number of blocks, then the
table is very well ordered. In this case, the index
entries in a single leaf block tend to point to rows in
the same data blocks.
■ If the value is near the number of rows, then the
table is very randomly ordered. In this case, it is
unlikely that index entries in the same leaf block
point to rows in the same data blocks.

Assessing I/O for Blocks, not Rows
Oracle Database performs I/O by blocks. Therefore, the optimizer's decision to use full table scans is influenced by the percentage of blocks accessed, not rows. This is called the index clustering factor. If blocks contain single rows, then rows accessed and blocks accessed are the same.
However, most tables have multiple rows in each block. Consequently, the desired number of rows may be clustered in a few blocks or spread out over a larger number of blocks.
Although the clustering factor is a property of the index, the clustering factor actually relates to the spread of similar indexed column values within data blocks in the table. A lower clustering factor indicates that the individual rows are concentrated within fewer blocks in the table. Conversely, a high clustering factor indicates that the individual rows are scattered more randomly across blocks in the table. Therefore, a high clustering factor means that it costs more to use a range scan to fetch rows by rowid, because more blocks in the table need to be visited to return the data
索引扫描取OR全表扫描决于：
1、

where条件囊括的数据键值在索引上能够找到，详见NULL相关知识点

2、where条件囊括的数据块占整表数据块范围比较小，为什么是数据块，而非数据行数？主要是因为有一个索引聚类因子的概念
Oracle数据库通过块执行I/O。因此，优化器使用全表扫描的决定受到访问块的百分比而不是行的影响。这称为索引聚类因子。如果块包含单行，则访问的行和访问的块是相同的。
但是，大多数表在每个块中都有多行。因此，期望数量的行可以聚集在几个块中或者扩展到更大的数目的块。
虽然聚类因子是索引的属性，但聚类因子实际上与表中数据块内类似索引列值的扩展有关。较低的聚类因子表示各行集中在表中较少的块中。相反，高聚类因子表明单独的行在表中的块之间更随机地散布。因此，高聚类因素意味着使用范围扫描以rowid获取行花费更多，因为表中的更多块需要被访问才能返回数据

查看ALL_INDEXES的CLUSTERING_FACTOR*值来判断数据的离散程度
假如表的总块数是5000个块，有1000000行，如果表的数据太离散，带上where 条件object_id=100的数据只有10000行（1%的数据），但是分布在太多不同的块上假如就是5000个块，走索引10000次就需要访问5000个数据块，成本IO等于5000数据块+索引块数>整表块数，那肯定走全表扫描了。
如果表的数据不那么离散，object_id=100的数据分布就在200个块上，成本IO等于200数据块+索引块数<整表块数，那肯定走索引扫描了。

如下三种数据倾斜情况
SQL> select * from test where object_id=1;--数据占95%，理应走全表
SQL> select * from test where object_id=2;--数据占1%，理应走索引
SQL> select * from test where object_id=3;--数据占1%，理应走索引

如果CLUSTERING_FACTOR值是建议走索引。
比如第一个语句，它也不会走全表而是直接走了索引
比如表的总块数是100个块（可能有空块或半空块），object_id=1却只占据了50个块，select全表扫描时会扫描HWM以内的所有块，即100个块
但是索引扫描时，只是去定位到那50个块，成本为索引块+50数据块<全表的100数据块

如果CLUSTERING_FACTOR值是建议走全表。
上面第二、三个语句，它也不会走索引而是直接走了全表

一个索引块中索引值如下
A
A
A
B
B
B
C
C
C

三个数据块中数据分布如下1
A
B
C

A
B
C

A
B
C

三个数据块中数据分布如下2
A
A
A

B
B
B

C
C
C

按分布1的情况，找到对应行，需要9次IO
A要访问3个数据块
B要访问3个数据块
C要访问3个数据块
总计需要访问9次数据块

按分布2的情况，找到对应行，需要3次IO
A要访问1个数据块
B要访问1个数据块
C要访问1个数据块
总计需要访问3次数据块

回表开销：索引存储索引列的值和rowid，可以通过rowid定位回到表中，得到这个索引列以外的列信息，要做这个类事，必然有开销

其实回表，不就是去访问表中某些具体的行嘛，如果回表涉及的行的数据块数量占整表所有行数据块数量的80%，那不就是回表成本太高吗,所以回表和数据离散程度相关

按上面的例子，回表要回10000次，比全表扫描1000000行开销总要少吧
比如不走索引，直接select *　from table，总计1000000，难道不就是回表1000000次吗

针对查询较少列的sql，还可以通过建立复合索引，使sql一次将数据从索引段中读取出来，不用回表。
select id,name from tab_t where id=1;而(id,name)是复合索引，这种情况下就不用回表。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/30126024/viewspace-2143207/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/30126024/viewspace-2143207/

cjh81378066

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
CLUSTERING_FACTOR、回表、数据分布倾斜走全表还是索引

CLUSTERING_FACTOR Indicates the amount of order of the rows in the table based on the values of the index.■ If t...
复制链接

扫一扫