DBA_INDEXES.CLUSTERING_FACTOR 索引的群集因子初探

最新推荐文章于 2021-04-06 00:14:59 发布

长烟慢慢

最新推荐文章于 2021-04-06 00:14:59 发布

阅读量2.1k

点赞数

分类专栏： oracle基础知识文章标签： table object insert sql delete 集群

oracle基础知识专栏收录该内容

167 篇文章 11 订阅

订阅专栏

http://space.itpub.net/?uid-7728585-action-viewspace-itemid-612691

先引出ORACLE WAIT INTERFACE中的原话：

In addition to SQL tuning, it may also be worthwhile to check the index’s clustering factor if the execution plan calls fortable access by index rowid. The clustering factor of an index defines how ordered the rows are in the table. It affects the number of I/Os required for the whole operation. If the DBA_INDEXES.CLUSTERING_FACTOR of the index approaches the number of blocks in the table, then most of the rows in the table are ordered. This is desirable. However, if the clustering factor approaches the number of rows in the table, it means the rows in the table are randomly ordered. In this case, it is unlikely for the index entries in the same leaf block to point to rows in the same data block, and thus it requires more I/Os to complete the operation. You can improve the index’s clustering factor by rebuilding the table so that rows are ordered according to the index key and rebuilding the index thereafter. What happens if the table has more than one index? Well, that is the downside. You can only cater to the most used index.

然后给出中文概念：

在索引的分析数据上clustering_factor是一个很重要的参数，表示的是索引和表之间的关系，因为，索引是按照一定的顺序排列的，但是，对于表来说是按照一种heap的形式存放，每一行可能分布在段上任何一个块上，所以要是通过索引来查找数据行的时候，就有可以一个索引块对应多个，甚至全部表的块，所以引入了clustering_factor这个参数来表示表上数据存放和索引之间的对应关系。这样CBO就可以根据这个参数来判断使用这个索引产生的cost是多少。
一般来说，如果这个表的排列是按照索引列的顺序存放数据的话，这个参数就应该和数据表上的块相类似。

然后做简单的试验证明：

1、试验1

SQL> create table test10
2 as select * from dba_tables;

SQL> create index test10_ind
2 on test10(SAMPLE_SIZE);

execute dbms_stats.gather_table_stats(ownname => 'sys',tabname => 'TEST10',cascade => true);

SQL> select BLOCKS from dba_tables where table_name='TEST10';

BLOCKS
----------
46

SQL> select dba_indexes.clustering_factor from dba_indexes where dba_indexes.index_name='TEST10_IND';

CLUSTERING_FACTOR
-----------------
496

由于是建立索引的时候是以SAMPLE_SIZE为关键字，这个时候索引会对其排序然后形成一个B-Tree的结构，当然这个时候索引上的一个块也许通过ROWID会访问TABLE上的多个块，所以因子就是496，相差很大。

2、试验2

SQL> create table test11
2 as
3 select * from dba_tables order by SAMPLE_SIZE;

SQL> create table index11
2 on test11(SAMPLE_SIZE);

SQL> execute dbms_stats.gather_table_stats(ownname => 'sys',tabname => 'TEST11',cascade => true);
SQL> select BLOCKS from dba_tables where table_name='TEST11';

BLOCKS
----------
46

SQL> select dba_indexes.clustering_factor from dba_indexes where dba_indexes.index_name='INDEX11';

CLUSTERING_FACTOR
-----------------
44

这次的试验建立表的时候我使用了索引关键字进行排序，这个时候看到索引的群集因子和表的块数相差不大，除去表中HEADER块，其实是相等的。这个时候索引块一一对应表块，通过索引访问表的I/O也会低，是最好的

学习ORACLE的思维-索引结构分析

一、思考集群因子CLUSTERING_FACTOR
It is calculated as follows:
1. The index is scanned in order.
2. The block portion of the rowid pointed at by the current indexed value is compared with that pointed at by the previous indexed value.
This is achieved by comparing adjacent rowids in the index leaf block (the blocks themselves do not need to be visited).
3. If these rowids are pointing at different blocks then a counter is incremented.
4. This is continued throughout the whole index.
5. The resultant count is then stored.

索引集群因子的计算方法：
1. 用当前索引值的rowid中指向块部份的值和上一条索引值的rowid中指向块部份的值相比较。
2. 这一步通过在索引的树叶块（本身不用被访问的块）中比较相邻的rowid来完成。
3. 如果相邻rowid指向不同的块，那么就会使计数器增加。
4. 以上比较会持续进行直到扫描完整个索引。
5. 最后结果的数值就被记录下来。

测试思路：创建两张内容相同但索引字段是无序和升序的数据表，数据来源为dba_objects，从而测试对索引集群因子和其他索引统计数据的影响。

create table test_objects1 as select * from dba_objects;

create table test_objects2 as select * from dba_objects order by object_id;

create index idx1_test_objects1 on test_objects1(object_id);

create index idx1_test_objects2 on test_objects2(object_id);

select * from test_objects1 where rownum < 10;
test_objects1全表扫描可以看到object_id是无序的。

select * from test_objects2 where rownum < 10;
test_objects2全表扫描可以看到object_id是升序的。

通过分析表获得统计数据
DBA_INDEXES:
execute dbms_stats.gather_table_stats('SCOTT', 'TEST_OBJECTS1', cascade => true);

BLEVEL 1
LEAF_BLOCKS 68
DISTINCT_KEYS 30761
AVG_LEAF_BLOCKS_PER_KEY 1
AVG_DATA_BLOCKS_PER_KEY 1
CLUSTERING_FACTOR 25581
STATUS VALID
NUM_ROWS 30761
SAMPLE_SIZE 30761

execute dbms_stats.gather_table_stats('SCOTT', 'TEST_OBJECTS2', cascade => true);

BLEVEL 1
LEAF_BLOCKS 68
DISTINCT_KEYS 30762
AVG_LEAF_BLOCKS_PER_KEY 1
AVG_DATA_BLOCKS_PER_KEY 1
CLUSTERING_FACTOR 421
STATUS VALID
NUM_ROWS 30762
SAMPLE_SIZE 30762

INDEX_STATS:
ANALYZE INDEX IDX1_TEST_OBJECTS1 VALIDATE STRUCTURE;

HEIGHT 2
BLOCKS 80
NAME IDX1_TEST_OBJECTS1
PARTITION_NAME
LF_ROWS 30761
LF_BLKS 68
LF_ROWS_LEN 481877
LF_BLK_LEN 7996
BR_ROWS 67
BR_BLKS 1
BR_ROWS_LEN 784
BR_BLK_LEN 8028
DEL_LF_ROWS 0
DEL_LF_ROWS_LEN 0
DISTINCT_KEYS 30761
MOST_REPEATED_KEY 1
BTREE_SPACE 551756
USED_SPACE 482661
PCT_USED 88
ROWS_PER_KEY 1
BLKS_GETS_PER_ACCESS 3
PRE_ROWS 0
PRE_ROWS_LEN 0
OPT_CMPR_COUNT 0
OPT_CMPR_PCTSAVE 0

ANALYZE INDEX IDX1_TEST_OBJECTS2 VALIDATE STRUCTURE;

HEIGHT 2
BLOCKS 80
NAME IDX1_TEST_OBJECTS2
PARTITION_NAME
LF_ROWS 30762
LF_BLKS 68
LF_ROWS_LEN 481893
LF_BLK_LEN 7996
BR_ROWS 67
BR_BLKS 1
BR_ROWS_LEN 784
BR_BLK_LEN 8028
DEL_LF_ROWS 0
DEL_LF_ROWS_LEN 0
DISTINCT_KEYS 30762
MOST_REPEATED_KEY 1
BTREE_SPACE 551756
USED_SPACE 482677
PCT_USED 88
ROWS_PER_KEY 1
BLKS_GETS_PER_ACCESS 3
PRE_ROWS 0
PRE_ROWS_LEN 0
OPT_CMPR_COUNT 0
OPT_CMPR_PCTSAVE 0

分析统计数据：
根据索引的统计数据和结构数据可以看出，除了集群因子有较大差别外，其他统计信息都是相同。
那集群因子对SQL有什么影响呢？
对于一般的等值查询来说，集群因子是不会有任何影响的，而对于范围的查询则有很大的影响。
例如：
查询object_id = xxxx的等值查询
select * from test_objects1 where object_id = xxxx;
select * from test_objects2 where object_id = xxxx;
按索引的查询原理，
1. 首先查询索引根节点，
2. 然后查询分支节点，
3. 然后是叶子节点，
4. 最后回表查询数据。
所以对于高度相同的索引来说，查询的开销和效率应该是一致的，完全不会受到集群因子的影响。

查询objct_id >= xxxx and object_id <= yyyy的范围查询
select * from test_objects1 where objct_id >= xxxx and object_id <= yyyy;
select * from test_objects2 where objct_id >= xxxx and object_id <= yyyy;
按索引的查询原理，
1. 首先查询索引根节点，
2. 然后查询分支节点，
3. 然后是范围开始叶子节点，
4. 下一个叶子节点，回表查询数据，
5. 重复2 - 4直到范围结束的节点。
在1 - 3步骤是不会有差别的，差别发生在步骤4的回表查询，
如果集群因子接近与数据块的数量，则表明表的数据是有序的，回表查询需要更少的数据块，从而减少逻辑读和物理读。
而相反集群因子接近于行的数量，则表明表的数据是无序的甚至可能是存在碎片的，回表查询需要更多的数据块，从而需要更多的逻辑读和物理读。

以上观点可以通过统计数据证明：

SQL> select * from test_objects1 where object_id = 2000;
Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
          4 consistent gets
          0 physical reads
          0 redo size
       1364 bytes sent via SQL*Net to client
        655 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

SQL> select * from test_objects2 where object_id = 2000;
Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
          4 consistent gets
          0 physical reads
          0 redo size
       1364 bytes sent via SQL*Net to client
        655 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

SQL> select * from test_objects1 where object_id >= 8000 and object_id <= 9000
Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
        489 consistent gets
          0 physical reads
          0 redo size
      73539 bytes sent via SQL*Net to client
       1381 bytes received via SQL*Net from client
         68 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
       1001 rows processed

SQL> select * from test_objects2 where object_id >= 8000 and object_id <= 9000;
Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
        153 consistent gets
          0 physical reads
          0 redo size
      52350 bytes sent via SQL*Net to client
       1381 bytes received via SQL*Net from client
         68 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
       1001 rows processed

从统计数据可以看出等值查询的统计数据是完全相同，
而范围查询中，一致读相差3倍，网络数据增加50%。
所以根据上述的分析可以知道，如果索引只是等值查询的话，可以不考虑集群因子的影响，
如果是范围查询的话则需要考虑如果让数据块更加有序，从而提高SQL语句的效率。
引申到多表等值关联和范围关联等也会造成较大的影响。

二、思考索引rebuild
索引是否需要重建？这是一个困扰大多数DBA的问题，
想知道索引是否需要重建，首先需要了解BTREE索引的原理和实现。
简单的说BTREE索引就是将索引列组织成树形结构，索引列是按照顺序排列的，分支节点指示索引列的范围，叶子节点只是索引列的位置rowid。
Oracle的BTREE索引是会自动平衡叶子节点的高度，从根节点到任何一个叶子节点的长度是相等的。所以坊间流传对索引的操作会引起索引不平衡是一个错误的说法。
Oracle的BTREE索引处理DML操作时主要分为两种分裂方式：99/1和50/50。
99/1是插入最大索引值时（最小是不适用的，这点需要继续研究，我的猜想和索引的序列方向有关系），索引节点分裂后，新插入的值分裂到新的叶子节点，而原来的叶子节点不变。
50/50是插入当前索引值范围内的索引值时，原叶子节点的数据被分裂为相同大小的两个叶子节点并插入新的索引值到其中的节点，两个叶子节点都只有50%的利用率。
所以99/1方式一般不会产生索引碎片，而50/50方式就很容易产生碎片了。

根据上述原理我们来做个测试：
（1）99/1方式
drop table t1;
create table t1 (name varchar2(10),nr number) pctfree 0;
create index i1 on t1(nr);
declare
i number;
begin
for i in 1..50000
loop
insert into t1 values('XX',i);
end loop;
end;
/

SQL> analyze index i1 validate structure;

SQL> select blocks, lf_blks, pct_used from index_stats;
BLOCKS LF_BLKS PCT_USED
---------- ---------- ----------
112 99 99

（2）50/50方式
drop table t2;
create table t2 (name varchar2(10),nr number) pctfree 0;
create index i2 on t2(nr);
declare
i number;
begin
insert into t2 values('XX',50000);
for i in 1..49999
loop
insert into t2 values('XX',i);
end loop;
end;
/

SQL> analyze index i2 validate structure;

SQL> select blocks, lf_blks, pct_used from index_stats;

    BLOCKS    LF_BLKS   PCT_USED
---------- ---------- ----------
       384        193         52

（3）从大往小插50/50方式
drop table t1;
create table t1 (name varchar2(10),nr number) pctfree 0;
create index i1 on t1(nr);
declare
i number;
begin
for i in reverse 1..50000
loop
    insert into t1 values('XX',i);
end loop;
end;
/

SQL> analyze index i1 validate structure;

SQL> select blocks, lf_blks, pct_used from index_stats;

BLOCKS LF_BLKS PCT_USED
---------- ---------- ----------
384 200 50

根据上述的分析显示，要让索引高效则应用应该尽可能适应这种索引分裂模式，
插入按日期，序列的正序列方式，删除按照FIFO方式则是很好的适应Oracle的索引设计模式。
按照上述设计模式进行设计的应用一般是不需要重建索引的。
而违反这种设计模式的应用则需要定期进行索引重建，使索引保持高效。

请看一下例子：
create table t1 (name varchar2(10),nr number) pctfree 0;
create index i1 on t1(nr);

declare
i number;
begin
for i in 1..50000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

declare
i number;
begin
for i in 1..50000
loop
delete from t1 where nr = i;
end loop;
commit;
end;
/

declare
i number;
begin
for i in 50000..100000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

declare
i number;
begin
for i in 50000..60000
loop
delete from t1 where nr = i;
end loop;
commit;
end;
/

declare
i number;
begin
for i in 100000..110000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

SQL> analyze index i1 validate structure;

Index analyzed.

SQL> select blocks, lf_blks, pct_used from index_stats;

BLOCKS LF_BLKS PCT_USED
---------- ---------- ----------
112 102 98

drop table t1;
create table t1 (name varchar2(10),nr number) pctfree 0;
create index i1 on t1(nr);

declare
i number;
begin
for i in 1..50000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

declare
i number;
begin
for i in 2..9999
loop
    delete from t1 where nr = i;
end loop;
for i in 10001..19999
loop
    delete from t1 where nr = i;
end loop;
for i in 20001..29999
loop
    delete from t1 where nr = i;
end loop;
for i in 30001..39999
loop
    delete from t1 where nr = i;
end loop;
for i in 40001..49999
loop
    delete from t1 where nr = i;
end loop;
commit;
end;
/

declare
i number;
begin
for i in 50000..100000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

declare
i number;
begin
for i in 50001..59999
loop
delete from t1 where nr = i;
end loop;
commit;
end;
/

declare
i number;
begin
for i in 100000..110000
loop
insert into t1 values('XX',i);
end loop;
commit;
end;
/

SQL> analyze index i1 validate structure;

Index analyzed.

SQL> select blocks, lf_blks, pct_used from index_stats;

BLOCKS LF_BLKS PCT_USED
---------- ---------- ----------
120 109 92

SQL> alter index i1 rebuild online;

Index altered.

SQL> analyze index i1 validate structure;

Index analyzed.

SQL> select blocks, lf_blks, pct_used from index_stats;

BLOCKS LF_BLKS PCT_USED
---------- ---------- ----------
128 112 89

根据分析上述统计数据可以看出，叶子节点多了7个，基本上和6个剩余的记录相等，表示每个记录就要浪费一个数据块的空间。
如果这样的记录不断增加，那么索引就需要进行重建了。
有时候重建后由于需要平衡各方面的因素，重建后的索引的结构可能会显得更差，但其实索引的整体性能是相差不大的，因为数据块都变满了，同时分布也更平均了。