在微信群中,老虎刘老师提了一个有趣的问题,这个SQL,object_id列的可选择性非常高,owner列的可选择性比较差,你认为创建什么索引最佳?
select max(object_id) from t where owner='SYS';
但从这条SQL看,一共就用了两个字段,object_id和owner,如果使用穷举法,排列组合,能创建的B*Tree索引的类型就这几种,
1. object_id单键值索引。
2. owner单键值索引。
3. (object_id, owner)复合索引。
4. (owner, object_id)复合索引。
我们从实际的成本消耗,看下这几个方案,孰劣孰优?
首先创建测试表,可以看到,object_id选择率很高,owner选择率很低,
SQL> create table t as select object_id, owner from all_objects;
Table created.
SQL> select count(*) from t;
COUNT(*)
----------
5757
SQL> select count(distinct object_id) from t;
COUNT(DISTINCTOBJECT_ID)
------------------------
5757
SQL> select owner, count(*) from t group by owner;
OWNER COUNT(*)
-------------------- ----------
BISAL 6
PUBLIC 3340
SYSTEM 4
OE 2
SYS 2405
为了对比,我们看下无任何索引的消耗,全表扫描,consistent gets是69,
SQL> select max(object_id) from t where owner='SYS';
...
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
|* 2 | TABLE ACCESS FULL| T | 2405 | 72150 | 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
...
Statistics
----------------------------------------------------------
45 recursive calls
0 db block gets
69 consistent gets
0 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
5 sorts (memory)
0 sorts (disk)
1 rows processed
方案1,object_id单键值索引,由于条件字段owner,无索引,采用了全表扫描,consistent gets是32,
SQL> create index idx_t_01 on t(object_id);
Index created.
SQL> select max(object_id) from t where owner='SYS';
...
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
|* 2 | TABLE ACCESS FULL| T | 2405 | 72150 | 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
...
Statistics
----------------------------------------------------------
5 recursive calls
0 db block gets
32 consistent gets
0 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
方案2,owner单键值索引,Oracle的CBO会根据各种执行计划的成本,选择出成本值最低的一个,虽然owner有索引,但是owner='SYS'的记录会返回接近一半的数据,相比索引单块读,全表扫描多块读,效率会更高一些,此时consistents gets是52,
SQL> create index idx_t_01 on t(owner);
Index created.
SQL> select max(object_id) from t where owner='SYS';
...
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
|* 2 | TABLE ACCESS FULL| T | 2405 | 72150 | 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
...
Statistics
----------------------------------------------------------
17 recursive calls
0 db block gets
52 consistent gets
6 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
方案3,(object_id, owner)复合索引,因为条件只有owner,而索引前导列object_id的区分度很高,所以不会选择索引跳跃扫描,consistent gets是45,
SQL> create index idx_t_01 on t(object_id, owner);
Index created.
SQL> select max(object_id) from t where owner='SYS';
...
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 6 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
|* 2 | TABLE ACCESS FULL| T | 1 | 30 | 6 (0)| 00:00:01 |
---------------------------------------------------------------------------
...
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
45 consistent gets
0 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
如果使用HINT强制采用了这个索引,可以看到用的是索引全扫描,consistent get是32,这个效率和object_id单键值索引相差无几,
SQL> create index idx_t_01 on t(object_id, owner);
Index created.
SQL> select max(object_id) from t where owner='SYS';
...
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 23 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
| 2 | FIRST ROW | | 1 | 30 | 23 (0)| 00:00:01 |
|* 3 | INDEX FULL SCAN (MIN/MAX)| IDX_T_01 | 1 | 30 | 23 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
...
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("OWNER"='SYS')
...
Statistics
----------------------------------------------------------
13 recursive calls
0 db block gets
32 consistent gets
1 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
如果我们增加测试数据量,
SQL> select count(*), count(distinct object_id), count(distinct owner) from t;
COUNT(*) COUNT(DISTINCTOBJECT_ID) COUNT(DISTINCTOWNER)
---------- ------------------------ --------------------
5898240 5760 5
SQL> select owner, count(*) from t group by owner;
OWNER COUNT(*)
-------------------- ----------
SYSTEM 4096
OE 2048
PUBLIC 3420160
BISAL 6144
SYS 2465792
此时,即使不用HINT了,Oracle仍会选择索引全扫描,因为索引叶子结点是有序排列,max/min的值,不是最左边,就是最右边,当数据量很小的情况下,可以通过owner='SYS'先从数据块中找到符合条件的记录(毕竟全表扫描是多块读,数据量小的时候,即使读所有数据,可能不会读几次,如果是索引读,则要一个索引块一个索引块地读),然后再统计max/min的值。但是当数据量非常大的情况下,通过owner='SYS'扫描数据块的开销,就会比之前增加几个数量级,而通过索引全扫描的方式,多块读索引块,用owner当作filter过滤条件,开销就会小很多,如下所示,数据量增加1000倍,consistent gets只增加了一倍,
SQL> select max(object_id) from t where owner = 'SYS';
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 59 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | | |
| 2 | FIRST ROW | | 1 | 30 | 59 (0)| 00:00:01 |
|* 3 | INDEX FULL SCAN (MIN/MAX)| IDX_T_01 | 1 | 30 | 59 (0)| 00:00:01 |
----------------------------------------------------------------------------------------
...
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
66 consistent gets
0 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
方案4,(owner, object_id)复合索引,因为检索条件中owner是索引的前导列,所以能使用索引范围扫描,consistent gets是40,但是不如object_id单键值索引,和object_id作为前导列的复合索引,
SQL> create index idx_t_01 on t(owner, object_id);
Index created.
SQL> select max(object_id) from t where owner='SYS';
...
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 30 | 2 (0)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 30 | |
| 2 | FIRST ROW | | 1 | 30 | 2 (0)| 00:00:01 |
|* 3 | INDEX RANGE SCAN (MIN/MAX)| IDX_T_01 | 1 | 30 | 2 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
...
Predicate Information (identified by operation id):
---------------------------------------------------
3 - access("OWNER"='SYS')
Statistics
----------------------------------------------------------
17 recursive calls
0 db block gets
40 consistent gets
7 physical reads
0 redo size
534 bytes sent via SQL*Net to client
519 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
1 rows processed
从实验来看,object_id单键值索引和object_id作为前导列的复合索引,效率最高,当然这的需求是max/min,虽然where条件不带object_id,但基于B*Tree索引的特点,能用上索引,否则条件中没有object_id,就可能无法用上这个索引,要考虑其他方案。
此外,object_id单键值索引和object_id作为前导列的复合索引,对这两种索引的选择,还取决于业务的需求,如果有同时用object_id和owner这两个字段做检索条件的,可以选择使用复合索引,如果没这种需求,就可以创建object_id单键值索引,这样一来,使用object_id和其他字段的复合检索,都可能用上object_id的单键值索引,一举多得。
索引的选择,其实还是非常讲究的,无论是索引类型,还是索引字段的顺序,针对不同的业务场景,都会有不同方案,最根本的,还是对索引的原理通晓,加上经验的积累,才可能慢慢掌握。