分类: Oracle
在ORACLE 11G大行其道的今天,还有很多人受早期版本的影响,记住一些既定的规则,
1.子查询结果集小,用IN
2.外表小,子查询表大,用EXISTS
这是完全错误的观点。在8i时代,这经常是正确的,但是现在已经11G了,马上12C就要面世了。其实在ORACLE 9i CBO就已经优化了IN,EXISTS的区别,ORACLE优化器有个查询转换器,很多SQL虽然写法不同,但是ORACLE优化器会根据既定规则进行查询重写,重写为优化器觉得效率最高的SQL,所以可能SQL写法不同,但是执行计划却是完全一样的。
IN与EXISTS有一点要记住:IN一般是用于非相关子查询,而EXISTS一般用于相关子查询。当然IN也可以用于相关子查询,EXISTS也可以用于非相关子查询。但是这区别很重要,虽然优化器很强大,但是查询转换是有一定的限制的,在EXISTS性能低下,无法进行相关查询转换,比如不能UNNEST SUBQUERY,那么可能我们需要改写SQL,通常可以用IN/JOIN等改写相关子查询,或对于含有OR的SEMI JOIN改为UNION
ALL/UNION的形式。
下面就用例子说明一下:
DROP TABLE a; DROP TABLE b; CREATE TABLE a AS SELECT * FROM hr.employees; CREATE TABLE b AS SELECT * FROM hr.employees; –反复插入,构造20万行+ INSERT INTO a SELECT * FROM a; INSERT INTO b SELECT * FROM a; COMMIT; BEGIN dbms_stats.gather_table_stats(ownname => USER,tabname => ‘a’,estimate_percent => 100,cascade => TRUE); dbms_stats.gather_table_stats(ownname => USER,tabname => ‘b’,estimate_percent => 100,cascade => TRUE); END; / |
1.测试IN,EXISTS在简单查询中,是等价的
SQL> set autotrace traceonly exp SQL> SELECT * FROM a 2 WHERE EXISTS( 3 SELECT 1 FROM b WHERE a.employee_id=b.employee_id); 执行计划 ———————————————————- Plan hash value: 2317816356 ——————————————————————————— | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ——————————————————————————— | 0 | SELECT STATEMENT | | 217K| 15M| 1375 (2)| 00:00:17 | |* 1 | HASH JOIN | | 217K| 15M| 1375 (2)| 00:00:17 | | 2 | SORT UNIQUE | | 217K| 848K| 126 (1)| 00:00:02 | | 3 | INDEX FAST FULL SCAN| IDX2_B | 217K| 848K| 126 (1)| 00:00:02 | | 4 | TABLE ACCESS FULL | A | 217K| 14M| 620 (1)| 00:00:08 | ——————————————————————————— Predicate Information (identified by operation id): ————————————————— 1 - access(“A”.”EMPLOYEE_ID”=”B”.”EMPLOYEE_ID”) SQL> SELECT * FROM a 2 WHERE a.employee_id IN (SELECT b.employee_id FROM b); 执行计划 ———————————————————- Plan hash value: 2317816356 ——————————————————————————— | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ——————————————————————————— | 0 | SELECT STATEMENT | | 217K| 15M| 1375 (2)| 00:00:17 | |* 1 | HASH JOIN | | 217K| 15M| 1375 (2)| 00:00:17 | | 2 | SORT UNIQUE | | 217K| 848K| 126 (1)| 00:00:02 | | 3 | INDEX FAST FULL SCAN| IDX2_B | 217K| 848K| 126 (1)| 00:00:02 | | 4 | TABLE ACCESS FULL | A | 217K| 14M| 620 (1)| 00:00:08 | ——————————————————————————— Predicate Information (identified by operation id): ————————————————— 1 - access(“A”.”EMPLOYEE_ID”=”B”.”EMPLOYEE_ID”) |
可以看出,两个计划完全没有区别。类似于ORACLE查询重写为:
SELECT a.* FROM a,(SELECT DISTINCT b.employee_id FROM b) b1 WHERE a.employee_id=b1.employee_id;
看看8i时代的区别:
SQL> SELECT/*+optimizer_features_enable(‘8.1.7’)*/ * FROM a 2 WHERE EXISTS( 3 SELECT 1 FROM b WHERE a.employee_id=b.employee_id); 已用时间: 00: 00: 00.00 执行计划 ———————————————————- Plan hash value: 3422092984 ————————————————————- | Id | Operation | Name | Rows | Bytes | Cost | ————————————————————- | 0 | SELECT STATEMENT | | 10854 | 731K| 344 | |* 1 | FILTER | | | | | | 2 | TABLE ACCESS FULL| A | 10854 | 731K| 344 | |* 3 | INDEX RANGE SCAN | IDX2_B | 2049 | 8196 | 5 | ————————————————————- Predicate Information (identified by operation id): ————————————————— 1 - filter( EXISTS (SELECT 0 FROM “B” “B” WHERE “B”.”EMPLOYEE_ID”=:B1)) 3 - access(“B”.”EMPLOYEE_ID”=:B1) Note —– - cpu costing is off (consider enabling it) SQL> SELECT/*+optimizer_features_enable(‘8.1.7’)*/ * FROM a 2 WHERE a.employee_id IN (SELECT b.employee_id FROM b); 已用时间: 00: 00: 00.00 执行计划 ———————————————————- Plan hash value: 1679318093 ————————————————————————- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost | ————————————————————————- | 0 | SELECT STATEMENT | | 217K| 16M| | 1126 | |* 1 | HASH JOIN | | 217K| 16M| | 1126 | | 2 | VIEW | VW_NSO_1 | 106 | 1378 | | 779 | | 3 | SORT UNIQUE | | 106 | 424 | 2576K| 779 | | 4 | TABLE ACCESS FULL| B | 217K| 848K| | 344 | | 5 | TABLE ACCESS FULL | A | 217K| 14M| | 344 | ————————————————————————- Predicate Information (identified by operation id): ————————————————— 1 - access(“A”.”EMPLOYEE_ID”=”EMPLOYEE_ID”) Note —– - cpu costing is off (consider enabling it) |
驱动,内表返回行少,效率高。当然具体情况肯定会很复杂,这里不做研究了,因为8i已经是过去式了。
2.遇到优化器限制的做法:改写SQL
ORACLE优化器虽然已经很强大,但是还有很多限制,比如无法UNNEST的限制如子查询有CONNECT BY,SET操作,ROWNUM,关联子查询内部包含分组函数等。还比如SEMI JOIN.ANTI JOIN条件带OR的形式等。这时候,我们常用的优化方式就是SQL的等价改写,这要根据具体的业务和数据特点,来重写等价的SQL,千万别改写成结果不等价,那就糟糕了。
比如这个SQL:
SELECT * FROM a WHERE EXISTS( SELECT 1 FROM b WHERE a.department_id=b.department_id GROUP BY b.department_id HAVING a.salary>=MAX(b.salary)); SQL> @display_cursor SQL_ID dgc8b80sxwct2, child number 1 ————————————- SELECT * FROM a WHERE EXISTS( SELECT 1 FROM b WHERE a.department_id=b.department_id GROUP BY b.department_id HAVING a.salary>=MAX(b.salary)) Plan hash value: 705769378 —————————————————————————————– | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | —————————————————————————————– | 0 | SELECT STATEMENT | | 1 | | 22528 |00:09:13.61 | 32M| |* 1 | FILTER | | 1 | | 22528 |00:09:13.61 | 32M| | 2 | TABLE ACCESS FULL | A | 1 | 217K| 217K|00:00:00.20 | 3733 | |* 3 | FILTER | | 14403 | | 2058 |00:09:13.20 | 32M| | 4 | SORT GROUP BY NOSORT| | 14403 | 1 | 14403 |00:09:13.17 | 32M| |* 5 | TABLE ACCESS FULL | B | 14403 | 19745 | 973M|00:06:17.21 | 32M| —————————————————————————————– Predicate Information (identified by operation id): ————————————————— 1 - filter( IS NOT NULL) 3 - filter(MAX(“B”.”SALARY”)<=:B1) 5 - filter(“B”.”DEPARTMENT_ID”=:B1) |
这是个很简单的SQL,但是因为使用了EXISTS关联子查询,并且内部有分组操作,无法进行有效的查询转换,走了FILTER操作,FILTER操作类似于NESTED LOOPS,但是不同于NESTED LOOPS的是,他还可以通过条件判断,是否走子步骤。这里全表扫描B 14403次(这是无法忍受的,特别遇到大表,甭想跑出来了)。
那么如何优化这种SQL呢,要改写,改写为JOIN形式:
SQL> SELECT * FROM a,(SELECT department_id,MAX(b.salary) max_salary FROM b GROUP BY b.department_id) b1 2 WHERE a.department_id=b1.department_id AND a.salary>=b1.max_salary; 已选择22528行。 已用时间: 00: 00: 00.64 执行计划 ———————————————————- Plan hash value: 774961296 —————————————————————————– | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | —————————————————————————– | 0 | SELECT STATEMENT | | 10854 | 1006K| 1250 (2)| 00:00:16 | |* 1 | HASH JOIN | | 10854 | 1006K| 1250 (2)| 00:00:16 | | 2 | VIEW | | 11 | 286 | 629 (3)| 00:00:08 | | 3 | HASH GROUP BY | | 11 | 88 | 629 (3)| 00:00:08 | | 4 | TABLE ACCESS FULL| B | 217K| 1696K| 620 (1)| 00:00:08 | | 5 | TABLE ACCESS FULL | A | 217K| 14M| 620 (1)| 00:00:08 | —————————————————————————– Predicate Information (identified by operation id): ————————————————— 1 - access(“A”.”DEPARTMENT_ID”=”B1”.”DEPARTMENT_ID”) filter(“A”.”SALARY”>=”B1”.”MAX_SALARY”) 统计信息 ———————————————————- 0 recursive calls 0 db block gets 5986 consistent gets 0 physical reads 0 redo size 1851483 bytes sent via SQL*Net to client 16926 bytes received via SQL*Net from client 1503 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 22528 rows processed |
从上面看出,逻辑读相对于没有改写之前的32M,时间9分钟,效率现在大幅度提升,时间变为00: 00: 00.64,l逻辑读变为5986次,因为走了HASH JOIN,相当于两表各扫描1次。
上面是通过将子查询改为JOIN的形式来优化,当然改写方式多样,下面用一个EXISTS改写为IN的方式来提高效率:
SELECT * FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.department_id=b.department_id AND a.employee_id=b.employee_id START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id); SQL> set autotrace traceonly exp SQL> SELECT * FROM a WHERE EXISTS(SELECT 1 FROM b WHERE a.department_id=b.department_id AND a.employee_id=b.employee_id 2 START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id); 已用时间: 00: 00: 00.00 执行计划 ———————————————————- Plan hash value: 985844456 ————————————————————————————————— | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ————————————————————————————————— | 0 | SELECT STATEMENT | | 186 | 12834 | 723K (1)| 02:24:37 | |* 1 | FILTER | | | | | | | 2 | TABLE ACCESS FULL | A | 217K| 14M| 620 (1)| 00:00:08 | |* 3 | FILTER | | | | | | |* 4 | CONNECT BY NO FILTERING WITH SW (UNIQUE)| | | | | | | 5 | TABLE ACCESS FULL | B | 217K| 2545K| 620 (1)| 00:00:08 | ————————————————————————————————— Predicate Information (identified by operation id): ————————————————— 1 - filter( EXISTS (SELECT 0 FROM “B” “B” WHERE “B”.”DEPARTMENT_ID”=:B1 AND “B”.”EMPLOYEE_ID”=:B2 START WITH “B”.”EMPLOYEE_ID”=202 CONNECT BY “B”.”MANAGER_ID”=PRIOR “B”.”EMPLOYEE_ID”)) 3 - filter(“B”.”DEPARTMENT_ID”=:B1 AND “B”.”EMPLOYEE_ID”=:B2) 4 - access(“B”.”MANAGER_ID”=PRIOR “B”.”EMPLOYEE_ID”) filter(“B”.”EMPLOYEE_ID”=202) |
又是和前面类似的FILTER操作,这条SQL也要运行很长时间,FILTER操作不是不好,就像NESTED LOOPS一样,也有高效的时候,如果FILTER操作做的次数不多,而且分支操作可以高效实用索引,那么也是高效的,这得注意。
改写,以上查询可以很容易改写为IN的形式:
SQL> SELECT * FROM a WHERE (a.department_id,a.employee_id) IN 2 (SELECT b.department_id,b.employee_id FROM b START WITH b.employee_id =202 CONNECT BY PRIOR b.employee_id=b.manager_id ); 已用时间: 00: 00: 00.13 执行计划 ———————————————————- Plan hash value: 1584203533 ————————————————————————————————————– | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | ————————————————————————————————————– | 0 | SELECT STATEMENT | | 2048 | 190K| | 2466 (1)| 00:00:30 | |* 1 | HASH JOIN RIGHT SEMI | | 2048 | 190K| 8064K| 2466 (1)| 00:00:30 | | 2 | VIEW | VW_NSO_1 | 217K| 5514K| | 1617 (62)| 00:00:20 | |* 3 | CONNECT BY NO FILTERING WITH START-WITH| | | | | | | | 4 | TABLE ACCESS FULL | B | 217K| 2545K| | 620 (1)| 00:00:08 | | 5 | TABLE ACCESS FULL | A | 217K| 14M| | 620 (1)| 00:00:08 | ————————————————————————————————————– Predicate Information (identified by operation id): ————————————————— 1 - access(“A”.”DEPARTMENT_ID”=”DEPARTMENT_ID” AND “A”.”EMPLOYEE_ID”=”EMPLOYEE_ID”) 3 - access(“B”.”MANAGER_ID”=PRIOR “B”.”EMPLOYEE_ID”) filter(“B”.”EMPLOYEE_ID”=202) |
走了HASH JOIN,这条SQL效率大增,基本在<1s内会返回结果。
在学习过程中,一定要亲自实践,不能遵循于从某个地方看到的规则,特别是N年前的规则,规则是有用的,但是规则也会不断滴更新的,如果规则发生了更新,但是在你的脑子里没有更新,你却不经过实践,就永远遵循这规则,那你对这方面的知识永远知之甚少或知而不全。