使用exists(Semi-Join)优化distinct语句

在Oracle 官方文档,semi-join是这么解释的:

A semijoin returns rows that match an EXISTS subquery without duplicating rows from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.

Semijoin and antijoin transformation cannot be done if the subquery is on an OR branch of the WHERE clause.

由上可见,semi-join特别适用于如下场景:
表A 和B有1:N关系,需要根据B表某个条件,通过semi-join查询A表上某个列(属性)。因为semi-join已经包含一个去重过程,不需要在A表上加distinct。比内连接然后去重方式性能更好。

假设有两个表Customer , orders。当然,每个客户可以有多个订单。现在需要找出订单金额超过10000的客户数量。本文例子在12cR1上实现,11g上应该是同样结果。

show rel;
release 1201000100

desc customer
Name        Null     Type          
----------- -------- -------------
CUSTOMER_ID NOT NULL NUMBER(12)    
NAME                 VARCHAR2(100)
STATUS               VARCHAR2(10)  

desc orders
Name        Null Type         
----------- ---- ------------
ORDER_ID         NUMBER(12)   
CUSTOMER_ID      NUMBER(12)   
CREATE_DATE      DATE         
ORDER_PRICE      NUMBER(38,6)

select count(*) from  orders;


count(*)
--------------
1000002

select count(*) from customer;


count(*)
--------------
100000

orders表有100万行,customer表有10万行。

假设我需要一个用户列表,该列表中用户的订单金额低于10000。我们有两种写法 - 内连接或Exists。

下面来比较下inner join和semi-join两种SQL写法的性能。
Select /*HE7*/ count(distinct cus.name)
from customer cus, orders ord
where cus.customer_id=ord.customer_id
and ord.order_price < 10000;

SQL_ID  3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord
where cus.customer_id=ord.customer_id and ord.order_price < 10000
Plan hash value: 3082118893
------------------------------------------------------------------------------------------
| Id  | Operation             | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |       |       |  3212 (100)|          |
|   1 |  SORT AGGREGATE       |          |     1 |    52 |       |            |          |
|   2 |   VIEW                | VW_DAG_0 | 52570 |  2669K|       |  3212   (1)| 00:00:01 |
|   3 |    HASH GROUP BY      |          | 52570 |  2104K|  2688K|  3212   (1)| 00:00:01 |
|*  4 |     HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  6 |      TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
   6 - filter("ORD"."ORDER_PRICE"<10000)
Note
-----

   - dynamic statistics used: dynamic sampling (level=2)

这个执行计划的总成本为3212。由于做了内连接(inner join),最终结果集里将会出现所有匹配行,包含一部分重复customer(用户)。著名Oracle布道者Tom Kyte在2006年一篇文章里就写到:


“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list.”

文章出处:
http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html
在这里,我们只需要customer表的信息,orders表是用来做检索(filtering)的。
只将customer表放在from处,更加自然。用exists(semi-join)改写后:

Select /*HE8*/ count(name)
from customer
where exists(
  select 1 from orders
  where customer.customer_id=orders.customer_id
  and orders.order_price < 10000
);

由于semi-join,这里我们不需要distinct,从customer能直接得到去重后的结果。查看下执行计划和最终的cost

SQL_ID  ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists(   select 1 from
orders   where customer.customer_id=orders.customer_id   and
orders.order_price < 10000 )
Plan hash value: 2841769433
----------------------------------------------------------------------------------------
| Id  | Operation           | Name     | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |          |       |       |       |  2651 (100)|          |
|   1 |  SORT AGGREGATE     |          |     1 |    41 |       |            |          |
|*  2 |   HASH JOIN SEMI    |          | 52570 |  2104K|  2640K|  2651   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| CUSTOMER |   100K|  1464K|       |   102   (0)| 00:00:01 |
|*  4 |    TABLE ACCESS FULL| ORDERS   |   753K|    18M|       |  1063   (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
   4 - filter("ORDERS"."ORDER_PRICE"<10000)
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

很明显,第二个执行计划优于第一个。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值