在Oracle 官方文档,semi-join是这么解释的:
A semijoin returns rows that match an EXISTS subquery without duplicating rows from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.
Semijoin and antijoin transformation cannot be done if the subquery is on an OR branch of the WHERE clause.
由上可见,semi-join特别适用于如下场景:
表A 和B有1:N关系,需要根据B表某个条件,通过semi-join查询A表上某个列(属性)。因为semi-join已经包含一个去重过程,不需要在A表上加distinct。比内连接然后去重方式性能更好。
假设有两个表Customer , orders。当然,每个客户可以有多个订单。现在需要找出订单金额超过10000的客户数量。本文例子在12cR1上实现,11g上应该是同样结果。
show rel;
release 1201000100
desc customer
Name Null Type
----------- -------- -------------
CUSTOMER_ID NOT NULL NUMBER(12)
NAME VARCHAR2(100)
STATUS VARCHAR2(10)
desc orders
Name Null Type
----------- ---- ------------
ORDER_ID NUMBER(12)
CUSTOMER_ID NUMBER(12)
CREATE_DATE DATE
ORDER_PRICE NUMBER(38,6)
count(*)
--------------
1000002
count(*)
--------------
100000
SQL_ID 3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord
where cus.customer_id=ord.customer_id and ord.order_price < 10000
Plan hash value: 3082118893
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 3212 (100)| |
| 1 | SORT AGGREGATE | | 1 | 52 | | | |
| 2 | VIEW | VW_DAG_0 | 52570 | 2669K| | 3212 (1)| 00:00:01 |
| 3 | HASH GROUP BY | | 52570 | 2104K| 2688K| 3212 (1)| 00:00:01 |
|* 4 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |
| 5 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |
|* 6 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
6 - filter("ORD"."ORDER_PRICE"<10000)
Note
-----
“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!
http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html
在这里,我们只需要customer表的信息,orders表是用来做检索(filtering)的。
只将customer表放在from处,更加自然。用exists(semi-join)改写后:
由于semi-join,这里我们不需要distinct,从customer能直接得到去重后的结果。查看下执行计划和最终的cost
SQL_ID ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists( select 1 from
orders where customer.customer_id=orders.customer_id and
orders.order_price < 10000 )
Plan hash value: 2841769433
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 2651 (100)| |
| 1 | SORT AGGREGATE | | 1 | 41 | | | |
|* 2 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
4 - filter("ORDERS"."ORDER_PRICE"<10000)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
很明显,第二个执行计划优于第一个。
A semijoin returns rows that match an EXISTS subquery without duplicating rows from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.
Semijoin and antijoin transformation cannot be done if the subquery is on an OR branch of the WHERE clause.
由上可见,semi-join特别适用于如下场景:
表A 和B有1:N关系,需要根据B表某个条件,通过semi-join查询A表上某个列(属性)。因为semi-join已经包含一个去重过程,不需要在A表上加distinct。比内连接然后去重方式性能更好。
假设有两个表Customer , orders。当然,每个客户可以有多个订单。现在需要找出订单金额超过10000的客户数量。本文例子在12cR1上实现,11g上应该是同样结果。
show rel;
release 1201000100
desc customer
Name Null Type
----------- -------- -------------
CUSTOMER_ID NOT NULL NUMBER(12)
NAME VARCHAR2(100)
STATUS VARCHAR2(10)
desc orders
Name Null Type
----------- ---- ------------
ORDER_ID NUMBER(12)
CUSTOMER_ID NUMBER(12)
CREATE_DATE DATE
ORDER_PRICE NUMBER(38,6)
select count(*) from orders;
count(*)
--------------
1000002
select count(*) from customer;
count(*)
--------------
100000
orders表有100万行,customer表有10万行。
假设我需要一个用户列表,该列表中用户的订单金额低于10000。我们有两种写法 - 内连接或Exists。
Select /*HE7*/ count(distinct cus.name)
from customer cus, orders ord
where cus.customer_id=ord.customer_id
and ord.order_price < 10000;
SQL_ID 3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord
where cus.customer_id=ord.customer_id and ord.order_price < 10000
Plan hash value: 3082118893
------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 3212 (100)| |
| 1 | SORT AGGREGATE | | 1 | 52 | | | |
| 2 | VIEW | VW_DAG_0 | 52570 | 2669K| | 3212 (1)| 00:00:01 |
| 3 | HASH GROUP BY | | 52570 | 2104K| 2688K| 3212 (1)| 00:00:01 |
|* 4 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |
| 5 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |
|* 6 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
6 - filter("ORD"."ORDER_PRICE"<10000)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
这个执行计划的总成本为3212。由于做了内连接(inner join),最终结果集里将会出现所有匹配行,包含一部分重复customer(用户)。著名Oracle布道者Tom Kyte在2006年一篇文章里就写到:
“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!
In general, you use a join when you need data from more than one table in the ultimate SELECT list.”
文章出处:http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html
在这里,我们只需要customer表的信息,orders表是用来做检索(filtering)的。
只将customer表放在from处,更加自然。用exists(semi-join)改写后:
Select /*HE8*/ count(name)
from customer
where exists(
select 1 from orders
where customer.customer_id=orders.customer_id
and orders.order_price < 10000
);
由于semi-join,这里我们不需要distinct,从customer能直接得到去重后的结果。查看下执行计划和最终的cost
SQL_ID ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists( select 1 from
orders where customer.customer_id=orders.customer_id and
orders.order_price < 10000 )
Plan hash value: 2841769433
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | | 2651 (100)| |
| 1 | SORT AGGREGATE | | 1 | 41 | | | |
|* 2 | HASH JOIN SEMI | | 52570 | 2104K| 2640K| 2651 (1)| 00:00:01 |
| 3 | TABLE ACCESS FULL| CUSTOMER | 100K| 1464K| | 102 (0)| 00:00:01 |
|* 4 | TABLE ACCESS FULL| ORDERS | 753K| 18M| | 1063 (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
4 - filter("ORDERS"."ORDER_PRICE"<10000)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
很明显,第二个执行计划优于第一个。