使用exists(Semi-Join)优化distinct语句

最新推荐文章于 2024-01-11 21:51:29 发布

chncaesar

最新推荐文章于 2024-01-11 21:51:29 发布

阅读量2.2k

点赞数

分类专栏： Oracle调优 SQL & PL/SQL 文章标签： semijoin

本文链接：https://blog.csdn.net/chncaesar/article/details/17320919

版权

SQL & PL/SQL 同时被 2 个专栏收录

41 篇文章 0 订阅

订阅专栏

Oracle调优

15 篇文章 0 订阅

订阅专栏

在Oracle 官方文档，semi-join是这么解释的：

A semijoin returns rows that match an EXISTS subquery without duplicating rows from the left side of the predicate when multiple rows on the right side satisfy the
criteria of the subquery.

Semijoin and antijoin transformation cannot be done if the subquery is on an OR branch of the WHERE clause.

由上可见，semi-join特别适用于如下场景：
表A 和B有1：N关系，需要根据B表某个条件，通过semi-join查询A表上某个列（属性）。因为semi-join已经包含一个去重过程，不需要在A表上加distinct。比内连接然后去重方式性能更好。

假设有两个表Customer ， orders。当然，每个客户可以有多个订单。现在需要找出订单金额超过10000的客户数量。本文例子在12cR1上实现，11g上应该是同样结果。

show rel;
release 1201000100

desc customer
Name        Null     Type
----------- -------- -------------
CUSTOMER_ID NOT NULL NUMBER(12)
NAME                 VARCHAR2(100)
STATUS               VARCHAR2(10)

desc orders
Name        Null Type
----------- ---- ------------
ORDER_ID         NUMBER(12)
CUSTOMER_ID      NUMBER(12)
CREATE_DATE      DATE
ORDER_PRICE      NUMBER(38,6)

select count(*) from  orders;

count(*)
--------------
1000002

select count(*) from customer;

count(*)
--------------
100000

orders表有100万行，customer表有10万行。

假设我需要一个用户列表，该列表中用户的订单金额低于10000。我们有两种写法 - 内连接或Exists。

下面来比较下inner join和semi-join两种SQL写法的性能。

Select /*HE7*/ count(distinct cus.name)
from customer cus, orders ord
where cus.customer_id=ord.customer_id
and ord.order_price < 10000;

SQL_ID 3kfvh06bqrn5h, child number 0
-------------------------------------
select /*HE7*/ count(distinct cus.name) from customer cus, orders ord
where cus.customer_id=ord.customer_id and ord.order_price < 10000
Plan hash value: 3082118893
------------------------------------------------------------------------------------------
| Id | Operation             | Name     | Rows | Bytes |TempSpc| Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |          |       |       |       | 3212 (100)|          |
|   1 | SORT AGGREGATE       |          |     1 |    52 |       |            |          |
|   2 |   VIEW                | VW_DAG_0 | 52570 | 2669K|       | 3212   (1)| 00:00:01 |
|   3 |    HASH GROUP BY      |          | 52570 | 2104K| 2688K| 3212   (1)| 00:00:01 |
|* 4 |     HASH JOIN SEMI    |          | 52570 | 2104K| 2640K| 2651   (1)| 00:00:01 |
|   5 |      TABLE ACCESS FULL| CUSTOMER |   100K| 1464K|       |   102   (0)| 00:00:01 |
|* 6 |      TABLE ACCESS FULL| ORDERS   |   753K|    18M|       | 1063   (1)| 00:00:01 |
------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - access("CUS"."CUSTOMER_ID"="ORD"."CUSTOMER_ID")
   6 - filter("ORD"."ORDER_PRICE"<10000)
Note
-----

- dynamic statistics used: dynamic sampling (level=2)

这个执行计划的总成本为3212。由于做了内连接(inner join)，最终结果集里将会出现所有匹配行，包含一部分重复customer(用户）。著名Oracle布道者Tom Kyte在2006年一篇文章里就写到：

“In general, you should phrase the queries in the manner that says it best. If one set of joins were particularly efficient to use in all cases, Oracle would not have implemented the rest of them!

In general, you use a join when you need data from more than one table in the ultimate SELECT list.”

文章出处：
http://www.oracle.com/technetwork/issue-archive/2006/06-may/o36asktom-099107.html
在这里，我们只需要customer表的信息，orders表是用来做检索（filtering）的。
只将customer表放在from处，更加自然。用exists(semi-join)改写后：

Select /*HE8*/ count(name)
from customer
where exists(
  select 1 from orders
  where customer.customer_id=orders.customer_id
  and orders.order_price < 10000
);

由于semi-join，这里我们不需要distinct，从customer能直接得到去重后的结果。查看下执行计划和最终的cost

SQL_ID ggs8p2c27nzmu, child number 0
-------------------------------------
select /*HE8*/ count(name) from customer where exists(   select 1 from
orders   where customer.customer_id=orders.customer_id   and
orders.order_price < 10000 )
Plan hash value: 2841769433
----------------------------------------------------------------------------------------
| Id | Operation           | Name     | Rows | Bytes |TempSpc| Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT    |          |       |       |       | 2651 (100)|          |
|   1 | SORT AGGREGATE     |          |     1 |    41 |       |            |          |
|* 2 |   HASH JOIN SEMI    |          | 52570 | 2104K| 2640K| 2651   (1)| 00:00:01 |
|   3 |    TABLE ACCESS FULL| CUSTOMER |   100K| 1464K|       |   102   (0)| 00:00:01 |
|* 4 |    TABLE ACCESS FULL| ORDERS   |   753K|    18M|       | 1063   (1)| 00:00:01 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - access("CUSTOMER"."CUSTOMER_ID"="ORDERS"."CUSTOMER_ID")
   4 - filter("ORDERS"."ORDER_PRICE"<10000)
Note
-----
   - dynamic statistics used: dynamic sampling (level=2)

很明显，第二个执行计划优于第一个。