对含distinct操作的SQL的优化

今天帮同事优化了一条含distinct操作的SQL,因为这个SQL在数据仓库环境中非常经典,因此有比较拿出来说一说。

[@more@] 原SQL:

sys@dwrac1> select count(distinct a.ip)
2 from (select t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

从执行计划可以看到,Oracle先扫描两个表的部分分区,在用hash连接,最后做distinct操作。
在这个案例中,这样是有很大问题的,因为两个结果集的IP重复度都比较高,根据a.ip=b.ip关联后结果集剧烈膨胀,导致需要distinct的记录集太大。
知道原因,那么就可以想办法避免了。我们的目标很简单,就是让关联后的记录集减少。
这个例子很明显就可以先分别对两个结果集做排重后再关联,于是得到第一个优化后SQL:

sys@dwrac1> select count(distinct a.ip)
2 from (select distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

从执行计划看到,虽然我们改写了SQL,但是执行计划没有改变,这是因为CBO“聪明”地把两个子查询merge进去了。
我们可以用no_merge禁止子查询的merge操作,于是得到第二次优化的SQL:


sys@dwrac1> select count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1087K (4)| 04:13:49 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
| 2 | MERGE JOIN | | 3533K| 87M| | 1087K (4)| 04:13:49 | | |
| 3 | SORT JOIN | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 4 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 5 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 6 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 7 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 8 | SORT JOIN | | 3533K| 43M| 135M| 989K (4)| 03:50:48 | | |
| 9 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 10 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 11 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 12 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
------------------------------------------------------------------------------------------------------------------------------------

从执行计划看出,no_merge已经生效了,但是目前还不是最优的,因为对一个子查询都做了太多的工作,既要hash unique又要sort,最后两个记录集关联还是用merge join。
实际上,这种关联用hash join更合适,可以强制它们走hash join,于是得到第三个优化SQL:

sys@dwrac1> select /*+ use_hash(a,b) */count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;


-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

从执行计划看,已经基本达到我们的要求了,但是还不够完美,因为我们对两个记录集做完排重后,关联出来的记录肯定是没有重复的了,因此可以把最外层的distinct去掉,于是得到第四个优化的SQL:

sys@dwrac1> select /*+ use_hash(a,b) */count(a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;


-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT AGGREGATE | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

原SQL运行超过几小时都出不来,最终优化过的SQL 只需要不到1分钟就得到结果,优化效果非常显著。

以上步骤是我做优化的思路,从优化的步骤看,要优化,不但需要对技术了解,更需要对数据分布了解!
在数据仓库中,我们可能经常会碰到类似的SQL,对此,我们的处理原则是:先排重,再关联,尽可能避免中间结果集膨胀。

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/231499/viewspace-1047939/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/231499/viewspace-1047939/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值