对含distinct操作的SQL的优化

最新推荐文章于 2024-08-17 21:51:39 发布

cuilin2183

最新推荐文章于 2024-08-17 21:51:39 发布

阅读量514

点赞数

今天帮同事优化了一条含distinct操作的SQL，因为这个SQL在数据仓库环境中非常经典，因此有比较拿出来说一说。

[@more@] 原SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

从执行计划可以看到，Oracle先扫描两个表的部分分区，在用hash连接，最后做distinct操作。
在这个案例中，这样是有很大问题的，因为两个结果集的IP重复度都比较高，根据a.ip=b.ip关联后结果集剧烈膨胀，导致需要distinct的记录集太大。
知道原因，那么就可以想办法避免了。我们的目标很简单，就是让关联后的记录集减少。
这个例子很明显就可以先分别对两个结果集做排重后再关联，于是得到第一个优化后SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

---------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
---------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 40 | | 1032K (4)| 04:00:50 | | |
| 1 | SORT GROUP BY | | 1 | 40 | | | | | |
|* 2 | HASH JOIN | | 3533K| 134M| 107M| 1032K (4)| 04:00:50 | | |
| 3 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 4 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 5 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
---------------------------------------------------------------------------------------------------------------------------------

从执行计划看到，虽然我们改写了SQL，但是执行计划没有改变，这是因为CBO“聪明”地把两个子查询merge进去了。
我们可以用no_merge禁止子查询的merge操作，于是得到第二次优化的SQL：

sys@dwrac1> select count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1087K (4)| 04:13:49 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
| 2 | MERGE JOIN | | 3533K| 87M| | 1087K (4)| 04:13:49 | | |
| 3 | SORT JOIN | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 4 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 5 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 6 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 7 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 8 | SORT JOIN | | 3533K| 43M| 135M| 989K (4)| 03:50:48 | | |
| 9 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 10 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 11 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 12 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
------------------------------------------------------------------------------------------------------------------------------------

从执行计划看出，no_merge已经生效了，但是目前还不是最优的，因为对一个子查询都做了太多的工作，既要hash unique又要sort，最后两个记录集关联还是用merge join。
实际上，这种关联用hash join更合适，可以强制它们走hash join，于是得到第三个优化SQL：

sys@dwrac1> select /*+ use_hash(a,b) */count(distinct a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT GROUP BY | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

从执行计划看，已经基本达到我们的要求了，但是还不够完美，因为我们对两个记录集做完排重后，关联出来的记录肯定是没有重复的了，因此可以把最外层的distinct去掉，于是得到第四个优化的SQL：

sys@dwrac1> select /*+ use_hash(a,b) */count(a.ip)
2 from (select /*+ no_merge */distinct t.ip
3 from creater_user.popt_total_login_month_his t
4 where t.data_desc = '2010-02-01') a, (select /*+ no_merge */distinct t.ip
5 from creater_user.popt_total_login_month_his t
6 where t.data_desc between
7 '2010-03-01' and
8 '2011-12-31') b
9 where a.ip = b.ip;

-----------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
-----------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 26 | | 1095K (4)| 04:15:35 | | |
| 1 | SORT AGGREGATE | | 1 | 26 | | | | | |
|* 2 | HASH JOIN | | 3533K| 87M| 84M| 1095K (4)| 04:15:35 | | |
| 3 | VIEW | | 3533K| 43M| | 978K (4)| 03:48:18 | | |
| 4 | HASH UNIQUE | | 3533K| 67M| 94M| 978K (4)| 03:48:18 | | |
| 5 | PARTITION RANGE ITERATOR| | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
|* 6 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 3533K| 67M| | 964K (4)| 03:45:07 | 2 | 17 |
| 7 | VIEW | | 14M| 175M| | 98667 (4)| 00:23:02 | | |
| 8 | HASH UNIQUE | | 14M| 269M| 378M| 98667 (4)| 00:23:02 | | |
| 9 | PARTITION RANGE SINGLE | | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
|* 10 | TABLE ACCESS FULL | POPT_TOTAL_LOGIN_MONTH_HIS | 14M| 269M| | 44061 (4)| 00:10:17 | 1 | 1 |
-----------------------------------------------------------------------------------------------------------------------------------

原SQL运行超过几小时都出不来，最终优化过的SQL 只需要不到1分钟就得到结果，优化效果非常显著。

以上步骤是我做优化的思路，从优化的步骤看，要优化，不但需要对技术了解，更需要对数据分布了解！
在数据仓库中，我们可能经常会碰到类似的SQL，对此，我们的处理原则是：先排重，再关联，尽可能避免中间结果集膨胀。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/231499/viewspace-1047939/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/231499/viewspace-1047939/