mysql not exists 效率高_mysql not in、left join、IS NULL、NOT EXISTS 效率问题记录

最新推荐文章于 2023-08-22 11:38:02 发布

小爽无敌

最新推荐文章于 2023-08-22 11:38:02 发布

阅读量773

点赞数

文章标签： mysql not exists 效率高

本文链接：https://blog.csdn.net/weixin_42419782/article/details/113339843

版权

语句一：select count(*) from A where A.a not in (select a from B)

语句二：select count(*) from A left join B on A.a = B.a where B.a is null

语句三：select count(*) from A where not exists (select a from B where A.a = B.a)

知道以上三条语句的实际效果是相同的已经很久了，但是一直没有深究其间的效率对比。一直感觉上语句二是最快的。今天工作上因为要对一个数千万行数据的库进行数据清除，需要删掉两千多万行数据。大量的用到了以上三条语句所要实现的功能。本来用的是语句一，但是结果是执行速度1个小时32分，日志文件占用21GB。时间上虽然可以接受，但是对硬盘空间的占用确是个问题。因此将所有的语句一都换成语句二。本以为会更快。没想到执行40多分钟后，第一批50000行都没有删掉，反而让SQL SERVER崩溃掉了，结果令人诧异。试了试单独执行这条语句，查询近一千万行的表，语句一用了4秒，语句二却用了18秒，差距很大。语句三的效率与语句一接近。

第二种写法是大忌，应该尽量避免。第一种和第三种写法本质上几乎一样。

假设buffer pool足够大，写法二相对于写法一来说存在以下几点不足： (1)left join本身更耗资源(需要更多资源来处理产生的中间结果集) (2)left join的中间结果集的规模不会比表A小 (3)写法二还需要对left join产生的中间结果做is null的条件筛选，而写法一则在两个集合join的同时完成了筛选，这部分开销是额外的

这三点综合起来，在处理海量数据时就会产生比较明显的区别(主要是内存和CPU上的开销)。我怀疑楼主在测试时buffer pool可能已经处于饱和状态，这样的话，写法二的那些额外开销不得不借助磁盘上的虚拟内存，在SQL Server做换页时，由于涉及到较慢的I/O操作因此这种差距会更加明显。

关于日志文件过大，这也是正常的，因为删除的记录多嘛。可以根据数据库的用途考虑将恢复模型设为simple，或者在删除结束后将日志truncate掉并把文件shrink下来。

因为以前曾经作过一个对这个库进行无条件删除的脚本，就是要删除数据量较大的表中的所有数据，但是因为客户要求，不能使用truncate table，怕破坏已有的库结构。所以只能用delete删，当时也遇到了日志文件过大的问题，当时采用的方法是分批删除，在SQL2K中用set rowcount @chunk，在SQL2K5中用delete top @chunk。这样的操作不仅使删除时间大大减少，而且让日志量大大减少，只增长了1G左右。但是这次清除数据的工作需要加上条件，就是delete A from A where ....后面有条件的。再次使用分批删除的方法，却已经没效果了。不知您知不知道这是为什么。

mysql not in 和 left join 效率问题记录

首先说明该条sql的功能是查询集合a不在集合b的数据。 not in的写法

select add_tb.RUID

from (select distinct RUID

from UserMsg

where SubjectID =12

and CreateTime>'2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00'

) add_tb

where add_tb.RUID

not in (select distinct RUID

from UserMsg

where SubjectID =12

and CreateTime

)

返回444行记录用时 0.07sec explain 结果

+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+

Extra |

+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+

Using where |

Using index; Using where |

Using where; Using temporary |

+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+

分析:该条查询速度快原因为id=2的sql查询出来的结果比较少，所以id=1sql所以运行速度比较快，id=2的使用了临时表，不知道这个时候是否使用索引？其中一种left join

select a.ruid,b.ruid

from(select distinct RUID

from UserMsg

where SubjectID =12

and CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00'

) a left join (

select distinct RUID

from UserMsg

where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'

) b on a.ruid = b.ruid

where b.ruid is null

返回444行记录用时 0.39sec

explain 结果

+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+

+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+

| 3 | DERIVED | UserMsg | ref | SubjectID,CreateTime | SubjectID | 5 | | 6667 | Using where; Using

temporary |

temporary |

+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+

分析:使用了两个临时表，并且两个临时表做了笛卡尔积，导致不能使用索引并且数据量很大

另外一种left join

复制代码代码如下:

select distinct a.RUID

from UserMsg a

left join UserMsg b

on a.ruid = b.ruid

and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'

where a.subjectID =12

and a.createTime >= '2009-8-14 15:30:00'

and a.createtime <='2009-8-17 16:00:00'

and b.ruid is null;

返回444行记录用时 0.07sec

explain 结果

+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+

+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+

Using temporary |

| 1 | SIMPLE | b | ref | RUID,SubjectID,CreateTime | RUID | 96 | dream.a.RUID | 2 | Using where;

Not exists; Distinct |

+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+

分析：两次查询都是用上了索引，并且查询时同时进行的，所以查询效率应该很高

使用not exists的sql

复制代码代码如下:

select distinct a.ruid

from UserMsg a

where a.subjectID =12

and a.createTime >= '2009-8-14 15:30:00'

and a.createTime <='2009-8-17 16:00:00'

and not exists (

select distinct RUID

from UserMsg

where subjectID =12 and createTime < '2009-8-14 15:30:00'

and ruid=a.ruid

)

返回444行记录用时 0.08sec

explain 结果

+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+

+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+

where; Using temporary |

where |

+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+

分析：同上基本上是一样的，只是分解了2个查询顺序执行，查询效率低于第3个

为了验证数据查询效率，将上述查询中的subjectID =12的限制条件去掉，结果统计查询时间如下

0.20s

21.31s

0.25s

0.43s

laserhe帮忙分析问题总结

复制代码代码如下:

select a.ruid,b.ruid

from( select distinct RUID

from UserMsg

where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00'

) a left join UserMsg b

on a.ruid = b.ruid

and b.createTime < '2009-8-14 15:30:00'

where b.ruid is null;

执行时间0.13s

+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+

+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+

| 1 | PRIMARY | b | ref | RUID,CreateTime | RUID | 96 | a.RUID | 2 | Using where; Not exists

temporary |

+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+

执行效率类似与not in的效率

数据库优化的基本原则：让笛卡尔积发生在尽可能小的集合之间，mysql在join的时候可以直接通过索引来扫描，而嵌入到子查询里头，查询规

划器就不晓得用合适的索引了。

一个SQL在数据库里是这么优化的：首先SQL会分析成一堆分析树，一个树状数据结构，然后在这个数据结构里，查询规划器会查找有没有合适

的索引，然后根据具体情况做一个排列组合，然后计算这个排列组合中的每一种的开销(类似explain的输出的计算机可读版本)，然后比较里

面开销最小的，选取并执行之。那么：

explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'

where b.ruid is null;

和

explain select add_tb.RUID

-> from (select distinct RUID

-> from UserMsg

-> where CreateTime>'2009-8-14 15:30:00'

-> and CreateTime<='2009-8-17 16:00:00'

-> ) add_tb

-> where add_tb.RUID

-> not in (select distinct RUID

-> from UserMsg

-> where CreateTime

-> );

explain

+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+

+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+

Using where |

Using temporary |

+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+

开销是完全一样的，开销可以从 rows 那个字段得出(基本上是rows那个字段各个行的数值的乘积，也就是笛卡尔积)

但是呢：下面这个：

explain select a.ruid,b.ruid from(select distinct RUID from UserMsg where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14

15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;

执行时间21.31s

+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

我就有些不明白

为何是四行

并且中间两行巨大无比

按理说

查询规划器应该能把这个查询优化得跟前面的两个一样的

(至少在我熟悉的pgsql数据库里我有信心是一样的)

但mysql里头不是

所以我感觉查询规划器里头可能还是糙了点

我前面说过优化的基本原则就是，让笛卡尔积发生在尽可能小的集合之间

那么上面最后一种写法至少没有违反这个原则

虽然b 表因为符合条件的非常多，基本上不会用索引

但是并不应该妨碍查询优化器看到外面的join on条件，从而和前面两个SQL一样，选取主键进行join

不过我前面说过查询规划器的作用

理论上来讲

遍历一遍所有可能，计算一下开销

是合理的

我感觉这里最后一种写法没有遍历完整所有可能

可能的原因是子查询的实现还是比较简单？

子查询对数据库的确是个挑战

因为基本都是递归的东西

所以在这个环节有点毛病并不奇怪

其实你仔细想想，最后一种写法无非是我们第一种写法的一个变种，关键在表b的where 条件放在哪里

放在里面，就不会用索引去join

放在外面就会

这个本身就是排列组合的一个可能

详细出处参考：http://www.jb51.net/article/29122.htm

小爽无敌

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
mysql not exists 效率高_mysql not in、left join、IS NULL、NOT EXISTS 效率问题记录

语句一：select count(*) from A where A.a not in (select a from B)语句二：select count(*) from A left join B on A.a = B.a where B.a is null语句三：select count(*) from A where not exists (select a from B where A.a...
复制链接

扫一扫

mysql not exists 效率高_mysql not in、left join、IS NULL、NOT EXISTS 效率问题记录

“相关推荐”对你有帮助么？