mysql not in 和 left join 效率问题记录

最新推荐文章于 2024-05-26 18:48:02 发布

ethanzhao

最新推荐文章于 2024-05-26 18:48:02 发布

阅读量1.4w

点赞数

分类专栏：数据库相关文章标签： join mysql null subquery table 数据结构

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/ethanzhao/article/details/4467022

版权

数据库相关专栏收录该内容

10 篇文章 0 订阅

订阅专栏

声明：感谢 laserhe, denniswwh ， ACMAIN_CHM ， vinsonshen 的热心帮助

首先说明该条sql的功能是查询集合a不在集合b的数据。
not in的写法
select add_tb.RUID
from (select distinct RUID
      from UserMsg
      where SubjectID =12
      and CreateTime>'2009-8-14 15:30:00'
      and CreateTime<='2009-8-17 16:00:00'
) add_tb
where add_tb.RUID
not in (select distinct RUID
          from UserMsg
          where SubjectID =12
    and CreateTime<'2009-8-14 15:30:00'
)
返回444行记录用时 0.07sec
explain 结果
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
| id | select_type        | table      | type           | possible_keys             | key        | key_len | ref | rows |

Extra                        |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
| 1 | PRIMARY            | <derived2> | ALL            | NULL                      | NULL       |    NULL | NULL | 452 |

Using where                  |
| 3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,SubjectID,CreateTime | RUID       |      96 | func |    2 |

Using index; Using where     |
| 2 | DERIVED            | UserMsg    | range          | SubjectID,CreateTime      | CreateTime |       9 | NULL | 1857 |

Using where; Using temporary |
+----+--------------------+------------+----------------+---------------------------+------------+---------+------+------+--

----------------------------+
分析:该条查询速度快原因为id=2的sql查询出来的结果比较少，所以id=1sql所以运行速度比较快，id=2的使用了临时表，不知道这个时候是否使用索引？
其中一种left join
select a.ruid,b.ruid
from(select distinct RUID
     from UserMsg
     where SubjectID =12
     and CreateTime >= '2009-8-14 15:30:00'
     and CreateTime<='2009-8-17 16:00:00'
) a left join (
    select distinct RUID
    from UserMsg
    where SubjectID =12 and CreateTime< '2009-8-14 15:30:00'
) b on a.ruid = b.ruid
where b.ruid is null
返回444行记录用时 0.39sec
explain 结果
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
| id | select_type | table      | type | possible_keys        | key        | key_len | ref | rows | Extra

     |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
| 1 | PRIMARY     | <derived2> | ALL   | NULL                 | NULL       |    NULL | NULL | 452 |

     |
| 1 | PRIMARY     | <derived3> | ALL   | NULL                 | NULL       |    NULL | NULL | 1112 | Using where; Not exists

     |
| 3 | DERIVED     | UserMsg    | ref   | SubjectID,CreateTime | SubjectID |       5 |      | 6667 | Using where; Using

temporary |
| 2 | DERIVED     | UserMsg    | range | SubjectID,CreateTime | CreateTime |       9 | NULL | 1838 | Using where; Using

temporary |
+----+-------------+------------+-------+----------------------+------------+---------+------+------+-----------------------

-------+
分析:使用了两个临时表，并且两个临时表做了笛卡尔积，导致不能使用索引并且数据量很大
另外一种left join
select distinct a.RUID
from UserMsg a
left join UserMsg b
    on a.ruid = b.ruid
    and b.subjectID =12 and b.createTime < '2009-8-14 15:30:00'
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createtime <='2009-8-17 16:00:00'
and b.ruid is null;
返回444行记录用时 0.07sec
explain 结果
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
| id | select_type | table | type | possible_keys             | key        | key_len | ref          | rows | Extra

                  |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
| 1 | SIMPLE      | a     | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using where;

Using temporary      |
| 1 | SIMPLE      | b     | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using where;

Not exists; Distinct |
+----+-------------+-------+-------+---------------------------+------------+---------+--------------+------+---------------

--------------------+
分析：两次查询都是用上了索引，并且查询时同时进行的，所以查询效率应该很高
使用not exists的sql
select distinct a.ruid
from UserMsg a
where a.subjectID =12
and a.createTime >= '2009-8-14 15:30:00'
and a.createTime <='2009-8-17 16:00:00'
and not exists (
    select distinct RUID
    from UserMsg
    where subjectID =12 and createTime < '2009-8-14 15:30:00'
    and ruid=a.ruid
)
返回444行记录用时 0.08sec
explain 结果
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
| id | select_type        | table   | type | possible_keys             | key        | key_len | ref          | rows | Extra

                      |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
| 1 | PRIMARY            | a       | range | SubjectID,CreateTime      | CreateTime |       9 | NULL         | 1839 | Using

where; Using temporary |
| 2 | DEPENDENT SUBQUERY | UserMsg | ref   | RUID,SubjectID,CreateTime | RUID       |      96 | dream.a.RUID |    2 | Using

where                  |
+----+--------------------+---------+-------+---------------------------+------------+---------+--------------+------+------

------------------------+
分析：同上基本上是一样的，只是分解了2个查询顺序执行，查询效率低于第3个

为了验证数据查询效率，将上述查询中的subjectID =12的限制条件去掉，结果统计查询时间如下
0.20s
21.31s
0.25s
0.43s

laserhe帮忙分析问题总结
select a.ruid,b.ruid
from(    select distinct RUID
    from UserMsg
    where CreateTime >= '2009-8-14 15:30:00'
    and CreateTime<='2009-8-17 16:00:00'
) a     left join UserMsg b
    on a.ruid = b.ruid
    and b.createTime < '2009-8-14 15:30:00'
where b.ruid is null;
执行时间0.13s
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
| id | select_type | table      | type | possible_keys   | key        | key_len | ref    | rows | Extra

|
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
| 1 | PRIMARY     | <derived2> | ALL   | NULL            | NULL       |    NULL | NULL   | 1248 |

|
| 1 | PRIMARY     | b          | ref   | RUID,CreateTime | RUID       |      96 | a.RUID |    2 | Using where; Not exists

|
| 2 | DERIVED     | UserMsg    | range | CreateTime      | CreateTime |       9 | NULL   | 3553 | Using where; Using

temporary |
+----+-------------+------------+-------+-----------------+------------+---------+--------+------+--------------------------

----+
执行效率类似与not in的效率

数据库优化的基本原则：让笛卡尔积发生在尽可能小的集合之间，mysql在join的时候可以直接通过索引来扫描，而嵌入到子查询里头，查询规

划器就不晓得用合适的索引了。
一个SQL在数据库里是这么优化的：首先SQL会分析成一堆分析树，一个树状数据结构，然后在这个数据结构里，查询规划器会查找有没有合适

的索引，然后根据具体情况做一个排列组合，然后计算这个排列组合中的每一种的开销（类似explain的输出的计算机可读版本），然后比较里

面开销最小的，选取并执行之。那么：
explain select a.ruid,b.ruid from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join UserMsg b on a.ruid = b.ruid and b.createTime < '2009-8-14 15:30:00'

where b.ruid is null;
和
explain select add_tb.RUID
    -> from (select distinct RUID
    ->       from UserMsg
    ->       where CreateTime>'2009-8-14 15:30:00'
    ->       and CreateTime<='2009-8-17 16:00:00'
    -> ) add_tb
    ->   where add_tb.RUID
    ->   not in (select distinct RUID
    ->           from UserMsg
    ->           where CreateTime<'2009-8-14 15:30:00'
    -> );
explain
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
| id | select_type        | table      | type           | possible_keys   | key        | key_len | ref | rows | Extra

                |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
| 1 | PRIMARY            | <derived2> | ALL            | NULL            | NULL       |    NULL | NULL | 1248 | Using where

                |
| 3 | DEPENDENT SUBQUERY | UserMsg    | index_subquery | RUID,CreateTime | RUID       |      96 | func |    2 | Using index;

Using where     |
| 2 | DERIVED            | UserMsg    | range          | CreateTime      | CreateTime |       9 | NULL | 3509 | Using where;

Using temporary |
+----+--------------------+------------+----------------+-----------------+------------+---------+------+------+------------

------------------+
开销是完全一样的，开销可以从 rows 那个字段得出（基本上是rows那个字段各个行的数值的乘积，也就是笛卡尔积）
但是呢：下面这个：
explain select a.ruid,b.ruid from(select distinct RUID      from UserMsg       where CreateTime >= '2009-8-14 15:30:00'

and CreateTime<='2009-8-17 16:00:00' ) a left join ( select distinct RUID from UserMsg where createTime < '2009-8-14

15:30:00' ) b on a.ruid = b.ruid where b.ruid is null;
执行时间21.31s
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
| id | select_type | table      | type | possible_keys | key        | key_len | ref | rows | Extra

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
| 1 | PRIMARY     | <derived2> | ALL   | NULL          | NULL       |    NULL | NULL | 1248 |

|
| 1 | PRIMARY     | <derived3> | ALL   | NULL          | NULL       |    NULL | NULL | 30308 | Using where; Not exists

|
| 3 | DERIVED     | UserMsg    | ALL   | CreateTime    | NULL       |    NULL | NULL | 69366 | Using where; Using temporary

|
| 2 | DERIVED     | UserMsg    | range | CreateTime    | CreateTime |       9 | NULL | 3510 | Using where; Using temporary

|
+----+-------------+------------+-------+---------------+------------+---------+------+-------+-----------------------------

-+
我就有些不明白
为何是四行
并且中间两行巨大无比
按理说
查询规划器应该能把这个查询优化得跟前面的两个一样的
（至少在我熟悉的pgsql数据库里我有信心是一样的）
但mysql里头不是
所以我感觉查询规划器里头可能还是糙了点
我前面说过优化的基本原则就是，让笛卡尔积发生在尽可能小的集合之间
那么上面最后一种写法至少没有违反这个原则
虽然b 表因为符合条件的非常多，基本上不会用索引
但是并不应该妨碍查询优化器看到外面的join on条件，从而和前面两个SQL一样，选取主键进行join
不过我前面说过查询规划器的作用
理论上来讲
遍历一遍所有可能，计算一下开销
是合理的
我感觉这里最后一种写法没有遍历完整所有可能
可能的原因是子查询的实现还是比较简单？
子查询对数据库的确是个挑战
因为基本都是递归的东西
所以在这个环节有点毛病并不奇怪
其实你仔细想想，最后一种写法无非是我们第一种写法的一个变种，关键在表b的where 条件放在哪里
放在里面，就不会用索引去join
放在外面就会
这个本身就是排列组合的一个可能

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
1
评论
mysql not in 和 left join 效率问题记录

声明：感谢laserhe, denniswwh，ACMAIN_CHM，vinsonshen 的热心帮助首先说明该条sql的功能是查询集合a不在集合b的数据。not in的写法select add_tb.RUID from (select distinct RUID from UserMsg where SubjectID
复制链接

扫一扫

专栏目录

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。