文章之前的描述存在不少问题,今天重新改写了些内容。
开篇之前还是先来回顾一下这个错误的MySQL优化案例:
开发同学期望将两个表的数据汇集之后,取满足条件的top N,于是使用union all联合两个结果集,最后使用limit取top 30。两个表的数据量接近500w,执行耗时12.66秒:
整体来看,性能比较差,下面这种优化是期望将limit推入到两个联合的子句中来减少联合对性能影响,执行耗时在0.03秒,效果还不错。
但这里用到的优化思路看起来好像没有问题,但逻辑的实现是有问题的,而且改写前原SQL的逻辑也是存在问题的,具体是什么问题,看完下面的实验之后,相信也就自然明白了。
下面我们通过PostgreSQL来演示一下这里埋的是什么坑。MySQL和Oracle应该也是一样的效果。时间有限,就不做过多对比了。
先来创建两个测试表:
akendb=# \d+ tab01 Table "public.tab01" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------+-----------+----------+---------+---------+--------------+------------- id | integer | | not null | 0 | plain | | num_col01 | integer | | not null | 0 | plain | | akendb=# \d+ tab02 Table "public.tab02" Column | Type | Collation | Nullable | Default | Storage | Stats target | Description -----------+---------+-----------+----------+---------+---------+--------------+------------- id | integer | | not null | 0 | plain | | num_col01 | integer | | not null | 0 | plain | | akendb=#
然后每张表简单插入3行数据即可:
akendb=# select * from tab01; id | num_col01 ----+----------- 1 | 100 3 | 45 3 | 11(3 rows)akendb=# select * from tab02; id | num_col01 ----+----------- 1 | 73 2 | 58 2 | 22(3 rows)akendb=#
--两表数据union all:
(select * from tab01 )
union all
(select * from tab02)
akendb=# (select * from tab01 ) akendb-# union all akendb-# (select * from tab02) ; id | num_col01 ----+----------- 1 | 100 3 | 45 3 | 11 1 | 73 2 | 58 2 | 22(6 rows)akendb=#
--测试语句:业务期望是想得出两表合并后根据id排序前面的两条记录,即tab01中(id=1,num_col01=100)和tab02中的(id=1,num_col01=73)。
(select * from tab01 )
union all
(select * from tab02)
order by id asc limit 2;
akendb=# (select * from tab01 ) akendb-# union all akendb-# (select * from tab02) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 100 1 | 73(2 rows)akendb=#
接着我们对调一下量表的先后顺序,除了记录的顺序不一样,得到的结果是相同的两行数据。
(select * from tab02 )
union all
(select * from tab01)
order by id asc limit 2;
akendb=# (select * from tab02 ) akendb-# union all akendb-# (select * from tab01) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 73 1 | 100(2 rows)akendb=#
依样画葫芦,我们将order by和limit内推到union all各个字句中,查询结果和上面的一致,好像没什么问题。
(select * from tab01 order by id limit 2)
union all
(select * from tab02 order by id limit 2)
order by id asc limit 2;
(select * from tab02 order by id limit 2)
union all
(select * from tab01 order by id limit 2)
order by id asc limit 2;
akendb=# (select * from tab01 order by id limit 2) akendb-# union all akendb-# (select * from tab02 order by id limit 2) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 100 1 | 73(2 rows)akendb=# (select * from tab02 order by id limit 2) akendb-# union all akendb-# (select * from tab01 order by id limit 2) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 73 1 | 100(2 rows)akendb=#
细心的同学或许已经看出了其中的问题,为了让问题更加明显,我这里将order by字段的值改动一下:让每个表或第一张表的order by字段重复值的数量大于limit n中的n:
update tab01 set id=1;
update tab02 set id=2;
akendb=# update tab01 set id=1; --进行update之后id=1的行数为3,大于limit 2取数范围。UPDATE 3akendb=# update tab02 set id=1;UPDATE 3akendb=#
首先,直接执行原来的查询语句:
(select * from tab01 )
union all
(select * from tab02)
order by id asc limit 2;
(select * from tab02 )
union all
(select * from tab01)
order by id asc limit 2;
akendb=# (select * from tab01 ) akendb-# union all akendb-# (select * from tab02) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 45 1 | 100(2 rows)akendb=# (select * from tab02 ) akendb-# union all akendb-# (select * from tab01) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 58 1 | 73(2 rows)akendb=#
上面执行结果和之前的查询结果不太一样了,原因是union all中做order by实际上根据第一个表的字段在排序,然后limit的取值实际上倾向于能取第一个结果集的limit n,所以到底是tab01还是tab02在前,其实在union all的应用场景里面是有所区别的。
akendb=# explain analyze (select * from tab01 ) union all (select * from tab02) order by id asc limit 2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Limit (cost=110.40..110.41 rows=2 width=8) (actual time=0.031..0.033 rows=2 loops=1) -> Sort (cost=110.40..121.70 rows=4520 width=8) (actual time=0.030..0.030 rows=2 loops=1) Sort Key: tab01.id Sort Method: top-N heapsort Memory: 25kB -> Append (cost=0.00..65.20 rows=4520 width=8) (actual time=0.012..0.018 rows=6 loops=1) -> Seq Scan on tab01 (cost=0.00..32.60 rows=2260 width=8) (actual time=0.012..0.013 rows=3 loops=1) -> Seq Scan on tab02 (cost=0.00..32.60 rows=2260 width=8) (actual time=0.004..0.004 rows=3 loops=1) Planning time: 0.118 ms Execution time: 0.062 ms(9 rows)akendb=# explain analyze (select * from tab02 ) union all (select * from tab01) order by id asc limit 2; QUERY PLAN --------------------------------------------------------------------------------------------------------------------- Limit (cost=110.40..110.41 rows=2 width=8) (actual time=0.029..0.031 rows=2 loops=1) -> Sort (cost=110.40..121.70 rows=4520 width=8) (actual time=0.028..0.028 rows=2 loops=1) Sort Key: tab02.id Sort Method: top-N heapsort Memory: 25kB -> Append (cost=0.00..65.20 rows=4520 width=8) (actual time=0.010..0.016 rows=6 loops=1) -> Seq Scan on tab02 (cost=0.00..32.60 rows=2260 width=8) (actual time=0.010..0.011 rows=3 loops=1) -> Seq Scan on tab01 (cost=0.00..32.60 rows=2260 width=8) (actual time=0.003..0.004 rows=3 loops=1) Planning time: 0.110 ms Execution time: 0.058 ms(9 rows)akendb=#
接着,我们先看看“优化”改写后的效果:
(select * from tab01 order by id limit 2)
union all
(select * from tab02 order by id limit 2)
order by id asc limit 2;
(selt * from tab02 order by id limit 2)
union all
(select * from tab01 order by id limit 2)
order by id asc limit 2;
akendb=# (select * from tab01 order by id limit 2) akendb-# union all akendb-# (select * from tab02 order by id limit 2) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 100 1 | 45(2 rows) akendb=# (select * from tab02 order by id limit 2) akendb-# union all akendb-# (select * from tab01 order by id limit 2) akendb-# order by id asc limit 2; id | num_col01 ----+----------- 1 | 73 1 | 58(2 rows)akendb=#
上面的输出结果,看似和原句的结果保持一致,事实上,当第一张表的中order by字段的记录满足比第二张表的记录都小的时候,那么这种内推到子句的改法是明显错误的,比如下面这种情况:
akendb=# select * from tab01; id | num_col01 ----+----------- 1 | 45 1 | 6 1 | 65(3 rows)akendb=# select * from tab02; id | num_col01 ----+----------- 2 | 13 2 | 16 2 | 82(3 rows)akendb=# (select * from tab01 order by id limit 2) union all (select * from tab02 order by id limit 2) order by id asc limit 3; id | num_col01 ----+----------- 1 | 45 1 | 6 2 | 13(3 rows)akendb=#
这里的查询结果就不应该出现第二张表的记录,但因为内推改变了取数据的逻辑,这是明显搞错了。
所以,在union all之后执行order by取top n看似逻辑的理解上正确了,但在实际查询结果却偏差了,不少开发和运维的同学可能并没有觉察,以前遇到过,今儿再次遇到了,简单记录一下。