理论学习 | MySQL 8.0.18 +的hash join学习

21 篇文章 0 订阅

在MySQL 8.0.18之前,表的join方式只有嵌套循环(nested loop)这一种方式;
8.0.18推出了hash join的方式以替代嵌套循环。

hash join的原理概括为:
选择占用空间较小的表t1(不一定是行数)作为驱动表,计算其join字段的hash值,在内存中build一个hash table,将t1的join字段的hash值存放至hash table。然后对被驱动表t2的join字段计算hash值,并与内存中的hash table进行查找匹配。

使hash join方式生效的前提是用于join的字段上没有索引
且在8.0.18中,还需要一个对等的条件(table1.a=table2.a),才能满足hash join。

原文如下:
Beginning with MySQL 8.0.18, MySQL employs a hash join for any query for which each join has an equi-join condition, and in which there are no indexes that can be applied to any join conditions。

当有一个或多个索引可用于单表谓词时,也可以使用散列连接。
A hash join can also be used when there are one or more indexes that can be used for single-table predicates.

在8.0.20中,取消了对等条件的约束,可以全面支持non-equi-join,Semijoin,Antijoin,Left outer join/Right outer join。

原文如下:
In MySQL 8.0.20 and later, it is no longer necessary for the join to contain at least one equi-join condition in order for a hash join to be used. This means that the types of queries which can be optimized using hash joins include those in the following list (with examples):

Inner non-equi-join:

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 JOIN t2 ON t1.c1 < t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (t1.c1 < t2.c1) (cost=4.70 rows=12)
-> Inner hash join (no condition) (cost=4.70 rows=12)
-> Table scan on t2 (cost=0.08 rows=6)
-> Hash
-> Table scan on t1 (cost=0.85 rows=6)
Semijoin:

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1
-> WHERE t1.c1 IN (SELECT t2.c2 FROM t2)\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join
-> Filter: (t1.c1 is not null) (cost=0.85 rows=6)
-> Table scan on t1 (cost=0.85 rows=6)
-> Single-row index lookup on using <auto_distinct_key> (c2=t1.c1)
-> Materialize with deduplication
-> Filter: (t2.c2 is not null) (cost=0.85 rows=6)
-> Table scan on t2 (cost=0.85 rows=6)
Antijoin:

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t2
-> WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1)\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin
-> Table scan on t2 (cost=0.85 rows=6)
-> Single-row index lookup on using <auto_distinct_key> (c1=t2.c1)
-> Materialize with deduplication
-> Filter: (t1.c1 is not null) (cost=0.85 rows=6)
-> Table scan on t1 (cost=0.85 rows=6)
Left outer join:

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Left hash join (t2.c1 = t1.c1) (cost=3.99 rows=36)
-> Table scan on t1 (cost=0.85 rows=6)
-> Hash
-> Table scan on t2 (cost=0.14 rows=6)
Right outer join (observe that MySQL rewrites all right outer joins as left outer joins):

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 RIGHT JOIN t2 ON t1.c1 = t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Left hash join (t1.c1 = t2.c1) (cost=3.99 rows=36)
-> Table scan on t2 (cost=0.85 rows=6)
-> Hash
-> Table scan on t1 (cost=0.14 rows=6)

散列连接通常比以前MySQL版本中使用的块嵌套循环算法(参见块嵌套循环连接算法)要快。从MySQL 8.0.20开始,删除了对块嵌套循环的支持,服务器在以前使用块嵌套循环的地方使用了散列连接。
A hash join is usually faster than and is intended to be used in such cases instead of the block nested loop algorithm (see Block Nested-Loop Join Algorithm) employed in previous versions of MySQL. Beginning with MySQL 8.0.20, support for block nested loop is removed, and the server employs a hash join wherever a block nested loop would have been used previously.

默认情况下,MySQL 8.0.18及以后版本尽可能使用散列连接。可以使用BNL和NO_BNL优化提示之一来控制哈希连接是否被使用。
By default, MySQL 8.0.18 and later employs hash joins whenever possible. It is possible to control whether hash joins are employed using one of the BNL and NO_BNL optimizer hints.

MySQL 8.0.18支持hash_join=on或hash_join=off作为optimizer_switch服务器系统变量设置的一部分,并且优化器提示hash_join或NO_HASH_JOIN。在MySQL 8.0.19和更高版本中,这些不再有任何效果。)
(MySQL 8.0.18 supported hash_join=on or hash_join=off as part of the setting for the optimizer_switch server system variable as well as the optimizer hints HASH_JOIN or NO_HASH_JOIN. In MySQL 8.0.19 and later, these no longer have any effect.)

哈希连接的内存使用可以使用join_buffer_size系统变量来控制;哈希连接不能使用超过这个数量的内存。当哈希连接所需的内存超过可用的数量时,MySQL通过使用磁盘上的文件来处理这个问题。如果发生这种情况,您应该注意,如果散列连接无法装入内存,并且它创建的文件超过open_files_limit设置的文件,那么连接可能不会成功。为避免此类问题,请进行以下任何一项更改:
增加join_buffer_size,以便散列连接不会溢出到磁盘。
增加open_files_limit。

Memory usage by hash joins can be controlled using the join_buffer_size system variable; a hash join cannot use more memory than this amount. When the memory required for a hash join exceeds the amount available, MySQL handles this by using files on disk. If this happens, you should be aware that the join may not succeed if a hash join cannot fit into memory and it creates more files than set for open_files_limit. To avoid such problems, make either of the following changes:

Increase join_buffer_size so that the hash join does not spill over to disk.
Increase open_files_limit.

从MySQL 8.0.18开始,哈希连接的连接缓冲区是递增分配的;因此,可以将join_buffer_size设置得更高,而不需要小查询分配大量的RAM,但是外部连接分配整个缓冲区。在MySQL 8.0.20及以后版本中,散列连接也用于外连接(包括反连接和半连接),所以这不再是一个问题。
Beginning with MySQL 8.0.18, join buffers for hash joins are allocated incrementally; thus, you can set join_buffer_size higher without small queries allocating very large amounts of RAM, but outer joins allocate the entire buffer. In MySQL 8.0.20 and later, hash joins are used for outer joins (including antijoins and semijoins) as well, so this is no longer an issue.

【参考】
https://dev.mysql.com/doc/refman/8.0/en/hash-joins.html

文章结束。

以下为个人公众号,欢迎扫码关注:
image.png

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值