Cardinality对表Join的影响

如果你执行SHOW INDEX FROM TABLE_NAME察看所引信息,中间会有一列Cardinality,MySQL在解析多重Join的时候会根据Cardinality的信息决定选择什么路径来执行Join,这种算法的前提条件是索引数据时很平均分配的,但如果索引中的数据非常不均衡,会导致MySQL做出错误的选择。 下面是一个完整的例子:

首先创建三个测试表:

CREATE TABLE customer (
-- Unique ID
customer_id integer NOT NULL,

CONSTRAINT PK_customer PRIMARY KEY(customer_id)
);

CREATE TABLE my_order (
-- Unique ID
order_id integer NOT NULL,

product_id integer NOT NULL,

customer_id integer NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(product_id, customer_id ),
CONSTRAINT FK_order2customer_id FOREIGN KEY (customer_id) REFERENCES customer(customer_id),
CONSTRAINT UQ_order_id UNIQUE(order_id)

);


CREATE TABLE delivery (

order_id int NOT NULL,

time datetime NOT NULL,

CONSTRAINT PK_order PRIMARY KEY(order_id, time),
CONSTRAINT FK_order FOREIGN KEY (order_id) REFERENCES my_order(order_id)
);

然后加入数据:

set @N=0;

insert into customer select @N:=@N+1 from mysql.help_topic LIMIT 1000;

set @N=0;

insert into my_order select @N:=@N+1, @N, 1 from mysql.help_topic a, mysql.help_topic b LIMIT 100000;

set @I=1;

insert into my_order select @N:=@N+1, @N, @I:=@I+1 from mysql.help_topic a, mysql.help_topic b LIMIT 600;

set @N=0;

insert into delivery select @N:=@N+1, '2010-05-10 15:22:02' from mysql.help_topic a, mysql.help_topic b LIMIT 600;

注意在my_order表中大多数记录的customer_id的值是1。

执行ANALYZE TABLE来更新Cardinality:

ANALYZE TABLE my_order;

ANALYZE TABLE delivery;

下面我们来执行一个Query:

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.51 sec)

然后用Explain分析一下:

mysql> explain SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 JOIN delivery ON b.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
| 1 | SIMPLE | a | ref | PRIMARY,FK_order2customer_id | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | b | ref | UQ_order_id,FK_order2customer_id | FK_order2customer_id | 4 | oliver_test.a.customer_id | 167 | |
| 1 | SIMPLE | delivery | eq_ref | PRIMARY | PRIMARY | 12 | oliver_test.b.order_id,const | 1 | Using index |
+----+-------------+----------+--------+----------------------------------+----------------------+---------+------------------------------+------+-------------+
3 rows in set (0.00 sec)

从explain的结果来看,MySQL选择先把my_order自己做Join然后再去Join表delivery。第一个Join需要访问167行数据。 但实际情况是如何呢?

mysql> SELECT count(*) FROM my_order a JOIN my_order b ON a.customer_id = b.customer_id and a.product_id = 123 ;
+----------+
| count(*) |
+----------+
| 100000 |
+----------+
1 row in set (0.06 sec)

如果我们只执行第一个Join,实际将访问100000条记录,远远高于167,相反如果我们先做第二个Join, 我们只需要访问600行:

mysql> select count(*) from delivery JOIN my_order ON my_order.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02';
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.00 sec)

为什么MySQL会选择错误的顺序来执行呢?首先察看一下my_order表的索引:

mysql> show index from my_order;
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| my_order | 0 | PRIMARY | 1 | product_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 0 | PRIMARY | 2 | customer_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 0 | UQ_order_id | 1 | order_id | A | 100600 | NULL | NULL | | BTREE | |
| my_order | 1 | FK_order2customer_id | 1 | customer_id | A | 602 | NULL | NULL | | BTREE | |
+----------+------------+----------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
4 rows in set (0.00 sec)

我们可以看到对于索引FK_order2customer_id,Cardinality的值是602,MySQL在估算第一个Join的时候,假设索引是平均分布的,用总行数(100600)除以Cardinality,所以得到167,由于167小于第二个Join需要访问的行数600,所以选择先执行第一个Join。

如何解决这个问题?

修改Query中Join的顺序并用STRAIGHT_JOIN强制MySQL按照这个顺序执行,下面是新的Query:

mysql> SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----------+
| count(*) |
+----------+
| 600 |
+----------+
1 row in set (0.00 sec)

你可以看到查询时间从0.51秒变成0秒。

用Explain察看新的执行顺序:

mysql> explain SELECT count(*) FROM delivery STRAIGHT_JOIN my_order a ON a.order_id = delivery.order_id AND delivery.time = '2010-05-10 15:22:02' JOIN my_order b ON a.customer_id = b.customer_id and b.product_id = 123 ;
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
| 1 | SIMPLE | b | ref | PRIMARY,FK_order2customer_id | PRIMARY | 4 | const | 1 | Using index |
| 1 | SIMPLE | delivery | index | PRIMARY | PRIMARY | 12 | NULL | 600 | Using where; Using index; Using join buffer |
| 1 | SIMPLE | a | eq_ref | UQ_order_id,FK_order2customer_id | UQ_order_id | 4 | oliver_test.delivery.order_id | 1 | Using where |
+----+-------------+----------+--------+----------------------------------+-------------+---------+-------------------------------+------+---------------------------------------------+
3 rows in set (0.00 sec)

另外很重要的一点是这个问题在开始阶段通常不会注意,因为数据量少不会影响性能,但随着数据不断增加,一旦达到一定数量,就会突然出现并影响系统的性能。
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值