MySQL分布值很少的字段要不要加索引

最新推荐文章于 2024-12-29 16:16:31 发布

知其黑、受其白

最新推荐文章于 2024-12-29 16:16:31 发布

阅读量3k

点赞数 3

分类专栏： mysql

本文链接：https://blog.csdn.net/weiguang102/article/details/119889711

版权

mysql 专栏收录该内容

81 篇文章

订阅专栏

MySQL分布值很少的字段要不要加索引

场景

场景

在我还是个mysql新手的时候，看到有的同事给字段值分布很少的字段也加索引，这违背了我看过的大部分mysql索引优化的文章内容，甚是疑惑。

例如：订单状态字段只有6个值： 0 待确认，1 已确认，2 已收货，3 已取消，4 已完成，5 已关闭

在我理解mysql B+tree的原理后，很有必要去实战这种情况到底有没有必要加索引。

建立相关表数据

建立带索引的表

DROP TABLE if EXISTS `bool_index`;
CREATE TABLE `bool_index` (
	`id` INT (11) NOT NULL AUTO_INCREMENT,
	`rand_id` VARCHAR (200) COMMENT '随机数',
	`order_status` TINYINT (1) NOT NULL DEFAULT '0' COMMENT '订单状态.0待确认，1已确认，2已收货，3已取消，4已完成，5已作废',
	`created_at` datetime NOT NULL,
	PRIMARY KEY (`id`),
	KEY `idx_order_status` (`order_status`)
) ENGINE = INNODB DEFAULT CHARSET = utf8;

建立不带索引的表

DROP TABLE if EXISTS `bool_no_index`;
CREATE TABLE `bool_no_index` (
	`id` INT (11) NOT NULL AUTO_INCREMENT,
	`rand_id` VARCHAR (200) COMMENT '随机数',
	`order_status` TINYINT (1) NOT NULL DEFAULT '0' COMMENT '订单状态.0待确认，1已确认，2已收货，3已取消，4已完成，5已作废',
	`created_at` datetime NOT NULL,
	PRIMARY KEY (`id`)
) ENGINE = INNODB DEFAULT CHARSET = utf8;

创作测试数据

DELIMITER $$
DROP PROCEDURE IF EXISTS `proc_index`$$
CREATE PROCEDURE proc_index()
BEGIN
   DECLARE rand_id VARCHAR(120);
   DECLARE order_status INT(1);
   DECLARE i INT DEFAULT 0;
   DECLARE createtime DATETIME;
   -- 调试过程, 插入一些数据
   WHILE i < 10000 DO
     SET rand_id= SUBSTRING(MD5(RAND()),1,28);
			-- 生成 订单状态值.0待确认，1已确认，2已收货，3已取消，4已完成，5已关闭
		 SET order_status = FLOOR(RAND()*10)%6;
     SET createtime = NOW();
     INSERT INTO  `bool_index`(`rand_id`,`order_status`,`created_at`) VALUES(rand_id,order_status,createtime);
     INSERT INTO  `bool_no_index`(`rand_id`,`order_status`,`created_at`) VALUES(rand_id,order_status,createtime);
     SET i=i+1;
     END WHILE;
END$$
call proc_index();

添加的一万条数据
在这里插入图片描述

在表数据量不同的情况下测试结果

表数据量/耗时	select * from bool_index where order_status=0 and rand_id="9a1dae86505d1778cb80e3bcb0b3";	select * from bool_no_index where order_status=1 and rand_id='dade3f937953e2e525e913939381';	order_status=3数据总量
1W	0.002s	0.002s	约2000
4W	0.011s	0.009s	约8000
8W	0.021s	0.021s	约1.6W
16W	0.059s	0.040s	约3.2W
32W	0.142s	0.110s	约6.3W
64W	1.194s	0.383s	约12W
100W	2.761s	0.563s	约20W
200W	7.025s	1.158s	约40W

通过比较，在数据量小于16W时，加索引和不加索引查询速度差别不大，数据大于16W时，随着数据量的增大，加索引的查询速度相对会越来越慢。

为什么随着数据量的增加，反而加索引的查询比没加索引的更慢呢？

如：第20001万条记录rand_id=‘6b84b2dbd3e81bfc67366a640d1a’ order_status=0
在这里插入图片描述

mysql> select * from bool_no_index where id>20000 limit 1;
+-------+------------------------------+--------------+---------------------+
| id    | rand_id                      | order_status | created_at          |
+-------+------------------------------+--------------+---------------------+
| 20001 | 6b84b2dbd3e81bfc67366a640d1a |            0 | 2021-08-24 15:05:59 |
+-------+------------------------------+--------------+---------------------+
1 row in set (0.00 sec)

通过explain分析执行情况

mysql> explain select * from bool_no_index where order_status=0 and rand_id="6b84b2dbd3e81bfc67366a640d1a";
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
| id | select_type | table         | partitions | type | possible_keys | key  | key_len | ref  | rows  | filtered | Extra       |
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
|  1 | SIMPLE      | bool_no_index | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 30259 |     1.00 | Using where |
+----+-------------+---------------+------------+------+---------------+------+---------+------+-------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

mysql> explain select * from bool_index where order_status=0 and rand_id="6b84b2dbd3e81bfc67366a640d1a";
+----+-------------+------------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
| id | select_type | table      | partitions | type | possible_keys    | key              | key_len | ref   | rows | filtered | Extra       |
+----+-------------+------------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
|  1 | SIMPLE      | bool_index | NULL       | ref  | idx_order_status | idx_order_status | 1       | const | 6039 |    10.00 | Using where |
+----+-------------+------------+------------+------+------------------+------------------+---------+-------+------+----------+-------------+
1 row in set, 1 warning (0.00 sec)

加索引扫描的数据rows=6039 ，不加索引rows=30259 (全表扫描)，明明加索引的扫描条目更少，为何反而变慢了呢？

mysql> show profiles;
+----------+------------+-----------------------------------------------------------------------------------------------------+
| Query_ID | Duration   | Query                                                                                               |
+----------+------------+-----------------------------------------------------------------------------------------------------+
|       21 | 0.00019875 | explain select * from bool_no_index where order_status=0 and rand_id="6b84b2dbd3e81bfc67366a640d1a" |
|       22 | 0.00022850 | explain select * from bool_index where order_status=0 and rand_id="6b84b2dbd3e81bfc67366a640d1a"    |
+----------+------------+-----------------------------------------------------------------------------------------------------+
15 rows in set, 1 warning (0.00 sec)

举一个非常好理解的场景（通过索引读取表中20%的数据）解释一下这个有趣的概念：

假设一张表含有10万行数据----------100000行
我们要读取其中20%(2万)行数据----20000行
表中每行数据大小80字节-------------80bytes
数据库中的数据块大小8K-------------8000bytes

所以有以下结果：

每个数据块包含100行数据---------100行
这张表一共有1000个数据块--------1000块

上面列出了一系列浅显易懂的数据，我们挖掘一下这些数据后面的故事：

通过索引读取20000行数据 = 约20000个table access by rowid = 需要处理20000个块来执行这个查询
但是，请大家注意：整个表只有1000个块！

所以：如果按照索引读取全部的数据的20%相当于将整张表平均读取了20次！！

So，这种情况下直接读取整张表的效率会更高。）（索引还涉及多次回表查询问题）

总结：禁止在更新十分频繁、区分度不高的属性上建立索引。

查看mysql 语句执行时间

开启profile==>查询表==>显示时间

mysql> set profiling=1;
Query OK, 0 rows affected (0.00 sec)

mysql> select * from t_user;

mysql> show profiles;
+----------+----------+----------------------+
| Query_ID | Duration | Query                |
+----------+----------+----------------------+
|        1 | 0.277744 | select * from t_user |
+----------+----------+----------------------+
1 row in set