索引选择MySQL Internals索引选择

最新推荐文章于 2023-08-19 18:07:19 发布

dyl_love98

最新推荐文章于 2023-08-19 18:07:19 发布

阅读量280

点赞数

文章标签：索引选择

本文链接：https://blog.csdn.net/javablogs/article/details/8830205

版权

查了好多资料，发明还是不全，干脆自己整理吧，至少保证在我的做法正确的，以免误导读者，也是给自己做个录记吧！

MySQL Internals-Index Merge优化

Louis Hust

0 言前

之前弄错了，认为Index Merge是MySQL5.6的新特性，来原不是，发明5.5也有，看了下manual，发明5.0的manual就已存在了，可以说是一个历史悠久的优化手腕了，好吧，不管怎么样，天今就开拨其秘神的面纱，看看其部内到底如何生成这类Index Merge的划计的。这里只细详绍介Intersect操纵，对于Union和Sort-Union的详细代码，还没开始究研。

1 Index Merge理论基础

Index Merge——索引归并，即针对一张表，同时用使多个索引停止查询，然后将各个索引查出来的结果停止进一步的操纵，可是以求交 ——Intersect，也可是以和求——Union，针对union还有一种弥补法算——Sort-Union，很奇异为什么没有Sort-Intersect，按情理也是可以做的。

什么情况下，同时用使多个索引会有益呢？比如说WHERE件条是C1=10 AND C2 =100，但是只有别分针对C1和C2的索引，而没有(C1,C2)这类索引，两个索引同时用使才有义意,通过两个索引都可以速快位定到一批据数，然后对这一批据数停止进一步的求交或和求操纵便可,这样的率效可能比全表扫描或者只用使其中一个索引停止扫描然后再去主索引查询要快。

Intersect和Union都需要用使的索引是ROR的，也就时ROWID ORDERED，即针对不同的索引扫描出来的据数必须是同时按照ROWID排序的，这里的 ROWID其实也就是InnoDB的主键(如果不定义主键，InnoDB会隐式加添ROWID列作为主键)。只有每一个索引是ROR的，才能停止归并排序，你懂的。当然你可能会有惑疑，查不录记后部内停止一次sort不一样么，何必必需要ROR呢，不错，所以有了SORT-UNION。SORT-UNION就是每一个非ROR的索引排序后再停止Merge。至于为什么没有SORT-INTERSECT，我也是很渺茫。

2 初始化据数

mysql> show create table im\G*************************** 1. row *************************** Table: imCreate Table: CREATE TABLE `im` ( `c1` int(11) DEFAULT NULL, `c2` int(11) DEFAULT NULL, `c3` int(11) DEFAULT NULL, KEY `c1` (`c1`,`c3`), KEY `c2` (`c2`,`c1`)) ENGINE=InnoDB DEFAULT CHARSET=latin11 row in set (0.00 sec)mysql> show create procedure fill_im1\G*************************** 1. row *************************** Procedure: fill_im1 sql_mode: NO_ENGINE_SUBSTITUTION Create Procedure: CREATE DEFINER=`root`@`127.0.0.1` PROCEDURE `fill_im1`(cnt int)begin declare i int default 0; repeat insert into im values(100, 50, 100); set i=i+1; until i > cnt end repeat; endcharacter_set_client: utf8collation_connection: utf8_general_ci Database Collation: latin1_swedish_ci1 row in set (0.07 sec)mysql> show create procedure fill_im2\G*************************** 1. row *************************** Procedure: fill_im2 sql_mode: NO_ENGINE_SUBSTITUTION Create Procedure: CREATE DEFINER=`root`@`127.0.0.1` PROCEDURE `fill_im2`(cnt int)begin declare i int default 0; repeat insert into im values(100, 100, 50); set i=i+1; until i > cnt end repeat; endcharacter_set_client: utf8collation_connection: utf8_general_ci Database Collation: latin1_swedish_ci1 row in set (0.00 sec)mysql> call fill_im1(2000)mysql> call fill_im2(2000)mysql> insert into im values(100,50,50);Query OK, 1 row affected (0.00 sec)mysql> insert into im values(100,50,50);Query OK, 1 row affected (0.00 sec)mysql> commit;Query OK, 0 rows affected (0.05 sec)mysql> select * from im where c1=100 and c2 = 50 and c3 = 50\G*************************** 1. row ***************************c1: 100c2: 50c3: 50*************************** 2. row ***************************c1: 100c2: 50c3: 502 rows in set (0.13 sec)

3 执行划计

mysql> explain select * from im where c1=100 and c2 = 50 and c3 = 50\G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: im type: index_mergepossible_keys: c1,c2 key: c1,c2 key_len: 10,10 ref: NULL rows: 1001 Extra: Using intersect(c1,c2); Using where; Using index1 row in set (0.00 sec)

4 代码析分

从生成据数的方法可以看出来，是专门针对查询的语句停止造构的。无论是根据(c1,c3)的索引查询还是根据(c2,c1)的索引查询，都市查出一般的据数，即率效接近于全表扫描的一半。但是如果利用两个索引同时停止滤过，那么滤过出来的据数就很少了,也就是结果中的两条。

也就是说如果独单查询各个索引，滤过效果不明显，但是如果结合两个索引停止MERGE滤过，那么效果可能很明显，这里所说的滤过，用更专业的词说来是选择因子——selectivity。而划计的选择时价代的盘算，是便盘算这个选择因子。如果综合多个索引，致使选择因子很小，从而到达索引merge出来的结果集很小的话，那么划计就更倾向于Index Merge，反之则不然。

上面是选择子盘算的代码：

static double ror_scan_selectivity(const ROR_INTERSECT_INFO *info, const ROR_SCAN_INFO *scan){ double selectivity_mult= 1.0; const TABLE * const table= info->param->table; const KEY_PART_INFO * const key_part= table->key_info[scan->keynr].key_part; /** key values tuple, used to store both min_range.key and max_range.key. This function is only called for equality ranges; open ranges (e.g. "min_value < X < max_value") cannot be used for rowid ordered retrieval, so in this function we know that min_range.key == max_range.key */ uchar key_val[MAX_KEY_LENGTH+MAX_FIELD_WIDTH]; uchar *key_ptr= key_val; SEL_ARG *sel_arg, *tuple_arg= NULL; key_part_map keypart_map= 0; bool cur_covered; bool prev_covered= test(bitmap_is_set(&info->covered_fields, key_part->fieldnr-1)); key_range min_range; key_range max_range; min_range.key= key_val; min_range.flag= HA_READ_KEY_EXACT; max_range.key= key_val; max_range.flag= HA_READ_AFTER_KEY; ha_rows prev_records= table->file->stats.records; DBUG_ENTER("ror_scan_selectivity"); for (sel_arg= scan->sel_arg; sel_arg; sel_arg= sel_arg->next_key_part) { DBUG_PRINT("info",("sel_arg step")); cur_covered= test(bitmap_is_set(&info->covered_fields, key_part[sel_arg->part].fieldnr-1)); if (cur_covered != prev_covered) { /* create (part1val, ..., part{n-1}val) tuple. */ bool is_null_range= false; ha_rows records; if (!tuple_arg) { tuple_arg= scan->sel_arg; /* Here we use the length of the first key part */ tuple_arg->store_min(key_part[0].store_length, &key_ptr, 0); is_null_range|= tuple_arg->is_null_interval(); keypart_map= 1; } while (tuple_arg->next_key_part != sel_arg) { tuple_arg= tuple_arg->next_key_part; tuple_arg->store_min(key_part[tuple_arg->part].store_length, &key_ptr, 0); is_null_range|= tuple_arg->is_null_interval(); keypart_map= (keypart_map << 1) | 1; } min_range.length= max_range.length= (size_t) (key_ptr - key_val); min_range.keypart_map= max_range.keypart_map= keypart_map; /* Get the number of rows in this range. This is done by calling records_in_range() unless all these are true: 1) The user has requested that index statistics should be used for equality ranges to avoid the incurred overhead of index dives in records_in_range() 2) The range is not on the form "x IS NULL". The reason is that the number of rows with this value are likely to be very different than the values in the index statistics 3) Index statistics is available. @see key_val */ if (!info->param->use_index_statistics || // (1) is_null_range || // (2) !(records= table->key_info[scan->keynr]. rec_per_key[tuple_arg->part])) // (3) { DBUG_EXECUTE_IF("crash_records_in_range", DBUG_SUICIDE();); DBUG_ASSERT(min_range.length > 0); records= (table->file-> records_in_range(scan->keynr, &min_range, &max_range)); } if (cur_covered) { /* uncovered -> covered */ double tmp= rows2double(records)/rows2double(prev_records); DBUG_PRINT("info", ("Selectivity multiplier: %g", tmp)); selectivity_mult *= tmp; prev_records= HA_POS_ERROR; } else { /* covered -> uncovered */ prev_records= records; } } prev_covered= cur_covered; } if (!prev_covered) { double tmp= rows2double(table->quick_rows[scan->keynr]) / rows2double(prev_records); DBUG_PRINT("info", ("Selectivity multiplier: %g", tmp)); selectivity_mult *= tmp; } // Todo: This assert fires in PB sysqa RQG tests. // DBUG_ASSERT(selectivity_mult <= 1.0); DBUG_PRINT("info", ("Returning multiplier: %g", selectivity_mult)); DBUG_RETURN(selectivity_mult);}

每日一情理
巴尔扎克说过“不幸，是天才的进升阶梯，信徒的洗礼之水，弱者的无底深渊”。风雨过后，眼前会是鸥翔鱼游的天水一色；走出荆棘，前面就是铺满鲜花的康庄大道；登上山顶，脚下是便积翠如云的空蒙山色。在这个世界上，一星陨落，黯淡不了星空灿烂，一花凋零，荒芜不了整个春天。人生要尽全力度过每一关，不管遇到什么困难不可轻言放弃。

刚看到这段代码时，确切有点犯懵，代码的注释给了很大的助帮：

/* Get selectivity of adding a ROR scan to the ROR-intersection. SYNOPSIS ror_scan_selectivity() info ROR-interection, an intersection of ROR index scans scan ROR scan that may or may not improve the selectivity of 'info' NOTES Suppose we have conditions on several keys cond=k_11=c_11 AND k_12=c_12 AND ... // key_parts of first key in 'info' k_21=c_21 AND k_22=c_22 AND ... // key_parts of second key in 'info' ... k_n1=c_n1 AND k_n3=c_n3 AND ... (1) //key_parts of 'scan' where k_ij may be the same as any k_pq (i.e. keys may have common parts). Note that for ROR retrieval, only equality conditions are usable so there are no open ranges (e.g., k_ij > c_ij) in 'scan' or 'info' A full row is retrieved if entire condition holds. The recursive procedure for finding P(cond) is as follows: First step: Pick 1st part of 1st key and break conjunction (1) into two parts: cond= (k_11=c_11 AND R) Here R may still contain condition(s) equivalent to k_11=c_11. Nevertheless, the following holds: P(k_11=c_11 AND R) = P(k_11=c_11) * P(R | k_11=c_11). Mark k_11 as fixed field (and satisfied condition) F, save P(F), save R to be cond and proceed to recursion step. Recursion step: We have a set of fixed fields/satisfied conditions) F, probability P(F), and remaining conjunction R Pick next key part on current key and its condition "k_ij=c_ij". We will add "k_ij=c_ij" into F and update P(F). Lets denote k_ij as t, R = t AND R1, where R1 may still contain t. Then P((t AND R1)|F) = P(t|F) * P(R1|t|F) = P(t|F) * P(R1|(t AND F)) (2) (where '|' mean conditional probability, not "or") Consider the first multiplier in (2). One of the following holds: a) F contains condition on field used in t (i.e. t AND F = F). Then P(t|F) = 1 b) F doesn't contain condition on field used in t. Then F and t are considered independent. P(t|F) = P(t|(fields_before_t_in_key AND other_fields)) = = P(t|fields_before_t_in_key). P(t|fields_before_t_in_key) = #records(fields_before_t_in_key) / #records(fields_before_t_in_key, t) The second multiplier is calculated by applying this step recursively. IMPLEMENTATION This function calculates the result of application of the "recursion step" described above for all fixed key members of a single key, accumulating set of covered fields, selectivity, etc. The calculation is conducted as follows: Lets denote #records(keypart1, ... keypartK) as n_k. We need to calculate n_{k1} n_{k2} --------- * --------- * .... (3) n_{k1-1} n_{k2-1} where k1,k2,... are key parts which fields were not yet marked as fixed ( this is result of application of option b) of the recursion step for parts of a single key). Since it is reasonable to expect that most of the fields are not marked as fixed, we calculate (3) as n_{i1} n_{i2} (3) = n_{max_key_part} / ( --------- * --------- * .... ) n_{i1-1} n_{i2-1} where i1,i2, .. are key parts that were already marked as fixed. In order to minimize number of expensive records_in_range calls we group and reduce adjacent fractions. Note that on the optimizer's request, index statistics may be used instead of records_in_range @see RANGE_OPT_PARAM::use_index_statistics. RETURN Selectivity of given ROR scan, a number between 0 and 1. 1 means that adding 'scan' to the intersection does not improve the selectivity.*/

注释想说明的就是选择因子的概率如何停止盘算，其实就是不同INDEX之间差异性的索引列会引起选择因子一直变小，即 Index之间差异性越大，滤过的录记就越多，选择出来的据数集就会越少。INDEX的差异性就是INdex之间索引列列是不是重复出现在不同索引之间，两个INDEX约相似，那么MERGE的结果集越大。详细的实现大家自己看看吧，明确了理原，实现都是浮云了。

BTW, 5.6的Optimizer trace非常好用，对于想要踪跟Optimizer部内的学同说来，可以先把细详的划计生成程流通过Optimizer trace 打印出来，对比优化程流，能就更好的位定到代码。

References

[1]

index-merge-optimization

File translated from T_EX by T_TH, version 4.03.
On 28 Jan 2013, 22:35.

文章结束给大家分享下程序员的一些笑话语录：一位程序员去海边游泳，由于水性不佳，游不回岸了，于是他挥着手臂，大声求.救：“F1，F1！”

dyl_love98

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
索引选择MySQL Internals索引选择

查了好多资料，发明还是不全，干脆自己整理吧，至少保证在我的做法正确的，以免误导读者，也是给自己做个录记吧！ MySQL Internals-Index Merge优化 Louis Hust 0 言前之前弄错了，认为Index Merge是MySQL5.6的新特性，来原不是，发明5.5也有，看了下man
复制链接

扫一扫