/*
Latching strategy of the InnoDB B-tree
--------------------------------------
A tree latch protects all non-leaf nodes of the tree. Each node of a tree
also has a latch of its own.
A B-tree operation normally first acquires an S-latch on the tree. It
searches down the tree and releases the tree latch when it has the
leaf node latch. To save CPU time we do not acquire any latch on
non-leaf nodes of the tree during a search, those pages are only bufferfixed.
If an operation needs to restructure the tree, it acquires an X-latch on
the tree before searching to a leaf node. If it needs, for example, to
split a leaf,
(1) InnoDB decides the split point in the leaf,
(2) allocates a new page,
(3) inserts the appropriate node pointer to the first non-leaf level,
(4) releases the tree X-latch,
(5) and then moves records from the leaf to the new allocated page.
Node pointers
-------------
Leaf pages of a B-tree contain the index records stored in the
tree. On levels n > 0 we store 'node pointers' to pages on level
n - 1. For each page there is exactly one node pointer stored:
thus the our tree is an ordinary B-tree, not a B-link tree.
A node pointer contains a prefix P of an index record. The prefix
is long enough so that it determines an index record uniquely.
The file page number of the child page is added as the last
field. To the child page we can store node pointers or index records
which are >= P in the alphabetical order, but < P1 if there is
a next node pointer on the level, and P1 is its prefix.
If a node pointer with a prefix P points to a non-leaf child,
then the leftmost record in the child must have the same
prefix P. If it points to a leaf node, the child is not required
to contain any record with a prefix equal to P. The leaf case
is decided this way to allow arbitrary deletions in a leaf node
without touching upper levels of the tree.
We have predefined a special minimum record which we
define as the smallest record in any alphabetical order.
A minimum record is denoted by setting a bit in the record
header. A minimum record acts as the prefix of a node pointer
which points to a leftmost node on any level of the tree.
File page allocation
--------------------
In the root node of a B-tree there are two file segment headers.
The leaf pages of a tree are allocated from one file segment, to
make them consecutive on disk if possible. From the other file segment
we allocate pages for the non-leaf levels of the tree.
*/
/* DICT_MAX_INDEX_COL_LEN is measured in bytes and is the maximum
indexed column length (or indexed prefix length). It is set to 3*256,
so that one can create a column prefix index on 256 characters of a
TEXT or VARCHAR column also in the UTF-8 charset. In that charset,
a character may take at most 3 bytes.
This constant MUST NOT BE CHANGED, or the compatibility of InnoDB data
files would be at risk! */
#define DICT_MAX_INDEX_COL_LEN768
/* Data structure for a field in an index */
struct dict_field_struct{
dict_col_t*col;/* pointer to the table column */
const char*name;/* name of the column */
unsignedprefix_len:10;/* 0 or the length of the column
prefix in bytes in a MySQL index of
type, e.g., INDEX (textcol(25));
must be smaller than
DICT_MAX_INDEX_COL_LEN; NOTE that
in the UTF-8 charset, MySQL sets this
to 3 * the prefix len in UTF-8 chars */
unsignedfixed_len:10;/* 0 or the fixed length of the
column if smaller than
DICT_MAX_INDEX_COL_LEN */
};
/* Data structure for an index */
struct dict_index_struct{
dulintid;/* id of the index */
mem_heap_t*heap;/* memory heap */
ulinttype;/* index type */
const char*name;/* index name */
const char*table_name; /* table name */
dict_table_t*table;/* back pointer to table */
unsignedspace:32;
/* space where the index tree is placed */
unsignedpage:32;/* index tree root page number */
unsignedtrx_id_offset:10;/* position of the the trx id column
in a clustered index record, if the fields
before it are known to be of a fixed size,
0 otherwise */
unsignedn_user_defined_cols:10;
/* number of columns the user defined to
be in the index: in the internal
representation we add more columns */
unsignedn_uniq:10;/* number of fields from the beginning
which are enough to determine an index
entry uniquely */
unsignedn_def:10;/* number of fields defined so far */
unsignedn_fields:10;/* number of fields in the index */
unsignedn_nullable:10;/* number of nullable fields */
unsignedcached:1;/* TRUE if the index object is in the
dictionary cache */
dict_field_t*fields;/* array of field descriptions */
UT_LIST_NODE_T(dict_index_t)
indexes;/* list of indexes of the table */
btr_search_t*search_info; /* info used in optimistic searches */
/*----------------------*/
ib_longlong*stat_n_diff_key_vals;
/* approximate number of different key values
for this index, for each n-column prefix
where n <= dict_get_n_unique(index); we
periodically calculate new estimates */
ulintstat_index_size;
/* approximate index size in database pages */
ulintstat_n_leaf_pages;
/* approximate number of leaf pages in the
index tree */
rw_lock_tlock;/* read-write lock protecting the upper levels
of the index tree */
#ifdef UNIV_DEBUG
ulintmagic_n;/* magic number */
# define DICT_INDEX_MAGIC_N76789786
#endif
};
/*****************************************************************
Makes tree one level higher by splitting the root, and inserts
the tuple. It is assumed that mtr contains an x-latch on the tree.
NOTE that the operation of this function must always succeed,
we cannot reverse it: therefore enough free disk space must be
guaranteed to be available before this function is called. */
/* Allocate a new page to the tree. Root splitting is done by first
moving the root records to the new page, emptying the root, putting
a node pointer to the new page, and then splitting the new page. */
/*****************************************************************
Splits an index page to halves and inserts the tuple. It is assumed
that mtr holds an x-latch to the index tree. */
/* 1. Decide the split record; split_rec == NULL means that the
tuple to be inserted should be the first record on the upper
half-page */
/* 2. Allocate a new page to the index */
/* 3. Calculate the first record on the upper half-page, and the
first record (move_limit) on original page which ends up on the
upper half */
/* 4. Do first the modifications in the tree structure */
/* 5. Move then the records to the new page */
/* 6. The split and the tree modification is now completed. Decide the
page where the tuple should be inserted */
/* 7. Reposition the cursor for insert and try insertion */
/* 8. If insert did not fit, try page reorganization */
innodb的btree就是一棵普通的Btree,没有引入任何论文里面来提高效率的方法,并发的性能应不高
btree在插入分裂的时候,会通过x-latch把整个树锁起来,这样在分裂完成之前,无法进行任何其它的操作,通过降低并发性来保正正确性,并且对于分裂的改变树结构操作,一旦成功,也是不可回滚的,
比较特别的地方就是,它可以对部分叶结点的页面建立一个hash索引,相当于对btree上的一种索引,相当于索引之上的索引,这个hash索引是指向btree的叶页面,btree叶结点页面是有序的,方便范围顺序查找。所有的BTREE全部使用同一个HASH表
它把叶页面和非叶叶面存在了两个不同的段之中,提高了read-ahead的效率