/* PHYSICAL RECORD (OLD STYLE) =========================== The physical record, which is the data type of all the records found in index pages of the database, has the following format (lower addresses and more significant bits inside a byte are below represented on a higher text line): | offset of the end of the last field of data, the most significant bit is set to 1 if and only if the field is SQL-null, if the offset is 2-byte, then the second most significant bit is set to 1 if the field is stored on another page: mostly this will occur in the case of big BLOB fields | ... | offset of the end of the first field of data + the SQL-null bit | | 4 bits used to delete mark a record, and mark a predefined minimum record in alphabetical order | | 4 bits giving the number of records owned by this record (this term is explained in page0page.h) | | 13 bits giving the order number of this record in the heap of the index page | | 10 bits giving the number of fields in this record | | 1 bit which is set to 1 if the offsets above are given in one byte format, 0 if in two byte format | | two bytes giving an absolute pointer to the next record in the page | ORIGIN of the record | first field of data | ... | last field of data | The origin of the record is the start address of the first field of data. The offsets are given relative to the origin. The offsets of the data fields are stored in an inverted order because then the offset of the first fields are near the origin, giving maybe a better processor cache hit rate in searches. The offsets of the data fields are given as one-byte (if there are less than 127 bytes of data in the record) or two-byte unsigned integers. The most significant bit is not part of the offset, instead it indicates the SQL-null if the bit is set to 1. */ /* PHYSICAL RECORD (NEW STYLE) =========================== The physical record, which is the data type of all the records found in index pages of the database, has the following format (lower addresses and more significant bits inside a byte are below represented on a higher text line): | length of the last non-null variable-length field of data: if the maximum length is 255, one byte; otherwise, 0xxxxxxx (one byte, length=0..127), or 1exxxxxxxxxxxxxx (two bytes, length=128..16383, extern storage flag) | ... | length of first variable-length field of data | | SQL-null flags (1 bit per nullable field), padded to full bytes | | 4 bits used to delete mark a record, and mark a predefined minimum record in alphabetical order | | 4 bits giving the number of records owned by this record (this term is explained in page0page.h) | | 13 bits giving the order number of this record in the heap of the index page | | 3 bits record type: 000=conventional, 001=node pointer (inside B-tree), 010=infimum, 011=supremum, 1xx=reserved | | two bytes giving a relative pointer to the next record in the page | ORIGIN of the record | first field of data | ... | last field of data | The origin of the record is the start address of the first field of data. The offsets are given relative to the origin. The offsets of the data fields are stored in an inverted order because then the offset of the first fields are near the origin, giving maybe a better processor cache hit rate in searches. The offsets of the data fields are given as one-byte (if there are less than 127 bytes of data in the record) or two-byte unsigned integers. The most significant bit is not part of the offset, instead it indicates the SQL-null if the bit is set to 1. */ /* CANONICAL COORDINATES. A record can be seen as a single string of 'characters' in the following way: catenate the bytes in each field, in the order of fields. An SQL-null field is taken to be an empty sequence of bytes. Then after the position of each field insert in the string the 'character' <FIELD-END>, except that after an SQL-null field insert <NULL-FIELD-END>. Now the ordinal position of each byte in this canonical string is its canonical coordinate. So, for the record ("AA", SQL-NULL, "BB", ""), the canonical string is "AA<FIELD_END><NULL-FIELD-END>BB<FIELD-END><FIELD-END>". We identify prefixes (= initial segments) of a record with prefixes of the canonical string. The canonical length of the prefix is the length of the corresponding prefix of the canonical string. The canonical length of a record is the length of its canonical string. For example, the maximal common prefix of records ("AA", SQL-NULL, "BB", "C") and ("AA", SQL-NULL, "B", "C") is "AA<FIELD-END><NULL-FIELD-END>B", and its canonical length is 5. A complete-field prefix of a record is a prefix which ends at the end of some field (containing also <FIELD-END>). A record is a complete-field prefix of another record, if the corresponding canonical strings have the same property. */ /* THE INDEX PAGE ============== The index page consists of a page header which contains the page's id and other information. On top of it are the index records in a heap linked into a one way linear list according to alphabetic order. Just below page end is an array of pointers which we call page directory, to about every sixth record in the list. The pointers are placed in the directory in the alphabetical order of the records pointed to, enabling us to make binary search using the array. Each slot n:o I in the directory points to a record, where a 4-bit field contains a count of those records which are in the linear list between pointer I and the pointer I - 1 in the directory, including the record pointed to by pointer I and not including the record pointed to by I - 1. We say that the record pointed to by slot I, or that slot I, owns these records. The count is always kept in the range 4 to 8, with the exception that it is 1 for the first slot, and 1--8 for the second slot. An essentially binary search can be performed in the list of index records, like we could do if we had pointer to every record in the page directory. The data structure is, however, more efficient when we are doing inserts, because most inserts are just pushed on a heap. Only every 8th insert requires block move in the directory pointer table, which itself is quite small. A record is deleted from the page by just taking it off the linear list and updating the number of owned records-field of the record which owns it, and updating the page directory, if necessary. A special case is the one when the record owns itself. Because the overhead of inserts is so small, we may also increase the page size from the projected default of 8 kB to 64 kB without too much loss of efficiency in inserts. Bigger page becomes actual when the disk transfer rate compared to seek and latency time rises. On the present system, the page size is set so that the page transfer time (3 ms) is 20 % of the disk random access time (15 ms). When the page is split, merged, or becomes full but contains deleted records, we have to reorganize the page. Assuming a page size of 8 kB, a typical index page of a secondary index contains 300 index entries, and the size of the page directory is 50 x 4 bytes = 200 bytes. */ /* PAGE HEADER =========== Index page header starts at the first offset left free by the FIL-module */ typedef byte page_header_t; #define PAGE_HEADER FSEG_PAGE_DATA /* index page header starts at this offset */ /*-----------------------------*/ #define PAGE_N_DIR_SLOTS 0 /* number of slots in page directory */ #define PAGE_HEAP_TOP 2 /* pointer to record heap top */ #define PAGE_N_HEAP 4 /* number of records in the heap, bit 15=flag: new-style compact page format */ #define PAGE_FREE 6 /* pointer to start of page free record list */ #define PAGE_GARBAGE 8 /* number of bytes in deleted records */ #define PAGE_LAST_INSERT 10 /* pointer to the last inserted record, or NULL if this info has been reset by a delete, for example */ #define PAGE_DIRECTION 12 /* last insert direction: PAGE_LEFT, ... */ #define PAGE_N_DIRECTION 14 /* number of consecutive inserts to the same direction */ #define PAGE_N_RECS 16 /* number of user records on the page */ #define PAGE_MAX_TRX_ID 18 /* highest id of a trx which may have modified a record on the page; trx_id_t; defined only in secondary indexes and in the insert buffer tree; NOTE: this may be modified only when the thread has an x-latch to the page, and ALSO an x-latch to btr_search_latch if there is a hash index to the page! */ #define PAGE_HEADER_PRIV_END 26 /* end of private data structure of the page header which are set in a page create */ /*----*/ #define PAGE_LEVEL 26 /* level of the node in an index tree; the leaf level is the level 0. This field should not be written to after page creation. */ #define PAGE_INDEX_ID 28 /* index id where the page belongs. This field should not be written to after page creation. */ #define PAGE_BTR_SEG_LEAF 36 /* file segment header for the leaf pages in a B-tree: defined only on the root page of a B-tree, but not in the root of an ibuf tree */ #define PAGE_BTR_IBUF_FREE_LIST PAGE_BTR_SEG_LEAF #define PAGE_BTR_IBUF_FREE_LIST_NODE PAGE_BTR_SEG_LEAF /* in the place of PAGE_BTR_SEG_LEAF and _TOP there is a free list base node if the page is the root page of an ibuf tree, and at the same place is the free list node if the page is in a free list */ #define PAGE_BTR_SEG_TOP (36 + FSEG_HEADER_SIZE) /* file segment header for the non-leaf pages in a B-tree: defined only on the root page of a B-tree, but not in the root of an ibuf tree */ /*----*/ #define PAGE_DATA (PAGE_HEADER + 36 + 2 * FSEG_HEADER_SIZE) /* start of data on the page */ #define PAGE_OLD_INFIMUM (PAGE_DATA + 1 + REC_N_OLD_EXTRA_BYTES) /* offset of the page infimum record on an old-style page */ #define PAGE_OLD_SUPREMUM (PAGE_DATA + 2 + 2 * REC_N_OLD_EXTRA_BYTES + 8) /* offset of the page supremum record on an old-style page */ #define PAGE_OLD_SUPREMUM_END (PAGE_OLD_SUPREMUM + 9) /* offset of the page supremum record end on an old-style page */ #define PAGE_NEW_INFIMUM (PAGE_DATA + REC_N_NEW_EXTRA_BYTES) /* offset of the page infimum record on a new-style compact page */ #define PAGE_NEW_SUPREMUM (PAGE_DATA + 2 * REC_N_NEW_EXTRA_BYTES + 8) /* offset of the page supremum record on a new-style compact page */ #define PAGE_NEW_SUPREMUM_END (PAGE_NEW_SUPREMUM + 8) /* offset of the page supremum record end on a new-style compact page */ /* Offsets of the bit-fields in an old-style record. NOTE! In the table the most significant bytes and bits are written below less significant. (1) byte offset (2) bit usage within byte downward from origin -> 1 8 bits pointer to next record 2 8 bits pointer to next record 3 1 bit short flag 7 bits number of fields 4 3 bits number of fields 5 bits heap number 5 8 bits heap number 6 4 bits n_owned 4 bits info bits */ /* Offsets of the bit-fields in a new-style record. NOTE! In the table the most significant bytes and bits are written below less significant. (1) byte offset (2) bit usage within byte downward from origin -> 1 8 bits relative offset of next record 2 8 bits relative offset of next record the relative offset is an unsigned 16-bit integer: (offset_of_next_record - offset_of_this_record) mod 64Ki, where mod is the modulo as a non-negative number; we can calculate the offset of the next record with the formula: relative_offset + offset_of_this_record mod UNIV_PAGE_SIZE 3 3 bits status: 000=conventional record 001=node pointer record (inside B-tree) 010=infimum record 011=supremum record 1xx=reserved 5 bits heap number 4 8 bits heap number 5 4 bits n_owned 4 bits info bits */ /* The maximum and minimum number of records owned by a directory slot. The number may drop below the minimum in the first and the last slot in the directory. */ #define PAGE_DIR_SLOT_MAX_N_OWNED 8 #define PAGE_DIR_SLOT_MIN_N_OWNED 4 /* index page creation 1. INCREMENT MODIFY CLOCK 修改buffer pool分配的内存结构 3. CREATE THE INFIMUM AND SUPREMUM RECORDS 4. INITIALIZE THE PAGE 初始化页面头,不是结构体,而是指定偏移的一些整数 5. SET POINTERS IN RECORDS AND DIR SLOTS 上面定义的INFIMUM AND SUPREMUM RECORDS 是两个SLOT mysql索引页面的构成,里面应没有普通页面,全是索引页面 page head 就是前面的PAGE_N_DIR_SLOTS等存的变量 record0, record1, ... recordn每个对应一个元组 ... free space ... slotn, ... slot1, slot0 |最后几个字节存check sum 查找的时候先在各个slot之间二分查找,之后在每个slot之间线性查找 每个record保存有下一个record的偏移量,用于从一个找到下一个, slot和record并不是像postgresql那样一对一的,而是一对多,数目由PAGE_DIR_SLOT_MAX_N_OWNED控制,并且是指向它所指的最后一个,最后一个record存有这个slot的record个数,它之前的这个字节全为0,slot只有两个字节,存的是最后一个record在页面的偏移,并有其它信息。 /* If the maximum length of a variable-length field is up to 255 bytes, the actual length is always stored in one byte. If the maximum length is more than 255 bytes, the actual length is stored in one byte for 0..127. The length will be encoded in two bytes when it is 128 or more, or when the field is stored externally. */ /* Number of extra bytes in a new-style record, in addition to the data and the offsets */ #define REC_N_NEW_EXTRA_BYTES 5 文件中存储的元组长度由三部分构成 REC_N_NEW_EXTRA_BYTES + 是否为NULL的位 + 数据长度 变长的话需要1或2个字节记录长度 再加下下面部分 switch (UNIV_EXPECT(status, REC_STATUS_ORDINARY)) { case REC_STATUS_ORDINARY: ut_ad(n_fields == dict_index_get_n_fields(index)); size = 0; break; case REC_STATUS_NODE_PTR: n_fields--; ut_ad(n_fields == dict_index_get_n_unique_in_tree(index)); ut_ad(dfield_get_len(&fields[n_fields]) == REC_NODE_PTR_SIZE); size = REC_NODE_PTR_SIZE; /* child page number */ break; case REC_STATUS_INFIMUM: case REC_STATUS_SUPREMUM: /* infimum or supremum record, 8 data bytes */ if (UNIV_LIKELY_NULL(extra)) { *extra = REC_N_NEW_EXTRA_BYTES; } return(REC_N_NEW_EXTRA_BYTES + 8); default: ut_error; return(ULINT_UNDEFINED); } 存在硬盘上的元组主要由四部分构成 各个列的长度,从左向右,每个列两个字节,只有不是固定长度的列才需要 | null 位串,从左到右,1表示为NULL,只有可为null的才有,| 五个字节的状态 | 各个列的数据,从左到右 update时,如果更新长度和原来的相同,就把原来的覆盖掉,长度不同,就执行先delete后insert;不管是哪种insert都是按照递增的使用页面空间,只有不够时才reorganize, */