一序
ANSI SQL STANDARD定义了4类隔离级别(READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE),包括了一些具体规则,用来限定事务内外的哪些改变是可见的,哪些是不可见的。低级别的隔离级别一般支持更高的并发处理,并拥有更低的系统开销。
- Read Uncommitted(读未提交) 在该隔离级别,所有事务都可以看到其他未提交事务的执行结果。本隔离级别很少用于实际应用,因为它的性能也不比其他级别好多少。读取未提交的数据,也被称之为脏读(Dirty Read)。
- Read Committed(读已提交) 一个事务只能看见已经提交事务所做的改变。这种隔离级别也支持所谓的不可重复读(Nonrepeatable Read),因为同一事务的其他实例在该实例处理其间可能会有新的commit,所以同一select可能返回不同结果。
- Repeatable Read(可重读) 这是MySQL的默认事务隔离级别,它确保同一事务的多个实例在并发读取数据时,会看到同样的数据行。不过理论上,这会导致另一个棘手的问题:幻读 (Phantom Read)。简单的说,幻读指当用户读取某一范围的数据行时,另一个事务又在该范围内插入了新行,当用户再读取该范围的数据行时,会发现有新的“幻影” 行。InnoDB存储引擎通过多版本并发控制(MVCC,Multiversion Concurrency Control)机制解决了该问题。
- Serializable(可串行化) 这是最高的隔离级别,它通过强制事务排序,使之不可能相互冲突,从而解决幻读问题。简言之,它是在每个读的数据行上加上共享锁。在这个级别,可能导致大量的超时现象和锁竞争。 这四种隔离级别采取不同的锁类型来实现。并发控制中读取同一个表的数据,可能出现如下问题:
脏读(Drity Read):事务T1修改了一行数据,事务T2在事务T1提交之前读到了该行数据。
不可重复读(Non-repeatable read): 事务T1读取了一行数据。 事务T2接着修改或者删除了改行数据,当T1再次读取同一行数据的时候,读到的数据时修改之后的或者发现已经被删除。
幻读(Phantom Read): 事务T1读取了满足某条件的一个数据集,事务T2插入了一行或者多行数据满足了T1的选择条件,导致事务T1再次使用同样的选择条件读取的时候,得到了比第一次读取更多的数据集。
MySQL/INNODB支持ANSI SQL STANDARD规定的四种隔离级别(READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, SERIALIZABLE).本篇文章重点关注一下MySQL REPEATABLE READ隔离级别与其他数据实现方式上的不同之处。
下面看一下MySQL在REPEATABLE READ 隔离级别下的工作方式:
MySQL在执行UPDATE语句的时候对于session2的INSERT语句是可以看到的,也就是说发生了幻读。ANSI SQL STANDARD可以看到对于REPEATABLE READ隔离级别下是允许出现幻读现象的。
二 源码分析
接下来我们从源码的角度分析一下Innodb对于REPEATABLE READ隔离级别的执行过程(代码只覆盖重要执行部分)。 以上面的例子为依据进行剖析: 对于第一条SELECT语句,InnoDB将调用row_search_for_mysql函数来返回扫描行。函数row_search_for_mysql调用相关代码如下: mysql 版本5.7.18为例
row_search_for_mysql(
byte* buf,
page_cur_mode_t mode,
row_prebuilt_t* prebuilt,
ulint match_mode,
ulint direction)
{
if (!dict_table_is_intrinsic(prebuilt->table)) {
return(row_search_mvcc(
buf, mode, prebuilt, match_mode, direction));
} else {
return(row_search_no_mvcc(
buf, mode, prebuilt, match_mode, direction));
}
}
我们看看row_search_no_mvcc.
/** Searches for rows in the database using cursor.
Function is for temporary tables that are not shared accross connections
and so lot of complexity is reduced especially locking and transaction related.
The cursor is an iterator over the table/index.
@param[out] buf buffer for the fetched row in MySQL format
@param[in] mode search mode PAGE_CUR_L
@param[in,out] prebuilt prebuilt struct for the table handler;
this contains the info to search_tuple,
index; if search tuple contains 0 field then
we position the cursor at start or the end of
index, depending on 'mode'
@param[in] match_mode 0 or ROW_SEL_EXACT or ROW_SEL_EXACT_PREFIX
@param[in] direction 0 or ROW_SEL_NEXT or ROW_SEL_PREV;
Note: if this is != 0, then prebuilt must has a
pcur with stored position! In opening of a
cursor 'direction' should be 0.
@return DB_SUCCESS or error code */
dberr_t
row_search_no_mvcc(
byte* buf, //用来存放记录的空间地址
page_cur_mode_t mode, /* InnoDB页扫描顺序 */
row_prebuilt_t* prebuilt, /* InnoDB扫描需要的所有信息都包含在这个结构体,比如表以及Index等信息 */
ulint match_mode, /* 对于Index的匹配模式,是精确匹配还是前缀索引匹配 */
ulint direction) /* 指定扫描顺序,正序还是倒叙扫描 */
{
dict_index_t* index = prebuilt->index;
const dtuple_t* search_tuple = prebuilt->search_tuple;
btr_pcur_t* pcur = prebuilt->pcur;
const rec_t* result_rec = NULL;
const rec_t* clust_rec = NULL;
dberr_t err = DB_SUCCESS;
mem_heap_t* heap = NULL;
ulint offsets_[REC_OFFS_NORMAL_SIZE];
ulint* offsets = offsets_;
rec_offs_init(offsets_);
ut_ad(index && pcur && search_tuple);
/* Step-0: Re-use the cached mtr. */
mtr_t* mtr = &index->last_sel_cur->mtr;
dict_index_t* clust_index = dict_table_get_first_index(index->table);
/* Step-1: Build the select graph. */
if (direction == 0 && prebuilt->sel_graph == NULL) {
row_prebuild_sel_graph(prebuilt);
}
que_thr_t* thr = que_fork_get_first_thr(prebuilt->sel_graph);
bool moves_up;
if (direction == 0) {
if (mode == PAGE_CUR_GE || mode == PAGE_CUR_G) {
moves_up = true;
} else {
moves_up = false;
}
} else if (direction == ROW_SEL_NEXT) {
moves_up = true;
} else {
moves_up = false;
}
/* Step-2: Open or Restore the cursor.
If search key is specified, cursor is open using the key else
cursor is open to return all the records. */
if (direction != 0) {
if (index->last_sel_cur->invalid) {
/* Index tree has changed and so active cached cursor
is no more valid. Re-set it based on the last selected
position. */
index->last_sel_cur->release();
mtr_start(mtr);
dict_disable_redo_if_temporary(index->table, mtr);
mem_heap_t* heap = mem_heap_create(256);
dtuple_t* tuple;
tuple = dict_index_build_data_tuple(
index, pcur->old_rec,
pcur->old_n_fields, heap);
btr_pcur_open_with_no_init(
index, tuple, pcur->search_mode,
BTR_SEARCH_LEAF, pcur, 0, mtr);
mem_heap_free(heap);
} else {
/* Restore the cursor for reading next record from cache
information. */
ut_ad(index->last_sel_cur->rec != NULL);
pcur->btr_cur.page_cur.rec = index->last_sel_cur->rec;
pcur->btr_cur.page_cur.block =
index->last_sel_cur->block;
err = row_search_traverse(
moves_up, match_mode, pcur, mtr);
if (err != DB_SUCCESS) {
return(err);
}
}
} else {
/* There could be previous uncommitted transaction if SELECT
is operation as part of SELECT (IF NOT FOUND) INSERT
(IF DUPLICATE) UPDATE plan. */
index->last_sel_cur->release();
/* Capture table snapshot in form of trx-id. */
index->trx_id = dict_table_get_curr_table_sess_trx_id(
index->table);
/* Fresh search commences. */
mtr_start(mtr);
dict_disable_redo_if_temporary(index->table, mtr);
if (dtuple_get_n_fields(search_tuple) > 0) {
btr_pcur_open_with_no_init(
index, search_tuple, mode, BTR_SEARCH_LEAF,
pcur, 0, mtr);
} else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
btr_pcur_open_at_index_side(
mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
pcur, false, 0, mtr);
}
}
/* Step-3: Traverse the records filtering non-qualifiying records. */
for (/* No op */;
err == DB_SUCCESS;
err = row_search_traverse(moves_up, match_mode, pcur, mtr)) {
const rec_t* rec = btr_pcur_get_rec(pcur);
if (page_rec_is_infimum(rec)
|| page_rec_is_supremum(rec)
|| rec_get_deleted_flag(
rec, dict_table_is_comp(index->table))) {
/* The infimum record on a page cannot be in the
result set, and neither can a record lock be placed on
it: we skip such a record. */
continue;
}
offsets = rec_get_offsets(
rec, index, offsets, ULINT_UNDEFINED, &heap);
/* Note that we cannot trust the up_match value in the cursor
at this place because we can arrive here after moving the
cursor! Thus we have to recompare rec and search_tuple to
determine if they match enough. */
if (match_mode == ROW_SEL_EXACT) {
/* Test if the index record matches completely to
search_tuple in prebuilt: if not, then we return with
DB_RECORD_NOT_FOUND */
if (0 != cmp_dtuple_rec(search_tuple, rec, offsets)) {
err = DB_RECORD_NOT_FOUND;
break;
}
} else if (match_mode == ROW_SEL_EXACT_PREFIX) {
if (!cmp_dtuple_is_prefix_of_rec(
search_tuple, rec, offsets)) {
err = DB_RECORD_NOT_FOUND;
break;
}
}
/* Get the clustered index. We always need clustered index
record for snapshort verification. */
if (index != clust_index) {
err = row_sel_get_clust_rec_for_mysql(
prebuilt, index, rec, thr, &clust_rec,
&offsets, &heap, NULL, mtr);
if (err != DB_SUCCESS) {
break;
}
if (rec_get_deleted_flag(
clust_rec, dict_table_is_comp(index->table))) {
/* The record is delete marked in clustered
index. We can skip this record. */
continue;
}
result_rec = clust_rec;
} else {
result_rec = rec;
}
/* Step-4: Check if row is part of the consistent view that was
captured while SELECT statement started execution. */
{
trx_id_t trx_id;
ulint len;
ulint trx_id_off = rec_get_nth_field_offs(
offsets, clust_index->n_uniq, &len);
ut_ad(len == DATA_TRX_ID_LEN);
trx_id = trx_read_trx_id(result_rec + trx_id_off);
if (trx_id > index->trx_id) {
/* This row was recently added skip it from
SELECT view. */
continue;
}
}
/* Step-5: Cache the row-id of selected row to prebuilt cache.*/
if (prebuilt->clust_index_was_generated) {
row_sel_store_row_id_to_prebuilt(
prebuilt, result_rec, clust_index, offsets);
}
/* Step-6: Convert selected record to MySQL format and
store it. */
if (prebuilt->template_type == ROW_MYSQL_DUMMY_TEMPLATE) {
const rec_t* ret_rec =
(index != clust_index
&& prebuilt->need_to_access_clustered)
? result_rec : rec;
offsets = rec_get_offsets(ret_rec, index, offsets,
ULINT_UNDEFINED, &heap);
memcpy(buf + 4, ret_rec - rec_offs_extra_size(offsets),
rec_offs_size(offsets));
mach_write_to_4(buf, rec_offs_extra_size(offsets) + 4);
} else if (!row_sel_store_mysql_rec(
buf, prebuilt, result_rec, NULL, TRUE,
clust_index, offsets, false)) {
err = DB_ERROR;
break;
}
/* Step-7: Store cursor position to fetch next record.
MySQL calls this function iteratively get_next(), get_next()
fashion. */
ut_ad(err == DB_SUCCESS);
index->last_sel_cur->rec = btr_pcur_get_rec(pcur);
index->last_sel_cur->block = btr_pcur_get_block(pcur);
/* This is needed in order to restore the cursor if index
structure changes while SELECT is still active. */
pcur->old_rec = dict_index_copy_rec_order_prefix(
index, rec, &pcur->old_n_fields,
&pcur->old_rec_buf, &pcur->buf_size);
break;
}
if (err != DB_SUCCESS) {
index->last_sel_cur->release();
}
if (heap != NULL) {
mem_heap_free(heap);
}
return(err);
}
/** Extract virtual column data from a virtual index record and fill a dtuple
@param[in] rec the virtual (secondary) index record
@param[in] index the virtual index
@param[in,out] vrow the dtuple where data extract to
@param[in] heap memory heap to allocate memory
*/
static
void
row_sel_fill_vrow(
const rec_t* rec,
dict_index_t* index,
const dtuple_t** vrow,
mem_heap_t* heap)
{
ulint offsets_[REC_OFFS_NORMAL_SIZE];
ulint* offsets = offsets_;
rec_offs_init(offsets_);
ut_ad(!(*vrow));
offsets = rec_get_offsets(rec, index, offsets,
ULINT_UNDEFINED, &heap);
*vrow = dtuple_create_with_vcol(
heap, 0, dict_table_get_n_v_cols(index->table));
/* Initialize all virtual row's mtype to DATA_MISSING */
dtuple_init_v_fld(*vrow);
for (ulint i = 0; i < dict_index_get_n_fields(index); i++) {
const dict_field_t* field;
const dict_col_t* col;
field = dict_index_get_nth_field(index, i);
col = dict_field_get_col(field);
if (dict_col_is_virtual(col)) {
const byte* data;
ulint len;
data = rec_get_nth_field(rec, offsets, i, &len);
const dict_v_col_t* vcol = reinterpret_cast<
const dict_v_col_t*>(col);
dfield_t* dfield = dtuple_get_nth_v_field(
*vrow, vcol->v_pos);
dfield_set_data(dfield, data, len);
dict_col_copy_type(col, dfield_get_type(dfield));
}
}
}
/** Searches for rows in the database using cursor.
Function is mainly used for tables that are shared accorss connection and
so it employs technique that can help re-construct the rows that
transaction is suppose to see.
It also has optimization such as pre-caching the rows, using AHI, etc.
@param[out] buf buffer for the fetched row in MySQL format
@param[in] mode search mode PAGE_CUR_L
@param[in,out] prebuilt prebuilt struct for the table handler;
this contains the info to search_tuple,
index; if search tuple contains 0 field then
we position the cursor at start or the end of
index, depending on 'mode'
@param[in] match_mode 0 or ROW_SEL_EXACT or ROW_SEL_EXACT_PREFIX
@param[in] direction 0 or ROW_SEL_NEXT or ROW_SEL_PREV;
Note: if this is != 0, then prebuilt must has a
pcur with stored position! In opening of a
cursor 'direction' should be 0.
@return DB_SUCCESS or error code */
dberr_t
row_search_mvcc(
byte* buf,
page_cur_mode_t mode,
row_prebuilt_t* prebuilt,
ulint match_mode,
ulint direction)
{
DBUG_ENTER("row_search_mvcc");
dict_index_t* index = prebuilt->index;
ibool comp = dict_table_is_comp(index->table);
const dtuple_t* search_tuple = prebuilt->search_tuple;
btr_pcur_t* pcur = prebuilt->pcur;
trx_t* trx = prebuilt->trx;
dict_index_t* clust_index;
que_thr_t* thr;
const rec_t* prev_rec = NULL;
const rec_t* rec = NULL;
byte* end_range_cache = NULL;
const dtuple_t* prev_vrow = NULL;
const dtuple_t* vrow = NULL;
const rec_t* result_rec = NULL;
const rec_t* clust_rec;
dberr_t err = DB_SUCCESS;
ibool unique_search = FALSE;
ibool mtr_has_extra_clust_latch = FALSE;
ibool moves_up = FALSE;
ibool set_also_gap_locks = TRUE;
/* if the query is a plain locking SELECT, and the isolation level
is <= TRX_ISO_READ_COMMITTED, then this is set to FALSE */
ibool did_semi_consistent_read = FALSE;
/* if the returned record was locked and we did a semi-consistent
read (fetch the newest committed version), then this is set to
TRUE */
ulint next_offs;
ibool same_user_rec;
mtr_t mtr;
mem_heap_t* heap = NULL;
ulint offsets_[REC_OFFS_NORMAL_SIZE];
ulint* offsets = offsets_;
ibool table_lock_waited = FALSE;
byte* next_buf = 0;
bool spatial_search = false;
ulint end_loop = 0;
rec_offs_init(offsets_);
ut_ad(index && pcur && search_tuple);
ut_a(prebuilt->magic_n == ROW_PREBUILT_ALLOCATED);
ut_a(prebuilt->magic_n2 == ROW_PREBUILT_ALLOCATED);
/* We don't support FTS queries from the HANDLER interfaces, because
we implemented FTS as reversed inverted index with auxiliary tables.
So anything related to traditional index query would not apply to
it. */
if (prebuilt->index->type & DICT_FTS) {
DBUG_RETURN(DB_END_OF_INDEX);
}
#ifdef UNIV_DEBUG
{
btrsea_sync_check check(trx->has_search_latch);
ut_ad(!sync_check_iterate(check));
}
#endif /* UNIV_DEBUG */
if (dict_table_is_discarded(prebuilt->table)) {
DBUG_RETURN(DB_TABLESPACE_DELETED);
} else if (prebuilt->table->ibd_file_missing) {
DBUG_RETURN(DB_TABLESPACE_NOT_FOUND);
} else if (!prebuilt->index_usable) {
DBUG_RETURN(DB_MISSING_HISTORY);
} else if (dict_index_is_corrupted(prebuilt->index)) {
DBUG_RETURN(DB_CORRUPTION);
}
/* We need to get the virtual column values stored in secondary
index key, if this is covered index scan or virtual key read is
requested. */
bool need_vrow = dict_index_has_virtual(prebuilt->index)
&& (prebuilt->read_just_key
|| prebuilt->m_read_virtual_key);
/*-------------------------------------------------------------*/
/* PHASE 0: Release a possible s-latch we are holding on the
adaptive hash index latch if there is someone waiting behind */
if (trx->has_search_latch
#ifndef INNODB_RW_LOCKS_USE_ATOMICS
&& rw_lock_get_writer(
btr_get_search_latch(index)) != RW_LOCK_NOT_LOCKED
#endif /* !INNODB_RW_LOCKS_USE_ATOMICS */
) {
/* There is an x-latch request on the adaptive hash index:
release the s-latch to reduce starvation and wait for
BTR_SEA_TIMEOUT rounds before trying to keep it again over
calls from MySQL */
trx_search_latch_release_if_reserved(trx);
}
/* Reset the new record lock info if srv_locks_unsafe_for_binlog
is set or session is using a READ COMMITED isolation level. Then
we are able to remove the record locks set here on an individual
row. */
prebuilt->new_rec_locks = 0;
/*-------------------------------------------------------------*/
/* PHASE 1: Try to pop the row from the prefetch cache */
if (UNIV_UNLIKELY(direction == 0)) {
trx->op_info = "starting index read";
prebuilt->n_rows_fetched = 0;
prebuilt->n_fetch_cached = 0;
prebuilt->fetch_cache_first = 0;
if (prebuilt->sel_graph == NULL) {
/* Build a dummy select query graph */
row_prebuild_sel_graph(prebuilt);
}
} else {
trx->op_info = "fetching rows";
if (prebuilt->n_rows_fetched == 0) {
prebuilt->fetch_direction = direction;
}
if (UNIV_UNLIKELY(direction != prebuilt->fetch_direction)) {
if (UNIV_UNLIKELY(prebuilt->n_fetch_cached > 0)) {
ut_error;
/* TODO: scrollable cursor: restore cursor to
the place of the latest returned row,
or better: prevent caching for a scroll
cursor! */
}
prebuilt->n_rows_fetched = 0;
prebuilt->n_fetch_cached = 0;
prebuilt->fetch_cache_first = 0;
} else if (UNIV_LIKELY(prebuilt->n_fetch_cached > 0)) {
row_sel_dequeue_cached_row_for_mysql(buf, prebuilt);
prebuilt->n_rows_fetched++;
err = DB_SUCCESS;
goto func_exit;
} else if (prebuilt->m_end_range == true) {
prebuilt->m_end_range = false;
err = DB_RECORD_NOT_FOUND;
goto func_exit;
}
if (prebuilt->fetch_cache_first > 0
&& prebuilt->fetch_cache_first < MYSQL_FETCH_CACHE_SIZE) {
/* The previous returned row was popped from the fetch
cache, but the cache was not full at the time of the
popping: no more rows can exist in the result set */
err = DB_RECORD_NOT_FOUND;
goto func_exit;
}
prebuilt->n_rows_fetched++;
if (prebuilt->n_rows_fetched > 1000000000) {
/* Prevent wrap-over */
prebuilt->n_rows_fetched = 500000000;
}
mode = pcur->search_mode;
}
/* In a search where at most one record in the index may match, we
can use a LOCK_REC_NOT_GAP type record lock when locking a
non-delete-marked matching record.
Note that in a unique secondary index there may be different
delete-marked versions of a record where only the primary key
values differ: thus in a secondary index we must use next-key
locks when locking delete-marked records. */
if (match_mode == ROW_SEL_EXACT
&& dict_index_is_unique(index)
&& dtuple_get_n_fields(search_tuple)
== dict_index_get_n_unique(index)
&& (dict_index_is_clust(index)
|| !dtuple_contains_null(search_tuple))) {
/* Note above that a UNIQUE secondary index can contain many
rows with the same key value if one of the columns is the SQL
null. A clustered index under MySQL can never contain null
columns because we demand that all the columns in primary key
are non-null. */
unique_search = TRUE;
/* Even if the condition is unique, MySQL seems to try to
retrieve also a second row if a primary key contains more than
1 column. Return immediately if this is not a HANDLER
command. */
if (UNIV_UNLIKELY(direction != 0
&& !prebuilt->used_in_HANDLER)) {
err = DB_RECORD_NOT_FOUND;
goto func_exit;
}
}
/* We don't support sequencial scan for Rtree index, because it
is no meaning to do so. */
if (dict_index_is_spatial(index)
&& !RTREE_SEARCH_MODE(mode)) {
err = DB_END_OF_INDEX;
goto func_exit;
}
mtr_start(&mtr);
/*-------------------------------------------------------------*/
/* PHASE 2: Try fast adaptive hash index search if possible */
/* Next test if this is the special case where we can use the fast
adaptive hash index to try the search. Since we must release the
search system latch when we retrieve an externally stored field, we
cannot use the adaptive hash index in a search in the case the row
may be long and there may be externally stored fields */
if (UNIV_UNLIKELY(direction == 0)
&& unique_search
&& btr_search_enabled
&& dict_index_is_clust(index)
&& !prebuilt->templ_contains_blob
&& !prebuilt->used_in_HANDLER
&& (prebuilt->mysql_row_len < UNIV_PAGE_SIZE / 8)
&& !prebuilt->innodb_api) {
mode = PAGE_CUR_GE;
if (trx->mysql_n_tables_locked == 0
&& !prebuilt->ins_sel_stmt
&& prebuilt->select_lock_type == LOCK_NONE
&& trx->isolation_level > TRX_ISO_READ_UNCOMMITTED
&& MVCC::is_view_active(trx->read_view)) {
/* This is a SELECT query done as a consistent read,
and the read view has already been allocated:
let us try a search shortcut through the hash
index.
NOTE that we must also test that
mysql_n_tables_locked == 0, because this might
also be INSERT INTO ... SELECT ... or
CREATE TABLE ... SELECT ... . Our algorithm is
NOT prepared to inserts interleaved with the SELECT,
and if we try that, we can deadlock on the adaptive
hash index semaphore! */
ut_a(!trx->has_search_latch);
rw_lock_s_lock(btr_get_search_latch(index));
trx->has_search_latch = true;
switch (row_sel_try_search_shortcut_for_mysql(
&rec, prebuilt, &offsets, &heap,
&mtr)) {
case SEL_FOUND:
/* At this point, rec is protected by
a page latch that was acquired by
row_sel_try_search_shortcut_for_mysql().
The latch will not be released until
mtr_commit(&mtr). */
ut_ad(!rec_get_deleted_flag(rec, comp));
if (prebuilt->idx_cond) {
switch (row_search_idx_cond_check(
buf, prebuilt,
rec, offsets)) {
case ICP_NO_MATCH:
case ICP_OUT_OF_RANGE:
goto shortcut_mismatch;
case ICP_MATCH:
goto shortcut_match;
}
}
if (!row_sel_store_mysql_rec(
buf, prebuilt,
rec, NULL, FALSE, index,
offsets, false)) {
/* Only fresh inserts may contain
incomplete externally stored
columns. Pretend that such
records do not exist. Such
records may only be accessed
at the READ UNCOMMITTED
isolation level or when
rolling back a recovered
transaction. Rollback happens
at a lower level, not here. */
/* Proceed as in case SEL_RETRY. */
break;
}
shortcut_match:
mtr_commit(&mtr);
/* NOTE that we do NOT store the cursor
position */
err = DB_SUCCESS;
rw_lock_s_unlock(btr_get_search_latch(index));
trx->has_search_latch = false;
goto func_exit;
case SEL_EXHAUSTED:
shortcut_mismatch:
mtr_commit(&mtr);
err = DB_RECORD_NOT_FOUND;
rw_lock_s_unlock(btr_get_search_latch(index));
trx->has_search_latch = false;
/* NOTE that we do NOT store the cursor
position */
goto func_exit;
case SEL_RETRY:
break;
default:
ut_ad(0);
}
mtr_commit(&mtr);
mtr_start(&mtr);
rw_lock_s_unlock(btr_get_search_latch(index));
trx->has_search_latch = false;
}
}
/*-------------------------------------------------------------*/
/* PHASE 3: Open or restore index cursor position */
trx_search_latch_release_if_reserved(trx);
spatial_search = dict_index_is_spatial(index)
&& mode >= PAGE_CUR_CONTAIN;
/* The state of a running trx can only be changed by the
thread that is currently serving the transaction. Because we
are that thread, we can read trx->state without holding any
mutex. */
ut_ad(prebuilt->sql_stat_start || trx->state == TRX_STATE_ACTIVE);
ut_ad(!trx_is_started(trx) || trx->state == TRX_STATE_ACTIVE);
ut_ad(prebuilt->sql_stat_start
|| prebuilt->select_lock_type != LOCK_NONE
|| MVCC::is_view_active(trx->read_view)
|| srv_read_only_mode);
//从这里我们看出开始一个新事务,并非是从执行BEGIN语句位置开始,而是从其后开始执行的第一条语句开始分配事务ID */
trx_start_if_not_started(trx, false);
if (trx->isolation_level <= TRX_ISO_READ_COMMITTED
&& prebuilt->select_lock_type != LOCK_NONE
&& trx->mysql_thd != NULL
&& thd_is_select(trx->mysql_thd)) {
/* It is a plain locking SELECT and the isolation
level is low: do not lock gaps */
set_also_gap_locks = FALSE;
}
/* Note that if the search mode was GE or G, then the cursor
naturally moves upward (in fetch next) in alphabetical order,
otherwise downward */
if (direction == 0) {
if (mode == PAGE_CUR_GE
|| mode == PAGE_CUR_G
|| mode >= PAGE_CUR_CONTAIN) {
moves_up = TRUE;
}
} else if (direction == ROW_SEL_NEXT) {
moves_up = TRUE;
}
thr = que_fork_get_first_thr(prebuilt->sel_graph);
que_thr_move_to_run_state_for_mysql(thr, trx);
clust_index = dict_table_get_first_index(index->table);
/* Do some start-of-statement preparations */
if (!prebuilt->sql_stat_start) {
//这里标记SQL语句已经开始执行,处理一条SQL语句循环扫描记录的过程
/* No need to set an intention lock or assign a read view */
if (!MVCC::is_view_active(trx->read_view)
&& !srv_read_only_mode
&& prebuilt->select_lock_type == LOCK_NONE) {
ib::error() << "MySQL is trying to perform a"
" consistent read but the read view is not"
" assigned!";
trx_print(stderr, trx, 600);
fputc('\n', stderr);
ut_error;
}
} else if (prebuilt->select_lock_type == LOCK_NONE) {
// 如果是第一次执行SELECT语句,构建READ_VIEW. 该READ_VIEW 用来判断记录的可见性
/* This is a consistent read */
/* Assign a read view for the query */
if (!srv_read_only_mode) {
trx_assign_read_view(trx);
}
prebuilt->sql_stat_start = FALSE;
} else {
// 这里开始非INSERT的DML操作,因为DML会对记录增加记录排他锁。
wait_table_again:
// 对table加意向锁
err = lock_table(0, index->table,
prebuilt->select_lock_type == LOCK_S
? LOCK_IS : LOCK_IX, thr);
if (err != DB_SUCCESS) {
table_lock_waited = TRUE;
goto lock_table_wait;
}
prebuilt->sql_stat_start = FALSE;
}
/* Open or restore index cursor position */
if (UNIV_LIKELY(direction != 0)) {
if (spatial_search) {
/* R-Tree access does not need to do
cursor position and resposition */
goto next_rec;
}
ibool need_to_process = sel_restore_position_for_mysql(
&same_user_rec, BTR_SEARCH_LEAF,
pcur, moves_up, &mtr);
if (UNIV_UNLIKELY(need_to_process)) {
if (UNIV_UNLIKELY(prebuilt->row_read_type
== ROW_READ_DID_SEMI_CONSISTENT)) {
/* We did a semi-consistent read,
but the record was removed in
the meantime. */
prebuilt->row_read_type
= ROW_READ_TRY_SEMI_CONSISTENT;
}
} else if (UNIV_LIKELY(prebuilt->row_read_type
!= ROW_READ_DID_SEMI_CONSISTENT)) {
/* The cursor was positioned on the record
that we returned previously. If we need
to repeat a semi-consistent read as a
pessimistic locking read, the record
cannot be skipped. */
goto next_rec;
}
} else if (dtuple_get_n_fields(search_tuple) > 0) {
pcur->btr_cur.thr = thr;
if (dict_index_is_spatial(index)) {
bool need_pred_lock;
need_pred_lock = (set_also_gap_locks
&& !(srv_locks_unsafe_for_binlog
|| trx->isolation_level
<= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type
!= LOCK_NONE);
if (!prebuilt->rtr_info) {
prebuilt->rtr_info = rtr_create_rtr_info(
need_pred_lock, true,
btr_pcur_get_btr_cur(pcur), index);
prebuilt->rtr_info->search_tuple = search_tuple;
prebuilt->rtr_info->search_mode = mode;
rtr_info_update_btr(btr_pcur_get_btr_cur(pcur),
prebuilt->rtr_info);
} else {
rtr_info_reinit_in_cursor(
btr_pcur_get_btr_cur(pcur),
index, need_pred_lock);
prebuilt->rtr_info->search_tuple = search_tuple;
prebuilt->rtr_info->search_mode = mode;
}
}
btr_pcur_open_with_no_init(index, search_tuple, mode,
BTR_SEARCH_LEAF,
pcur, 0, &mtr);
pcur->trx_if_known = trx;
rec = btr_pcur_get_rec(pcur);
if (!moves_up
&& !page_rec_is_supremum(rec)
&& set_also_gap_locks
&& !(srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE
&& !dict_index_is_spatial(index)) {
/* Try to place a gap lock on the next index record
to prevent phantoms in ORDER BY ... DESC queries */
const rec_t* next_rec = page_rec_get_next_const(rec);
offsets = rec_get_offsets(next_rec, index, offsets,
ULINT_UNDEFINED, &heap);
err = sel_set_rec_lock(pcur,
next_rec, index, offsets,
prebuilt->select_lock_type,
LOCK_GAP, thr, &mtr);
switch (err) {
case DB_SUCCESS_LOCKED_REC:
err = DB_SUCCESS;
case DB_SUCCESS:
break;
default:
goto lock_wait_or_error;
}
}
} else if (mode == PAGE_CUR_G || mode == PAGE_CUR_L) {
btr_pcur_open_at_index_side(
mode == PAGE_CUR_G, index, BTR_SEARCH_LEAF,
pcur, false, 0, &mtr);
}
rec_loop:
DEBUG_SYNC_C("row_search_rec_loop");
if (trx_is_interrupted(trx)) {
if (!spatial_search) {
btr_pcur_store_position(pcur, &mtr);
}
err = DB_INTERRUPTED;
goto normal_return;
}
/*-------------------------------------------------------------*/
/* PHASE 4: Look for matching records in a loop */
rec = btr_pcur_get_rec(pcur);
ut_ad(!!page_rec_is_comp(rec) == comp);
if (page_rec_is_infimum(rec)) {
/* The infimum record on a page cannot be in the result set,
and neither can a record lock be placed on it: we skip such
a record. */
prev_rec = NULL;
goto next_rec;
}
if (page_rec_is_supremum(rec)) {
/** Compare the last record of the page with end range
passed to InnoDB when there is no ICP and number of
loops in row_search_mvcc for rows found but not
reporting due to search views etc. */
if (prev_rec != NULL
&& prebuilt->m_mysql_handler->end_range != NULL
&& prebuilt->idx_cond == false && end_loop >= 100) {
dict_index_t* key_index = prebuilt->index;
bool clust_templ_for_sec = false;
if (end_range_cache == NULL) {
end_range_cache = static_cast<byte*>(
ut_malloc_nokey(prebuilt->mysql_row_len));
}
if (index != clust_index
&& prebuilt->need_to_access_clustered) {
/** Secondary index record but the template
based on PK. */
key_index = clust_index;
clust_templ_for_sec = true;
}
/** Create offsets based on prebuilt index. */
offsets = rec_get_offsets(prev_rec, prebuilt->index,
offsets, ULINT_UNDEFINED, &heap);
if (row_sel_store_mysql_rec(
end_range_cache, prebuilt, prev_rec, prev_vrow,
clust_templ_for_sec, key_index, offsets,
clust_templ_for_sec)) {
if (row_search_end_range_check(
end_range_cache,
prebuilt->m_mysql_handler)) {
/** In case of prebuilt->fetch,
set the error in prebuilt->end_range. */
if (prebuilt->n_fetch_cached > 0) {
prebuilt->m_end_range = true;
}
err = DB_RECORD_NOT_FOUND;
goto normal_return;
}
}
}
if (set_also_gap_locks
&& !(srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE
&& !dict_index_is_spatial(index)) {
/* Try to place a lock on the index record */
/* If innodb_locks_unsafe_for_binlog option is used
or this session is using a READ COMMITTED isolation
level we do not lock gaps. Supremum record is really
a gap and therefore we do not set locks there. */
offsets = rec_get_offsets(rec, index, offsets,
ULINT_UNDEFINED, &heap);
err = sel_set_rec_lock(pcur,
rec, index, offsets,
prebuilt->select_lock_type,
LOCK_ORDINARY, thr, &mtr);
switch (err) {
case DB_SUCCESS_LOCKED_REC:
err = DB_SUCCESS;
case DB_SUCCESS:
break;
default:
goto lock_wait_or_error;
}
}
/* A page supremum record cannot be in the result set: skip
it now that we have placed a possible lock on it */
prev_rec = NULL;
goto next_rec;
}
/*-------------------------------------------------------------*/
/* Do sanity checks in case our cursor has bumped into page
corruption */
if (comp) {
next_offs = rec_get_next_offs(rec, TRUE);
if (UNIV_UNLIKELY(next_offs < PAGE_NEW_SUPREMUM)) {
goto wrong_offs;
}
} else {
next_offs = rec_get_next_offs(rec, FALSE);
if (UNIV_UNLIKELY(next_offs < PAGE_OLD_SUPREMUM)) {
goto wrong_offs;
}
}
if (UNIV_UNLIKELY(next_offs >= UNIV_PAGE_SIZE - PAGE_DIR)) {
wrong_offs:
if (srv_force_recovery == 0 || moves_up == FALSE) {
ib::error() << "Rec address "
<< static_cast<const void*>(rec)
<< ", buf block fix count "
<< btr_cur_get_block(
btr_pcur_get_btr_cur(pcur))->page
.buf_fix_count;
ib::error() << "Index corruption: rec offs "
<< page_offset(rec) << " next offs "
<< next_offs << ", page no "
<< page_get_page_no(page_align(rec))
<< ", index " << index->name
<< " of table " << index->table->name
<< ". Run CHECK TABLE. You may need to"
" restore from a backup, or dump + drop +"
" reimport the table.";
ut_ad(0);
err = DB_CORRUPTION;
goto lock_wait_or_error;
} else {
/* The user may be dumping a corrupt table. Jump
over the corruption to recover as much as possible. */
ib::info() << "Index corruption: rec offs "
<< page_offset(rec) << " next offs "
<< next_offs << ", page no "
<< page_get_page_no(page_align(rec))
<< ", index " << index->name
<< " of table " << index->table->name
<< ". We try to skip the rest of the page.";
btr_pcur_move_to_last_on_page(pcur, &mtr);
prev_rec = NULL;
goto next_rec;
}
}
/*-------------------------------------------------------------*/
/* Calculate the 'offsets' associated with 'rec' */
ut_ad(fil_page_index_page_check(btr_pcur_get_page(pcur)));
ut_ad(btr_page_get_index_id(btr_pcur_get_page(pcur)) == index->id);
offsets = rec_get_offsets(rec, index, offsets, ULINT_UNDEFINED, &heap);
if (UNIV_UNLIKELY(srv_force_recovery > 0)) {
if (!rec_validate(rec, offsets)
|| !btr_index_rec_validate(rec, index, FALSE)) {
ib::info() << "Index corruption: rec offs "
<< page_offset(rec) << " next offs "
<< next_offs << ", page no "
<< page_get_page_no(page_align(rec))
<< ", index " << index->name
<< " of table " << index->table->name
<< ". We try to skip the record.";
prev_rec = NULL;
goto next_rec;
}
}
prev_rec = rec;
/* Note that we cannot trust the up_match value in the cursor at this
place because we can arrive here after moving the cursor! Thus
we have to recompare rec and search_tuple to determine if they
match enough. */
if (match_mode == ROW_SEL_EXACT) {
/* Test if the index record matches completely to search_tuple
in prebuilt: if not, then we return with DB_RECORD_NOT_FOUND */
/* fputs("Comparing rec and search tuple\n", stderr); */
if (0 != cmp_dtuple_rec(search_tuple, rec, offsets)) {
if (set_also_gap_locks
&& !(srv_locks_unsafe_for_binlog
|| trx->isolation_level
<= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE
&& !dict_index_is_spatial(index)) {
/* Try to place a gap lock on the index
record only if innodb_locks_unsafe_for_binlog
option is not set or this session is not
using a READ COMMITTED isolation level. */
err = sel_set_rec_lock(
pcur,
rec, index, offsets,
prebuilt->select_lock_type, LOCK_GAP,
thr, &mtr);
switch (err) {
case DB_SUCCESS_LOCKED_REC:
case DB_SUCCESS:
break;
default:
goto lock_wait_or_error;
}
}
btr_pcur_store_position(pcur, &mtr);
/* The found record was not a match, but may be used
as NEXT record (index_next). Set the relative position
to BTR_PCUR_BEFORE, to reflect that the position of
the persistent cursor is before the found/stored row
(pcur->old_rec). */
ut_ad(pcur->rel_pos == BTR_PCUR_ON);
pcur->rel_pos = BTR_PCUR_BEFORE;
err = DB_RECORD_NOT_FOUND;
goto normal_return;
}
} else if (match_mode == ROW_SEL_EXACT_PREFIX) {
if (!cmp_dtuple_is_prefix_of_rec(search_tuple, rec, offsets)) {
if (set_also_gap_locks
&& !(srv_locks_unsafe_for_binlog
|| trx->isolation_level
<= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE
&& !dict_index_is_spatial(index)) {
/* Try to place a gap lock on the index
record only if innodb_locks_unsafe_for_binlog
option is not set or this session is not
using a READ COMMITTED isolation level. */
err = sel_set_rec_lock(
pcur,
rec, index, offsets,
prebuilt->select_lock_type, LOCK_GAP,
thr, &mtr);
switch (err) {
case DB_SUCCESS_LOCKED_REC:
case DB_SUCCESS:
break;
default:
goto lock_wait_or_error;
}
}
btr_pcur_store_position(pcur, &mtr);
/* The found record was not a match, but may be used
as NEXT record (index_next). Set the relative position
to BTR_PCUR_BEFORE, to reflect that the position of
the persistent cursor is before the found/stored row
(pcur->old_rec). */
ut_ad(pcur->rel_pos == BTR_PCUR_ON);
pcur->rel_pos = BTR_PCUR_BEFORE;
err = DB_RECORD_NOT_FOUND;
goto normal_return;
}
}
/* We are ready to look at a possible new index entry in the result
set: the cursor is now placed on a user record */
/* 从这里我们看一下InnoDB如何获取一条新纪录。由于上面例子中SESSION1的第一条语句是SELECT语句,
InnoDB在REPEATABLE READ 隔离级别下,不对SELECT 语句加锁,
所以这里执行SELECT语句的时候prebuilt->select_lock_type为LOCK_NONE。
下面我们直接看一下prebuilt->select_lock_type为LOCK_NONE的情况下,InnoDB如何扫描行? */
if (prebuilt->select_lock_type != LOCK_NONE) {
/* Try to place a lock on the index record; note that delete
marked records are a special case in a unique search. If there
is a non-delete marked record, then it is enough to lock its
existence with LOCK_REC_NOT_GAP. */
/* If innodb_locks_unsafe_for_binlog option is used
or this session is using a READ COMMITED isolation
level we lock only the record, i.e., next-key locking is
not used. */
ulint lock_type;
if (!set_also_gap_locks
|| srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED
|| (unique_search && !rec_get_deleted_flag(rec, comp))
|| dict_index_is_spatial(index)) {
// 这里对于READ_UNCOMMITTED以及READ_COMMITTED,或者唯一键扫描不需要使用gap锁
goto no_gap_lock;
} else {
lock_type = LOCK_ORDINARY;
}
/* If we are doing a 'greater or equal than a primary key
value' search from a clustered index, and we find a record
that has that exact primary key value, then there is no need
to lock the gap before the record, because no insert in the
gap can be in our search range. That is, no phantom row can
appear that way.
An example: if col1 is the primary key, the search is WHERE
col1 >= 100, and we find a record where col1 = 100, then no
need to lock the gap before that record. */
if (index == clust_index
&& mode == PAGE_CUR_GE
&& direction == 0
&& dtuple_get_n_fields_cmp(search_tuple)
== dict_index_get_n_unique(index)
&& 0 == cmp_dtuple_rec(search_tuple, rec, offsets)) {
no_gap_lock:
lock_type = LOCK_REC_NOT_GAP;
}
err = sel_set_rec_lock(pcur,
rec, index, offsets,
prebuilt->select_lock_type,
lock_type, thr, &mtr);
switch (err) {
const rec_t* old_vers;
case DB_SUCCESS_LOCKED_REC:
if (srv_locks_unsafe_for_binlog
|| trx->isolation_level
<= TRX_ISO_READ_COMMITTED) {
/* Note that a record of
prebuilt->index was locked. */
prebuilt->new_rec_locks = 1;
}
err = DB_SUCCESS;
case DB_SUCCESS:
// 加锁成功后就认为记录可见了,并未像SELECT语句一样根据事务开始的READ_VIEW进行可见性判断。所以对于DML来说,所有提交的事务都是可见的。
break;
case DB_LOCK_WAIT:
// 如果存在锁冲突,也就是其他事务正在更新同一行
/* Lock wait for R-tree should already
be handled in sel_set_rtr_rec_lock() */
ut_ad(!dict_index_is_spatial(index));
/* Never unlock rows that were part of a conflict. */
prebuilt->new_rec_locks = 0;
if (UNIV_LIKELY(prebuilt->row_read_type
!= ROW_READ_TRY_SEMI_CONSISTENT)
|| unique_search
|| index != clust_index) {
goto lock_wait_or_error;
}
/* The following call returns 'offsets'
associated with 'old_vers' */
// 这里需要查看是否有别的事务提交了,以便获取最新版本的记录
row_sel_build_committed_vers_for_mysql(
clust_index, prebuilt, rec,
&offsets, &heap, &old_vers, need_vrow ? &vrow : NULL,
&mtr);
/* Check whether it was a deadlock or not, if not
a deadlock and the transaction had to wait then
release the lock it is waiting on. */
err = lock_trx_handle_wait(trx);
switch (err) {
case DB_SUCCESS:
/* The lock was granted while we were
searching for the last committed version.
Do a normal locking read. */
offsets = rec_get_offsets(
rec, index, offsets, ULINT_UNDEFINED,
&heap);
goto locks_ok;
case DB_DEADLOCK:
goto lock_wait_or_error;
case DB_LOCK_WAIT:
ut_ad(!dict_index_is_spatial(index));
err = DB_SUCCESS;
break;
default:
ut_error;
}
if (old_vers == NULL) {
/* The row was not yet committed */
goto next_rec;
}
did_semi_consistent_read = TRUE;
rec = old_vers;
prev_rec = rec;
break;
case DB_RECORD_NOT_FOUND:
if (dict_index_is_spatial(index)) {
goto next_rec;
} else {
goto lock_wait_or_error;
}
default:
goto lock_wait_or_error;
}
} else {
/* This is a non-locking consistent read: if necessary, fetch
a previous version of the record */
if (trx->isolation_level == TRX_ISO_READ_UNCOMMITTED) {
/* 对于READ UNCOMMITTED隔离级别,我们什么都不需要,只要让他读取最新的记录版本即可 */
/* Do nothing: we let a non-locking SELECT read the
latest version of the record */
} else if (index == clust_index) {
/* Fetch a previous version of the row if the current
one is not visible in the snapshot; if we have a very
high force recovery level set, we try to avoid crashes
by skipping this lookup */
// 如果是全表扫描或主键扫描,这里需要看看当前记录是否对当前事务可见
if (srv_force_recovery < 5
&& !lock_clust_rec_cons_read_sees(
rec, index, offsets,
trx_get_read_view(trx))) {
rec_t* old_vers;
/* The following call returns 'offsets'
associated with 'old_vers' */
// 如果不可见,这里需要查找历史版本
err = row_sel_build_prev_vers_for_mysql(
trx->read_view, clust_index,
prebuilt, rec, &offsets, &heap,
&old_vers, need_vrow ? &vrow : NULL,
&mtr);
if (err != DB_SUCCESS) {
goto lock_wait_or_error;
}
if (old_vers == NULL) {
/* The row did not exist yet in
the read view */
// 如果当前记录对当前事务不可见,也没有历史版本,直接查找下一条记录
goto next_rec;
}
rec = old_vers;
prev_rec = rec;
}
} else {
/* We are looking into a non-clustered index,
and to get the right version of the record we
have to look also into the clustered index: this
is necessary, because we can only get the undo
information via the clustered index record. */
ut_ad(!dict_index_is_clust(index));
// 这里处理是Secondary index扫描的情况
if (!srv_read_only_mode
&& !lock_sec_rec_cons_read_sees(
rec, index, trx->read_view)) {
/* We should look at the clustered index.
However, as this is a non-locking read,
we can skip the clustered index lookup if
the condition does not match the secondary
index entry. */
// 这里InnoDB做了一下优化,如果当前记录不满足ICP,直接查找下一条记录;
//如果满足ICP则需要继续根据聚集索引寻找历史版本
switch (row_search_idx_cond_check(
buf, prebuilt, rec, offsets)) {
case ICP_NO_MATCH:
goto next_rec;
case ICP_OUT_OF_RANGE:
err = DB_RECORD_NOT_FOUND;
goto idx_cond_failed;
case ICP_MATCH:
goto requires_clust_rec;
}
ut_error;
}
}
}
locks_ok:
/* NOTE that at this point rec can be an old version of a clustered
index record built for a consistent read. We cannot assume after this
point that rec is on a buffer pool page. Functions like
page_rec_is_comp() cannot be used! */
if (rec_get_deleted_flag(rec, comp)) {
/* The record is delete-marked: we can skip it */
if ((srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE
&& !did_semi_consistent_read) {
/* No need to keep a lock on a delete-marked record
if we do not want to use next-key locking. */
row_unlock_for_mysql(prebuilt, TRUE);
}
/* This is an optimization to skip setting the next key lock
on the record that follows this delete-marked record. This
optimization works because of the unique search criteria
which precludes the presence of a range lock between this
delete marked record and the record following it.
For now this is applicable only to clustered indexes while
doing a unique search except for HANDLER queries because
HANDLER allows NEXT and PREV even in unique search on
clustered index. There is scope for further optimization
applicable to unique secondary indexes. Current behaviour is
to widen the scope of a lock on an already delete marked record
if the same record is deleted twice by the same transaction */
if (index == clust_index && unique_search
&& !prebuilt->used_in_HANDLER) {
err = DB_RECORD_NOT_FOUND;
goto normal_return;
}
goto next_rec;
}
/* Check if the record matches the index condition. */
switch (row_search_idx_cond_check(buf, prebuilt, rec, offsets)) {
case ICP_NO_MATCH:
if (did_semi_consistent_read) {
row_unlock_for_mysql(prebuilt, TRUE);
}
goto next_rec;
case ICP_OUT_OF_RANGE:
err = DB_RECORD_NOT_FOUND;
goto idx_cond_failed;
case ICP_MATCH:
break;
}
/* Get the clustered index record if needed, if we did not do the
search using the clustered index. */
if (index != clust_index && prebuilt->need_to_access_clustered) {
requires_clust_rec:
ut_ad(index != clust_index);
/* We use a 'goto' to the preceding label if a consistent
read of a secondary index record requires us to look up old
versions of the associated clustered index record. */
ut_ad(rec_offs_validate(rec, index, offsets));
/* It was a non-clustered index and we must fetch also the
clustered index record */
mtr_has_extra_clust_latch = TRUE;
ut_ad(!vrow);
/* The following call returns 'offsets' associated with
'clust_rec'. Note that 'clust_rec' can be an old version
built for a consistent read. */
err = row_sel_get_clust_rec_for_mysql(prebuilt, index, rec,
thr, &clust_rec,
&offsets, &heap,
need_vrow ? &vrow : NULL,
&mtr);
switch (err) {
case DB_SUCCESS:
if (clust_rec == NULL) {
/* The record did not exist in the read view */
ut_ad(prebuilt->select_lock_type == LOCK_NONE
|| dict_index_is_spatial(index));
goto next_rec;
}
break;
case DB_SUCCESS_LOCKED_REC:
ut_a(clust_rec != NULL);
if (srv_locks_unsafe_for_binlog
|| trx->isolation_level
<= TRX_ISO_READ_COMMITTED) {
/* Note that the clustered index record
was locked. */
prebuilt->new_rec_locks = 2;
}
err = DB_SUCCESS;
break;
default:
vrow = NULL;
goto lock_wait_or_error;
}
if (rec_get_deleted_flag(clust_rec, comp)) {
/* The record is delete marked: we can skip it */
if ((srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED)
&& prebuilt->select_lock_type != LOCK_NONE) {
/* No need to keep a lock on a delete-marked
record if we do not want to use next-key
locking. */
row_unlock_for_mysql(prebuilt, TRUE);
}
goto next_rec;
}
if (need_vrow && !vrow) {
if (!heap) {
heap = mem_heap_create(100);
}
row_sel_fill_vrow(rec, index, &vrow, heap);
}
result_rec = clust_rec;
ut_ad(rec_offs_validate(result_rec, clust_index, offsets));
if (prebuilt->idx_cond) {
/* Convert the record to MySQL format. We were
unable to do this in row_search_idx_cond_check(),
because the condition is on the secondary index
and the requested column is in the clustered index.
We convert all fields, including those that
may have been used in ICP, because the
secondary index may contain a column prefix
rather than the full column. Also, as noted
in Bug #56680, the column in the secondary
index may be in the wrong case, and the
authoritative case is in result_rec, the
appropriate version of the clustered index record. */
if (!row_sel_store_mysql_rec(
buf, prebuilt, result_rec, vrow,
TRUE, clust_index, offsets, false)) {
goto next_rec;
}
}
} else {
result_rec = rec;
}
/* We found a qualifying record 'result_rec'. At this point,
'offsets' are associated with 'result_rec'. */
ut_ad(rec_offs_validate(result_rec,
result_rec != rec ? clust_index : index,
offsets));
ut_ad(!rec_get_deleted_flag(result_rec, comp));
/* Decide whether to prefetch extra rows.
At this point, the clustered index record is protected
by a page latch that was acquired when pcur was positioned.
The latch will not be released until mtr_commit(&mtr). */
if ((match_mode == ROW_SEL_EXACT
|| prebuilt->n_rows_fetched >= MYSQL_FETCH_CACHE_THRESHOLD)
&& prebuilt->select_lock_type == LOCK_NONE
&& !prebuilt->m_no_prefetch
&& !prebuilt->templ_contains_blob
&& !prebuilt->templ_contains_fixed_point
&& !prebuilt->clust_index_was_generated
&& !prebuilt->used_in_HANDLER
&& !prebuilt->innodb_api
&& prebuilt->template_type != ROW_MYSQL_DUMMY_TEMPLATE
&& !prebuilt->in_fts_query) {
/* Inside an update, for example, we do not cache rows,
since we may use the cursor position to do the actual
update, that is why we require ...lock_type == LOCK_NONE.
Since we keep space in prebuilt only for the BLOBs of
a single row, we cannot cache rows in the case there
are BLOBs in the fields to be fetched. In HANDLER we do
not cache rows because there the cursor is a scrollable
cursor. */
ut_a(prebuilt->n_fetch_cached < MYSQL_FETCH_CACHE_SIZE);
/* We only convert from InnoDB row format to MySQL row
format when ICP is disabled. */
if (!prebuilt->idx_cond) {
/* We use next_buf to track the allocation of buffers
where we store and enqueue the buffers for our
pre-fetch optimisation.
If next_buf == 0 then we store the converted record
directly into the MySQL record buffer (buf). If it is
!= 0 then we allocate a pre-fetch buffer and store the
converted record there.
If the conversion fails and the MySQL record buffer
was not written to then we reset next_buf so that
we can re-use the MySQL record buffer in the next
iteration. */
next_buf = next_buf
? row_sel_fetch_last_buf(prebuilt) : buf;
if (!row_sel_store_mysql_rec(
next_buf, prebuilt, result_rec, vrow,
result_rec != rec,
result_rec != rec ? clust_index : index,
offsets, false)) {
if (next_buf == buf) {
ut_a(prebuilt->n_fetch_cached == 0);
next_buf = 0;
}
/* Only fresh inserts may contain incomplete
externally stored columns. Pretend that such
records do not exist. Such records may only be
accessed at the READ UNCOMMITTED isolation
level or when rolling back a recovered
transaction. Rollback happens at a lower
level, not here. */
goto next_rec;
}
if (next_buf != buf) {
row_sel_enqueue_cache_row_for_mysql(
next_buf, prebuilt);
}
} else {
row_sel_enqueue_cache_row_for_mysql(buf, prebuilt);
}
if (prebuilt->n_fetch_cached < MYSQL_FETCH_CACHE_SIZE) {
goto next_rec;
}
} else {
if (UNIV_UNLIKELY
(prebuilt->template_type == ROW_MYSQL_DUMMY_TEMPLATE)) {
/* CHECK TABLE: fetch the row */
if (result_rec != rec
&& !prebuilt->need_to_access_clustered) {
/* We used 'offsets' for the clust
rec, recalculate them for 'rec' */
offsets = rec_get_offsets(rec, index, offsets,
ULINT_UNDEFINED,
&heap);
result_rec = rec;
}
memcpy(buf + 4, result_rec
- rec_offs_extra_size(offsets),
rec_offs_size(offsets));
mach_write_to_4(buf,
rec_offs_extra_size(offsets) + 4);
} else if (!prebuilt->idx_cond && !prebuilt->innodb_api) {
/* The record was not yet converted to MySQL format. */
if (!row_sel_store_mysql_rec(
buf, prebuilt, result_rec, vrow,
result_rec != rec,
result_rec != rec ? clust_index : index,
offsets, false)) {
/* Only fresh inserts may contain
incomplete externally stored
columns. Pretend that such records do
not exist. Such records may only be
accessed at the READ UNCOMMITTED
isolation level or when rolling back a
recovered transaction. Rollback
happens at a lower level, not here. */
goto next_rec;
}
}
if (prebuilt->clust_index_was_generated) {
row_sel_store_row_id_to_prebuilt(
prebuilt, result_rec,
result_rec == rec ? index : clust_index,
offsets);
}
}
/* From this point on, 'offsets' are invalid. */
/* We have an optimization to save CPU time: if this is a consistent
read on a unique condition on the clustered index, then we do not
store the pcur position, because any fetch next or prev will anyway
return 'end of file'. Exceptions are locking reads and the MySQL
HANDLER command where the user can move the cursor with PREV or NEXT
even after a unique search. */
err = DB_SUCCESS;
idx_cond_failed:
if (!unique_search
|| !dict_index_is_clust(index)
|| direction != 0
|| prebuilt->select_lock_type != LOCK_NONE
|| prebuilt->used_in_HANDLER
|| prebuilt->innodb_api) {
/* Inside an update always store the cursor position */
if (!spatial_search) {
btr_pcur_store_position(pcur, &mtr);
}
if (prebuilt->innodb_api) {
prebuilt->innodb_api_rec = result_rec;
}
}
goto normal_return;
next_rec:
if (end_loop >= 99
&& need_vrow && vrow == NULL && prev_rec != NULL) {
if (!heap) {
heap = mem_heap_create(100);
}
prev_vrow = NULL;
row_sel_fill_vrow(prev_rec, index, &prev_vrow, heap);
} else {
prev_vrow = vrow;
}
end_loop++;
/* Reset the old and new "did semi-consistent read" flags. */
if (UNIV_UNLIKELY(prebuilt->row_read_type
== ROW_READ_DID_SEMI_CONSISTENT)) {
prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
}
did_semi_consistent_read = FALSE;
prebuilt->new_rec_locks = 0;
vrow = NULL;
/*-------------------------------------------------------------*/
/* PHASE 5: Move the cursor to the next index record */
/* NOTE: For moves_up==FALSE, the mini-transaction will be
committed and restarted every time when switching b-tree
pages. For moves_up==TRUE in index condition pushdown, we can
scan an entire secondary index tree within a single
mini-transaction. As long as the prebuilt->idx_cond does not
match, we do not need to consult the clustered index or
return records to MySQL, and thus we can avoid repositioning
the cursor. What prevents us from buffer-fixing all leaf pages
within the mini-transaction is the btr_leaf_page_release()
call in btr_pcur_move_to_next_page(). Only the leaf page where
the cursor is positioned will remain buffer-fixed.
For R-tree spatial search, we also commit the mini-transaction
each time */
if (mtr_has_extra_clust_latch || spatial_search) {
/* If we have extra cluster latch, we must commit
mtr if we are moving to the next non-clustered
index record, because we could break the latching
order if we would access a different clustered
index page right away without releasing the previous. */
/* No need to do store restore for R-tree */
if (!spatial_search) {
btr_pcur_store_position(pcur, &mtr);
}
mtr_commit(&mtr);
mtr_has_extra_clust_latch = FALSE;
mtr_start(&mtr);
if (!spatial_search
&& sel_restore_position_for_mysql(&same_user_rec,
BTR_SEARCH_LEAF,
pcur, moves_up, &mtr)) {
goto rec_loop;
}
}
if (moves_up) {
bool move;
if (spatial_search) {
move = rtr_pcur_move_to_next(
search_tuple, mode, pcur, 0, &mtr);
} else {
move = btr_pcur_move_to_next(pcur, &mtr);
}
if (!move) {
not_moved:
if (!spatial_search) {
btr_pcur_store_position(pcur, &mtr);
}
if (match_mode != 0) {
err = DB_RECORD_NOT_FOUND;
} else {
err = DB_END_OF_INDEX;
}
goto normal_return;
}
} else {
if (UNIV_UNLIKELY(!btr_pcur_move_to_prev(pcur, &mtr))) {
goto not_moved;
}
}
goto rec_loop;
lock_wait_or_error:
/* Reset the old and new "did semi-consistent read" flags. */
if (UNIV_UNLIKELY(prebuilt->row_read_type
== ROW_READ_DID_SEMI_CONSISTENT)) {
prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
}
did_semi_consistent_read = FALSE;
/*-------------------------------------------------------------*/
if (!dict_index_is_spatial(index)) {
btr_pcur_store_position(pcur, &mtr);
}
lock_table_wait:
mtr_commit(&mtr);
mtr_has_extra_clust_latch = FALSE;
trx->error_state = err;
/* The following is a patch for MySQL */
if (thr->is_active) {
que_thr_stop_for_mysql(thr);
}
thr->lock_state = QUE_THR_LOCK_ROW;
if (row_mysql_handle_errors(&err, trx, thr, NULL)) {
/* It was a lock wait, and it ended */
thr->lock_state = QUE_THR_LOCK_NOLOCK;
mtr_start(&mtr);
/* Table lock waited, go try to obtain table lock
again */
if (table_lock_waited) {
table_lock_waited = FALSE;
goto wait_table_again;
}
if (!dict_index_is_spatial(index)) {
sel_restore_position_for_mysql(
&same_user_rec, BTR_SEARCH_LEAF, pcur,
moves_up, &mtr);
}
if ((srv_locks_unsafe_for_binlog
|| trx->isolation_level <= TRX_ISO_READ_COMMITTED)
&& !same_user_rec) {
/* Since we were not able to restore the cursor
on the same user record, we cannot use
row_unlock_for_mysql() to unlock any records, and
we must thus reset the new rec lock info. Since
in lock0lock.cc we have blocked the inheriting of gap
X-locks, we actually do not have any new record locks
set in this case.
Note that if we were able to restore on the 'same'
user record, it is still possible that we were actually
waiting on a delete-marked record, and meanwhile
it was removed by purge and inserted again by some
other user. But that is no problem, because in
rec_loop we will again try to set a lock, and
new_rec_lock_info in trx will be right at the end. */
prebuilt->new_rec_locks = 0;
}
mode = pcur->search_mode;
goto rec_loop;
}
thr->lock_state = QUE_THR_LOCK_NOLOCK;
goto func_exit;
normal_return:
/*-------------------------------------------------------------*/
que_thr_stop_for_mysql_no_error(thr, trx);
mtr_commit(&mtr);
/* Rollback blocking transactions from hit list for high priority
transaction, if any. We should not be holding latches here as
we are going to rollback the blocking transactions. */
if (!trx->hit_list.empty()) {
ut_ad(trx_is_high_priority(trx));
trx_kill_blocking(trx);
}
DEBUG_SYNC_C("row_search_for_mysql_before_return");
if (prebuilt->idx_cond != 0) {
/* When ICP is active we don't write to the MySQL buffer
directly, only to buffers that are enqueued in the pre-fetch
queue. We need to dequeue the first buffer and copy the contents
to the record buffer that was passed in by MySQL. */
if (prebuilt->n_fetch_cached > 0) {
row_sel_dequeue_cached_row_for_mysql(buf, prebuilt);
err = DB_SUCCESS;
}
} else if (next_buf != 0) {
/* We may or may not have enqueued some buffers to the
pre-fetch queue, but we definitely wrote to the record
buffer passed to use by MySQL. */
DEBUG_SYNC_C("row_search_cached_row");
err = DB_SUCCESS;
}
#ifdef UNIV_DEBUG
if (dict_index_is_spatial(index) && err != DB_SUCCESS
&& err != DB_END_OF_INDEX && err != DB_INTERRUPTED) {
rtr_node_path_t* path = pcur->btr_cur.rtr_info->path;
ut_ad(path->empty());
}
#endif
func_exit:
trx->op_info = "";
if (end_range_cache != NULL) {
ut_free(end_range_cache);
}
if (heap != NULL) {
mem_heap_free(heap);
}
/* Set or reset the "did semi-consistent read" flag on return.
The flag did_semi_consistent_read is set if and only if
the record being returned was fetched with a semi-consistent read. */
ut_ad(prebuilt->row_read_type != ROW_READ_WITH_LOCKS
|| !did_semi_consistent_read);
if (prebuilt->row_read_type != ROW_READ_WITH_LOCKS) {
if (did_semi_consistent_read) {
prebuilt->row_read_type = ROW_READ_DID_SEMI_CONSISTENT;
} else {
prebuilt->row_read_type = ROW_READ_TRY_SEMI_CONSISTENT;
}
}
#ifdef UNIV_DEBUG
{
btrsea_sync_check check(trx->has_search_latch);
ut_ad(!sync_check_iterate(check));
}
#endif /* UNIV_DEBUG */
DEBUG_SYNC_C("innodb_row_search_for_mysql_exit");
DBUG_RETURN(err);
}
对于UPDATE操作执行流程的简单描述如下:
根据WHERE条件扫描一条记录(row_search_for_mysql)
更新当前获取的记录(ha_innobase::update_row)
重新将更新后的记录写入InnoDB存储引擎(row_upd_step)
从上面的代码我们可以看到,对于UPDATE操作更新的记录包含幻读读取到的已提交事务的最新记录。那么接下来看为什么UPDATE之后的SELECT语句对于UPDATE之后的所有语句都可见了? 原因是前面的UPDATE语句执行之后,会将当前记录上存储的事务信息更新为当前的事务,而当前事务所做的任何更新,对本事务所有SELECT查询都变的可见,因此最后输出的结果是UPDATE执行后更新的所有记录。
原文地址:http://mysql.taobao.org/monthly/2017/06/07/ 原文是分为两个步骤,版本应该是mysql5.6的,所以针对mysql5.7版本做了合并整理。