12 flush与fsync，checkpoint与redolog

最新推荐文章于 2023-07-16 13:06:19 发布

tz_sz

最新推荐文章于 2023-07-16 13:06:19 发布

阅读量1.5k

点赞数

分类专栏： mysql

mysql 专栏收录该内容

18 篇文章 4 订阅

订阅专栏

一、flush

log file或data file的write和flush是不同的操作，写后是否立即flush受参数srv_unix_file_flush_method和srv_flush_log_at_trx_commit的影响。

二、主线程中的flush

每1S：若 buffer pool 中的脏页比率超过了 srv_max_buf_pool_modified_pct = 75，则进行刷脏页，flush PCT_IO(100)的 dirty pages = 200；若采用 adaptive flushing，则计算 flush rate，进行必要的 flush。
每10S：若 buffer pool 中的脏页比率超过了 70%，flush PCT_IO(100)的 dirty pages；若 buffer pool 中的脏页比率未超过 70%，flush PCT_IO(10)的 dirty pages = 20。

flush采用buf_flush_list()进行刷脏页。

三、buf_flush_list

1、buf_flush_start

判断当前是否有正在进行的相同类型的 flush，有则直接退出。

2、buf_flush_batch

Flush dirty blocks from the end of the LRU list or flush_list。这里只考虑flush_list。

（1）于是进一步调用buf_flush_flush_list_batch -> buf_flush_page_and_try_neighbors ->buf_flush_try_neighbors -> buf_flush_page ->buf_flush_write_block_low -> log_write_up_to保证脏页对应的日志已经写回日志文件。

（2）之后buf_flush_batch调用buf_flush_buffered_writes，使用fil_io(OS_FILE_WRITE, TRUE)同步写doublewrite buffer，同步写之后，调用 fil_flush 函数将 doublewrite buffer 中的内容 flush 到 disk，然后才使用fil_io(OS_FILE_WRITE | OS_AIO_SIMULATED_WAKE_LATER, FALSE)异步将doublewrite buffer 对应的 dirty pages 写到 disk。之后buf_flush_sync_datafiles()（其中会调用fil_flush()）等待系统完成这所有的IO操作。

3、buf_flush_end

标识当前 flush 操作结束。

4、buf_flush_common

收集当前 flush 操作的统计信息。

四、checkpoint的更新

位于主线程srv_master_thread中，分为每10s一次、1s一次（IO较为繁忙时）。

1、10s一次的log_checkpoint()

This function does not flush dirty blocks from the buffer pool: it only checks what is lsn of the oldest modification in the pool, and writes information about the lsn in log files. 它会调用以下函数：

log_buf_pool_get_oldest_modification()读取系统中，最老的日志序列号。实现简单，读取 flush list 中最老日志对应的 lsn 即可。

log_write_up_to(oldest_lsn, LOG_WAIT_ALL_GROUPS, TRUE)将日志 flush 到 oldest_lsn。

log_groups_write_checkpoint_info-> log_group_checkpoint() Writes the checkpoint info to a log group header。将lsn信息异步写入redo log文件的特定位置（使用fil_io(OS_FILE_WRITE | OS_FILE_LOG, FLASE)）。（注意多个redo log files属于一个group，即一个log_group_t表示，本例只用了redo log，只有一个log_group_t。）

主线程中在之后会使用log_io_complete来处理（进一步调用fil_flush--->fsync()写到磁盘），这里通过独占锁来保证log在写完成之后才能继续运行（参考第九篇《异步IO的实现》）。这里要注意的是，不管是同步IO还是异步IO，都是将内存中的数据写到系统缓存（除非是O_DIRECT，而O_SYNC也是经过系统缓存的），如果需要保证修改过的块立即写到磁盘上，还必须在完成IO之后，调用fsync()完成系统缓存到磁盘的刷新。

******************************************************************************************

a、进一步：fsync()与加上O_SYNC标志的作用类似，直到IO同步才返回；O_DIRECT保证不经过内核缓存，但是不保证在所有数据写完后才返回（好像linux2.6已经保证）。因此最佳保证是fsync+O_DIRECT或O_SYNC+O_DIRECT。但是O_DIRECT用时需谨慎。参考：http://stackoverflow.com/questions/5055859/how-are-the-o-sync-and-o-direct-flags-in-open2-different-alike

b、再看下面这篇文章：http://www.orczhou.com/index.php/2009/08/innodb_flush_method-file-io/

回到MySQL，参数Innodb_flush_method(Linux)可以设定为：Fdatasync、O_DSYNC、O_DIRECT。我们看看这个三个参数是如何影响程序MySQL对日志和数据文件的操作：

fdatasync被认为是安全的，因为在MySQL总会调用fsync来flush数据。使用O_DSYNC是有些风险的，有些OS会忽略该参数O_SYNC。

我们看到O_DIRECT和fdatasync和很类似，但是它会使用O_DIRECT来打开数据文件。有数据表明，如果是大量随机写入操作，O_DIRECT会提升效率。但是顺序写入和读取效率都会降低。所以使用O_DIRECT需要谨慎。

c、关于O_DIRECT：以下代码位于os_file_create_func()（用于打开文件）中：

	/* We disable OS caching (O_DIRECT) only on data files */
	if (type != OS_LOG_FILE
	    && srv_unix_file_flush_method == SRV_UNIX_O_DIRECT) {		
		os_file_set_nocache(file, name, mode_str);
	}

*********************************************************************************************

2、1s一次的log_free_check()

log_free_check->log_check_margins -> log_checkpoint_margin

log_checkpoint_margin根据 oldest_lsn 与 log->lsn(current lsn)之间的差距，判断日志空间是否足够，是否需要进行 flush dirty pages 操作—— buf_flush_list()；根据last_checkpoint_lsn 与 log->lsn 之间的差距，判断是否需要向前推进检查点——log_checkpoint()。

参考资料：《Mysql Innodb Insert Buffer/Checkpoint/Aio 实现分析》何登成

备注：redo log

参考：https://blogs.oracle.com/mysqlinnodb/entry/redo_logging_in_innodb

1、The life cycle of a redo log record is as follows:

The redo log record is first created by a mini transaction and stored in the mini transaction buffer. It has information necessary to redo the same operation again in the time of database recovery.

When mtr_commit() is done, the redo log record is transferred to the global in-memory redo log buffer. If necessary, the redo log buffer will be flushed to the redo log file, to make space for the new redo log record.

The redo log record has a specific lsn associated with it. This association is established during mtr_commit(), when the redo log record is transferred from mtr buffer to log buffer. Once the lsn is established for a redo log record, then its location in the redo log file is also established.

The redo log record is then transferred from the log buffer to redo log file when it is written + flushed. This means that the redo log record is now durably stored on the disk.

Each redo log record has an associated list of dirty pages. This relationship is established via LSN. A redo log record must be flushed to disk, before its associated dirty pages. A redo log record can be discarded only after all the associated dirty pages are flushed to the disk.

A redo log record can be used to re-create its associated list of dirty pages. This happens during database recovery.

2、Redo log file header

redo log files（属于一个group，log_group_t）是一系列的block（512字节），每个文件前四个block用于file header。

The first 4 bytes contain the log group number to which the log file belongs.

The next 8 bytes contain the lsn of the start of data in this log file.

First checkpoint field located in the beginning of the second log block.

Second checkpoint field located in the beginning of the fourth log block.

tz_sz

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
1
评论
12 flush与fsync，checkpoint与redolog

一、flushlog file或data file的write和flush是不同的操作，写后是否立即flush受参数srv_unix_file_flush_method和srv_flush_log_at_trx_commit的影响。二、主线程中的flush每1S：若 buffer pool 中的脏页比率超过了 srv_max_buf_pool_modified_pct =
复制链接

扫一扫

专栏目录