(原创,未经允许,请勿转载)
第一部分 pagecache write概述
1. pagecache write的过程,即非direct IO的写,通过系统调用sys_write写下来的数据,是先在内核的pagecache中分配一块pages,然后将数据写到pages;
2.之后便通过balance_dirty_pages_ratelimmited来检查dirty page是否超过预先设定的上限,超过的话便唤醒bdi_writeback_workfn 这个work来进行writeback。
3.dirty page相关的阈值有如下:
dirty_background_bytes : 当总共的dirty bytes超过该阈值后,便唤醒writeback work;
dirty_background_ratio :当超过总共可用memory的百分比后,便唤醒writeback work;dirty_background_ratio 与dirty_background_bytes是互斥的,设置一个,另一个自动为0
dirty_bytes :当dirty bytes超过该阈值,写的进程自己启动writeback;
dirty_expire_centisecs:当dirty data超过该阈值后,也会唤醒writeback work;
dirty_ratio :当一个写的进程写的dirty page超过总共可用内存的百分比后,唤醒writeback work;dirty_ratio 与dirty_bytes是互斥的,设置一个,另一个自动为0;
dirty_writeback_centisecs: writeback work周期性的唤醒,唤醒周期为该阈值;
4.在2.6之前的内核,内核中是起了一个pdflush thread,而在2.6及之后的内核,用bdi机制替换了pdflush机制,每个device均注册一个bdi_device,在每个bdi 结构体中均有一个writeback_work,这样可以对每个device单独尽心flush操作。
第二部分 writeback 主要函数
writeback需要的全局变量:
struct workqueue_struct *
bdi_wq; //所有bdi->wb.dwork都queue在该wq上。
LIST_HEAD(
bdi_list); //全局的bdi_list, 所有注册的bid 设备都挂在该链表上。
blk_alloc_queue_node->bdi_init->bdi_wb_init;
//bdi初始化
add_disk->bdi_register_dev->bdi_register->list_add_tail_rcu(&bdi->bdi_list, &
bdi_list);//注册bdi device,将bid加入到bdi_list中。
在sys_write->ext4_file_write_iter->__generic_file_write_iter->generic_perform_write->balance_dirty_pages_ratelimited->balance_dirty_pages->bdi_start_background_writeback->bdi_wakeup_thread
1.
bdi初始化
bdi_wb_init 函数初始化wb结构体,wb->dwork注册的函数为bdi_writeback_workfn.
static void
bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
{
memset(wb, 0, sizeof(*wb));
wb->bdi = bdi;
wb->last_old_flush = jiffies;
INIT_LIST_HEAD(&wb->b_dirty);
INIT_LIST_HEAD(&wb->b_io);
INIT_LIST_HEAD(&wb->b_more_io);
spin_lock_init(&wb->list_lock);
INIT_DELAYED_WORK(&wb->dwork,
bdi_writeback_workfn);
}
2.
flush thread唤醒
唤醒flushser_threads的地方有四处处,
free_more_memory->wakeup_flusher_threads;//回收内存时用的
sys_sync->wakeup_flusher_threads;//手动sync,来同步dirty page
do_try_to_free_pages->wakeup_flusher_threads;//这是回收dirty page主要的入口点。
generic_perform_write->balance_dirty_pages_ratelimited;//每次write操作都会检查是否需要writeback
下面列举几个调用栈:
a. alloc_pages_node->__alloc_pages->__alloc_pages_nodemask->__alloc_pages_slowpath->__alloc_pages_direct_reclaim->__perform_reclaim->try_to_free_pages->
do_try_to_free_pages
b. hibernate->hibernation_snapshot->hibernate_preallocate_memory->shrink_all_memory->
do_try_to_free_pages;
c. do_fallocate->ext4_fallocate->ext4_zero_range->ext4_zero_partial_blocks->ext4_block_zero_page_range->create_empty_buffers->alloc_page_buffers->
free_more_memory
d. ext4_fill_super->ext4_load_journal->ext4_get_dev_journal->__bread->__getblk->__getblk_slow->
free_more_memory
调用bdi_writeback_workfn的地方有:
- wakeup_flusher_threads是唤醒writeback work的一个入口,其中do_try_to_free_pages是主要调用wakeup_flusher_threads的入口点;另外,sys_sync是手动wakeup writeback work的入口点;
void wakeup_flusher_threads(long nr_pages, enum wb_reason reason)
{
struct backing_dev_info *bdi;
if (!nr_pages)
nr_pages = get_nr_dirty_pages();
rcu_read_lock();
list_for_each_entry_rcu(bdi, &bdi_list, bdi_list) {
if (!bdi_has_dirty_io(bdi))
continue;
__bdi_start_writeback(bdi, nr_pages, false, reason);
}
rcu_read_unlock();
}
- sys_write的pagecache 写,是另外一个主要wakeup writeback的入口点,每次系统调用时,在generic_perform_write中检查balance_dirty_pages_ratelimited,当超过指定的各项阈值后,也可以wakeup writeback work;
第三部分 sys_close结束后的处理
通常,我们认为,在非direct IO 写的过程中,先是将data写到pagecache中,然后通过writeback 后台work将这些dirty page flush到 HDD中;其实未必完全按照这样的路径执行。
在sys_close系统调用中,sys_close->__close_fd->filp_close->fput.
下面具体看一下fput的操作:
void fput(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
struct task_struct *task = current;
if (likely(!in_interrupt() && !(task->flags & PF_KTHREAD))) {
init_task_work(&file->f_u.fu_rcuhead,
____fput);
if (!
task_work_add(task, &file->f_u.fu_rcuhead, true))
return;
/*
* After this task has run exit_task_work(),
* task_work_add() will fail. Fall through to delayed
* fput to avoid leaking *file.
*/
}
if (llist_add(&file->f_u.fu_llist, &delayed_fput_list))
schedule_delayed_work(&delayed_fput_work, 1);
}
}
在fput中,会初始化一个task_work,并注册___fput函数;该work add到当前进程current中。
在系统调用结束时,会调用到do_notify_resume,在do_notify_resume中将当前current中所有未执行的task都执行一遍,而之前在sys_close中有通过task_work_add将一个task加入到current中,因此在 系统调用结束时,会调用到___fput。
在fput中,会同步将dirty page 写到HDD中,这就是为什么在非directIO 写的过程之后写完的阶段,写的进程会等待一段时间,这段时间是执行同步flush dirty page的操作。
第四部分 system_call
这里列举system_call系统调用过程,具体可以自行研究,代码在arch/x86/kernel/entry_64.S中:
ENTRY(system_call)
.......
call *
sys_call_table(,%rax,8) # XXX: rip relative
......
int_signal:
testl $_TIF_DO_NOTIFY_MASK,%edx
jz 1f
movq %rsp,%rdi # &ptregs -> arg1
xorl %esi,%esi # oldset -> arg2
call
do_notify_resume
1: movl $_TIF_WORK_MASK,%edi
int_restore_rest:
RESTORE_REST
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
jmp int_with_check
CFI_ENDPROC
END(system_call)
下面看一下,通过do_notify_resume如何进行writeback的。即通过do_notify_resume->task_work_run->____fput->__blkdev_put->__sync_blockdev->filemap_write_and_wait->do_writepages->_submit_bh.....;
[14779.469828] CPU: 1 PID: 9430 Comm: dd Tainted: G OE 3.16.4-0920 #6
[14779.489322] Hardware name: System manufacturer System Product Name/Z170-A, BIOS 1602 01/07/2016
[14779.509117] ffff880266706540 ffff88002d577960 ffffffff81726ce6 ffff880265c6ab80
[14779.528942] ffff88002d577970 ffffffffa0339078 ffff88002d5779a8 ffffffff8111017e
[14779.548593] ffffffffa033b020 0000000000000296 ffffffff8134b9d1 ffff88002d577a30
[14779.568411] Call Trace:
[14779.588049] [<ffffffff81726ce6>] dump_stack+0x4e/0x7a
[14779.608005] [<ffffffffa0339078>] entry_handler+0x68/0x73
[14779.628326] [<ffffffff8111017e>] pre_handler_kretprobe+0x9e/0x1c0
[14779.648794] [<ffffffff8134b9d1>] ? blk_queue_bio+0x1/0x380
[14779.669496] [<ffffffff8105398c>] kprobe_ftrace_handler+0xdc/0x130
[14779.690081] [<ffffffff8134b9d0>] ? blk_flush_plug_list+0x220/0x220
[14779.710423] [<ffffffff81347960>] ? generic_make_request+0xc0/0x110
[14779.730497] [<ffffffff811239de>] ftrace_ops_list_func+0xae/0x170
[14779.750658] [<ffffffff817311f2>] ftrace_regs_call+0x5/0x77
[14779.770846] [<ffffffff8116611f>] ? mempool_alloc+0x4f/0x130
[14779.791056] [<ffffffff8134b9d1>] ? blk_queue_bio+0x1/0x380
[14779.811211] [<ffffffff8134b9d5>] ? blk_queue_bio+0x5/0x380
[14779.831096] [<ffffffff81347960>] ? generic_make_request+0xc0/0x110
[14779.851102] [<ffffffff81347a19>] submit_bio+0x69/0x130
[14779.871130] [<ffffffff81342406>] ? bio_alloc_bioset+0x1a6/0x2b0
[14779.891139] [<ffffffff8120641d>] _submit_bh+0x13d/0x230
[14779.910951] [<ffffffff81209445>] __block_write_full_page.constprop.35+0x125/0x370
[14779.931048] [<ffffffff81209a80>] ? I_BDEV+0x10/0x10
[14779.951139] [<ffffffff81209a80>] ? I_BDEV+0x10/0x10
[14779.970969] [<ffffffff81209776>] block_write_full_page+0xe6/0x100
[14779.991104] [<ffffffff8120a358>] blkdev_writepage+0x18/0x20
[14780.011454] [<ffffffff8116dd43>] __writepage+0x13/0x50
[14780.032006] [<ffffffff8116e848>] write_cache_pages+0x238/0x4c0
[14780.052338] [<ffffffff8116dd30>] ? global_dirtyable_memory+0x50/0x50
[14780.072510] [<ffffffff8116eb13>] generic_writepages+0x43/0x60
[14780.092459] [<ffffffff811701ce>] do_writepages+0x1e/0x40
[14780.112405] [<ffffffff81164b19>] __filemap_fdatawrite_range+0x59/0x60
[14780.132267] [<ffffffff81164b7c>] filemap_write_and_wait+0x2c/0x60
[14780.151908] [<ffffffff8120ae1f>] __sync_blockdev+0x1f/0x40
[14780.171549] [<ffffffff8120b16c>] __blkdev_put+0x5c/0x1a0
[14780.190981] [<ffffffff8172e583>] ? _raw_spin_unlock_irq+0x23/0x60
[14780.210257] [<ffffffff8120bbbe>] blkdev_put+0x4e/0x140
[14780.229202] [<ffffffff8120bd65>] blkdev_close+0x25/0x30
[14780.247950] [<ffffffff811d5b63>] __fput+0xd3/0x210
[14780.266483] [<ffffffff811d5cee>] ____fput+0xe/0x10
[14780.285180] [<ffffffff8108f6d7>] task_work_run+0xa7/0xe0
[14780.303859] [<ffffffff81014f37>] do_notify_resume+0x97/0xb0
[14780.322490] [<ffffffff8172ee6a>] int_signal+0x12/0x17