原文链接: Linux 内核进程详解之二: bdi-default
转载说明 (最后更新于: 2019-04-03, 12:26:50)
- 转载本文的原意仅是为了理解 “bdi” 所代表的含义以及其背后涉及到的初步的工作原理或机制, 避免原文因某些原因无法访问.
- 在原文的基础上, 博主根据自己的需要对文本格式进行了调整 (强迫症), 未对内容进行修改. 同时根据自己查阅的资料进行了注释或补充说明.
- 本文适用于 CentOS 6.x 版本. 但对于 CentOS 7.x, 从系统中已经看不到
bdi-default
与flush-x:y
这两个内核线程, 已经被合并新的到 “workqueues” 机制 (参见 Working on workqueues 和 backing-dev: replace private thread pool with workqueue).
bdi, 即是 “backing device info” 的缩写, 根据英文单词全称可知其通指备用存储设备相关描述信息, 这在内核代码里用一个结构体 backing_dev_info
来表示: http://lxr.linux.no/#linux+v2.6.38.8/include/linux/backing-dev.h#L62.
bdi, 备用存储设备, 简单点说就是能够用来存储数据的设备, 而这些设备存储的数据能够保证在计算机电源关闭时也不丢失. 这样说来, 软盘存储设备, 光驱存储设备, USB 存储设备, 硬盘存储设备都是所谓的备用存储设备 (后面都用 bdi 来指示), 而内存显然不是, 具体看下面这个链接: (链接已失效).
相对于内存来说, bdi 设备 (比如最常见的硬盘存储设备) 的读写速度是非常慢的, 因此为了提高系统整体性能, Linux 系统对 bdi 设备的读写内容进行了缓冲, 那些读写的数据会临时保存在内存里, 以避免每次都直接操作 bdi 设备, 但这就需要在一定的时机 (比如每隔 5 秒, 脏数据达到的一定的比率等) 把它们同步到 bdi 设备, 否则长久的呆在内存里容易丢失 (比如机器突然宕机, 重启), 而进行间隔性同步工作的进程之前名叫 pdflush
, 但后来在 Kernel 2.6.2x/3x (没注意具体是哪个小版本的改动, 比如: Linux 2.6.35, kill unnecessary bdi wakeups + cleanups) 对此进行了优化改进, 产生有多个内核进程, bdi-default
, flush-x:y
等, 这也是这两篇文章要介绍的内容.
注释于 2019-04-03, 详情请参见: Linux Page Cache Basics
-
Up to Version 2.6.31 of the Kernel:
pdflush
Up to and including the 2.6.31 version of the Linux kernel, the
pdflush
threads ensured that dirty pages were periodically written to the underlying storage device. -
As of Version 2.6.32: per-backing-device based writeback
Since
pdflush
had several performance disadvantages, Jens Axboe developed a new, more effective writeback mechanism for Linux Kernel version 2.6.32.
关于以前的 pdflush
不再多说, 我们这里只讨论 bdi-default
和 flush-x:y
, 这两个进程 (事实上, flush-x:y
为多个) 的关系与运行模式类似于 lighttpd
的那种标准的父子进程工作 demon 模型, 当然, 很多人不了解 lighttpd
的进程模型, 下面详解。
一般而言, 一个 Linux 系统会挂载很多 bdi 设备, 在 bdi 设备注册 (函数: bdi_register(…)
) 时, 这些 bdi 设备会以链表的形式组织在全局变量 bdi_list
下, 除了一个比较特别的 bdi 设备以外, 它就是 default bdi 设备 (default_backing_dev_info
), 它除了被加进到 bdi_list
, 还会新建一个 bdi-default
内核进程, 即本文的主角. 具体代码如下, 我相信你一眼就能注意到 kthread_run
和 list_add_tail_rcu
这样的关键代码.
struct backing_dev_info default_backing_dev_info = {
.name = "default",
.ra_pages = VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
.state = 0,
.capabilities = BDI_CAP_MAP_COPY,
.unplug_io_fn = default_unplug_io_fn,
};
EXPORT_SYMBOL_GPL(default_backing_dev_info);
static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
{
return bdi == &default_backing_dev_info;
}
int bdi_register(struct backing_dev_info *bdi, struct device *parent,
const char *fmt, ...)
{
va_list args;
struct device *dev;
if (bdi->dev) /* The driver needs to use separate queues per device */
return 0;
va_start(args, fmt);
dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
va_end(args);
if (IS_ERR(dev))
return PTR_ERR(dev);
bdi->dev = dev;
/*
* Just start the forker thread for our default backing_dev_info,
* and add other bdi's to the list. They will get a thread created
* on-demand when they need it.
*/
if (bdi_cap_flush_forker(bdi)) {
struct bdi_writeback *wb = &bdi->wb;
wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
dev_name(dev));
if (IS_ERR(wb->task))
return PTR_ERR(wb->task);
}
bdi_debug_register(bdi, dev_name(dev));
set_bit(BDI_registered, &bdi->state);
spin_lock_bh(&bdi_lock);
list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
spin_unlock_bh(&bdi_lock);
trace_writeback_bdi_register(bdi);
return 0;
}
EXPORT_SYMBOL(bdi_register);
接着跟进函数 bdi_forker_thread
, 它是 bdi-default
内核进程的主体:
static int bdi_forker_thread(void *ptr)
{
struct bdi_writeback *me = ptr;
current->flags |= PF_SWAPWRITE;
set_freezable();
/*
* Our parent may run at a different priority, just set us to normal
*/
set_user_nice(current, 0);
for (;;) {
struct task_struct *task = NULL;
struct backing_dev_info *bdi;
enum {
NO_ACTION, /* Nothing to do */
FORK_THREAD, /* Fork bdi thread */
KILL_THREAD, /* Kill inactive bdi thread */
} action = NO_ACTION;
/*
* Temporary measure, we want to make sure we don't see
* dirty data on the default backing_dev_info
*/
if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
del_timer(&me->wakeup_timer);
wb_do_writeback(me, 0);
}
spin_lock_bh(&bdi_lock);
set_current_state(TASK_INTERRUPTIBLE);
list_for_each_entry(bdi, &bdi_list, bdi_list) {
bool have_dirty_io;
if (!bdi_cap_writeback_dirty(bdi) ||
bdi_cap_flush_forker(bdi))
continue;
WARN(!test_bit(BDI_registered, &bdi->state),
"bdi %p/%s is not registered!\n", bdi, bdi->name);
have_dirty_io = !list_empty(&bdi->work_list) ||
wb_has_dirty_io(&bdi->wb);
/*
* If the bdi has work to do, but the thread does not
* exist - create it.
*/
if (!bdi->wb.task && have_dirty_io) {
/*
* Set the pending bit - if someone will try to
* unregister this bdi - it'll wait on this bit.
*/
set_bit(BDI_pending, &bdi->state);
action = FORK_THREAD;
break;
}
spin_lock(&bdi->wb_lock);
/*
* If there is no work to do and the bdi thread was
* inactive long enough - kill it. The wb_lock is taken
* to make sure no-one adds more work to this bdi and
* wakes the bdi thread up.
*/
if (bdi->wb.task && !have_dirty_io &&
time_after(jiffies, bdi->wb.last_active +
bdi_longest_inactive())) {
task = bdi->wb.task;
bdi->wb.task = NULL;
spin_unlock(&bdi->wb_lock);
set_bit(BDI_pending, &bdi->state);
action = KILL_THREAD;
break;
}
spin_unlock(&bdi->wb_lock);
}
spin_unlock_bh(&bdi_lock);
/* Keep working if default bdi still has things to do */
if (!list_empty(&me->bdi->work_list))
__set_current_state(TASK_RUNNING);
switch (action) {
case FORK_THREAD:
__set_current_state(TASK_RUNNING);
task = kthread_create(bdi_writeback_thread, &bdi->wb,
"flush-%s", dev_name(bdi->dev));
if (IS_ERR(task)) {
/*
* If thread creation fails, force writeout of
* the bdi from the thread.
*/
bdi_flush_io(bdi);
} else {
/*
* The spinlock makes sure we do not lose
* wake-ups when racing with 'bdi_queue_work()'.
* And as soon as the bdi thread is visible, we
* can start it.
*/
spin_lock_bh(&bdi->wb_lock);
bdi->wb.task = task;
spin_unlock_bh(&bdi->wb_lock);
wake_up_process(task);
}
break;
case KILL_THREAD:
__set_current_state(TASK_RUNNING);
kthread_stop(task);
break;
case NO_ACTION:
if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
/*
* There are no dirty data. The only thing we
* should now care about is checking for
* inactive bdi threads and killing them. Thus,
* let's sleep for longer time, save energy and
* be friendly for battery-driven devices.
*/
schedule_timeout(bdi_longest_inactive());
else
schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
try_to_freeze();
/* Back to the main loop */
continue;
}
/*
* Clear pending bit and wakeup anybody waiting to tear us down.
*/
clear_bit(BDI_pending, &bdi->state);
smp_mb__after_clear_bit();
wake_up_bit(&bdi->state, BDI_pending);
}
return 0;
}
代码看上去很多, 但逻辑十分简单, 一个 for
死循序, 接着一个 list_for_each_entry
遍历 bdi_list
下的所有 bdi 设备对应的 flush-x:y
内核进程是否存在, 进程状态如何, 是否需要进行对应的操作 (kill
掉或 create
).
绝大部分的 bdi 设备都会有对应的 flush-x:y
内核进程, 除了一些特殊的 bdi 设备, 比如 default bdi 设备或其它一些内存虚拟 bdi设 备, 这从第一个 if
判断代码可以看出:
if (!bdi_cap_writeback_dirty(bdi) ||
bdi_cap_flush_forker(bdi))
continue;
关于 flush-x:y
内核进程具体做什么, 待下一篇文章再讲, 但我们这里需要知道, 如果一个 bdi 设备当前有脏数据需要同步, 那么它对应的 flush-x:y
内核进程就会被创建 (当然, 这是在它本身不存在的情况下):
have_dirty_io = !list_empty(&bdi->work_list) ||
wb_has_dirty_io(&bdi->wb);
/*
* If the bdi has work to do, but the thread does not
* exist - create it.
*/
if (!bdi->wb.task && have_dirty_io) {
/*
* Set the pending bit - if someone will try to
* unregister this bdi - it'll wait on this bit.
*/
set_bit(BDI_pending, &bdi->state);
action = FORK_THREAD;
break;
}
标记 action
为 FORK_THREAD
, 在接下来 (注意 if
语句里的 break
语句, 这个 break
语句会跳出 list_for_each_entry
循环) 的switch (action)
的语句体里进行具体的 flush-x:y
内核进程创建工作.
如果一个 bdi 设备当前没有脏数据需要同步, 并且它对应的 flush-x:y
内核进程已经有很久没有活动 (通过对比最后活动时间 last_active
与当前 jiffies
) 了, 那么就把它 kill
掉:
/*
* If there is no work to do and the bdi thread was
* inactive long enough - kill it. The wb_lock is taken
* to make sure no-one adds more work to this bdi and
* wakes the bdi thread up.
*/
if (bdi->wb.task && !have_dirty_io &&
time_after(jiffies, bdi->wb.last_active +
bdi_longest_inactive())) {
task = bdi->wb.task;
bdi->wb.task = NULL;
spin_unlock(&bdi->wb_lock);
set_bit(BDI_pending, &bdi->state);
action = KILL_THREAD;
break;
}
/*
* Calculate the longest interval (jiffies) bdi threads are allowed to be
* inactive.
*/
static unsigned long bdi_longest_inactive(void)
{
unsigned long interval;
interval = msecs_to_jiffies(dirty_writeback_interval * 10);
return max(5UL * 60 * HZ, interval);
}
unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */
可以看到 “很久” 在默认情况下是 5 分钟, 此时标记 action
为 KILL_THREAD
, 在接下来的 switch (action)
的语句体里进行具体的flush-x:y
内核进程 kill
工作.
如果所有 bdi 设备遍历操作结束, 此时 bdi-default
内核进程自身执行 switch (action)
的语句体里 NO_ACTION
语句进行睡眠, 直到超时后 continue
重复上面的工作.