[转载] Linux 内核进程详解之二: bdi-default

最新推荐文章于 2023-07-16 00:10:49 发布

Dream.Seeker

最新推荐文章于 2023-07-16 00:10:49 发布

阅读量655

点赞数

分类专栏：转载 CentOS_6 文章标签： bdi writeback kernel centos_6

转载同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

CentOS_6

1 篇文章 0 订阅

订阅专栏

原文链接: Linux 内核进程详解之二: bdi-default

转载说明 (最后更新于: 2019-04-03, 12:26:50)

转载本文的原意仅是为了理解 “bdi” 所代表的含义以及其背后涉及到的初步的工作原理或机制, 避免原文因某些原因无法访问.
在原文的基础上, 博主根据自己的需要对文本格式进行了调整 (强迫症), 未对内容进行修改. 同时根据自己查阅的资料进行了注释或补充说明.
本文适用于 CentOS 6.x 版本. 但对于 CentOS 7.x, 从系统中已经看不到 bdi-default 与 flush-x:y 这两个内核线程, 已经被合并新的到 “workqueues” 机制 (参见 Working on workqueues 和 backing-dev: replace private thread pool with workqueue).

bdi, 即是 “backing device info” 的缩写, 根据英文单词全称可知其通指备用存储设备相关描述信息, 这在内核代码里用一个结构体 backing_dev_info 来表示: http://lxr.linux.no/#linux+v2.6.38.8/include/linux/backing-dev.h#L62.

bdi, 备用存储设备, 简单点说就是能够用来存储数据的设备, 而这些设备存储的数据能够保证在计算机电源关闭时也不丢失. 这样说来, 软盘存储设备, 光驱存储设备, USB 存储设备, 硬盘存储设备都是所谓的备用存储设备 (后面都用 bdi 来指示), 而内存显然不是, 具体看下面这个链接: (链接已失效).

相对于内存来说, bdi 设备 (比如最常见的硬盘存储设备) 的读写速度是非常慢的, 因此为了提高系统整体性能, Linux 系统对 bdi 设备的读写内容进行了缓冲, 那些读写的数据会临时保存在内存里, 以避免每次都直接操作 bdi 设备, 但这就需要在一定的时机 (比如每隔 5 秒, 脏数据达到的一定的比率等) 把它们同步到 bdi 设备, 否则长久的呆在内存里容易丢失 (比如机器突然宕机, 重启), 而进行间隔性同步工作的进程之前名叫 pdflush, 但后来在 Kernel 2.6.2x/3x (没注意具体是哪个小版本的改动, 比如: Linux 2.6.35, kill unnecessary bdi wakeups + cleanups) 对此进行了优化改进, 产生有多个内核进程, bdi-default, flush-x:y 等, 这也是这两篇文章要介绍的内容.

注释于 2019-04-03, 详情请参见: Linux Page Cache Basics

Up to Version 2.6.31 of the Kernel: pdflush

Up to and including the 2.6.31 version of the Linux kernel, the pdflush threads ensured that dirty pages were periodically written to the underlying storage device.
As of Version 2.6.32: per-backing-device based writeback

Since pdflush had several performance disadvantages, Jens Axboe developed a new, more effective writeback mechanism for Linux Kernel version 2.6.32.

关于以前的 pdflush 不再多说, 我们这里只讨论 bdi-default 和 flush-x:y, 这两个进程 (事实上, flush-x:y 为多个) 的关系与运行模式类似于 lighttpd 的那种标准的父子进程工作 demon 模型, 当然, 很多人不了解 lighttpd 的进程模型, 下面详解。

一般而言, 一个 Linux 系统会挂载很多 bdi 设备, 在 bdi 设备注册 (函数: bdi_register(…)) 时, 这些 bdi 设备会以链表的形式组织在全局变量 bdi_list 下, 除了一个比较特别的 bdi 设备以外, 它就是 default bdi 设备 (default_backing_dev_info), 它除了被加进到 bdi_list, 还会新建一个 bdi-default 内核进程, 即本文的主角. 具体代码如下, 我相信你一眼就能注意到 kthread_run 和 list_add_tail_rcu 这样的关键代码.

struct backing_dev_info default_backing_dev_info = {
	.name		= "default",
	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
	.state		= 0,
	.capabilities	= BDI_CAP_MAP_COPY,
	.unplug_io_fn	= default_unplug_io_fn,
};
EXPORT_SYMBOL_GPL(default_backing_dev_info);

static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
{
	return bdi == &default_backing_dev_info;
}

int bdi_register(struct backing_dev_info *bdi, struct device *parent,
		const char *fmt, ...)
{
	va_list args;
	struct device *dev;

	if (bdi->dev)	/* The driver needs to use separate queues per device */
		return 0;

	va_start(args, fmt);
	dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
	va_end(args);
	if (IS_ERR(dev))
		return PTR_ERR(dev);

	bdi->dev = dev;

	/*
	 * Just start the forker thread for our default backing_dev_info,
	 * and add other bdi's to the list. They will get a thread created
	 * on-demand when they need it.
	 */
	if (bdi_cap_flush_forker(bdi)) {
		struct bdi_writeback *wb = &bdi->wb;

		wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
						dev_name(dev));
		if (IS_ERR(wb->task))
			return PTR_ERR(wb->task);
	}

	bdi_debug_register(bdi, dev_name(dev));
	set_bit(BDI_registered, &bdi->state);

	spin_lock_bh(&bdi_lock);
	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
	spin_unlock_bh(&bdi_lock);

	trace_writeback_bdi_register(bdi);
	return 0;
}
EXPORT_SYMBOL(bdi_register);

接着跟进函数 bdi_forker_thread, 它是 bdi-default 内核进程的主体:

static int bdi_forker_thread(void *ptr)
{
	struct bdi_writeback *me = ptr;

	current->flags |= PF_SWAPWRITE;
	set_freezable();

	/*
	 * Our parent may run at a different priority, just set us to normal
	 */
	set_user_nice(current, 0);

	for (;;) {
		struct task_struct *task = NULL;
		struct backing_dev_info *bdi;
		enum {
			NO_ACTION,   /* Nothing to do */
			FORK_THREAD, /* Fork bdi thread */
			KILL_THREAD, /* Kill inactive bdi thread */
		} action = NO_ACTION;

		/*
		 * Temporary measure, we want to make sure we don't see
		 * dirty data on the default backing_dev_info
		 */
		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
			del_timer(&me->wakeup_timer);
			wb_do_writeback(me, 0);
		}

		spin_lock_bh(&bdi_lock);
		set_current_state(TASK_INTERRUPTIBLE);

		list_for_each_entry(bdi, &bdi_list, bdi_list) {
			bool have_dirty_io;

			if (!bdi_cap_writeback_dirty(bdi) ||
			    bdi_cap_flush_forker(bdi))
				continue;

			WARN(!test_bit(BDI_registered, &bdi->state),
			     "bdi %p/%s is not registered!\n", bdi, bdi->name);

			have_dirty_io = !list_empty(&bdi->work_list) ||
					wb_has_dirty_io(&bdi->wb);

			/*
			 * If the bdi has work to do, but the thread does not
			 * exist - create it.
			 */
			if (!bdi->wb.task && have_dirty_io) {
				/*
				 * Set the pending bit - if someone will try to
				 * unregister this bdi - it'll wait on this bit.
				 */
				set_bit(BDI_pending, &bdi->state);
				action = FORK_THREAD;
				break;
			}

			spin_lock(&bdi->wb_lock);

			/*
			 * If there is no work to do and the bdi thread was
			 * inactive long enough - kill it. The wb_lock is taken
			 * to make sure no-one adds more work to this bdi and
			 * wakes the bdi thread up.
			 */
			if (bdi->wb.task && !have_dirty_io &&
			    time_after(jiffies, bdi->wb.last_active +
						bdi_longest_inactive())) {
				task = bdi->wb.task;
				bdi->wb.task = NULL;
				spin_unlock(&bdi->wb_lock);
				set_bit(BDI_pending, &bdi->state);
				action = KILL_THREAD;
				break;
			}
			spin_unlock(&bdi->wb_lock);
		}
		spin_unlock_bh(&bdi_lock);

		/* Keep working if default bdi still has things to do */
		if (!list_empty(&me->bdi->work_list))
			__set_current_state(TASK_RUNNING);

		switch (action) {
		case FORK_THREAD:
			__set_current_state(TASK_RUNNING);
			task = kthread_create(bdi_writeback_thread, &bdi->wb,
					      "flush-%s", dev_name(bdi->dev));
			if (IS_ERR(task)) {
				/*
				 * If thread creation fails, force writeout of
				 * the bdi from the thread.
				 */
				bdi_flush_io(bdi);
			} else {
				/*
				 * The spinlock makes sure we do not lose
				 * wake-ups when racing with 'bdi_queue_work()'.
				 * And as soon as the bdi thread is visible, we
				 * can start it.
				 */
				spin_lock_bh(&bdi->wb_lock);
				bdi->wb.task = task;
				spin_unlock_bh(&bdi->wb_lock);
				wake_up_process(task);
			}
			break;

		case KILL_THREAD:
			__set_current_state(TASK_RUNNING);
			kthread_stop(task);
			break;

		case NO_ACTION:
			if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
				/*
				 * There are no dirty data. The only thing we
				 * should now care about is checking for
				 * inactive bdi threads and killing them. Thus,
				 * let's sleep for longer time, save energy and
				 * be friendly for battery-driven devices.
				 */
				schedule_timeout(bdi_longest_inactive());
			else
				schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
			try_to_freeze();
			/* Back to the main loop */
			continue;
		}

		/*
		 * Clear pending bit and wakeup anybody waiting to tear us down.
		 */
		clear_bit(BDI_pending, &bdi->state);
		smp_mb__after_clear_bit();
		wake_up_bit(&bdi->state, BDI_pending);
	}

	return 0;
}

代码看上去很多, 但逻辑十分简单, 一个 for 死循序, 接着一个 list_for_each_entry 遍历 bdi_list 下的所有 bdi 设备对应的 flush-x:y 内核进程是否存在, 进程状态如何, 是否需要进行对应的操作 (kill 掉或 create).

绝大部分的 bdi 设备都会有对应的 flush-x:y 内核进程, 除了一些特殊的 bdi 设备, 比如 default bdi 设备或其它一些内存虚拟 bdi设备, 这从第一个 if 判断代码可以看出:

if (!bdi_cap_writeback_dirty(bdi) ||
    bdi_cap_flush_forker(bdi))
    continue;

关于 flush-x:y 内核进程具体做什么, 待下一篇文章再讲, 但我们这里需要知道, 如果一个 bdi 设备当前有脏数据需要同步, 那么它对应的 flush-x:y 内核进程就会被创建 (当然, 这是在它本身不存在的情况下):

have_dirty_io = !list_empty(&bdi->work_list) ||
        wb_has_dirty_io(&bdi->wb);
 
/*
 * If the bdi has work to do, but the thread does not
 * exist - create it.
 */
if (!bdi->wb.task && have_dirty_io) {
    /*
     * Set the pending bit - if someone will try to
     * unregister this bdi - it'll wait on this bit.
     */
    set_bit(BDI_pending, &bdi->state);
    action = FORK_THREAD;
    break;
}

标记 action 为 FORK_THREAD, 在接下来 (注意 if 语句里的 break 语句, 这个 break 语句会跳出 list_for_each_entry 循环) 的switch (action) 的语句体里进行具体的 flush-x:y 内核进程创建工作.

如果一个 bdi 设备当前没有脏数据需要同步, 并且它对应的 flush-x:y 内核进程已经有很久没有活动 (通过对比最后活动时间 last_active 与当前 jiffies) 了, 那么就把它 kill 掉:

            /*
             * If there is no work to do and the bdi thread was
             * inactive long enough - kill it. The wb_lock is taken
             * to make sure no-one adds more work to this bdi and
             * wakes the bdi thread up.
             */
            if (bdi->wb.task && !have_dirty_io &&
                time_after(jiffies, bdi->wb.last_active +
                        bdi_longest_inactive())) {
                task = bdi->wb.task;
                bdi->wb.task = NULL;
                spin_unlock(&bdi->wb_lock);
                set_bit(BDI_pending, &bdi->state);
                action = KILL_THREAD;
                break;
            }
  
/*
 * Calculate the longest interval (jiffies) bdi threads are allowed to be
 * inactive.
 */
static unsigned long bdi_longest_inactive(void)
{
    unsigned long interval;
  
    interval = msecs_to_jiffies(dirty_writeback_interval * 10);
    return max(5UL * 60 * HZ, interval);
}
  
unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */

可以看到 “很久” 在默认情况下是 5 分钟, 此时标记 action 为 KILL_THREAD, 在接下来的 switch (action) 的语句体里进行具体的flush-x:y 内核进程 kill 工作.

如果所有 bdi 设备遍历操作结束, 此时 bdi-default 内核进程自身执行 switch (action) 的语句体里 NO_ACTION 语句进行睡眠, 直到超时后 continue 重复上面的工作.

原文链接: Linux 内核进程详解之二: bdi-default

Dream.Seeker

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
[转载] Linux 内核进程详解之二: bdi-default

原文链接: Linux 内核进程详解之二: bdi-default转载说明:博主转载本文的原意仅是为了了解 “bdi” 所代表的含义以及其背后涉及到的初步的工作原理或机制, 以防原文再也无法访问.本文适用于 CentOS 6.x 版本. 但对于 CentOS 7.x, 从系统中已经看不到 bdi-default 与 flush-x:y 这两个内核线程, 猜测已经被进一步的优化和取代 (待...
复制链接

扫一扫