[转载] Linux 内核进程详解之二: bdi-default

原文链接: Linux 内核进程详解之二: bdi-default


转载说明 (最后更新于: 2019-04-03, 12:26:50)

  1. 转载本文的原意仅是为了理解 “bdi” 所代表的含义以及其背后涉及到的初步的工作原理或机制, 避免原文因某些原因无法访问.
  2. 在原文的基础上, 博主根据自己的需要对文本格式进行了调整 (强迫症), 未对内容进行修改. 同时根据自己查阅的资料进行了注释或补充说明.
  3. 本文适用于 CentOS 6.x 版本. 但对于 CentOS 7.x, 从系统中已经看不到 bdi-defaultflush-x:y 这两个内核线程, 已经被合并新的到 “workqueues” 机制 (参见 Working on workqueuesbacking-dev: replace private thread pool with workqueue).

bdi, 即是 “backing device info” 的缩写, 根据英文单词全称可知其通指备用存储设备相关描述信息, 这在内核代码里用一个结构体 backing_dev_info 来表示: http://lxr.linux.no/#linux+v2.6.38.8/include/linux/backing-dev.h#L62.

bdi, 备用存储设备, 简单点说就是能够用来存储数据的设备, 而这些设备存储的数据能够保证在计算机电源关闭时也不丢失. 这样说来, 软盘存储设备, 光驱存储设备, USB 存储设备, 硬盘存储设备都是所谓的备用存储设备 (后面都用 bdi 来指示), 而内存显然不是, 具体看下面这个链接: (链接已失效).

相对于内存来说, bdi 设备 (比如最常见的硬盘存储设备) 的读写速度是非常慢的, 因此为了提高系统整体性能, Linux 系统对 bdi 设备的读写内容进行了缓冲, 那些读写的数据会临时保存在内存里, 以避免每次都直接操作 bdi 设备, 但这就需要在一定的时机 (比如每隔 5 秒, 脏数据达到的一定的比率等) 把它们同步到 bdi 设备, 否则长久的呆在内存里容易丢失 (比如机器突然宕机, 重启), 而进行间隔性同步工作的进程之前名叫 pdflush, 但后来在 Kernel 2.6.2x/3x (没注意具体是哪个小版本的改动, 比如: Linux 2.6.35, kill unnecessary bdi wakeups + cleanups) 对此进行了优化改进, 产生有多个内核进程, bdi-default, flush-x:y 等, 这也是这两篇文章要介绍的内容.


注释于 2019-04-03, 详情请参见: Linux Page Cache Basics

  • Up to Version 2.6.31 of the Kernel: pdflush

    Up to and including the 2.6.31 version of the Linux kernel, the pdflush threads ensured that dirty pages were periodically written to the underlying storage device.

  • As of Version 2.6.32: per-backing-device based writeback

    Since pdflush had several performance disadvantages, Jens Axboe developed a new, more effective writeback mechanism for Linux Kernel version 2.6.32.


关于以前的 pdflush 不再多说, 我们这里只讨论 bdi-defaultflush-x:y, 这两个进程 (事实上, flush-x:y 为多个) 的关系与运行模式类似于 lighttpd 的那种标准的父子进程工作 demon 模型, 当然, 很多人不了解 lighttpd 的进程模型, 下面详解。

一般而言, 一个 Linux 系统会挂载很多 bdi 设备, 在 bdi 设备注册 (函数: bdi_register(…)) 时, 这些 bdi 设备会以链表的形式组织在全局变量 bdi_list 下, 除了一个比较特别的 bdi 设备以外, 它就是 default bdi 设备 (default_backing_dev_info), 它除了被加进到 bdi_list, 还会新建一个 bdi-default 内核进程, 即本文的主角. 具体代码如下, 我相信你一眼就能注意到 kthread_runlist_add_tail_rcu 这样的关键代码.

struct backing_dev_info default_backing_dev_info = {
	.name		= "default",
	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
	.state		= 0,
	.capabilities	= BDI_CAP_MAP_COPY,
	.unplug_io_fn	= default_unplug_io_fn,
};
EXPORT_SYMBOL_GPL(default_backing_dev_info);

static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
{
	return bdi == &default_backing_dev_info;
}

int bdi_register(struct backing_dev_info *bdi, struct device *parent,
		const char *fmt, ...)
{
	va_list args;
	struct device *dev;

	if (bdi->dev)	/* The driver needs to use separate queues per device */
		return 0;

	va_start(args, fmt);
	dev = device_create_vargs(bdi_class, parent, MKDEV(0, 0), bdi, fmt, args);
	va_end(args);
	if (IS_ERR(dev))
		return PTR_ERR(dev);

	bdi->dev = dev;

	/*
	 * Just start the forker thread for our default backing_dev_info,
	 * and add other bdi's to the list. They will get a thread created
	 * on-demand when they need it.
	 */
	if (bdi_cap_flush_forker(bdi)) {
		struct bdi_writeback *wb = &bdi->wb;

		wb->task = kthread_run(bdi_forker_thread, wb, "bdi-%s",
						dev_name(dev));
		if (IS_ERR(wb->task))
			return PTR_ERR(wb->task);
	}

	bdi_debug_register(bdi, dev_name(dev));
	set_bit(BDI_registered, &bdi->state);

	spin_lock_bh(&bdi_lock);
	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
	spin_unlock_bh(&bdi_lock);

	trace_writeback_bdi_register(bdi);
	return 0;
}
EXPORT_SYMBOL(bdi_register);

接着跟进函数 bdi_forker_thread, 它是 bdi-default 内核进程的主体:

static int bdi_forker_thread(void *ptr)
{
	struct bdi_writeback *me = ptr;

	current->flags |= PF_SWAPWRITE;
	set_freezable();

	/*
	 * Our parent may run at a different priority, just set us to normal
	 */
	set_user_nice(current, 0);

	for (;;) {
		struct task_struct *task = NULL;
		struct backing_dev_info *bdi;
		enum {
			NO_ACTION,   /* Nothing to do */
			FORK_THREAD, /* Fork bdi thread */
			KILL_THREAD, /* Kill inactive bdi thread */
		} action = NO_ACTION;

		/*
		 * Temporary measure, we want to make sure we don't see
		 * dirty data on the default backing_dev_info
		 */
		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list)) {
			del_timer(&me->wakeup_timer);
			wb_do_writeback(me, 0);
		}

		spin_lock_bh(&bdi_lock);
		set_current_state(TASK_INTERRUPTIBLE);

		list_for_each_entry(bdi, &bdi_list, bdi_list) {
			bool have_dirty_io;

			if (!bdi_cap_writeback_dirty(bdi) ||
			    bdi_cap_flush_forker(bdi))
				continue;

			WARN(!test_bit(BDI_registered, &bdi->state),
			     "bdi %p/%s is not registered!\n", bdi, bdi->name);

			have_dirty_io = !list_empty(&bdi->work_list) ||
					wb_has_dirty_io(&bdi->wb);

			/*
			 * If the bdi has work to do, but the thread does not
			 * exist - create it.
			 */
			if (!bdi->wb.task && have_dirty_io) {
				/*
				 * Set the pending bit - if someone will try to
				 * unregister this bdi - it'll wait on this bit.
				 */
				set_bit(BDI_pending, &bdi->state);
				action = FORK_THREAD;
				break;
			}

			spin_lock(&bdi->wb_lock);

			/*
			 * If there is no work to do and the bdi thread was
			 * inactive long enough - kill it. The wb_lock is taken
			 * to make sure no-one adds more work to this bdi and
			 * wakes the bdi thread up.
			 */
			if (bdi->wb.task && !have_dirty_io &&
			    time_after(jiffies, bdi->wb.last_active +
						bdi_longest_inactive())) {
				task = bdi->wb.task;
				bdi->wb.task = NULL;
				spin_unlock(&bdi->wb_lock);
				set_bit(BDI_pending, &bdi->state);
				action = KILL_THREAD;
				break;
			}
			spin_unlock(&bdi->wb_lock);
		}
		spin_unlock_bh(&bdi_lock);

		/* Keep working if default bdi still has things to do */
		if (!list_empty(&me->bdi->work_list))
			__set_current_state(TASK_RUNNING);

		switch (action) {
		case FORK_THREAD:
			__set_current_state(TASK_RUNNING);
			task = kthread_create(bdi_writeback_thread, &bdi->wb,
					      "flush-%s", dev_name(bdi->dev));
			if (IS_ERR(task)) {
				/*
				 * If thread creation fails, force writeout of
				 * the bdi from the thread.
				 */
				bdi_flush_io(bdi);
			} else {
				/*
				 * The spinlock makes sure we do not lose
				 * wake-ups when racing with 'bdi_queue_work()'.
				 * And as soon as the bdi thread is visible, we
				 * can start it.
				 */
				spin_lock_bh(&bdi->wb_lock);
				bdi->wb.task = task;
				spin_unlock_bh(&bdi->wb_lock);
				wake_up_process(task);
			}
			break;

		case KILL_THREAD:
			__set_current_state(TASK_RUNNING);
			kthread_stop(task);
			break;

		case NO_ACTION:
			if (!wb_has_dirty_io(me) || !dirty_writeback_interval)
				/*
				 * There are no dirty data. The only thing we
				 * should now care about is checking for
				 * inactive bdi threads and killing them. Thus,
				 * let's sleep for longer time, save energy and
				 * be friendly for battery-driven devices.
				 */
				schedule_timeout(bdi_longest_inactive());
			else
				schedule_timeout(msecs_to_jiffies(dirty_writeback_interval * 10));
			try_to_freeze();
			/* Back to the main loop */
			continue;
		}

		/*
		 * Clear pending bit and wakeup anybody waiting to tear us down.
		 */
		clear_bit(BDI_pending, &bdi->state);
		smp_mb__after_clear_bit();
		wake_up_bit(&bdi->state, BDI_pending);
	}

	return 0;
}

代码看上去很多, 但逻辑十分简单, 一个 for 死循序, 接着一个 list_for_each_entry 遍历 bdi_list 下的所有 bdi 设备对应的 flush-x:y 内核进程是否存在, 进程状态如何, 是否需要进行对应的操作 (kill 掉或 create).

绝大部分的 bdi 设备都会有对应的 flush-x:y 内核进程, 除了一些特殊的 bdi 设备, 比如 default bdi 设备或其它一些内存虚拟 bdi设 备, 这从第一个 if 判断代码可以看出:

if (!bdi_cap_writeback_dirty(bdi) ||
    bdi_cap_flush_forker(bdi))
    continue;

关于 flush-x:y 内核进程具体做什么, 待下一篇文章再讲, 但我们这里需要知道, 如果一个 bdi 设备当前有脏数据需要同步, 那么它对应的 flush-x:y 内核进程就会被创建 (当然, 这是在它本身不存在的情况下):

have_dirty_io = !list_empty(&bdi->work_list) ||
        wb_has_dirty_io(&bdi->wb);
 
/*
 * If the bdi has work to do, but the thread does not
 * exist - create it.
 */
if (!bdi->wb.task && have_dirty_io) {
    /*
     * Set the pending bit - if someone will try to
     * unregister this bdi - it'll wait on this bit.
     */
    set_bit(BDI_pending, &bdi->state);
    action = FORK_THREAD;
    break;
}

标记 actionFORK_THREAD, 在接下来 (注意 if 语句里的 break 语句, 这个 break 语句会跳出 list_for_each_entry 循环) 的switch (action) 的语句体里进行具体的 flush-x:y 内核进程创建工作.

如果一个 bdi 设备当前没有脏数据需要同步, 并且它对应的 flush-x:y 内核进程已经有很久没有活动 (通过对比最后活动时间 last_active 与当前 jiffies) 了, 那么就把它 kill 掉:

            /*
             * If there is no work to do and the bdi thread was
             * inactive long enough - kill it. The wb_lock is taken
             * to make sure no-one adds more work to this bdi and
             * wakes the bdi thread up.
             */
            if (bdi->wb.task && !have_dirty_io &&
                time_after(jiffies, bdi->wb.last_active +
                        bdi_longest_inactive())) {
                task = bdi->wb.task;
                bdi->wb.task = NULL;
                spin_unlock(&bdi->wb_lock);
                set_bit(BDI_pending, &bdi->state);
                action = KILL_THREAD;
                break;
            }
  
/*
 * Calculate the longest interval (jiffies) bdi threads are allowed to be
 * inactive.
 */
static unsigned long bdi_longest_inactive(void)
{
    unsigned long interval;
  
    interval = msecs_to_jiffies(dirty_writeback_interval * 10);
    return max(5UL * 60 * HZ, interval);
}
  
unsigned int dirty_writeback_interval = 5 * 100; /* centiseconds */

可以看到 “很久” 在默认情况下是 5 分钟, 此时标记 actionKILL_THREAD, 在接下来的 switch (action) 的语句体里进行具体的flush-x:y 内核进程 kill 工作.

如果所有 bdi 设备遍历操作结束, 此时 bdi-default 内核进程自身执行 switch (action) 的语句体里 NO_ACTION 语句进行睡眠, 直到超时后 continue 重复上面的工作.


原文链接: Linux 内核进程详解之二: bdi-default

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值