IOwait 到底在wait什么

最新推荐文章于 2025-04-01 15:18:24 发布

xyin_kevin

最新推荐文章于 2025-04-01 15:18:24 发布

阅读量2.7k

点赞数 1

文章标签：内核 linux

本文链接：https://blog.csdn.net/qq_32740107/article/details/105567793

版权

%IOwait 一个迷之参数，top/iostat/mpstat/sar 都会统计的关键数据，字面理解是系统等待IO的时间，经常用于反映系统IO压力。 IOwait时候CPU在干什么？什么时间才算IOwait，到底在wait什么？

数据含义及来源

man mpstat 查看下官方定义

%iowait
Percentage of time that the CPU or CPUs were idle during which the system
had an outstanding disk I/O request.

统计工具对IOwait的定义是CPU idle时系统有IO请求的时间百分比，也就是说要满足两个条件才会被记录为IOwait

1.CPU idle

2.有IO请求在处理

%iowait 数据来自/proc/stat 第5个参数，这里的数据又是怎么来的？

内核IOwait时间统计（kernel 5.3）

1. /proc/stat 数据来源

fs/proc/stat.c show_stat() 中找到/proc/stat 获取 iowait的路径 get_iowait_time()->get_cpu_iowait_time_us()。

根据CPU当前状态 online/offline，选择从cpustat.iowait 或者get_cpu_iowait_time_us() 获取iowait

static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
{
	u64 iowait, iowait_usecs = -1ULL;

	if (cpu_online(cpu))
		iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);

	if (iowait_usecs == -1ULL)
		/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
		iowait = kcs->cpustat[CPUTIME_IOWAIT];
	else
		iowait = iowait_usecs * NSEC_PER_USEC;

	return iowait;
}

get_cpu_iowait_time_us()数据来源于每个CPU 的 ts->iowait_sleeptime

u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
	struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
	ktime_t now, iowait;

	if (!tick_nohz_active)
		return -1;

	now = ktime_get();
	if (last_update_time) {
		update_ts_time_stats(cpu, ts, now, last_update_time);
		iowait = ts->iowait_sleeptime;
	} else {
		if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
			ktime_t delta = ktime_sub(now, ts->idle_entrytime);

			iowait = ktime_add(ts->iowait_sleeptime, delta);
		} else {
			iowait = ts->iowait_sleeptime;
		}
	}
	return ktime_to_us(iowait);
}

即iowait 来源为 cpustat.iowait 或者 CPU 相关的 ts->iowait_sleeptime 。

2. cpustat.iowait/iowait_sleeptime 统计

这两个参数的累加函数为

void account_idle_time(u64 cputime)
{
	u64 *cpustat = kcpustat_this_cpu->cpustat;
	struct rq *rq = this_rq();

	if (atomic_read(&rq->nr_iowait) > 0)
		cpustat[CPUTIME_IOWAIT] += cputime;
	else
		cpustat[CPUTIME_IDLE] += cputime;
}

static void
update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
{
	ktime_t delta;

	if (ts->idle_active) {
		delta = ktime_sub(now, ts->idle_entrytime);
		if (nr_iowait_cpu(cpu) > 0)
			ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
		else
			ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
		ts->idle_entrytime = now;
	}

	if (last_update_time)
		*last_update_time = ktime_to_us(now);
}

调用栈为:

do_idle()->tick_nohz_idle_exit()->__tick_nohz_idle_restart_tick()->tick_nohz_account_idle_ticks()->account_idle_ticks()->account_idle_time()

do_idle()->tick_nohz_idle_exit()->tick_nohz_stop_idle()->update_ts_time_stats()

即CPU idle时会触发计数，具体计数哪个根据cpu状态判断。

两个数据的计数逻辑一样，都是根据当前cpu runqueue的nr_iowait 判断时间要累加在idle 或者 iowait 。

3.nr_iowait 哪里统计

调度时若当前task是in_iowait，则当前CPU runqueue 的nr_iowait 加1，表示当前有task在等待IO，参考__schedule() 。

	if (!preempt && prev->state) {
		if (signal_pending_state(prev->state, prev)) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
		}
		switch_count = &prev->nvcsw;
	}

task的in_iowait 在 io_schedule_prepare()中设置，调用io_schedule_prepare()的相关函数有io_schedule()， io_schedule_time() , mutex_lock_io() , mutex_lock_io_nested() 等。即当因这些调用产生调度则标记当前CPU有iowait的task，task重新wakeup时in_iowait恢复，cpu runqueu 的nr_iowait减1。

int io_schedule_prepare(void)
{
	int old_iowait = current->in_iowait;

	current->in_iowait = 1;
	blk_schedule_flush_plug(current);

	return old_iowait;
}

总的来说当有task因为IO而被调度出CPU时，标识该CPU有在等待IO的task，当CPU进入idle时如果仍有等待IO的task，就标记这段时间为IOwait，否则标记为idle, 与man mpstat中的定义一致。

IO在哪里阻塞

系统中执行IO的流程非常的多，其阻塞点也很多，这里列出两个通常IO操作中最常见的阻塞点。

1. io提交至驱动后，等待数据返回。

2. 有并发IO请求时争抢IO软/硬队列等资源。

用 kprobe 观察两种测试场景，单线程IO/多线程IO，的io_schedule调用栈

单线程read，仅有一种调用栈

             fio-7834  [001] d... 875382.127151: io_schedule: (io_schedule+0x0/0x40)
             fio-7834  [001] d... 875382.127163: <stack trace>
 => io_schedule
 => ext4_file_read_iter
 => new_sync_read
 => __vfs_read
 => vfs_read
 => ksys_pread64
 => sys_pread64
 => ret_fast_syscall

多线程read，多了因争抢IO资源产生的io调度

            fio-9800  [001] d... 875471.769845: io_schedule: (io_schedule+0x0/0x40)
             fio-9800  [001] d... 875471.769858: <stack trace>
 => io_schedule
 => ext4_file_read_iter
 => new_sync_read
 => __vfs_read
 => vfs_read
 => ksys_pread64
 => sys_pread64
 => ret_fast_syscall
 => 0xbe9445a8
             fio-9801  [003] d... 875471.770153: io_schedule: (io_schedule+0x0/0x40)
             fio-9801  [003] d... 875471.770164: <stack trace>
 => io_schedule
 => blk_mq_get_request
 => blk_mq_make_request
 => generic_make_request
 => submit_bio
 => ext4_mpage_readpages
 => ext4_readpages
 => read_pages
 => __do_page_cache_readahead
 => force_page_cache_readahead
 => page_cache_sync_readahead
 => generic_file_read_iter
 => ext4_file_read_iter
 => new_sync_read
 => __vfs_read
 => vfs_read
 => ksys_pread64
 => sys_pread64
 => ret_fast_syscall
 => 0xbe9445a8