%IOwait 一个迷之参数,top/iostat/mpstat/sar 都会统计的关键数据,字面理解是系统等待IO的时间,经常用于反映系统IO压力。 IOwait时候CPU在干什么?什么时间才算IOwait,到底在wait什么?
数据含义及来源
man mpstat 查看下官方定义
%iowait
Percentage of time that the CPU or CPUs were idle during which the system
had an outstanding disk I/O request.
统计工具对IOwait的定义是CPU idle时系统有IO请求的时间百分比,也就是说要满足两个条件才会被记录为IOwait
1.CPU idle
2.有IO请求在处理
%iowait 数据来自/proc/stat 第5个参数,这里的数据又是怎么来的?
内核IOwait时间统计(kernel 5.3)
1. /proc/stat 数据来源
fs/proc/stat.c show_stat() 中找到/proc/stat 获取 iowait的路径 get_iowait_time()->get_cpu_iowait_time_us()。
根据CPU当前状态 online/offline,选择从cpustat.iowait 或者get_cpu_iowait_time_us() 获取iowait
static u64 get_iowait_time(struct kernel_cpustat *kcs, int cpu)
{
u64 iowait, iowait_usecs = -1ULL;
if (cpu_online(cpu))
iowait_usecs = get_cpu_iowait_time_us(cpu, NULL);
if (iowait_usecs == -1ULL)
/* !NO_HZ or cpu offline so we can rely on cpustat.iowait */
iowait = kcs->cpustat[CPUTIME_IOWAIT];
else
iowait = iowait_usecs * NSEC_PER_USEC;
return iowait;
}
get_cpu_iowait_time_us()数据来源于每个CPU 的 ts->iowait_sleeptime
u64 get_cpu_iowait_time_us(int cpu, u64 *last_update_time)
{
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
ktime_t now, iowait;
if (!tick_nohz_active)
return -1;
now = ktime_get();
if (last_update_time) {
update_ts_time_stats(cpu, ts, now, last_update_time);
iowait = ts->iowait_sleeptime;
} else {
if (ts->idle_active && nr_iowait_cpu(cpu) > 0) {
ktime_t delta = ktime_sub(now, ts->idle_entrytime);
iowait = ktime_add(ts->iowait_sleeptime, delta);
} else {
iowait = ts->iowait_sleeptime;
}
}
return ktime_to_us(iowait);
}
即iowait 来源为 cpustat.iowait 或者 CPU 相关的 ts->iowait_sleeptime 。
2. cpustat.iowait/iowait_sleeptime 统计
这两个参数的累加函数为
void account_idle_time(u64 cputime)
{
u64 *cpustat = kcpustat_this_cpu->cpustat;
struct rq *rq = this_rq();
if (atomic_read(&rq->nr_iowait) > 0)
cpustat[CPUTIME_IOWAIT] += cputime;
else
cpustat[CPUTIME_IDLE] += cputime;
}
static void
update_ts_time_stats(int cpu, struct tick_sched *ts, ktime_t now, u64 *last_update_time)
{
ktime_t delta;
if (ts->idle_active) {
delta = ktime_sub(now, ts->idle_entrytime);
if (nr_iowait_cpu(cpu) > 0)
ts->iowait_sleeptime = ktime_add(ts->iowait_sleeptime, delta);
else
ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
ts->idle_entrytime = now;
}
if (last_update_time)
*last_update_time = ktime_to_us(now);
}
调用栈为:
do_idle()->tick_nohz_idle_exit()->__tick_nohz_idle_restart_tick()->tick_nohz_account_idle_ticks()->account_idle_ticks()->account_idle_time()
do_idle()->tick_nohz_idle_exit()->tick_nohz_stop_idle()->update_ts_time_stats()
即CPU idle时会触发计数,具体计数哪个根据cpu状态判断。
两个数据的计数逻辑一样,都是根据当前cpu runqueue的nr_iowait 判断时间要累加在idle 或者 iowait 。
3.nr_iowait 哪里统计
调度时若当前task是in_iowait,则当前CPU runqueue 的nr_iowait 加1,表示当前有task在等待IO,参考__schedule() 。
if (!preempt && prev->state) {
if (signal_pending_state(prev->state, prev)) {
prev->state = TASK_RUNNING;
} else {
deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
if (prev->in_iowait) {
atomic_inc(&rq->nr_iowait);
delayacct_blkio_start();
}
}
switch_count = &prev->nvcsw;
}
task的in_iowait 在 io_schedule_prepare()中设置,调用io_schedule_prepare()的相关函数有io_schedule(), io_schedule_time() , mutex_lock_io() , mutex_lock_io_nested() 等。 即当因这些调用产生调度则标记当前CPU有iowait的task,task重新wakeup时in_iowait恢复,cpu runqueu 的nr_iowait减1。
int io_schedule_prepare(void)
{
int old_iowait = current->in_iowait;
current->in_iowait = 1;
blk_schedule_flush_plug(current);
return old_iowait;
}
总的来说当有task因为IO而被调度出CPU时,标识该CPU有在等待IO的task,当CPU进入idle时如果仍有等待IO的task,就标记这段时间为IOwait,否则标记为idle, 与man mpstat中的定义一致。
IO在哪里阻塞
系统中执行IO的流程非常的多,其阻塞点也很多,这里列出两个通常IO操作中最常见的阻塞点。
1. io提交至驱动后,等待数据返回。
2. 有并发IO请求时争抢IO软/硬队列等资源。
用 kprobe 观察两种测试场景,单线程IO/多线程IO,的io_schedule调用栈
单线程read,仅有一种调用栈
fio-7834 [001] d... 875382.127151: io_schedule: (io_schedule+0x0/0x40)
fio-7834 [001] d... 875382.127163: <stack trace>
=> io_schedule
=> ext4_file_read_iter
=> new_sync_read
=> __vfs_read
=> vfs_read
=> ksys_pread64
=> sys_pread64
=> ret_fast_syscall
多线程read,多了因争抢IO资源产生的io调度
fio-9800 [001] d... 875471.769845: io_schedule: (io_schedule+0x0/0x40)
fio-9800 [001] d... 875471.769858: <stack trace>
=> io_schedule
=> ext4_file_read_iter
=> new_sync_read
=> __vfs_read
=> vfs_read
=> ksys_pread64
=> sys_pread64
=> ret_fast_syscall
=> 0xbe9445a8
fio-9801 [003] d... 875471.770153: io_schedule: (io_schedule+0x0/0x40)
fio-9801 [003] d... 875471.770164: <stack trace>
=> io_schedule
=> blk_mq_get_request
=> blk_mq_make_request
=> generic_make_request
=> submit_bio
=> ext4_mpage_readpages
=> ext4_readpages
=> read_pages
=> __do_page_cache_readahead
=> force_page_cache_readahead
=> page_cache_sync_readahead
=> generic_file_read_iter
=> ext4_file_read_iter
=> new_sync_read
=> __vfs_read
=> vfs_read
=> ksys_pread64
=> sys_pread64
=> ret_fast_syscall
=> 0xbe9445a8
结论
IOwait 是指CPU空闲时,且当前有task在等待IO的时间。
因IO阻塞而调度主要出现在 1.等待数据返回; 2.并发IO时竞争资源
影响该时间的因素很多,不只有IO负载,CPU负载也会严重影响该参数,不可一味根据IOwait判断系统的IO压力,还需要结合iostat 等数据综合判断。