CFS调度周期、调度粒度、时间片分析

OS Developer

已于 2023-07-25 15:45:24 修改

阅读量1.4k

点赞数

分类专栏：进程调度 CFS 文章标签： linux

于 2023-07-25 15:43:38 首次发布

本文链接：https://blog.csdn.net/lizhijun_buaa/article/details/131919334

版权

进程调度同时被 2 个专栏收录

11 篇文章

订阅专栏

CFS

6 篇文章

订阅专栏

文章详细介绍了Linux内核中的CompletelyFairScheduler(CFS)如何使用vruntime来决定进程的调度顺序，以及nice值如何影响进程的优先级。vruntime越小，进程优先执行；nice值越高，进程对其他进程越友好，获得的CPU时间相应减少。调度周期与调度粒度相关，用于保证进程运行的公平性。同时，文章提到了进程状态、调度相关函数以及如何查看调度延迟等信息。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本代码分析来自内核6.1.31，转载请注明出处

在讨论本文主题前，先讨论一些基本概念

vruntime 的作用

vruntime 表示进程虚拟的运行时间，CFS调度时，调度队列多个进程如何决定哪个进程先运行？就是依靠vruntime。

vruntime越小，进程所在se在红黑树就越靠左，越先得到执行，vruntime最小的，在最左边，第一个被调度执行。

vruntime的计算是以nice 为0的权重作为标准，然后与实际权重计算得到比例，然后与实际运行实际相乘得到

nice 值和运行时间的关系

nice 值的范围-20 ~ 19,进程默认的nice值为0。这些值类似与级别，可以理解成40个等级，nice 值越高，优先级越低，nice值越低，优先级越高。

为什么这么设定?

因为nice表示进程友好程度，值越大，对其他进程越友好，就会让出cpu时间给其他进程。

进程每降低一个nice级别，优先级提高一个等级，响应进程可多获得10%的cpu时间。

进程每提升一个nice级别，优先级则降低一个级别，响应进程少获得10%的cpu时间。 nice值相当于系数1.25。

内核提供的nice值与权重对应关系表

< kernel/sched/core.c >

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

举例：假设进程A和B的nice值都为0，那么理想运行时间计算公式

1024 / (1024 + 1024) = 50%

1024是因为A和B的进程权重值1024

当A 进程nice值增加到1，B依然是0 对于A进程理想运行时间

820 / (820 + 1024) = 44.5%

对于B进程理想运行时间

1024 / (820 + 1024) = 55.5%

vruntime如何计算

公式如下

vruntime = ( delta_exec * nice_0_weight ) / weight

vruntime 表示进程虚拟的运行时间，delta_exec 表示实际运行时间，nice_0_weight 表示nice值为0的进程的权重值，weight表示该进程的权重值。

具体计算代码如下

calc_delta_fair()分析

< kernel/sched/fair.c >

/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))                       /* 只有权重不是0时才需要计算，权重0直接返回实际时长即可 */
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

	return delta;
}

__calc_delta()

计算出一个进程的虚拟运行时间，这里用到了一些计算技巧，但是总体计算原理还是上面vruntime的计算过程。

< kernel/sched/fair.c >

/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	u32 fact_hi = (u32)(fact >> 32);
	int shift = WMULT_SHIFT;
	int fs;

	__update_inv_weight(lw);

	if (unlikely(fact_hi)) {
		fs = fls(fact_hi);
		shift -= fs;
		fact >>= fs;
	}

	fact = mul_u32_u32(fact, lw->inv_weight);

	fact_hi = (u32)(fact >> 32);
	if (fact_hi) {
		fs = fls(fact_hi);
		shift -= fs;
		fact >>= fs;
	}

	return mul_u64_u32_shr(delta_exec, fact, shift);
}

update_min_vruntime()

< kernel/sched/fair.c >

/*
 * update_min_vruntime - 更新CFS队列的min_vruntime
 */
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);

	u64 vruntime = cfs_rq->min_vruntime;

    /*
      * 首先检测cfs就绪队列上是否有活动进程curr, 以此设置vruntime的值
       * 如果cfs就绪队列上没有活动进程curr, 就设置vruntime为curr->vruntime;
       * 否则有活动进程就设置为vruntime为cfs_rq的原min_vruntime;
       */
	if (curr) {
		if (curr->on_rq)
			vruntime = curr->vruntime;
		else
			curr = NULL;
	}

	if (leftmost) { /* non-empty tree */
		struct sched_entity *se = __node_2_se(leftmost);
        
        /*  如果就绪队列上没有curr进程
         *  则vruntime设置为树种最左结点的vruntime
         *  否则设置vruntiem值为cfs_rq->curr->vruntime和se->vruntime的最小值
         */
		if (!curr)
			vruntime = se->vruntime;
		else
			vruntime = min_vruntime(vruntime, se->vruntime);
	}

	/* ensure we never gain time by being placed backwards. */
    /* 
     * 为了保证min_vruntime单调不减
     * 只有在vruntime超出的cfs_rq->min_vruntime的时候才更新
     */
	u64_u32_store(cfs_rq->min_vruntime,
		      max_vruntime(cfs_rq->min_vruntime, vruntime));
}

update_min_vruntime依据当前进程和待调度的进程的vruntime值, 设置出一个可能的vruntime值, 但是只有在这个可能的vruntime值大于就绪队列原来的min_vruntime的时候, 才更新就绪队列的min_vruntime, 利用该策略, 内核确保min_vruntime只能增加, 不能减少.

进程的状态有哪些

学生时期，在操作系统原理课程一般都会讲，进程运行状态一直在阻塞、就绪、执行三个状态转换，

而在Linux下，就绪和执行都用TASK_RUNNING来表示，这让很多人疑惑，毕竟我们在学习操作系统理论时，是要区分就绪状态和运行状态的。

下面是linux定义的进程状态

< include/linux/sched.h >

/* Used in tsk->state: */
#define TASK_RUNNING			0x00000000
#define TASK_INTERRUPTIBLE		0x00000001
#define TASK_UNINTERRUPTIBLE		0x00000002
#define __TASK_STOPPED			0x00000004
#define __TASK_TRACED			0x00000008

linux用来区分是否运行状态的方法

#define task_is_running(task)		(READ_ONCE((task)->__state) == TASK_RUNNING)

task_is_running判断的是进程的状态字段是否等于TASK_RUNNING，代表进程处于就绪或者执行态。

那么，如何判断进程只是在运行队列中，

se_runnable通过字段on_rq判断进程是否处于就绪态,如下

< kernel/sched/sched.h >

static inline long se_runnable(struct sched_entity *se)
{
	return !!se->on_rq;
}

task_on_cpu()通过字段on_cpu判断进程是否处于运行态

< kernel/sched/sched.h >

static inline int task_on_cpu(struct rq *rq, struct task_struct *p)
{
#ifdef CONFIG_SMP
	return p->on_cpu;
#else
	return task_current(rq, p);
#endif
}

最终，TASK_RUNNING状态的进程通过判断其是on_rq还是on_cpu来区分就绪和运行。

表示进程处于阻塞态的状态有两个：TASK_INTERRUPTIBLE和TASK_UNINTERRUPTIBLE。

CFS的调度周期

CFS没有传统概念的调度周期，也没有传统概念的时间片。

调度粒度：

调度粒度指，一个任务在CPU上至少要运行多少时间才能被抢占。

要注意，调度粒度指的是被动调度中进程一次运行最少的时间，如果进程阻塞发生主动调度，不受这个限制。

内核中定义了sysctl_sched_min_granularity，代表调度粒度，初始值是0.75毫秒，但这并不最终使用的值，系统在启动的时候还会对这个变量进行赋值。我们来看一下代码。

/*
 * Targeted preemption latency for CPU-bound tasks:
 *
 * NOTE: this latency value is not the same as the concept of
 * 'timeslice length' - timeslices in CFS are of variable length
 * and have no persistent notion like in traditional, time-slice
 * based scheduling concepts.
 *
 * (to see the precise effective timeslice length of your workload,
 *  run vmstat and monitor the context-switches (cs) field)
 *
 * (default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)
 */
unsigned int sysctl_sched_latency			= 6000000ULL;
static unsigned int normalized_sysctl_sched_latency	= 6000000ULL;

/*
 * Minimal preemption granularity for CPU-bound tasks:
 *
 * (default: 0.75 msec * (1 + ilog(ncpus)), units: nanoseconds)
 */
unsigned int sysctl_sched_min_granularity			= 750000ULL;
static unsigned int normalized_sysctl_sched_min_granularity	= 750000ULL;

/*
 * SCHED_OTHER wake-up granularity.
 *
 * This option delays the preemption effects of decoupled workloads
 * and reduces their over-scheduling. Synchronous workloads will still
 * have immediate wakeup/sleep latencies.
 *
 * (default: 1 msec * (1 + ilog(ncpus)), units: nanoseconds)
 */
unsigned int sysctl_sched_wakeup_granularity			= 1000000UL;
static unsigned int normalized_sysctl_sched_wakeup_granularity	= 1000000UL;

update_sysctl()

< kernel/sched/fair.c >

static void update_sysctl(void)
{
	unsigned int factor = get_update_sysctl_factor();

#define SET_SYSCTL(name) \
	(sysctl_##name = (factor) * normalized_sysctl_##name)
	SET_SYSCTL(sched_min_granularity);
	SET_SYSCTL(sched_latency);
	SET_SYSCTL(sched_wakeup_granularity);
#undef SET_SYSCTL
}

这个函数展开后，相当于下面代码

static void update_sysctl(void)
{
	unsigned int factor = get_update_sysctl_factor();

	sysctl_sched_min_granularity = factor * normalized_sysctl_sched_min_granularity;
	sysctl_sched_latency = factor * normalized_sysctl_sched_latency;
	sysctl_sched_wakeup_granularity = factor * normalized_sysctl_sched_wakeup_granularity;
}

update_sysctl()调用路径如下：

sched_init_smp() -> sched_init_granularity() -> update_sysctl();

所以，调度周期的计算和调度粒度有关。

__sched_period()

< kernel/sched/fair.c >

/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
    if (unlikely(nr_running > sched_nr_latency))
        return nr_running * sysctl_sched_min_granularity;
    else
        return sysctl_sched_latency;
}

所以当running进程的个数小于等于8时，调度周期就等于调度延迟，每个进程至少能平分到3毫秒时间。当其个数大于8时，调度周期就等于运行进程的个数乘以调度粒度。在一个调度周期内如果是所有进程平分的话，一个进程能分到3毫秒。但是由于有的进程权重高，分到的时间就会大于3毫秒，就会有进程分到的时间少于3毫秒。

时间片：

CFS下，一个任务一次最大调度时长，就是时间片。

时间片的计算和调度周期有关，调度实体的时间片计算公式如下

时间片（实际可运行时间） = cfs本次调度周期 × 调度实体的负荷权重 / CFS队列的负荷权重总和

调度延迟如何查看

进入目录/sys/kernel/debug/sched

root [ /sys/kernel/debug/sched ]# ls -l
总计 0
-r--r--r--  1 root root 0  7月21日 08:33 debug
drwxr-xr-x 10 root root 0  7月21日 08:33 domains
-rw-r--r--  1 root root 0  7月21日 08:33 features
-rw-r--r--  1 root root 0  7月21日 08:33 idle_min_granularity_ns
-rw-r--r--  1 root root 0  7月21日 08:33 latency_ns
-rw-r--r--  1 root root 0  7月21日 08:33 latency_warn_ms
-rw-r--r--  1 root root 0  7月21日 08:33 latency_warn_once
-rw-r--r--  1 root root 0  7月21日 08:33 migration_cost_ns
-rw-r--r--  1 root root 0  7月21日 08:33 min_granularity_ns
-rw-r--r--  1 root root 0  7月21日 08:33 nr_migrate
-rw-r--r--  1 root root 0  7月21日 08:33 preempt
-rw-r--r--  1 root root 0  7月21日 08:33 tunable_scaling
-rw-r--r--  1 root root 0  7月21日 08:33 verbose
-rw-r--r--  1 root root 0  7月21日 08:33 wakeup_granularity_ns

sched_nr_latency 对应的是latency_ns

sysctl_sched_min_granularity 对应的是min_granularity_ns

sched_init_debug()函数

< kernel/sched/debug.c >

static __init int sched_init_debug(void)
{
	struct dentry __maybe_unused *numa;

	debugfs_sched = debugfs_create_dir("sched", NULL);

	debugfs_create_file("features", 0644, debugfs_sched, NULL, &sched_feat_fops);
	debugfs_create_bool("verbose", 0644, debugfs_sched, &sched_debug_verbose);
#ifdef CONFIG_PREEMPT_DYNAMIC
	debugfs_create_file("preempt", 0644, debugfs_sched, NULL, &sched_dynamic_fops);
#endif

	debugfs_create_u32("latency_ns", 0644, debugfs_sched, &sysctl_sched_latency);
	debugfs_create_u32("min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_min_granularity);
	debugfs_create_u32("idle_min_granularity_ns", 0644, debugfs_sched, &sysctl_sched_idle_min_granularity);
	debugfs_create_u32("wakeup_granularity_ns", 0644, debugfs_sched, &sysctl_sched_wakeup_granularity);

	debugfs_create_u32("latency_warn_ms", 0644, debugfs_sched, &sysctl_resched_latency_warn_ms);
	debugfs_create_u32("latency_warn_once", 0644, debugfs_sched, &sysctl_resched_latency_warn_once);

#ifdef CONFIG_SMP
	debugfs_create_file("tunable_scaling", 0644, debugfs_sched, NULL, &sched_scaling_fops);
	debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, &sysctl_sched_migration_cost);
	debugfs_create_u32("nr_migrate", 0644, debugfs_sched, &sysctl_sched_nr_migrate);

	mutex_lock(&sched_domains_mutex);
	update_sched_domain_debugfs();
	mutex_unlock(&sched_domains_mutex);
#endif

#ifdef CONFIG_NUMA_BALANCING
	numa = debugfs_create_dir("numa_balancing", debugfs_sched);

	debugfs_create_u32("scan_delay_ms", 0644, numa, &sysctl_numa_balancing_scan_delay);
	debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
	debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
	debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
	debugfs_create_u32("hot_threshold_ms", 0644, numa, &sysctl_numa_balancing_hot_threshold);
#endif

	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);

	return 0;
}

ideal_runtime 是什么

ideal_runtime 表示理想运行时间，也就是分配给进程的时间

分配给进程的真实运行时间 = 本次调度周期总的cpu时间 * ( 当前进程的权重 / 就绪队列(runnable)所有进程权重之和 )

所以得出结论

ideal_runtime = 本次调度周期总的cpu时间 * ( 当前进程的权重 / 就绪队列(runnable)所有进程权重之和 )

根据 sched_period 可知，总时间固定的情况，每个进程理想运行时间也是知道的，如果一个进程是cpu密集型，每次调度会使用整个理想时间，如果是IO密集型的，必然用不完这个时间。

ideal_runtime相关代码

< kernel/sched/fair.c >

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;

	/*
	 * When many tasks blow up the sched_period; it is possible that
	 * sched_slice() reports unusually large results (when many tasks are
	 * very light for example). Therefore impose a maximum.
	 */
	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);          /* curr进程在本次调度周期中应该分配的时间片 */

	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;                    /* 当前进程已经运行的实际时间 */
	if (delta_exec > ideal_runtime) {                                                     /* 本次运行的时间超过应该运行的时间，就调度调度出去 */
		resched_curr(rq_of(cfs_rq));                           /* 如果实际运行时间已经超过分配给进程的时间片，自然就需要抢占当前进程。设置TIF_NEED_RESCHED flag */
		/*
		 * The current task ran long enough, ensure it doesn't get
		 * re-elected due to buddy favours.
		 */
		clear_buddies(cfs_rq, curr);
		return;
	}

	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	if (delta_exec < sysctl_sched_min_granularity)                   /* 如果运行时间小于最小调度粒度时间，不应该抢占 */
		return;

	se = __pick_first_entity(cfs_rq);                               /* 找到红黑树最左侧，vruntime 最小的调度实体 */
	delta = curr->vruntime - se->vruntime;                          /* 如果当前进程vruntime比最小的调度实体vruntime都小，则不需要调度 */

	if (delta < 0)
		return;

	if (delta > ideal_runtime)                                      /* 这里不好理解，目的是希望权重小的任务更容易被抢占 */
		resched_curr(rq_of(cfs_rq));
}

sum_exec_runtime: 调度实体的总运行时间，这是真实时间

prev_sum_exec_runtime: 上次统一调度实体运行的总时间