深入理解linux内核 --------- CFS调度

self-motivation

已于 2024-08-28 20:36:31 修改

阅读量724

点赞数 25

分类专栏：深入理解linux内核文章标签： linux cfs cgroup 抢占完全公平

于 2024-08-28 20:36:14 首次发布

本文链接：https://blog.csdn.net/happyAnger6/article/details/140885479

版权

深入理解linux内核专栏收录该内容

1 篇文章 0 订阅

订阅专栏

本篇文章共1.6万字,通过学习，你将了解以下内容

LINUX调度的意义
如何理解CFS
CFS的内核实现: 包括任务优先级与调度的关系，抢占， CGROUP配额实现，CGROUP调试接口

1.为什么需要调度

cpu资源不是无限的,任务的紧迫程度也各不相同.
如何合理地在任务之间分配CPU资源,高效地利用系统资源就是调度要解决的问题。

2.调度要解决的问题

当有多个任务就绪时，应该优先选择哪个任务使用CPU？
给当前任务分配多少运行时间？
调度的时机? 何时需要抢占?

3.完全公平调度器

3.1 历史

“完全公平调度器”即CFS调度器，它是为桌面新设计的进程调度器，由Ingo Molnar实现并合入Linux
2.6.23。它替代了之前原始调度器中SCHED_OTHER策略的交互式代码。

3.2 如何理解完全公平

CFS在真实硬件上建模了一个“理想的，精确的多任务CPU”。
“理想的多任务CPU”是一种（不存在的 😃）具有100%物理算力的CPU，它能让每个任务精确地以
相同的速度并行运行，速度均为1/nr_running。举例来说，如果有两个任务正在运行，那么每个
任务获得50%物理算力。 — 也就是说，真正的并行。

在真实的硬件上，一次只能运行一个任务，CFS引入了“虚拟运行时间”的概念。

3.3 CFS基本工作流程

总结一下，CFS工作方式像这样：它运行一个任务一会儿，当合适的时机达到,如任务发生调度（或者由调度器时钟滴答
产生），就会考虑任务的CPU使用率：任务刚刚花在物理CPU上的（少量）时间被加到
p->se.vruntime。一旦p->se.vruntime变得足够大，其它的任务将成为按时间排序的红黑树的
“最左侧任务”（相较最左侧的任务，还要加上一个很小的“粒度”量，使得我们不会对任务过度调度，
导致缓存颠簸），然后新的最左侧任务将被选中，当前任务被抢占。

其中任务的虚拟运行时间与其优先级有直接关系，任务的优先级越高，其虚拟运行时间增长的越慢。

4. CFS核心概念

4.1 虚拟运行时间(vruntime)

任务的虚拟运行时间表明，它的下一个时间片将在上文描述的理想多任务CPU上开始执行的早晚。在实践中，任务的
虚拟运行时间由它的真实运行时间相较正在运行的任务总数归一化计算得到。

在CFS中，虚拟运行时间由每个任务的p->se.vruntime（单位为纳秒）的值表达和跟踪。因此，
精确地计时和测量一个任务应得的“预期的CPU时间”是可能的。

在“理想的”硬件上，所有的任务在任何时刻都应该具有一样的p->se.vruntime值，
— 也就是说，任务应当同时执行，没有任务会在“理想的”CPU分时中变得“不平衡”。

CFS的任务选择逻辑基于p->se.vruntime的值，因此非常简单：总是试图选择p->se.vruntime值
最小的任务运行（也就是说，至今执行时间最少的任务）。CFS总是尽可能尝试按“理想多任务硬件”
那样将CPU时间在可运行任务中均分。

CFS剩下的其它设计，一般脱离了这个简单的概念，附加的设计包括nice级别，多处理，以及各种
用来识别已睡眠任务的算法变体。

CFS的设计非常激进：它不使用运行队列的旧数据结构，而是使用按时间排序的红黑树，构建出
任务未来执行的“时间线”。因此没有任何“数组切换”的旧包袱（之前的原始调度器和RSDL/SD都
被它影响）。

CFS同样维护了rq->cfs.min_vruntime值，它是单调递增的，跟踪运行队列中的所有任务的最小
虚拟运行时间值。系统做的全部工作是：使用min_vruntime跟踪，然后用它的值将新激活的调度
实体尽可能地放在红黑树的左侧。

运行队列中正在运行的任务的总数由rq->cfs.load计数，它是运行队列中的任务的权值之和。

CFS维护了一个按时间排序的红黑树，所有可运行任务以p->se.vruntime为键值排序。CFS从这颗
树上选择“最左侧”的任务并运行。系统继续运行，被执行过的任务越来越被放到树的右侧 — 缓慢，
但很明确每个任务都有成为“最左侧任务”的机会，因此任务将确定性地获得一定量CPU时间。

4.2 任务优先级

内核使用一个简单的数值范围(0~139(包含))来表示内部优先级.值越低,优先级越高.
0-99专供实时进程使用.
nice值-20~+19映射到100-139
实时进程优先级总是比普通进程高

nice值-20 ~ 20
nice0级别任务的虚拟运行时间==实际运行时间

4.2.1 任务优先级与虚拟运行时间的关系

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

内核使用权重值来计算任务在一个调度周期内将获得的CPU时间.

一般结果是,进程每降低一个nice值，则多获得10%的cpu时间,每升高一个nice值,则降低10%的cpu时间.为了实现该策略,内核将优先级转换为权重表.
举例:
有两个nice值都为0的进程要运行,权重值都为1024,则每个进程获得的cpu时间为1024/(1024+1024) = 50%
如果两个进程的nice值分别为0, 1，则权重分别为1024, 820, 则进程获得的cpu时间为1024/(1024+820)=55.6%, 820/(1024+820)=44.4%

总结: 任务优先级越高,越容易被选择首先运行,同时在一个调度周期内能获得更多的CPU时间.

一个调度周期的长度

static u64 __sched_period(unsigned long nr_running)
{
	if (unlikely(nr_running > sched_nr_latency))
		return nr_running * sysctl_sched_min_granularity;
	else
		return sysctl_sched_latency;
}

通过上面的代码,可以看到调度周期的时间长度分2种情况。

如果当前可运行的任务数>8,则用最小运行时间(0.75ms)*可运行的任务数

sysctl_sched_min_granularity: //为了减少任务切换,最小的运行时间片0.75ms.
unsigned int sysctl_sched_min_granularity	= 750000ULL;

如果运行队列上的任务数<=8,则调度周期为6ms(最小调度周期)

unsigned int sysctl_sched_latency			= 6000000ULL;

任务在一个调度周期内可以获得的CPU时间

内核为了快速计算CPU时间,不仅提前计算了权重值,还计算了用于除法的值

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

这样内核在根据权重计算当前任务的虚拟运行时间时，就可以将除法转换为乘法和移位运算，提高效率
假设任务当前的实际运行时间为delta_exec,权重为lw.weight,则delta_exec*1024/lw.weight就可以转换为(delta_exec * (1024* lw->inv_weight)) >> 32.
1024为nice0任务对应的权重.
nice0任务的虚拟运行时间和实际运行时间相等.

计算任务一个周期内可获得的cpu时间:

static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	unsigned int nr_running = cfs_rq->nr_running;
	struct sched_entity *init_se = se;
	unsigned int min_gran;
	u64 slice;

	if (sched_feat(ALT_PERIOD))
		nr_running = rq_of(cfs_rq)->cfs.h_nr_running;

	slice = __sched_period(nr_running + !se->on_rq);  // 上文提到的计算一个调度周期的长度

	for_each_sched_entity(se) {
		struct load_weight *load;
		struct load_weight lw;
		struct cfs_rq *qcfs_rq;

		qcfs_rq = cfs_rq_of(se);
		load = &qcfs_rq->load;

		if (unlikely(!se->on_rq)) {
			lw = qcfs_rq->load;

			update_load_add(&lw, se->load.weight);
			load = &lw;
		}
		slice = __calc_delta(slice, se->load.weight, load); //根据任务优先级计算应获得的时间片
	}

	if (sched_feat(BASE_SLICE)) {
		if (se_is_idle(init_se) && !sched_idle_cfs_rq(cfs_rq))
			min_gran = sysctl_sched_idle_min_granularity;
		else
			min_gran = sysctl_sched_min_granularity;  // 保证任务最小执行时间,不至于频繁切换

		slice = max_t(u64, slice, min_gran);
	}

	return slice;
}

5.调度执行

调度的时机有以下几种:

定时器周期调度
任务因时间片用完或因等待资源主动释放CPU，触发调度
在抢占检查点(如系统调用结束，中断返回等）检查当前任务是否需要被抢占

5.1 周期调度

周期调度由定时器中断产生,在周期调度里会做以下处理:

更新当前任务的虚拟运行时间
检查是否需要抢占当前任务(任务在一个调度周期内的时间用完）
如果配置了CGROUP，还会进行cgroup相关处理(更新cgroup统计,检查配额使用情况等)

5.1.2 任务运行时间更新

周期调度`scheduler_tick`---->`task_tick_fair`

周期调度函数会调用任务的调度类，对于CFS就是task_tick_fair:

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;

	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued); //更新调度实体即任务相关信息，见下文
	}

	if (static_branch_unlikely(&sched_numa_balancing))
		task_tick_numa(rq, curr);

	update_misfit_status(curr, rq);
	update_overutilized_status(task_rq(curr));

	task_tick_core(rq, curr); //一般CPU上不会打开此功能，忽略
}

entity_tick:

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq); //更新当前任务的运行相关信息,见下文详细注释

...

	if (cfs_rq->nr_running > 1) //如果运行队列上的进程数>1,则检查是否需要抢占
		check_preempt_tick(cfs_rq, curr); //判断是否需要抢占当前任务,见下文详细注释
}

update_curr:

static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));
	u64 delta_exec;

	if (unlikely(!curr))
		return;

	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;

	curr->exec_start = now;

	if (schedstat_enabled()) {
		struct sched_statistics *stats;

		stats = __schedstats_from_se(curr);
		__schedstat_set(stats->exec_max,
				max(delta_exec, stats->exec_max));
	}

	curr->sum_exec_runtime += delta_exec;  //更新任务总的运行时间
	schedstat_add(cfs_rq->exec_clock, delta_exec);

	curr->vruntime += calc_delta_fair(delta_exec, curr); //更新当前任务的虚拟运行时间,虚拟运行时间的计算与任务的优先级有关,见上文
	update_min_vruntime(cfs_rq); //更新运行队列的最小vruntime

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cgroup_account_cputime(curtask, delta_exec); //更新cgroup统计, /sys/fs/cgroup/cpuacct.usage:运行时间统计:单位ns
		account_group_exec_runtime(curtask, delta_exec); /*更新进程的运行时间,这个运行时间可以通过系统调用      
		int getitimer(int which, struct itimerval *curr_value);来获取*/
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec); /*
	如果开启了CFS CGROUP配额功能，这个函数才有实际作用.
	如果上次分配的时间已经用完，且还有配额,则分配配额,分配的值是剩余配额和5ms的较小值.
	static unsigned int sysctl_sched_cfs_bandwidth_slice		= 5000UL;
	如果这个调度周期内,cgroup配额已经用完。则会启动cgroup配额定时器,并设置`need_resched`标志,让调度器切换任务,可以抢占当前任务 */
}

抢占检查

check_preempt_tick:

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;

	/*
	 * When many tasks blow up the sched_period; it is possible that
	 * sched_slice() reports unusually large results (when many tasks are
	 * very light for example). Therefore impose a maximum.
	 */
	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);  //计算当前任务可运行的的理想时间
	/*sched_slice用于计算一个调度周期内当前任务可以获得的执行时间。当运行队列中的任务比较多时，可能这个时间会很长。
	为了保证系统的响应，取和sysctl_sched_latency(默认6ms)的较小值.

       调度周期的时间长度分2种情况。如果当前可运行的任务数>8,则用最小运行时间(0.75ms)*可运行的任务数
		sysctl_sched_min_granularity:为了减少任务切换,最小的运行时间片0.75ms.
		unsigned int sysctl_sched_min_granularity	= 750000ULL;
		
		 如果运行队列上的任务数<=8,则调度周期为6ms(最小调度周期) 
		unsigned int sysctl_sched_latency			= 6000000ULL;
	*/

	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; //计算当前任务的实际运行时间
	if (delta_exec > ideal_runtime) {  // 如果运行时间已经超过可以运行的时间,则触发抢占
		resched_curr(rq_of(cfs_rq));
		/*
		 * The current task ran long enough, ensure it doesn't get
		 * re-elected due to buddy favours.
		 */
		clear_buddies(cfs_rq, curr);
		return;
	}

	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	if (delta_exec < sysctl_sched_min_granularity) //运行的时间还没有到最小运行时间(0.75ms),则返回继续运行
		return;

	se = __pick_first_entity(cfs_rq);
	delta = curr->vruntime - se->vruntime; //查看当前任务的虚拟运行时间是否已经大于当前运行队列上最左边的任务

	if (delta < 0)
		return;

	if (delta > ideal_runtime) // 如果虚拟运行时间的差值>理想运行时间
		resched_curr(rq_of(cfs_rq));
}

cgroup处理

account_cfs_rq_runtime:

static void __account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec)
{
	/* dock delta_exec before expiring quota (as it could span periods) */
	cfs_rq->runtime_remaining -= delta_exec; //从配额中减去已经运行时间

	if (likely(cfs_rq->runtime_remaining > 0)) //如果配额还有剩余直接返回
		return;

	if (cfs_rq->throttled) //如果已经耗尽配额且处理过也直接返回
		return;
	/*
	 * if we're unable to extend our runtime we resched so that the active
	 * hierarchy can be throttled
	 */
	if (!assign_cfs_rq_runtime(cfs_rq) && likely(cfs_rq->curr)) //没有配额了则触发重新调度
		resched_curr(rq_of(cfs_rq));
}

5.2 抢占

5.2.1抢占时机

5.2.1.1 主动释放执行权

正在运行的进程由于等待资源:如锁，信号量, 等待队列等，主动放弃CPU。此时调度器需要选择下一个运行的进程

pick_next_task(...):
检查是否有任务进行runnable状态,并需要抢占当前任务

static inline struct task_struct *
__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in the fair class we can
	 * call that function directly, but only if the @prev task wasn't of a
	 * higher scheduling class, because otherwise those lose the
	 * opportunity to pull in more work from other CPUs.
	 */
	if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) &&  //如果上一个运行的任务的调度类>=CFS调度类(即是CFS或者idle),且运行队列中全是CFS任务. 几种调度类的顺序为deadline_sched < rt_sched < fair_sched < idle
		   rq->nr_running == rq->cfs.h_nr_running)) {

		p = pick_next_task_fair(rq, prev, rf); //调用CFS调度类选择下一个任务
		if (unlikely(p == RETRY_TASK))
			goto restart;

		/* Assume the next prioritized class is idle_sched_class */
		if (!p) {
			put_prev_task(rq, prev);
			p = pick_next_task_idle(rq);
		}

		return p;
	}

restart:
	put_prev_task_balance(rq, prev, rf);

	for_each_class(class) {   // 按照stop_sched, dl_sched, rt_sched, fair_sched, idle_sched的顺序依次调度
		p = class->pick_next_task(rq);
		if (p)
			return p;
	}

	BUG(); /* The idle class should always have a runnable task. */
}

pick_next_task_fair:


```c
struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	if (!sched_fair_runnable(rq))  // 判断运行队列上是否有任务?
		goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED
	if (!prev || prev->sched_class != &fair_sched_class)
		goto simple;

	/*
	 * Because of the set_next_buddy() in dequeue_task_fair() it is rather
	 * likely that a next task is from the same cgroup as the current.
	 *
	 * Therefore attempt to avoid putting and setting the entire cgroup
	 * hierarchy, only change the part that actually changes.
	 */

	do {
		struct sched_entity *curr = cfs_rq->curr;

		/*
		 * Since we got here without doing put_prev_entity() we also
		 * have to consider cfs_rq->curr. If it is still a runnable
		 * entity, update_curr() will update its vruntime, otherwise
		 * forget we've ever seen it.
		 */
		if (curr) {
			if (curr->on_rq)
				update_curr(cfs_rq);  //更新当前任务数据,vruntime等
			else
				curr = NULL;

			/*
			 * This call to check_cfs_rq_runtime() will do the
			 * throttle and dequeue its entity in the parent(s).
			 * Therefore the nr_running test will indeed
			 * be correct.
			 */
			if (unlikely(check_cfs_rq_runtime(cfs_rq))) { //检查cfs_rq的cgroup配额是否已经用完
				cfs_rq = &rq->cfs;

				if (!cfs_rq->nr_running)
					goto idle;

				goto simple;
			}
		}

		se = pick_next_entity(cfs_rq);
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

	p = task_of(se);

	/*
	 * Since we haven't yet done put_prev_entity and if the selected task
	 * is a different task than we started out with, try and touch the
	 * least amount of cfs_rqs.
	 */
	if (prev != p) {
		struct sched_entity *pse = &prev->se;

		while (!(cfs_rq = is_same_group(se, pse))) {
			int se_depth = se->depth;
			int pse_depth = pse->depth;

			if (se_depth <= pse_depth) {
				put_prev_entity(cfs_rq_of(pse), pse);
				pse = parent_entity(pse);
			}
			if (se_depth >= pse_depth) {
				set_next_entity(cfs_rq_of(se), se);
				se = parent_entity(se);
			}
		}

		put_prev_entity(cfs_rq, pse);
		set_next_entity(cfs_rq, se);
	}

	goto done;
simple:
#endif
	if (prev)
		put_prev_task(rq, prev);

	do {
		se = pick_next_entity(cfs_rq);
		set_next_entity(cfs_rq, se);
		cfs_rq = group_cfs_rq(se);
	} while (cfs_rq);

	p = task_of(se);

done: __maybe_unused;
#ifdef CONFIG_SMP
	/*
	 * Move the next running task to the front of
	 * the list, so our cfs_tasks list becomes MRU
	 * one.
	 */
	list_move(&p->se.group_node, &rq->cfs_tasks);
#endif

	if (hrtick_enabled_fair(rq))
		hrtick_start_fair(rq, p);

	update_misfit_status(p, rq);
	sched_fair_update_stop_tick(rq, p);

	return p;

idle:
	if (!rf)
		return NULL;

	new_tasks = newidle_balance(rq, rf);

	/*
	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
	 * possible for any higher priority task to appear. In that case we
	 * must re-start the pick_next_entity() loop.
	 */
	if (new_tasks < 0)
		return RETRY_TASK;

	if (new_tasks > 0)
		goto again;

	/*
	 * rq is about to be idle, check if we need to update the
	 * lost_idle_time of clock_pelt
	 */
	update_idle_rq_clock_pelt(rq);

	return NULL;
}

pick_next_entity:

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq)
{
	/*
	 * Enabling NEXT_BUDDY will affect latency but not fairness.
	 */
	if (sched_feat(NEXT_BUDDY) &&
	    cfs_rq->next && entity_eligible(cfs_rq, cfs_rq->next))
		return cfs_rq->next;

	return pick_eevdf(cfs_rq);
}

6. 调度与cgroup

6.1 cgroup配额配置

cgroupv1里可以通过以下2个文件来配置CPU带宽:

cpu.cfs_period_us: 带宽分配的周期
cpu.cfs_quota_us: 在一个分配周期内使用的CPU时间

cgroupv2里通过一个文件配置:

cpu.max

其中: 一个分配周期最大为1s
配额最小为1ms，最大大于203天

const u64 max_cfs_quota_period = 1 * NSEC_PER_SEC; /* 1s */
static const u64 min_cfs_quota_period = 1 * NSEC_PER_MSEC; /* 1ms */
/* More than 203 days if BW_SHIFT equals 20. */
static const u64 max_cfs_runtime = MAX_BW * NSEC_PER_USEC;

6.2 配额分配

当我们设置好了CPU CGROUP配额后,内核会以cfs_period_us为周期定时分配指定的cfs_quota_us配额数量
如果配额用完了而分配周期还未到,则不能继续获得CPU并运行.

throttled状态：当前调度周期内,运行队列因为使用完了cgroup配额,而无法继续运行的状态.

6.2.1 周期配额分配实现

启动定时器，定时器超时时间为cgroup设置的 `cpu.cfs_period_us`

void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
{
	lockdep_assert_held(&cfs_b->lock);

	if (cfs_b->period_active)
		return;

	cfs_b->period_active = 1;
	hrtimer_forward_now(&cfs_b->period_timer, cfs_b->period);
	hrtimer_start_expires(&cfs_b->period_timer, HRTIMER_MODE_ABS_PINNED);
}

定时器处理函数

do_sched_cfs_period_timer

if (cfs_b->idle && !throttled)   // 如果idle且之前没有需要运行时间的运行队列,直接跳到out_deactivate
	goto out_deactivate;

if (!throttled) {   // 如果当前没有需要运行时间的队列,设置状态为idle,并返回
	/* mark as potentially idle for the upcoming period */
	cfs_b->idle = 1;
	return 0;
}

/* account preceding periods in which throttling occurred */
cfs_b->nr_throttled += overrun;  //计算自throttled以来经过了多少调度周期

/* Refill extra burst quota even if cfs_b->idle */
__refill_cfs_bandwidth_runtime(cfs_b);  // 填充cfs_b的cgroup配额,见下面详细注释
	
while (throttled && cfs_b->runtime > 0) {  // 如果有throttled且cgroup有配额,则开始给rq分配配额
	raw_spin_unlock_irqrestore(&cfs_b->lock, flags);
	/* we can't nest cfs_b->lock while distributing bandwidth */
	throttled = distribute_cfs_runtime(cfs_b); // 遍历throttled队列,给rq分配配额.
	raw_spin_lock_irqsave(&cfs_b->lock, flags);
}

cfs_b->idle = 0; // 清除idle状态

return 0;

out_deactivate:
	return 1;

__refill_cfs_bandwidth_runtime

void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
{
	s64 runtime;

	if (unlikely(cfs_b->quota == RUNTIME_INF)) //没有设置cgroup配额则直接返回
		return;

	cfs_b->runtime += cfs_b->quota;      // 增加配额,cgroup配置cpu.cfs_quota_us. 注意：如果上个调度周期内还有剩余,则也会一起分配.
	runtime = cfs_b->runtime_snap - cfs_b->runtime; // 计算上一个调度周期内burst的时间
	if (runtime > 0) {
		cfs_b->burst_time += runtime;
		cfs_b->nr_burst++;
	}

	cfs_b->runtime = min(cfs_b->runtime, cfs_b->quota + cfs_b->burst); //一个调度周期内的配额不能超过quota+burst的和.
	cfs_b->runtime_snap = cfs_b->runtime;
}

6.3 cgroup相关调试数据

/sys/fs/cgroup/cpu/cpu.stat

nr_periods 0  # 此cfs cgroup运行了多少调度周期
nr_throttled 0  # throttled多少
throttled_time 0 # throttled状态持续了多久