小张学linux内核：五.cfs调度类和rt调度类

最新推荐文章于 2024-08-05 23:34:03 发布

加油2019

最新推荐文章于 2024-08-05 23:34:03 发布

阅读量1.4k

点赞数

分类专栏：小张学inux内核文章标签： linux 算法 cfs调度类调度子系统

本文链接：https://blog.csdn.net/qq_40036519/article/details/106038525

版权

小张学inux内核专栏收录该内容

23 篇文章 11 订阅

订阅专栏

今天我们来学习调度类cfs和rt调度类。

cfs调度类

简述
cfs是绝对公平调度算法，理想情况下，优先级相同的两个task，运行时间应该各占cpu的50%，同理3个则cpu利用率为1/3。但是cfs中弱化了优先级的概念而是使用权重weight来决定任务的运行时间。例如：3个任务A，B，C权重分别1，2，3；则总权重，一个调度周期为6单位时间，理想状态下，A应占用1单位，B为2，C为3。
cfs中使用虚拟时间vruntime来决定运行的task，nice值-20到20 --> weight -->vruntime; cfs中使用rbtree来管理调度实体se。每次选取vruntime最小的task进行执行。在rbtree中vruntime最小的se在rbtree的最左侧。cfs是通过限制当前task的运行时间来实现公平的，task的vruntime单调递增，它在rbtree中向右移动，让出cpu使用权给vruntime更小的task。

nice 权重和vruntime的转换。

cfs即普通进程的nice值在-20到19之间。即优先级在100到139之间.0-99是实时进程，采用rt调度类。
nice值和权重weight之间的转化，是通过表格sched_prio_to_weight转换的。
kernel/sched/core.c中

const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

查表法很快得出nice值对应的weight值。nice值为0的weight为1024。同时nice值为0，所对应的vruntime 虚拟时间是和物理时间是同一个时钟的，不用通过weight进行加权计算，nice值每上升一级，则load值下降20%，则总vruntime则下降10%。
vruntime是由该task当前时间和上次更新时间的插值delta 在加上weight权重加权计算而来的。
周期性调度器 scheduler_tick()中会调用具体调度类的task_tick函数。
task_tick_failr()
->entity_tick()
->->update_curr() 更新当前时间的vruntime
kernel/sched/fair.c中

static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));        /*获取当前时间*/
	u64 delta_exec;       

	if (unlikely(!curr))
		return;

	delta_exec = now - curr->exec_start;         /*和上次更新的差值*/
	if (unlikely((s64)delta_exec <= 0))
		return;

	curr->exec_start = now;				/*更新 更新vruntime的时间*/

	schedstat_set(curr->statistics.exec_max,
		      max(delta_exec, curr->statistics.exec_max));

	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq->exec_clock, delta_exec);

	curr->vruntime += calc_delta_fair(delta_exec, curr);      /*更新当前task的vruntime，档期那vruntime+一个值*/
	update_min_vruntime(cfs_rq);

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cgroup_account_cputime(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec);
}

	curr->vruntime += calc_delta_fair(delta_exec, curr);     
	 /*更新当前task的vruntime，档期那vruntime+一个值*/

来看calc_delta_fair()函数
当nice为0时 vruntime 就加上delta值，即为两次更新物理时间的差值，但当nice值不为0时，则要加权一下。

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
	return delta;
}

/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	int shift = WMULT_SHIFT;

	__update_inv_weight(lw);

	if (unlikely(fact >> 32)) {
		while (fact >> 32) {
			fact >>= 1;
			shift--;
		}
	}

	/* hint to use a 32x32->64 mul */
	fact = (u64)(u32)fact * lw->inv_weight;

	while (fact >> 32) {
		fact >>= 1;
		shift--;
	}

	return mul_u64_u32_shr(delta_exec, fact, shift);
}

delta vt = delta vt * NICE_0_LOAD/load;
由于除法的算法效率很低，所以奖除法转换成乘法和移位操作；转换成公式：
delat vt = (ldelta vt 1024 * 2^32 /load2^32 ) >> 32;
将2^32/load * 2^32依旧做成查表法 inv_load。

1.如何选择下一个task？

pick_next_fair:
选择rbtree上vruntime最小的se来进行调度。

2. 何时更新vruntime？更新在rb tree中的位置

vruntime 是在entity_tick中update_curr中更新。

3.延迟调度

task 至少运行一次的时间间隔。

4. 组调度中vruntime是何设置

组调度会有自己的cfs_rq，组内的各se的vruntime如何计算？
调度组对应的se的vruntime是如何计算的？

每个task_group都有一个shares，share并非我们说的进程优先级，而是调度权重，这个是cfs调度管理的概念，但在cfs中最终体现到调度优先排序上。shares值默认都是相同的，所有没有设置权重的值。
entity_tick()
->update_cfs_group()

static void update_cfs_group(struct sched_entity *se)
{
	struct cfs_rq *gcfs_rq = group_cfs_rq(se);
	long shares, runnable;

	if (!gcfs_rq)
		return;

	if (throttled_hierarchy(gcfs_rq))
		return;

#ifndef CONFIG_SMP
	runnable = shares = READ_ONCE(gcfs_rq->tg->shares);

	if (likely(se->load.weight == shares))
		return;
#else
	shares   = calc_group_shares(gcfs_rq);
	runnable = calc_group_runnable(gcfs_rq, shares);
#endif

	reweight_entity(cfs_rq_of(se), se, shares, runnable);
}

调度组的shares即为se的load weight。