深入Linux内核（进程篇）—CFS调度

最新推荐文章于 2022-10-22 17:16:45 发布

迷途小生

最新推荐文章于 2022-10-22 17:16:45 发布

阅读量1.8k

点赞数

分类专栏：深入Linux内核文章标签： linux 面试 c语言

本文链接：https://blog.csdn.net/liyuewuwunaile/article/details/106609362

版权

深入Linux内核专栏收录该内容

7 篇文章 19 订阅

订阅专栏

CFS调度

一、CFS调度类
二、CFS运行队列
三、CFS调度实体
四、调度优先级
五、调度权重
六、虚拟运行时间的计算
七、虚拟运行时间的更新
八、虚拟运行时间的使用
九、虚拟运行时间的分配
十、新创建进程和刚唤醒进程的虚拟运行时间
十一、CFS入队
十二、CFS出队
十三、CFS选择下一个进程
十四、进程唤醒
十五、总结

了解Linux中CFS调度之前，最好先看《深入Linux内核（进程篇）—进程调度》，这样可以对进程调度有一个大体的认识。
进程大体可以分为交互式进程，批处理进程以及实施进程，针对这些进程类型，Linux内核定义了定义了六种调度策略：

#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
#define SCHED_DEADLINE		6

其中SCHED_NORMAL和SCHED_BATCH对应普通进程，SCHED_FIFO，SCHED_RR和SCHED_DEADLINE对应实时进程，而SCHED_IDLE对应idle进程（0号进程，当CPU没有可运行的进程时执行idle进程）。
Linux针对普通进程采用CFS调度器，实时进程采用realtime调度器。
一般情况下，Linux中的进程都是普通进程，采用CFS完全公平调度器。这一点从进程切换选择下一个进程pick_next_task函数中可以窥探到。

static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in the fair class we can
	 * call that function directly, but only if the @prev task wasn't of a
	 * higher scheduling class, because otherwise those loose the
	 * opportunity to pull in more work from other CPUs.
	 */
	 /* likely告诉编译器该分支执行概率大，即系统中都是由CFS调度的进程  */
	if (likely((prev->sched_class == &idle_sched_class ||
		    prev->sched_class == &fair_sched_class) &&
		   rq->nr_running == rq->cfs.h_nr_running)) {

		p = pick_next_task_fair(rq, prev, rf);
		if (unlikely(p == RETRY_TASK))
			goto restart;

		/* Assumes fair_sched_class->next == idle_sched_class */
		if (!p) {
			put_prev_task(rq, prev);
			p = pick_next_task_idle(rq);
		}

		return p;
	}
	…………
	for_each_class(class) {
		p = class->pick_next_task(rq);
		if (p)
			return p;
	}

	/* The idle class should always have a runnable task: */
	BUG();
}

Linux为了实现进程调度引入了如下概念：

调度类：5个调度类，提供调度方法，；
调度实体：3个调度实体，调度的基本单元，支持进程调度和组调度；
调度策略：6中调度策略，依据实时进程和普通进程划分；
调度算法：4种调度算法，根据调度策略实现对应的调度算法。

各个概念的对应关系，按调度类优先级从高到底排列。

调度类	调度策略	调度实体	调度算法
stop_sched_class	无	无	无
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	EDF
rt_sched_class	SCHED_RR/SCHED_FIFO	sched_rt_entity	RR/FIFO
fair_sched_class	SCHED_NORMAL/SCHED_BATCH	sched_entity	CFS
idle_sched_class	无	无	无

本文介绍的CFS调度即对应fair_sched_class调度类，SCHED_NORMAL/SCHED_BATCH调度策略，和sched_entity调度实体。

一、CFS调度类

CFS调度类fair_sched_class提供了CFS调度需要实现的方法。

方法	描述
enqueue_task_fair	将进程对应的调度实体添加到运行队列红黑树中
dequeue_task_fair	将进程对应的调度实体从运行队列红黑树中删除
yield_task_fair	sched_yield系统调用，将当前进程放入运行队列skip成员中
check_preempt_wakeup	检查当前进程是否可被新进程抢占
__pick_next_task_fair	在调度类中选择下一个要执行的进程
put_prev_task_fair	更新进程vruntime，再将进程放回运行队列，与set_next_task配合使用
set_next_task_fair	将进程放从运行队列删除，并作为当前运行实体，与put_prev_task配合使用
task_tick_fair	scheduler_tick定时器调用，CFS调度算法调用update_curr计算vruntime，可能触发调度。
task_fork_fair	do_fork调用，CFS调度算法初始化vruntime

/*
 * All the scheduling class methods:
 */
const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class, /* 下一个调度类idle */
	.enqueue_task		= enqueue_task_fair, /* 进程加入运行队列 */
	.dequeue_task		= dequeue_task_fair,/* 进程退出运行队列 */
	.yield_task		= yield_task_fair,/* 进程主动让出CPU，进程退出运行队列后再加入运行队列尾 */
	.yield_to_task		= yield_to_task_fair,

	.check_preempt_curr	= check_preempt_wakeup,/* 检查当前进程是否可被新进程抢占 */

	.pick_next_task		= __pick_next_task_fair,/* 在调度类中选择下一个要执行的进程 */
	.put_prev_task		= put_prev_task_fair,/* 将进程放回运行队列 */
	.set_next_task          = set_next_task_fair,

#ifdef CONFIG_SMP
	.balance		= balance_fair,
	.select_task_rq		= select_task_rq_fair,/* 返回进程运行队列CPU number */ 
	.migrate_task_rq	= migrate_task_rq_fair,/* 进程迁移到指定CPU，set_task_cpu调用 */

	.rq_online		= rq_online_fair,/* 启动运行队列 */
	.rq_offline		= rq_offline_fair,/* 禁止运行队列 */

	.task_dead		= task_dead_fair,/* 进程切换时前一个进程状态为TASK_DEAD时调用 */
	.set_cpus_allowed	= set_cpus_allowed_common,
#endif

	.task_tick		= task_tick_fair,/*scheduler_tick定时器调用，CFS调度算法调用update_curr计算vruntime*/
	.task_fork		= task_fork_fair,/* do_fork调用，CFS调度算法初始化vruntime */
	
	/* __sched_setscheduler调用check_class_changed，进程切换调度策略时触发切换调度类以及优先级的变化 */
	.prio_changed		= prio_changed_fair,
	.switched_from		= switched_from_fair,
	.switched_to		= switched_to_fair,

	/* 系统调用sched_rr_get_interval，返回 round robin time，非RR调度类返回0 */
	.get_rr_interval	= get_rr_interval_fair,

	.update_curr		= update_curr_fair,/* 更新当前进程运行时间 */

#ifdef CONFIG_FAIR_GROUP_SCHED
	.task_change_group	= task_change_group_fair,
#endif

#ifdef CONFIG_UCLAMP_TASK
	.uclamp_enabled		= 1,
#endif
}

二、CFS运行队列

内核为每个CPU分配一个运行队列rq(this_rq())。每个运行队列rq中包含一个CFS运行队列cfs_rq，用于组织管理普通进程的调度。

成员	描述
load	运行队列所有调度实体总负载。se入队时update_load_add增加负载；se出队时通过update_load_sub减去负载
nr_running	CFS运行队列调度实体数量，se入队时加1，se出队时减1
h_nr_running	CFS运行队列调度实体数量
idle_h_nr_running	记录idle调度实体数量
min_vruntime	记录 CFS运行队列红黑树中最小的vruntime值，保持单调递增。该值非常重要，在对唤醒进程和fork进程vruntime做补偿时使用
tasks_timeline	CFS运行队列红黑树根，所有调度实体都依据se->vruntime(红黑树的KEY)大小加入到红黑树接受调度
curr	记录CFS运行队列中当前正在运行的se
next	记录CFS运行队列中急需运行的se，wakeup唤醒进程时可能将被唤醒的进程赋值给next。pick_next_entity时会优先选择cfs_rq->next
last	记录CFS运行队列中当前运行的se，与curr不同，curr一定会记录当前运行se，而last只会记录执行wakeup操作的se。pick_next_entity时会次优先选择cfs_rq->last。这样有利于重复利用cache
skip	记录CFS运行队列中跳过运行的se，系统调用sched_yield会将当前实体赋值到cfs_rq->skip。pick_next_entity时如发现选择的se是cfs_rq->skip时，会重新选择se

/* CFS-related fields in a runqueue */
struct cfs_rq {
	struct load_weight	load; /* CFS运行队列所有调度实体总负载 */
	unsigned long		runnable_weight;
	unsigned int		nr_running; /* CFS运行队列调度实体数量 */
	unsigned int		h_nr_running;   /* SCHED_{NORMAL,BATCH,IDLE} */
	unsigned int		idle_h_nr_running; /* SCHED_IDLE */

	u64			exec_clock;
	u64			min_vruntime;/* CFS运行队列红黑树中最小的vruntime值 */
#ifndef CONFIG_64BIT
	u64			min_vruntime_copy;
#endif

	struct rb_root_cached	tasks_timeline;/* CFS运行队列红黑树根 */

	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity	*curr;/* 记录CFS运行队列中当前正在运行的se */
	struct sched_entity	*next;/* 记录CFS运行队列中被wakeup的se */
	struct sched_entity	*last;/* 记录CFS运行队列中执行wakeup的se */
	struct sched_entity	*skip;/* 记录CFS运行队列需要跳过的se */

#ifdef	CONFIG_SCHED_DEBUG
	unsigned int		nr_spread_over;
#endif

#ifdef CONFIG_SMP
	/*
	 * CFS load tracking
	 */
	struct sched_avg	avg;
#ifndef CONFIG_64BIT
	u64			load_last_update_time_copy;
#endif
	struct {
		raw_spinlock_t	lock ____cacheline_aligned;
		int		nr;
		unsigned long	load_avg;
		unsigned long	util_avg;
		unsigned long	runnable_sum;
	} removed;

#ifdef CONFIG_FAIR_GROUP_SCHED
	unsigned long		tg_load_avg_contrib;
	long			propagate;
	long			prop_runnable_sum;

	/*
	 *   h_load = weight * f(tg)
	 *
	 * Where f(tg) is the recursive weight fraction assigned to
	 * this group.
	 */
	unsigned long		h_load;
	u64			last_h_load_update;
	struct sched_entity	*h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */

#ifdef CONFIG_FAIR_GROUP_SCHED
	struct rq		*rq;	/* CPU runqueue to which this cfs_rq is attached */

	/*
	 * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
	 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
	 * (like users, containers etc.)
	 *
	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
	 * This list is used during load balance.
	 */
	int			on_list;
	struct list_head	leaf_cfs_rq_list;
	struct task_group	*tg;	/* group that "owns" this runqueue */
	…………
#endif /* CONFIG_FAIR_GROUP_SCHED */
}

三、CFS调度实体

每个进程描述符task_struct包含成员变量sched_entity，即调度实体。调度实体是最基本的调度单元，可以是一个进程也可以是一个调度组。

struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load; /* 调度权重 */
	unsigned long			runnable_weight;
	struct rb_node			run_node;/* 调度实体在cfs_rq红黑树的节点 */
	struct list_head		group_node;
	unsigned int			on_rq;/* 调度实体是否在调度队列中接受调度 */

	u64				exec_start;
	u64				sum_exec_runtime;/* 当前累计运行实际时间 */
	u64				vruntime;/* 虚拟运行时间 */
	u64				prev_sum_exec_runtime;/* 上次累计运行实际时间 */

	u64				nr_migrations;

	struct sched_statistics		statistics;

#ifdef CONFIG_FAIR_GROUP_SCHED
	int				depth;
	struct sched_entity		*parent;
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq			*cfs_rq;
	/* rq "owned" by this entity/group: */
	struct cfs_rq			*my_q;
#endif

#ifdef CONFIG_SMP
	/*
	 * Per entity load average tracking.
	 *
	 * Put into separate cache line so it does not
	 * collide with read-mostly values above.
	 */
	struct sched_avg		avg;/* 进程负载 */
#endif
}

四、调度优先级

	int				prio;                    /* 动态优先级 */
	int				static_prio;             /* 静态优先级 */
	int				normal_prio;             /* 普通优先级 */
	unsigned int			rt_priority;     /* 实时优先级 */

内核使用0-139表示进程优先级，数值越低优先级越高。优先级0-99用于实时进程，100-139用于普通进程。
MAX_RT_PRIO宏是最大实时进程优先级，数值是100。
DEFAULT_PRIO宏是默认普通进程优先级，数值为120，即创建普通进程默认优先级是120，可以通过nice系统调用修改。
NICE的取值范围是-20到19，负值增大进程优先级，正值减小进程优先级。
实时进程的优先级范围是0…MAX_PRIO-1（0-99），普通进程优先级范围是MAX_RT_PRIO…MAX_PRIO-1（100-139）。

#define MAX_NICE	19
#define MIN_NICE	-20
#define NICE_WIDTH	(MAX_NICE - MIN_NICE + 1)

/*
 * Priority of a process goes from 0..MAX_PRIO-1, valid RT
 * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
 * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
 * values are inverted: lower p->prio value means higher priority.
 *
 * The MAX_USER_RT_PRIO value allows the actual maximum
 * RT priority to be separate from the value exported to
 * user-space.  This allows kernel threads to set their
 * priority to a value higher than any user task. Note:
 * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
 */

#define MAX_USER_RT_PRIO	100
#define MAX_RT_PRIO		MAX_USER_RT_PRIO

#define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

五、调度权重

调度实体中有一个成员调度权重struct load_weight load，调度权重结构体定义如下。

struct load_weight {
	unsigned long			weight; /* 调度实体的权重 */
	u32				inv_weight; /* 调度实体的权重的中间计算值 */
};

关于调度权重需了解以下几点：

系统中普通进程优先级通过NICE值确定，普通进程默认优先级是120，NICE值的取值范围是[-20,19]，因而普通进程的优先级范围是[100,139]。
Linux内核将优先级转换为权重用于普通进程的调度。[-20,19]的NICE范围表示40个优先级等级，NICE值越高，优先级越低，权重越低，反之亦然。
NICE值每增加1，则相对于NICE为0时降低10%的CPU使用时间，反之NICE值每减少1，则相对于NICE为0时增加10%的CPU使用时间。
为了计算方便，内核将NICE为0时的权重定为1024，其他NCIE值对应的权重通过查表方式获取，即内核全局变量sched_prio_to_weight。
sched_prio_to_weight数组中相邻两个值大约sched_prio_to_weight[i]是sched_prio_to_weight[i+1]的1.25倍。
内核提供另一个数组sched_prio_to_wmult，用于保存2^32/sched_prio_to_weight的值。将这些值保存起来是为了提高计算效率。

关于权重对进程占用CPU的时间的计算，可以举一个例子：
进程A和B创建时NICE值都是0，则权重均为1024，那么两者各占50%的CPU时间，即CPUA=CPUB=1024/(1024+1024)=50%。
现修改B进程的NICE值为1，则B进程相对A进程减少10%的CPU时间，那么A和B的权重分别为1024和820，CPUA=820/(1024+820)=45%，CPUB=1024/(1024+820)=55%。
sched_prio_to_weight和sched_prio_to_wmult具体代码如下。

/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};

/*
 * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
const u32 sched_prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

sched_prio_to_weight[i]和sched_prio_to_wmult[i]直接的关系是：
$inv\_weight = \frac{2^{32}}{weight}$
内核提供set_load_weight函数设置调度实体中的load(p->se.load)。

static void set_load_weight(struct task_struct *p, bool update_load)
{
	int prio = p->static_prio - MAX_RT_PRIO;
	struct load_weight *load = &p->se.load;

	/*
	 * SCHED_IDLE tasks get minimal weight:
	 */
	if (task_has_idle_policy(p)) {
		load->weight = scale_load(WEIGHT_IDLEPRIO);
		load->inv_weight = WMULT_IDLEPRIO;
		p->se.runnable_weight = load->weight;
		return;
	}

	/*
	 * SCHED_OTHER tasks have to update their load when changing their
	 * weight
	 */
	if (update_load && p->sched_class == &fair_sched_class) {
		reweight_task(p, prio);
	} else {
		load->weight = scale_load(sched_prio_to_weight[prio]);
		load->inv_weight = sched_prio_to_wmult[prio];
		p->se.runnable_weight = load->weight;
	}
}

内核为何要定义一个sched_prio_to_wmult数组，这与CFS调度器中虚拟运行时间的计算有关系。

六、虚拟运行时间的计算

CFS调度器抛弃了以前采用时间片的调度算法，而是使用了权重值计算虚拟运行时间的方法来实现进程调度。

每个进程的虚拟运行时间是相对NICE为0时的权重比例值。
NICE值小的进程，权重大，虚拟运行时间比真实运行时间慢，因而获得更多的运行时间。

虚拟运行时间的计算公式是：
$\frac{delta\_exec * NICE\_0\_LOAD}{weight}$
其中delta_exec是实际运行时间，NICE_0_LOAD是NICE为0时的权重值，weight为进程权重值。
这个公式需要做除法，内核显然不想这么做，因而公式做了如下转换：
$(\frac{delta\_exec * NICE\_0\_LOAD*2^{32}}{weight})>>32$
这里就解释了内核为什么需要sched_prio_to_wmult(2^32/weight)数组。显然公式进一步转换为：
$vruntime = (delta\_exec * NICE\_0\_LOAD*inv\_weight)>>32$
内核巧妙的运用sched_prio_to_wmult预先做了除法，保障在实际计算是只有乘法和位移运算。
内核的实现代码如下：

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);/* weight=NICE_0_LOAD */
	int shift = WMULT_SHIFT;

	__update_inv_weight(lw);

	if (unlikely(fact >> 32)) {
		while (fact >> 32) {
			fact >>= 1;
			shift--;
		}
	}
	
	/* fact=NICE_0_LOAD*inv_weight */
	fact = mul_u32_u32(fact, lw->inv_weight);

	while (fact >> 32) {
		fact >>= 1;
		shift--;
	}

	/* delta_exec*=fact >> 32 */
	return mul_u64_u32_shr(delta_exec, fact, shift);
}

static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

	/* 如果调度实体权重为NICE_0_LOAD，则其虚拟运行时间等于实际运行时间 */
	return delta;
}

显然当NICE值为0时，内核不进行计算直接返回实际运行时间，即调度实体的权重是NICE_0_LOAD时，其虚拟运行时间vruntime与实际运行时间delta是一样的。

七、虚拟运行时间的更新

CFS调度利用vruntime实现对普通进程的调度策略，vruntime时间的更新是由update_curr函数实现的。
1、update_curr函数入参是当前进程的CFS运行队列；
2、rq_clock_task获取当前运行队列成员clock_task，该值由函数update_rq_clock_task更新；
3、delta_exec计算该进程上次调用update_curr到现在的时间差。
4、调用calc_delta_fair计算vruntime，关于calc_delta_fair计算方法上一节已说明。
5、调用update_min_vruntime更新cfs_rq->min_vruntime，该值记录了CFS运行队列中最小的vruntime。

/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */
	u64 now = rq_clock_task(rq_of(cfs_rq)); /* 运行队列clock_task值 */
	u64 delta_exec;

	if (unlikely(!curr))
		return;

	/* delta_exec计算该进程上次调用update_curr到现在的时间差 */
	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;

	/* curr->exec_start记录本次调用的时间 */
	curr->exec_start = now;

	schedstat_set(curr->statistics.exec_max,
		      max(delta_exec, curr->statistics.exec_max));
	/* curr->sum_exec_runtime记录进程占用CPU的总时间 */
	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq->exec_clock, delta_exec);

	/* 利用delta_exec实际运行时间的增量计算vruntime增量 */
	curr->vruntime += calc_delta_fair(delta_exec, curr);
	/* 更新cfs_rq->min_vruntime */
	update_min_vruntime(cfs_rq);

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cgroup_account_cputime(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec);
}

update_min_vruntime更新cfs_rq->min_vruntime的基本原则是记录运行队列中的最小的虚拟运行时间。
1、vruntime赋值为上次记录的min_vruntime值；
2、curr调度实体在运行队列中，则更新vruntime值；
3、curr调度实体在运行队列中，则比较CFS运行队列红黑树最左节点调度实体vruntime和第二步vruntime值，将vruntime更新为二者的最小值；
4、curr调度实体不运行队列中，将vruntime更新为CFS运行队列红黑树最左节点调度实体vruntime；
5、更新cfs_rq->min_vruntime，其值是上述四步计算出的vruntime和cfs_rq->min_vruntime中的最大值。
update_min_vruntime保证cfs_rq->min_vruntime值单调不递减。

static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */
	struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);/* CFS运行队列红黑树最左子节点 */
	/* vruntime赋值为上次记录的min_vruntime值 */
	u64 vruntime = cfs_rq->min_vruntime;

	if (curr) {
		if (curr->on_rq)
			vruntime = curr->vruntime;/* curr实体在运行队列中，则更新vruntime值 */
		else
			curr = NULL;
	}

	if (leftmost) { /* non-empty tree */
		struct sched_entity *se;
		/* 获取CFS运行队列红黑树最左节点调度实体 */
		se = rb_entry(leftmost, struct sched_entity, run_node);

		if (!curr)
			/* 若当前调度实体为空，则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime */
			vruntime = se->vruntime;
		else
			/* 若当前调度实体不为空，则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime和当前vruntime值中的最小值 */
			vruntime = min_vruntime(vruntime, se->vruntime);
	}

	/* ensure we never gain time by being placed backwards. */
	/* cfs_rq->min_vruntime是vruntime和cfs_rq->min_vruntime中的最大值 */
	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
	smp_wmb();
	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}

虚拟运行时间作为CFS调度的基本元素，其计算方法已经介绍了，进程一直在运行，那么什么时候去更新vruntime呢。在内核中有很多地方调用update_curr，这里列举几处调用：

进程创建时do_fork调用sched_fork，再调用task_fork_fair方法，再调用update_curr。task_fork_fair方法也会调用update_rq_clock更新运行队列rq->clock_task。
内核中有一个定时器任务scheduler_tick，会定时调用调度类的task_tick方法（详见《深入Linux内核（进程篇）—进程调度》scheduler_tick一节）。对于CFS调度类，即调用task_tick_fair方法。task_tick_fair调用entity_tick调用update_curr。这个调用时周期性调用。
调度实体加入CFS运行队列时，enqueue_task_fair方法也会调用update_curr。

八、虚拟运行时间的使用

内核如何运用虚拟运行时间调度进程：

调度实体虚拟运行时间的不断叠加，表明调度实体使用CPU的时间。当调度实体使用CPU的时间超过一定额度后，就会设置进程thread_info中的TIF_NEED_RESCHED，从而触发延时调度schedule。这样虚拟运行时间的作用就体现出来了。
另一方面虚拟运行时间vruntime是CFS运行队列红黑树的key，vruntime值小的调度实体被排到红黑树的左侧，在schedule中pick_next_task时，CFS会优先选择运行队列红黑树最左侧的节点。还会通过vruntime比较当前调度实体是否可被抢占。

本节讨论第一点，第二点在“CFS入队”以及“CFS选择下一个进程”章节中说明。
内核scheduler_tick周期调度curr->sched_class->task_tick方法。对于CFS调度类即task_tick_fair。
task_tick_fair调用entity_tick实现上述延时调度。

/*
 * scheduler tick hitting a task of our scheduling class.
 *
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;

	/* 遍历调度实体内的成员，非组调度情况下即当前调度实体 */
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	}

	if (static_branch_unlikely(&sched_numa_balancing))
		task_tick_numa(rq, curr);

	update_misfit_status(curr, rq);
	update_overutilized_status(task_rq(curr));
}

entity_tick的实现如下：
1、调用update_curr 更新当前运行队列以及CFS运行队列的min_vruntime；
2、调用update_load_avg更新调度该调度实体的平均负载；
3、调用check_preempt_tick检测当前是否有进程需要调度。

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	/* 更新当前运行队列以及CFS运行队列的min_vruntime */
	update_curr(cfs_rq);

	/*
	 * Ensure that runnable average is periodically updated.
	 */
	update_load_avg(cfs_rq, curr, UPDATE_TG);
	update_cfs_group(curr);

#ifdef CONFIG_SCHED_HRTICK
	/*
	 * queued ticks are scheduled to match the slice, so don't bother
	 * validating it and just reschedule.
	 */
	if (queued) {
		resched_curr(rq_of(cfs_rq));
		return;
	}
	/*
	 * don't let the period tick interfere with the hrtick preemption
	 */
	if (!sched_feat(DOUBLE_TICK) &&
			hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
		return;
#endif

	/* 检测当前是否有进程需要调度 */
	if (cfs_rq->nr_running > 1)
		check_preempt_tick(cfs_rq, curr);
}

check_preempt_tick函数实现如下：
1、调用sched_slice获取当前调度实体的理论运行时间ideal_runtime，即该调度实体依据权重分配得到的可占用CPU时间；
2、计算当前调度实体的实际运行时间delta_exec，即已使用的CPU时间；
3、当前调度实体的实际运行时间大于理论运行时间，则调用resched_curr设置进程thread_info中的TIF_NEED_RESCHED，触发延时调度；
4、当前调度实体的实际运行时间小于进程最小运行时间sysctl_sched_min_granularity（默认为0.75毫秒），则不需要调度；
5、计算当前调度实体虚拟运行时间与CFS运行队列红黑树最左节点对应的调度实体的虚拟运行时间差值；红黑树最左节点即CFS运行队列中vruntime最小的调度实体；
6、若差值小于0，则不触发调度；
7、若差值大于当前调度实体的理论运行时间，则触发延时调度。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;
	/* 获取当前调度实体的理论运行时间 */
	ideal_runtime = sched_slice(cfs_rq, curr);
	/* 计算当前调度实体的实际运行时间 */
	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
	/* 当前调度实体的实际运行时间大于理论运行时间，则调用resched_curr设置进程thread_info中的TIF_NEED_RESCHED，触发延时调度 */
	if (delta_exec > ideal_runtime) {
		resched_curr(rq_of(cfs_rq));
		/*
		 * The current task ran long enough, ensure it doesn't get
		 * re-elected due to buddy favours.
		 */
		clear_buddies(cfs_rq, curr);
		return;
	}

	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	 /* 当前调度实体的实际运行时间小于进程最小运行时间，则不需要调度 */
	if (delta_exec < sysctl_sched_min_granularity)
		return;
	/* 获取CFS运行队列红黑树最左节点对应的调度实体，即队列中vruntime最小的调度实体 */
	se = __pick_first_entity(cfs_rq);
	delta = curr->vruntime - se->vruntime;
	/* 如果当前进程vruntime小于队列中vruntime最小的调度实体，则不调度 */
	if (delta < 0)
		return;
	/* 如果当前进程vruntime大于队列中vruntime最小的调度实体差值大于调度实体理论运行时间，则触发延时调度 */
	if (delta > ideal_runtime)
		resched_curr(rq_of(cfs_rq));
}

九、虚拟运行时间的分配

虚拟运行时间触发延时调度时，会和理论运行时间做比较，理论运行时间即调度实体可使用的CPU时间。
上一节看到这个时间是由函数sched_slice获取的。
1、对于非组调度，则理论运行时间即__sched_period；
2、对于组调度，需要遍历组内所有调度实体计算理论运行时间；
3、__sched_period是CFS运行队列的一个调度周期的长度，可以理解为调度时间片。

/*
 * We calculate the wall-time slice from the period by taking a part
 * proportional to the weight.
 *
 * s = p*P[w/rw]
 */
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);

	for_each_sched_entity(se) {
		struct load_weight *load;
		struct load_weight lw;

		cfs_rq = cfs_rq_of(se);
		load = &cfs_rq->load;

		if (unlikely(!se->on_rq)) {
			lw = cfs_rq->load;

			update_load_add(&lw, se->load.weight);
			load = &lw;
		}
		slice = __calc_delta(slice, se->load.weight, load);
	}
	return slice;
}

__sched_period函数计算调度时间片根据当前运行队列的数目有关：
1、当运行队列中的调度实体数目大于8时，时间片为进程数乘最小调度延时；
2、当运行队列中的调度实体数目不大于8时，则时间片为默认时间片sched_nr_latency，值为6ms。

/* 默认时间片 */
unsigned int sysctl_sched_latency			= 6000000ULL;
/*
 * This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
 */
static unsigned int sched_nr_latency = 8;
/*
 * The idea is to set a period in which each task runs once.
 *
 * When there are too many tasks (sched_nr_latency) we have to stretch
 * this period because otherwise the slices get too small.
 *
 * p = (nr <= nl) ? l : l*nr/nl
 */
static u64 __sched_period(unsigned long nr_running)
{
	if (unlikely(nr_running > sched_nr_latency))
		return nr_running * sysctl_sched_min_granularity;
	else
		return sysctl_sched_latency;
}

十、新创建进程和刚唤醒进程的虚拟运行时间

对于睡眠进程唤醒后，或者新创建的进程，其vruntime值并没有随updat_curr更新。如果唤醒后还用原来的vruntime值，会导致报复性的占用CPU(entity_before和wakeup_preempt_entity比较中都将胜出)。因而需要对这两类进程的vruntime进行修正。

对于唤醒的进程，会在入队enqueue_entity时，将其vruntime加上cfs_rq->min_vruntime。同时调用place_entity对其补偿。

	update_curr(cfs_rq);
	if (renorm && !curr)
		se->vruntime += cfs_rq->min_vruntime;
	…………
	if (flags & ENQUEUE_WAKEUP)
	place_entity(cfs_rq, se, 0);

对于刚刚参加的进程，会在task_fork_fair中将子进程vruntime赋值为父进程curr->vruntime。同时会调用place_entity对其惩罚。需要注意的是task_fork_fair最后将子进程vruntime减去cfs_rq->min_vruntime。在_do_fork调用链中，copy_process->sched_fork->task_fork_fair后，会调用wake_up_new_task->activate_task->enqueue_task->enqueue_task_fair->enqueue_entity中会将vruntime再加上cfs_rq->min_vruntime。实际上fork的进程走了唤醒操作。

	if (curr) {
		update_curr(cfs_rq);
		se->vruntime = curr->vruntime;
	}
	place_entity(cfs_rq, se, 1);
	se->vruntime -= cfs_rq->min_vruntime;

place_entity对调度实体vruntime的计算。
1、对于唤醒进程设置入参initial为0，对于fork进程设置入参initial为1；
2、使用局部变量vruntime作为se->vruntime 最终参考值。首先将其赋值为cfs_rq->min_vruntime；
2、对于fork进程，vruntime需要加上sched_vslice虚拟实际作为惩罚；
3、对于唤醒进程，vruntime减小sysctl_sched_latency(默认6ms)/2，作为补偿；
4、se->vruntime最终值为其当前值和局部变量vruntime较大者。

static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
	u64 vruntime = cfs_rq->min_vruntime;/* 以红黑树中最小vruntime为基值 */

	/*
	 * The 'current' period is already promised to the current tasks,
	 * however the extra weight of the new task will slow them down a
	 * little, place the new task so that it fits in the slot that
	 * stays open at the end.
	 */
	/* 如果是fork进程，则增加一个调度周期的虚拟运行时间 */
	if (initial && sched_feat(START_DEBIT))
		vruntime += sched_vslice(cfs_rq, se);

	/* sleeps up to a single latency don't count. */
	/* 如果是wakeup进程，则减少sysctl_sched_latency/2的虚拟运行时间 */
	if (!initial) {
		unsigned long thresh = sysctl_sched_latency;

		/*
		 * Halve their sleep time's effect, to allow
		 * for a gentler effect of sleepers:
		 */
		if (sched_feat(GENTLE_FAIR_SLEEPERS))
			thresh >>= 1;

		vruntime -= thresh;
	}

	/* ensure we never gain time by being placed backwards. */
	/* 取最大值作为最终vruntime */
	se->vruntime = max_vruntime(se->vruntime, vruntime);
}

十一、CFS入队

CFS入队通过enqueue_task_fair函数实现。
1、for_each_sched_entity遍历调度实体，对于非调度组调度实体即se本身；
2、调度实体不在队列（se->on_rq == 0）上则调用enqueue_entity完成调度实体入队。

/*
 * The enqueue_task method is called before nr_running is
 * increased. Here we update the fair scheduling stats and
 * then put the task into the rbtree:
 */
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;
	int idle_h_nr_running = task_has_idle_policy(p);

	…………
	for_each_sched_entity(se) {
		if (se->on_rq)/* 调度实体在队列上则跳出遍历 */
			break;
		cfs_rq = cfs_rq_of(se);
		/* 完成入队操作 */
		enqueue_entity(cfs_rq, se, flags);

		/*
		 * end evaluation on encountering a throttled cfs_rq
		 *
		 * note: in the case of encountering a throttled cfs_rq we will
		 * post the final h_nr_running increment below.
		 */
		if (cfs_rq_throttled(cfs_rq))
			break;
		cfs_rq->h_nr_running++;
		cfs_rq->idle_h_nr_running += idle_h_nr_running;

		flags = ENQUEUE_WAKEUP;
	}

	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		cfs_rq->h_nr_running++;
		cfs_rq->idle_h_nr_running += idle_h_nr_running;

		if (cfs_rq_throttled(cfs_rq))
			break;

		update_load_avg(cfs_rq, se, UPDATE_TG);
		update_cfs_group(se);
	}
	…………
}

enqueue_entity完成了调度实体入队操作。
1、renorm变量：flags不是ENQUEUE_WAKEUP和ENQUEUE_MIGRATED，即入队操作不是wakeup导致的。
2、curr变量：入队的调度实体是当前正在运行的调度实体。
3、如果入队的是当前正在运行的调度实体，则增加其vruntime+=min_vruntime；当前运行的调度实体在update_curr之前增加min_vruntime是为了公平性，因为update_curr之后的min_vruntime势必增加，所以在update_curr之后再增加当前运行实体的vruntime。
4、调用update_curr更新当前正在运行的调度实体的vruntime和min_vruntime；
5、如果入队的不是正在运行的调度实体，则增加其vruntime+=min_vruntime；
6、如果是进程是被唤醒入队的，则对其vruntime进行补偿，防止其vruntime没有更新导致报复性占用CPU。
7、入队的调度实体不是当前运行实体时，调用__enqueue_entity将该调度实体加入CFS运行队列红黑树中。
8、设置调度实体se->on_rq为1，表示该调度实体已经在运行队列中。

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
	bool curr = cfs_rq->curr == se;

	/*
	 * If we're the current task, we must renormalise before calling
	 * update_curr().
	 */
	if (renorm && curr)
		se->vruntime += cfs_rq->min_vruntime;

	update_curr(cfs_rq);

	/*
	 * Otherwise, renormalise after, such that we're placed at the current
	 * moment in time, instead of some random moment in the past. Being
	 * placed in the past could significantly boost this task to the
	 * fairness detriment of existing tasks.
	 */
	if (renorm && !curr)
		se->vruntime += cfs_rq->min_vruntime;

	/*
	 * When enqueuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Add its load to cfs_rq->runnable_avg
	 *   - For group_entity, update its weight to reflect the new share of
	 *     its group cfs_rq
	 *   - Add its new weight to cfs_rq->load.weight
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
	update_cfs_group(se);
	enqueue_runnable_load_avg(cfs_rq, se);
	account_entity_enqueue(cfs_rq, se);/* 更新cfs_rq->nr_running及总负载*/
	/* 如果是进程是被唤醒入队的，则对其vruntime进行补偿，
	防止其vruntime没有更新导致报复性占用CPU */
	if (flags & ENQUEUE_WAKEUP)
		place_entity(cfs_rq, se, 0);

	check_schedstat_required();
	update_stats_enqueue(cfs_rq, se, flags);
	check_spread(cfs_rq, se);
	/* 入队的调度实体不是当前运行实体时，
	调用__enqueue_entity将该调度实体加入CFS运行队列红黑树中 */
	if (!curr)
		__enqueue_entity(cfs_rq, se);
	se->on_rq = 1;/* 设置调度实体se->on_rq为1，表示该调度实体已经在运行队列中 */

	if (cfs_rq->nr_running == 1) {
		list_add_leaf_cfs_rq(cfs_rq);
		check_enqueue_throttle(cfs_rq);
	}
}

__enqueue_entity将调度实体加入到CFS运行队列红黑树中。
1、遍历红黑树，通过vruntime作为key找到调度实体插入的节点位置；
2、调用entity_before比较插入的调度实体和当前节点vruntime；
3、调度实体小于当前节点的vruntime，则红黑树向左遍历；
4、调度实体大于当前节点的vruntime，则红黑树向右遍历，同时说明调度实体不可能是最左节点；
5、需要说明的一点是红黑树左侧的调度实体总是优先被选中执行(pick_next_task_fair)；
6、调度实体插入红黑树，完成入队。

/*
 * Enqueue an entity into the rb-tree:
 */
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
	struct rb_node *parent = NULL;
	struct sched_entity *entry;
	bool leftmost = true;

	/*
	 * Find the right place in the rbtree:
	 */
	/* 遍历红黑树，通过vruntime作为key找到调度实体插入的节点位置 */
	while (*link) {
		parent = *link;
		entry = rb_entry(parent, struct sched_entity, run_node);
		/*
		 * We dont care about collisions. Nodes with
		 * the same key stay together.
		 */
		/* 调度实体小于当前节点的vruntime，则红黑树向左遍历 */
		if (entity_before(se, entry)) {
			link = &parent->rb_left;
		} else {/* 调度实体大于当前节点的vruntime，则红黑树向右遍历，同时说明调度实体不可能是最左节点 */
			link = &parent->rb_right;
			leftmost = false;
		}
	}

	/* 完成红黑树插入 */
	rb_link_node(&se->run_node, parent, link);
	rb_insert_color_cached(&se->run_node,
			       &cfs_rq->tasks_timeline, leftmost);
}

十二、CFS出队

CFS入队通过dequeue_task_fair函数实现。
1、for_each_sched_entity遍历调度实体，对于非调度组调度实体即se本身；
2、调用dequeue_entity完成调度实体出队。

/*
 * The dequeue_task method is called before nr_running is
 * decreased. We remove the task from the rbtree and
 * update the fair scheduling stats:
 */
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &p->se;
	int task_sleep = flags & DEQUEUE_SLEEP;
	int idle_h_nr_running = task_has_idle_policy(p);
	bool was_sched_idle = sched_idle_rq(rq);

	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		dequeue_entity(cfs_rq, se, flags);/* 调用dequeue_entity完成调度实体出队 */

		/*
		 * end evaluation on encountering a throttled cfs_rq
		 *
		 * note: in the case of encountering a throttled cfs_rq we will
		 * post the final h_nr_running decrement below.
		*/
		if (cfs_rq_throttled(cfs_rq))
			break;
		cfs_rq->h_nr_running--;
		cfs_rq->idle_h_nr_running -= idle_h_nr_running;

		/* Don't dequeue parent if it has other entities besides us */
		if (cfs_rq->load.weight) {
			/* Avoid re-evaluating load for this entity: */
			se = parent_entity(se);
			/*
			 * Bias pick_next to pick a task from this cfs_rq, as
			 * p is sleeping when it is within its sched_slice.
			 */
			if (task_sleep && se && !throttled_hierarchy(cfs_rq))
				set_next_buddy(se);
			break;
		}
		flags |= DEQUEUE_SLEEP;
	}
	…………
}

dequeue_entity完成了调度实体出队操作。
1、调用update_curr更新当前正在运行的调度实体的vruntime和min_vruntime；
2、出队的调度实体不是当前运行实体时，调用__dequeue_entity将该调度实体从CFS运行队列红黑树中删除。
3、设置调度实体se->on_rq为0，表示该调度实体已经不在运行队列中。

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);

	/*
	 * When dequeuing a sched_entity, we must:
	 *   - Update loads to have both entity and cfs_rq synced with now.
	 *   - Subtract its load from the cfs_rq->runnable_avg.
	 *   - Subtract its previous weight from cfs_rq->load.weight.
	 *   - For group entity, update its weight to reflect the new share
	 *     of its group cfs_rq.
	 */
	update_load_avg(cfs_rq, se, UPDATE_TG);
	dequeue_runnable_load_avg(cfs_rq, se);

	update_stats_dequeue(cfs_rq, se, flags);

	clear_buddies(cfs_rq, se);
	/* 出队的调度实体不是当前运行实体时，则将该调度实体从CFS运行队列红黑树中删除 */
	if (se != cfs_rq->curr)
		__dequeue_entity(cfs_rq, se);
	se->on_rq = 0;/* 设置调度实体se->on_rq为0，表示该调度实体已经不在运行队列中 */
	account_entity_dequeue(cfs_rq, se);/* 更新cfs_rq->nr_running及总负载*/

	/*
	 * Normalize after update_curr(); which will also have moved
	 * min_vruntime if @se is the one holding it back. But before doing
	 * update_min_vruntime() again, which will discount @se's position and
	 * can move min_vruntime forward still more.
	 */
	if (!(flags & DEQUEUE_SLEEP))
		se->vruntime -= cfs_rq->min_vruntime;

	/* return excess runtime on last dequeue */
	return_cfs_rq_runtime(cfs_rq);

	update_cfs_group(se);

	/*
	 * Now advance min_vruntime if @se was the entity holding it back,
	 * except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
	 * put back on, and if we advance min_vruntime, we'll be placed back
	 * further than we started -- ie. we'll be penalized.
	 */
	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
		update_min_vruntime(cfs_rq);
}

__dequeue_entity将调度实体从CFS运行队列红黑树中删除。

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}

十三、CFS选择下一个进程

shedule调度函数完成调度时需要选择一个进程来执行(详见《深入Linux内核（进程篇）—进程调度》)，对于CFS调度类由pick_next_task_fair函数实现。
1、判断rq->cfs.nr_running不大于0，即cfs队列当前没有可运行进程，则调到idle标签；
2、如果定义CONFIG_FAIR_GROUP_SCHED，则进入组调度选择流程，我们这里先不考虑组调度；
3、调用put_prev_task，将前一个进程重新入队，依据vruntime加入CFS运行队列红黑树中；
4、调用pick_next_entity选择下一个运行的调度实体，这是pick_next_task_fair的核心；
5、调用set_next_entity，将选中的进程出队，从CFS运行队列红黑树中删除，同时设置为当前运行实体；
6、task_of由调度实体找到对应的进程描述符；
7、将进程描述符p返回给调用者。

struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	struct cfs_rq *cfs_rq = &rq->cfs;
	struct sched_entity *se;
	struct task_struct *p;
	int new_tasks;

again:
	/* 如果cfs.nr_running不大于0，即cfs队列当前没有可运行进程，则调到idle标签 */
	if (!sched_fair_runnable(rq)) 
		goto idle;

#ifdef CONFIG_FAIR_GROUP_SCHED /* 组调度 */
	…………
#endif
	/* 将前一个进程重新入队 */
	if (prev)
		put_prev_task(rq, prev);

	do {
		/* 选择下一个运行的调度实体 */
		se = pick_next_entity(cfs_rq, NULL);
		/* 选中的进程出队，同时设置为当前运行实体 */
		set_next_entity(cfs_rq, se);
		cfs_rq = group_cfs_rq(se);/* 非组调度group_cfs_rq返回空 */
	} while (cfs_rq);/* 非组调度该条件不成立，因而只运行一次 */

	p = task_of(se);/* 根据调度实体找到对应的进程描述符 */

done: __maybe_unused;
#ifdef CONFIG_SMP
	/*
	 * Move the next running task to the front of
	 * the list, so our cfs_tasks list becomes MRU
	 * one.
	 */
	list_move(&p->se.group_node, &rq->cfs_tasks);
#endif

	if (hrtick_enabled(rq))
		hrtick_start_fair(rq, p);

	update_misfit_status(p, rq);

	return p;

idle:
	if (!rf)
		return NULL;

	new_tasks = newidle_balance(rq, rf);

	/*
	 * Because newidle_balance() releases (and re-acquires) rq->lock, it is
	 * possible for any higher priority task to appear. In that case we
	 * must re-start the pick_next_entity() loop.
	 */
	if (new_tasks < 0)
		return RETRY_TASK;

	if (new_tasks > 0)
		goto again;

	/*
	 * rq is about to be idle, check if we need to update the
	 * lost_idle_time of clock_pelt
	 */
	update_idle_rq_clock_pelt(rq);

	return NULL;
}

下面详解选取进程的三部曲：put_prev_task，pick_next_entity和set_next_entity。

1、put_prev_task

选择进程是先调用put_prev_task将prev进程重新入队，因为prev进程之前是curr进程，即占用CPU的进程，是不在运行队列红黑树里的，现在要停掉prev进程，故需要重新放入运行队列。
put_prev_task是一个通用函数，调用进程调度类的put_prev_task方法。对于CFS调度类对应的函数是put_prev_task_fair。

static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
	WARN_ON_ONCE(rq->curr != prev);
	prev->sched_class->put_prev_task(rq, prev);
}

put_prev_task_fair先遍历调度组，再调用put_prev_entity将prev->se放入运行队列。

/*
 * Account for a descheduled task:
 */
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
	struct sched_entity *se = &prev->se;
	struct cfs_rq *cfs_rq;

	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		put_prev_entity(cfs_rq, se);
	}
}

put_prev_entity的具体实现如下。
1、调用update_curr更新vruntime和min_vruntime；
2、调用__enqueue_entity将prev调度实体加入CFS运行队列红黑树；
3、将cfs_rq->curr设置为NULL，因为当前运行进程prev即将停止运行，该值会在set_next_entity中设置为选出来即将执行的进程next;

static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
	/*
	 * If still on the runqueue then deactivate_task()
	 * was not called and update_curr() has to be done:
	 */
	if (prev->on_rq)
		update_curr(cfs_rq);

	/* throttle cfs_rqs exceeding runtime */
	check_cfs_rq_runtime(cfs_rq);

	check_spread(cfs_rq, prev);

	if (prev->on_rq) {
		update_stats_wait_start(cfs_rq, prev);
		/* Put 'current' back into the tree. */
		__enqueue_entity(cfs_rq, prev);
		/* in !on_rq case, update occurred at dequeue */
		update_load_avg(cfs_rq, prev, 0);
	}
	cfs_rq->curr = NULL;
}

2、pick_next_entity

CFS调度算法选择下一个调度实体的基本原则：

保持任务组之间的公平性；
选择cfs_rq->next调度实体，因为cfs_rq->next记录了确实想要执行的调度实体；
选择cfs_rq->last调度实体，为了利用cache局部性原理，cfs_rq->last记录了上次占用CPU的调度实体（非调度组）；
不选择cfs_rq->skip调度实体，在选择运行的调度实体时，cfs_rq->skip记录了要跳过的调度实体，这个是实现sched_yield系统调用而做的记录。进程调用sched_yield后主动让出CPU，同时将当前运行的调度实体记录到cfs_rq->skip中，在pick_next_task时跳过cfs_rq->skip即实现了让出CPU的功能。

pick_next_entity实现即按照上述原则：
1、获取CFS运行队列红黑树最左节点，即队列中vruntime最小的调度实体；
2、当前运行实体vruntime小于最左节点实体vruntime（entity_before(a,b)函数比较，a<b返回真），则将left赋值为当前调度实体；
3、如果1和2两步选出来的调度实体se是cfs_rq->skip，则需要重新选择调度实体second；
4、由1和2步选择可知，选出的调度实体se可能为curr，则继续选择CFS运行队列红黑树最左节点为second；
5、选出的调度实体se不是curr，则选择CFS运行队列红黑树次左节点为second；
6、红黑树次左节点为空或者当前运行实体vruntime小于次左节点实体vruntime，则选择curr为second；
7、后序判断受wakeup_preempt_entity函数影响。该函数判断两个入参se是否可以抢占curr。如果差值vdiff（curr->vruntime - se->vruntime）小于等于0，则返回-1，即curr对CPU的使用权优先于se。差值vdiff大于调度唤醒粒度，则返回1，即se对CPU的使用权优先于curr。差值vdiff小于等于调度唤醒粒度，则返回0，即curr对CPU的使用权优先于se。

static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
	/* 当前调度实体与抢占调度实体vruntime差值 */
	s64 gran, vdiff = curr->vruntime - se->vruntime;
	if (vdiff <= 0) /* 差值小于0，返回-1 */
		return -1;
	/*调度唤醒粒度，默认1ms实际时间，需根据调度实体se的权重转换成虚拟运行时间*/
	gran = wakeup_gran(se);
	if (vdiff > gran)/* 差值大于调度粒度，返回1 */
		return 1;
	return 0; /* 差值小于等于调度粒度，返回0 */
}

8、调用wakeup_preempt_entity比较second和left，如果second优于left，则将se赋值为second；
9、调用wakeup_preempt_entity比较cfs_rq->last和left，如果cfs_rq->last优于left，则将se赋值为cfs_rq->last；
10、调用wakeup_preempt_entity比较cfs_rq->next和left，如果cfs_rq->next优于left，则将se赋值为cfs_rq->next；

/*
 * Pick the next process, keeping these things in mind, in this order:
 * 1) keep things fair between processes/task groups
 * 2) pick the "next" process, since someone really wants that to run
 * 3) pick the "last" process, for cache locality
 * 4) do not run the "skip" process, if something else is available
 */
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	/* 获取CFS运行队列红黑树最左节点lestmost */
	struct sched_entity *left = __pick_first_entity(cfs_rq);
	struct sched_entity *se;

	/*
	 * If curr is set we have to see if its left of the leftmost entity
	 * still in the tree, provided there was anything in the tree at all.
	 */
	/* 当前运行实体小于最左节点实体,则选择当前调度实体 */
	if (!left || (curr && entity_before(curr, left)))
		left = curr;

	/* se是最终要返回的调度实体，先为其赋值为left */
	se = left; /* ideally we run the leftmost entity */

	/*
	 * Avoid running the skip buddy, if running something else can
	 * be done without getting too unfair.
	 */
	 /* 选出来的调度实体se是skip实体，则重新选择 */
	if (cfs_rq->skip == se) {
		struct sched_entity *second;

		if (se == curr) {/* se==curr==skip，则选择红黑树最左节点 */
			second = __pick_first_entity(cfs_rq);
		} else {/* se==leftmost==skip，则选择红黑树次左节点 */
			second = __pick_next_entity(se);
			if (!second || (curr && entity_before(curr, second)))
				second = curr;/* curr优于次左节点，则选择curr */
		}

		if (second && wakeup_preempt_entity(second, left) < 1)
			se = second;/* second优于left，则选择second */
	}

	/*
	 * Prefer last buddy, try to return the CPU to a preempted task.
	 */
	if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
		se = cfs_rq->last;/* cfs_rq->last优于left，则选择second */

	/*
	 * Someone really wants this to run. If it's not unfair, run it.
	 */
	if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
		se = cfs_rq->next;/* cfs_rq->next优于left，则选择second */

	/* 清除cfs_rq->last，cfs_rq->next和cfs_rq->skip */
	clear_buddies(cfs_rq, se);

	return se;
}

3、set_next_entity

处理好前一个进程的事情，选出了下一个要执行的进程之后，还需要对下一个执行的进程做一些事情，由set_next_entity函数完成。
1、即将要执行的进程要从CFS运行队列红黑树中移除，即调用__dequeue_entity完成红黑树移除节点操作；
2、将CFS运行队列的curr指针指向即将要执行的调度实体，在put_prev_entity中将cfs_rq->curr设置为NULL了；
3、se->prev_sum_exec_runtime记录se->sum_exec_runtime的值，“虚拟运行时间的使用”一节中介绍了在check_preempt_tick函数中计算进程的实际运行时间是通过curr->sum_exec_runtime - curr->prev_sum_exec_runtime获取的。prev_sum_exec_runtime即调度实体上次运行占用CPU的时间，sum_exec_runtime是累计占用CPU的总时间，二者差值即本次占用CPU的时间。

static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	/* 'current' is not kept within the tree. */
	if (se->on_rq) {
		/*
		 * Any task has to be enqueued before it get to execute on
		 * a CPU. So account for the time it spent waiting on the
		 * runqueue.
		 */
		update_stats_wait_end(cfs_rq, se);
		__dequeue_entity(cfs_rq, se);/* 将调度实体节点从红黑树移除 */
		update_load_avg(cfs_rq, se, UPDATE_TG);
	}

	update_stats_curr_start(cfs_rq, se);
	cfs_rq->curr = se;/* curr指针指向即将要执行的调度实体 */

	/*
	 * Track our maximum slice length, if the CPU's load is at
	 * least twice that of our own weight (i.e. dont track it
	 * when there are only lesser-weight tasks around):
	 */
	if (schedstat_enabled() &&
	    rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
		schedstat_set(se->statistics.slice_max,
			max((u64)schedstat_val(se->statistics.slice_max),
			    se->sum_exec_runtime - se->prev_sum_exec_runtime));
	}
	/* 记录上次占用CPU的时间，以便计算本次调度后进程占用的CPU时间 */
	se->prev_sum_exec_runtime = se->sum_exec_runtime;
}

十四、进程唤醒

睡眠进程可以被其他进程调用wake_up_process唤醒，fork进程可以被父进程调用wake_up_new_task唤醒。详见《深入Linux内核（进程篇）—进程调度》进程唤醒一节。
唤醒进程时需要调用check_preempt_curr检查当前进程是否可以被抢占。
1、抢占进程与当前进程属于同一调度类，则调用调度类check_preempt_curr方法检查当前进程是否可以抢占；
2、抢占进程与当前进程不是同一调度类，则按照调度类的优先级判别；
3、按优先级从高到低遍历调度类；
4、先匹配到当前进程，则说明当前进程调度类优先级高于抢占进程，即不可抢占，直接跳出遍历；
5、先匹配到抢占进程，则说明抢占进程调度类优先级高于当前进程，即可抢占，调用resched_curr（设置当前进程thread_info中的TIF_NEED_RESCHED）触发延时调度，抢占CPU。

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
	const struct sched_class *class;
	/* 抢占进程与当前进程属于同一调度类，则调用调度类check_preempt_curr方法检查当前进程是否可以抢占 */
	if (p->sched_class == rq->curr->sched_class) {
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
	} else {/* 抢占进程与当前进程不是同一调度类，则按照调度类的优先级判别 */
		for_each_class(class) {/* 按优先级从高到低遍历调度类 */
			/* 匹配到当前进程，则说明当前进程调度类优先级高于抢占进程，即不可抢占 */
			if (class == rq->curr->sched_class)
				break;
			/* 匹配到抢占进程，则说明抢占进程调度类优先级高于当前进程，即可抢占 */
			if (class == p->sched_class) {
				resched_curr(rq);/* 触发延时调度，抢占CPU */
				break;
			}
		}
	}

	/*
	 * A queue event has occurred, and we're going to schedule.  In
	 * this case, we can save a useless back to back clock update.
	 */
	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
		rq_clock_skip_update(rq);
}

对于CFS调度类，check_preempt_curr实现函数为check_preempt_wakeup。
1、当前调度实体与抢占调度实体相同，则返回；
2、如果不是fork进程唤醒且CFS运行队列数量大于8，则设置抢占调度实体为cfs_rq->next，在pick_next_eneity中cfs_rq->next会被优先选择；
3、SCHED_BATCH和SCHED_IDLE进程不抢占SCHED_NORMAL进程，而是由schedule_tick完成调度。
4、调用wakeup_preempt_entity判断se是否可以抢占pse，wakeup_preempt_entity函数在“pick_next_entity”一节已有详述。如果可以抢占则走preempt流程，否则返回。
5、preempt流程：调用resched_curr设置当前进程thread_info中的TIF_NEED_RESCHED，触发延时调度。
6、如果不是组调度且CFS运行队列数量大于8，设置被抢占进程curr为cfq_rq->last，为了利用cache局部性原理，减少cache刷新，pick_next_eneity可能会选择被抢占的进程执行。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
	struct task_struct *curr = rq->curr;
	struct sched_entity *se = &curr->se, *pse = &p->se;
	struct cfs_rq *cfs_rq = task_cfs_rq(curr);
	int scale = cfs_rq->nr_running >= sched_nr_latency;/* sched_nr_latency值为8 */
	int next_buddy_marked = 0;

	if (unlikely(se == pse))
		return;/* 当前调度实体与抢占调度实体相同，则返回 */

	/*
	 * This is possible from callers such as attach_tasks(), in which we
	 * unconditionally check_prempt_curr() after an enqueue (which may have
	 * lead to a throttle).  This both saves work and prevents false
	 * next-buddy nomination below.
	 */
	if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
		return;
	/* (wake_flags & WF_FORK)判断是否是fork进程唤醒，scale判断CFS运行队列数量是否大于8 */
	if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
		set_next_buddy(pse);/* 设置抢占调度实体为cfs_rq->next，在pick_next_eneity中cfs_rq->next会被优先选择 */
		next_buddy_marked = 1;/* 设置cfs_rq->next赋值标志 */
	}

	/*
	 * We can come here with TIF_NEED_RESCHED already set from new task
	 * wake up path.
	 *
	 * Note: this also catches the edge-case of curr being in a throttled
	 * group (e.g. via set_curr_task), since update_curr() (in the
	 * enqueue of curr) will have resulted in resched being set.  This
	 * prevents us from potentially nominating it as a false LAST_BUDDY
	 * below.
	 */
	if (test_tsk_need_resched(curr))
		return;

	/* Idle tasks are by definition preempted by non-idle tasks. */
	if (unlikely(task_has_idle_policy(curr)) &&
	    likely(!task_has_idle_policy(p)))
		goto preempt;

	/*
	 * Batch and idle tasks do not preempt non-idle tasks (their preemption
	 * is driven by the tick):
	 */
	 /* SCHED_BATCH和SCHED_IDLE进程不抢占SCHED_NORMAL进程 */
	if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
		return;

	find_matching_se(&se, &pse);
	update_curr(cfs_rq_of(se));
	BUG_ON(!pse);
	/* 判断se是否可以抢占pse，wakeup_preempt_entity返回1表示可以抢占 */
	if (wakeup_preempt_entity(se, pse) == 1) {
		/*
		 * Bias pick_next to pick the sched entity that is
		 * triggering this preemption.
		 */
		if (!next_buddy_marked)
			set_next_buddy(pse);/* cfs_rq->next没有赋值，则赋值为pse */
		goto preempt;
	}

	return;

preempt:
	resched_curr(rq);/* 触发延时调度，抢占CPU */
	/*
	 * Only set the backward buddy when the current task is still
	 * on the rq. This can happen when a wakeup gets interleaved
	 * with schedule on the ->pre_schedule() or idle_balance()
	 * point, either of which can * drop the rq lock.
	 *
	 * Also, during early boot the idle thread is in the fair class,
	 * for obvious reasons its a bad idea to schedule back to it.
	 */
	if (unlikely(!se->on_rq || curr == rq->idle))
		return;/* se不在队列或者当前进程是idle进程则返回 */

	if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
		set_last_buddy(se);/* 设置被抢占进程为cfq_rq->last，为了利用cache局部性原理，减少cache刷新，pick_next_eneity可能会选择被抢占的进程执行 */
}