CFS调度
了解Linux中CFS调度之前,最好先看 《深入Linux内核(进程篇)—进程调度》,这样可以对进程调度有一个大体的认识。
进程大体可以分为交互式进程,批处理进程以及实施进程,针对这些进程类型,Linux内核定义了定义了六种调度策略:
#define SCHED_NORMAL 0
#define SCHED_FIFO 1
#define SCHED_RR 2
#define SCHED_BATCH 3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE 5
#define SCHED_DEADLINE 6
其中SCHED_NORMAL和SCHED_BATCH对应普通进程,SCHED_FIFO,SCHED_RR和SCHED_DEADLINE对应实时进程,而SCHED_IDLE对应idle进程(0号进程,当CPU没有可运行的进程时执行idle进程)。
Linux针对普通进程采用CFS调度器,实时进程采用realtime调度器。
一般情况下,Linux中的进程都是普通进程,采用CFS完全公平调度器。这一点从进程切换选择下一个进程pick_next_task函数中可以窥探到。
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
const struct sched_class *class;
struct task_struct *p;
/*
* Optimization: we know that if all tasks are in the fair class we can
* call that function directly, but only if the @prev task wasn't of a
* higher scheduling class, because otherwise those loose the
* opportunity to pull in more work from other CPUs.
*/
/* likely告诉编译器该分支执行概率大,即系统中都是由CFS调度的进程 */
if (likely((prev->sched_class == &idle_sched_class ||
prev->sched_class == &fair_sched_class) &&
rq->nr_running == rq->cfs.h_nr_running)) {
p = pick_next_task_fair(rq, prev, rf);
if (unlikely(p == RETRY_TASK))
goto restart;
/* Assumes fair_sched_class->next == idle_sched_class */
if (!p) {
put_prev_task(rq, prev);
p = pick_next_task_idle(rq);
}
return p;
}
…………
for_each_class(class) {
p = class->pick_next_task(rq);
if (p)
return p;
}
/* The idle class should always have a runnable task: */
BUG();
}
Linux为了实现进程调度引入了如下概念:
- 调度类:5个调度类,提供调度方法,;
- 调度实体:3个调度实体,调度的基本单元,支持进程调度和组调度;
- 调度策略:6中调度策略,依据实时进程和普通进程划分;
- 调度算法:4种调度算法,根据调度策略实现对应的调度算法。
各个概念的对应关系,按调度类优先级从高到底排列。
调度类 | 调度策略 | 调度实体 | 调度算法 |
---|---|---|---|
stop_sched_class | 无 | 无 | 无 |
dl_sched_class | SCHED_DEADLINE | sched_dl_entity | EDF |
rt_sched_class | SCHED_RR/SCHED_FIFO | sched_rt_entity | RR/FIFO |
fair_sched_class | SCHED_NORMAL/SCHED_BATCH | sched_entity | CFS |
idle_sched_class | 无 | 无 | 无 |
本文介绍的CFS调度即对应fair_sched_class调度类,SCHED_NORMAL/SCHED_BATCH调度策略,和sched_entity调度实体。
一、CFS调度类
CFS调度类fair_sched_class提供了CFS调度需要实现的方法。
方法 | 描述 |
---|---|
enqueue_task_fair | 将进程对应的调度实体添加到运行队列红黑树中 |
dequeue_task_fair | 将进程对应的调度实体从运行队列红黑树中删除 |
yield_task_fair | sched_yield系统调用,将当前进程放入运行队列skip成员中 |
check_preempt_wakeup | 检查当前进程是否可被新进程抢占 |
__pick_next_task_fair | 在调度类中选择下一个要执行的进程 |
put_prev_task_fair | 更新进程vruntime,再将进程放回运行队列,与set_next_task配合使用 |
set_next_task_fair | 将进程放从运行队列删除,并作为当前运行实体,与put_prev_task配合使用 |
task_tick_fair | scheduler_tick定时器调用,CFS调度算法调用update_curr计算vruntime,可能触发调度。 |
task_fork_fair | do_fork调用,CFS调度算法初始化vruntime |
/*
* All the scheduling class methods:
*/
const struct sched_class fair_sched_class = {
.next = &idle_sched_class, /* 下一个调度类idle */
.enqueue_task = enqueue_task_fair, /* 进程加入运行队列 */
.dequeue_task = dequeue_task_fair,/* 进程退出运行队列 */
.yield_task = yield_task_fair,/* 进程主动让出CPU,进程退出运行队列后再加入运行队列尾 */
.yield_to_task = yield_to_task_fair,
.check_preempt_curr = check_preempt_wakeup,/* 检查当前进程是否可被新进程抢占 */
.pick_next_task = __pick_next_task_fair,/* 在调度类中选择下一个要执行的进程 */
.put_prev_task = put_prev_task_fair,/* 将进程放回运行队列 */
.set_next_task = set_next_task_fair,
#ifdef CONFIG_SMP
.balance = balance_fair,
.select_task_rq = select_task_rq_fair,/* 返回进程运行队列CPU number */
.migrate_task_rq = migrate_task_rq_fair,/* 进程迁移到指定CPU,set_task_cpu调用 */
.rq_online = rq_online_fair,/* 启动运行队列 */
.rq_offline = rq_offline_fair,/* 禁止运行队列 */
.task_dead = task_dead_fair,/* 进程切换时前一个进程状态为TASK_DEAD时调用 */
.set_cpus_allowed = set_cpus_allowed_common,
#endif
.task_tick = task_tick_fair,/*scheduler_tick定时器调用,CFS调度算法调用update_curr计算vruntime*/
.task_fork = task_fork_fair,/* do_fork调用,CFS调度算法初始化vruntime */
/* __sched_setscheduler调用check_class_changed,进程切换调度策略时触发切换调度类以及优先级的变化 */
.prio_changed = prio_changed_fair,
.switched_from = switched_from_fair,
.switched_to = switched_to_fair,
/* 系统调用sched_rr_get_interval,返回 round robin time,非RR调度类返回0 */
.get_rr_interval = get_rr_interval_fair,
.update_curr = update_curr_fair,/* 更新当前进程运行时间 */
#ifdef CONFIG_FAIR_GROUP_SCHED
.task_change_group = task_change_group_fair,
#endif
#ifdef CONFIG_UCLAMP_TASK
.uclamp_enabled = 1,
#endif
}
二、CFS运行队列
内核为每个CPU分配一个运行队列rq(this_rq())。每个运行队列rq中包含一个CFS运行队列cfs_rq,用于组织管理普通进程的调度。
成员 | 描述 |
---|---|
load | 运行队列所有调度实体总负载。se入队时update_load_add增加负载;se出队时通过update_load_sub减去负载 |
nr_running | CFS运行队列调度实体数量,se入队时加1,se出队时减1 |
h_nr_running | CFS运行队列调度实体数量 |
idle_h_nr_running | 记录idle调度实体数量 |
min_vruntime | 记录 CFS运行队列红黑树中最小的vruntime值,保持单调递增。该值非常重要,在对唤醒进程和fork进程vruntime做补偿时使用 |
tasks_timeline | CFS运行队列红黑树根,所有调度实体都依据se->vruntime(红黑树的KEY)大小加入到红黑树接受调度 |
curr | 记录CFS运行队列中当前正在运行的se |
next | 记录CFS运行队列中急需运行的se,wakeup唤醒进程时可能将被唤醒的进程赋值给next。pick_next_entity时会优先选择cfs_rq->next |
last | 记录CFS运行队列中当前运行的se,与curr不同,curr一定会记录当前运行se,而last只会记录执行wakeup操作的se。pick_next_entity时会次优先选择cfs_rq->last。这样有利于重复利用cache |
skip | 记录CFS运行队列中跳过运行的se,系统调用sched_yield会将当前实体赋值到cfs_rq->skip。pick_next_entity时如发现选择的se是cfs_rq->skip时,会重新选择se |
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load; /* CFS运行队列所有调度实体总负载 */
unsigned long runnable_weight;
unsigned int nr_running; /* CFS运行队列调度实体数量 */
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
u64 exec_clock;
u64 min_vruntime;/* CFS运行队列红黑树中最小的vruntime值 */
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
#endif
struct rb_root_cached tasks_timeline;/* CFS运行队列红黑树根 */
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr;/* 记录CFS运行队列中当前正在运行的se */
struct sched_entity *next;/* 记录CFS运行队列中被wakeup的se */
struct sched_entity *last;/* 记录CFS运行队列中执行wakeup的se */
struct sched_entity *skip;/* 记录CFS运行队列需要跳过的se */
#ifdef CONFIG_SCHED_DEBUG
unsigned int nr_spread_over;
#endif
#ifdef CONFIG_SMP
/*
* CFS load tracking
*/
struct sched_avg avg;
#ifndef CONFIG_64BIT
u64 load_last_update_time_copy;
#endif
struct {
raw_spinlock_t lock ____cacheline_aligned;
int nr;
unsigned long load_avg;
unsigned long util_avg;
unsigned long runnable_sum;
} removed;
#ifdef CONFIG_FAIR_GROUP_SCHED
unsigned long tg_load_avg_contrib;
long propagate;
long prop_runnable_sum;
/*
* h_load = weight * f(tg)
*
* Where f(tg) is the recursive weight fraction assigned to
* this group.
*/
unsigned long h_load;
u64 last_h_load_update;
struct sched_entity *h_load_next;
#endif /* CONFIG_FAIR_GROUP_SCHED */
#endif /* CONFIG_SMP */
#ifdef CONFIG_FAIR_GROUP_SCHED
struct rq *rq; /* CPU runqueue to which this cfs_rq is attached */
/*
* leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
* a hierarchy). Non-leaf lrqs hold other higher schedulable entities
* (like users, containers etc.)
*
* leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.
* This list is used during load balance.
*/
int on_list;
struct list_head leaf_cfs_rq_list;
struct task_group *tg; /* group that "owns" this runqueue */
…………
#endif /* CONFIG_FAIR_GROUP_SCHED */
}
三、CFS调度实体
每个进程描述符task_struct包含成员变量sched_entity,即调度实体。调度实体是最基本的调度单元,可以是一个进程也可以是一个调度组。
struct sched_entity {
/* For load-balancing: */
struct load_weight load; /* 调度权重 */
unsigned long runnable_weight;
struct rb_node run_node;/* 调度实体在cfs_rq红黑树的节点 */
struct list_head group_node;
unsigned int on_rq;/* 调度实体是否在调度队列中接受调度 */
u64 exec_start;
u64 sum_exec_runtime;/* 当前累计运行实际时间 */
u64 vruntime;/* 虚拟运行时间 */
u64 prev_sum_exec_runtime;/* 上次累计运行实际时间 */
u64 nr_migrations;
struct sched_statistics statistics;
#ifdef CONFIG_FAIR_GROUP_SCHED
int depth;
struct sched_entity *parent;
/* rq on which this entity is (to be) queued: */
struct cfs_rq *cfs_rq;
/* rq "owned" by this entity/group: */
struct cfs_rq *my_q;
#endif
#ifdef CONFIG_SMP
/*
* Per entity load average tracking.
*
* Put into separate cache line so it does not
* collide with read-mostly values above.
*/
struct sched_avg avg;/* 进程负载 */
#endif
}
四、调度优先级
int prio; /* 动态优先级 */
int static_prio; /* 静态优先级 */
int normal_prio; /* 普通优先级 */
unsigned int rt_priority; /* 实时优先级 */
内核使用0-139表示进程优先级,数值越低优先级越高。优先级0-99用于实时进程,100-139用于普通进程。
MAX_RT_PRIO宏是最大实时进程优先级,数值是100。
DEFAULT_PRIO宏是默认普通进程优先级,数值为120,即创建普通进程默认优先级是120,可以通过nice系统调用修改。
NICE的取值范围是-20到19,负值增大进程优先级,正值减小进程优先级。
实时进程的优先级范围是0…MAX_PRIO-1(0-99),普通进程优先级范围是MAX_RT_PRIO…MAX_PRIO-1(100-139)。
#define MAX_NICE 19
#define MIN_NICE -20
#define NICE_WIDTH (MAX_NICE - MIN_NICE + 1)
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
* tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
* values are inverted: lower p->prio value means higher priority.
*
* The MAX_USER_RT_PRIO value allows the actual maximum
* RT priority to be separate from the value exported to
* user-space. This allows kernel threads to set their
* priority to a value higher than any user task. Note:
* MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
*/
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO
#define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
五、调度权重
调度实体中有一个成员调度权重struct load_weight load,调度权重结构体定义如下。
struct load_weight {
unsigned long weight; /* 调度实体的权重 */
u32 inv_weight; /* 调度实体的权重的中间计算值 */
};
关于调度权重需了解以下几点:
- 系统中普通进程优先级通过NICE值确定,普通进程默认优先级是120,NICE值的取值范围是[-20,19],因而普通进程的优先级范围是[100,139]。
- Linux内核将优先级转换为权重用于普通进程的调度。[-20,19]的NICE范围表示40个优先级等级,NICE值越高,优先级越低,权重越低,反之亦然。
- NICE值每增加1,则相对于NICE为0时降低10%的CPU使用时间,反之NICE值每减少1,则相对于NICE为0时增加10%的CPU使用时间。
- 为了计算方便,内核将NICE为0时的权重定为1024,其他NCIE值对应的权重通过查表方式获取,即内核全局变量sched_prio_to_weight。
- sched_prio_to_weight数组中相邻两个值大约sched_prio_to_weight[i]是sched_prio_to_weight[i+1]的1.25倍。
- 内核提供另一个数组sched_prio_to_wmult,用于保存2^32/sched_prio_to_weight的值。将这些值保存起来是为了提高计算效率。
关于权重对进程占用CPU的时间的计算,可以举一个例子:
进程A和B创建时NICE值都是0,则权重均为1024,那么两者各占50%的CPU时间,即CPUA=CPUB=1024/(1024+1024)=50%。
现修改B进程的NICE值为1,则B进程相对A进程减少10%的CPU时间,那么A和B的权重分别为1024和820,CPUA=820/(1024+820)=45%,CPUB=1024/(1024+820)=55%。
sched_prio_to_weight和sched_prio_to_wmult具体代码如下。
/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
* that remained on nice 0.
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
const int sched_prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
/*
* Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated.
*
* In cases where the weight does not change often, we can use the
* precalculated inverse to speed up arithmetics by turning divisions
* into multiplications:
*/
const u32 sched_prio_to_wmult[40] = {
/* -20 */ 48388, 59856, 76040, 92818, 118348,
/* -15 */ 147320, 184698, 229616, 287308, 360437,
/* -10 */ 449829, 563644, 704093, 875809, 1099582,
/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};
sched_prio_to_weight[i]和sched_prio_to_wmult[i]直接的关系是:
i
n
v
_
w
e
i
g
h
t
=
2
32
w
e
i
g
h
t
inv\_weight = \frac{2^{32}}{weight}
inv_weight=weight232
内核提供set_load_weight函数设置调度实体中的load(p->se.load)。
static void set_load_weight(struct task_struct *p, bool update_load)
{
int prio = p->static_prio - MAX_RT_PRIO;
struct load_weight *load = &p->se.load;
/*
* SCHED_IDLE tasks get minimal weight:
*/
if (task_has_idle_policy(p)) {
load->weight = scale_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
p->se.runnable_weight = load->weight;
return;
}
/*
* SCHED_OTHER tasks have to update their load when changing their
* weight
*/
if (update_load && p->sched_class == &fair_sched_class) {
reweight_task(p, prio);
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
p->se.runnable_weight = load->weight;
}
}
内核为何要定义一个sched_prio_to_wmult数组,这与CFS调度器中虚拟运行时间的计算有关系。
六、虚拟运行时间的计算
CFS调度器抛弃了以前采用时间片的调度算法,而是使用了权重值计算虚拟运行时间的方法来实现进程调度。
- 每个进程的虚拟运行时间是相对NICE为0时的权重比例值。
- NICE值小的进程,权重大,虚拟运行时间比真实运行时间慢,因而获得更多的运行时间。
虚拟运行时间的计算公式是:
v
r
u
n
t
i
m
e
=
d
e
l
t
a
_
e
x
e
c
∗
N
I
C
E
_
0
_
L
O
A
D
w
e
i
g
h
t
vruntime = \frac{delta\_exec * NICE\_0\_LOAD}{weight}
vruntime=weightdelta_exec∗NICE_0_LOAD
其中delta_exec是实际运行时间,NICE_0_LOAD是NICE为0时的权重值,weight为进程权重值。
这个公式需要做除法,内核显然不想这么做,因而公式做了如下转换:
v
r
u
n
t
i
m
e
=
(
d
e
l
t
a
_
e
x
e
c
∗
N
I
C
E
_
0
_
L
O
A
D
∗
2
32
w
e
i
g
h
t
)
>
>
32
vruntime = (\frac{delta\_exec * NICE\_0\_LOAD*2^{32}}{weight})>>32
vruntime=(weightdelta_exec∗NICE_0_LOAD∗232)>>32
这里就解释了内核为什么需要sched_prio_to_wmult(2^32/weight)数组。显然公式进一步转换为:
v
r
u
n
t
i
m
e
=
(
d
e
l
t
a
_
e
x
e
c
∗
N
I
C
E
_
0
_
L
O
A
D
∗
i
n
v
_
w
e
i
g
h
t
)
>
>
32
vruntime = (delta\_exec * NICE\_0\_LOAD*inv\_weight)>>32
vruntime=(delta_exec∗NICE_0_LOAD∗inv_weight)>>32
内核巧妙的运用sched_prio_to_wmult预先做了除法,保障在实际计算是只有乘法和位移运算。
内核的实现代码如下:
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);/* weight=NICE_0_LOAD */
int shift = WMULT_SHIFT;
__update_inv_weight(lw);
if (unlikely(fact >> 32)) {
while (fact >> 32) {
fact >>= 1;
shift--;
}
}
/* fact=NICE_0_LOAD*inv_weight */
fact = mul_u32_u32(fact, lw->inv_weight);
while (fact >> 32) {
fact >>= 1;
shift--;
}
/* delta_exec*=fact >> 32 */
return mul_u64_u32_shr(delta_exec, fact, shift);
}
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
if (unlikely(se->load.weight != NICE_0_LOAD))
delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
/* 如果调度实体权重为NICE_0_LOAD,则其虚拟运行时间等于实际运行时间 */
return delta;
}
显然当NICE值为0时,内核不进行计算直接返回实际运行时间,即调度实体的权重是NICE_0_LOAD时,其虚拟运行时间vruntime与实际运行时间delta是一样的。
七、虚拟运行时间的更新
CFS调度利用vruntime实现对普通进程的调度策略,vruntime时间的更新是由update_curr函数实现的。
1、update_curr函数入参是当前进程的CFS运行队列;
2、rq_clock_task获取当前运行队列成员clock_task,该值由函数update_rq_clock_task更新;
3、delta_exec计算该进程上次调用update_curr到现在的时间差。
4、调用calc_delta_fair计算vruntime,关于calc_delta_fair计算方法上一节已说明。
5、调用update_min_vruntime更新cfs_rq->min_vruntime,该值记录了CFS运行队列中最小的vruntime。
/*
* Update the current task's runtime statistics.
*/
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */
u64 now = rq_clock_task(rq_of(cfs_rq)); /* 运行队列clock_task值 */
u64 delta_exec;
if (unlikely(!curr))
return;
/* delta_exec计算该进程上次调用update_curr到现在的时间差 */
delta_exec = now - curr->exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
/* curr->exec_start记录本次调用的时间 */
curr->exec_start = now;
schedstat_set(curr->statistics.exec_max,
max(delta_exec, curr->statistics.exec_max));
/* curr->sum_exec_runtime记录进程占用CPU的总时间 */
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
/* 利用delta_exec实际运行时间的增量计算vruntime增量 */
curr->vruntime += calc_delta_fair(delta_exec, curr);
/* 更新cfs_rq->min_vruntime */
update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
update_min_vruntime更新cfs_rq->min_vruntime的基本原则是记录运行队列中的最小的虚拟运行时间。
1、vruntime赋值为上次记录的min_vruntime值;
2、curr调度实体在运行队列中,则更新vruntime值;
3、curr调度实体在运行队列中,则比较CFS运行队列红黑树最左节点调度实体vruntime和第二步vruntime值,将vruntime更新为二者的最小值;
4、curr调度实体不运行队列中,将vruntime更新为CFS运行队列红黑树最左节点调度实体vruntime;
5、更新cfs_rq->min_vruntime,其值是上述四步计算出的vruntime和cfs_rq->min_vruntime中的最大值。
update_min_vruntime保证cfs_rq->min_vruntime值单调不递减。
static void update_min_vruntime(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */
struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);/* CFS运行队列红黑树最左子节点 */
/* vruntime赋值为上次记录的min_vruntime值 */
u64 vruntime = cfs_rq->min_vruntime;
if (curr) {
if (curr->on_rq)
vruntime = curr->vruntime;/* curr实体在运行队列中,则更新vruntime值 */
else
curr = NULL;
}
if (leftmost) { /* non-empty tree */
struct sched_entity *se;
/* 获取CFS运行队列红黑树最左节点调度实体 */
se = rb_entry(leftmost, struct sched_entity, run_node);
if (!curr)
/* 若当前调度实体为空,则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime */
vruntime = se->vruntime;
else
/* 若当前调度实体不为空,则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime和当前vruntime值中的最小值 */
vruntime = min_vruntime(vruntime, se->vruntime);
}
/* ensure we never gain time by being placed backwards. */
/* cfs_rq->min_vruntime是vruntime和cfs_rq->min_vruntime中的最大值 */
cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);
#ifndef CONFIG_64BIT
smp_wmb();
cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;
#endif
}
虚拟运行时间作为CFS调度的基本元素,其计算方法已经介绍了,进程一直在运行,那么什么时候去更新vruntime呢。在内核中有很多地方调用update_curr,这里列举几处调用:
- 进程创建时do_fork调用sched_fork,再调用task_fork_fair方法,再调用update_curr。task_fork_fair方法也会调用update_rq_clock更新运行队列rq->clock_task。
- 内核中有一个定时器任务scheduler_tick,会定时调用调度类的task_tick方法(详见《深入Linux内核(进程篇)—进程调度》scheduler_tick一节)。对于CFS调度类,即调用task_tick_fair方法。task_tick_fair调用entity_tick调用update_curr。这个调用时周期性调用。
- 调度实体加入CFS运行队列时,enqueue_task_fair方法也会调用update_curr。
八、虚拟运行时间的使用
内核如何运用虚拟运行时间调度进程:
- 调度实体虚拟运行时间的不断叠加,表明调度实体使用CPU的时间。当调度实体使用CPU的时间超过一定额度后,就会设置进程thread_info中的TIF_NEED_RESCHED,从而触发延时调度schedule。这样虚拟运行时间的作用就体现出来了。
- 另一方面虚拟运行时间vruntime是CFS运行队列红黑树的key,vruntime值小的调度实体被排到红黑树的左侧,在schedule中pick_next_task时,CFS会优先选择运行队列红黑树最左侧的节点。还会通过vruntime比较当前调度实体是否可被抢占。
本节讨论第一点,第二点在“CFS入队”以及“CFS选择下一个进程”章节中说明。
内核scheduler_tick周期调度curr->sched_class->task_tick方法。对于CFS调度类即task_tick_fair。
task_tick_fair调用entity_tick实现上述延时调度。
/*
* scheduler tick hitting a task of our scheduling class.
*
* NOTE: This function can be called remotely by the tick offload that
* goes along full dynticks. Therefore no local assumption can be made
* and everything must be accessed through the @rq and @curr passed in
* parameters.
*/
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;
/* 遍历调度实体内的成员,非组调度情况下即当前调度实体 */
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
if (static_branch_unlikely(&sched_numa_balancing))
task_tick_numa(rq, curr);
update_misfit_status(curr, rq);
update_overutilized_status(task_rq(curr));
}
entity_tick的实现如下:
1、调用update_curr 更新当前运行队列以及CFS运行队列的min_vruntime;
2、调用update_load_avg更新调度该调度实体的平均负载;
3、调用check_preempt_tick检测当前是否有进程需要调度。
static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
/*
* Update run-time statistics of the 'current'.
*/
/* 更新当前运行队列以及CFS运行队列的min_vruntime */
update_curr(cfs_rq);
/*
* Ensure that runnable average is periodically updated.
*/
update_load_avg(cfs_rq, curr, UPDATE_TG);
update_cfs_group(curr);
#ifdef CONFIG_SCHED_HRTICK
/*
* queued ticks are scheduled to match the slice, so don't bother
* validating it and just reschedule.
*/
if (queued) {
resched_curr(rq_of(cfs_rq));
return;
}
/*
* don't let the period tick interfere with the hrtick preemption
*/
if (!sched_feat(DOUBLE_TICK) &&
hrtimer_active(&rq_of(cfs_rq)->hrtick_timer))
return;
#endif
/* 检测当前是否有进程需要调度 */
if (cfs_rq->nr_running > 1)
check_preempt_tick(cfs_rq, curr);
}
check_preempt_tick函数实现如下:
1、调用sched_slice获取当前调度实体的理论运行时间ideal_runtime,即该调度实体依据权重分配得到的可占用CPU时间;
2、计算当前调度实体的实际运行时间delta_exec,即已使用的CPU时间;
3、当前调度实体的实际运行时间大于理论运行时间,则调用resched_curr设置进程thread_info中的TIF_NEED_RESCHED,触发延时调度;
4、当前调度实体的实际运行时间小于进程最小运行时间sysctl_sched_min_granularity(默认为0.75毫秒),则不需要调度;
5、计算当前调度实体虚拟运行时间与CFS运行队列红黑树最左节点对应的调度实体的虚拟运行时间差值;红黑树最左节点即CFS运行队列中vruntime最小的调度实体;
6、若差值小于0,则不触发调度;
7、若差值大于当前调度实体的理论运行时间,则触发延时调度。
/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;
struct sched_entity *se;
s64 delta;
/* 获取当前调度实体的理论运行时间 */
ideal_runtime = sched_slice(cfs_rq, curr);
/* 计算当前调度实体的实际运行时间 */
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
/* 当前调度实体的实际运行时间大于理论运行时间,则调用resched_curr设置进程thread_info中的TIF_NEED_RESCHED,触发延时调度 */
if (delta_exec > ideal_runtime) {
resched_curr(rq_of(cfs_rq));
/*
* The current task ran long enough, ensure it doesn't get
* re-elected due to buddy favours.
*/
clear_buddies(cfs_rq, curr);
return;
}
/*
* Ensure that a task that missed wakeup preemption by a
* narrow margin doesn't have to wait for a full slice.
* This also mitigates buddy induced latencies under load.
*/
/* 当前调度实体的实际运行时间小于进程最小运行时间,则不需要调度 */
if (delta_exec < sysctl_sched_min_granularity)
return;
/* 获取CFS运行队列红黑树最左节点对应的调度实体,即队列中vruntime最小的调度实体 */
se = __pick_first_entity(cfs_rq);
delta = curr->vruntime - se->vruntime;
/* 如果当前进程vruntime小于队列中vruntime最小的调度实体,则不调度 */
if (delta < 0)
return;
/* 如果当前进程vruntime大于队列中vruntime最小的调度实体差值大于调度实体理论运行时间,则触发延时调度 */
if (delta > ideal_runtime)
resched_curr(rq_of(cfs_rq));
}
九、虚拟运行时间的分配
虚拟运行时间触发延时调度时,会和理论运行时间做比较,理论运行时间即调度实体可使用的CPU时间。
上一节看到这个时间是由函数sched_slice获取的。
1、对于非组调度,则理论运行时间即__sched_period;
2、对于组调度,需要遍历组内所有调度实体计算理论运行时间;
3、__sched_period是CFS运行队列的一个调度周期的长度,可以理解为调度时间片。
/*
* We calculate the wall-time slice from the period by taking a part
* proportional to the weight.
*
* s = p*P[w/rw]
*/
static u64 sched_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
u64 slice = __sched_period(cfs_rq->nr_running + !se->on_rq);
for_each_sched_entity(se) {
struct load_weight *load;
struct load_weight lw;
cfs_rq = cfs_rq_of(se);
load = &cfs_rq->load;
if (unlikely(!se->on_rq)) {
lw = cfs_rq->load;
update_load_add(&lw, se->load.weight);
load = &lw;
}
slice = __calc_delta(slice, se->load.weight, load);
}
return slice;
}
__sched_period函数计算调度时间片根据当前运行队列的数目有关:
1、当运行队列中的调度实体数目大于8时,时间片为进程数乘最小调度延时;
2、当运行队列中的调度实体数目不大于8时,则时间片为默认时间片sched_nr_latency,值为6ms。
/* 默认时间片 */
unsigned int sysctl_sched_latency = 6000000ULL;
/*
* This value is kept at sysctl_sched_latency/sysctl_sched_min_granularity
*/
static unsigned int sched_nr_latency = 8;
/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small.
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
if (unlikely(nr_running > sched_nr_latency))
return nr_running * sysctl_sched_min_granularity;
else
return sysctl_sched_latency;
}
十、新创建进程和刚唤醒进程的虚拟运行时间
对于睡眠进程唤醒后,或者新创建的进程,其vruntime值并没有随updat_curr更新。如果唤醒后还用原来的vruntime值,会导致报复性的占用CPU(entity_before和wakeup_preempt_entity比较中都将胜出)。因而需要对这两类进程的vruntime进行修正。
- 对于唤醒的进程,会在入队enqueue_entity时,将其vruntime加上cfs_rq->min_vruntime。同时调用place_entity对其补偿。
update_curr(cfs_rq);
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;
…………
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
- 对于刚刚参加的进程,会在task_fork_fair中将子进程vruntime赋值为父进程curr->vruntime。同时会调用place_entity对其惩罚。需要注意的是task_fork_fair最后将子进程vruntime减去cfs_rq->min_vruntime。在_do_fork调用链中,copy_process->sched_fork->task_fork_fair后,会调用wake_up_new_task->activate_task->enqueue_task->enqueue_task_fair->enqueue_entity中会将vruntime再加上cfs_rq->min_vruntime。实际上fork的进程走了唤醒操作。
if (curr) {
update_curr(cfs_rq);
se->vruntime = curr->vruntime;
}
place_entity(cfs_rq, se, 1);
se->vruntime -= cfs_rq->min_vruntime;
place_entity对调度实体vruntime的计算。
1、对于唤醒进程设置入参initial为0,对于fork进程设置入参initial为1;
2、使用局部变量vruntime作为se->vruntime 最终参考值。首先将其赋值为cfs_rq->min_vruntime;
2、对于fork进程,vruntime需要加上sched_vslice虚拟实际作为惩罚;
3、对于唤醒进程,vruntime减小sysctl_sched_latency(默认6ms)/2,作为补偿;
4、se->vruntime最终值为其当前值和局部变量vruntime较大者。
static void
place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial)
{
u64 vruntime = cfs_rq->min_vruntime;/* 以红黑树中最小vruntime为基值 */
/*
* The 'current' period is already promised to the current tasks,
* however the extra weight of the new task will slow them down a
* little, place the new task so that it fits in the slot that
* stays open at the end.
*/
/* 如果是fork进程,则增加一个调度周期的虚拟运行时间 */
if (initial && sched_feat(START_DEBIT))
vruntime += sched_vslice(cfs_rq, se);
/* sleeps up to a single latency don't count. */
/* 如果是wakeup进程,则减少sysctl_sched_latency/2的虚拟运行时间 */
if (!initial) {
unsigned long thresh = sysctl_sched_latency;
/*
* Halve their sleep time's effect, to allow
* for a gentler effect of sleepers:
*/
if (sched_feat(GENTLE_FAIR_SLEEPERS))
thresh >>= 1;
vruntime -= thresh;
}
/* ensure we never gain time by being placed backwards. */
/* 取最大值作为最终vruntime */
se->vruntime = max_vruntime(se->vruntime, vruntime);
}
十一、CFS入队
CFS入队通过enqueue_task_fair函数实现。
1、for_each_sched_entity遍历调度实体,对于非调度组调度实体即se本身;
2、调度实体不在队列(se->on_rq == 0)上则调用enqueue_entity完成调度实体入队。
/*
* The enqueue_task method is called before nr_running is
* increased. Here we update the fair scheduling stats and
* then put the task into the rbtree:
*/
static void
enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int idle_h_nr_running = task_has_idle_policy(p);
…………
for_each_sched_entity(se) {
if (se->on_rq)/* 调度实体在队列上则跳出遍历 */
break;
cfs_rq = cfs_rq_of(se);
/* 完成入队操作 */
enqueue_entity(cfs_rq, se, flags);
/*
* end evaluation on encountering a throttled cfs_rq
*
* note: in the case of encountering a throttled cfs_rq we will
* post the final h_nr_running increment below.
*/
if (cfs_rq_throttled(cfs_rq))
break;
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
flags = ENQUEUE_WAKEUP;
}
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
cfs_rq->h_nr_running++;
cfs_rq->idle_h_nr_running += idle_h_nr_running;
if (cfs_rq_throttled(cfs_rq))
break;
update_load_avg(cfs_rq, se, UPDATE_TG);
update_cfs_group(se);
}
…………
}
enqueue_entity完成了调度实体入队操作。
1、renorm变量:flags不是ENQUEUE_WAKEUP和ENQUEUE_MIGRATED,即入队操作不是wakeup导致的。
2、curr变量:入队的调度实体是当前正在运行的调度实体。
3、如果入队的是当前正在运行的调度实体,则增加其vruntime+=min_vruntime;当前运行的调度实体在update_curr之前增加min_vruntime是为了公平性,因为update_curr之后的min_vruntime势必增加,所以在update_curr之后再增加当前运行实体的vruntime。
4、调用update_curr更新当前正在运行的调度实体的vruntime和min_vruntime;
5、如果入队的不是正在运行的调度实体,则增加其vruntime+=min_vruntime;
6、如果是进程是被唤醒入队的,则对其vruntime进行补偿,防止其vruntime没有更新导致报复性占用CPU。
7、入队的调度实体不是当前运行实体时,调用__enqueue_entity将该调度实体加入CFS运行队列红黑树中。
8、设置调度实体se->on_rq为1,表示该调度实体已经在运行队列中。
static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
bool renorm = !(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATED);
bool curr = cfs_rq->curr == se;
/*
* If we're the current task, we must renormalise before calling
* update_curr().
*/
if (renorm && curr)
se->vruntime += cfs_rq->min_vruntime;
update_curr(cfs_rq);
/*
* Otherwise, renormalise after, such that we're placed at the current
* moment in time, instead of some random moment in the past. Being
* placed in the past could significantly boost this task to the
* fairness detriment of existing tasks.
*/
if (renorm && !curr)
se->vruntime += cfs_rq->min_vruntime;
/*
* When enqueuing a sched_entity, we must:
* - Update loads to have both entity and cfs_rq synced with now.
* - Add its load to cfs_rq->runnable_avg
* - For group_entity, update its weight to reflect the new share of
* its group cfs_rq
* - Add its new weight to cfs_rq->load.weight
*/
update_load_avg(cfs_rq, se, UPDATE_TG | DO_ATTACH);
update_cfs_group(se);
enqueue_runnable_load_avg(cfs_rq, se);
account_entity_enqueue(cfs_rq, se);/* 更新cfs_rq->nr_running及总负载*/
/* 如果是进程是被唤醒入队的,则对其vruntime进行补偿,
防止其vruntime没有更新导致报复性占用CPU */
if (flags & ENQUEUE_WAKEUP)
place_entity(cfs_rq, se, 0);
check_schedstat_required();
update_stats_enqueue(cfs_rq, se, flags);
check_spread(cfs_rq, se);
/* 入队的调度实体不是当前运行实体时,
调用__enqueue_entity将该调度实体加入CFS运行队列红黑树中 */
if (!curr)
__enqueue_entity(cfs_rq, se);
se->on_rq = 1;/* 设置调度实体se->on_rq为1,表示该调度实体已经在运行队列中 */
if (cfs_rq->nr_running == 1) {
list_add_leaf_cfs_rq(cfs_rq);
check_enqueue_throttle(cfs_rq);
}
}
__enqueue_entity将调度实体加入到CFS运行队列红黑树中。
1、遍历红黑树,通过vruntime作为key找到调度实体插入的节点位置;
2、调用entity_before比较插入的调度实体和当前节点vruntime;
3、调度实体小于当前节点的vruntime,则红黑树向左遍历;
4、调度实体大于当前节点的vruntime,则红黑树向右遍历,同时说明调度实体不可能是最左节点;
5、需要说明的一点是红黑树左侧的调度实体总是优先被选中执行(pick_next_task_fair);
6、调度实体插入红黑树,完成入队。
/*
* Enqueue an entity into the rb-tree:
*/
static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;
struct rb_node *parent = NULL;
struct sched_entity *entry;
bool leftmost = true;
/*
* Find the right place in the rbtree:
*/
/* 遍历红黑树,通过vruntime作为key找到调度实体插入的节点位置 */
while (*link) {
parent = *link;
entry = rb_entry(parent, struct sched_entity, run_node);
/*
* We dont care about collisions. Nodes with
* the same key stay together.
*/
/* 调度实体小于当前节点的vruntime,则红黑树向左遍历 */
if (entity_before(se, entry)) {
link = &parent->rb_left;
} else {/* 调度实体大于当前节点的vruntime,则红黑树向右遍历,同时说明调度实体不可能是最左节点 */
link = &parent->rb_right;
leftmost = false;
}
}
/* 完成红黑树插入 */
rb_link_node(&se->run_node, parent, link);
rb_insert_color_cached(&se->run_node,
&cfs_rq->tasks_timeline, leftmost);
}
十二、CFS出队
CFS入队通过dequeue_task_fair函数实现。
1、for_each_sched_entity遍历调度实体,对于非调度组调度实体即se本身;
2、调用dequeue_entity完成调度实体出队。
/*
* The dequeue_task method is called before nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;
int task_sleep = flags & DEQUEUE_SLEEP;
int idle_h_nr_running = task_has_idle_policy(p);
bool was_sched_idle = sched_idle_rq(rq);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);/* 调用dequeue_entity完成调度实体出队 */
/*
* end evaluation on encountering a throttled cfs_rq
*
* note: in the case of encountering a throttled cfs_rq we will
* post the final h_nr_running decrement below.
*/
if (cfs_rq_throttled(cfs_rq))
break;
cfs_rq->h_nr_running--;
cfs_rq->idle_h_nr_running -= idle_h_nr_running;
/* Don't dequeue parent if it has other entities besides us */
if (cfs_rq->load.weight) {
/* Avoid re-evaluating load for this entity: */
se = parent_entity(se);
/*
* Bias pick_next to pick a task from this cfs_rq, as
* p is sleeping when it is within its sched_slice.
*/
if (task_sleep && se && !throttled_hierarchy(cfs_rq))
set_next_buddy(se);
break;
}
flags |= DEQUEUE_SLEEP;
}
…………
}
dequeue_entity完成了调度实体出队操作。
1、调用update_curr更新当前正在运行的调度实体的vruntime和min_vruntime;
2、出队的调度实体不是当前运行实体时,调用__dequeue_entity将该调度实体从CFS运行队列红黑树中删除。
3、设置调度实体se->on_rq为0,表示该调度实体已经不在运行队列中。
static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);
/*
* When dequeuing a sched_entity, we must:
* - Update loads to have both entity and cfs_rq synced with now.
* - Subtract its load from the cfs_rq->runnable_avg.
* - Subtract its previous weight from cfs_rq->load.weight.
* - For group entity, update its weight to reflect the new share
* of its group cfs_rq.
*/
update_load_avg(cfs_rq, se, UPDATE_TG);
dequeue_runnable_load_avg(cfs_rq, se);
update_stats_dequeue(cfs_rq, se, flags);
clear_buddies(cfs_rq, se);
/* 出队的调度实体不是当前运行实体时,则将该调度实体从CFS运行队列红黑树中删除 */
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
se->on_rq = 0;/* 设置调度实体se->on_rq为0,表示该调度实体已经不在运行队列中 */
account_entity_dequeue(cfs_rq, se);/* 更新cfs_rq->nr_running及总负载*/
/*
* Normalize after update_curr(); which will also have moved
* min_vruntime if @se is the one holding it back. But before doing
* update_min_vruntime() again, which will discount @se's position and
* can move min_vruntime forward still more.
*/
if (!(flags & DEQUEUE_SLEEP))
se->vruntime -= cfs_rq->min_vruntime;
/* return excess runtime on last dequeue */
return_cfs_rq_runtime(cfs_rq);
update_cfs_group(se);
/*
* Now advance min_vruntime if @se was the entity holding it back,
* except when: DEQUEUE_SAVE && !DEQUEUE_MOVE, in this case we'll be
* put back on, and if we advance min_vruntime, we'll be placed back
* further than we started -- ie. we'll be penalized.
*/
if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
update_min_vruntime(cfs_rq);
}
__dequeue_entity将调度实体从CFS运行队列红黑树中删除。
static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);
}
十三、CFS选择下一个进程
shedule调度函数完成调度时需要选择一个进程来执行(详见《深入Linux内核(进程篇)—进程调度》),对于CFS调度类由pick_next_task_fair函数实现。
1、判断rq->cfs.nr_running不大于0,即cfs队列当前没有可运行进程,则调到idle标签;
2、如果定义CONFIG_FAIR_GROUP_SCHED,则进入组调度选择流程,我们这里先不考虑组调度;
3、调用put_prev_task,将前一个进程重新入队,依据vruntime加入CFS运行队列红黑树中;
4、调用pick_next_entity选择下一个运行的调度实体,这是pick_next_task_fair的核心;
5、调用set_next_entity,将选中的进程出队,从CFS运行队列红黑树中删除,同时设置为当前运行实体;
6、task_of由调度实体找到对应的进程描述符;
7、将进程描述符p返回给调用者。
struct task_struct *
pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;
struct task_struct *p;
int new_tasks;
again:
/* 如果cfs.nr_running不大于0,即cfs队列当前没有可运行进程,则调到idle标签 */
if (!sched_fair_runnable(rq))
goto idle;
#ifdef CONFIG_FAIR_GROUP_SCHED /* 组调度 */
…………
#endif
/* 将前一个进程重新入队 */
if (prev)
put_prev_task(rq, prev);
do {
/* 选择下一个运行的调度实体 */
se = pick_next_entity(cfs_rq, NULL);
/* 选中的进程出队,同时设置为当前运行实体 */
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);/* 非组调度group_cfs_rq返回空 */
} while (cfs_rq);/* 非组调度该条件不成立,因而只运行一次 */
p = task_of(se);/* 根据调度实体找到对应的进程描述符 */
done: __maybe_unused;
#ifdef CONFIG_SMP
/*
* Move the next running task to the front of
* the list, so our cfs_tasks list becomes MRU
* one.
*/
list_move(&p->se.group_node, &rq->cfs_tasks);
#endif
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);
update_misfit_status(p, rq);
return p;
idle:
if (!rf)
return NULL;
new_tasks = newidle_balance(rq, rf);
/*
* Because newidle_balance() releases (and re-acquires) rq->lock, it is
* possible for any higher priority task to appear. In that case we
* must re-start the pick_next_entity() loop.
*/
if (new_tasks < 0)
return RETRY_TASK;
if (new_tasks > 0)
goto again;
/*
* rq is about to be idle, check if we need to update the
* lost_idle_time of clock_pelt
*/
update_idle_rq_clock_pelt(rq);
return NULL;
}
下面详解选取进程的三部曲:put_prev_task,pick_next_entity和set_next_entity。
1、put_prev_task
选择进程是先调用put_prev_task将prev进程重新入队,因为prev进程之前是curr进程,即占用CPU的进程,是不在运行队列红黑树里的,现在要停掉prev进程,故需要重新放入运行队列。
put_prev_task是一个通用函数,调用进程调度类的put_prev_task方法。对于CFS调度类对应的函数是put_prev_task_fair。
static inline void put_prev_task(struct rq *rq, struct task_struct *prev)
{
WARN_ON_ONCE(rq->curr != prev);
prev->sched_class->put_prev_task(rq, prev);
}
put_prev_task_fair先遍历调度组,再调用put_prev_entity将prev->se放入运行队列。
/*
* Account for a descheduled task:
*/
static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
{
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
}
put_prev_entity的具体实现如下。
1、调用update_curr更新vruntime和min_vruntime;
2、调用__enqueue_entity将prev调度实体加入CFS运行队列红黑树;
3、将cfs_rq->curr设置为NULL,因为当前运行进程prev即将停止运行,该值会在set_next_entity中设置为选出来即将执行的进程next;
static void put_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev)
{
/*
* If still on the runqueue then deactivate_task()
* was not called and update_curr() has to be done:
*/
if (prev->on_rq)
update_curr(cfs_rq);
/* throttle cfs_rqs exceeding runtime */
check_cfs_rq_runtime(cfs_rq);
check_spread(cfs_rq, prev);
if (prev->on_rq) {
update_stats_wait_start(cfs_rq, prev);
/* Put 'current' back into the tree. */
__enqueue_entity(cfs_rq, prev);
/* in !on_rq case, update occurred at dequeue */
update_load_avg(cfs_rq, prev, 0);
}
cfs_rq->curr = NULL;
}
2、pick_next_entity
CFS调度算法选择下一个调度实体的基本原则:
- 保持任务组之间的公平性;
- 选择cfs_rq->next调度实体,因为cfs_rq->next记录了确实想要执行的调度实体;
- 选择cfs_rq->last调度实体,为了利用cache局部性原理,cfs_rq->last记录了上次占用CPU的调度实体(非调度组);
- 不选择cfs_rq->skip调度实体,在选择运行的调度实体时,cfs_rq->skip记录了要跳过的调度实体,这个是实现sched_yield系统调用而做的记录。进程调用sched_yield后主动让出CPU,同时将当前运行的调度实体记录到cfs_rq->skip中,在pick_next_task时跳过cfs_rq->skip即实现了让出CPU的功能。
pick_next_entity实现即按照上述原则:
1、获取CFS运行队列红黑树最左节点,即队列中vruntime最小的调度实体;
2、当前运行实体vruntime小于最左节点实体vruntime(entity_before(a,b)函数比较,a<b返回真),则将left赋值为当前调度实体;
3、如果1和2两步选出来的调度实体se是cfs_rq->skip,则需要重新选择调度实体second;
4、由1和2步选择可知,选出的调度实体se可能为curr,则继续选择CFS运行队列红黑树最左节点为second;
5、选出的调度实体se不是curr,则选择CFS运行队列红黑树次左节点为second;
6、红黑树次左节点为空或者当前运行实体vruntime小于次左节点实体vruntime,则选择curr为second;
7、后序判断受wakeup_preempt_entity函数影响。该函数判断两个入参se是否可以抢占curr。如果差值vdiff(curr->vruntime - se->vruntime)小于等于0,则返回-1,即curr对CPU的使用权优先于se。差值vdiff大于调度唤醒粒度,则返回1,即se对CPU的使用权优先于curr。差值vdiff小于等于调度唤醒粒度,则返回0,即curr对CPU的使用权优先于se。
static int
wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
/* 当前调度实体与抢占调度实体vruntime差值 */
s64 gran, vdiff = curr->vruntime - se->vruntime;
if (vdiff <= 0) /* 差值小于0,返回-1 */
return -1;
/*调度唤醒粒度,默认1ms实际时间,需根据调度实体se的权重转换成虚拟运行时间*/
gran = wakeup_gran(se);
if (vdiff > gran)/* 差值大于调度粒度,返回1 */
return 1;
return 0; /* 差值小于等于调度粒度,返回0 */
}
8、调用wakeup_preempt_entity比较second和left,如果second优于left,则将se赋值为second;
9、调用wakeup_preempt_entity比较cfs_rq->last和left,如果cfs_rq->last优于left,则将se赋值为cfs_rq->last;
10、调用wakeup_preempt_entity比较cfs_rq->next和left,如果cfs_rq->next优于left,则将se赋值为cfs_rq->next;
/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *
pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/* 获取CFS运行队列红黑树最左节点lestmost */
struct sched_entity *left = __pick_first_entity(cfs_rq);
struct sched_entity *se;
/*
* If curr is set we have to see if its left of the leftmost entity
* still in the tree, provided there was anything in the tree at all.
*/
/* 当前运行实体小于最左节点实体,则选择当前调度实体 */
if (!left || (curr && entity_before(curr, left)))
left = curr;
/* se是最终要返回的调度实体,先为其赋值为left */
se = left; /* ideally we run the leftmost entity */
/*
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
/* 选出来的调度实体se是skip实体,则重新选择 */
if (cfs_rq->skip == se) {
struct sched_entity *second;
if (se == curr) {/* se==curr==skip,则选择红黑树最左节点 */
second = __pick_first_entity(cfs_rq);
} else {/* se==leftmost==skip,则选择红黑树次左节点 */
second = __pick_next_entity(se);
if (!second || (curr && entity_before(curr, second)))
second = curr;/* curr优于次左节点,则选择curr */
}
if (second && wakeup_preempt_entity(second, left) < 1)
se = second;/* second优于left,则选择second */
}
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;/* cfs_rq->last优于left,则选择second */
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;/* cfs_rq->next优于left,则选择second */
/* 清除cfs_rq->last,cfs_rq->next和cfs_rq->skip */
clear_buddies(cfs_rq, se);
return se;
}
3、set_next_entity
处理好前一个进程的事情,选出了下一个要执行的进程之后,还需要对下一个执行的进程做一些事情,由set_next_entity函数完成。
1、即将要执行的进程要从CFS运行队列红黑树中移除,即调用__dequeue_entity完成红黑树移除节点操作;
2、将CFS运行队列的curr指针指向即将要执行的调度实体,在put_prev_entity中将cfs_rq->curr设置为NULL了;
3、se->prev_sum_exec_runtime记录se->sum_exec_runtime的值,“虚拟运行时间的使用”一节中介绍了在check_preempt_tick函数中计算进程的实际运行时间是通过curr->sum_exec_runtime - curr->prev_sum_exec_runtime获取的。prev_sum_exec_runtime即调度实体上次运行占用CPU的时间,sum_exec_runtime是累计占用CPU的总时间,二者差值即本次占用CPU的时间。
static void
set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/* 'current' is not kept within the tree. */
if (se->on_rq) {
/*
* Any task has to be enqueued before it get to execute on
* a CPU. So account for the time it spent waiting on the
* runqueue.
*/
update_stats_wait_end(cfs_rq, se);
__dequeue_entity(cfs_rq, se);/* 将调度实体节点从红黑树移除 */
update_load_avg(cfs_rq, se, UPDATE_TG);
}
update_stats_curr_start(cfs_rq, se);
cfs_rq->curr = se;/* curr指针指向即将要执行的调度实体 */
/*
* Track our maximum slice length, if the CPU's load is at
* least twice that of our own weight (i.e. dont track it
* when there are only lesser-weight tasks around):
*/
if (schedstat_enabled() &&
rq_of(cfs_rq)->cfs.load.weight >= 2*se->load.weight) {
schedstat_set(se->statistics.slice_max,
max((u64)schedstat_val(se->statistics.slice_max),
se->sum_exec_runtime - se->prev_sum_exec_runtime));
}
/* 记录上次占用CPU的时间,以便计算本次调度后进程占用的CPU时间 */
se->prev_sum_exec_runtime = se->sum_exec_runtime;
}
十四、进程唤醒
睡眠进程可以被其他进程调用wake_up_process唤醒,fork进程可以被父进程调用wake_up_new_task唤醒。详见《深入Linux内核(进程篇)—进程调度》进程唤醒一节。
唤醒进程时需要调用check_preempt_curr检查当前进程是否可以被抢占。
1、抢占进程与当前进程属于同一调度类,则调用调度类check_preempt_curr方法检查当前进程是否可以抢占;
2、抢占进程与当前进程不是同一调度类,则按照调度类的优先级判别;
3、按优先级从高到低遍历调度类;
4、先匹配到当前进程,则说明当前进程调度类优先级高于抢占进程,即不可抢占,直接跳出遍历;
5、先匹配到抢占进程,则说明抢占进程调度类优先级高于当前进程,即可抢占,调用resched_curr(设置当前进程thread_info中的TIF_NEED_RESCHED)触发延时调度,抢占CPU。
void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
const struct sched_class *class;
/* 抢占进程与当前进程属于同一调度类,则调用调度类check_preempt_curr方法检查当前进程是否可以抢占 */
if (p->sched_class == rq->curr->sched_class) {
rq->curr->sched_class->check_preempt_curr(rq, p, flags);
} else {/* 抢占进程与当前进程不是同一调度类,则按照调度类的优先级判别 */
for_each_class(class) {/* 按优先级从高到低遍历调度类 */
/* 匹配到当前进程,则说明当前进程调度类优先级高于抢占进程,即不可抢占 */
if (class == rq->curr->sched_class)
break;
/* 匹配到抢占进程,则说明抢占进程调度类优先级高于当前进程,即可抢占 */
if (class == p->sched_class) {
resched_curr(rq);/* 触发延时调度,抢占CPU */
break;
}
}
}
/*
* A queue event has occurred, and we're going to schedule. In
* this case, we can save a useless back to back clock update.
*/
if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
rq_clock_skip_update(rq);
}
对于CFS调度类,check_preempt_curr实现函数为check_preempt_wakeup。
1、当前调度实体与抢占调度实体相同,则返回;
2、如果不是fork进程唤醒且CFS运行队列数量大于8,则设置抢占调度实体为cfs_rq->next,在pick_next_eneity中cfs_rq->next会被优先选择;
3、SCHED_BATCH和SCHED_IDLE进程不抢占SCHED_NORMAL进程,而是由schedule_tick完成调度。
4、调用wakeup_preempt_entity判断se是否可以抢占pse,wakeup_preempt_entity函数在“pick_next_entity”一节已有详述。如果可以抢占则走preempt流程,否则返回。
5、preempt流程:调用resched_curr设置当前进程thread_info中的TIF_NEED_RESCHED,触发延时调度。
6、如果不是组调度且CFS运行队列数量大于8,设置被抢占进程curr为cfq_rq->last,为了利用cache局部性原理,减少cache刷新,pick_next_eneity可能会选择被抢占的进程执行。
/*
* Preempt the current task with a newly woken task if needed:
*/
static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
{
struct task_struct *curr = rq->curr;
struct sched_entity *se = &curr->se, *pse = &p->se;
struct cfs_rq *cfs_rq = task_cfs_rq(curr);
int scale = cfs_rq->nr_running >= sched_nr_latency;/* sched_nr_latency值为8 */
int next_buddy_marked = 0;
if (unlikely(se == pse))
return;/* 当前调度实体与抢占调度实体相同,则返回 */
/*
* This is possible from callers such as attach_tasks(), in which we
* unconditionally check_prempt_curr() after an enqueue (which may have
* lead to a throttle). This both saves work and prevents false
* next-buddy nomination below.
*/
if (unlikely(throttled_hierarchy(cfs_rq_of(pse))))
return;
/* (wake_flags & WF_FORK)判断是否是fork进程唤醒,scale判断CFS运行队列数量是否大于8 */
if (sched_feat(NEXT_BUDDY) && scale && !(wake_flags & WF_FORK)) {
set_next_buddy(pse);/* 设置抢占调度实体为cfs_rq->next,在pick_next_eneity中cfs_rq->next会被优先选择 */
next_buddy_marked = 1;/* 设置cfs_rq->next赋值标志 */
}
/*
* We can come here with TIF_NEED_RESCHED already set from new task
* wake up path.
*
* Note: this also catches the edge-case of curr being in a throttled
* group (e.g. via set_curr_task), since update_curr() (in the
* enqueue of curr) will have resulted in resched being set. This
* prevents us from potentially nominating it as a false LAST_BUDDY
* below.
*/
if (test_tsk_need_resched(curr))
return;
/* Idle tasks are by definition preempted by non-idle tasks. */
if (unlikely(task_has_idle_policy(curr)) &&
likely(!task_has_idle_policy(p)))
goto preempt;
/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
*/
/* SCHED_BATCH和SCHED_IDLE进程不抢占SCHED_NORMAL进程 */
if (unlikely(p->policy != SCHED_NORMAL) || !sched_feat(WAKEUP_PREEMPTION))
return;
find_matching_se(&se, &pse);
update_curr(cfs_rq_of(se));
BUG_ON(!pse);
/* 判断se是否可以抢占pse,wakeup_preempt_entity返回1表示可以抢占 */
if (wakeup_preempt_entity(se, pse) == 1) {
/*
* Bias pick_next to pick the sched entity that is
* triggering this preemption.
*/
if (!next_buddy_marked)
set_next_buddy(pse);/* cfs_rq->next没有赋值,则赋值为pse */
goto preempt;
}
return;
preempt:
resched_curr(rq);/* 触发延时调度,抢占CPU */
/*
* Only set the backward buddy when the current task is still
* on the rq. This can happen when a wakeup gets interleaved
* with schedule on the ->pre_schedule() or idle_balance()
* point, either of which can * drop the rq lock.
*
* Also, during early boot the idle thread is in the fair class,
* for obvious reasons its a bad idea to schedule back to it.
*/
if (unlikely(!se->on_rq || curr == rq->idle))
return;/* se不在队列或者当前进程是idle进程则返回 */
if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))
set_last_buddy(se);/* 设置被抢占进程为cfq_rq->last,为了利用cache局部性原理,减少cache刷新,pick_next_eneity可能会选择被抢占的进程执行 */
}
十五、总结
- CFS调度类提供了CFS调度需要的方法,包括入队出队,选择下一个要执行的进程,scheduler_tick等。
- CFS运行队列用于组织进程,组织的数据结构是红黑树,KEY为se->vruntime,vruntime小者位于红黑树左侧。
- CFS调度的对象是调度实体,即普通进程或普通组调度,普通进程优先级范围[100,139],默认120,可以有nice系统调用修改;
- NICE值在内核被转换为调度权重;
- 权重值用于计算调度实体的vruntime;
- 新建进程和唤醒进程的vruntime需要修正才能使用;
- 选择下一个执行的进程,三部曲:put_prev_task,pick_next_entity和set_next_entity
- scheduler_tick周期调度,调用check_preempt_tick触发延时调度;
- wake_up_process和wake_up_new_task调用check_preempt_wakeup触发延时调度。
本文内核版本为Linux5.6.4。