1、背景
1.1 android中设置线程优先级
android中设定线程优先级通常有两种方式,一种是通过Process.setThreadPriority方法来实现,另一种是通过Process.setThreadScheduler方法来实现。这两个方法最终会通过系统调用,分别通过执行linux中的setpriority、sched_setscheduler函数来实现线程优先级的设置。
/frameworks/base/core/java/android/os/Process.java
1025 /**
1026 * Set the priority of a thread, based on Linux priorities.
1027 *
1028 * @param tid The identifier of the thread/process to change.
1029 * @param priority A Linux priority level, from -20 for highest scheduling
1030 * priority to 19 for lowest scheduling priority.
1031 *
1032 * @throws IllegalArgumentException Throws IllegalArgumentException if
1033 * <var>tid</var> does not exist.
1034 * @throws SecurityException Throws SecurityException if your process does
1035 * not have permission to modify the given thread, or to use the given
1036 * priority.
1037 */
1038 public static final native void setThreadPriority(int tid, int priority)
1039 throws IllegalArgumentException, SecurityException;
1228 /**
1229 * Set the scheduling policy and priority of a thread, based on Linux.
1230 *
1231 * @param tid The identifier of the thread/process to change.
1232 * @param policy A Linux scheduling policy such as SCHED_OTHER etc.
1233 * @param priority A Linux priority level in a range appropriate for the given policy.
1234 *
1235 * @throws IllegalArgumentException Throws IllegalArgumentException if
1236 * <var>tid</var> does not exist, or if <var>priority</var> is out of range for the policy.
1237 * @throws SecurityException Throws SecurityException if your process does
1238 * not have permission to modify the given thread, or to use the given
1239 * scheduling policy or priority.
1240 *
1241 * {@hide}
1242 */
1243
1244 public static final native void setThreadScheduler(int tid, int policy, int priority)
1245 throws IllegalArgumentException;
1.2 linux中调度器与调度策略
linux中的常用调度器(类)主要分为deadline、realtime和CFS(Completely Fair Scheduler)。其中,deadline对应的调度策略为SCHED_DEADLINE,realtime对应的调度策略为SCHED_FIFO、SCHED_RR,CFS对应的调度策略为SCHED_NORMAL、SCHED_BATCH、SCHED_IDLE。android中线程常用的调度策略为SCHED_NORMAL、SCHED_RR、SCHED_FIFO。
1.3 task_struct
task_struct是linux中线程/进程对应的结构体,包含着包括优先级在内的众多信息,其中与优先级直接相关的变量为prio、static_prio、normal_prio、rt_priority、policy、sched_class。
/include/linux/sched.h
struct task_struct {
...
int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
...
unsigned int policy;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
...
}
task_struct结构体中一些重要的变量的含义如下:
normal_prio: 经过normal_prio函数计算得到的、归一化后的优先级数值,数值范围为0~139(包括实时线程与普通线程优先级),数值越大代表优先级越低,systrace中所使用的值;
prio: 是调度器实际使用的值,大多数情况下与p->normal值相同。
static_prio: 经nice值转换得到的优先级,主要用来保存非实时线程的nice值(范围-20~19),主要对非实时线程生效;
rt_priority: 实时线程对应的优先级,主要对实时线程生效,数值越大代表优先级越高,数值范围0~99(一般情况下数值为0时表示为非实时线程);
2.setThreadPriority底层原理
setThreadPriority主要用来设置普通线程的nice值,在kernel中的主要流程为:
(1)对用户空间设置的nice值进行校验,保证其在-20~19范围内;
(2)根据传入的tid找到线程对应的结构体task_struct;
(3)对当前task_struct的调度策略进行校验,如果是SCHED_FIFO或SCHED_RR或SCHED_DEADLINE,则后续流程不再继续;
(4)根据nice值计算static_prio;
(5)根据static_prio计算normal_prio和prio。
setpriority->set_one_prio->set_user_nice
frameworks/base/core/java/android/os/Process.java
1025 /**
1026 * Set the priority of a thread, based on Linux priorities.
1027 *
1028 * @param tid The identifier of the thread/process to change.
1029 * @param priority A Linux priority level, from -20 for highest scheduling
1030 * priority to 19 for lowest scheduling priority.
1031 *
1032 * @throws IllegalArgumentException Throws IllegalArgumentException if
1033 * <var>tid</var> does not exist.
1034 * @throws SecurityException Throws SecurityException if your process does
1035 * not have permission to modify the given thread, or to use the given
1036 * priority.
1037 */
1038 public static final native void setThreadPriority(int tid, int priority)
1039 throws IllegalArgumentException, SecurityException;
/kernel/sys.c
SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
{
struct task_struct *g, *p;
struct user_struct *user;
const struct cred *cred = current_cred();
int error = -EINVAL;
struct pid *pgrp;
kuid_t uid;
if (which > PRIO_USER || which < PRIO_PROCESS)
goto out;
/* normalize: avoid signed division (rounding problems) */
error = -ESRCH;
>>确保nice值在规定范围内:-20<=niceval<=19
if (niceval < MIN_NICE)
niceval = MIN_NICE;
if (niceval > MAX_NICE)
niceval = MAX_NICE;
rcu_read_lock();
read_lock(&tasklist_lock);
switch (which) {
case PRIO_PROCESS:
if (who)
>>根据tid找到对应task_struct
p = find_task_by_vpid(who);
else
p = current;
if (p)
>>通过set_one_prio函数设置nice值
error = set_one_prio(p, niceval, error);
break;
case PRIO_PGRP:
...
case PRIO_USER:
/*
* set the priority of a task
* - the caller must hold the RCU read lock
*/
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
int no_nice;
if (!set_one_prio_perm(p)) {
error = -EPERM;
goto out;
}
if (niceval < task_nice(p) && !can_nice(p, niceval)) {
error = -EACCES;
goto out;
}
no_nice = security_task_setnice(p, niceval);
if (no_nice) {
error = no_nice;
goto out;
}
if (error == -ESRCH)
error = 0;
>>最终通过set_user_nice函数设置nice值
set_user_nice(p, niceval);
out:
return error;
}
/kernel/sched/core.c
void set_user_nice(struct task_struct *p, long nice)
{
bool queued, running;
int old_prio, delta;
struct rq_flags rf;
struct rq *rq;
if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
return;
/*
* We have to be careful, if called from sys_setpriority(),
* the task might be in the middle of scheduling on another CPU.
*/
rq = task_rq_lock(p, &rf);
update_rq_clock(rq);
/*
* The RT priorities are set via sched_setscheduler(), but we still
* allow the 'normal' nice value to be set - but as expected
* it wont have any effect on scheduling until the task is
* SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
*/
>>如果是rt调度,则也会允许其更新nice值并转换为p->static_prio(nice值变化对rt线程没有影响),然后立马退出
if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
p->static_prio = NICE_TO_PRIO(nice);
goto out_unlock;
}
queued = task_on_rq_queued(p);
running = task_current(rq, p);
if (queued)
dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
if (running)
put_prev_task(rq, p);
>>执行到这里,说明是normal调度,先通过NICE_TO_PRIO函数更新static_prio=120+nice
>>#define NICE_TO_PRIO(nice) ((nice) + DEFAULT_PRIO)
p->static_prio = NICE_TO_PRIO(nice);
set_load_weight(p, true);
old_prio = p->prio;
>>更新p->prio,同时在effective_prio函数中更新p->normal_prio、
p->prio = effective_prio(p);
delta = p->prio - old_prio;
if (queued) {
enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
/*
* If the task increased its priority or is running and
* lowered its priority, then reschedule its CPU:
*/
if (delta < 0 || (delta > 0 && task_running(rq, p)))
>>如果是优先级提升、或者优先级下降&&task正处于running状态,则设置重新调度标志
resched_curr(rq);
}
if (running)
>>如果task处于running状态,进程放入执行队列
set_next_task(rq, p);
out_unlock:
task_rq_unlock(rq, p, &rf);
}
/*
* Calculate the current priority, i.e. the priority
* taken into account by the scheduler. This value might
* be boosted by RT tasks, or might be boosted by
* interactivity modifiers. Will be RT if the task got
* RT-boosted. If not then it returns p->normal_prio.
*/
static int effective_prio(struct task_struct *p)
{
>>通过normal_prio函数计算得到p->normal_prio
p->normal_prio = normal_prio(p);
/*
* If we are RT tasks or we were boosted to RT priority,
* keep the priority unchanged. Otherwise, update priority
* to the normal priority:
*/
>>对于非实时线程,将p->prio更新为p->normal_prio
if (!rt_prio(p->prio))
return p->normal_prio;
return p->prio;
}
/*
* Calculate the expected normal priority: i.e. priority
* without taking RT-inheritance into account. Might be
* boosted by interactivity modifiers. Changes upon fork,
* setprio syscalls, and whenever the interactivity
* estimator recalculates.
*/
static inline int normal_prio(struct task_struct *p)
{
int prio;
if (task_has_dl_policy(p))
prio = MAX_DL_PRIO-1;
else if (task_has_rt_policy(p))
>>如果是rt线程,则normal_prio=99-p->rt_priority
>>#define MAX_USER_RT_PRIO 100
>>#define MAX_RT_PRIO MAX_USER_RT_PRIO
prio = MAX_RT_PRIO-1 - p->rt_priority;
else
>>如果是普通线程,则normal_prio=static_prio
prio = __normal_prio(p);
return prio;
}
/*
* __normal_prio - return the priority that is based on the static prio
*/
static inline int __normal_prio(struct task_struct *p)
{
return p->static_prio;
}
3.setThreadScheduler底层原理
sched_setscheduler主要用来设置实时线程的优先级,或者将线程的调度策略由SCHED_RR或SCHED_FIFO恢复为SCHED_NORMAL并将其优先级恢复为原有值。
在kernel中的主要流程为:
(1)根据tid找到线程对应的结构体task_struct;
(2)将用户空间传入的参数包装成结构体sched_attr;
(3)对用户空间设置下来的sched_priority进行校验:不能大于最大值99,如果是实时线程其sched_priority不能为0,如果是普通线程其sched_priority不能大于0;
(4)将用户空间设置下来的sched_priority赋值给task_struct中的rt_priority,并更新normal_prio;
(5)如果是恢复为SCHED_NORMAL调度策略,则还会更新static_prio;
(5)将归一化后的优先级normal_prio赋值给prio。
SYSCALL_DEFINE3(sched_setscheduler,...)->do_sched_setscheduler->sched_setscheduler->
_sched_setscheduler(sched_param->sched_attr)->__sched_setscheduler->__setscheduler->
1.__setscheduler_params; 2.normal_prio; 3.p->sched_class
frameworks/base/core/java/android/os/Process.java
1228 /**
1229 * Set the scheduling policy and priority of a thread, based on Linux.
1230 *
1231 * @param tid The identifier of the thread/process to change.
1232 * @param policy A Linux scheduling policy such as SCHED_OTHER etc.
1233 * @param priority A Linux priority level in a range appropriate for the given policy.
1234 *
1235 * @throws IllegalArgumentException Throws IllegalArgumentException if
1236 * <var>tid</var> does not exist, or if <var>priority</var> is out of range for the policy.
1237 * @throws SecurityException Throws SecurityException if your process does
1238 * not have permission to modify the given thread, or to use the given
1239 * scheduling policy or priority.
1240 *
1241 * {@hide}
1242 */
1243
1244 public static final native void setThreadScheduler(int tid, int policy, int priority)
1245 throws IllegalArgumentException;
/**
* sys_sched_setscheduler - set/change the scheduler policy and RT priority
* @pid: the pid in question.
* @policy: new policy.
* @param: structure containing the new RT priority.
*
* Return: 0 on success. An error code otherwise.
*/
SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
{
if (policy < 0)
return -EINVAL;
>>用户空间通过系统调用执行到这里
return do_sched_setscheduler(pid, policy, param);
}
static int
do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
{
struct sched_param lparam;
struct task_struct *p;
int retval;
if (!param || pid < 0)
return -EINVAL;
if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
return -EFAULT;
rcu_read_lock();
retval = -ESRCH;
>>根据tid找到对应的task_struct
p = find_process_by_pid(pid);
if (likely(p))
get_task_struct(p);
rcu_read_unlock();
if (likely(p)) {
>>执行sched_setscheduler函数,更新task_struct相关参数
retval = sched_setscheduler(p, policy, &lparam);
put_task_struct(p);
}
return retval;
}
/**
* sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
* @p: the task in question.
* @policy: new policy.
* @param: structure containing the new RT priority.
*
* Return: 0 on success. An error code otherwise.
*
* NOTE that the task may be already dead.
*/
int sched_setscheduler(struct task_struct *p, int policy,
const struct sched_param *param)
{
return _sched_setscheduler(p, policy, param, true);
}
static int _sched_setscheduler(struct task_struct *p, int policy,
const struct sched_param *param, bool check)
{
>>将用户空间设置的参数更新到sched_attr中:sched_policy(刚传入)、sched_priority(刚传入)、sched_nice(原来的nice值)
struct sched_attr attr = {
.sched_policy = policy,
.sched_priority = param->sched_priority,
.sched_nice = PRIO_TO_NICE(p->static_prio),
};
/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
policy &= ~SCHED_RESET_ON_FORK;
attr.sched_policy = policy;
}
>>后续执行__sched_setscheduler函数
return __sched_setscheduler(p, &attr, check, true);
}
static int __sched_setscheduler(struct task_struct *p,
const struct sched_attr *attr,
bool user, bool pi)
{
int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
MAX_RT_PRIO - 1 - attr->sched_priority;
...
/*
* Valid priorities for SCHED_FIFO and SCHED_RR are
* 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
* SCHED_BATCH and SCHED_IDLE is 0.
*/
>>rt线程设置的优先级不能大于99
>>#define MAX_USER_RT_PRIO 100
>>#define MAX_RT_PRIO MAX_USER_RT_PRIO
if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
(!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
return -EINVAL;
>>如果是rt线程,用户空间设置下来的sched_priority为0,则return,流程中止
>>或者,对于非实时调度的普通线程,用户空间设置下来的sched_priority不为0,则return,流程中止
if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
(rt_policy(policy) != (attr->sched_priority != 0)))
return -EINVAL;
...
/*
* If not changing anything there's no need to proceed further,
* but store a possible modification of reset_on_fork.
*/
if (unlikely(policy == p->policy)) {
>>如果fair调度策略没变,但是sched_priority和task现有nice值不同,则执行change逻辑
if (fair_policy(policy) && attr->sched_nice != task_nice(p))
goto change;
>>如果rt调度策略没变,但是sched_priority和task现有rt_priority值不同,则执行change逻辑
if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
goto change;
if (dl_policy(policy) && dl_param_changed(p, attr))
goto change;
if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
goto change;
p->sched_reset_on_fork = reset_on_fork;
retval = 0;
goto unlock;
}
...
queued = task_on_rq_queued(p);
running = task_current(rq, p);
if (queued)
dequeue_task(rq, p, queue_flags);
if (running)
put_prev_task(rq, p);
prev_class = p->sched_class;
>>真正更新task对应参数逻辑
__setscheduler(rq, p, attr, pi);
__setscheduler_uclamp(p, attr);
if (queued) {
/*
* We enqueue to tail when the priority of a task is
* increased (user space view).
*/
if (oldprio < p->prio)
queue_flags |= ENQUEUE_HEAD;
enqueue_task(rq, p, queue_flags);
}
if (running)
set_next_task(rq, p);
/* Actually do priority change: must hold pi & rq lock. */
static void __setscheduler(struct rq *rq, struct task_struct *p,
const struct sched_attr *attr, bool keep_boost)
{
/*
* If params can't change scheduling class changes aren't allowed
* either.
*/
if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
return;
>>更新task调度策略policy与优先级rt_priority、normal_prio
__setscheduler_params(p, attr);
/*
* Keep a potential priority boosting if called from
* sched_setscheduler().
*/
>>更新p->prio
p->prio = normal_prio(p);
if (keep_boost)
p->prio = rt_effective_prio(p, p->prio);
>>更新调度器sched_class
if (dl_prio(p->prio))
p->sched_class = &dl_sched_class;
else if (rt_prio(p->prio))
p->sched_class = &rt_sched_class;
else
p->sched_class = &fair_sched_class;
}
/*
* sched_setparam() passes in -1 for its policy, to let the functions
* it calls know not to change it.
*/
#define SETPARAM_POLICY -1
static void __setscheduler_params(struct task_struct *p,
const struct sched_attr *attr)
{
int policy = attr->sched_policy;
if (policy == SETPARAM_POLICY)
policy = p->policy;
>>更新policy调度策略
p->policy = policy;
if (dl_policy(policy))
__setparam_dl(p, attr);
else if (fair_policy(policy))
>>对于fair调度策略,更新用户空间传下来的nice值以及static_prio
>>根据前面的代码逻辑可知,sched_nice为task原来的nice值
p->static_prio = NICE_TO_PRIO(attr->sched_nice);
/*
* __sched_setscheduler() ensures attr->sched_priority == 0 when
* !rt_policy. Always setting this ensures that things like
* getparam()/getattr() don't report silly values for !rt tasks.
*/
>>更新rt_priority,根据前面的代码逻辑可知,如果是rt调度,则sched_priority大于0;
>>如果是非rt调度,则sched_priority为0
p->rt_priority = attr->sched_priority;
>>更新normal_prio,即将优先级归一化为0-139范围,即systrace中所看到的线程优先级范围
p->normal_prio = normal_prio(p);
set_load_weight(p, true);
}
static inline int normal_prio(struct task_struct *p)
{
int prio;
if (task_has_dl_policy(p))
prio = MAX_DL_PRIO-1;
else if (task_has_rt_policy(p))
>>如果是rt线程,则normal_prio=99-p->rt_priority
>>#define MAX_USER_RT_PRIO 100
>>#define MAX_RT_PRIO MAX_USER_RT_PRIO
prio = MAX_RT_PRIO-1 - p->rt_priority;
else
>>如果是普通线程,则normal_prio=static_prio
prio = __normal_prio(p);
return prio;
}
/*
* __normal_prio - return the priority that is based on the static prio
*/
static inline int __normal_prio(struct task_struct *p)
{
return p->static_prio;
}
4. CFS调度原理与vruntime的计算
android中最常见的线程调度策略为SCHED_NORMAL,对应的调度器(类)为CFS(Completely Fair Scheduler),因此有必要了解CFS调度器的大致原理。
4.1 CFS调度器常用数据结构
4.1.1 task_struct
task_struct前面已经提到过,在linux中代表着进程/线程,为了方便,后文提到的进程或task,均代表进程/线程/task。在设置线程优先级时,会先通过tid找到对应的结构体task_struct。其中,和优先级直接相关的几个变量为prio、static_prio、normal_prio、rt_priority;和所能运行的cpu相关的变量为nr_cpus_allowed、cpus_ptr、cpus_mask,cpu_mask指定了当前task能在哪些cpu上运行;和task调度相关的变量为sched_class、policy、se,这三个变量分别是调度器、调度策略和调度实体,而se对应数据结构为sched_entity。
struct task_struct {
...
int on_rq;
int prio;
int static_prio;
int normal_prio;
unsigned int rt_priority;
const struct sched_class *sched_class;
struct sched_entity se;
struct sched_rt_entity rt;
#ifdef CONFIG_CGROUP_SCHED
struct task_group *sched_task_group;
#endif
struct sched_dl_entity dl;
...
unsigned int policy;
int nr_cpus_allowed;
const cpumask_t *cpus_ptr;
cpumask_t cpus_mask;
...
}
4.1.2 sched_entity
CFS调度器在调度时所直接用到的结构体即为sched_entity。其中,load表示权重(对应数据结构为load_weight),与优先级(nice值)有关,影响到vruntime的计算;exec_start为当前进程获取最近一次cpu时间片的起始时间,利用当前时间now减去起始时间exec_start,就得到了此次获取cpu时间片后的运行时间;sum_exec_runtime为当前进程总的cpu消耗时间,为真实时间;vruntime表示虚拟运行时间,在真实时间sum_exec_runtime的基础上计算得到,为CFS调度器在调度时的重要参考;prev_sum_exec_runtime为进程在上次获得cpu时间片后的累计总的运行时间。
struct sched_entity {
/* For load-balancing: */
struct load_weight load;
unsigned long runnable_weight;
struct rb_node run_node;
struct list_head group_node;
unsigned int on_rq;
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
u64 prev_sum_exec_runtime;
...
}
struct load_weight {
unsigned long weight;
u32 inv_weight;
};
4.1.3 rq
每个cpu都有一个运行队列,对应结构体为rq,其中的cfs、rt、dl分别表示CFS调度器、Realtime调度器和Deadline调度器对应的就绪队列。
kernel/sched/sched.h
/*
* This is the main, per-CPU runqueue data structure.
*
* Locking rule: those places that want to lock multiple runqueues
* (such as the load balancing or the thread migration code), lock
* acquire operations must be ordered by ascending &runqueue.
*/
struct rq {
...
struct cfs_rq cfs;
struct rt_rq rt;
struct dl_rq dl;
...
}
4.1.4 cfs_rq
CFS调度器的就绪队列对应的结构体为cfs_rq。其中,load为就绪队列中所有调度实体权重之和;min_vruntime为队列中所有调度实体中最小的虚拟运行时间;curr为当前执行进程的调度实体;tasks_timeline为保存调度实体的红黑树,红黑树中每一个节点的key为对应调度实体的vruntime。
/* CFS-related fields in a runqueue */
struct cfs_rq {
struct load_weight load;
unsigned long runnable_weight;
unsigned int nr_running;
unsigned int h_nr_running; /* SCHED_{NORMAL,BATCH,IDLE} */
unsigned int idle_h_nr_running; /* SCHED_IDLE */
u64 exec_clock;
u64 min_vruntime;
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
#endif
struct rb_root_cached tasks_timeline;
/*
* 'curr' points to currently running entity on this cfs_rq.
* It is set to NULL otherwise (i.e when none are currently running).
*/
struct sched_entity *curr;
struct sched_entity *next;
struct sched_entity *last;
struct sched_entity *skip;
...
}
/*
* Leftmost-cached rbtrees.
*
* We do not cache the rightmost node based on footprint
* size vs number of potential users that could benefit
* from O(1) rb_last(). Just not worth it, users that want
* this feature can always implement the logic explicitly.
* Furthermore, users that want to cache both pointers may
* find it a bit asymmetric, but that's ok.
*/
struct rb_root_cached {
struct rb_root rb_root;
struct rb_node *rb_leftmost;
};
4.2 CFS调度原理
CFS调度器,从就绪队列中根据进程的vruntime选择下一个将要运行的进程,且总是倾向于选择队列中vruntime最小的进程。CFS调度器就绪队列中有一个保存调度实体的红黑树,红黑树中每一个节点对应的key为进程对应的vruntime,而value则为进程对应的调度实体sched_entity。由于红黑树的结构特性,其左边的节点对应的vruntime总是比右边节点的vruntime要小。因此,红黑树中最左边的叶子节点即为vuntime最小的调度实体,而CFS调度器总是选择红黑树中最左边的调度实体进行调度。
vruntime为虚拟运行时间,是根据调度实体的实际运行时间计算得到。调度实体获得的cpu时间片越多,其实际运行时间就越大,vruntime就越大,当调度实体没有在cpu上运行时,其vruntime会保持不变。对于优先级越高(nice值越小)的task,其调度实体对应的vruntime会增长的更慢(相对于优先级较低的task而言),因此,优先级更高的task能获取到更多的cpu时间片。
vruntime的更新与调整时机,不仅在于优先级改变的时候,在迁移运行的cpu、进程长时间休眠后被唤醒、进程被创建时等情况,其对应的vruntime也会得到调整,以确保公平性。比如,每个cpu上rq对应的vruntime不同,当进程从一个cpu迁移到另一个cpu时,会对进程的vruntime进行调整,以确保迁核后进程长时间得不到调度或者进程长时间占用cpu资源;再比如,对于新创建的进程,其vruntime为0,需要对其值在cfs_rq对应的min_vruntime基础上进行调整,防止其他进程长时间得不到cpu资源。这些都体现出CFS调度的公平性。
4.3 vruntime的计算
根据前面CFS调度原理,CFS调度依赖于进程的vruntime。因此,需要梳理当进程优先级改变时,其vruntime的更新流程。
在更新task优先级流程中,先更新结构体task_struct中优先级相关的变量(比如static_prio、normal_prio、prio等),然后再依次更新task对应的最近一次在cpu上运行的实际时间delta_exec、虚拟运行时间Δvruntime, 最后最新nice值对应的权重weight。其中,vruntime更新公式为:
vruntime += Δvruntime;
Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight;
可以看到,权重weight越大,vruntime增长得越慢,所以优先级高的task其权重越大、获得的cpu时间片越长。
//更新weight与vruntime代码流程
set_user_nice->set_load_weight->reweight_task->reweight_entity(更新runnable_weight和weight)->update_curr->1.calc_delta_fair(更新task对应vruntime);2. update_min_vruntime(更新cfs_rq对应min_vruntime);
/kernel/sched/core.c
static void set_load_weight(struct task_struct *p, bool update_load)
{
int prio = p->static_prio - MAX_RT_PRIO;
struct load_weight *load = &p->se.load;
/*
* SCHED_IDLE tasks get minimal weight:
*/
if (task_has_idle_policy(p)) {
load->weight = scale_load(WEIGHT_IDLEPRIO);
load->inv_weight = WMULT_IDLEPRIO;
p->se.runnable_weight = load->weight;
return;
}
/*
* SCHED_OTHER tasks have to update their load when changing their
* weight
*/
if (update_load && p->sched_class == &fair_sched_class) {
>>对于CFS调度的task,更新其权重weight
reweight_task(p, prio);
} else {
load->weight = scale_load(sched_prio_to_weight[prio]);
load->inv_weight = sched_prio_to_wmult[prio];
p->se.runnable_weight = load->weight;
}
}
/kernel/sched/fair.c
void reweight_task(struct task_struct *p, int prio)
{
>>获取task_struct对应的调度实体sched_entity
struct sched_entity *se = &p->se;
>>获取调度实体对应的CFS调度器就绪队列cfs_rq
struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>获取task对应的权重结构体load_weight
struct load_weight *load = &se->load;
>>根据nice值从sched_prio_to_weight数组中获取对应权重weight
unsigned long weight = scale_load(sched_prio_to_weight[prio]);
>>更新sched_entity对应权重weight
reweight_entity(cfs_rq, se, weight, weight);
load->inv_weight = sched_prio_to_wmult[prio];
}
每个nice值有着对应的weight值,并存放在数组sched_prio_to_weight中。其中,nice值越小,优先级越高,则对应权重weight越大。且相邻优先级对应的weight的比值大约为0.8左右,这主要是为了确保nice值每加1,则task所获得的cpu时间片减少10%左右。
/kernel/sched/core.c
/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
* that remained on nice 0.
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
const int sched_prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
unsigned long weight, unsigned long runnable)
{
if (se->on_rq) {
/* commit outstanding execution time */
if (cfs_rq->curr == se)
>>如果当前task正在cpu上运行,则通过update_curr函数更新vruntime
update_curr(cfs_rq);
account_entity_dequeue(cfs_rq, se);
dequeue_runnable_load_avg(cfs_rq, se);
}
dequeue_load_avg(cfs_rq, se);
se->runnable_weight = runnable;
>>更新vruntime后,再更新权重weight
update_load_set(&se->load, weight);
...
}
static inline void update_load_set(struct load_weight *lw, unsigned long w)
{
>>更新权重weight,将新的nice值对应的权重w赋值给lw->weight
lw->weight = w;
lw->inv_weight = 0;
}
/*
* Update the current task's runtime statistics.
*/
static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;
u64 now = rq_clock_task(rq_of(cfs_rq));
u64 delta_exec;
if (unlikely(!curr))
return;
>>获取最近这一次运行时间delta_exec
delta_exec = now - curr->exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
>>重置exec_start
curr->exec_start = now;
schedstat_set(curr->statistics.exec_max,
max(delta_exec, curr->statistics.exec_max));
>>更新sum_exec_runtime:加上最近这一次的运行时间delta_exec
curr->sum_exec_runtime += delta_exec;
schedstat_add(cfs_rq->exec_clock, delta_exec);
>>更新调度实体对应的vruntime: 加上最近这一次的虚拟运行时间
curr->vruntime += calc_delta_fair(delta_exec, curr);
>>更新cfs_rq对应min_vruntime
update_min_vruntime(cfs_rq);
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
cgroup_account_cputime(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
account_cfs_rq_runtime(cfs_rq, delta_exec);
}
/*
* delta /= w
*/
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
if (unlikely(se->load.weight != NICE_0_LOAD))
>>通过__calc_delta函数计算新增的vruntime
>>delta为最近一次获取cpu后的运行时间,NICE_0_LOAD为nice值0所对应的weight,se->load为调度实体对应的权重
delta = __calc_delta(delta, NICE_0_LOAD, &se->load);
return delta;
}
根据代码注释,vruntime计算公式为:Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight,实际计算过程在此基础上略有优化。因此,task对应优先级越高,则权重越大,则vruntime增长得越慢,其获取的cpu时间片越长。对于nice值为0的task,其虚拟运行时间Δvruntime与实际运行时间delta_exec相等。
/*
* delta_exec * weight / lw.weight
* OR
* (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
*
* Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
* we're guaranteed shift stays positive because inv_weight is guaranteed to
* fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
*
* Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
* weight/lw.weight <= 1, and therefore our shift will also be positive.
*/
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
u64 fact = scale_load_down(weight);
int shift = WMULT_SHIFT;
__update_inv_weight(lw);
if (unlikely(fact >> 32)) {
while (fact >> 32) {
fact >>= 1;
shift--;
}
}
fact = mul_u32_u32(fact, lw->inv_weight);
while (fact >> 32) {
fact >>= 1;
shift--;
}
return mul_u64_u32_shr(delta_exec, fact, shift);
}
5.总结与思考
(1) setThreadPriority只能设置普通线程优先级,对实时线程无效;
(2) setThreadScheduler在设置普通线程优先级时,传入参数(nice值)只能为0; 在设置实时线程优先级时,传入参数数值范围为1~99;
(3) 安卓设置实时线程优先级数值为1~99, 对应归一化后的优先级范围为0-98,所以android中无法设置线程的归一化优先级为99?
(4) 对于普通线程,nice值与normal_prio的转化关系为:normal_prio=static_prio=DEFAULT_PRIO+nice=120+nice;对于实时线程,android中设置的优先级数值rt_priority与normal_prio转化关系为:normal_prio=MAX_RT_PRIO-1-rt_priority=99-rt_priority。
(5) 线程对应nice值越小,优先级越高,对应的权重越大,其虚拟运行时间vruntime增长得越慢,获得的cpu时间片越长。vruntime计算公式为:
Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight;vruntime += Δvruntime;