梳理android中设置线程优先级底层原理

1、背景

1.1 android中设置线程优先级

android中设定线程优先级通常有两种方式,一种是通过Process.setThreadPriority方法来实现,另一种是通过Process.setThreadScheduler方法来实现。这两个方法最终会通过系统调用,分别通过执行linux中的setpriority、sched_setscheduler函数来实现线程优先级的设置。

/frameworks/base/core/java/android/os/Process.java
1025      /**
1026       * Set the priority of a thread, based on Linux priorities.
1027       *
1028       * @param tid The identifier of the thread/process to change.
1029       * @param priority A Linux priority level, from -20 for highest scheduling
1030       * priority to 19 for lowest scheduling priority.
1031       *
1032       * @throws IllegalArgumentException Throws IllegalArgumentException if
1033       * <var>tid</var> does not exist.
1034       * @throws SecurityException Throws SecurityException if your process does
1035       * not have permission to modify the given thread, or to use the given
1036       * priority.
1037       */
1038      public static final native void setThreadPriority(int tid, int priority)
1039              throws IllegalArgumentException, SecurityException;


1228      /**
1229       * Set the scheduling policy and priority of a thread, based on Linux.
1230       *
1231       * @param tid The identifier of the thread/process to change.
1232       * @param policy A Linux scheduling policy such as SCHED_OTHER etc.
1233       * @param priority A Linux priority level in a range appropriate for the given policy.
1234       *
1235       * @throws IllegalArgumentException Throws IllegalArgumentException if
1236       * <var>tid</var> does not exist, or if <var>priority</var> is out of range for the policy.
1237       * @throws SecurityException Throws SecurityException if your process does
1238       * not have permission to modify the given thread, or to use the given
1239       * scheduling policy or priority.
1240       *
1241       * {@hide}
1242       */
1243  
1244      public static final native void setThreadScheduler(int tid, int policy, int priority)
1245              throws IllegalArgumentException;

1.2 linux中调度器与调度策略

    linux中的常用调度器(类)主要分为deadline、realtime和CFS(Completely Fair Scheduler)。其中,deadline对应的调度策略为SCHED_DEADLINE,realtime对应的调度策略为SCHED_FIFO、SCHED_RR,CFS对应的调度策略为SCHED_NORMAL、SCHED_BATCH、SCHED_IDLE。android中线程常用的调度策略为SCHED_NORMAL、SCHED_RR、SCHED_FIFO。
 

1.3 task_struct

    task_struct是linux中线程/进程对应的结构体,包含着包括优先级在内的众多信息,其中与优先级直接相关的变量为prio、static_prio、normal_prio、rt_priority、policy、sched_class。

/include/linux/sched.h
struct task_struct {
...
	int				prio;
	int				static_prio;
	int				normal_prio;
	unsigned int			rt_priority;

	const struct sched_class	*sched_class;
...
	unsigned int			policy;
	int				nr_cpus_allowed;
	const cpumask_t			*cpus_ptr;
	cpumask_t			cpus_mask;
...
}

task_struct结构体中一些重要的变量的含义如下:

normal_prio: 经过normal_prio函数计算得到的、归一化后的优先级数值,数值范围为0~139(包括实时线程与普通线程优先级),数值越大代表优先级越低,systrace中所使用的值;    
prio: 是调度器实际使用的值,大多数情况下与p->normal值相同。
static_prio: 经nice值转换得到的优先级,主要用来保存非实时线程的nice值(范围-20~19),主要对非实时线程生效;
rt_priority: 实时线程对应的优先级,主要对实时线程生效,数值越大代表优先级越高,数值范围0~99(一般情况下数值为0时表示为非实时线程);
 

2.setThreadPriority底层原理

setThreadPriority主要用来设置普通线程的nice值,在kernel中的主要流程为:
(1)对用户空间设置的nice值进行校验,保证其在-20~19范围内;
(2)根据传入的tid找到线程对应的结构体task_struct;
(3)对当前task_struct的调度策略进行校验,如果是SCHED_FIFO或SCHED_RR或SCHED_DEADLINE,则后续流程不再继续;
(4)根据nice值计算static_prio;
(5)根据static_prio计算normal_prio和prio。

setpriority->set_one_prio->set_user_nice

frameworks/base/core/java/android/os/Process.java
1025      /**
1026       * Set the priority of a thread, based on Linux priorities.
1027       *
1028       * @param tid The identifier of the thread/process to change.
1029       * @param priority A Linux priority level, from -20 for highest scheduling
1030       * priority to 19 for lowest scheduling priority.
1031       *
1032       * @throws IllegalArgumentException Throws IllegalArgumentException if
1033       * <var>tid</var> does not exist.
1034       * @throws SecurityException Throws SecurityException if your process does
1035       * not have permission to modify the given thread, or to use the given
1036       * priority.
1037       */
1038      public static final native void setThreadPriority(int tid, int priority)
1039              throws IllegalArgumentException, SecurityException;


/kernel/sys.c
SYSCALL_DEFINE3(setpriority, int, which, int, who, int, niceval)
{
	struct task_struct *g, *p;
	struct user_struct *user;
	const struct cred *cred = current_cred();
	int error = -EINVAL;
	struct pid *pgrp;
	kuid_t uid;

	if (which > PRIO_USER || which < PRIO_PROCESS)
		goto out;

	/* normalize: avoid signed division (rounding problems) */
	error = -ESRCH;
>>确保nice值在规定范围内:-20<=niceval<=19
	if (niceval < MIN_NICE)
		niceval = MIN_NICE;
	if (niceval > MAX_NICE)
		niceval = MAX_NICE;

	rcu_read_lock();
	read_lock(&tasklist_lock);
	switch (which) {
	case PRIO_PROCESS:
		if (who)
>>根据tid找到对应task_struct
			p = find_task_by_vpid(who);
		else
			p = current;
		if (p)
>>通过set_one_prio函数设置nice值
			error = set_one_prio(p, niceval, error);
		break;
	case PRIO_PGRP:
    ...
	case PRIO_USER:


/*
 * set the priority of a task
 * - the caller must hold the RCU read lock
 */
static int set_one_prio(struct task_struct *p, int niceval, int error)
{
	int no_nice;

	if (!set_one_prio_perm(p)) {
		error = -EPERM;
		goto out;
	}
	if (niceval < task_nice(p) && !can_nice(p, niceval)) {
		error = -EACCES;
		goto out;
	}
	no_nice = security_task_setnice(p, niceval);
	if (no_nice) {
		error = no_nice;
		goto out;
	}
	if (error == -ESRCH)
		error = 0;
>>最终通过set_user_nice函数设置nice值
	set_user_nice(p, niceval);
out:
	return error;
}


/kernel/sched/core.c
void set_user_nice(struct task_struct *p, long nice)
{
	bool queued, running;
	int old_prio, delta;
	struct rq_flags rf;
	struct rq *rq;

	if (task_nice(p) == nice || nice < MIN_NICE || nice > MAX_NICE)
		return;
	/*
	 * We have to be careful, if called from sys_setpriority(),
	 * the task might be in the middle of scheduling on another CPU.
	 */
	rq = task_rq_lock(p, &rf);
	update_rq_clock(rq);

	/*
	 * The RT priorities are set via sched_setscheduler(), but we still
	 * allow the 'normal' nice value to be set - but as expected
	 * it wont have any effect on scheduling until the task is
	 * SCHED_DEADLINE, SCHED_FIFO or SCHED_RR:
	 */
>>如果是rt调度,则也会允许其更新nice值并转换为p->static_prio(nice值变化对rt线程没有影响),然后立马退出
	if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
		p->static_prio = NICE_TO_PRIO(nice);
		goto out_unlock;
	}
	queued = task_on_rq_queued(p);
	running = task_current(rq, p);
	if (queued)
		dequeue_task(rq, p, DEQUEUE_SAVE | DEQUEUE_NOCLOCK);
	if (running)
		put_prev_task(rq, p);
>>执行到这里,说明是normal调度,先通过NICE_TO_PRIO函数更新static_prio=120+nice
>>#define NICE_TO_PRIO(nice)	((nice) + DEFAULT_PRIO)
	p->static_prio = NICE_TO_PRIO(nice);
	set_load_weight(p, true);
	old_prio = p->prio;
>>更新p->prio,同时在effective_prio函数中更新p->normal_prio、
	p->prio = effective_prio(p);
	delta = p->prio - old_prio;

	if (queued) {
		enqueue_task(rq, p, ENQUEUE_RESTORE | ENQUEUE_NOCLOCK);
		/*
		 * If the task increased its priority or is running and
		 * lowered its priority, then reschedule its CPU:
		 */
		if (delta < 0 || (delta > 0 && task_running(rq, p)))
>>如果是优先级提升、或者优先级下降&&task正处于running状态,则设置重新调度标志
			resched_curr(rq);
	}
	if (running)
>>如果task处于running状态,进程放入执行队列
		set_next_task(rq, p);
out_unlock:
	task_rq_unlock(rq, p, &rf);
}


/*
 * Calculate the current priority, i.e. the priority
 * taken into account by the scheduler. This value might
 * be boosted by RT tasks, or might be boosted by
 * interactivity modifiers. Will be RT if the task got
 * RT-boosted. If not then it returns p->normal_prio.
 */
static int effective_prio(struct task_struct *p)
{
>>通过normal_prio函数计算得到p->normal_prio
	p->normal_prio = normal_prio(p);
	/*
	 * If we are RT tasks or we were boosted to RT priority,
	 * keep the priority unchanged. Otherwise, update priority
	 * to the normal priority:
	 */
>>对于非实时线程,将p->prio更新为p->normal_prio
	if (!rt_prio(p->prio))
		return p->normal_prio;
	return p->prio;
}


/*
 * Calculate the expected normal priority: i.e. priority
 * without taking RT-inheritance into account. Might be
 * boosted by interactivity modifiers. Changes upon fork,
 * setprio syscalls, and whenever the interactivity
 * estimator recalculates.
 */
static inline int normal_prio(struct task_struct *p)
{
	int prio;

	if (task_has_dl_policy(p))
		prio = MAX_DL_PRIO-1;
	else if (task_has_rt_policy(p))
>>如果是rt线程,则normal_prio=99-p->rt_priority
>>#define MAX_USER_RT_PRIO	100
>>#define MAX_RT_PRIO		MAX_USER_RT_PRIO
		prio = MAX_RT_PRIO-1 - p->rt_priority;
	else
>>如果是普通线程,则normal_prio=static_prio
		prio = __normal_prio(p);
	return prio;
}


/*
 * __normal_prio - return the priority that is based on the static prio
 */
static inline int __normal_prio(struct task_struct *p)
{
	return p->static_prio;
}

3.setThreadScheduler底层原理

sched_setscheduler主要用来设置实时线程的优先级,或者将线程的调度策略由SCHED_RR或SCHED_FIFO恢复为SCHED_NORMAL并将其优先级恢复为原有值。
在kernel中的主要流程为:
(1)根据tid找到线程对应的结构体task_struct;
(2)将用户空间传入的参数包装成结构体sched_attr;
(3)对用户空间设置下来的sched_priority进行校验:不能大于最大值99,如果是实时线程其sched_priority不能为0,如果是普通线程其sched_priority不能大于0;
(4)将用户空间设置下来的sched_priority赋值给task_struct中的rt_priority,并更新normal_prio;
(5)如果是恢复为SCHED_NORMAL调度策略,则还会更新static_prio;
(5)将归一化后的优先级normal_prio赋值给prio。

SYSCALL_DEFINE3(sched_setscheduler,...)->do_sched_setscheduler->sched_setscheduler->
_sched_setscheduler(sched_param->sched_attr)->__sched_setscheduler->__setscheduler->
1.__setscheduler_params; 2.normal_prio; 3.p->sched_class


frameworks/base/core/java/android/os/Process.java
1228      /**
1229       * Set the scheduling policy and priority of a thread, based on Linux.
1230       *
1231       * @param tid The identifier of the thread/process to change.
1232       * @param policy A Linux scheduling policy such as SCHED_OTHER etc.
1233       * @param priority A Linux priority level in a range appropriate for the given policy.
1234       *
1235       * @throws IllegalArgumentException Throws IllegalArgumentException if
1236       * <var>tid</var> does not exist, or if <var>priority</var> is out of range for the policy.
1237       * @throws SecurityException Throws SecurityException if your process does
1238       * not have permission to modify the given thread, or to use the given
1239       * scheduling policy or priority.
1240       *
1241       * {@hide}
1242       */
1243  
1244      public static final native void setThreadScheduler(int tid, int policy, int priority)
1245              throws IllegalArgumentException;


/**
 * sys_sched_setscheduler - set/change the scheduler policy and RT priority
 * @pid: the pid in question.
 * @policy: new policy.
 * @param: structure containing the new RT priority.
 *
 * Return: 0 on success. An error code otherwise.
 */
SYSCALL_DEFINE3(sched_setscheduler, pid_t, pid, int, policy, struct sched_param __user *, param)
{
	if (policy < 0)
		return -EINVAL;
>>用户空间通过系统调用执行到这里
	return do_sched_setscheduler(pid, policy, param);
}



static int
do_sched_setscheduler(pid_t pid, int policy, struct sched_param __user *param)
{
	struct sched_param lparam;
	struct task_struct *p;
	int retval;

	if (!param || pid < 0)
		return -EINVAL;
		
	if (copy_from_user(&lparam, param, sizeof(struct sched_param)))
		return -EFAULT;

	rcu_read_lock();
	retval = -ESRCH;
>>根据tid找到对应的task_struct
	p = find_process_by_pid(pid);
	if (likely(p))
		get_task_struct(p);
	rcu_read_unlock();

	if (likely(p)) {
>>执行sched_setscheduler函数,更新task_struct相关参数
		retval = sched_setscheduler(p, policy, &lparam);
		put_task_struct(p);
	}

	return retval;
}


/**
 * sched_setscheduler - change the scheduling policy and/or RT priority of a thread.
 * @p: the task in question.
 * @policy: new policy.
 * @param: structure containing the new RT priority.
 *
 * Return: 0 on success. An error code otherwise.
 *
 * NOTE that the task may be already dead.
 */
int sched_setscheduler(struct task_struct *p, int policy,
		       const struct sched_param *param)
{
	return _sched_setscheduler(p, policy, param, true);
}


static int _sched_setscheduler(struct task_struct *p, int policy,
			       const struct sched_param *param, bool check)
{
>>将用户空间设置的参数更新到sched_attr中:sched_policy(刚传入)、sched_priority(刚传入)、sched_nice(原来的nice值)
	struct sched_attr attr = {
		.sched_policy   = policy,
		.sched_priority = param->sched_priority,
		.sched_nice	= PRIO_TO_NICE(p->static_prio),
	};

	/* Fixup the legacy SCHED_RESET_ON_FORK hack. */
	if ((policy != SETPARAM_POLICY) && (policy & SCHED_RESET_ON_FORK)) {
		attr.sched_flags |= SCHED_FLAG_RESET_ON_FORK;
		policy &= ~SCHED_RESET_ON_FORK;
		attr.sched_policy = policy;
	}
>>后续执行__sched_setscheduler函数
	return __sched_setscheduler(p, &attr, check, true);
}




static int __sched_setscheduler(struct task_struct *p,
				const struct sched_attr *attr,
				bool user, bool pi)
{
	int newprio = dl_policy(attr->sched_policy) ? MAX_DL_PRIO - 1 :
		      MAX_RT_PRIO - 1 - attr->sched_priority;
...
	/*
	 * Valid priorities for SCHED_FIFO and SCHED_RR are
	 * 1..MAX_USER_RT_PRIO-1, valid priority for SCHED_NORMAL,
	 * SCHED_BATCH and SCHED_IDLE is 0.
	 */
>>rt线程设置的优先级不能大于99
>>#define MAX_USER_RT_PRIO	100
>>#define MAX_RT_PRIO		MAX_USER_RT_PRIO
	if ((p->mm && attr->sched_priority > MAX_USER_RT_PRIO-1) ||
	    (!p->mm && attr->sched_priority > MAX_RT_PRIO-1))
		return -EINVAL;
>>如果是rt线程,用户空间设置下来的sched_priority为0,则return,流程中止
>>或者,对于非实时调度的普通线程,用户空间设置下来的sched_priority不为0,则return,流程中止
	if ((dl_policy(policy) && !__checkparam_dl(attr)) ||
	    (rt_policy(policy) != (attr->sched_priority != 0)))
		return -EINVAL;
...
	/*
	 * If not changing anything there's no need to proceed further,
	 * but store a possible modification of reset_on_fork.
	 */
	if (unlikely(policy == p->policy)) {
>>如果fair调度策略没变,但是sched_priority和task现有nice值不同,则执行change逻辑
		if (fair_policy(policy) && attr->sched_nice != task_nice(p))
			goto change;
>>如果rt调度策略没变,但是sched_priority和task现有rt_priority值不同,则执行change逻辑
		if (rt_policy(policy) && attr->sched_priority != p->rt_priority)
			goto change;
		if (dl_policy(policy) && dl_param_changed(p, attr))
			goto change;
		if (attr->sched_flags & SCHED_FLAG_UTIL_CLAMP)
			goto change;

		p->sched_reset_on_fork = reset_on_fork;
		retval = 0;
		goto unlock;
	}
...
	queued = task_on_rq_queued(p);
	running = task_current(rq, p);
	if (queued)
		dequeue_task(rq, p, queue_flags);
	if (running)
		put_prev_task(rq, p);

	prev_class = p->sched_class;
>>真正更新task对应参数逻辑
	__setscheduler(rq, p, attr, pi);
	__setscheduler_uclamp(p, attr);

	if (queued) {
		/*
		 * We enqueue to tail when the priority of a task is
		 * increased (user space view).
		 */
		if (oldprio < p->prio)
			queue_flags |= ENQUEUE_HEAD;

		enqueue_task(rq, p, queue_flags);
	}
	if (running)
		set_next_task(rq, p);
		
		

/* Actually do priority change: must hold pi & rq lock. */
static void __setscheduler(struct rq *rq, struct task_struct *p,
			   const struct sched_attr *attr, bool keep_boost)
{
	/*
	 * If params can't change scheduling class changes aren't allowed
	 * either.
	 */
	if (attr->sched_flags & SCHED_FLAG_KEEP_PARAMS)
		return;
>>更新task调度策略policy与优先级rt_priority、normal_prio
	__setscheduler_params(p, attr);

	/*
	 * Keep a potential priority boosting if called from
	 * sched_setscheduler().
	 */
>>更新p->prio
	p->prio = normal_prio(p);
	if (keep_boost)
		p->prio = rt_effective_prio(p, p->prio);
>>更新调度器sched_class
	if (dl_prio(p->prio))
		p->sched_class = &dl_sched_class;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class;
	else
		p->sched_class = &fair_sched_class;
}



/*
 * sched_setparam() passes in -1 for its policy, to let the functions
 * it calls know not to change it.
 */
#define SETPARAM_POLICY	-1

static void __setscheduler_params(struct task_struct *p,
		const struct sched_attr *attr)
{
	int policy = attr->sched_policy;

	if (policy == SETPARAM_POLICY)
		policy = p->policy;
>>更新policy调度策略
	p->policy = policy;

	if (dl_policy(policy))
		__setparam_dl(p, attr);
	else if (fair_policy(policy))
>>对于fair调度策略,更新用户空间传下来的nice值以及static_prio
>>根据前面的代码逻辑可知,sched_nice为task原来的nice值
		p->static_prio = NICE_TO_PRIO(attr->sched_nice);

	/*
	 * __sched_setscheduler() ensures attr->sched_priority == 0 when
	 * !rt_policy. Always setting this ensures that things like
	 * getparam()/getattr() don't report silly values for !rt tasks.
	 */
>>更新rt_priority,根据前面的代码逻辑可知,如果是rt调度,则sched_priority大于0;
>>如果是非rt调度,则sched_priority为0
	p->rt_priority = attr->sched_priority;
>>更新normal_prio,即将优先级归一化为0-139范围,即systrace中所看到的线程优先级范围
	p->normal_prio = normal_prio(p);
	set_load_weight(p, true);
}


static inline int normal_prio(struct task_struct *p)
{
	int prio;

	if (task_has_dl_policy(p))
		prio = MAX_DL_PRIO-1;
	else if (task_has_rt_policy(p))
>>如果是rt线程,则normal_prio=99-p->rt_priority
>>#define MAX_USER_RT_PRIO	100
>>#define MAX_RT_PRIO		MAX_USER_RT_PRIO
		prio = MAX_RT_PRIO-1 - p->rt_priority;
	else
>>如果是普通线程,则normal_prio=static_prio
		prio = __normal_prio(p);
	return prio;
}


/*
 * __normal_prio - return the priority that is based on the static prio
 */
static inline int __normal_prio(struct task_struct *p)
{
	return p->static_prio;
}

4. CFS调度原理与vruntime的计算

android中最常见的线程调度策略为SCHED_NORMAL,对应的调度器(类)为CFS(Completely Fair Scheduler),因此有必要了解CFS调度器的大致原理。

4.1 CFS调度器常用数据结构

4.1.1 task_struct

task_struct前面已经提到过,在linux中代表着进程/线程,为了方便,后文提到的进程或task,均代表进程/线程/task。在设置线程优先级时,会先通过tid找到对应的结构体task_struct。其中,和优先级直接相关的几个变量为prio、static_prio、normal_prio、rt_priority;和所能运行的cpu相关的变量为nr_cpus_allowed、cpus_ptr、cpus_mask,cpu_mask指定了当前task能在哪些cpu上运行;和task调度相关的变量为sched_class、policy、se,这三个变量分别是调度器、调度策略和调度实体,而se对应数据结构为sched_entity。

struct task_struct {
    ...
	int				on_rq;

	int				prio;
	int				static_prio;
	int				normal_prio;
	unsigned int			rt_priority;

	const struct sched_class	*sched_class;
	struct sched_entity		se;
	struct sched_rt_entity		rt;
#ifdef CONFIG_CGROUP_SCHED
	struct task_group		*sched_task_group;
#endif
	struct sched_dl_entity		dl;
	...
	unsigned int			policy;
	int				nr_cpus_allowed;
	const cpumask_t			*cpus_ptr;
	cpumask_t			cpus_mask;
	...
}

4.1.2 sched_entity

CFS调度器在调度时所直接用到的结构体即为sched_entity。其中,load表示权重(对应数据结构为load_weight),与优先级(nice值)有关,影响到vruntime的计算;exec_start为当前进程获取最近一次cpu时间片的起始时间,利用当前时间now减去起始时间exec_start,就得到了此次获取cpu时间片后的运行时间;sum_exec_runtime为当前进程总的cpu消耗时间,为真实时间;vruntime表示虚拟运行时间,在真实时间sum_exec_runtime的基础上计算得到,为CFS调度器在调度时的重要参考;prev_sum_exec_runtime为进程在上次获得cpu时间片后的累计总的运行时间。

struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load;
	unsigned long			runnable_weight;
	struct rb_node			run_node;
	struct list_head		group_node;
	unsigned int			on_rq;

	u64				exec_start;
	u64				sum_exec_runtime;
	u64				vruntime;
	u64				prev_sum_exec_runtime;
	...
}


struct load_weight {
	unsigned long			weight;
	u32				inv_weight;
};

4.1.3 rq

每个cpu都有一个运行队列,对应结构体为rq,其中的cfs、rt、dl分别表示CFS调度器、Realtime调度器和Deadline调度器对应的就绪队列。

kernel/sched/sched.h
/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
    ...
	struct cfs_rq		cfs;
	struct rt_rq		rt;
	struct dl_rq		dl;
	...
}

4.1.4 cfs_rq

CFS调度器的就绪队列对应的结构体为cfs_rq。其中,load为就绪队列中所有调度实体权重之和;min_vruntime为队列中所有调度实体中最小的虚拟运行时间;curr为当前执行进程的调度实体;tasks_timeline为保存调度实体的红黑树,红黑树中每一个节点的key为对应调度实体的vruntime。

/* CFS-related fields in a runqueue */
struct cfs_rq {
	struct load_weight	load;
	unsigned long		runnable_weight;
	unsigned int		nr_running;
	unsigned int		h_nr_running;      /* SCHED_{NORMAL,BATCH,IDLE} */
	unsigned int		idle_h_nr_running; /* SCHED_IDLE */

	u64			exec_clock;
	u64			min_vruntime;
#ifndef CONFIG_64BIT
	u64			min_vruntime_copy;
#endif

	struct rb_root_cached	tasks_timeline;

	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity	*curr;
	struct sched_entity	*next;
	struct sched_entity	*last;
	struct sched_entity	*skip;
	...
}


/*
 * Leftmost-cached rbtrees.
 *
 * We do not cache the rightmost node based on footprint
 * size vs number of potential users that could benefit
 * from O(1) rb_last(). Just not worth it, users that want
 * this feature can always implement the logic explicitly.
 * Furthermore, users that want to cache both pointers may
 * find it a bit asymmetric, but that's ok.
 */
struct rb_root_cached {
	struct rb_root rb_root;
	struct rb_node *rb_leftmost;
};

4.2 CFS调度原理

CFS调度器,从就绪队列中根据进程的vruntime选择下一个将要运行的进程,且总是倾向于选择队列中vruntime最小的进程。CFS调度器就绪队列中有一个保存调度实体的红黑树,红黑树中每一个节点对应的key为进程对应的vruntime,而value则为进程对应的调度实体sched_entity。由于红黑树的结构特性,其左边的节点对应的vruntime总是比右边节点的vruntime要小。因此,红黑树中最左边的叶子节点即为vuntime最小的调度实体,而CFS调度器总是选择红黑树中最左边的调度实体进行调度。

vruntime为虚拟运行时间,是根据调度实体的实际运行时间计算得到。调度实体获得的cpu时间片越多,其实际运行时间就越大,vruntime就越大,当调度实体没有在cpu上运行时,其vruntime会保持不变。对于优先级越高(nice值越小)的task,其调度实体对应的vruntime会增长的更慢(相对于优先级较低的task而言),因此,优先级更高的task能获取到更多的cpu时间片。

vruntime的更新与调整时机,不仅在于优先级改变的时候,在迁移运行的cpu、进程长时间休眠后被唤醒、进程被创建时等情况,其对应的vruntime也会得到调整,以确保公平性。比如,每个cpu上rq对应的vruntime不同,当进程从一个cpu迁移到另一个cpu时,会对进程的vruntime进行调整,以确保迁核后进程长时间得不到调度或者进程长时间占用cpu资源;再比如,对于新创建的进程,其vruntime为0,需要对其值在cfs_rq对应的min_vruntime基础上进行调整,防止其他进程长时间得不到cpu资源。这些都体现出CFS调度的公平性。

4.3 vruntime的计算

根据前面CFS调度原理,CFS调度依赖于进程的vruntime。因此,需要梳理当进程优先级改变时,其vruntime的更新流程。

在更新task优先级流程中,先更新结构体task_struct中优先级相关的变量(比如static_prio、normal_prio、prio等),然后再依次更新task对应的最近一次在cpu上运行的实际时间delta_exec、虚拟运行时间Δvruntime, 最后最新nice值对应的权重weight。其中,vruntime更新公式为:
vruntime += Δvruntime;

Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight;

可以看到,权重weight越大,vruntime增长得越慢,所以优先级高的task其权重越大、获得的cpu时间片越长。

//更新weight与vruntime代码流程
set_user_nice->set_load_weight->reweight_task->reweight_entity(更新runnable_weight和weight)->update_curr->1.calc_delta_fair(更新task对应vruntime);2. update_min_vruntime(更新cfs_rq对应min_vruntime);


/kernel/sched/core.c
static void set_load_weight(struct task_struct *p, bool update_load)
{
	int prio = p->static_prio - MAX_RT_PRIO;
	struct load_weight *load = &p->se.load;

	/*
	 * SCHED_IDLE tasks get minimal weight:
	 */
	if (task_has_idle_policy(p)) {
		load->weight = scale_load(WEIGHT_IDLEPRIO);
		load->inv_weight = WMULT_IDLEPRIO;
		p->se.runnable_weight = load->weight;
		return;
	}

	/*
	 * SCHED_OTHER tasks have to update their load when changing their
	 * weight
	 */
	if (update_load && p->sched_class == &fair_sched_class) {
>>对于CFS调度的task,更新其权重weight
		reweight_task(p, prio);
	} else {
		load->weight = scale_load(sched_prio_to_weight[prio]);
		load->inv_weight = sched_prio_to_wmult[prio];
		p->se.runnable_weight = load->weight;
	}
}



/kernel/sched/fair.c
void reweight_task(struct task_struct *p, int prio)
{
>>获取task_struct对应的调度实体sched_entity
	struct sched_entity *se = &p->se;
>>获取调度实体对应的CFS调度器就绪队列cfs_rq
	struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>获取task对应的权重结构体load_weight
	struct load_weight *load = &se->load;
>>根据nice值从sched_prio_to_weight数组中获取对应权重weight
	unsigned long weight = scale_load(sched_prio_to_weight[prio]);
>>更新sched_entity对应权重weight
	reweight_entity(cfs_rq, se, weight, weight);
	load->inv_weight = sched_prio_to_wmult[prio];
}



每个nice值有着对应的weight值,并存放在数组sched_prio_to_weight中。其中,nice值越小,优先级越高,则对应权重weight越大。且相邻优先级对应的weight的比值大约为0.8左右,这主要是为了确保nice值每加1,则task所获得的cpu时间片减少10%左右。
/kernel/sched/core.c
/*
 * Nice levels are multiplicative, with a gentle 10% change for every
 * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
 * nice 1, it will get ~10% less CPU time than another CPU-bound task
 * that remained on nice 0.
 *
 * The "10% effect" is relative and cumulative: from _any_ nice level,
 * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
 * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
 * If a task goes up by ~10% and another task goes down by ~10% then
 * the relative distance between them is ~25%.)
 */
const int sched_prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};



static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
			    unsigned long weight, unsigned long runnable)
{
	if (se->on_rq) {
		/* commit outstanding execution time */
		if (cfs_rq->curr == se)
>>如果当前task正在cpu上运行,则通过update_curr函数更新vruntime
			update_curr(cfs_rq);
		account_entity_dequeue(cfs_rq, se);
		dequeue_runnable_load_avg(cfs_rq, se);
	}
	dequeue_load_avg(cfs_rq, se);

	se->runnable_weight = runnable;
>>更新vruntime后,再更新权重weight
	update_load_set(&se->load, weight);
	...
}


static inline void update_load_set(struct load_weight *lw, unsigned long w)
{
>>更新权重weight,将新的nice值对应的权重w赋值给lw->weight
	lw->weight = w;
	lw->inv_weight = 0;
}


/*
 * Update the current task's runtime statistics.
 */
static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_clock_task(rq_of(cfs_rq));
	u64 delta_exec;

	if (unlikely(!curr))
		return;
>>获取最近这一次运行时间delta_exec
	delta_exec = now - curr->exec_start;
	if (unlikely((s64)delta_exec <= 0))
		return;
>>重置exec_start
	curr->exec_start = now;

	schedstat_set(curr->statistics.exec_max,
		      max(delta_exec, curr->statistics.exec_max));
>>更新sum_exec_runtime:加上最近这一次的运行时间delta_exec
	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq->exec_clock, delta_exec);
>>更新调度实体对应的vruntime: 加上最近这一次的虚拟运行时间
	curr->vruntime += calc_delta_fair(delta_exec, curr);
>>更新cfs_rq对应min_vruntime
	update_min_vruntime(cfs_rq);

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cgroup_account_cputime(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

	account_cfs_rq_runtime(cfs_rq, delta_exec);
}


/*
 * delta /= w
 */
static inline u64 calc_delta_fair(u64 delta, struct sched_entity *se)
{
	if (unlikely(se->load.weight != NICE_0_LOAD))
>>通过__calc_delta函数计算新增的vruntime 
>>delta为最近一次获取cpu后的运行时间,NICE_0_LOAD为nice值0所对应的weight,se->load为调度实体对应的权重
		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);

	return delta;
}


根据代码注释,vruntime计算公式为:Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight,实际计算过程在此基础上略有优化。因此,task对应优先级越高,则权重越大,则vruntime增长得越慢,其获取的cpu时间片越长。对于nice值为0的task,其虚拟运行时间Δvruntime与实际运行时间delta_exec相等。
/*
 * delta_exec * weight / lw.weight
 *   OR
 * (delta_exec * (weight * lw->inv_weight)) >> WMULT_SHIFT
 *
 * Either weight := NICE_0_LOAD and lw \e sched_prio_to_wmult[], in which case
 * we're guaranteed shift stays positive because inv_weight is guaranteed to
 * fit 32 bits, and NICE_0_LOAD gives another 10 bits; therefore shift >= 22.
 *
 * Or, weight =< lw.weight (because lw.weight is the runqueue weight), thus
 * weight/lw.weight <= 1, and therefore our shift will also be positive.
 */
static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw)
{
	u64 fact = scale_load_down(weight);
	int shift = WMULT_SHIFT;

	__update_inv_weight(lw);

	if (unlikely(fact >> 32)) {
		while (fact >> 32) {
			fact >>= 1;
			shift--;
		}
	}

	fact = mul_u32_u32(fact, lw->inv_weight);

	while (fact >> 32) {
		fact >>= 1;
		shift--;
	}

	return mul_u64_u32_shr(delta_exec, fact, shift);
}

5.总结与思考

(1) setThreadPriority只能设置普通线程优先级,对实时线程无效;
(2) setThreadScheduler在设置普通线程优先级时,传入参数(nice值)只能为0; 在设置实时线程优先级时,传入参数数值范围为1~99;
(3) 安卓设置实时线程优先级数值为1~99, 对应归一化后的优先级范围为0-98,所以android中无法设置线程的归一化优先级为99?
(4) 对于普通线程,nice值与normal_prio的转化关系为:normal_prio=static_prio=DEFAULT_PRIO+nice=120+nice;对于实时线程,android中设置的优先级数值rt_priority与normal_prio转化关系为:normal_prio=MAX_RT_PRIO-1-rt_priority=99-rt_priority。

(5) 线程对应nice值越小,优先级越高,对应的权重越大,其虚拟运行时间vruntime增长得越慢,获得的cpu时间片越长。vruntime计算公式为:

Δvruntime= delta_exec * NICE_0_LOAD/ lw.weight;vruntime += Δvruntime;

  • 13
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值