深入Linux内核（进程篇）—进程调度

最新推荐文章于 2022-01-22 21:57:51 发布

迷途小生

最新推荐文章于 2022-01-22 21:57:51 发布

阅读量3k

点赞数 3

分类专栏：深入Linux内核文章标签： linux 内核操作系统

本文链接：https://blog.csdn.net/liyuewuwunaile/article/details/105825069

版权

深入Linux内核专栏收录该内容

7 篇文章 17 订阅

订阅专栏

进程调度

一、进程分类
二、task_struct与进程调度相关的成员
三、进程创建时调度相关成员赋值
四、优先级
五、调度类
六、调度实体
七、调度策略
八、调度类，实体，策略的关系
九、运行队列
十、schedule
十一、scheduler_tick
十二、进程唤醒
十四、调度相关的系统调用
十五、总结

进程调度即从一个进程切换到另一个进程，进程调度需要明确：
1、进程调度的时机，即什么时候进行进程调度；
2、选择哪个进程替换当前进程；
3、如何切换两个进程。

一、进程分类

按进程运行资源限制分类：
1、I/O受限（I/O-bound）：频繁访问IO，花费时间等待IO操作完成；
2、CPU受限（CPU-bound）：需要在CPU计算上花费时间。

按进程功能分类：
1、交换式进程（interactive process）：进程与用户交互，花费时间等待鼠标，键盘等外设。这类进程的延迟不能太大，否则用户体验不好。典型交互式进程是shell，编辑器等；
2、批处理进程（batch process）：无需与用户交互，在后台运行。这类进程响应时间不会太快。典型批处理进程是编译程序；
3、实时进程（real-time process）：这种进程需要及时调度。典型实时进程有工业控制程序。

二、task_struct与进程调度相关的成员

截取Linux5.6.4中task_struct进程调度相关的成员：

	int				prio;                    /* 动态优先级 */
	int				static_prio;             /* 静态优先级 */
	int				normal_prio;             /* 普通优先级 */
	unsigned int			rt_priority;     /* 实时优先级 */

	const struct sched_class	*sched_class;/* 调度类 */
	struct sched_entity		se;              /* CFS调度实体 */
	struct sched_rt_entity		rt;          /* RT调度实体 */
#ifdef CONFIG_CGROUP_SCHED
	struct task_group		*sched_task_group;/* 组调度 */
#endif
	struct sched_dl_entity		dl;          /* DL调度实体 */
	unsigned int			policy;          /* 调度策略 */
	int				nr_cpus_allowed;

三、进程创建时调度相关成员赋值

《深入Linux内核（进程篇）—进程创建与退出》介绍了进程创建的过程，_do_fork完成了进程的创建，进程描述符task_struct中关于进程调度的成员是在_do_fork调用的sched_fork中实现的。
sched_fork函数实现如下：
1、调用__sched_fork初始化子进程调度实体相关成员；
2、如果设置SCHED_RESET_ON_FORK策略，则强制设置子进程优先级及调度策略为普通进程。
3、SCHED_RESET_ON_FORK策略由系统调用sched_setscheduler实现，父进程设置该策略后，子进程也会继承，因而判断子进程sched_reset_on_fork标志即可。
4、设置进程调度类。
5、调用进程调度类p->sched_class->task_fork方法，只有CFS调度类实现了task_fork_fair方法，用于初始化子调度实体vruntime虚拟运行时间。

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
	unsigned long flags;
	/* 初始化调度实体 */
	__sched_fork(clone_flags, p);
	/*
	 * We mark the process as NEW here. This guarantees that
	 * nobody will actually run it, and a signal or other external
	 * event cannot wake it up and insert it on the runqueue either.
	 */
	p->state = TASK_NEW;

	/*
	 * Make sure we do not leak PI boosting priority to the child.
	 */
	p->prio = current->normal_prio;

	uclamp_fork(p);

	/*
	 * Revert to default priority/policy on fork if requested.
	 */
	 /* 设置SCHED_RESET_ON_FORK策略，则强制设置子进程优先级及调度策略为普通进程 */
	if (unlikely(p->sched_reset_on_fork)) {
		if (task_has_dl_policy(p) || task_has_rt_policy(p)) {
			p->policy = SCHED_NORMAL;
			p->static_prio = NICE_TO_PRIO(0);
			p->rt_priority = 0;
		} else if (PRIO_TO_NICE(p->static_prio) < 0)
			p->static_prio = NICE_TO_PRIO(0);

		p->prio = p->normal_prio = __normal_prio(p);
		set_load_weight(p, false);

		/*
		 * We don't need the reset flag anymore after the fork. It has
		 * fulfilled its duty:
		 */
		p->sched_reset_on_fork = 0;
	}
	
 	/* 设置进程调度类 */
	if (dl_prio(p->prio))
		return -EAGAIN;
	else if (rt_prio(p->prio))
		p->sched_class = &rt_sched_class; /* 实时调度类 */
	else
		p->sched_class = &fair_sched_class; /* CFS调度类 */

	init_entity_runnable_average(&p->se);

	/*
	 * The child is not yet in the pid-hash so no cgroup attach races,
	 * and the cgroup is pinned to this child due to cgroup_fork()
	 * is ran before sched_fork().
	 *
	 * Silence PROVE_RCU.
	 */
	raw_spin_lock_irqsave(&p->pi_lock, flags);
	/*
	 * We're setting the CPU for the first time, we don't migrate,
	 * so use __set_task_cpu().
	 */
	__set_task_cpu(p, smp_processor_id());
	/* 调用进程调度类p->sched_class->task_fork方法，只有CFS调度类实现了
	task_fork_fair方法，用于初始化子进程调度实体vruntime虚拟运行时间。 */
	if (p->sched_class->task_fork)
		p->sched_class->task_fork(p);
	raw_spin_unlock_irqrestore(&p->pi_lock, flags);

#ifdef CONFIG_SCHED_INFO
	if (likely(sched_info_on()))
		memset(&p->sched_info, 0, sizeof(p->sched_info));
#endif
#if defined(CONFIG_SMP)
	p->on_cpu = 0;
#endif
	init_task_preempt_count(p);
#ifdef CONFIG_SMP
	plist_node_init(&p->pushable_tasks, MAX_PRIO);
	RB_CLEAR_NODE(&p->pushable_dl_tasks);
#endif
	return 0;
}

四、优先级

	int				prio;                    /* 动态优先级 */
	int				static_prio;             /* 静态优先级 */
	int				normal_prio;             /* 普通优先级 */
	unsigned int			rt_priority;     /* 实时优先级 */

内核使用0-139表示进程优先级，数值越低优先级越高。优先级0-99用于实时进程，100-139用于普通进程。
MAX_RT_PRIO宏是最大实时进程优先级，数值是100。
DEFAULT_PRIO宏是默认普通进程优先级，数值为120，即创建普通进程默认优先级是120，可以通过nice系统调用修改。
NICE的取值范围是-20到19，负值增大进程优先级，正值减小进程优先级。
实时进程的优先级范围是0…MAX_PRIO-1（0-99），普通进程优先级范围是MAX_RT_PRIO…MAX_PRIO-1（100-139）。

#define MAX_NICE	19
#define MIN_NICE	-20
#define NICE_WIDTH	(MAX_NICE - MIN_NICE + 1)

/*
 * Priority of a process goes from 0..MAX_PRIO-1, valid RT
 * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
 * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority
 * values are inverted: lower p->prio value means higher priority.
 *
 * The MAX_USER_RT_PRIO value allows the actual maximum
 * RT priority to be separate from the value exported to
 * user-space.  This allows kernel threads to set their
 * priority to a value higher than any user task. Note:
 * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO.
 */

#define MAX_USER_RT_PRIO	100
#define MAX_RT_PRIO		MAX_USER_RT_PRIO

#define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)
#define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

static_prio
进程的静态优先级，静态优先级不会随着时间改变，但是可以通过nice或sched_setscheduler系统调用修改。
normal_prio
基于static_prio和调度策略计算出来的优先级，do_fork进程时，子进程会继承父进程的normal_prio。
对于普通进程，normal_prio等于static_prio；对于实时进程，会根据rt_priority计算normal_prio。
prio
进程的动态优先级。
rt_priority
实时进程优先级

/*
 * __normal_prio - return the priority that is based on the static prio
 */
static inline int __normal_prio(struct task_struct *p)
{
	return p->static_prio;
}

/*
 * Calculate the expected normal priority: i.e. priority
 * without taking RT-inheritance into account. Might be
 * boosted by interactivity modifiers. Changes upon fork,
 * setprio syscalls, and whenever the interactivity
 * estimator recalculates.
 */
static inline int normal_prio(struct task_struct *p)
{
	int prio;

	if (task_has_dl_policy(p))
		prio = MAX_DL_PRIO-1;
	else if (task_has_rt_policy(p))
		prio = MAX_RT_PRIO-1 - p->rt_priority;
	else
		prio = __normal_prio(p);
	return prio;
}

/*
 * Calculate the current priority, i.e. the priority
 * taken into account by the scheduler. This value might
 * be boosted by RT tasks, or might be boosted by
 * interactivity modifiers. Will be RT if the task got
 * RT-boosted. If not then it returns p->normal_prio.
 */
static int effective_prio(struct task_struct *p)
{
	p->normal_prio = normal_prio(p);
	/*
	 * If we are RT tasks or we were boosted to RT priority,
	 * keep the priority unchanged. Otherwise, update priority
	 * to the normal priority:
	 */
	if (!rt_prio(p->prio))
		return p->normal_prio;
	return p->prio;
}

五、调度类

内核提供5种调度类，用于提供完成调度的一些方法。
task_struct描述调度类的成员。

const struct sched_class	*sched_class;/* 调度类 */

内核共有5个调度类，依次为：

kenel/sched/stop_task.c
const struct sched_class stop_sched_class = {
	.next			= &dl_sched_class,
	……
	}
kenel/sched/deadline.c
const struct sched_class dl_sched_class = {
	.next			= &rt_sched_class,
	……
	}
kenel/sched/rt.c
const struct sched_class rt_sched_class = {
	.next			= &fair_sched_class,
	}
kenel/sched/fair.c
const struct sched_class fair_sched_class = {
	.next			= &idle_sched_class,
	}
kenel/sched/idle.c
const struct sched_class idle_sched_class = {
	/* .next is NULL */
	}

优先级高的调度类指向下一优先级调度类，这种优先级的关系可以从schdule()函数中选择下一个要执行的进程看出。

static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	struct rq *rq;
	……
	next = pick_next_task(rq, prev, &rf); /* 选择切换后执行的进程 */
	……
}
/* 最高优先级调度类 */
#ifdef CONFIG_SMP
#define sched_class_highest (&stop_sched_class)
#else
#define sched_class_highest (&dl_sched_class)
#endif
/* 遍历调度类 */
#define for_class_range(class, _from, _to) \
	for (class = (_from); class != (_to); class = class->next)

#define for_each_class(class) \
	for_class_range(class, sched_class_highest, NULL)

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;

	……
	/* 按调度类优先级高低选择下一个执行的进程，高优先级调度类有要执行的进程，则直接返回该进程描述符，不再遍历下一个调度类 */
	for_each_class(class) {
		p = class->pick_next_task(rq);
		if (p)
			return p;
	}

	/* The idle class should always have a runnable task: */
	BUG();
}

可见各调度类优先级排序为：

stop_sched_class > dl_sched_class > rt_sched_class > fair_sched_class > idle_sched_class

stop_sched_class调度类具有最高优先级，它可以抢占其他任何调度类调度实体且不被任何调度实体抢占。CPU总是能够选择一个进程执行，如果当前没有可执行的进程，则执行idle进程。当遍历到idle_sched_class调度类后，说明没有要进程要执行。idle_sched_class->pick_next_task(rq)会选择rq->idle进程执行，rq即运行队列。

struct task_struct *pick_next_task_idle(struct rq *rq)
{
	struct task_struct *next = rq->idle;
	set_next_task_idle(rq, next, true);
	return next; /* 返回idle进程描述符 */
}

内核调度类数据结构如下。

struct sched_class {
	const struct sched_class *next; /* 指向下一个调度类，高优先级指向低优先级 */

#ifdef CONFIG_UCLAMP_TASK
	int uclamp_enabled;
#endif

	/* 进程加入运行队列 */
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	/* 进程退出运行队列 */
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	/* 进程主动让出CPU，进程退出运行队列后再加入运行队列尾 */
	void (*yield_task)   (struct rq *rq);
	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);

	/* 检查当前进程是否可被新进程抢占 */
	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);

	/* 在调度类中选择下一个要执行的进程 */
	struct task_struct *(*pick_next_task)(struct rq *rq);

	/* 将进程放回运行队列 */
	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
	void (*set_next_task)(struct rq *rq, struct task_struct *p, bool first);

#ifdef CONFIG_SMP
	int (*balance)(struct rq *rq, struct task_struct *prev, struct rq_flags *rf);
	/* 返回进程运行队列CPU number */ 
	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
	/* 进程迁移到指定CPU，set_task_cpu调用 */
	void (*migrate_task_rq)(struct task_struct *p, int new_cpu);

	/* 进程唤醒 */
	void (*task_woken)(struct rq *this_rq, struct task_struct *task);

	/* 修改进程的CPU亲和力(affinity) */
	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask);

	/* 启动运行队列 */
	void (*rq_online)(struct rq *rq);
	/* 禁止运行队列 */
	void (*rq_offline)(struct rq *rq);
#endif

	/*scheduler_tick定时器调用，CFS调度算法调用update_curr计算vruntime*/
	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
	/* do_fork调用，CFS调度算法初始化vruntime */
	void (*task_fork)(struct task_struct *p);
	/* 进程切换时前一个进程状态为TASK_DEAD时调用 */
	void (*task_dead)(struct task_struct *p);

	/*
	 * The switched_from() call is allowed to drop rq->lock, therefore we
	 * cannot assume the switched_from/switched_to pair is serliazed by
	 * rq->lock. They are however serialized by p->pi_lock.
	 */
	/* __sched_setscheduler调用check_class_changed，进程切换调度策略时触发切换调度类以及优先级的变化 */
	void (*switched_from)(struct rq *this_rq, struct task_struct *task);
	void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			      int oldprio);

	/* 系统调用sched_rr_get_interval，返回 round robin time，非RR调度类返回0 */
	unsigned int (*get_rr_interval)(struct rq *rq,
					struct task_struct *task);

		/* 更新当前进程运行时间 */
	void (*update_curr)(struct rq *rq);

#define TASK_SET_GROUP		0
#define TASK_MOVE_GROUP		1

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*task_change_group)(struct task_struct *p, int type);
#endif
}

六、调度实体

task_struct关于调度实体的成员有三个：
CFS调度实体，用于普通进程；
RT调度实体，用于实时进程；
DL调度实体，用于Deadline进程。

	struct sched_entity		se;              /* CFS调度实体 */
	struct sched_rt_entity		rt;          /* RT调度实体 */
	struct sched_dl_entity		dl;          /* DL调度实体 */

之所以定义调度实体，是因为Linux调度的对象可以是进程，也可以是组调度(sched_task_group)。因而定义调度实体抽象被调度的对象。

七、调度策略

决定什么时候以怎样的方式选择一个新进程运行的规则就是所谓的调度策略(scheduling policy)。用户可以通过sched_setscheduler()系统调用设置进程调度策略。
Linux六种调度策略：

/*
 * Scheduling policies
 */
#define SCHED_NORMAL		0
#define SCHED_FIFO		1
#define SCHED_RR		2
#define SCHED_BATCH		3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE		5
#define SCHED_DEADLINE		6

SCHED_NORMAL
普通的分时进程调度策略，通过CFS调度器实现。
SCHED_FIFO
实时进程调度策略-先进先出。高优先级抢占低优先级任务，优先级相同则按照先进先出（先到先得）调度。通过RT调度器实现。
SCHED_RR
实时进程调度策略-时间片轮转。高优先级抢占低优先级任务，优先级相同则按照时间片轮转，时间片用完则放至队尾重新排队调度。通过RT调度器实现。
SCHED_BATCH
普通的分时进程调度策略，类似SCHED_NORMAL，根据动态优先级调度，通过CFS调度器实现。
SCHED_IDLE
普通的分时进程调度策略，CPU不能执行其他任务，才执行0号进程。
SCHED_DEADLINE
实时进程调度策略，针对突发型计算，且对延迟和完成时间高度敏感的任务适用。基于Earliest Deadline First (EDF) 调度算法。通过deadline调度器实现。

八、调度类，实体，策略的关系

按调度类优先级从高到底排列。

调度类	调度策略	调度实体	调度算法
stop_sched_class	无	无	无
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	EDF
rt_sched_class	SCHED_RR/SCHED_FIFO	sched_rt_entity	RR/FIFO
fair_sched_class	SCHED_NORMAL/SCHED_BATCH	sched_entity	CFS
idle_sched_class	无	无	无

九、运行队列

内核为每个CPU分配一个运行队列(runqueue)[每CPU变量]，用于组织CPU上进程的运行。
this_rq()用于获取当前CPU运行队列。
task_rq()用于获取进程所在CPU运行队列。
cpu_curr()用于获取指定CPU运行队列当前运行的进程。

DECLARE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);

#define cpu_rq(cpu)		(&per_cpu(runqueues, (cpu)))
#define this_rq()		this_cpu_ptr(&runqueues)
#define task_rq(p)		cpu_rq(task_cpu(p))
#define cpu_curr(cpu)		(cpu_rq(cpu)->curr)
#define raw_rq()		raw_cpu_ptr(&runqueues)

struct rq 数据结构描述CPU的通用运行队列，记录了一个运行队列所需要的全部信息，包括CFS调度运行队列struct cfs_rq，实时进程调度运行队列struct rt_rq和struct dl_rq，权重load信息，运行队列当前运行进程struct task_struct __rcu *curr，idle进程struct task_struct *idle等。

/*
 * This is the main, per-CPU runqueue data structure.
 */
struct rq {
	/* runqueue lock: */
	raw_spinlock_t		lock; /* 保护运行队列自旋锁 */

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned int		nr_running; /* 运行队列可运行进程数 */
	…………

	unsigned long		nr_load_updates;
	u64			nr_switches; /* 进程切换次数 */

#ifdef CONFIG_UCLAMP_TASK
	/* Utilization clamp values based on CPU's RUNNABLE tasks */
	struct uclamp_rq	uclamp[UCLAMP_CNT] ____cacheline_aligned;
	unsigned int		uclamp_flags;
#define UCLAMP_FLAG_IDLE 0x01
#endif

	struct cfs_rq		cfs;/* CFS运行队列，用于普通进程 */
	struct rt_rq		rt;/* rt运行队列，用于实时进程 */
	struct dl_rq		dl;/* dl运行队列，用于deadline进程 */

#ifdef CONFIG_FAIR_GROUP_SCHED
	/* list of leaf cfs_rq on this CPU: */
	struct list_head	leaf_cfs_rq_list;
	struct list_head	*tmp_alone_branch;
#endif /* CONFIG_FAIR_GROUP_SCHED */

	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long		nr_uninterruptible;

	struct task_struct __rcu	*curr;/* 当前正在运行的进程描述符 */
	struct task_struct	*idle;/*idle进程描述符 */
	struct task_struct	*stop;/*stop进程描述符 */
	unsigned long		next_balance;
	struct mm_struct	*prev_mm;/*进程切换是用于存放被替换进程内存描述符 */

	unsigned int		clock_update_flags;
	u64			clock;
	/* Ensure that all clocks are in the same cache line */
	u64			clock_task ____cacheline_aligned;
	u64			clock_pelt;
	unsigned long		lost_idle_time;

	atomic_t		nr_iowait;/*等待io操作结束的进程数*/

#ifdef CONFIG_MEMBARRIER
	int membarrier_state;
#endif

#ifdef CONFIG_SMP
	struct root_domain		*rd;
	struct sched_domain __rcu	*sd;/*调度域*/

	unsigned long		cpu_capacity;
	unsigned long		cpu_capacity_orig;

	struct callback_head	*balance_callback;

	unsigned char		idle_balance;

	unsigned long		misfit_task_load;

	/* For active balancing */
	int			active_balance;
	int			push_cpu;
	struct cpu_stop_work	active_balance_work;

	/* CPU of this runqueue: */
	int			cpu;
	int			online;

	struct list_head cfs_tasks;

	struct sched_avg	avg_rt;
	struct sched_avg	avg_dl;
#ifdef CONFIG_HAVE_SCHED_AVG_IRQ
	struct sched_avg	avg_irq;
#endif
	u64			idle_stamp;
	u64			avg_idle;

	/* This is used to determine avg_idle's max value */
	u64			max_idle_balance_cost;
#endif

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
	u64			prev_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
	u64			prev_steal_time;
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
	u64			prev_steal_time_rq;
#endif

	/* calc_load related fields */
	unsigned long		calc_load_update;
	long			calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
	int			hrtick_csd_pending;
	call_single_data_t	hrtick_csd;
#endif
	struct hrtimer		hrtick_timer;
#endif

#ifdef CONFIG_SCHEDSTATS
	/* latency stats */
	struct sched_info	rq_sched_info;
	unsigned long long	rq_cpu_time;
	/* could above be rq->cfs_rq.exec_clock + rq->rt_rq.rt_runtime ? */

	/* sys_sched_yield() stats */
	unsigned int		yld_count;

	/* schedule() stats */
	unsigned int		sched_count;
	unsigned int		sched_goidle;

	/* try_to_wake_up() stats */
	unsigned int		ttwu_count;
	unsigned int		ttwu_local;
#endif

#ifdef CONFIG_SMP
	struct llist_head	wake_list;
#endif

#ifdef CONFIG_CPU_IDLE
	/* Must be inspected within a rcu lock section */
	struct cpuidle_state	*idle_state;
#endif
}

十、schedule

schedule是完成进程调度的入口函数，其调用__schedule完成调度。__schedule是调度器核心函数。
触发__schedule调度的时机有如下几种情况：

阻塞操作：进程调用mutex, semaphore, waitqueue等；
中断返回及系统调用返回用户空间时，会检查TIF_NEED_RESCHED标志，设置TIF_NEED_RESCHED标志，则产生调度。该标志在scheduler_tick定时器中可能被设置。
将要被唤醒的进程不会马上调用schedule，而是被添加到运行队列(run-queue)，并设置TIF_NEED_RESCHED标志位。被唤醒进程调度时机与内核是否开启CONFIG_PREEMPTION有关。
如果内核开启抢占(CONFIG_PREEMPTION=y)：
1)如果唤醒动作发生在系统调用或者异常处理上下文（in syscall or exception context），则在下一次preempt_enable()检查是否需要抢占调度；
2)如果唤醒动作发生在硬中断处理上下文（in IRQ context），则硬中断返回前会检查是否抢占调度。
如果内核没有开启抢占(CONFIG_PREEMPTION未设置)，则调度发生在：
1)当前进程主动调用cond_resched();
2)当前进程主动调用schedule()；
3)系统调用或者异常处理返回用户空间；
4)中断处理返回用户空间。

中断处理完成返回与中断处理完成返回用户空间是两个不同的概念，前者每次中断返回都会检查是否抢占调度，无论中断发生在内核空间还是用户空间；后者只有中断发生在用户空间才会检查。

/*
 * __schedule() is the main scheduler function.
 *
 * The main means of driving the scheduler and thus entering this function are:
 *
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPTION=y):
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPTION is not set)
 *         then at the next:
 *
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 *
 * WARNING: must be called with preemption disabled!
 */

__schedule函数的实现：
1、找到将被切换的进程描述符prev，即运行队列里记录的当前进程。
2、找到将切换的进程描述符next，即调用pick_next_task选择一个切换的进程，调用调度类提供的pick_next_task方法实现，在调度类一节中已说明。
3、调用context_switch切换到next进程。

static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr; /* prev是当前运行的进程 */

	schedule_debug(prev, preempt);

	if (sched_feat(HRTICK))
		hrtick_clear(rq);

	local_irq_disable();
	rcu_note_context_switch(preempt);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up().
	 *
	 * The membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();

	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);

	switch_count = &prev->nivcsw;
	if (!preempt && prev->state) {
		if (signal_pending_state(prev->state, prev)) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);

			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
		}
		switch_count = &prev->nvcsw;
	}

	/* 调用调度类pick_next_task方法选择一个切换的进程 */
	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();

	/* 当前进程与切换到的进程不是同一个进程，则调用context_switch */
	if (likely(prev != next)) {
		rq->nr_switches++;/* 运行队列进程切换统计加1 */
		/*
		 * RCU users of rcu_dereference(rq->curr) may not see
		 * changes to task_struct made by pick_next_task().
		 */
		RCU_INIT_POINTER(rq->curr, next);
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;

		trace_sched_switch(preempt, prev, next);

		/* Also unlocks the rq: */
		/* 调用context_switch完成进程切换 */
		rq = context_switch(rq, prev, next, &rf);
	} else {
		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
		rq_unlock_irq(rq, &rf);
	}

	balance_callback(rq);
}

进程切换由两部分组成：

切换页全局目录安装一个新的地址空间；
切换内核态堆栈及硬件上下文。

context_switch实现了上述两部分内容。
1、通过进程描述符next->mm是否为空判断当前进程是否是内核线程，因为内核线程的内存描述符mm_struct *mm总是为空，详见《深入Linux内核（进程篇）—进程描述》内存描述一节。
2、如果是内核线程则借用prev进程的active_mm，对于用户进程，active_mm == mm，对于内核线程，mm == NULL，active_mm==prev->active_mm。
3、如果prev->mm不为空，则说明prev是用户进程，调用mmgrab增加mm->mm_count引用计数。
4、对于内核线程，会启动懒惰TLB模式。懒惰TLB模式是为了减少无用的TLB刷新，关于TLB的内容详见《深入Linux内核（内存篇）-内存管理》TLB一节。enter_lazy_tlb与体系结构相关。
5、如果是用户进程则调用switch_mm_irqs_off完成用户地址空间切换，switch_mm_irqs_off(或switch_mm)与体系结构相关。
6、调用switch_to完成内核态堆栈及硬件上下文切换，switch_to与体系结构相关。
7、switch_to执行完成后，next进程获得CPU使用权，prev进程进入睡眠状态。
8、调用finish_task_switch，如果prev是内核线程，则调用mmdrop减少内存描述符引用计数。如果引用计数为0，则释放与页表相关的所有描述符和虚拟内存。

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		 /* 调用switch_mm_irqs_off完成用户地址空间切换 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	/* 调用switch_to完成内核态堆栈及硬件上下文切换 */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

十一、scheduler_tick

schdule一节中介绍调度时机与TIF_NEED_RESCHED标志密不可分，而TIF_NEED_RESCHED标志的设置是由scheduler_tick定时器中断完成的。

To drive preemption between tasks, the scheduler sets the flag in timer
 interrupt handler scheduler_tick().

scheduler_tick由定时器以HZ频率周期调用。

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;

	sched_clock_tick();

	rq_lock(rq, &rf);

	/* 更新当前CPU运行队列rq中的时钟计数clock和click_task */
	update_rq_clock(rq);
	/* 调度类task_tick方法，对于CFS和RR调度策略，TIF_NEED_RESCHED标志即在该方法中设置，其他调度策略不设置TIF_NEED_RESCHED */
	curr->sched_class->task_tick(rq, curr, 0);
	calc_global_load_tick(rq);
	psi_task_tick(rq);

	rq_unlock(rq, &rf);

	perf_event_task_tick();

#ifdef CONFIG_SMP
	rq->idle_balance = idle_cpu(cpu);
	trigger_load_balance(rq);
#endif
}

十二、进程唤醒

进程睡眠状态下，可以被动等待被调度(进程切换时在运行队列中被选为next)，也可以被其他进程调用wake_up_process唤醒。
wake_up_process其实调用的是try_to_wake_up。第二个入参为TASK_NORMAL(TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE)，即wake_up_process唤醒处于睡眠的进程。

int wake_up_process(struct task_struct *p)
{
	return try_to_wake_up(p, TASK_NORMAL, 0);
}

try_to_wake_up入参分别为：

task_struct *p 被唤醒进程的进程描述符；
unsigned int state 可以被唤醒进程的状态掩码；
int wake_flags 唤醒标志：WF_SYNC，WF_FORK和WF_MIGRATED。

try_to_wake_up返回值为：

int success 被唤醒进程的状态是否发生变化，1为成功唤醒进程，0为没有唤醒进程；

try_to_wake_up实现如下。

static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
{
	unsigned long flags;
	int cpu, success = 0;

	preempt_disable();/* 禁止抢占 */
	if (p == current) {
		/*
		 * We're waking current, this means 'p->on_rq' and 'task_cpu(p)
		 * == smp_processor_id()'. Together this means we can special
		 * case the whole 'p->on_rq && ttwu_remote()' case below
		 * without taking any locks.
		 *
		 * In particular:
		 *  - we rely on Program-Order guarantees for all the ordering,
		 *  - we're serialized against set_special_state() by virtue of
		 *    it disabling IRQs (this allows not taking ->pi_lock).
		 */
		if (!(p->state & state))
			goto out;

		success = 1;
		cpu = task_cpu(p);
		trace_sched_waking(p);
		p->state = TASK_RUNNING;
		trace_sched_wakeup(p);
		goto out;
	}

	/*
	 * If we are going to wake up a thread waiting for CONDITION we
	 * need to ensure that CONDITION=1 done by the caller can not be
	 * reordered with p->state check below. This pairs with mb() in
	 * set_current_state() the waiting thread does.
	 */
	raw_spin_lock_irqsave(&p->pi_lock, flags);
	smp_mb__after_spinlock();
	if (!(p->state & state))
		goto unlock; /* 如果唤醒进程的状态与可被唤醒的状态不匹配，则走退出流程 */

	trace_sched_waking(p);

	/* We're going to change ->state: */
	success = 1;/* 将返回值设置为1，即进程状态发生变化 */
	cpu = task_cpu(p);/* 进程运行的cpu */

	/*
	 * Ensure we load p->on_rq _after_ p->state, otherwise it would
	 * be possible to, falsely, observe p->on_rq == 0 and get stuck
	 * in smp_cond_load_acquire() below.
	 *
	 * sched_ttwu_pending()			try_to_wake_up()
	 *   STORE p->on_rq = 1			  LOAD p->state
	 *   UNLOCK rq->lock
	 *
	 * __schedule() (switch to task 'p')
	 *   LOCK rq->lock			  smp_rmb();
	 *   smp_mb__after_spinlock();
	 *   UNLOCK rq->lock
	 *
	 * [task p]
	 *   STORE p->state = UNINTERRUPTIBLE	  LOAD p->on_rq
	 *
	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
	 * __schedule().  See the comment for smp_mb__after_spinlock().
	 */
	smp_rmb();
	if (p->on_rq && ttwu_remote(p, wake_flags))
		goto unlock;

#ifdef CONFIG_SMP
	/*
	 * Ensure we load p->on_cpu _after_ p->on_rq, otherwise it would be
	 * possible to, falsely, observe p->on_cpu == 0.
	 *
	 * One must be running (->on_cpu == 1) in order to remove oneself
	 * from the runqueue.
	 *
	 * __schedule() (switch to task 'p')	try_to_wake_up()
	 *   STORE p->on_cpu = 1		  LOAD p->on_rq
	 *   UNLOCK rq->lock
	 *
	 * __schedule() (put 'p' to sleep)
	 *   LOCK rq->lock			  smp_rmb();
	 *   smp_mb__after_spinlock();
	 *   STORE p->on_rq = 0			  LOAD p->on_cpu
	 *
	 * Pairs with the LOCK+smp_mb__after_spinlock() on rq->lock in
	 * __schedule().  See the comment for smp_mb__after_spinlock().
	 */
	smp_rmb();

	/*
	 * If the owning (remote) CPU is still in the middle of schedule() with
	 * this task as prev, wait until its done referencing the task.
	 *
	 * Pairs with the smp_store_release() in finish_task().
	 *
	 * This ensures that tasks getting woken will be fully ordered against
	 * their previous state and preserve Program Order.
	 */
	smp_cond_load_acquire(&p->on_cpu, !VAL);

	p->sched_contributes_to_load = !!task_contributes_to_load(p);
	p->state = TASK_WAKING;

	if (p->in_iowait) {
		delayacct_blkio_end(p);
		atomic_dec(&task_rq(p)->nr_iowait);
	}

	/* 为进程p选择一个cpu */
	cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
	if (task_cpu(p) != cpu) {/* 选择的cpu与当前cpu不同，则进程迁移 */
		wake_flags |= WF_MIGRATED;
		psi_ttwu_dequeue(p);
		set_task_cpu(p, cpu);
	}

#else /* CONFIG_SMP */

	if (p->in_iowait) {
		delayacct_blkio_end(p);
		atomic_dec(&task_rq(p)->nr_iowait);
	}

#endif /* CONFIG_SMP */

	ttwu_queue(p, cpu, wake_flags);/* 将进程添加到运行队列 */
unlock:
	raw_spin_unlock_irqrestore(&p->pi_lock, flags);
out:
	if (success)
		ttwu_stat(p, cpu, wake_flags);
	preempt_enable();/* 使能抢占 */

	return success;
}

唤醒进程将进程添加到运行队列是调用ttwu_queue，再调用ttwu_do_activate实现的。

static void ttwu_queue(struct task_struct *p, int cpu, int wake_flags)
{
	struct rq *rq = cpu_rq(cpu);
	struct rq_flags rf;

#if defined(CONFIG_SMP)
	if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
		sched_clock_cpu(cpu); /* Sync clocks across CPUs */
		ttwu_queue_remote(p, cpu, wake_flags);/* 远程唤醒进程，核间通信 */
		return;
	}
#endif

	rq_lock(rq, &rf);
	update_rq_clock(rq);
	ttwu_do_activate(rq, p, wake_flags, &rf);/* 进程入队 */
	rq_unlock(rq, &rf);
}

ttwu_do_activate实现如下。
1、调用activate_task完成enqueue_task入队操作；
2、调用ttwu_do_wakeup检测当前进程是否可以被抢占，并将唤醒进程的状态设置为TASK_RUNNING。

static void
ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags,
		 struct rq_flags *rf)
{
	int en_flags = ENQUEUE_WAKEUP | ENQUEUE_NOCLOCK;

	lockdep_assert_held(&rq->lock);

#ifdef CONFIG_SMP
	if (p->sched_contributes_to_load)
		rq->nr_uninterruptible--;

	if (wake_flags & WF_MIGRATED)
		en_flags |= ENQUEUE_MIGRATED;
#endif
	/* 调用enqueue_task将进程加入对应调度类的运行队列 */
	activate_task(rq, p, en_flags);
	/* 检测当前是否可以抢占，并将唤醒进程的状态设置为TASK_RUNNING */
	ttwu_do_wakeup(rq, p, wake_flags, rf);
}

除了wake_up_process唤醒进程外，fork的子进程也会在do_fork流程中被唤醒，do_fork中的唤醒是调用wake_up_new_task实现的。
1、将进程状态设置为TASK_RUNNING；
2、调用activate_task将进程加入对应调度类的运行队列；
3、调用check_preempt_curr检测当前是否可以抢占；

/*
 1. wake_up_new_task - wake up a newly created task for the first time.
 2.  3. This function will do some initial scheduler statistics housekeeping
 4. that must be done for every newly created context, then puts the task
 5. on the runqueue and wakes it.
 */
void wake_up_new_task(struct task_struct *p)
{
	struct rq_flags rf;
	struct rq *rq;

	raw_spin_lock_irqsave(&p->pi_lock, rf.flags);
	p->state = TASK_RUNNING;/*将进程状态设置为TASK_RUNNING*/
#ifdef CONFIG_SMP
	/*
	 * Fork balancing, do it here and not earlier because:
	 *  - cpus_ptr can change in the fork path
	 *  - any previously selected CPU might disappear through hotplug
	 *
	 * Use __set_task_cpu() to avoid calling sched_class::migrate_task_rq,
	 * as we're not fully set-up yet.
	 */
	p->recent_used_cpu = task_cpu(p);
	__set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
	rq = __task_rq_lock(p, &rf);
	update_rq_clock(rq);
	post_init_entity_util_avg(p);
	/* 调用enqueue_task将进程加入对应调度类的运行队列 */
	activate_task(rq, p, ENQUEUE_NOCLOCK);
	trace_sched_wakeup_new(p);
	check_preempt_curr(rq, p, WF_FORK);/* 检测当前是否可以抢占 */
#ifdef CONFIG_SMP
	if (p->sched_class->task_woken) {
		/*
		 * Nothing relies on rq->lock after this, so its fine to
		 * drop it.
		 */
		rq_unpin_lock(rq, &rf);
		p->sched_class->task_woken(rq, p);
		rq_repin_lock(rq, &rf);
	}
#endif
	task_rq_unlock(rq, p, &rf);
}

check_preempt_curr检测当前是否可以抢占的原则：

如果抢占进程与当前进程属于同一调度类，则调用调度类check_preempt_curr方法检查当前进程是否可以抢占；
抢占进程与当前进程不是同一调度类，则按照调度类的优先级判别。

具体实现如下：

void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
{
	const struct sched_class *class;
	/* 抢占进程与当前进程属于同一调度类，则调用调度类check_preempt_curr方法检查当前进程是否可以抢占 */
	if (p->sched_class == rq->curr->sched_class) {
		rq->curr->sched_class->check_preempt_curr(rq, p, flags);
	} else {/* 抢占进程与当前进程不是同一调度类，则按照调度类的优先级判别 */
		for_each_class(class) {/* 按优先级从高到低遍历调度类 */
			/* 匹配到当前进程，则说明当前进程调度类优先级高于抢占进程，即不可抢占 */
			if (class == rq->curr->sched_class)
				break;
			/* 匹配到抢占进程，则说明抢占进程调度类优先级高于当前进程，即可抢占 */
			if (class == p->sched_class) {
				resched_curr(rq);/* 触发延时调度，抢占CPU */
				break;
			}
		}
	}

	/*
	 * A queue event has occurred, and we're going to schedule.  In
	 * this case, we can save a useless back to back clock update.
	 */
	if (task_on_rq_queued(rq->curr) && test_tsk_need_resched(rq->curr))
		rq_clock_skip_update(rq);
}

十四、调度相关的系统调用

sched_setscheduler

系统调用	作用
nice	改变进程优先级
get_priority	SCHED_DEADLINE
sched_yield	当前进程进入运行队列尾让出CPU，执行schdule
sched_get_priority_min	返回实时进程最小优先级，根据调度策略区分
sched_get_priority_max	返回实时进程最大优先级，根据调度策略区分
sched_getscheduler	根据进程PID获取进程调度策略
sched_setscheduler	改变进程调度策略
sched_rr_get_interval	获取进程的时间片
sched_setparam	与sched_setscheduler类似，但是不修改调度策略
sched_getparam	获取进程的rt_priority
sched_getaffinity	获取进程亲和力
sched_setaffinity	设置进程亲和力

十五、总结

现在可以回答开篇提到进程调度需要明确的三个问题：

进程切换的时机：
进程切换的入口函数就是schedule()，凡是触发schedlue(函数主体由__schedule完成)的操作都会导致进程切换。
触发__schedule调度的时机有如下几种情况：
- 阻塞操作：进程调用mutex, semaphore, waitqueue等；
- 中断返回及系统调用返回用户空间时，会检查TIF_NEED_RESCHED标志，设置TIF_NEED_RESCHED标志，则产生调度。该标志在scheduler_tick定时器中可能被设置。
- 将要被唤醒的进程不会马上调用schedule，而是被添加到运行队列(run-queue)，并设置TIF_NEED_RESCHED标志位。被唤醒进程调度时机与内核是否开启CONFIG_PREEMPTION有关。
- 如果内核开启抢占(CONFIG_PREEMPTION=y)：
  1)如果唤醒动作发生在系统调用或者异常处理上下文（in syscall or exception context），则在下一次preempt_enable()检查是否需要抢占调度；
  2)如果唤醒动作发生在硬中断处理上下文（in IRQ context），则硬中断返回前会检查是否抢占调度。
- 如果内核没有开启抢占(CONFIG_PREEMPTION未设置)，则调度发生在：
  1)当前进程主动调用cond_resched();
  2)当前进程主动调用schedule()；
  3)系统调用或者异常处理返回用户空间；
  4)中断处理返回用户空间。
选择哪个进程替换当前进程：
schedule()函数通过调用pick_next_task选择一个进程替换当前进程。pick_next_task通过调用调度类中提供的pick_next_task方法实现next进程的选择。内核选择调度类时会首选CFS调度类，因为Linux中一般情况下都是普通进程。如果系统中存在其他调度类调度实体，则依调度类的优先级从高到低选择next进程。

struct task_struct *(*pick_next_task)(struct rq *rq);

切换两个进程：
结合前两个问题，解决了schdule调用时机和选择替换的进程的问题后，就满足了完成进程切换的条件。当前进程是被替换的进程，记录在运行队列rq->curr成员中，__schedule调用context_switch完成进程切换。context_switch实现了切换进程的地址空间以及切换内核态堆栈及硬件上下文。

本文内核版本为Linux5.6.4。