深入Linux内核架构笔记 - 进程管理与调度4: 调度器实现

最新推荐文章于 2022-11-04 17:00:50 发布

snoopyljc

最新推荐文章于 2022-11-04 17:00:50 发布

阅读量163

点赞数

分类专栏： Arm Linux 文章标签： Linux Process

本文链接：https://blog.csdn.net/snoopyljc/article/details/96508622

版权

Linux 同时被 2 个专栏收录

15 篇文章 0 订阅

订阅专栏

Arm

3 篇文章 0 订阅

订阅专栏

Overview

调度器任务分为两个部分: 一个涉及调度策略，一个涉及上下文切换。
Linux的CFS调度器，不需要传统时间片的概念，只考虑进程的等待时间，CPU优先选择对时间需求最严格的进程。
调度器的一般原理：将所能分配的计算能力，公平地提供给系统中的每个进程。
计算机通过轮流运行各个进程来模拟多任务，那么对于当前运行的进程，其待遇明显好于等待调度器选择的进程，即等待的进程收到了不公平待遇，这样每次调用调度器的时候，会挑选具有最高等待时间的进程，如此进程的不公平不会累积，不公平会均匀分布到系统中的所有进程。
所有可运行的进程按等待时间在红黑树中排序，红黑树最左边是等待CPU时间最长的进程，等待时间稍短的进程在该树上从左至右排列。
虚拟时钟：时间流逝速度慢于实际时钟，精确的速度依赖于当前等待调度器挑选的进程的数目
fair_clock：就绪队列的虚拟时间，是完全公平调度情况下进程将会得带的CPU时间的度量，wait_runtime：进程的等待时间，直接度量了实际系统不足造成的不公平，排序红黑数: fair_clock - wait_runtime
进程允许运行时，将从wait_runtime中减去它已经运行的时间，这样，按时间排序该进程会向右移动，另一个进程会被调度器选中，但是进程的虚拟时钟也会增加，这实际上意味着进程在完全公平调度器中接收的CPU时间份额，是推演自在实际的CPU上执行的时间，这减缓了削弱不公平的过程: 减少wait_runtime等于降低进程受到的不同平待遇，但是内核无论如何不应该忘记，用于降低不公平性的一部分时间，实际上属于完全公平世界中的进程，假定就绪队列上有４个进程，一个进程已经等待了20s，现在它允许运行10s, 此后的wait_runtime是10，但进程无论如何都会得到10/4 = 2s,因此实际上只有8s对该进程在就绪队列的新位置起了作用(不是特别理解???)

数据结构

通用调度器
- 分类两种：主调度器是是进程打算睡眠或者出于其他原因放弃CPU，周期性调度器是通过周期性机制，以固定频率运行，不时检测是否有必要进行进程切换
- 查询调度器类，用于判断接下来运行哪个进程，内核支持不同的调度策略(完全公平，实时，空闲)，调度器类使得能够以模块化方式实现这些策略。
- 在选中将要运行的进程后，执行底层任务切换
task_struct中与调度有关的成员
```
  ```
  struct task_struct {
  ...
  	int prio, static_prio, normal_prio;
  	unsigned int rt_priority;
  	struct list_head run_list;
  	const struct sched_class *sched_class;
  	struct sched_entity se;
  	unsigned int policy;
  	cpumask_t cpus_allowed;
  	unsigned int time_slice;
  ...
  }
  ```
```
- static_prio：静态优先级，进程启动时分配，可以使用系统调用修改：nice和sched_setscheduler
- normal_prio : 基于进程的静态优先级和调度策略计算出的优先级，因此，即使普通进程和实时进程具有相同的静态优先级，其普通优先级也是不同的，fork调用后的子进程会继承普通优先级
- prio：进程调度时考虑的优先级，由于某些情况下内核需要暂时提高进程的优先级，因此需要prio这个优先级
- rt_priority：实时进程的优先级
- sched_class：表示进程所属的调度器类
- policy：保存了对该进程应用的调度策略
1. SCHED_NORMAL：用于普通进程，通过完全公平调度器来处理，SCHED_BATCH和SCHED_IDLE也通过完全公平调度器来处理，不过可用于次要进程，SCHED_IDLE进程的重要性也比较低
2. SCHED_RR和SCHED_FIFO用于软实时进程
- cpus_allowed：用来限制进程可以在哪些CPU上运行
- run_list和time_slice：循环调度器所需要的，run_list用于维护包含各进程的一个运行表，time_slice则执行进程可使用CPU的剩余时间段
调度器类：提供了通用调度器和各个调度方法之间的关联
```
struct sched_class {
	const struct sched_class *next;
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
	void (*yield_task) (struct rq *rq);
	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p);
	struct task_struct * (*pick_next_task) (struct rq *rq);
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);
	void (*set_curr_task) (struct rq *rq);
	void (*task_tick) (struct rq *rq, struct task_struct *p);
	void (*task_new) (struct rq *rq, struct task_struct *p);
}
```
- next：将不同调度类的sched_class实例，按照实时－完全公平 - 空闲的顺序连接起来
- enqueue_task：向就绪队列添加一个进程，进程从睡眠状态变成可执行状态时，发生该操作
- dequeue_task：提供逆向操作，将一个进程从就绪队列去除
- yield_task：进程调用sched_yield放弃CPU时，导致内核调用该方法
- check_preempt_curr：用一个新唤醒的进程来抢占当前进程，例如，使用wake_up_new_task唤醒新进程的时候，会调用该函数
- pick_next_task：用于选择下一个将要运行的进程
- put_prev_task：在用另一个进程代替当前运行的进程之前调用
- set_curr_task：在进程的调度策略发生变化时调用该函数
- task_tick：每次激活周期性调度器时，由周期性调度器调用
- new_task：建立fork系统调用和调度器之间的关联，每次新进程建立后，由new_task通知调度器
就绪队列：核心调度器用于管理活动进程的主要数据结构，每个CPU都有其自身的就绪队列，每个活动进程只出现在一个就绪队列中。
```
struct rq {
	unsigned long nr_running;
	#define CPU_LOAD_IDX_MAX 5
	unsigned long cpu_load[CPU_LOAD_IDX_MAX];
	...
	struct load_weight load;
	struct cfs_rq cfs;
	struct rt_rq rt;
	struct task_struct *curr, *idle;
	u64 clock;
	...
};
```
- nr_running：指定了队列上可运行进程的数目
- load：就绪队列当前负荷的度量，本质上与队列上当前活动进程的数目成正比，其中的各个进程又有优先级作为权重。每个就绪队列的虚拟时钟的速度即基于该信息
- cpu_load：用于跟踪此前的负荷状态
- cfs和rt：嵌入的子就绪队列，分别用于完全公平调度器和实时调度器
- curr：指向当前运行进程的task_struct实例
- idle：指向空闲进程的task_struct实例
- clock和prev_raw_clock：用于实现就绪队列自身的时钟，每次调用周期性调度器时，都会更新clock的值，另外还可以使用update_rq_clock来更新
调度实体使调度器可以操作比进程更一般的实体
```
struct sched_entity {
	struct load_weight load; /* for load-balancing */
	struct rb_node run_node;
	unsigned int on_rq;
	u64 exec_start;
	u64 sum_exec_runtime;
	u64 vruntime;
	u64 prev_sum_exec_runtime;
	...
}
```
- load：权重，决定了各个实体占队列总负荷的比例
- run_node：标准的树结点，使得实体可以在红黑树上排序
- on_rq：表示该实体当前是否在就绪队列上接受调度
- sum_exec_runtime：用于记录消耗的CPU时间，有update_curr不断累积完成
- vruntime：进程执行期间虚拟时钟流逝的时间
- prev_sum_exec_runtime：在进程被撤销CPU时，其当前的sum_exec_runtime值保存到prev_sum_exec_rumtime中

优先级处理

内核使用简单的数值范围来表示进程优先级:从0到139，值越低，优先级越高，从0-99的范围供实时进程使用，nice值 [-20, 19] 映射到范围 [100, 139]

计算优先级

优先级计算的起点是static_prio, prio的计算：p->prio = effective_prio（p）

static int effective_prio(struct task_struct *p)
{
	p->normal_prio = normal_prio(p);
	/*
	* If we are RT tasks or we were boosted to RT priority,
	* keep the priority unchanged. Otherwise, update priority
	* to the normal priority:
	*/
	if (!rt_prio(p->prio))
		return p->normal_prio;
	return p->prio;
}

static inline int normal_prio(struct task_struct *p)
{
	int prio;
	if (task_has_rt_policy(p))
		prio = MAX_RT_PRIO - 1 - p->rt_priority;
	else
		prio = __normal_prio(p);
	return prio;
}
static inline int __normal_prio(struct task_struct *p)
{
	return p->static_prio;
}
static inline int rt_policy(int policy)
{
	if (unlikely(policy == SCHED_FIFO) || unlikely(policy == SCHED_RR))
		return 1;
	return 0;
}

static inline int task_has_rt_policy(struct task_struct *p)
{
	return rt_policy(p->policy);
}
static inline int rt_prio(int prio)
{
	if (unlikely(prio < MAX_RT_PRIO))
		return 1;
	return 0;
}

计算负荷权重：负荷权重数据保存在task_struct->se.load中

struct task_struct {
    ...
	struct sched_entity se;
	...
}
struct sched_entity {
  ...
  struct load_weight	load;
  ...
}
struct load_weight {
	unsigned long weight, inv_weight;
};
/*
 * Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.
 *
 * In cases where the weight does not change often, we can use the
 * precalculated inverse to speed up arithmetics by turning divisions
 * into multiplications:
 */
static const u32 prio_to_wmult[40] = {
 /* -20 */     48388,     59856,     76040,     92818,    118348,
 /* -15 */    147320,    184698,    229616,    287308,    360437,
 /* -10 */    449829,    563644,    704093,    875809,   1099582,
 /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326,
 /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587,
 /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126,
 /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
 /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};
static const int prio_to_weight[40] = {
 /* -20 */     88761,     71755,     56483,     46273,     36291,
 /* -15 */     29154,     23254,     18705,     14949,     11916,
 /* -10 */      9548,      7620,      6100,      4904,      3906,
 /*  -5 */      3121,      2501,      1991,      1586,      1277,
 /*   0 */      1024,       820,       655,       526,       423,
 /*   5 */       335,       272,       215,       172,       137,
 /*  10 */       110,        87,        70,        56,        45,
 /*  15 */        36,        29,        23,        18,        15,
};
static void set_load_weight(struct task_struct *p)
{
	if (task_has_rt_policy(p)) {
		p->se.load.weight = prio_to_weight[0] * 2;
		p->se.load.inv_weight = prio_to_wmult[0] >> 1;
		return;
	}

	/*
	 * SCHED_IDLE tasks get minimal weight:
	 */
	if (p->policy == SCHED_IDLE) {
		p->se.load.weight = WEIGHT_IDLEPRIO;
		p->se.load.inv_weight = WMULT_IDLEPRIO;
		return;
	}

	p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO];
	p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}

内核除了维护负荷权重自身，还维护了负荷权重除的结果(prio_to_wmult)
进程每降低一个NICE，多获得10%的CPU时间，每升高一个NICE，则放弃10%的CPU时间，参考prio_to_weight，考虑进程A和B运行在nice级别0，对应的权重为1024，每个进程的CPU份额:1024/(1024 + 1024) = 50%，进程B的NICE值加１，此时进程A的CPU份额:1024/(1024 + 820) = 55%, 进程B的份额: 820/(1024 + 820) = 45%，产生了10%的差值。
实时进程的权重是普通进程的两倍，SCHED_IDLE进程的权重总是最小

不仅进程，就绪队列也关联到一个负荷权重，每次进程被加到就绪队列时，内核会调用inc_nr_running，更新就绪队列中运行进程的个数和就绪队列的权重。

static inline void inc_load(struct rq *rq, const struct task_struct *p)
{
	update_load_add(&rq->load, p->se.load.weight);
}
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
	rq->nr_running++;
	inc_load(rq, p);
}

核心调度器

周期性调度器
- scheduler_tick函数，如果系统在活动中，内核会按照频率HZ自动调用该函数，如果没有进程等待调度，处于省电等的考虑，可以关闭该调度器以节省功耗，该函数的主要任务:
1. 管理内核中与整个系统和各个进程的调度相关的统计量
2. 负责激活当前进程的调度器类的周期性调度方法
```
void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	u64 next_tick = rq->tick_timestamp + TICK_NSEC;

	spin_lock(&rq->lock);
	__update_rq_clock(rq);
	update_cpu_load(rq);
	if (curr != rq->idle) /* FIXME: needed? */
		curr->sched_class->task_tick(rq, curr);
	spin_unlock(&rq->lock);
```
  __update_rq_clock： 处理就绪队列时钟的更新，即增加struct rq当前实例的时钟时间戳
  
  update_cpu_load： 负责更新就绪队列的cpu_load数组，本质上相当与于将数组中先前存储的负荷值向后移动一个位置，将当前就绪队列的负荷记入数组的第一个位置
  
  task_tick(rq, curr)：　取决于底层的调度类，对于不同的调度类有不同的实现，如果当前进程应该被重新调度，那么调度器类会在task_struct中设置TIF_NEED_RESCHED标志，内核会在接下来的适当时机完成该请求。

主调度器

schedule函数，如果要将CPU分配给与当前活动进程不同的另一个进程，直接调用那个该函数，另外，从系统调用返回时，内核也会检查当前进程是否设置了TIF_NEED_RESCHED标志，如果设置了，内核会调用schedule函数

asmlinkage void __sched schedule(void)
{
	struct task_struct *prev, *next;
	long *switch_count;
	struct rq *rq;
	int cpu;
need_resched:
	preempt_disable();
	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	rcu_qsctr_inc(cpu);
	prev = rq->curr;
    __update_rq_clock(rq);
    clear_tsk_need_resched(prev);
	if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
		if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
				unlikely(signal_pending(prev)))) {
			prev->state = TASK_RUNNING;
		} else {
			deactivate_task(rq, prev, 1);
		}
		switch_count = &prev->nvcsw;
	}
    prev->sched_class->put_prev_task(rq, prev);
	next = pick_next_task(rq, prev);
	if (likely(prev != next)) {
	   context_switch(rq, prev, next);
	}
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
      goto need_resched;

__update_rq_clock： 更新就绪队列的时钟
clear_tsk_need_resched： 清除TIF_NEED_RESCHED标志
prev->state = TASK_RUNNING： 对于可中断睡眠状态的进程，如果接收到信号，将该进程再次提升为运行进程
deactivate_task： 调用相应调度器类的方法是进程停止活动
put_prev_task： 通知调度器类当前进程将要被另一个进程代替
pick_next_task： 选择下一个将要执行的进程
context_switch： 执行上下文的切换

与fork的交互

copy_process中会调用sched_fork，进行和调度器相关的操作：初始化与调度相关的字段，建立数据结构，确定进程的动态优先级

void sched_fork(struct task_struct *p, int clone_flags)
{
	int cpu = get_cpu();

	__sched_fork(p);

#ifdef CONFIG_SMP
	cpu = sched_balance_self(cpu, SD_BALANCE_FORK);
#endif
	set_task_cpu(p, cpu);

	/*
	 * Make sure we do not leak PI boosting priority to the child:
	 */
	p->prio = current->normal_prio;
	if (!rt_prio(p->prio))
		p->sched_class = &fair_sched_class;
}

使用wake_up_new_task唤醒新进程的时候，内核调用调度器类的task_new函数，将新进程加入到相应类的就绪队列中。

上下文切换

内核选择新进程后，要进行上下文切换：context_switch，该函数调用特定与体系结构的方法

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_enter_lazy_cpu_mode();

	if (unlikely(!mm)) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm(oldmm, mm, next);

	if (unlikely(!prev->mm)) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
    /* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);

switch_mm：　切换内存管理相关的上下文：页面，TLB等等
switch_to：　切换处理器内容和内核栈

switch_to的复杂之处
在这里插入图片描述进程切换A->B->C->A，再次切换到A进程时，prev应该指向C进程，但是如果通过栈完全切换到之前的状态，则prev会指向A进程，而且由于控制流回到该函数的中间，无法通过普通函数的返回值来使prev指向C进程，内核通过使用一个3个参数的宏来解决这个问题:
switch_to(prev, next, prev)

#define switch_to(prev,next,last)					\
do {									\
	last = __switch_to(prev,task_thread_info(prev), task_thread_info(next));	\
} while (0)

ENTRY(__switch_to)
	add	ip, r1, #TI_CPU_SAVE
	ldr	r3, [r2, #TI_TP_VALUE]
	stmia	ip!, {r4 - sl, fp, sp, lr}	@ Store most regs on stack
	...
	mov	r5, r0  //prev task => r5
	add	r4, r2, #TI_CPU_SAVE
	ldr	r0, =thread_notify_head
	mov	r1, #THREAD_NOTIFY_SWITCH
	bl	atomic_notifier_call_chain
	mov	r0, r5  //return r5即prev_task

惰性FPU模式

由于上下文切换的速度对系统性能的影响举足轻重，因此对于浮点寄存器，如果应用不使用浮点数，则不进行浮点数的保存和恢复
A=>B=>A，如果A使用浮点数，B不使用浮点数，从A切换到B时，进程A的浮点数寄存器保存到进程A相关的线程数据结构中，然后在从进程B切换会进程A的时候，由于B没有使用浮点数，就不需要执行浮点数的恢复操作。

snoopyljc

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
深入Linux内核架构笔记 - 进程管理与调度4: 调度器实现

Overview 调度器任务分为两个部分: 一个涉及调度策略，一个涉及上下文切换。 Linux的CFS调度器，不需要传统时间片的概念，只考虑进程的等待时间，CPU优先选择对时间需求最严格的进程。调度器的一般原理：将所能分配的计算能力，公平地提供给系统中的每个进程。计算机通过轮流运行各个进程来模拟多任务，那么对于当前运行的进程，其待遇明显好于等待调度器选择的进程，即等待的进程收到了不公...
复制链接

扫一扫

专栏目录