重学计算机（十七、linux调度器和调度器类）

本文链接：https://blog.csdn.net/C1033177205/article/details/121300520

没想到上一篇只写了一个优先级，这一篇尽量把linux调度器整体架构缕清楚，下一篇正式开始CFS完全公平调度器。

17.1 整体框架

我感觉还是喜欢从整体到细节，虽然现在介绍整体比较懵逼，不过有一个整体的概念，然后在慢慢的细化分析，等分析完了，再回来看整体框架，就感觉很清晰。

在这里插入图片描述

这是从《深入linux内核架构》里面抄出来的图，我第一次见这个图就比较懵逼，为什么调度类上面还有主调度器和周期性调度器，调度类和主调器和周期性调度器究竟是什么关系。

调度器类和进程的关系，都是比较明显，因为linux系统只有有多个调度器类，比如有CFS完全公平调度器类，还有一个实时调度器类，在无事好做时调度空闲进程。普通进程都属于CFS完全公平调度器的，要求实时进程当然就属于实时调度器类。

17.2 调度器

接下来我们来分析一下调度器的实现，调度器的实现基于上面两个函数：周期行调度器函数和主调度器函数。我们接下里看看：

17.2.1 周期性调度器

周期性调度器相对来说简单一点，周期性调度器是在scheduler_tick中实现，如果系统正在工作，内核会按照频率Hz自动调用该函数。具体的我们后面有缘再介绍。（很有可能不会介绍，因为太底层的东西好处也不是很大）

我们直接来看看源码：

// kernel/sched/core.c
// 虽然好多细节我也不知道，就是因为当年老是分析内核细节，一下子就绕进去了，这次避免分析细节，等到整体抓的差不多再分析细节。
/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
	int cpu = smp_processor_id();   // 多核cpu
	struct rq *rq = cpu_rq(cpu);	// 应该是获取当前CPU运行的队列
	struct task_struct *curr = rq->curr;	// 通过运行队列获取到当前运行进程的控制块

	sched_clock_tick();		// 可能是调整什么时钟吧，这个忽略

	raw_spin_lock(&rq->lock);    // 内核的锁，忽略，以后再分析
	update_rq_clock(rq);		// 这个重要一点，更新就绪队列时钟的更新，这个就绪队列我们以后还会接触，就是就绪的进程都在这里
	curr->sched_class->task_tick(rq, curr, 0);		// 这个才是最重点的，是当前进程的控制块中有调度器类的指针，这个调度器类的task_tick，意思就是如果这个进程是CFS调度的，这个task_tick就是CFS调度类的，不同调度器类对task_tick处理不一样，所以需要这样来调用
	update_cpu_load_active(rq);   // 负责更新就绪队列的cpu_load[]数组（说实话我也没看懂，哈哈）
	calc_global_load_tick(rq);
	raw_spin_unlock(&rq->lock);   // 解锁

	perf_event_task_tick();

    // 这个是多和CPU的时候，如果有一个核比较空闲，就会做一下负载均衡
#ifdef CONFIG_SMP
	rq->idle_balance = idle_cpu(cpu);
	trigger_load_balance(rq);
#endif
	rq_last_tick_reset(rq);		//更新运行队列的时钟
}

虽然这个代码分析的，有一大半还看懂，但是我们还是看出了核心，就是

curr->sched_class->task_tick(rq, curr, 0);

通过进程去控制调度器类，然后调度器类再做相应的处理，等我们分析到CFS的时候，在来看看这个调度器类是做什么的。

17.2.2 主调度器

我们刚刚分析周期性调度器的时候，是不是很开心，感觉很简单，不过别高兴太早，这个主调度器不会太简单的，压力山大。。。。

周期性调度器是通过定时触发的，那是因为周期性调度器主要是判断进程的运行的时间，是否已经达到了自己的时间片，如果达到了，就会做相应的处理（这里没有说直接抢占，等到分析CFS的时候就会明白）。

主调度器被调用的地方就比较多了，比如：当前进程主动让出CPU，还有判断重调度标记等。我刚刚想在内核代码中搜索一下啥时候调用主调度器，结果搜出了一大推，然后就放弃了，也等到CFS的时候，看看有什么发现吧。

吹水吹完了，接下来上代码：

//#define __sched		__attribute__((__section__(".sched.text")))
// 这个有看前面的章节就知道，gcc自定义的一个段，把调度程序全部集中在.sched.text段中，这种做法是在显示堆栈信息时，忽略与调度相关的部分。所以我们在函数调用的时候，是看不到这些部分的。
asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;  
    //current 这个全局变量有点意思，并且在内核中，使用的次数也是比较多的，只想当前进程控制块，这个变量的存储，我们到进程控制块那章再介绍，有兴趣的可以先看，为了加快访问速度，是存储在寄存器中的。

	sched_submit_work(tsk);   // 不知道这个干啥的
	do {
		preempt_disable();		// 进程控制块中有一个计数器preempt_count，当数值为0的时候，表示可以抢占，不为0不能抢占，这个函数会把计数器perrmpt_count+1,
		__schedule(false);		// 这个是主要的调度函数,下面详解分析
		sched_preempt_enable_no_resched();   // 这个就是把preempt_count-1
	} while (need_resched());	
}
EXPORT_SYMBOL(schedule);

集中精力分析__schedule，看能不能看懂，看不懂也要看一个大概就可以了。

/*
 * __schedule() is the main scheduler function.
   __schedule()是主要的调度函数
 *
 * The main means of driving the scheduler and thus entering this function are:
 	驱动调度器进入这个函数的主要方法是:
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 	1. 阻塞：互斥锁,信号量，等待队列
 *
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
	 在中断和用户空间返回路径上检查TIF_NEED_RESCHED标志
 *		
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 		任务之间的抢占，调度器在定时器中断处理程序scheduler_tick()中设置TIF_NEED_RESCHED标志
 *
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 	唤醒并不会真正导致进入schedule()，他们将一个任务添加到运行队列中，仅此而已
 *
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 	现在，如果添加到运行队列中的新任务抢占当前任务，然后唤醒设置TIF_NEED_RESCHED，并在最近的可能场合调用schedule()
 *
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 		如果内核可抢占，linux内核是抢占式的
 *
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 			在系统调用或异常上下文中，下一个调用preempt_enable()。这可能需要wake_up()的spin_unlock()。
 *
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 			在IRQ上下文中，从中断处理程序返回到抢占上下文
 *
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *
 *          - cond_resched() call     cond_resched()调用
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 			从系统调用或异常返回到用户空间
 *          - return from interrupt-handler to user-space
 			从中断处理程序返回到用户空间
 *
 * WARNING: must be called with preemption disabled!
 	警告:必须在禁用抢占的情况下调用! （刚开始的时候就设置了标志）
 */
static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq *rq;
	int cpu;

	cpu = smp_processor_id();    // 终于看到操作多核CPU了
	rq = cpu_rq(cpu);
	rcu_note_context_switch();
	prev = rq->curr;

	/*
	 * do_exit() calls schedule() with preemption disabled as an exception;
	 * however we must fix that up, otherwise the next task will see an
	 * inconsistent (higher) preempt count.
	 do_exit()调用schedule()时异常禁用抢占;但是我们必须解决这个问题，否则下一个任务将看到一个不一致的(更高的)抢占计数
	 *
	 * It also avoids the below schedule_debug() test from complaining
	 * about this.
	 它还避免了下面的schedule_debug()测试对此进行抱怨。(有道翻译的哈哈)
	 */
    // 就是上面的翻译，如果进程挂了，要把计数器给减1
	if (unlikely(prev->state == TASK_DEAD))
		preempt_enable_no_resched_notrace();

    // 各种schedule()时间调试检查和统计信息:(注释是这么说的)
	schedule_debug(prev);

	if (sched_feat(HRTICK))
		hrtick_clear(rq);

	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up().
	确保下面的signal_pending_state()->signal_pending()不能被调用方的__set_current_state(TASK_INTERRUPTIBLE)重新排序，以避免与signal_wake_up()的竞争。
	 */
	smp_mb__before_spinlock();
	raw_spin_lock_irq(&rq->lock);
	lockdep_pin_lock(&rq->lock);

	rq->clock_skip_update <<= 1; /* promote REQ to ACT */

	switch_count = &prev->nivcsw;
	if (!preempt && prev->state) {   // 这个判断有点意思，preempt是传参过来的，这次传的是false,prev->state是进程状态，>0不是运行态，也就是说，抢占是不走这个分支？
		if (unlikely(signal_pending_state(prev->state, prev))) {
			prev->state = TASK_RUNNING;
		} else {
            // 先前的进程不再处于可执行状态，需要将其从运行队列中移除出去。
			deactivate_task(rq, prev, DEQUEUE_SLEEP);
			prev->on_rq = 0;

			/*
			 * If a worker went to sleep, notify and ask workqueue
			 * whether it wants to wake up a task to maintain
			 * concurrency.
			 如果一个worker进入睡眠状态，通知并询问workqueue是否需要唤醒一个task来保持并发
			 */
			if (prev->flags & PF_WQ_WORKER) {
				struct task_struct *to_wakeup;

				to_wakeup = wq_worker_sleeping(prev, cpu);
				if (to_wakeup)
					try_to_wake_up_local(to_wakeup);
			}
		}
		switch_count = &prev->nvcsw;
	}

	if (task_on_rq_queued(prev))
		update_rq_clock(rq);		// 在更新运行队列的东西

    // 这个就直接选择下一进程了，太尼玛快了
	next = pick_next_task(rq, prev);
	clear_tsk_need_resched(prev);		// 清除标记
	clear_preempt_need_resched();
	rq->clock_skip_update = 0;			// 把上面的标记清除

    // 如果选中的进程不是之前的进程，需要上下文切换
	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
		++*switch_count;

		trace_sched_switch(preempt, prev, next);	// 不知道是啥
        // 传说的上下文切换？
		rq = context_switch(rq, prev, next); /* unlocks the rq */
		cpu = cpu_of(rq);
	} else {
		lockdep_unpin_lock(&rq->lock);
		raw_spin_unlock_irq(&rq->lock);
	}

    // 多CPU做负载均衡
	balance_callback(rq);
}

感觉吧，虽然也能抓住核心，但是总是缺少了点啥，应该是缺少了细节，不过目前实力不够，就不沉迷细节了，如果去纠细节的话，会沉迷进去的，接下来我们看pick_next_task选择下一进程的函数。

/*
 * Pick up the highest-prio task:
 选择优先级最高的任务:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev)
{
	const struct sched_class *class = &fair_sched_class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in
	 * the fair class we can call that function directly:
	 优化:我们知道如果所有任务都在公平类中，我们可以直接调用该函数
	 */
    // 如果是CFS完全公平调度器的话，就执行下面的
	if (likely(prev->sched_class == class &&
		   rq->nr_running == rq->cfs.h_nr_running)) {
		p = fair_sched_class.pick_next_task(rq, prev);
		if (unlikely(p == RETRY_TASK))
			goto again;

		/* assumes fair_sched_class->next == idle_sched_class */
        // CFS完全公平调度类都没有选择出来，那就调用空闲调度类
		if (unlikely(!p))
			p = idle_sched_class.pick_next_task(rq, prev);

		return p;
	}

again:
    // 这个是遍历调度器类，可以往后看
	for_each_class(class) {
		p = class->pick_next_task(rq, prev);  // 直接找到一个进程，pick_next_task函数是调度器类的，等下一节再详细分析
		if (p) {
			if (unlikely(p == RETRY_TASK))
				goto again;
			return p;
		}
	}

	BUG(); /* the idle class will always have a runnable task
    空闲类将始终有一个可运行的任务*/
}

我们直接看看for_each_class的代码：

#define sched_class_highest (&stop_sched_class)
#define for_each_class(class) \
for (class = sched_class_highest; class; class = class->next)

for_each_class是按照linux调度类的优先级遍历的，找到各自的优先级中有没有就绪的进程，然后进行调度。下面是我按linux调度类的优先级排列的：

stop_sched_class（停止类）
dl_sched_class（最终期限调度类）
rt_sched_class（实时调度类）
fair_sched_class（CFS完全公平调度类）
idle_sched_class（空闲调度类）

讲到这里，主调度器基本讲完了，虽然很多细节不管，但基本核心还是抓住了，就是按照调度器类的优先级来查找合适的进程，进行进程切换，具体的等看具体的调度类，我们再分析。

17.3 调度器类

本来之前是有安排fork和上下文切换，但是想想fork留着讲完CFS的时候再讲，上下文切换等到以后功力深厚了再分析把，上下文切换的大概就是保存寄存器，保存内存，堆栈里的值，这些以后再分析，希望还有分析的机会。

17.3.1 调度器类的抽象

接下来我们看看调度器类的抽象，虽说c语言不是面向对象的语言，但是在linux内核中，基本都是面向对象的思想，这个调度器类也是这种思想，下面就来看看这个抽象：

struct sched_class {
	const struct sched_class *next;

    // 向就绪队列添加一个新进程
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	// 将一个进程从就绪队列移除
    void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
    // 进程想要资源放弃对CPU的控制权
	void (*yield_task) (struct rq *rq);
    // 
	bool (*yield_to_task) (struct rq *rq, struct task_struct *p, bool preempt);

    // 用一个新唤醒的进程来抢占当前进程
	void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

	/*
	 * It is the responsibility of the pick_next_task() method that will
	 * return the next task to call put_prev_task() on the @prev task or
	 * something equivalent.
	 *
	 * May return RETRY_TASK when it finds a higher prio class has runnable
	 * tasks.
	 */
    // 选择下一个将要运行的进程，上面我们就分析了
	struct task_struct * (*pick_next_task) (struct rq *rq,
						struct task_struct *prev);
    // 进程切换上下文之前的准备工作
	void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP  // CONFIG_SMP 这是多核CPU
	int  (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
	void (*migrate_task_rq)(struct task_struct *p);

	void (*task_waking) (struct task_struct *task);
	void (*task_woken) (struct rq *this_rq, struct task_struct *task);

	void (*set_cpus_allowed)(struct task_struct *p,
				 const struct cpumask *newmask);

	void (*rq_online)(struct rq *rq);
	void (*rq_offline)(struct rq *rq);
#endif
	
	void (*set_curr_task) (struct rq *rq);
    // 这个就是周期性调度器调用的
	void (*task_tick) (struct rq *rq, struct task_struct *p, int queued);
	void (*task_fork) (struct task_struct *p);
	void (*task_dead) (struct task_struct *p);

	/*
	 * The switched_from() call is allowed to drop rq->lock, therefore we
	 * cannot assume the switched_from/switched_to pair is serliazed by
	 * rq->lock. They are however serialized by p->pi_lock.
	 */
	void (*switched_from) (struct rq *this_rq, struct task_struct *task);
	void (*switched_to) (struct rq *this_rq, struct task_struct *task);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			     int oldprio);

	unsigned int (*get_rr_interval) (struct rq *rq,
					 struct task_struct *task);

	void (*update_curr) (struct rq *rq);

#ifdef CONFIG_FAIR_GROUP_SCHED
	void (*task_move_group) (struct task_struct *p);
#endif
};

每一个调度器类，都要实现这些方法，通过周期性调度器或者主调度器来调用这些方法，就行进程调度。

上面我们也分析了目前有5种调度器类，每一个都有自己的实现，我们下一节就分析CFS完全调度器。

17.3.2 就绪队列

主调度器用于管理活动进程的主要数据结构称为就绪队列。各个CPU都有自身的就绪队列，各个活动进程只出现在一个就绪队列中。

我们也可看看就绪队列的结构：

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	raw_spinlock_t lock;

	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 nr_running和cpu_load应该在同一个cacheline中，因为远程cpu在进行负载计算时使用这两个字段
	 */
	unsigned int nr_running;  // 指定了队列上可运行进程的数目

	#define CPU_LOAD_IDX_MAX 5
	unsigned long cpu_load[CPU_LOAD_IDX_MAX];	// 用于跟踪此前的负荷状态
	unsigned long last_load_update_tick;

	/* capture load from *all* tasks on this cpu: */
	struct load_weight load;		// 提供了就绪队列当前负荷的度量。
	unsigned long nr_load_updates;
	u64 nr_switches;

	struct cfs_rq cfs;		// 嵌入子就绪队列
	struct rt_rq rt;
	struct dl_rq dl;


	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long nr_uninterruptible;

	struct task_struct *curr, *idle, *stop;
	unsigned long next_balance;
	struct mm_struct *prev_mm;

	unsigned int clock_skip_update;
	u64 clock;
	u64 clock_task;

	atomic_t nr_iowait;
};

这个结构体省了一大半，结果这个结构体还是没看的懂，那只能等到后面分析了，这里就有一个印象了。

17.3.3 调度实体

linux内核调度的实体：

struct sched_entity {
    // 这个就是我们上一篇算的优先级保存的结构，一个负荷权重
	struct load_weight	load;		/* for load-balancing */
	struct rb_node		run_node;  // 红黑树的节点，排序使用的
	struct list_head	group_node;	
	unsigned int		on_rq;		// 表示该实体是否在就绪队列上接受调度

    // 这几个时间我们下一节再分析
	u64			exec_start;
	u64			sum_exec_runtime;
	u64			vruntime;
	u64			prev_sum_exec_runtime;

	u64			nr_migrations;

#ifdef CONFIG_SCHEDSTATS
	struct sched_statistics statistics;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
	int			depth;
	struct sched_entity	*parent;
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq		*cfs_rq;
	/* rq "owned" by this entity/group: */
	struct cfs_rq		*my_q;
#endif

#ifdef CONFIG_SMP
	/* Per entity load average tracking */
	struct sched_avg	avg;
#endif
};