Linux内核进程管理与调度_调度实体 linux-CSDN博客

本文链接：https://blog.csdn.net/qq_37070988/article/details/132390483

前言

进程调度是什么

内存中保存了每个进程的唯一描述, 并通过若干结构把他和其他进程连接起来。调度器调度进程在cpu上运行，这涉及到2个部分，其中一个是调度策略，另一个是上下文切换。

进程的分类

调度器分配CPU时间的基本依据，就是进程的优先级。根据优先级特点，我们可以把进程分为两种类别。

实时进程（Real-Time Process）：优先级高、需要尽快被执行的进程。优先级从0到99。它们一定不能被普通进程所阻挡，例如视频播放、各种监测系统。
普通进程（Normal Process）：优先级低、更长执行时间的进程。优先级从100到139。例如文本编译器、批处理一段文档、图形渲染。普通进程根据行为的不同，还可以被分成交互式进程（interactive process）和批处理进程（batch process）。

（1）交互式进程：一旦特定事件发生，交互式进程需要尽快被激活。它们可能处在长时间的等待状态，例如等待用户的输入。

（2）批处理进程：此类进程不必与用户交互, 因此经常在后台运行。

进程调度相关的数据结构

我们先把进程描述符 task_struct 中和调度相关的结构挑选出来：

struct task_struct {
    ......
    const struct sched_class *sched_class;
    unsigned int   policy;
    struct sched_entity  se;
    struct sched_rt_entity  rt;
    struct sched_dl_entity  dl;
    ......
}

调度器类 struct sched_class

调度器类一共五种：

extern const struct sched_class stop_sched_class;//停机调度类
extern const struct sched_class dl_sched_class;  //限期调度类
extern const struct sched_class rt_sched_class;  //实时调度类
extern const struct sched_class fair_sched_class;//公平调度类
extern const struct sched_class idle_sched_class;//空闲调度类

调度类优先级由高到低排列为：

stop_sched_class > dl_sched_class > rt_sched_class > fair_sched_class > idle_sched_class。

stop_sched_class 调度器：优先级最高的调度类，用于停止进程，可以抢占其他任意进程，不能被其他进程抢占
dl_sched_class 调度器：使用红黑树，把进程按照绝对截止期限从小到大进行排序，优先选择deadline最小的进程进行调度运行
rt_sched_class 调度器：采用Round-Robin算法或FIFO算法调度实时进程，具体调度策略由task_struct->policy指定
fair_sched_class 调度器（CFS调度器）：采用一个完全公平的调度算法，根据“虚拟运行时间”概念，调度进程：

虚拟运行时间 = 实际运行时间 * NICE_0_LOAD / 进程权重

（NICE_0_LOAD是nice0对应的权重）

idle_sched_class调度器：空闲调度类，每个CPU上都有一个空闲线程，即0号线程，当没有其他进程可以调度时，调度运行idle线程；

进程调度策略 unsigned int policy

进程的调度策略有6种，用户可以调用调度器里的不同调度策略：

字段	描述	所在调度器类
SCHED_DEADLINE	根据任务结束时间来进行调度，即将结束的拥有较高的优先级	deadline调度器
SCHED_FIFO	相同优先级的情况下先到先得	rt调度器
SCHED_RR	轮询策略，注重公平性，相同优先级的任务会使用相同的时间片轮流执行	rt调度器
SCHED_NORMAL	也叫SCHED_OTHER，用于普通进程	CFS调度器
SCHED_BATCH	用于后台进程的调度	CFS调度器
SCHED_IDLE	用于空闲时才会跑的任务的调度	CFS调度器

调度实体

有了调度策略后，我们还需要一个调度结构体来集合调度信息，用于调度。主要有以下3个调度实体。

普通进程调度实体

struct sched_entity se：采用CFS算法调度的普通非实时进程的调度实体。

实时调度实体

struct sched_rt_entity rt：采用Roound-Robin或者FIFO算法调度的实时调度实体。

DEADLINE调度实体

struct sched_dl_entity dl：采用EDF算法调度的实时调度实体。

run queue 运行队列

在调度任务时，调度实体首先会区分是实时任务还是普通任务，然后把任务以时间为顺序，用红黑树结构组合起来，vruntime 最小的在树的左侧，vruntime最多的在树的右侧。而这颗红黑树，我们称之为运行时队列（run queue），即struct rq。

struct rq {
 ......
 struct cfs_rq cfs;    //CFS调度队列
 struct rt_rq rt;      //RT调度队列
 struct dl_rq dl;      //DL调度队列
 ......
}

run queue 运行队列是本 CPU 上所有可运行进程的队列集合。每个 CPU 都有一个运行队列，每个运行队列中有3个调度队列（cfs_rq、rt_rq、dl_rq），每个队列由一个红黑树组织，红黑树里每一个节点为一个调度实体 sched_entity，每一个调度实体 sched_entity 对应一个任务 task_struct。在task_struct 中对应的 sched_class 会根据不同策略声明不同的对应处理函数，处理实际的调度工作。

以 cfs_rq 调度队列为例，它的部分定义如下：

struct cfs_rq {
  ...
  struct rb_root_cached tasks_timeline
  ...
};

struct rb_root_cached {
    struct rb_root rb_root;
    struct rb_node* rb_leftmost;
};

它维护一棵按照虚拟时间排序的红黑树。tasks_timeline->rb_root 是红黑树的根，tasks_timeline->rb_leftmost 指向红黑树中最左边的调度实体，即虚拟时间最小的调度实体。红黑树里的节点sched_entity 是可被内核调度的实体，其定义如下：

struct sched_entity {
  ...
  struct rb_node    run_node;      
  ...
  u64          vruntime;              
  ...
};

struct rb_node {
    unsigned long __rb_parent_color;
    struct rb_node* rb_right;
    struct rb_node* rb_left;
};

每个就绪态的调度实体sched_entity包含插入红黑树中使用的节点rb_node，同时vruntime成员记录已经运行的虚拟时间。

进程、调度实体、调度队列等这些数据结构的关系如下图所示：

调度过程

前面介绍了进程调度的相关数据结构体及调度策略，接下来介绍调度的过程。

调度的本质就是选择下一个进程，然后切换。调度分为两种，主动调度和抢占式调度。

主动调度：任务执行一定时间以后主动让出CPU，通过调度策略选择合适的下一个任务执行。
抢占式调度：也称被动调度。任务执行中收到了其他任务的中断，由此停止执行并切换至下一个任务。

无论是主动调度还是抢占式调度，最终都需要调用真正执行调度的函数 schedule()。

主动调度

主动调度：进程主动触发以下情况，然后陷入内核态，最终调用 schedule() 函数，进行调度:

当进程发生需要等待IO的系统调用，如read、write。
进程主动调用sleep时。
进程等待占用信号量或mutex时。（注意spin锁不会触发调度，可能在空转。）

先看下调度函数 schedule()。其中 sched_submit_work() 函数完成当前任务的收尾工作，以避免出现如死锁或者IO中断等情况。之后首先禁止抢占式调度的发生，然后调用__schedule()函数完成调度，之后重新打开抢占式调度，如果需要重新调度则会一直重复该过程，否则结束函数。

// kernel/sched/core.c
asmlinkage __visible void __sched schedule(void)
{
    struct task_struct *tsk = current;
    sched_submit_work(tsk);
    do {
        preempt_disable();
        __schedule(false);
        sched_preempt_enable_no_resched();
    } while (need_resched());
}
EXPORT_SYMBOL(schedule);

核心调度函数 __schedule()

而 __schedule() 函数则是实际的核心调度函数，该函数主要操作包括选取下一进程和进行上下文切换，而上下文切换又包括用户态空间切换和内核态的切换。

static void __sched notrace __schedule(bool preempt)
{
    struct task_struct *prev, *next;
    unsigned long *switch_count;
    struct rq_flags rf;
    struct rq *rq;
    int cpu;
    
    //从当前的CPU中取出任务队列rq，prev赋值为当前任务
    cpu = smp_processor_id();
    rq = cpu_rq(cpu);
    prev = rq->curr;
    
    //检测当前任务是否可以调度
    schedule_debug(prev);
    if (sched_feat(HRTICK))
        hrtick_clear(rq);
    
    //禁止中断，RCU抢占关闭，队列加锁，SMP加锁
    local_irq_disable();
    rcu_note_context_switch(preempt);
    /*
     * Make sure that signal_pending_state()->signal_pending() below
     * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
     * done by the caller to avoid the race with signal_wake_up().
     *
     * The membarrier system call requires a full memory barrier
     * after coming from user-space, before storing to rq->curr.
     */
    rq_lock(rq, &rf);
    smp_mb__after_spinlock();
    
    /* Promote REQ to ACT */
    rq->clock_update_flags <<= 1;
    update_rq_clock(rq);
    switch_count = &prev->nivcsw;
    
    if (!preempt && prev->state) {
        //不可中断的任务则继续执行
        if (signal_pending_state(prev->state, prev)) {
            prev->state = TASK_RUNNING;
        } else {
            //当前任务从队列rq中出队，on_rq设置为0，如果存在I/O未完成则延时完成
            deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
            prev->on_rq = 0;
            if (prev->in_iowait) {
                atomic_inc(&rq->nr_iowait);
                delayacct_blkio_start();
            }
            /* 唤醒睡眠进程
             * If a worker went to sleep, notify and ask workqueue
             * whether it wants to wake up a task to maintain
             * concurrency.
             */
            if (prev->flags & PF_WQ_WORKER) {
                struct task_struct *to_wakeup;
                to_wakeup = wq_worker_sleeping(prev);
                if (to_wakeup)
                    try_to_wake_up_local(to_wakeup, &rf);
            }
        }
        switch_count = &prev->nvcsw;
    }
    
    // 调用pick_next_task获取下一个任务，赋值给next
    next = pick_next_task(rq, prev, &rf);
    clear_tsk_need_resched(prev);
    clear_preempt_need_resched();
    
    // 如果产生了任务切换，则需要切换上下文
    if (likely(prev != next)) {
        rq->nr_switches++;
        rq->curr = next;
        /*
         * The membarrier system call requires each architecture
         * to have a full memory barrier after updating
         * rq->curr, before returning to user-space.
         *
         * Here are the schemes providing that barrier on the
         * various architectures:
         * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
         *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
         * - finish_lock_switch() for weakly-ordered
         *   architectures where spin_unlock is a full barrier,
         * - switch_to() for arm64 (weakly-ordered, spin_unlock
         *   is a RELEASE barrier),
         */
        ++*switch_count;
        trace_sched_switch(preempt, prev, next);
        /* Also unlocks the rq: */
        rq = context_switch(rq, prev, next, &rf);
    } else {
        // 清除标记位，重开中断
        rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
        rq_unlock_irq(rq, &rf);
    }
    //队列自平衡：红黑树平衡操作
    balance_callback(rq);
}

其中核心函数是获取下一个任务的 pick_next_task() 以及上下文切换的 context_switch()，下面详细展开剖析。

获取下一个任务的 pick_next_task()

首先看看pick_next_task()，该函数会根据调度策略分类，调用该类对应的调度函数选择下一个任务实体。根据前文分析我们知道，最终是在不同的红黑树上选择最左节点作为下一个任务实体并返回。

static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
    const struct sched_class *class;
    struct task_struct *p;
    /* 这里做了一个优化：如果是普通调度策略则直接调用fair_sched_class中的pick_next_task
     * Optimization: we know that if all tasks are in the fair class we can
     * call that function directly, but only if the @prev task wasn't of a
     * higher scheduling class, because otherwise those loose the
     * opportunity to pull in more work from other CPUs.
     */
    if (likely((prev->sched_class == &idle_sched_class ||
            prev->sched_class == &fair_sched_class) &&
           rq->nr_running == rq->cfs.h_nr_running)) {
        p = fair_sched_class.pick_next_task(rq, prev, rf);
        if (unlikely(p == RETRY_TASK))
            goto again;
        /* Assumes fair_sched_class->next == idle_sched_class */
        if (unlikely(!p))
            p = idle_sched_class.pick_next_task(rq, prev, rf);
        return p;
    }
again:
    //依次调用类中的选择函数，如果正确选择到下一个任务则返回
    for_each_class(class) {
        p = class->pick_next_task(rq, prev, rf);
        if (p) {
            if (unlikely(p == RETRY_TASK))
                goto again;
            return p;
        }
    }
    /* The idle class should always have a runnable task: */
    BUG();
}

上下文切换的 context_switch()

进程上下文是进程执行活动全过程的静态描述。我们把已执行过的进程指令和数据在相关寄存器与堆栈中的内容称为进程上文，把正在执行的指令和数据在寄存器与堆栈中的内容称为进程正文，把待执行的指令和数据在寄存器与堆栈中的内容称为进程下文。

关于上下文切换，更详细的介绍可以参见这2篇博客：

聊聊Linux中CPU上下文切换 - 小牛呼噜噜 - 博客园 (cnblogs.com)

linux内核上下文切换解析 - 码农教程 (manongjc.com)

上下文切换主要干两件事情，一是切换进程地址空间；二是切换寄存器和 CPU 上下文。

static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
           struct task_struct *next, struct rq_flags *rf)
{
    struct mm_struct *mm, *oldmm;
    prepare_task_switch(rq, prev, next);
    mm = next->mm;
    oldmm = prev->active_mm;
    /*
     * For paravirt, this is coupled with an exit in switch_to to
     * combine the page table reload and the switch backend into
     * one hypercall.
     */
    arch_start_context_switch(prev);
    /*
     * If mm is non-NULL, we pass through switch_mm(). If mm is
     * NULL, we will pass through mmdrop() in finish_task_switch().
     * Both of these contain the full memory barrier required by
     * membarrier after storing to rq->curr, before returning to
     * user-space.
     */
    if (!mm) {
        next->active_mm = oldmm;
        mmgrab(oldmm);
        enter_lazy_tlb(oldmm, next);
    } else
        switch_mm_irqs_off(oldmm, mm, next);
    if (!prev->mm) {
        prev->active_mm = NULL;
        rq->prev_mm = oldmm;
    }
    rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
    prepare_lock_switch(rq, next, rf);
    /* Here we just switch the register state and the stack. */
    switch_to(prev, next, prev);
    //barrier 语句是一个编译器指令，用于保证 switch_to 和 finish_task_switch 的执行顺序不会因为编译阶段优化而改变
    barrier();
    return finish_task_switch(prev);
}

进程地址空间切换

进程地址空间的切换由 switch_mm() 函数完成。

/*
 *   切换mm，重新装载cr3页表指针
 */
static inline void switch_mm(struct mm_struct *prev,
			     struct mm_struct *next,
			     struct task_struct *tsk)
{
	int cpu = smp_processor_id();//获取当前cpu

	//如果当前mm和将要运行的进程的mm不相等，说明是进程的切换
	//单核cpu下，如果是同一个进程下的线程切换则不需要做任何处理
	if (likely(prev != next)) {  
		/* stop flush ipis for the previous mm */
		cpu_clear(cpu, prev->cpu_vm_mask); //清除cpuvm标记位
#ifdef CONFIG_SMP
		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK; //当前cpu的tlb状态
		per_cpu(cpu_tlbstate, cpu).active_mm = next; //当前cpu的活动mm是将要运行进程的mm
#endif
		cpu_set(cpu, next->cpu_vm_mask); //设置新的cpuvm标志

		/* Re-load page tables */
		load_cr3(next->pgd);   //装载新的页表地址，到此cpu就开始在next进程下的地址空间执行了

		/*
		 * 如果局部描述符表不同，则装载新的局部描述符表
		 */
		if (unlikely(prev->context.ldt != next->context.ldt))
			load_LDT_nolock(&next->context, cpu);
	}
#ifdef CONFIG_SMP
	//在多核cpu下如果相等，则说明是同一个地址空间，说明是同一个进程里的线程切换
	else {
		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
		BUG_ON(per_cpu(cpu_tlbstate, cpu).active_mm != next);
		//如果next的cpuvm标志位没有置位，说明处于tlb懒惰模式，则需要重新装载
		//cr3
		if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
			/* We were in lazy tlb mode and leave_mm disabled 
			 * tlb flush IPI delivery. We must reload %cr3.
			 */
			load_cr3(next->pgd);
			load_LDT_nolock(&next->context, cpu);
		}
	}
#endif
}

寄存器状态切换

switch_to() 就是寄存器和栈的切换，它调用到了 __switch_to_asm。这是一段汇编代码，主要用于栈的切换，其中32位使用esp作为栈顶指针，64位使用rsp，其他部分代码一致。通过该段汇编代码我们完成了栈顶指针的切换，并调用__switch_to完成最终TSS的切换。注意switch_to中其实是有三个变量，分别是prev, next, last，而实际在使用时，我们会对last也赋值为prev。这里的设计意图需要结合一个例子来说明。假设有ABC三个任务，从A调度到B，B到C，最后C回到A，我们假设仅保存prev和next，则流程如下：

A保存内核栈和寄存器，切换至B，此时prev = A, next = B，该状态会保存在栈里，等下次调用A的时候再恢复。然后调用B的finish_task_switch()继续执行下去，返回B的队列rq，
B保存内核栈和寄存器，切换至C
C保存内核栈和寄存器，切换至A。A从barrier()开始运行，而A从步骤1中保存的prev = A, next = B则完美的避开了C，丢失了C的信息。因此last指针的重要性就出现了。在执行完__switch_to_asm后，A的内核栈和寄存器重新覆盖了prev和next，但是我们通过返回值提供了C的内存地址，保存在last中，在finish_task_switch中完成清理工作。

#define switch_to(prev, next, last)			      \
do {									       \
    prepare_switch_to(next);					\
                                               \
    ((last) = __switch_to_asm((prev), (next)));	  \
} while (0)

/*
 * %eax: prev task
 * %edx: next task
 */
ENTRY(__switch_to_asm)
......
  /* switch stack */
  movl  %esp, TASK_threadsp(%eax)
  movl  TASK_threadsp(%edx), %esp
......
  jmp  __switch_to
END(__switch_to_asm)

最终调用__switch_to() 函数。该函数中涉及到一个结构体TSS(Task State Segment)，该结构体存放了所有的寄存器。另外还有一个特殊的寄存器TR（Task Register）会指向TSS，我们通过更改TR的值，会触发硬件保存CPU所有寄存器在当前TSS，并从新的TSS读取寄存器的值加载入CPU，从而完成一次硬中断带来的上下文切换工作。系统初始化的时候，会调用 cpu_init()给每一个 CPU 关联一个 TSS，然后将 TR 指向这个 TSS，然后在操作系统的运行过程中，TR 就不切换了，永远指向这个 TSS。当修改TR的值得时候，则为任务调度。

/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * We fsave/fwait so that an exception goes off at the right time
 * (as a call from the fsave or fwait in effect) rather than to
 * the wrong process. Lazy FP saving no longer makes any sense
 * with modern CPU's, and this simplifies a lot of things (SMP
 * and UP become the same).
 *
 * NOTE! We used to use the x86 hardware context switching. The
 * reason for not using it any more becomes apparent when you
 * try to recover gracefully from saved state that is no longer
 * valid (stale segment register values in particular). With the
 * hardware task-switch, there is no way to fix up bad state in
 * a reasonable manner.
 *
 * The fact that Intel documents the hardware task-switching to
 * be slow is a fairly red herring - this code is not noticeably
 * faster. However, there _is_ some room for improvement here,
 * so the performance issues may eventually be a valid point.
 * More important, however, is the fact that this allows us much
 * more flexibility.
 *
 * The return value (in %ax) will be the "prev" task after
 * the task-switch, and shows up in ret_from_fork in entry.S,
 * for example.
 */
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
    struct thread_struct *prev = &prev_p->thread,
                 *next = &next_p->thread;
    struct fpu *prev_fpu = &prev->fpu;
    struct fpu *next_fpu = &next->fpu;
    int cpu = smp_processor_id();
    /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
    switch_fpu_prepare(prev_fpu, cpu);
    /*
     * Save away %gs. No need to save %fs, as it was saved on the
     * stack on entry.  No need to save %es and %ds, as those are
     * always kernel segments while inside the kernel.  Doing this
     * before setting the new TLS descriptors avoids the situation
     * where we temporarily have non-reloadable segments in %fs
     * and %gs.  This could be an issue if the NMI handler ever
     * used %fs or %gs (it does not today), or if the kernel is
     * running inside of a hypervisor layer.
     */
    lazy_save_gs(prev->gs);
    /*
     * Load the per-thread Thread-Local Storage descriptor.
     */
    load_TLS(next, cpu);
    /*
     * Restore IOPL if needed.  In normal use, the flags restore
     * in the switch assembly will handle this.  But if the kernel
     * is running virtualized at a non-zero CPL, the popf will
     * not restore flags, so it must be done in a separate step.
     */
    if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
        set_iopl_mask(next->iopl);
    switch_to_extra(prev_p, next_p);
    /*
     * Leave lazy mode, flushing any hypercalls made here.
     * This must be done before restoring TLS segments so
     * the GDT and LDT are properly updated, and must be
     * done before fpu__restore(), so the TS bit is up
     * to date.
     */
    arch_end_context_switch(next_p);
    /*
     * Reload esp0 and cpu_current_top_of_stack.  This changes
     * current_thread_info().  Refresh the SYSENTER configuration in
     * case prev or next is vm86.
     */
    update_task_stack(next_p);
    refresh_sysenter_cs(next);
    this_cpu_write(cpu_current_top_of_stack,
               (unsigned long)task_stack_page(next_p) +
               THREAD_SIZE);
    /*
     * Restore %gs if needed (which is common)
     */
    if (prev->gs | next->gs)
        lazy_load_gs(next->gs);
    switch_fpu_finish(next_fpu, cpu);
    this_cpu_write(current_task, next_p);
    /* Load the Intel cache allocation PQR MSR. */
    resctrl_sched_in();
    return prev_p;
}

在完成了switch_to()的内核态切换后，还有一个重要的函数 finish_task_switch() 负责善后清理工作。在前面介绍 switch_to 三个参数的时候我们已经说明了使用last的重要性。而这里为何让 prev和 last 均赋值为 prev ，是因为 prev 在后面没有需要用到，所以节省了一个指针空间来存储 last。

/**
 * finish_task_switch - clean up after a task-switch
 * @prev: the thread we just switched away from.
 *
 * finish_task_switch must be called after the context switch, paired
 * with a prepare_task_switch call before the context switch.
 * finish_task_switch will reconcile locking set up by prepare_task_switch,
 * and do any other architecture-specific cleanup actions.
 *
 * Note that we may have delayed dropping an mm in context_switch(). If
 * so, we finish that here outside of the runqueue lock. (Doing it
 * with the lock held can cause deadlocks; see schedule() for
 * details.)
 *
 * The context switch have flipped the stack from under us and restored the
 * local variables which were saved when this task called schedule() in the
 * past. prev == current is still correct but we need to recalculate this_rq
 * because prev may have moved to another CPU.
 */
static struct rq *finish_task_switch(struct task_struct *prev)
    __releases(rq->lock)
{
    struct rq *rq = this_rq();
    struct mm_struct *mm = rq->prev_mm;
    long prev_state;
    /*
     * The previous task will have left us with a preempt_count of 2
     * because it left us after:
     *
     *	schedule()
     *	  preempt_disable();			// 1
     *	  __schedule()
     *	    raw_spin_lock_irq(&rq->lock)	// 2
     *
     * Also, see FORK_PREEMPT_COUNT.
     */
    if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
              "corrupted preempt_count: %s/%d/0x%x\n",
              current->comm, current->pid, preempt_count()))
        preempt_count_set(FORK_PREEMPT_COUNT);
    rq->prev_mm = NULL;
    /*
     * A task struct has one reference for the use as "current".
     * If a task dies, then it sets TASK_DEAD in tsk->state and calls
     * schedule one last time. The schedule call will never return, and
     * the scheduled task must drop that reference.
     *
     * We must observe prev->state before clearing prev->on_cpu (in
     * finish_task), otherwise a concurrent wakeup can get prev
     * running on another CPU and we could rave with its RUNNING -> DEAD
     * transition, resulting in a double drop.
     */
    prev_state = prev->state;
    vtime_task_switch(prev);
    perf_event_task_sched_in(prev, current);
    finish_task(prev);
    finish_lock_switch(rq);
    finish_arch_post_lock_switch();
    kcov_finish_switch(current);
    fire_sched_in_preempt_notifiers(current);
    /*
     * When switching through a kernel thread, the loop in
     * membarrier_{private,global}_expedited() may have observed that
     * kernel thread and not issued an IPI. It is therefore possible to
     * schedule between user->kernel->user threads without passing though
     * switch_mm(). Membarrier requires a barrier after storing to
     * rq->curr, before returning to userspace, so provide them here:
     *
     * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
     *   provided by mmdrop(),
     * - a sync_core for SYNC_CORE.
     */
    if (mm) {
        membarrier_mm_sync_core_before_usermode(mm);
        mmdrop(mm);
    }
    if (unlikely(prev_state == TASK_DEAD)) {
        if (prev->sched_class->task_dead)
            prev->sched_class->task_dead(prev);
        /*
         * Remove function-return probe instances associated with this
         * task and put them back on the free list.
         */
        kprobe_flush_task(prev);
        /* Task is done with its stack. */
        put_task_stack(prev);
        put_task_struct(prev);
    }
    tick_nohz_task_switch();
    return rq;
}

至此，我们完成了内核态的切换工作，也完成了整个主动调度的过程。

抢占式调度

与主动调度直接调用schedule() 函数不同，抢占式调度需要先将当前CPU上正在运行的任务标记为 _TIF_NEED_RESCHED，然后再通过判断任务有无标记 _TIF_NEED_RESCHED 来决定是否调用 schedule() 函数。

抢占式调度通常发生在两种情况下：一种是某任务执行时间过长，另一种是当某任务被唤醒的时候。然后为 CPU 上正在运行的进程 thread_info 结构体里的 flags 成员设置 _TIF_NEED_RESCHED。

任务运行时间过长

当检测到一个任务的执行时间过长，则发起抢占。在计算机里面有一个时钟，每过一段时间触发一次时钟中断，通知操作系统又过去一个时钟周期，通过这种方式可以查看是否是需要抢占的时间点。

时钟中断处理函数会调用 scheduler_tick()。该函数首先取出当前CPU，并由此获取对应的运行队列 rq 和当前任务 curr。接着调用该任务的调度类 sched_class 对应的 task_tick() 函数进行时间事件处理。

void scheduler_tick(void)
{
    int cpu = smp_processor_id();
    struct rq *rq = cpu_rq(cpu);
    struct task_struct *curr = rq->curr;
    struct rq_flags rf;
    sched_clock_tick();
    rq_lock(rq, &rf);
    update_rq_clock(rq);
    curr->sched_class->task_tick(rq, curr, 0);
    cpu_load_update_active(rq);
    calc_global_load_tick(rq);
    psi_task_tick(rq);
    rq_unlock(rq, &rf);
    perf_event_task_tick();
......
}

以普通任务队列为例，对应的调度类为 fair_sched_class，对应的时钟处理函数为task_tick_fair()，该函数会获取当前的调度实体和运行队列，并调用 entity_tick() 函数更新时间。

static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
    struct cfs_rq *cfs_rq;
    struct sched_entity *se = &curr->se;
    for_each_sched_entity(se) {
        cfs_rq = cfs_rq_of(se);
        entity_tick(cfs_rq, se, queued);
    }
    if (static_branch_unlikely(&sched_numa_balancing))
        task_tick_numa(rq, curr);
    update_misfit_status(curr, rq);
    update_overutilized_status(task_rq(curr));
}

在 entity_tick() 中，首先会调用 update_curr() 更新当前任务的 vruntime，然后调用check_preempt_tick() 检测现在是否可以发起抢占。

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
    /*
     * Update run-time statistics of the 'current'.
     */
    update_curr(cfs_rq);
    /*
     * Ensure that runnable average is periodically updated.
     */
    update_load_avg(cfs_rq, curr, UPDATE_TG);
    update_cfs_group(curr);
......
    if (cfs_rq->nr_running > 1)
        check_preempt_tick(cfs_rq, curr);
}

check_preempt_tick() 先是调用 sched_slice() 函数计算出一个调度周期中该任务运行的实际时间 ideal_runtime。sum_exec_runtime 指任务总共执行的实际时间，prev_sum_exec_runtime 指上次该进程被调度时已经占用的实际时间，所以 sum_exec_runtime - prev_sum_exec_runtime 就是这次调度占用实际时间。如果这个时间大于 ideal_runtime，则应该被抢占了。除了这个条件之外，还会通过 __pick_first_entity 取出红黑树中最小的进程。如果当前进程的 vruntime 大于红黑树中最小的进程的 vruntime，且差值大于 ideal_runtime，也应该被抢占了。

static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
    unsigned long ideal_runtime, delta_exec;
    struct sched_entity *se;
    s64 delta;
    ideal_runtime = sched_slice(cfs_rq, curr);
    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
    if (delta_exec > ideal_runtime) {
        resched_curr(rq_of(cfs_rq));
        /*
         * The current task ran long enough, ensure it doesn't get
         * re-elected due to buddy favours.
         */
        clear_buddies(cfs_rq, curr);
        return;
    }
    /*
     * Ensure that a task that missed wakeup preemption by a
     * narrow margin doesn't have to wait for a full slice.
     * This also mitigates buddy induced latencies under load.
     */
    if (delta_exec < sysctl_sched_min_granularity)
        return;
    se = __pick_first_entity(cfs_rq);
    delta = curr->vruntime - se->vruntime;
    if (delta < 0)
        return;
    if (delta > ideal_runtime)
        resched_curr(rq_of(cfs_rq));
}

如果确认需要被抢占，则会调用 resched_curr() 函数，该函数会调用 set_tsk_need_resched() 标记该任务为 _TIF_NEED_RESCHED，即该任务应该被抢占。

void resched_curr(struct rq *rq)
{
    struct task_struct *curr = rq->curr;
    int cpu;
.......
    cpu = cpu_of(rq);
    if (cpu == smp_processor_id()) {
        set_tsk_need_resched(curr);
        set_preempt_need_resched();
        return;
    }
    if (set_nr_and_not_polling(curr))
        smp_send_reschedule(cpu);
    else
        trace_sched_wake_idle_without_ipi(cpu);
}

任务被唤醒

某些任务会因为中断而唤醒，如当 I/O 到来的时候，I/O进程往往会被唤醒。在这种时候，如果被唤醒的任务优先级高于 CPU 上的当前任务，就会触发抢占。

wake_up_process() 调用 try_to_wake_up() ， try_to_wake_up() 调用 ttwu_queue() 将这个唤醒的任务添加到队列当中。ttwu_queue() 再调用 ttwu_do_activate() 激活这个任务。ttwu_do_activate() 调用 ttwu_do_wakeup()。这里面调用了 check_preempt_curr() 检查是否应该发生抢占。如果应该发生抢占，也不是直接踢走当前进程，而是调用 resched_curr() 函数，该函数会调用 set_tsk_need_resched() 将当前进程标记为 _TIF_NEED_RESCHED，即该进程应该被抢占。

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
         struct rq_flags *rf)
{
  check_preempt_curr(rq, p, wake_flags);
  p->state = TASK_RUNNING;
  trace_sched_wakeup(p);
  ......
}

抢占的发生

由前面的分析，我们知道了不论是是当前任务执行时间过长还是新任务唤醒，仅仅是确认需要调度后给进程的打上标志 _TIF_NEED_RESCHED，然后会在以下时机会检查_TIF_NEED_RESCHED 标志，如果标志存在再调用 schedule() 函数：

从系统调用返回用户态：以64位为例，系统调用的链路为 do_syscall_64 -> syscall_return_slowpath -> prepare_exit_to_usermode -> exit_to_usermode_loop。在exit_to_usermode_loop 中，会检测是否为 _TIF_NEED_RESCHED，如果是则调用schedule()。

static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
{
    while (true) {
        /* We have work to do. */
        local_irq_enable();

        if (cached_flags & _TIF_NEED_RESCHED)
          schedule();
    ......
}

中断结束返回用户态或内核态之前：中断处理调用的是 do_IRQ 函数，中断完毕后分为两种情况，一个是返回用户态，一个是返回内核态。
- 返回用户态会调用 prepare_exit_to_usermode()，最终调用 exit_to_usermode_loop()
- 返回内核态会调用 preempt_schedule_irq()，最终调用 __schedule()

common_interrupt:
        ASM_CLAC
        addq    $-0x80, (%rsp) 
        interrupt do_IRQ
ret_from_intr:
        popq    %rsp
        testb   $3, CS(%rsp)
        jz      retint_kernel
/* Interrupt came from user space */
GLOBAL(retint_user)
        mov     %rsp,%rdi
        call    prepare_exit_to_usermode
        TRACE_IRQS_IRETQ
        SWAPGS
        jmp     restore_regs_and_iret
/* Returning to kernel space */
retint_kernel:
#ifdef CONFIG_PREEMPT
        bt      $9, EFLAGS(%rsp)  
        jnc     1f
0:      cmpl    $0, PER_CPU_VAR(__preempt_count)
        jnz     1f
        call    preempt_schedule_irq
        jmp     0b


asmlinkage __visible void __sched preempt_schedule_irq(void)
{
  ......
  do {
    preempt_disable();
    local_irq_enable();
    __schedule(true);
    local_irq_disable();
    sched_preempt_enable_no_resched();
  } while (need_resched());
  ......
}

开启内核抢占开关后。当使用 preempt_enable() 开关打开时，preempt_enable() 会调用 preempt_count_dec_and_test()，判断 preempt_count 和 _TIF_NEED_RESCHED 是否可以被抢占。如果可以，就调用 preempt_schedule -> preempt_schedule_common ->__schedule 进行调度。

从上可以总结下：

1、所有调度的发生都是处于内核态，中断也是处于内核态，不会有调度出现在用户态。

2、所有调度的都在 schedule() 函数中发生

参考文献：

玩转Linux内核进程调度，这一篇就够(所有的知识点) - 知乎 (zhihu.com)

进程管理：一文读懂Linux内核中的任务间调度策略 - 知乎 (zhihu.com)

Linux内核进程调度时机和过程_linux的schedule调度函数是在内核态执行的吗什么情况下执行_攻城狮百里的博客-CSDN博客