操作系统为了控制进程的执行,必须有能力挂起正在CPU上运行的进程,并恢复以前挂起的某个进程的执行,这种行为被称为进程切换,任务切换或上下文切换。
Linux切换并没有使用X86CPU的切换方法,Linux切换的实质就是cr3切换(内存空间切换,在switch_mm函数中)+ 寄存器切换(包括EIP,ESP等,均在switch_to函数中)。
我们可以在VsCode中找到context_switch 函数,
函数主体如下所示:
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_start_context_switch(prev);
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab() active
*
* kernel -> user switch + mmdrop() active
* user -> user switch
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
next->active_mm = prev->active_mm;
if (prev->mm) // from user
mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
return finish_task_switch(prev);
}
我们可以详细分析一下context_switch函数
context_switch函数首先执行了prepare_task_switch函数,该函数的代码如下,其中prepare_task_switch()函数是完成切换前的准备工作;
接着后面判断当前进程是不是内核线程,如果是内核线程,则不需要切换上下文,然后执行了arch_start_context_switch函数,函数arch_start_context_switch开始上下文切换,是每种处理器架构必须定义的函数。我们可以发现ARM64架构没有定义函数arch_start_context_switch,使用默认定义,它也是一个空的宏。
如果下一个进程是内核线程(成员mm是空指针),内核线程没有用户虚拟地址空间,那么需要借用上一个进程的用户虚拟地址空间,把借来的用户虚拟地址空间保存在成员active_mm中,内核线程在借用的用户虚拟地址空间的上面运行。
函数enter_lazy_tlb通知处理器架构不需要切换用户虚拟地址空间,这种加速进程切换的技术称为惰性TLB。ARM64架构定义的函数enter_lazy_tlb是一个空函数。
然后执行prepare_lock_switch函数,prepare_lock_switch将调度后进程next->on_cpu置位
如果下一个进程是用户进程,那么调用函数switch_mm_irqs_off切换进程的用户虚拟地址空间。
其中最关键的是switch_to函数,在switch_to中调用了三个函数,
prev:输入参数,变量值为旧进程描述符的地址。
next:输入参数,变量值为新进程描述符的地址。
last:输出参数,用来记录该进程是由哪个进程切换而来的,即保存 在当前进程之前 占用cpu的进程的 进程描述符地址。
switch_to函数的执行步骤如下:
该宏的工作步骤大致如下:
prev的值送入eax,next的值送入edx。
保护prev进程的eflags和ebp寄存器内容,这些内容保存在prev进程的内核堆栈中。
将prev的esp寄存器中的数据保存在prev->thread.esp中,即将prev进程的内核堆栈保存起来。
将next->thread.esp中的数据存入esp寄存器中,这是加载next进程的内核堆栈。
将数值1保存到prev->thread.eip中,该数值1其实就是代码中"1:\t"这行中的1。为了恢复prev进程执行时用。
将next->thread.eip压入next进程的内核堆栈中。这个值往往是数值1。
跳转到__switch_to函数处执行。执行到这里,prev进程重新获得CPU,恢复prev进程的ebp和eflags内容。
将eax的内容存入last参数(这里我也没看出来,原著上如是写,只是在__switch_to函数中返回prev,该值是放在eax中的)。
ENTRY(__switch_to_asm)
UNWIND_HINT_FUNC
/*
* Save callee-saved registers
* This must match the order in inactive_task_frame
*/
pushq %rbp
pushq %rbx
pushq %r12
pushq %r13
pushq %r14
pushq %r15
/* switch stack */
movq %rsp, TASK_threadsp(%rdi) // 保存旧进程的栈顶
movq TASK_threadsp(%rsi), %rsp // 恢复新进程的栈顶
/* restore callee-saved registers */
popq %r15
popq %r14
popq %r13
popq %r12
popq %rbx
popq %rbp
jmp __switch_to
END(__switch_to_asm)
最后执行finsh_task_switch()函数,我们可以通过阅读代码发现,函数finish_task_switch负责在进程切换后执行清理工作。
函数finish_task_switch在从进程prev切换到进程next后为进程prev执行清理工作。比如检测上一个进程的状态是TASK_DEAD,即进程主动退出或者被动退出,需要执行释放资源等操作。
static struct rq *finish_task_switch(struct task_struct *prev)
__releases(rq->lock)
{
struct rq *rq = this_rq();
struct mm_struct *mm = rq->prev_mm;
long prev_state;
/*
* The previous task will have left us with a preempt_count of 2
* because it left us after:
*
* schedule()
* preempt_disable(); // 1
* __schedule()
* raw_spin_lock_irq(&rq->lock) // 2
*
* Also, see FORK_PREEMPT_COUNT.
*/
if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
"corrupted preempt_count: %s/%d/0x%x\n",
current->comm, current->pid, preempt_count()))
preempt_count_set(FORK_PREEMPT_COUNT);
rq->prev_mm = NULL;
/*
* A task struct has one reference for the use as "current".
* If a task dies, then it sets TASK_DEAD in tsk->state and calls
* schedule one last time. The schedule call will never return, and
* the scheduled task must drop that reference.
*
* We must observe prev->state before clearing prev->on_cpu (in
* finish_task), otherwise a concurrent wakeup can get prev
* running on another CPU and we could rave with its RUNNING -> DEAD
* transition, resulting in a double drop.
*/
prev_state = prev->state;
vtime_task_switch(prev);
perf_event_task_sched_in(prev, current);
finish_task(prev);
finish_lock_switch(rq);
finish_arch_post_lock_switch();
kcov_finish_switch(current);
fire_sched_in_preempt_notifiers(current);
/*
* When switching through a kernel thread, the loop in
* membarrier_{private,global}_expedited() may have observed that
* kernel thread and not issued an IPI. It is therefore possible to
* schedule between user->kernel->user threads without passing though
* switch_mm(). Membarrier requires a barrier after storing to
* rq->curr, before returning to userspace, so provide them here:
*
* - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
* provided by mmdrop(),
* - a sync_core for SYNC_CORE.
*/
if (mm) {
membarrier_mm_sync_core_before_usermode(mm);
mmdrop(mm);
}
if (unlikely(prev_state == TASK_DEAD)) {
if (prev->sched_class->task_dead)
prev->sched_class->task_dead(prev);
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
*/
kprobe_flush_task(prev);
/* Task is done with its stack. */
put_task_stack(prev);
put_task_struct_rcu_user(prev);
}
tick_nohz_task_switch();
return rq;
}
最后完成了进程的切换。