Linux 中的 task switch，包括 timeslice, voluntary, preemption

最新推荐文章于 2023-10-24 22:10:31 发布

xzhao28

最新推荐文章于 2023-10-24 22:10:31 发布

阅读量542

点赞数

分类专栏： low level

本文链接：https://blog.csdn.net/xzhao28/article/details/111193505

版权

low level 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

#
# 被问到问题 “Kernel下的代码执行被 interleaving 有哪些情况”。
#
#
# 多年以前学习过，回答时想描述一下细节而不仅仅只是大致的原理，但是回答没描述清楚。       # 因为太久之前了，记忆已经非常模糊了。
#
#        在 user context 下的 task switch --- scheduling / preemption 这方面翻车了。
#
#
# 这里整理一下之前的学习笔记。       # ---这里只是关于 user context 下的 task switch。
#

=================================================================================

--- index:

1. intro

2. reference

3. task switch types

3.1. timeslice task switch

3.2. voluntary task switch

3.3. preemption task switch

3.4. actual task switch for timeslice / preemption task switch

3.5. synchronization - disable preemption

4. misc tips

4.1. question: 如果在 kernel-level 下，处于 user context 的 task #A （__非 irq 或者 bh ）唤醒了一个处于同一CPU上的更高优先级的 task #B，那么，是不是*马上立刻*发生 preemption switch 呢?

4.2. --- 实际上，“主动”的task switch 是 ** 马上立刻 **发生的。但 “被动”的task switch ** 都不是马上立刻发生的 **

================================================================================

1. intro

This doc describes 3 task switch types.

================================================================================

2. reference

[1]. ulk - OReilly.Understanding.The.Linux.Kernel.3rd.Edition

================================================================================

3. task switch types

In this doc we are not gonna talk about task switch logic, which is described by:

<<task - basic task definition, task switch logic, creation, destruction, wait system call.txt>>

But instead, we talk about task switch types.

There are 3 task switch types:

regular timeslice

voluntary

preemption

================================================================================

3.1. timeslice task switch

The timeslice task switch is most commonly-known in OS theory:

Each task is allocated with a timeslice according to its priority, and when it runs out of its timeslice, it give up CPU, and another task get scheduled to run.

Since timeslice task switch is about timeslice, then the procedure is:

At each timer interrupt, timer ISR checks if "current" uses up its timeslice. If YES, do task switch.

@@trace - timeslice task switch - timer ISR checks "current" timeslice - kernel 2.6.11.12

#
# update_process_times() is called by global timer ISR on UP, or local timer ISR on SMP, to update some time-related
# information of "current".
#
#
# [*] 为了方便理解，这里选取简单一些的 kernel 2.6.11.12 code 来描述。
#
/*
* Called from the timer interrupt handler to charge one tick to the current
* process. user_tick is 1 if the tick is user time, 0 for system.
*/
void update_process_times(int user_tick)

   #
   # scheduler_tick(), is the one which actually do the job.
   #
   scheduler_tick();

       #
       # For real-time task:
       #
       #       SCHED_RR
       #
       #       SCHED_FIIO
       #
       if (rt_task(p)) {
           /*
           * RR tasks need a special form of timeslice management.
           * FIFO tasks have no timeslices.
           */

           #
           # For SCHED_RR task, decrease its timeslice, and if drops to 0, then, set TIF_NEED_RESCHED flag
           # in "thread_info->flags".
           #

if ((p->policy == SCHED_RR) && !--p->time_slice) {

p->time_slice = task_timeslice(p);
p->first_time_slice = 0;

set_tsk_need_resched(p);
set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

               requeue_task(p, rq->active);
           }

           #
           # For SCHED_FF task, its scheduling policy doesn't have the concept of timeslice.
           #

goto out_unlock;
}

       #
       # For SCHED_NORMAL conventional task, decrease its timeslice, if drops to 0,
       # then TIF_NEED_RESCHED flag in "thread_info->flags", and some other scheduling-related work.
       #

       if (!--p->time_slice) {

           dequeue_task(p, rq->active);

set_tsk_need_resched(p);

           p->prio = effective_prio(p);
           p->time_slice = task_timeslice(p);
           p->first_time_slice = 0;

           if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
               enqueue_task(p, rq->expired);
               if (p->static_prio < rq->best_expired_prio)
                   rq->best_expired_prio = p->static_prio;
           } else
               enqueue_task(p, rq->active);

       } else {
           /*
           * Prevent a too long timeslice allowing a task to monopolize
           * the CPU. We do this by splitting up the timeslice into
           * smaller pieces.
           *
           * Note: this does not mean the task's timeslices expire or
           * get lost in any way, they just might be preempted by
           * another task of equal priority. (one with higher
           * priority would have preempted this task already.) We
           * requeue this task to the end of the list on this priority
           * level, which is in essence a round-robin of tasks with
           * equal priority.
           *
           * This only applies to tasks in the interactive
           * delta range with at least TIMESLICE_GRANULARITY to requeue.
           */
           if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
               p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
               (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
               (p->array == rq->active)) {

               requeue_task(p, rq->active);
               set_tsk_need_resched(p);
           }
       }

--------------------------------------------------------------------------------

As we see from update_process_times() / schedule_tick(), it just set TIF_NEED_RESCHED flag in "thread_info", but not actually perform task switch.

The actual task switch is performed in ASM code.

In fact, timeslice task switch happens in ASM code.(We will see later).

================================================================================

3.2. voluntary task switch

voluntary task switch means:

a task itself calls schedule() voluntarily to give up CPU in its logic.

It is also called "planned task switch".

A task performs volunary task switch by:

#1. enter waitQ

#2. wait for semaphore / mutex or rw_semaphore # which is internally waitQ.

See:
   <<synchronization - waitQ.txt>>
   <<synchronization - semaphone,mutex - new impl in recent kernel.txt>>
   <<synchronization - rw_semaphore.txt>>

================================================================================

3.3. preemption task switch

preemption is represented by kernel option CONFIG_PREEMPT.

Unlike:

timeslice task switch # controlled by timeslice

voluntary task switch # controlled by task logic itself

preemption means:

A task running in kernel-level( __of course, non-atomic "user context"), could be preempted by higher-priority task.

Too theoretical? This is one typical scenario how happens

task #A is running in "user context"

hardirq generated, hardirq handler / softirq wakes up a higher-priority task #B. # or task #A itself wake up task #B

during hardirq handler / softirq returns, preemption happens, task #B preempt task #A.

--------------------------------------------------------------------------------

@@trace - preemption task switch - case: a higher-priority task is waken up

/**
* try_to_wake_up - wake up a thread
* @p: the thread to be awakened
* @state: the mask of task states that can be woken
* @wake_flags: wake modifier flags (WF_*)
*
* Put it on the run-queue if it's not already there. The "current"
* thread is always on the run-queue (except when the actual
* re-schedule is in progress), and as such you're allowed to do
* the simpler "current->state = TASK_RUNNING" to mark yourself
* runnable without the overhead of this.
*
* Returns %true if @p was woken up, %false if it was already running
* or @state didn't match @p's state.
*/
#
# try_to_wake_up() is to wake up the specified task.
#
# It is called in:
#
#   #1. waitQ wakeup APIs
#
#   #2. signal APIs
#           void signal_wake_up(struct task_struct *t, int resume)
#                   wake_up_state(t, mask)
#
# [*] Well, for simplicity, we use kernel 2.6.11.12 code again.
#
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

   #
   # If the waken-up task has higher-priority than "current", then resched_task() is called to set TIF_NEED_RESCHED
   # flag of "current", which acts as a clue to do preemption later.
   #
   if (!sync || cpu != this_cpu) {
       if (TASK_PREEMPTS_CURR(p, rq))
           resched_task(rq->curr);
   }

--------------------------------------------------------------------------------

@@trace - resched_task() - sign of requesting preemption
/*
* resched_task - mark a task 'to be rescheduled now'.
*
* On UP this means the setting of the need_resched flag, on SMP it
* might also involve a cross-CPU call to trigger the scheduler on
* the target CPU.
*/
#
# resched_task() is the sign of requesting preemption.
#
# It is mainly called at the following cases:
#
#       #1. try_to_wakeup(), described above.
#
#       #2. wake_up_new_task(), in do_fork() ?? - this doesn't make sense, child task should inherit priority of
# parent "current".
#
#       #3. pull_task(), __migrate_task(), when task migrating between SMP.
#
#       #4. set_user_nice(), sched_setscheduler(), when system call is changing a priority of a task( __并不一定是 "current").
#
#
# It is called like this:
#
#       if (TASK_PREEMPTS_CURR(p, rq_dest))
#               resched_task(rq_dest->curr);
#
#       #define TASK_PREEMPTS_CURR(p, rq) \
#           ((p)->prio < (rq)->curr->prio)
#
#
# TASK_PREEMPT_CURR() checks if task "p" has the higher priority than "rq->curr".
#
#
# [*] Note "rq->curr" is not necessarily "current" of local CPU, because in the cases above, task "p" could be
# on other CPUs, so resched_task() is using "rq->curr" as its argument, not simply "current".
#
#
# The logic of resched_task()
#
#
#   If UP, just set TIF_NEED_RESCHED flag of "rq_dest->curr", that is, "current".
#
#
#   If SMP
#
#       if "rq_dest->curr" and task "current" are just on local CPU, then, simply set TIF_NEED_RESCHED flag.
#       because "rq_dest->curr" == "current".
#
#       if "rq_dest->curr" are on other CPUs, besides set TIF_NEED_RESCHED flag of "rq_dest->curr".
# send a IPI through smp_send_reschedule(), normally the IPI ISR running on other CPU does nothing,
#       but simply IRET, preemption will happen when IRET procedure.
#
#
#       See
#           /arch/x86/kernel/smp.c - smp_reschedule_interrupt()
#

#ifdef CONFIG_SMP
static void resched_task(task_t *p)
{
int need_resched, nrpolling;

assert_spin_locked(&task_rq(p)->lock);

   /* minimise the chance of sending an interrupt to poll_idle() */
   nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
   need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED);
   nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);

   if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id()))
       smp_send_reschedule(task_cpu(p));
}
#else
static inline void resched_task(task_t *p)
{
   set_tsk_need_resched(p);
}
#endif

--------------------------------------------------------------------------------

We see resched_task() is just to set TIF_NEED_RESCHED flag, just like schedule_tick() for timeslice task switch.

The actual preemption process happens in ASM code(We will see later).

================================================================================

3.4. actual task switch for timeslice / preemption task switch

We see previously, for actual task switch and preemption task switch, they just set TIF_NEED_RESCHED flag, but not actually perform task switch.

The actual task switch is performed by ASM code

"resume_userspace"

"resume_kernel"

See:

<<kernel control path - exception,hardirq - ret_from_exception,ret_from_intr - x86.vsd>>

--------------------------------------------------------------------------------

@@trace - "resume_userspace" - perform actual task switch for timeslice and preemption task switch.

#
# Well, 这是近期的 kernel code，而不是之前提到的 kernel.2.6.11.12 --- 不过，道理一样。
#

/arch/x86/kernel/entry_32.S
resume_userspace

ENTRY(resume_userspace)
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_OFF

#
# Check “thread_info->flags”, to see if any work to do,
# if so, jump to “work_pending”.
#
movl TI_flags(%ebp), %ecx
andl $_TIF_WORK_MASK, %ecx # is there any work to be
# done on int/exception
# return?
jne work_pending

#
# If no pending work, then, jump to “restore_all”,
# to return to user-level.
#
jmp restore_all
END(ret_from_exception)

work_pending:
#
# Check if TIF_NEED_RESCHED flag set in “thread_info->flags”.
#
#
# If NO, then, the pending work MUST be about signal, then, jump to “work_notifysig”.
#
#
# If YES, then, fall through to “work_resched”.
#

testb $_TIF_NEED_RESCHED, %cl
jz work_notifysig

#
# “work_resched”, we know we have a pending schedule work, so we call
# schedule() to do task switch.
#
# After sometime, “current” get scheduled and run again, returns to here.
#
# Again, it need to check if it still has pending work.
#
# If NO, then, jump to “restore_all”, to return to user-level.
#
# If YES, then, it checks if it is still a pending schedule work, and jump back to
# “work_resched” to schedule again if it is.
#
#
# [*] This is where timeslice task switch happens.
#
work_resched:
call schedule
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_ANY) # make sure we don't miss an interrupt
# setting need_resched or sigpending
# between sampling and the iret
TRACE_IRQS_OFF
movl TI_flags(%ebp), %ecx
andl $_TIF_WORK_MASK, %ecx # is there any work to be done other
# than syscall tracing?
jz restore_all
testb $_TIF_NEED_RESCHED, %cl
jnz work_resched

--------------------------------------------------------------------------------

@@trace - "resume_kernel" - perform actual task switch for timeslice and preemption task switch.

#
# /arch/x86/kernel/entry_32.S
#
# Note, if !CONFIG_PREEMPT, then, "resume_kernel" is directly "restore_all":
#
#   #ifdef CONFIG_PREEMPT
#   ...
#   #else
# #define resume_kernel       restore_all
#   #endif
#

#ifdef CONFIG_PREEMPT
ENTRY(resume_kernel)
DISABLE_INTERRUPTS(CLBR_ANY)

#
# If “thread_info->preempt_count” is not 0, then the interrupted code MUST
# have disabled preemption, then, there is no chance to do preemption progress,
# so, jump to “restore_all” to resume previously interrupted kernel code.
#
cmpl $0,TI_preempt_count(%ebp) # non-zero preempt_count ?
jnz restore_all

#
# “need_resched” label perform preemption progress.
#
# It checks TIF_NEED_RESCHED flag in “thread_info->flags”.
#
# If the flag is not set, then there is no preemption request, to jump to
#“restore_all” to return previously interrupted code (__ 在 kernel-level 中的被“打断”处).
#
# Otherwise, call preempt_schedule_irq() to perform preemption.
#

need_resched:
movl TI_flags(%ebp), %ecx # need_resched set ?
testb $_TIF_NEED_RESCHED, %cl
jz restore_all
testl $X86_EFLAGS_IF,PT_EFLAGS(%esp) # interrupts off (exception path) ?
jz restore_all
#
# Note here, “need_resched” is a loop, if “current” get preempted in
# preempt_schedule_irq(), and after sometime, it get scheduled and run again,
# it reaches to here, and continue this “resume_kernel”, it still need to
# check if itself has a new preemption request. If NO, then, it goes to
# “restore_all”.
#
call preempt_schedule_irq
jmp need_resched
END(resume_kernel)
#endif

--------------------------------------------------------------------------------

[*][*] !!__Note that, "resume_kernel" label 并不是 only place doing preemption task switch, in fact, "resume_userspace" label 也是.

This is a misunderstanding easily to get into:

"resume_userspace" - just do timeslice task switch.

"resume_kernel" - just do preemption task switch.

Both of "resume_kernel" and "resume_userspace" are using TIF_NEED_RESCHED as clue.

We differentiate timeslice task switch and preemption task switch, just by how TIF_NEED_RESCHED flag was set:

for timeslice task switch, it is set by scheduler_tick() in timer ISR.

for preemption task switch, it is set by resched_task() called in many cases.

Not by whether the actual task switch is done in "resume_kernel" or "resume_userspace".

In fact, it is easy to imagine:

"resume_userspace" doing preemption task switch, like this:

System call service routine calls set_user_nice() / sched_setscheduler() boost a task priority, which exceed "current", so resched_task() is called.

No hardirq happens, so "resume_userspace", well, TIF_NEED_RESCHED flag is set, task switch.

"resume_kernel" doing timeslice task switch, like this,

When a task is executing a read() system call service routine for a lengthy read operation.

Timer interrupt happens, timer ISR find this task uses up its timeslice, so it set TIF_NEED_RESCHED flag

"resume_kernel" performs timeslice task switch.

[*][*] If we want to be strict, to say about "kernel preemption", then, it is only about "resume_kernel".

And it is about 2 functions:

preempt_schedule()
called in:

               #define preempt_enable() \           # when preemption is reenabled.
               do { \
                   preempt_enable_no_resched(); \
                   barrier(); \
                   preempt_check_resched(); \
               } while (0)

                   #define preempt_check_resched() \
                   do { \
                       if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
                           preempt_schedule(); \
                   } while (0)

preempt_schedule_irq()
called in "resume_kernel", to perform kernel preemption when IRET.

[*][*] preemption becomes regular stuff of kernel, so don't be too strict.

================================================================================

3.5 synchronization - disable preemption

preempt_disable()

preempt_enable()

Disable preemption alone is not that useful.

It just:

avoid "current" to be preempted by other tasks.

It alone can only be apply to this situation:

UP, multi tasks could acess a shared resource in "user context".

================================================================================

4. misc tips

[*] TIF_NEED_RESCHED flag is the clue of:

timeslice task switch - set by schedule_tick()

preemption task switch - set by resched_task()

================================================================================

从上面的

3.3. preemption task switch

的分析来看，从代码细节来看，并不是 *马上立刻* 进行 switch。

task #A

try_to_wake_up( task #B )

.... 只是设置 task #A ( 作为 current ) 的 TIF_NEED_RESCHED flag。

然后，task #A 继续执行。

真正的 switch 的代码，在2个情况下执行:

#1. task #A 返回 user-level，

resume_userspace -> work_pending -> jnz work_resched -> call schedule()

#2. 触发了一个interrupt，再从 irq context 返回时: #（__可能是 local CPU上的任何interrupt，包括timer interrupt，或者是其他 CPU 上唤醒了 local task #B，发过来的 IPI ），

resume_kernel -> need_resched -> jmp need_resched -> call preempt_schedule_irq() -> schedule()

================================================================================

4.2. --- 实际上，“主动”的task switch 是 ** 马上立刻 **发生的。但 “被动”的task switch ** 都不是马上立刻发生的 **

“主动”的 task switch --- current 主动去调用 schedule() （__ 去获取 semaphore，进入 waitQ，等等，底层都是 schedule() 或者 schedule_timeout() ）

这是马上立刻发生的。

“被动”的 task switch --- timeslice用完，或者被新唤醒的更高优先级的 task 所 preempt。

都不是马上立刻发生的。

而是:

timeslice 用完 --- timer ISR 调用的 update_process_times() 会设置 current 的 TIF_NEED_RESCHED flag

被 preempt --- 也是设置 current 的 TIF_NEED_RESCHED flag

真正的 switch，是发生在之后的某个时刻。 --- 汇编代码 resume_userspace 或者 resume_kernel 中看到 TIF_NEED_RESCHED flag，所调用到的 schedule()。

================================================================================

xzhao28

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录