Linux 中的 task switch,包括 timeslice, voluntary, preemption

#
# 被问到问题 “Kernel下的代码执行被 interleaving 有哪些情况”。
#
#
# 多年以前学习过,回答时 想描述一下细节 而不仅仅只是 大致的原理,但是回答没描述清楚。        # 因为太久之前了,记忆已经非常模糊了。
#
#         在 user context 下的 task switch --- scheduling / preemption 这方面 翻车了。
#
#
# 这里整理一下之前的学习笔记。        # ---这里 只是关于 user context 下的 task switch。
#

 

=================================================================================

--- index:


1. intro


2. reference


3. task switch types

    3.1. timeslice task switch

    3.2. voluntary task switch

    3.3. preemption task switch

    3.4. actual task switch for timeslice / preemption task switch

    3.5. synchronization - disable preemption


4. misc tips

 

    4.1. question: 如果在 kernel-level 下,处于 user context 的 task #A (__非 irq 或者 bh ) 唤醒了一个处于同一CPU上的更高优先级的 task #B,那么,是不是*马上立刻*发生 preemption switch 呢?


    4.2. --- 实际上,“主动”的task switch 是 ** 马上立刻 **发生的。但 “被动”的task switch ** 都不是 马上立刻 发生的 **


================================================================================

1. intro


This doc describes 3 task switch types.


================================================================================

2. reference


    [1]. ulk - OReilly.Understanding.The.Linux.Kernel.3rd.Edition


================================================================================

3. task switch types


 In this doc we are not gonna talk about task switch logic, which is described by:


    <<task - basic task definition, task switch logic,  creation, destruction, wait system call.txt>>


 But instead, we talk about task switch types.


 There are 3 task switch types:

        regular timeslice

        voluntary

        preemption


================================================================================

3.1. timeslice task switch


 The timeslice task switch is most commonly-known in OS theory:


    Each task is allocated with a timeslice according to its priority, and when it runs out of its timeslice, it give up CPU, and another task get scheduled to run.


 Since timeslice task switch is about timeslice, then the procedure is:

    At each timer interrupt, timer ISR checks if "current" uses up its timeslice. If YES, do task switch.

@@trace - timeslice task switch - timer ISR checks "current" timeslice - kernel 2.6.11.12

#
# update_process_times() is called by global timer ISR on UP, or local timer ISR on SMP, to update some time-related
# information of "current".
#
#
# [*] 为了方便理解,这里选取简单一些的 kernel 2.6.11.12 code 来描述。
#
/*
 * Called from the timer interrupt handler to charge one tick to the current
 * process.  user_tick is 1 if the tick is user time, 0 for system.
 */
void update_process_times(int user_tick)

    #
    # scheduler_tick(), is the one which actually do the job.
    #
    scheduler_tick();

        #
        # For real-time task:
        #
        #        SCHED_RR
        #
        #        SCHED_FIIO
        #
        if (rt_task(p)) {
            /*
             * RR tasks need a special form of timeslice management.
             * FIFO tasks have no timeslices.
             */

            #
            # For SCHED_RR task, decrease its timeslice, and if drops to 0, then, set TIF_NEED_RESCHED flag
            # in "thread_info->flags".
            #

            if ((p->policy == SCHED_RR) && !--p->time_slice) {

                p->time_slice = task_timeslice(p);
                p->first_time_slice = 0;

                set_tsk_need_resched(p);
                    set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

                requeue_task(p, rq->active);
            }
    
            #
            # For SCHED_FF task, its scheduling policy doesn't have the concept of timeslice.
            #

            goto out_unlock;
        }


        #    
        # For SCHED_NORMAL conventional task,  decrease its timeslice, if drops to 0,
        # then TIF_NEED_RESCHED flag in "thread_info->flags", and some other scheduling-related work.
        #

        if (!--p->time_slice) {
                
            dequeue_task(p, rq->active);

            set_tsk_need_resched(p);

            p->prio = effective_prio(p);
            p->time_slice = task_timeslice(p);
            p->first_time_slice = 0;
            

            if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
                enqueue_task(p, rq->expired);
                if (p->static_prio < rq->best_expired_prio)
                    rq->best_expired_prio = p->static_prio;
            } else
                enqueue_task(p, rq->active);

        } else {
            /*
             * Prevent a too long timeslice allowing a task to monopolize
             * the CPU. We do this by splitting up the timeslice into
             * smaller pieces.
             *
             * Note: this does not mean the task's timeslices expire or
             * get lost in any way, they just might be preempted by
             * another task of equal priority. (one with higher
             * priority would have preempted this task already.) We
             * requeue this task to the end of the list on this priority
             * level, which is in essence a round-robin of tasks with
             * equal priority.
             *
             * This only applies to tasks in the interactive
             * delta range with at least TIMESLICE_GRANULARITY to requeue.
             */
            if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
                p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
                (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
                (p->array == rq->active)) {
    
                requeue_task(p, rq->active);
                set_tsk_need_resched(p);
            }
        }


--------------------------------------------------------------------------------

As we see from update_process_times() / schedule_tick(), it just set TIF_NEED_RESCHED flag in "thread_info", but not actually perform task switch.


The actual task switch is performed in ASM code.


In fact, timeslice task switch happens in ASM code.(We will see later).


================================================================================

3.2. voluntary task switch


 voluntary task switch means:

        a task itself calls schedule() voluntarily to give up CPU in its logic.


 It is also called "planned task switch".


 A task performs volunary task switch by:

    #1. enter waitQ

    #2. wait for semaphore / mutex or rw_semaphore        # which is internally waitQ.


See:
    <<synchronization - waitQ.txt>>
    <<synchronization - semaphone,mutex - new impl in recent kernel.txt>>
    <<synchronization - rw_semaphore.txt>>


================================================================================

3.3. preemption task switch


 preemption is represented by kernel option CONFIG_PREEMPT.


 Unlike:

        timeslice task switch        # controlled by timeslice

        voluntary task switch        # controlled by task logic itself


 preemption means:

    A task running in kernel-level( __of course, non-atomic "user context"), could be preempted by higher-priority task.

 Too theoretical? This is one typical scenario how happens


    task #A is running in "user context"


        hardirq generated, hardirq handler / softirq wakes up a higher-priority task #B.        # or task #A itself wake up task #B


        during hardirq handler / softirq returns, preemption happens, task #B preempt task #A.


--------------------------------------------------------------------------------

@@trace - preemption task switch - case: a higher-priority task is waken up

/**
 * try_to_wake_up - wake up a thread
 * @p: the thread to be awakened
 * @state: the mask of task states that can be woken
 * @wake_flags: wake modifier flags (WF_*)
 *
 * Put it on the run-queue if it's not already there. The "current"
 * thread is always on the run-queue (except when the actual
 * re-schedule is in progress), and as such you're allowed to do
 * the simpler "current->state = TASK_RUNNING" to mark yourself
 * runnable without the overhead of this.
 *
 * Returns %true if @p was woken up, %false if it was already running
 * or @state didn't match @p's state.
 */
#
# try_to_wake_up() is to wake up the specified task.
#
# It is called in:
#
#    #1. waitQ wakeup APIs
#
#    #2. signal APIs
#            void signal_wake_up(struct task_struct *t, int resume)
#                    wake_up_state(t, mask)
#
# [*] Well, for simplicity, we use kernel 2.6.11.12 code again.
#
static int
try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

    #
    # If the waken-up task has higher-priority than "current", then resched_task() is called to set TIF_NEED_RESCHED 
    # flag of "current", which acts as a clue to do preemption later.
    #
    if (!sync || cpu != this_cpu) {
        if (TASK_PREEMPTS_CURR(p, rq))
            resched_task(rq->curr);
    }


--------------------------------------------------------------------------------

@@trace - resched_task() - sign of requesting preemption
/*
 * resched_task - mark a task 'to be rescheduled now'.
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
 */
#
# resched_task() is the sign of requesting preemption.
#
# It is mainly called at the following cases:
#
#        #1. try_to_wakeup(), described above.
#
#        #2. wake_up_new_task(), in do_fork() ?? - this doesn't make sense, child task should inherit priority of 
# parent "current".
#
#        #3. pull_task(), __migrate_task(), when task migrating between SMP.
#
#        #4. set_user_nice(), sched_setscheduler(), when system call is changing a priority of a task( __并不一定是 "current").
#
#
# It is called like this:
#
#        if (TASK_PREEMPTS_CURR(p, rq_dest))
#                resched_task(rq_dest->curr);
#    
#        #define TASK_PREEMPTS_CURR(p, rq) \
#            ((p)->prio < (rq)->curr->prio)
#
#    
# TASK_PREEMPT_CURR() checks if task "p" has the higher priority than "rq->curr".
#
#
# [*] Note "rq->curr" is not necessarily "current" of local CPU, because in the cases above, task "p" could be
# on other CPUs, so resched_task() is using "rq->curr" as its argument, not simply "current".
#
#
# The logic of resched_task()
#
#
#    If UP, just set TIF_NEED_RESCHED flag of "rq_dest->curr", that is, "current".
#
#
#    If SMP
#
#        if "rq_dest->curr" and task "current" are just on local CPU, then, simply set TIF_NEED_RESCHED flag.
#        because "rq_dest->curr" == "current".
#
#        if "rq_dest->curr" are on other CPUs, besides set TIF_NEED_RESCHED flag of "rq_dest->curr".
#       send a IPI through smp_send_reschedule(), normally the IPI ISR running on other CPU does nothing, 
#        but simply IRET, preemption will happen when IRET procedure.
#
#
#        See 
#            /arch/x86/kernel/smp.c - smp_reschedule_interrupt()
#


#ifdef CONFIG_SMP
static void resched_task(task_t *p)
{
    int need_resched, nrpolling;

    assert_spin_locked(&task_rq(p)->lock);

    /* minimise the chance of sending an interrupt to poll_idle() */
    nrpolling = test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);
    need_resched = test_and_set_tsk_thread_flag(p,TIF_NEED_RESCHED);
    nrpolling |= test_tsk_thread_flag(p,TIF_POLLING_NRFLAG);

    if (!need_resched && !nrpolling && (task_cpu(p) != smp_processor_id()))
        smp_send_reschedule(task_cpu(p));
}
#else
static inline void resched_task(task_t *p)
{
    set_tsk_need_resched(p);
}
#endif


--------------------------------------------------------------------------------

 We see resched_task() is just to set TIF_NEED_RESCHED flag, just like schedule_tick() for timeslice task switch.


 The actual preemption process happens in ASM code(We will see later).


================================================================================

3.4. actual task switch for timeslice / preemption task switch


 We see previously, for actual task switch and preemption task switch, they just set TIF_NEED_RESCHED flag, but not actually perform task switch.


 The actual task switch is performed by ASM code

        "resume_userspace"

        "resume_kernel"


 See:

        <<kernel control path - exception,hardirq - ret_from_exception,ret_from_intr - x86.vsd>>


--------------------------------------------------------------------------------

@@trace - "resume_userspace" - perform actual task switch for timeslice and preemption task switch.

#
# Well, 这是近期的 kernel code,而不是 之前提到的 kernel.2.6.11.12 --- 不过,道理一样。 
#

/arch/x86/kernel/entry_32.S
resume_userspace

ENTRY(resume_userspace)
   LOCKDEP_SYS_EXIT
   DISABLE_INTERRUPTS(CLBR_ANY)
   TRACE_IRQS_OFF

   #
   # Check “thread_info->flags”, to see if any work to do,
   # if so, jump to “work_pending”.
   #
   movl TI_flags(%ebp), %ecx
   andl $_TIF_WORK_MASK, %ecx  # is there any work to be
                                                             # done on int/exception
                                                             # return?
   jne work_pending

  #
  # If no pending work, then, jump to “restore_all”,
  # to return to user-level.
  #
   jmp restore_all
END(ret_from_exception)


work_pending:
    #
    # Check if TIF_NEED_RESCHED flag set in “thread_info->flags”.
    # 
    #
    # If NO, then, the pending work MUST be about signal, then, jump to “work_notifysig”.
    # 
    #
    # If YES, then, fall through to “work_resched”.
    #

    testb $_TIF_NEED_RESCHED, %cl
    jz work_notifysig


#
# “work_resched”, we know we have a pending schedule work, so we call
# schedule() to do task switch.
#
# After sometime, “current” get scheduled and run again, returns to here.
#
# Again, it need to check if it still has pending work.
#
# If NO, then, jump to “restore_all”, to return to user-level.
#
# If YES, then, it checks if it is still a pending schedule work, and jump back to
# “work_resched” to schedule again if it is.
#
#
# [*] This is where timeslice task switch happens.
#
work_resched:
    call schedule
    LOCKDEP_SYS_EXIT
    DISABLE_INTERRUPTS(CLBR_ANY)    # make sure we don't miss an interrupt
                    # setting need_resched or sigpending
                    # between sampling and the iret
    TRACE_IRQS_OFF
    movl TI_flags(%ebp), %ecx
    andl $_TIF_WORK_MASK, %ecx  # is there any work to be done other
                    # than syscall tracing?
    jz restore_all
    testb $_TIF_NEED_RESCHED, %cl
    jnz work_resched


--------------------------------------------------------------------------------

@@trace - "resume_kernel" - perform actual task switch for timeslice and preemption task switch.

#
# /arch/x86/kernel/entry_32.S
#
# Note, if !CONFIG_PREEMPT, then, "resume_kernel" is directly "restore_all":
#
#    #ifdef CONFIG_PREEMPT
#    ...
#    #else
#    #define resume_kernel        restore_all
#    #endif
#

#ifdef CONFIG_PREEMPT
ENTRY(resume_kernel)
    DISABLE_INTERRUPTS(CLBR_ANY)

    #
    # If “thread_info->preempt_count” is not 0, then the interrupted code MUST
    # have disabled preemption, then, there is no chance to do preemption progress,
    # so, jump to “restore_all” to resume previously interrupted kernel code.
    #
    cmpl $0,TI_preempt_count(%ebp)  # non-zero preempt_count ?
    jnz restore_all


#
# “need_resched” label perform preemption progress.
#
#   It checks TIF_NEED_RESCHED flag in “thread_info->flags”.
#
#          If the flag is not set, then there is no preemption request, to jump to 
#“restore_all” to return previously interrupted code (__ 在 kernel-level 中的被“打断”处).
#
#          Otherwise, call preempt_schedule_irq() to perform preemption.

need_resched:
    movl TI_flags(%ebp), %ecx   # need_resched set ?
    testb $_TIF_NEED_RESCHED, %cl
    jz restore_all
    testl $X86_EFLAGS_IF,PT_EFLAGS(%esp)    # interrupts off (exception path) ?
    jz restore_all
    #
    # Note here, “need_resched” is a loop, if “current” get preempted in
    # preempt_schedule_irq(), and after sometime, it get scheduled and run again,
    # it reaches to here, and continue this “resume_kernel”, it still need to
    # check if itself has a new preemption request. If NO, then, it goes to
    # “restore_all”.
    #
    call preempt_schedule_irq
    jmp need_resched
END(resume_kernel)
#endif


--------------------------------------------------------------------------------

[*][*] !!__Note that, "resume_kernel" label 并不是 only place doing preemption task switch, in fact, "resume_userspace" label 也是.


 This is a misunderstanding easily to get into:


        "resume_userspace"  -    just do timeslice task switch.


        "resume_kernel"        -    just do preemption task switch.

 Both of "resume_kernel" and "resume_userspace" are using TIF_NEED_RESCHED as clue.


 We differentiate timeslice task switch and preemption task switch, just by how TIF_NEED_RESCHED flag was set:

        for timeslice task switch, it is set by scheduler_tick() in timer ISR.

        for preemption task switch, it is set by resched_task() called in many cases.


 Not by whether the actual task switch is done in "resume_kernel" or "resume_userspace".


 In fact, it is easy to imagine:


    "resume_userspace" doing preemption task switch, like this:

        System call service routine calls set_user_nice() / sched_setscheduler() boost a task priority, which exceed "current", so resched_task() is called.

        No hardirq happens, so "resume_userspace", well, TIF_NEED_RESCHED flag is set, task switch. 

    
    "resume_kernel" doing timeslice task switch, like this,

        When a task is executing a read() system call service routine for a lengthy read operation.

        Timer interrupt happens, timer ISR find this task uses up its timeslice, so it set TIF_NEED_RESCHED flag

        "resume_kernel" performs timeslice task switch.

[*][*] If we want to be strict, to say about "kernel preemption", then, it is only about "resume_kernel".


 And it is about 2 functions:


        preempt_schedule()
            called in:

                #define preempt_enable() \            # when preemption is reenabled.
                do { \
                    preempt_enable_no_resched(); \
                    barrier(); \
                    preempt_check_resched(); \
                } while (0)

                    #define preempt_check_resched() \
                    do { \
                        if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) \
                            preempt_schedule(); \
                    } while (0)


        preempt_schedule_irq()
            called in "resume_kernel", to perform kernel preemption when IRET.


[*][*] preemption becomes regular stuff of kernel, so don't be too strict.


================================================================================

3.5 synchronization - disable preemption


    preempt_disable()

    preempt_enable()


 Disable preemption alone is not that useful. 


 It just:

    avoid "current" to be preempted by other tasks.


 It alone can only be apply to this situation:

    UP, multi tasks could acess a shared resource in "user context".


================================================================================

4. misc tips


[*] TIF_NEED_RESCHED flag is the clue of:

        timeslice task switch        -    set by schedule_tick()

        preemption task switch        -    set by resched_task()

================================================================================

4.1. question: 如果在 kernel-level 下,处于 user context 的 task #A (__非 irq 或者 bh ) 唤醒了一个处于同一CPU上的更高优先级的 task #B,那么,是不是*马上立刻*发生 preemption switch 呢?


 从上面的

        3.3. preemption task switch


 的分析来看,从代码细节来看,并不是 *马上立刻* 进行 switch。


        task #A

            try_to_wake_up( task #B )


                .... 只是设置 task #A ( 作为 current ) 的 TIF_NEED_RESCHED flag。


            然后,task #A 继续执行。


 真正的 switch 的代码,在2个情况下执行:
 
    
    #1. task #A 返回 user-level,


            resume_userspace -> work_pending -> jnz work_resched -> call schedule()


    #2. 触发了一个interrupt,再从 irq context 返回时:     #(__可能是 local CPU上的任何interrupt,包括timer interrupt,或者是 其他 CPU 上唤醒了 local task #B,发过来的 IPI ),

        
            resume_kernel -> need_resched -> jmp need_resched -> call preempt_schedule_irq() -> schedule()


================================================================================

4.2. --- 实际上,“主动”的task switch 是 ** 马上立刻 **发生的。但 “被动”的task switch ** 都不是 马上立刻 发生的 **


“主动”的 task switch --- current 主动去调用 schedule() (__ 去获取 semaphore,进入 waitQ,等等,底层都是 schedule() 或者 schedule_timeout() )
 

    这是 马上立刻 发生的。


“被动”的 task switch --- timeslice用完,或者 被新唤醒的更高优先级的 task 所 preempt。


    都不是 马上立刻 发生的。


    而是:

        timeslice 用完 --- timer ISR 调用的 update_process_times() 会 设置 current 的 TIF_NEED_RESCHED flag

        被 preempt --- 也是设置 current 的 TIF_NEED_RESCHED flag


        真正的 switch,是发生在之后的某个时刻。 --- 汇编代码 resume_userspace 或者 resume_kernel 中看到 TIF_NEED_RESCHED flag,所调用到的 schedule()。


================================================================================
 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值