schedule()

2007-03-08 15:26

schedule() -- 调度进程

    The goal of the schedule( ) function consists of replacing the currently executing process with another one. Thus, the key outcome of the function is to set a local variable called next, so that it points to the descriptor of the process selected to replace current. If no runnable process in the system has priority greater than the priority of current, at the end, next coincides with current and no process switch takes place.
    schedule()函数的目的在于用另一个进程替换当前正在运行的进程。因此,这个函数的主要结果就是设置一个名为next的变量,以便它指向所选中的 代替current的进程的描述符。如果在系统中没有可运行进程的优先级大于current的优先级,那么,结果是next与current一致,没有进 程切换发生。

asmlinkage void __sched schedule(void)
{
    long *switch_count;
    task_t *prev, *next;
    runqueue_t *rq;
    prio_array_t *array;
    struct list_head *queue;
    unsigned long long now;
    unsigned long run_time;
    int cpu, idx;


    if (likely(!current->exit_state)) {
        if (unlikely(in_atomic())) {
            printk(KERN_ERR "scheduling while atomic: "
                "%s/0x%08x/%d ",
                current->comm, preempt_count(), current->pid);
            dump_stack();
        }
    }
    profile_hit(SCHED_PROFILING, __builtin_return_address(0));  

Actions performed by schedule( ) before a process switch
==============================================================
关闭 内核抢占功能;初始化参数prev、rq
The schedule( ) function starts by disabling kernel preemption and initializing a few local variables:
|---------------------------------|
|need_resched:                    |
|      preempt_disable();           |
|    prev = current;              |
|    release_kernel_lock(prev);   |
|need_resched_nonpreemptible:     |
|    rq =   this_rq();              |
|---------------------------------|


    if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
        printk(KERN_ERR "bad: scheduling from the idle thread! ");
        dump_stack();
    }

    schedstat_inc(rq, sched_cnt);


计算进程prev本次运行时间(run_time):
通常连续运行时间(run_time)限制在1秒内(要转换成纳秒)
    The sched_clock( ) function is invoked to read the TSC and convert its value to   nanoseconds; the timestamp obtained is saved in the now local variable. Then, schedule( ) computes the duration of the CPU time slice used by prev:
    now = sched_lock();
    run_time = now - prev->timestamp;
    if (run_time > 1000000000)
        run_time = 1000000000;
|-----------------------------------------------------------------------|
|   now = sched_clock();                                                |
|   if (likely((long long)(now - prev->timestamp) <   NS_MAX_SLEEP_AVG)) {|
|       run_time = now - prev->timestamp;                               |
|       if (unlikely((long long)(now - prev->timestamp) < 0))           |
|           run_time = 0;                                               |
|   } else                                                              |
|       run_time = NS_MAX_SLEEP_AVG;                                    |
|-----------------------------------------------------------------------|

根据原平均睡眠时间(CURRENT_BONUS)“倍减”本次连续运行时间:
本来进程prev的平均睡眠时间应该更新为:
    原平均睡眠时间 - 本次连续运行时间;
不过,schedule()为了奖励原平均睡眠时间较长的进程--CURRENT_BONUS(prev)值较大;经过下面运算将会减小run_time,从而降低了本次连续运行时间对新的平均睡眠时间的影响  
|--------------------------------------------|
|   run_time /= ( CURRENT_BONUS(prev)  ? : 1); |
|--------------------------------------------|

关闭本地中断;使用自旋锁保护runqueue
Before starting to look at the runnable processes, schedule( ) must disable the local interrupts and acquire the spin lock that protects the runqueue:
|------------------------------------|
|   spin_lock_irq(&rq->lock);        |
|------------------------------------|

为了识别当前进程是否已经终止,schedule检查PF_DEAD标志
|----------------------------------------|
|   if (unlikely(prev->flags & PF_DEAD)) |
|       prev->state = EXIT_DEAD;         |
|----------------------------------------|  

    switch_count = &prev->nivcsw;

如果进程prev因为等待某事件的发生而调用schedule()放弃CPU控制权,则schedule()将根据该进程的具体状态(TASK_INTERRUPTIBLE还是TASK_UNINTERRUPTIBLE)来决定它是继续留在活跃队列;还是从活跃队列中删除
    如果进程prev处于不可运行状态;并且该进程在内核态没有被抢占;则应该从可执行队列(runqueue)中删除。然而如果该进程有不可阻塞的信号并且 其状态为TASK_INTERRUPTIBLE则该进程将会被置为TASK_RUNNING并继续留在runqueue中。这个操作与把处理器分配给 prev是不同的,它只是给prev一次被选中执行的机会。
    schedule( ) examines the state of prev. If it is not runnable and it has not been preempted in Kernel Mode,then it should be removed from the runqueue. However, if it has nonblocked pending signals and its state is TASK_INTERRUPTIBLE, the function sets the process state to TASK_RUNNING and leaves it into the runqueue. This action is not the same as assigning the processor to prev; it just gives prev a chance to be selected for execution:
    如果进程prev处于不可运行状态;并且该进程在内核态没有被抢占;则说明该进程在调用schedule()之前,由于等待某事件的发生而进入等待队列- -处于睡眠状态。如果其状态为TASK_INTERRUPTIBLE并且收到了信号(并不处理信号),则该进程再次回到TASK_RUNNING状态(被 调度后将会去处理信号)
|---------------------------------------------------------------|
|   if (prev->state && !( preempt_count() & PREEMPT_ACTIVE)) {   |
|       switch_count = &prev->nvcsw;                            |
|       if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&      |
|               unlikely( signal_pending(prev))))                |
|           prev->state = TASK_RUNNING;                         |
|       else {                                               ---|
|           if (prev->state == TASK_UNINTERRUPTIBLE)            |
|               rq->nr_uninterruptible++;                       |
|             deactivate_task(prev, rq);                          |
|       }                                                       |
|   }                                                           |
|---------------------------------------------------------------|
    cpu = smp_processor_id();

Actions performed by schedule( ) to make the process switch==============================================================检测可执行队列(runqueue)中可运行进程数,并根据所剩进程数进行负载均衡运算:
schedule( ) checks the number of runnable processes left in the runqueue.
    If no runnable process exists, the function invokes idle_balance( ) to move some runnable process from another runqueue to the local runqueue; idle_balance( ) is similar to load_balance( )
    如果运行队列中没有可运行的进程存在,schedule()就调用idle_balance(),从另外一个运行队列迁移一些可运行进程到本地运行队列中, idle_balance( )与load_balance( )类似
    如果idle_balance( )没有成功地把进程迁移到本地运行队列中,schedule( )就调用wake_sleeping_dependent( )重新调度空闲CPU(即每个运行swapper进程的CPU)中的可运行进程。就象前面讨论 dependent_sleeper( ) 函数时所说明的,通常在内核支持超线程技术的时候可能会出现这种情况。然而,在单处理机系统中,或者当把进程迁移到本地运行队列的种种努力都失败的情况 下,函数就选择swapper进程作为next进程并继续进行下一步骤。
    If there are some runnable processes, the function invokes the dependent_sleeper( ) function. In most cases, this function immediately returns zero.
|-------------------------------------------------|
|   if (unlikely(!rq->nr_running)) {              |
|go_idle:                                         |
|       idle_balance(cpu, rq);                    |
|       if (!rq->nr_running) {                    |
|           next = rq->idle;                      |
|           rq->expired_timestamp = 0;            |
|           wake_sleeping_dependent(cpu, rq);     |
|           if (!rq->nr_running)                  |
|               goto switch_tasks;                |
|       }                                         |
|-------------------------------------------------|
|   } else {                                      |
|       if (dependent_sleeper(cpu, rq)) {         |
|           next = rq->idle;                      |
|           goto switch_tasks;                    |
|       }                                         |
|       if (unlikely(!rq->nr_running))            |
|           goto go_idle;                         |
|   }                                             |
|-------------------------------------------------|

如果可运行队列的活跃队列中(runqueue.active)已经没有活跃进程;则交换活跃队列(active)和过期队列(expired)
Let's suppose that the schedule( ) function has determined that the runqueue includes some runnable processes; now it has to check that at least one of these runnable processes is active. If not, the function exchanges the contents of the active and expired fields of the runqueue data structure; thus, all expired processes become active, while the empty set is ready to receive the processes that will expire in the future.
|------------------------------------------|
|   array = rq->active;                    |  
|   if (unlikely(!array->nr_active)) {     |
|       schedstat_inc(rq, sched_switch);   |
|       rq->active = rq->expired;          |
|       rq->expired = array;               |
|       array = rq->active;                |
|       rq->expired_timestamp = 0;         |
|       rq->best_expired_prio =   MAX_PRIO; -|
|   }                                      |
|------------------------------------------|

从优先级数组中选取优先级最高的进程next:
It is time to look up a runnable process in the active prio_array_t data structure.First of all, schedule( ) searches for the first nonzero bit in the bitmask of the active set. Remember that a bit in the bitmask is set when the corresponding priority list is not empty. Thus, the index of the first nonzero bit indicates the list containing the best process to run. Then, the first process descriptor in that list is retrieved:  
|-----------------------------------------------------|
|   idx = sched_find_first_bit(array->bitmap);        |
|   queue = array->queue + idx;                       |
|   next = list_entry(queue->next, task_t, run_list); |
|-----------------------------------------------------|

计算进程next的平均睡眠时间:
如果进程next是普通用户进程,并且该进程是从TASK_INTERRUPTIBLE或者TASK_STOPPED被唤醒的,scheduler将要为该进程增加平均睡眠时间sleep_avg(此时计算平均睡眠时间不能简单增加唤醒前的睡眠时间)
If next is a conventional process and it is being awakened from the TASK_INTERRUPTIBLE or TASK_STOPPED state, the scheduler adds to the average sleep time of the process the nanoseconds elapsed since the process was inserted into the runqueue. In other words, the sleep time of the process is increased to cover also the time spent by the process in the runqueue waiting for the CPU:  
|-------------------------------------------------------------------|
|   if (!rt_task(next) && next->activated > 0) {                    |
|       unsigned long long delta = now - next->timestamp;           |
|       if (unlikely((long long)(now - next->timestamp) < 0))       |
|           delta = 0;                                              |
|       if (next->activated == 1)                                   |
|           delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128; |
|       array = next->array;                                        |
|         dequeue_task(next, array);                                  |
|         recalc_task_prio(next, next->timestamp + delta);            |
|       enqueue_task(next, array);                                  |
|   }                                                               |
|   next->activated = 0;                                            |
|-------------------------------------------------------------------|

switch_tasks:                               
    if (next == rq->idle)
       schedstat_inc(rq, sched_goidle);

获取进程next的thread_info域
现在schedule()函数已经确定将要运行的进程next。内核将访问进程next的thread_info域--该域存放在进程next描述符的顶部(task_struct.thread_info):
Now the schedule( ) function has determined the next process to run.In a moment, the kernel will access the thread_info data structure of next, whose address is stored close to the top of next's process descriptor:

|-----------------------|
|   prefetch(next);     |
|-----------------------|

在替换prev进程前,调度程序需要进行对prev做一些处理:
清除标志位TIF_NEED_RESCHED
Before replacing prev, the scheduler should do some administrative work:
The clear_tsk_need_resched( ) function clears the TIF_NEED_RESCHED flag of prev, just in case schedule( ) has been invoked in the lazy way. Then, the function records that the CPU is going through a quiescent state
|-----------------------------------|
|   clear_tsk_need_resched(prev);   |
|   rcu_qsctr_inc(task_cpu(prev)); -|
|-----------------------------------|
    update_cpu_clock(prev, rq, now);

计算进程prev的平均睡眠时间sleep_avg
计算进程prev的平均睡眠时间sleep_avg(进程上下文切换前进程prev运行了run_time长的时间,因此该进程的sleep_avg应该减少run_time);更新该进程进入睡眠状态的时间戳
The schedule( ) function must also decrease the average sleep time of prev, charging to it the slice of CPU time used by the process:
|-------------------------------------------|
|   prev->sleep_avg -= run_time;            |
|   if ((long)prev->sleep_avg <= 0)         |
|       prev->sleep_avg = 0;                |
|   prev->timestamp = prev->last_ran = now; |
|-------------------------------------------|
     
      sched_info_switch(prev, next);

执行进程上下文切换动作:
At this point, prev and next are different processes, and the process switch is for real:
|-----------------------------------------------|
|   if (likely(prev != next)) {                 |
|       next->timestamp = now;                  |
|       rq->nr_switches++;                      |
|       rq->curr = next;                        |
|       ++*switch_count;                        |
|       prepare_arch_switch(rq, next);          |
|       prev =   context_switch(rq, prev, next); -|
|-----------------------------------------------|




Actions performed by schedule() after a process switch
==============================================================
|---------------------------------|
|       barrier();                |
|       finish_task_switch(prev); |
|---------------------------------|

如果prev和next是同一个进程:
It is quite possible that prev and next are the same process: this happens if no other higher or equal priority active process is present in the runqueue. In this case, the function skips the process switch:
|-----------------------------------|
|   } else                          |
|       spin_unlock_irq(&rq->lock); |
|-----------------------------------|

    prev = current;
    if (unlikely(reacquire_kernel_lock(prev) < 0))
        goto need_resched_nonpreemptible;
    preempt_enable_no_resched();
    if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
        goto need_resched;
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值