进程调度赖以运行的基础设施

Each process descriptor includes several fields related to scheduling:

        thread_info
     |--------------|
 +---|*task         |
 |   |--------------|
 |   |              |
 |   |--------------|
 |   |flags         |flags中的TIF_NEED_RESCHED为1表明该进程
 |   |--------------|需要调用schedule()
 |   |              |
 |   |--------------|
 |   |cpu           |Logical number of the CPU owning the
 |   |--------------|runqueue to which the runnable process belongs
 |   |              |
 |   |--------------|<--内核堆栈低端地址
 |
 |
 |
 |       task_struct
 |   |----------------|
 +-->|state           |TASK_RUNNING
     |----------------|
     |*thread_info    |
     |----------------|
     |prio            |调度程序要用到的动态优先级存放在prio域中
     |----------------|
     |static_prio     |静态优先级,影响进程获得CPU时间的多少
     |----------------|
     |run_list        |Pointers to the next and previous elements
     |----------------|in the runqueue list to which the process belongs
     |                |
     |----------------|
     |array           |    Pointer to the runqueue's
     |----------------|    prio_array_t set that includes the process
     |sleep_avg       |Average sleep time of the process
     |----------------|
     |                |
     |                |
     |----------------|
     |policy          |The scheduling class of the process
     |----------------|(SCHED_NORMAL, SCHED_RR, or SCHED_FIFO)
     |                |
     |----------------|
     |time_slice      |Ticks left in the time quantum of the process
     |----------------|
     |first_time_slice|Flag set to 1 if the process
     |----------------|never exhausted its time quantum
     |                |
     |                |
     |----------------|
     | rt_priority    |Real-time priority of the process
     |----------------|
     |                |
     |                |
     |----------------|



task_struct.activated  (int)
    Condition code used when the process is awakened
    表示进程因什么原因进入就绪态,这一原因会影响到调度优先级的计算。activated 有四个值:
  • -1,进程从 TASK_UNINTERRUPTIBLE 状态被唤醒;
  • 0,缺省值,进程原本就处于就绪态;
  • 1,进程从 TASK_INTERRUPTIBLE 状态被唤醒,且不在中断上下文中;
  • 2,进程从 TASK_INTERRUPTIBLE 状态被唤醒,且在中断上下文中。
    activated 初值为 0,在两个地方修改,一是在 schedule() 中,被恢复为 0,另一个就是 activate_task(),这个函数由 try_to_wake_up() 函数调用,用于激活休眠进程:
  • 如果是中断服务程序调用的 activate_task(),也就是说进程由中断激活,则该进程最有可能是交互式的,因此,置 activated=2;否则置activated=1。   
  • 如果进程是从 TASK_UNINTERRUPTIBLE 状态中被唤醒的,则 activated=-1(在try_to_wake_up()函数中)。
task_stuct.timestamp (unsigned long long)
    Time of last insertion of the process in the runqueue, or time of last process switch involving the process.
    该进程进入sleeping状态的时间戳;或者该进程最新一次得到CPU控制权的时间戳

task_struct.run_list (struct list_head)
    Pointers to the next and previous elements in the runqueue list to which the process belongs.
    指向该进程所在队列中的前后进程

task_struct.cpus_allowed ( cpumask_t)  
    Bitmask of the CPUs that can execute the process
    可以执行该进程的CPU掩码集




  • task_struct.static_prio

    Processes have an initial priority that is called the nice value. This value ranges from 20 to +19 with a default of zero. Nineteen is the lowest and 20 is the highest priority. This value is stored in the static_prio member of the process's task_struct. The variable is called the static priority because it does not change from what the user specifies. The scheduler, in turn, bases its decisions on the dynamic priority that is stored in prio. The dynamic priority is calculated as a function of the static priority and the task's interactivity.
    进程拥有一个初始的优先级,叫做nice值。该数值变化范围为-20到+19,默认值为0。19优先级最低,-20最高。进程task_struct的 static_prio域就存放着这个值(存放着用nice计算出来的static_prio),之所以起名为静态优先级(static priority),是因为它从一开始由用户指定后,就不能改变。而调度程序要用到的动态优先级存放在prio域里。动态优先级通过一个关于静态优先级和进程交互性的函数关系计算而来。

static priority of a conventional process
    The kernel represents the static priority of a conventional process with a number ranging from 100 (highest priority) to 139 (lowest priority); notice that static priority decreases as the values increase.
static priority [0, 139]
static priority of a conventional process: [100, 139]

nice--> static_prio
    A new process always inherits the static priority of its parent. However, a user can change the static priority of the processes that he owns by passing some "nice values" to the nice( ) and setpriority( ) system calls.
nice: [-20, 19]

用nice计算static_prio:
#define NICE_TO_PRIO(nice)   (MAX_RT_PRIO + (nice) + 20)
#define MAX_RT_PRIO          MAX_USER_RT_PRIO
#define MAX_USER_RT_PRIO     100



  • task_struct.prio

    Besides a static priority, a conventional process also has a dynamic priority, which is a value ranging from 100 (highest priority) to 139 (lowest priority). The dynamic priority is the number actually looked up by the scheduler when selecting the new process to run.

    The method effective_prio() returns a task's dynamic priority. The method begins with the task's nice value and computes a bonus or penalty in the range 5 to +5 based on the interactivity of the task. For example, a highly interactive task with a nice value of ten can have a dynamic priority of five. Conversely, a mild processor hog with a nice value of ten can have a dynamic priority of 12. Tasks that are only mildly interactiveat some theoretical equilibrium of I/O versus processor usagereceive no bonus or penalty and their dynamic priority is equal to their nice value.
    effective_prio()函数可以返回一个进程的动态优先级。这个函数以nice值为基数,再加上-5到+5之间的进程交互性的奖励或罚分。一个交互性很强的进程(I/O消耗性的进程),即使它的nice值为10,它的动态优先级最终也有可能达到5。相反,一个处理器消耗性的进程,虽然本来nice值一样是10,它最后的动态优先级却可能是12。交互性不强也不弱的进程不会得到优先级的奖励,同样也不会被罚分,所以它的动态优先级和它的nice值相等。

dynamic priority of a conventional process: [100, 139]
dynamic priority = max ( 100, min(static priority - bonus + 5, 139) )
bonus: [0, 10]

(>200ms)      (2)
sleep_avg --> bonus  ----+    
                         |    
                         +----->prio
                         |    
        static_prio  ----+  

  • task_struct.sleep_avg
    To implement this heuristic, Linux keeps a running tab on how much time a process is spent sleeping versus how much time the process spends in a runnable state. This value is stored in the sleep_avg member of the task_struct. It ranges from zero to MAX_SLEEP_AVG, which defaults to 10 milliseconds. When a task becomes runnable after sleeping, sleep_avg is incremented by how long it slept, until the value reaches MAX_SLEEP_AVG. For every timer tick the task runs, sleep_avg is decremented until it reaches zero.
    为了支持这种机制,linux记录了一个进程用于休眠和用于执行的时间,该值存放在task_struct的sleep_avg域中。它的范围从0到MAX_SLEEP_AVG。它的默认值为10毫秒。当一个进程从休眠状态恢复到执行状态时,sleep_avg会根据它休息时间的长短而增加,直到达到MAX_SLEEP_AVG为止。相反,进程每运行一个时钟节拍,sleep_avg就做相应的递减,到0为止。
A newly created interactive process quickly receives a large sleep_avg.
    粗略地说讲,平均睡眠时间是进程在睡眠状态所消耗的平均纳秒数。注意,这绝对不是对过去时间的求平均值的操作。例如:在 TASK_INTERRUPTIBLE 状态与在 TASK_UNINTERRUPTIBLE状态的所计算出的平均睡眠时间是不同的。   而且, 进程在运行的过程中, 平均睡眠时间递减。最后,平均睡眠时间永远不会大于 1 秒
    task每一次从sleep中醒来,sleep_avg都会加上sleep的时间,直到MAX_SLEEP_AVG。相反,tick结束的时候,sleep_avg也会减去运行的时间, 直到0

sleep_avg --> interactive
    The average sleep time is also used by the scheduler to determine whether a given process should be considered interactive or batch. More precisely, a process is considered "interactive" if it satisfies the following formula:
dynamic priority  <=  3 * static priority / 4 + 28   (3)
which is equivalent to the following:
bonus - 5 >= static priority / 4 - 28
    The expression static priority / 4 - 28 is called the interactive delta ;It should be noted that it is far easier for high priority than for low priority processes to become interactive. For instance, a process having highest static priority (100) is considered interactive when its bonus value exceeds 2, that is, when its average sleep time exceeds 200 ms.

(>200ms)      (2)    
sleep_avg --> bonus  ----+      +--> interactive
                         |      |      
                         +------+根据公式(3)判断交互性      
                         |      |
        static_prio  ----+      +-->  batch


  • task_struct.time_slice

    Timeslice, on the other hand, is a much simpler calculation. It is based on the static priority.After a task's timeslice is exhausted, however, it is recalculated based on the task's static priority. The function task_timeslice() returns a new timeslice for the given task. The calculation is a simple scaling of the static priority into a range of timeslices.(时间片的计算只需要把静态优先级按比例缩放,使其符合时间片的数值范围要求就可以了)
    The higher a task's priority, the more timeslice it receives per round of execution.(进程的静态优先级越高,它每次执行得到的时间片就越长)
    The maximum timeslice, which is given to the highest priority tasks (a nice value of -20), is 800 milliseconds. Even the lowest-priority tasks (those with a nice value of +19) receive at least the minimum timeslice, MIN_TIMESLICE, which is either 5 milliseconds or one timer tick, whichever is larger. Tasks with the default priority (a nice value of zero) receive a timeslice of 100 milliseconds.

                          +--> (140 - static priority) * 20  
                          |          (static priority < 120)
(task_struct.time_slice)  |         
  base time quantum =     |                                   
(in milliseconds)         |
                          |          (static priority >= 120)
                          +--> (140 - static priority) * 5   

注:
 -20  nice() /setpriority()      100      (140-100)*20       800ms
nice -----------------------> static_prio --------------> time_slice

进程刚刚生成时其时间片是从父进程那里获得;如果时间片耗尽将根据静态优先级重新计算时间片:
    When a new process is created, sched_fork( ), invoked by copy_process( ), sets the time_slice field of both current (the parent) and p (the child) processes in the following way:
p->time_slice = (current->time_slice + 1) >> 1;
current->time_slice >>= 1;

p->first_time_slice = 1;

task_struct.first_time_slice
    The first_time_slice flag is set to 1, because the child has never exhausted its time quantum (if a process terminates or executes a new program during its first time slice, the parent process is rewarded with the remaining time slice of the child).
    将first_time_slice标志位置为1,是因为子进程还从没有使用过自己的时间片。(如果一个进程在其第一个时间片内执行或终止了新程序,那么其剩余时间片将归还于其父进程)

  • task_struct.thread_info.flags

    The kernel provides the need_resched flag to signify whether a reschedule should be performed. This flag is set by scheduler_tick() when a process runs out of timeslice, and by try_to_wake_up() when a process that has a higher priority than the currently running process is awakened. The kernel checks the flag, sees that it is set, and calls schedule() to switch to a new process. The flag is a message to the kernel that the scheduler should be invoked as soon as possible because another process deserves to run. Upon returning to user-space or returning from an interrupt, the need_resched flag is checked. If it is set, the kernel invokes the scheduler before continuing.
    内核提供了一个need_resched标志来表明是否需要重新执行一个调度。当某个进程耗尽它的时间片时,scheduler_tick()就会设置这个标志;当一个优先级高的进程进入可执行状态的时候,try_to_wake_up()也会设置这个标志。在返回用户空间以及从中断返回的时候,内核也会检查need_resched标志。如果已被设置,内核会在继续执行之前调用 调度程序

    The flag is per-process, and not simply global, because it is faster to access a value in the process descriptor (because of the speed of current and high probability of it being in a cache line) than a global variable.Historically, the flag was global before the 2.2 kernel. In 2.2 and 2.4, the flag was an int inside the task_struct. In 2.6, it was moved into a single bit of a special flag variable inside the tHRead_info structure. As you can see, the kernel developers are never satisfied.
    每个进程都包含一个need_resched标志,这是因为访问进程描述符内的数值要比访问一个全局变量快(因为current宏速度很快并且描述符通常都在高速缓存中)。在2.2以前的内核版本中,该标志曾经是一个全局变量。2.2到2.4版本内核中它在task_struct中。而在2.6版中,它被移到thread_info结构体里,用一个特别的标志变量中的一位来表示。可见内核开发者是在不断改进。







  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值