3.2.2.5. The lists of TASK_RUNNING processes
When looking for a new process to run on a CPU, the kernel has to consider only the runnable processes (that is, the processes in the TASK_RUNNING state).
Earlier Linux versions put all runnable processes in the same list calledrunqueue. Because it would be too costly to maintain the list ordered according to process priorities, the earlier schedulers were compelled to scan the whole list in order to select the "best" runnable process.
Linux 2.6 implements the runqueue differently. The aim is to allow the scheduler to select the best runnable process in constant time, independently of the number of runnable processes. We'll defer toChapter 7 a detailed description of thisnew kind of runqueue, and we'll provide hereonly some basic information.
The trick used to achieve the scheduler speedup consists of splitting the runqueue in many lists of runnable processes, one list per process priority.Each task_struct descriptor includes a run_list field of type list_head. If the process priority is equal to k (a value ranging between 0 and 139), the run_list field links the process descriptor into the list of runnable processes having priority k.Furthermore, on a multiprocessor system, each CPU has its own runqueue, that is, its own set of lists of processes. This is a classic example of making a data structures more complex to improve performance: to make scheduler operations more efficient, the runqueue list has been split into 140 different lists!
As we'll see, the kernel must preserve a lot of data for every runqueue in the system; however, the main data structures of a runqueue are the lists of process descriptors belonging to the runqueue; all these lists are implemented by a single prio_array_t data structure, whose fields are shown inTable 3-2.
Type | Field | Description |
---|---|---|
int | nr_active | The number of process descriptors linked into the lists |
unsigned long [5] | bitmap | A priority bitmap: each flag is set if and only if the corresponding priority list is not empty |
struct list_head [140] | queue | The 140 heads of the priority lists |
The enqueue_task(p,array) function inserts a process descriptor into a runqueue list; its code is essentially equivalent to:
list_add_tail(&p->run_list, &array->queue[p->prio]);
__set_bit(p->prio, array->bitmap);
array->nr_active++;
p->array = array;
The prio field of the process descriptor stores the dynamic priority of the process,while the array field is a pointer to theprio_array_t data structure of its currentrunqueue.Similarly, the dequeue_task(p,array) function removes a process descriptor from a runqueue list.
|
\|/
7.3.1. The runqueueData Structure
The runqueue data structure is the most important data structure of the Linux 2.6 scheduler. Each CPU in the system has its own runqueue; all runqueue structures are stored in the runqueues per-CPU variable (see the section "Per-CPU Variables" inChapter 5). The this_rq( ) macro yields the address of the runqueue of the local CPU, while the cpu_rq(n) macro yields the address of the runqueue of the CPU having index n.
Table 7-4 lists the fields included in the runqueue data structure; we will discuss most of them in the following sections of the chapter.
Type | Name | Description |
---|---|---|
spinlock_t | lock | Spin lock protecting the lists of processes |
unsigned long | nr_running | Number of runnable processes in the runqueue lists |
unsigned long | cpu_load | CPU load factor based on the average number of processes in the runqueue |
unsigned long | nr_switches | Number of process switches performed by the CPU |
unsigned long | nr_uninterruptible | Number of processes that were previously in the runqueue lists and are now sleeping in TASK_UNINTERRUPTIBLE state (only the sum of these fields across all runqueues is meaningful) |
unsigned long | expired_timestamp | Insertion time of the eldest process in the expired lists |
unsigned long long | timestamp_last_tick | Timestamp value of the last timer interrupt |
task_t * | curr | Process descriptor pointer of the currently running process (same as current for the local CPU) |
task_t * | idle | Process descriptor pointer of theswapper process for this CPU |
struct mm_struct * | prev_mm | Used during a process switch to store the address of the memory descriptor of the process being replaced |
prio_array_t * | active | Pointer to the lists of active processes |
prio_array_t * | expired | Pointer to the lists of expired processes |
prio_array_t [2] | arrays | The two sets of active and expired processes |
int | best_expired_prio | The best static priority (lowest value) among the expired processes |
atomic_t | nr_iowait | Number of processes that were previously in the runqueue lists and are now waiting for a disk I/O operation to complete |
struct sched_domain * | sd | Points to the base scheduling domain of this CPU (see the section "Scheduling Domains" later in this chapter) |
int | active_balance | Flag set if some process shall bemigrated from this runqueue to another (runqueue balancing) |
int | push_cpu | Not used |
task_t * | migration_thread | |
struct list_head | migration_queue | List of processes to be removed from the runqueue |
The most important fields of the runqueue data structure are those related to the lists of runnable processes. Every runnable process in the system belongs to one, and just one, runqueue. As long as a runnable process remains in the same runqueue, it can be executed only by the CPU owning that runqueue. However, as we'll see later, runnable processes may migrate from one runqueue to another.
The arrays field of the runqueue is an array consisting of two prio_array_t structures. Each data structure represents a set of runnable processes, andincludes 140 doubly linked list heads (one list for each possible process priority), a priority bitmap, and a counter of the processes included in the set (seeTable 3-2 in the sectionChapter 3).
Figure 7-1. The runqueue structure and the two sets of runnable processes
|
|
\|/
7.3.2. Process Descriptor
Each process descriptor includes several fields related to scheduling; they are listed inTable 7-5.
Type | Name | Description |
---|---|---|
unsigned long | thread_info->flags | Stores the TIF_NEED_RESCHED flag, which is set if the scheduler must be invoked (see the section "Returning from Interrupts and Exceptions" inChapter 4) |
unsigned int | thread_info->cpu | Logical number of the CPU owning the runqueue to which the runnable process belongs |
unsigned long | state | The current state of the process (see the section "Process State" inChapter 3) |
int | prio | Dynamic priority of the process |
int | static_prio | Static priority of the process |
struct list_head | run_list | Pointers to the next and previous elements in the runqueue list to which the process belongs |
prio_array_t * | array | Pointer to the runqueue's prio_array_t set that includes the process |
unsigned long | sleep_avg | Average sleep time of the process |
unsigned long long | timestamp | Time of last insertion of the process in the runqueue, or time of last process switch involving the process |
unsigned long long | last_ran | Time of last process switch that replaced the process |
int | activated | Condition code used when the process is awakened |
unsigned long | policy | The scheduling class of the process (SCHED_NORMAL, SCHED_RR, or SCHED_FIFO) |
cpumask_t | cpus_allowed | Bit mask of the CPUs that can execute the process |
unsigned int | time_slice | Ticks left in the time quantum of the process |
unsigned int | first_time_slice | Flag set to 1 if the process never exhausted its time quantum |
unsigned long | rt_priority | Real-time priority of the process |