7.1 Scheduling Policy
The set of rules used to determine when and how to select a new process to run is called scheduling policy.
Linux scheduling is based on the time sharing technique.
The scheduling policy is also based on ranking processes according to their priority.
In Linux, process priority is dynamic.
7.1.1 Process Preemption
When a process enters the TASK_RUNNING state, the kernel checks whether its dynamic priority is greater than the priority of the currently running process. If it is, the execution of current is interrupted and the scheduler is invoked to select another process to run. Of course, a process also may be preempted when its time quantum expires. When this occurs, the TIF_NEED_RESCHED flag in the thread_info structure of the current process is set, so the scheduler is invoked when the timer interrupt handler terminates.
Be aware that a preempted process is not suspended, because it remains in the TASK_RUNNING state; it simply no longer uses the CPU.
7.1.2 How Long Must a Quantum Last?
The choice of the average quantum duration is always a compromise. The rule of thumb adopted by Linux is choose a duration as long as possible, while keeping good system response time.
7.2 The Scheduling Algorithm
The scheduling algorithm of Linux 2.6 is much more sophisticated. By design, it scales well with the number of runnable processes. It also scales well with the number of processors. Furthermore, the new algorithm does a better job of distinguishing interactive processes and batch processes.
The scheduler always succeeds in finding a process to be executed; in fact, there is always at least one runnable process: the swapper process, which has PID 0 and executes only when the CPU cannot execute other processes.
Every Linux process is always scheduled according to one of the following scheduling classes:
- SCHED_FIFO A First-In, First-Out real-time process.
- SCHED_RR A Round Robin real-time process.
- SCHED_NORMAL A conventional, time-shared process.
7.2.1 Scheduling of Conventional Processes
Every conventional process has its own static priority, which is a value used by the scheduler to rate the process with respect to the other conventional processes in the system. The kernel represents the static priority of a conventional process with a number ranging from 100 (highest priority) to 139 (lowest priority); notice that static priority decreases as the values increase.
A new process always inherits the static priority of its parent. Hower, a user can change the static priority of the processes that he owns by passing some "nice values" to the nice() and setpriority() system calls.
Base time quantum
The static priority essentially determines the base time quantum of a process, that is, the time quantum duration assigned to the process when it has exhausted its previous time quantum.
Higher priority processes usually get longer slices of CPU time with respect to lower priority processes.
Dynamic priority and average sleep time
Besides a static priority, a conventional process also has a dynamic priority, which is a value ranging from 100 (highest priority) to 139 (lowest priority). The dynamic priority is the number actually looked up by the scheduler when selecting the new process to run.
Active and expired processes
To avoid process starvation, when a process finishes its time quantum, it can be replaced by a lower priority process whose time quantum has not yet been exhausted. To implement this mechanism, the scheduler keeps two disjoint sets of runnable processes:
- Active processes These runnable processes have not yet exhausted their time quantum and are thus allowed to run.
- Expired processes These runnable processes have exhausted their time quantum and are thus forbidden to run until all active processes expire.
7.2.2 Scheduling of Real-Time Processes
Every real-time process is associated with a real-time priority, which is a value ranging from 1 (highest priority) to 99 (lowest priority). The scheduler always favors a higher priority runnable process over a lower priority one; in other words, a real-time process inhibits the execution of every lower-priority process while it remains runnable. Contrary to conventional processes, real-time processes are always considered active. The user can change the real-time priority of a process by means of the sched_setparam() and sched_setscheduler() system calls.
The duration of the base time quantum of Round Robin real-time processes does not depend on the real-time priority, but rather on the static priority of the process.
7.3 Data Structures Used by the Scheduler
7.3.1 The runqueue Data Structure
The runqueue data structure is the most important data structure of the Linux 2.6 scheduler. Each CPU in the system has its own runqueue; all runqueue structures are stored in the runqueues per-CPU variable.
7.3.2 Process Descripter
Each process descriptor includes several fields related to scheduling.
7.4 Functions Used by the Scheduler
The scheduler relies on several functions in order to do its work; the most import are:
- scheduler_tick() Keeps the time_slice counter of current up-to-date
- try_to_wake_up() Awakens a sleeping process
- recalc_task_prio() Updates the dynamic priority of a process
- schedule() Selects a new process to be executed
- load_balance() Keeps the runqueues of a multiprocessor system balanced.
7.4.1 The scheduler_tick() Function
7.4.2 The try_to_wake_up() Function
7.4.3 The recal_task_prio() Function
7.4.4 The schedule() Function
7.5 Runqueue Balancing in Multiprocessor Systems
7.5.1 Scheduling Domains
7.5.2 The rebalance_tick() Function
7.5.3 The load_balance() Function
7.5.4 The move_tasks() Function
7.6 System Calls Related to Scheduling
Several system calls have been introduced to allow processes to change their priorities and scheduling policies.
7.6.1 The nice() System Call
The nice() system call allows processes to change their base priority.
The nice() system call is maintained for backward compatibility only; it has been replaced by the setpriority() system call described next.
7.6.2 The getpriority() and setpriority() System Calls
The nice() system call affects only the process that invokes it. Two other system calls, denoted as getpriority() and setpriority(), act on the base priority of all processes in a given group.
7.6.3 The sched_getaffinity() and sched_setaffinity() System Calls
The sched_getaffinity() and sched_setaffinityI() system calls respectively return and set up the CPU affinity mask of a process -- the bit mask of the CPUs that are allowed to execute the process.
7.6.4 System Calls Related to Real-Time Processes
The sched_getscheduler() and sched_setscheduler() system calls
The sched_getparam() and sched_setparam() system calls
The sched_yield() system call
The sched_get_priority_min() and sched_get_priority_max() system calls
The sched_rr_get_interval() system call