An EEVDF CPU scheduler for Linux

kevin内核随笔

已于 2023-11-02 20:26:47 修改

阅读量396

点赞数

文章标签： linux

于 2023-11-02 19:28:46 首次发布

本文链接：https://blog.csdn.net/weixin_49382066/article/details/134189543

版权

翻译自https://lwn.net/Articles/925371/，6.6内核新合入特性New task scheduler: EEVDF

The kernel’s completely fair scheduler (CFS) has the job of managing the allocation of CPU time for most of the processes running on most Linux systems. CFS was merged for the 2.6.23 release in 2007 and has, with numerous ongoing tweaks, handled the job reasonably well ever since. CFS is not perfect, though, and there are some situations it does not handle as well as it should. The EEVDF scheduler, posted by Peter Zijlstra, offers the possibility of improving on CFS while reducing its dependence on often-fragile heuristics.
内核的完全公平调度器（CFS）的任务是管理大多数在大多数Linux系统上运行的进程的CPU时间分配。CFS合入于2007年发布的2.6.23版本，自那时起，经过无数次的调整，其工作效果一直相当好。然而，CFS并不完善，有些情况下其处理得不尽如人意。Peter Zijlstra提交的EEVDF调度器提供了改进CFS的可能性，同时减少其对经常易损的启发式策略（often-fragile heuristics）的依赖。

CFS and scheduling constraints

One of the key design goals of CFS was, as might be understood from its name, fairness — ensuring that every process in the system gets its fair share of CPU time. This goal is achieved by tracking how much time each process has received and running those that have gotten less CPU time than the others, with each process’s run time scaled by its “nice” priority. CFS is, in other words, a weighted fair queuing scheduler at its core.
CFS的关键设计目标之一，如其名字所暗示的，是公平性，确保系统中的每个进程都能公平地获得CPU时间。此目标是通过追踪每个进程所获得的时间，以及运行那些比其他进程获得较少CPU时间的进程来实现的，每个进程的运行时间都由其"nice"优先级进行缩放。换句话说，CFS在其核心是一个加权公平排队调度器。
Fairness, it turns out, is enough to solve many CPU-scheduling problems. There are, however, many constraints beyond the fair allocation of CPU time that are placed on the scheduler. It should, for example, maximize the benefit of the system’s memory caches, which requires minimizing the movement of processes between CPUs. At the same time, though, it should keep all CPUs busy if there is work for them to do. Power management is a complication as well; sometimes the optimal decisions for system throughput must take a back seat to preserving battery life. Hybrid systems (where not all CPUs are the same) add more complications. And so on.
事实证明，公平性足以解决许多CPU调度问题。然而，除了公平分配CPU时间外，还有许多约束条件对调度器产生影响。例如，它应该最大限度地利用系统的内存缓存，这需要尽量减少在CPU之间移动进程。同时，如果有工作需要它们做，它应确保所有的CPU都保持忙碌。能源管理也是一个问题；有时为了保护电池寿命，必须让最佳的系统吞吐量决策让位。混合系统（其中并非所有的CPU都相同）增加了更多的复杂性。等等。
One place where there is a desire for improvement is in the handling of latency requirements. Some processes may not need a lot of CPU time but, when they do need that time, they need it quickly. Others might need more CPU time but can wait for it if need be. CFS does not give processes a way to express their latency requirements; nice values (priorities) can be used to give a process more CPU time, but that is not the same thing. The realtime scheduling classes can be used for latency-sensitive work, but running in a realtime class is a privileged operation, and realtime processes can badly affect the operation of the rest of the system.
我们希望建立改进的地方之一是在处理延迟需求的方面。有些进程可能并不需要大量的CPU时间，但是，当它们需要这个时间时，它们需要快速获得。其他进程可能需要更多的CPU时间，但如果需要的话，它们可以等待。CFS并没有给进程提供一种表达它们的延迟需求的方式；nice的值（优先级）可以用来给进程更多的CPU时间，但这不是同一回事。实时调度类可以用于对延迟敏感的工作，但是运行在实时类别中是一个特权操作，并且实时进程可能严重影响系统的其他运行操作。
What is lacking is a way to ensure that some processes can get access to a CPU quickly without necessarily giving those processes the ability to obtain more than their fair share of CPU time. The latency nice patches have been circulating for some time as an attempt to solve this problem; they allow CFS processes with tight latency requirements to jump the queue for the CPU when they want to run. These patches appear to work, but Zijlstra thinks that there might be a better approach to the problem.
缺乏的是一种确保某些进程可以快速获得CPU访问权限的方法，而这并不一定要给这些进程获得超过他们应得的份额的CPU时间的能力。延迟nice补丁在一段时间内一直在循环中，试图解决这个问题；当它们想运行时，它们允许CFS进程在具有严格延迟需求的情况下先行使用CPU。这些补丁似乎有效，但Zijlstra认为可能有一种更好的方法来解决这个问题。

Introducing EEVDF

The “Earliest Eligible Virtual Deadline First” (EEVDF) scheduling algorithm is not new; it was described in this 1995 paper by Ion Stoica and Hussein Abdel-Wahab. Its name suggests something similar to the Earliest Deadline First algorithm used by the kernel’s deadline scheduler but, unlike that scheduler, EEVDF is not a realtime scheduler, so it works in different ways. Understanding EEVDF requires getting a handle on a few (relatively) simple concepts.
“Earliest Eligible Virtual Deadline First”（EEVDF）调度算法并不新鲜，它在Ion Stoica和Hussein Abdel-Wahab的1995年的论文中被描述。它的名字与内核的deadline调度器使用的Earliest Deadline First算法相似，但与那个调度器不同，EEVDF不是一个实时调度器，所以它的工作方式不同。理解EEVDF需要掌握一些（相对）简单的概念。
Like CFS, EEVDF tries to divide the available CPU time fairly among the processes that are contending for it. If, for example, there are five processes trying to run on a single CPU, each of those processes should get 20% of the available time. A given process’s nice value can be used to adjust the calculation of what its fair time is; a process with a lower nice value (and thus a higher priority) is entitled to more CPU time at the expense of those with higher nice values. To this point, there is nothing new here.
和CFS一样，EEVDF试图公平地将可用的CPU时间分配给竞争它的进程。例如，如果有五个进程试图在一个单一的CPU上运行，每个进程应该获得20%的可用时间。特定进程的nice值可以用来调整它的公平时间的计算；具有较低nice值（因此优先级较高）的进程有权获得更多的CPU时间，而那些具有较高nice值的进程则需要付出代价。到这一点为止，这里没有新的东西。
Imagine a time period of one second; during that time, in our five-process scenario, each process should have gotten 200ms of CPU time. For a number of reasons, things never turn out exactly that way; some processes will have gotten too much time, while others will have been shortchanged. For each process, EEVDF calculates the difference between the time that process should have gotten and how much it actually got; that difference is called “lag”. A process with a positive lag value has not received its fair share and should be scheduled sooner than one with a negative lag value.
设想一个一秒钟的时间段；在那段时间内，在我们的五个进程的情况下，每个进程应该获得200毫秒的CPU时间。由于许多原因，事情永远不会完全是这样；一些进程将会得到太多的时间，而其他的则会被低估。对于每一个进程，EEVDF计算该进程应该得到的时间和实际得到的时间之间的差异；这种差异被称为"滞后（lag）“。一个具有正滞后值（a positive lag value）的进程没有收到其应有的份额，应该比一个具有负滞后值（negative lag value）的进程更早地被调度。
In fact, a process is deemed to be “eligible” if — and only if — its calculated lag is greater than or equal to zero; any process with a negative lag will not be eligible to run. For any ineligible process, there will be a time in the future where the time it is entitled to catches up to the time it has actually gotten and it will become eligible again; that time is deemed the “eligible time”.
实际上，如果一个进程的计算滞后大于或等于零，那么它被认为是"合格的（eligible）”；任何具有负滞后的进程都将不具备运行的条件。对于任何不具备条件的进程，将来会有一个时间点，它有权获得的时间会赶上它实际获得的时间，它会再次变得合格；这个时间被认为是"合格时间（eligible time）“。
The calculation of lag is, thus, a key part of the EEVDF scheduler, and much of the patch set is dedicated to finding this value correctly. Even in the absence of the full EEVDF algorithm, a process’s lag can be used to place it fairly in the run queue; processes with higher lag should be run first in an attempt to even out lag values across the system.
因此，计算滞后（lag）是EEVDF调度器的关键部分，补丁集的大部分都致力于正确找到这个值。即使在没有完全的EEVDF算法的情况下，一个进程的滞后也可以用来公平地将它放在运行队列中；应该优先运行滞后较高的进程，以便在整个系统中平均滞后值。
The other factor that comes into play is the “virtual deadline”, which is the earliest time by which a process should have received its due CPU time. This deadline is calculated by adding a process’s allocated time slice to its eligible time. A process with a 10ms time slice, and whose eligible time is 20ms in the future, will have a virtual deadline that is 30ms in the future.
另一个影响因素是"虚拟截止日期（virtual deadline）”，这是一个进程应该接收到其应得的CPU时间的最早时间。这个截止日期是通过将进程的分配时间片加到其合格时间（eligible time）来计算的。一个具有10ms时间片的进程，如果其合格时间在未来的20ms，那么它的虚拟截止日期将在未来的30ms。
The core of EEVDF, as can be seen in its name, is that it will run the process with the earliest virtual deadline first. The scheduling choice is thus driven by a combination of fairness (the lag value that is used to calculate the eligible time) and the amount of time that each process currently has due to it.
如其名所示，EEVDF的核心是它将首先运行具有最早虚拟截止日期的进程。因此，调度的选择由公平性（用于计算合格时间的滞后值）和每个进程当前应得的时间量的组合驱动。

Addressing the latency problem

With this framework in place, the implementation of quicker access for latency-sensitive processes happens naturally. When the scheduler is calculating the time slice for each process, it factors in that process’s assigned latency-nice value; a process with a lower latency-nice setting (and, thus, tighter latency requirements) will get a shorter time slice. Processes that are relatively indifferent to latency will receive longer slices. Note that the amount of CPU time given to any two processes (with the same nice value) will be the same, but the low-latency process will get it in a larger number of shorter slices.
有了这个框架，对延迟敏感进程（latency-sensitive processes）的快速访问的实现就自然而然了。当调度器计算每个进程的时间片时，它需要考虑这个进程分配的latency-nice值；一个具有较低latency-nice设置（因此，延迟要求更紧的）的进程会得到一个较短的时间片。对延迟相对无差别的进程会收到更长的切片。注意，给予任何两个进程（具有相同nice值）的CPU时间将是相同的，但是低延迟进程将在更多的较短切片（shorter slices）中得到它。
Remember that the virtual deadline is calculated by adding the time slice to the eligible time. That will cause processes with shorter time slices to have closer virtual deadlines and, as a result, to be executed first. Latency-sensitive processes, which normally don’t need large amounts of CPU time, will be able to respond quickly to events, while processes without latency requirements will be given longer time slices, which can help to improve throughput. No tricky scheduler heuristics are needed to get this result.
请记住，虚拟截止日期（virtual deadline）是通过将时间片加到合格时间（eligible time）来计算的。这会导致具有较短时间片的进程具有更近的虚拟截止日期，结果是，这些进程将首先被执行。对延迟敏感的进程，通常不需要大量的CPU时间，将能够快速响应事件，而没有延迟要求的进程将被给予更长的时间片，这有助于提高吞吐量。不需要棘手的调度器启发式算法就可以得到这个结果。
There is a big distance, though, between an academic paper and an implementation that can perform well in the Linux kernel. Zijlstra has only begun to run benchmarks on his EEVDF scheduler; his initial conclusion is that “there’s a bunch of wins and losses, but nothing that indicates a total fail”. Some of the results, he said, “seem to indicate EEVDF schedules a lot more consistently than CFS and has a bunch of latency wins”.
然而，学术论文与能在Linux内核中表现良好的实现之间有很大的距离。Zijlstra只是开始在他的EEVDF调度器上运行基准测试；他的初步结论是"有很多胜利和失败（wins and losses），但没有什么表明完全失败。“他说，一些结果"似乎表明EEVDF比CFS更一致地进行调度，并且在延迟方面有很多优势”。
While this is clearly a reasonable starting point, Zijlstra acknowledges that there is still quite a bit of work to be done. But, he said, “if we can pull this off we can delete a whole [bunch] of icky heuristics code”, replacing it with a better-defined policy. This is not a small change, he added: “It completely reworks the base scheduler, placement, preemption, picking – everything. The only thing they have in common is that they’re both a virtual time based scheduler.”
虽然这显然是一个合理的起点，但Zijlstra承认仍然还有相当多的工作要做。然而，他说，“如果我们能完成这个任务，我们可以删除一整套复杂的启发式代码”，用更明确的策略来替代它。他补充说，这不是一个小的变化：“它完全重构了基本调度器，位置，抢占，选择–一切。他们共同拥有的唯一一点是，他们都是基于虚拟时间的调度器（virtual time based scheduler）。”
Needless to say, such a fundamental change is unlikely to be rushed into the kernel. Helpfully, the current patches implement EEVDF as an option alongside CFS, which will enable wider testing without actually replacing the current scheduler. The CPU scheduler has to do the right thing for almost any conceivable workload on the wide range of systems supported by the kernel; that leaves a lot of room for unwelcome regressions resulting from even small changes — which this is not. So a lot of that testing will have to happen before consideration might be given to replacing CFS with EEVDF; there is no deadline, virtual or otherwise, for when that might happen.
不用说，如此根本的改变不太可能会被匆匆地引入到内核中。有用的是，当前的补丁在CFS旁边实现了EEVDF作为一个选项，这将使得在不实际替换当前的调度器的情况下能够进行更广泛的测试。CPU调度器必须在内核支持的广泛的系统上的几乎任何可以想象的工作负载上做正确的事情；即使是微小的变化也会留下很多由于不喜欢的回归而产生的空间–这不是这样。因此，在考虑用EEVDF替换CFS之前，必须进行大量的测试；对于何时可能发生这种情况，无论是虚拟的还是其他的，都没有截止日期。