可扩展调度类还在努力

翻译 https://lwn.net/Articles/972710/ By Jonathan Corbet May 9, 2024

背景介绍,Tejun Heo 提出的 sched_ext 补丁在 V4 版本的时候,Peter Zijlstra 给了NAK,Tejun Heo 继续发出了 V5 和 V6 版本持续推动进入主线。

The extensible scheduler class (or “sched_ext”) is a comprehensive framework that enables the implementation of CPU schedulers as a set of BPF programs that can be loaded at run time. Despite having attracted a fair amount of interest from the development community, sched_ext has run into considerable opposition and seems far from acceptance into the mainline. The posting by Tejun Heo of a new version of the sched_ext series at the beginning of May has restarted this long-running discussion, but it is not clear what the end result will be.
可扩展的调度器类(或称为 “sched_ext”)是一个全面的框架,它允许以一组在运行时加载的 BPF 程序的形式实现CPU 调度器。尽管该框架吸引了开发社区相当多的关注,但 sched_ext 遭遇了相当大的反对声音,距离被主线接受似乎还很遥远。Tejun Heo 在五月初发布了 sched_ext 系列的新版本,这重新启动了这场持久的讨论,但最终结果如何尚不明确。

As a quick refresher: sched_ext allows the creation of BPF programs that handle almost every aspect of the scheduling problem; these programs can be loaded (and unloaded) at run time. Sched_ext is designed to safely fall back to the completely fair scheduler should something go wrong (if a process fails to be run within a time limit, for example). It has been used to create a number of special-purpose schedulers, often with impressive performance benefits for the intended workload. See this 2023 article for a more detailed overview of this work.
作为一个快速复习:sched_ext 允许创建 BPF 程序来处理几乎调度问题的每一个方面;这些程序可以在运行时被加载(和卸载)。在出现问题时(例如,如果一个进程在规定时间内未能运行),sched_ext 被设计为安全地回退到完全公平调度器。它已被用来创建许多特殊用途的调度器,这些调度器往往为预定的工作负载带来了令人印象深刻的性能好处。查看这篇2023年的文章,可以获得更详细的这项工作的概述。

Heo lists a number of changes that have been made to sched_ext since the previous version was posted in November. For the most part, these appear to be adjustments to the BPF API to make the writing of schedulers easier. There is also a new shutdown mechanism that, among other things, disables the BPF scheduler during power-management events like system suspend. There is now support for CPU-frequency scaling, and some debugging interfaces have been added to make developing schedulers easier. The core design of sched_ext appears to have stabilized, though.
Heo 列出了自去年11月发布上一个版本以来,sched_ext 进行的一些更改。大部分看起来是对 BPF API 的调整,以便于编写调度器。还引入了一种新的关闭机制,该机制在系统挂起等电源管理事件期间,会禁用 BPF 调度器。现在支持 CPU 频率的动态调整,同时增加了一些调试接口,以便于调度器的开发。不过,sched_ext 的核心设计似乎已经稳定下来。

Increasing interest

Even before getting to the changes, though, Heo called attention to the increasing interest in sched_ext that is being shown across the community and beyond. Valve is planning to use sched_ext for better game scheduling on the Steam Deck. Ubuntu is considering shipping it in the 24.10 release. Meta and Google are increasing their use of it in their production fleets. There is also evidently interest in using it in ChromeOS, and Occulus is looking at it as well. Heo concludes that section with:
然而,在谈到这些更改之前,Heo 就已经提到了社区及更广泛领域对 sched_ext 越来越浓厚的兴趣。Valve 计划在Steam Deck 上使用 sched_ext 以实现更好的游戏调度。Ubuntu 正在考虑在 24.10 版本中加入它。Meta 和Google 正在他们的生产环境中增加对它的使用。显然,ChromeOS 也对使用它感兴趣,Oculus 也在考虑它。Heo以此总结这一部分内容:
Given that there already is substantial adoption which continues to grow and sched_ext doesn’t affect the built-in schedulers or the rest of kernel in an invasive manner, I believe it’s reasonable to consider sched_ext for inclusion.
鉴于 sched_ext 已经得到了相当大的采用并且还在持续增长,而且它不会以一种侵入式的方式影响内置的调度器或内核的其他部分,我认为考虑将 sched_ext 包含进来是合理的。

Whether that inclusion will happen remains an open question, though. The posting of version 4 of the patch set in July 2023 led to a slow-burning discussion on the merits of this development. Scheduler maintainer Peter Zijlstra rejected the patches outright, saying:
然而,这个包含是否会发生仍然是一个悬而未决的问题。2023年7月发布的第4版补丁集引发了一场关于这一开发价值的缓慢讨论。调度器的维护者 Peter Zijlstra 直接拒绝了这些补丁,他说:
There is not a single doubt in my mind that if I were to merge this, there will be Enterprise software out there that will mandate its own BPF sched thing, or else it won’t work.
They will not care, they will not contribute, they might even pull a RedHat and only share the code to customers.
我心中毫无疑问,如果我合并了这个,那么就会有企业软件要求使用它们自己的 BPF 调度器,否则就无法工作。
他们不会在乎,他们不会做出贡献,他们甚至可能效仿 RedHat,只将代码共享给客户。

He added that he saw no value in merging the code, and dropped out of the conversation. Mel Gorman also expressed his opposition to merging sched_ext, echoing Zijlstra’s concern that enterprise software would start requiring the use of special-purpose schedulers. He later added that, in his opinion (one shared with Zijlstra), sched_ext would work actively against the improvement of the current scheduler:
他补充说,他认为合并这段代码没有价值,并退出了讨论。Mel Gorman 也表示反对合并 sched_ext,他重申了Zijlstra 的担忧,即企业软件将开始要求使用特殊用途的调度器。他后来补充说,他认为(与 Zijlstra 有相同看法),sched_ext 实际上会阻碍当前调度器的改进:
I generally worry that certain things may not have existed in the shipped scheduler if plugging was an option including EAS, throttling control, schedutil integration, big.Little, adapting to chiplets and picking preferred SMT siblings for turbo boost. In each case, integrating support was time consuming painful and a pluggable scheduler would have been a relatively easy out that would ultimately cost us if it was never properly integrated.
我通常担心,如果插拔成为一种选项,那么发货的调度器中可能不存在某些特性,包括能效感知调度(EAS)、限制控制、schedutil 集成、big.Little 架构、适应芯片组以及为涡轮增压选择首选的SMT同级。在每一种情况下,整合支持都是耗时且痛苦的,而一个可插拔的调度器本可以是一个相对容易的出路,但如果它从未得到恰当的整合,最终将会是我们的损失。

Heo, naturally, disagreed with a lot of the concerns that had been raised. There are, he said, scheduling problems that cannot be addressed with tweaks to the current scheduler, especially in “hyperscaling” environments like Meta. He disagreed that sched_ext would impose a maintenance burden, arguing that the intrusion of BPF into other parts of the kernel has not had that result. Making it possible for users to do something new is beneficial, even if there will inevitably be “stupid cases” resulting from how some choose to use the new feature. In summary, he said, opponents are focused on the potential (and, in his opinion, overstated) costs of sched_ext without taking into account the benefits it would bring.
自然地,Heo 并不同意提出的许多担忧。他说,有些调度问题是无法通过对当前调度器的微调来解决的,特别是在Meta 这样的“超大规模”环境中。他不同意 sched_ext 将增加维护负担的说法,并辩称 BPF 对内核其他部分的侵入并没有导致这种结果。使用户能够做一些新事物是有益的,即便不可避免会有一些“愚蠢的情况”因某些人选择如何使用这项新功能而产生。总而言之,他说,反对者专注于 sched_ext 的潜在(并且在他看来,被夸大了的)成本,而没有考虑到它将带来的好处。

Restarting the conversation

That message, in October, was the end of the conversation at the time. Heo is clearly hoping for a better result this time around, but Zijlstra’s response was not encouraging:
那条消息是10月份对话的结尾。Heo 显然希望这一次能有更好的结果,但 Zijlstra 的回应并不让人鼓舞:
I fundamentally believe the approach to be detrimental to the scheduler eco-system. Witness the metric ton of toy schedulers written for it, that’s all effort not put into improving the existing code.
我基本上认为这种方法对调度器生态系统是有害的。可以看到它为之编写的大量玩具级调度器,这些都是没有投入到改进现有代码中的努力。

He said that he would not accept any part of this patch series until “the cgroup situation” has been resolved. That “situation” is a performance problem that affects certain workloads when a number of control groups are in use. Rik van Riel had put together a patch series to address this problem in 2019, but it never reached the point of being merged; Zijlstra seems to be insisting that this work be completed before sched_ext can be considered, and he gave little encouragement that it would be more favorably considered even afterward.
他说,在“cgroup情况”得到解决之前,他不会接受这一补丁系列的任何部分。这个“情况”是一个性能问题,当使用多个控制组时会影响某些工作负载。Rik van Riel 在2019年整理了一个补丁系列来解决这个问题,但它从未到达合并的阶段;Zijlstra 似乎坚持认为,在考虑 sched_ext 之前必须完成这项工作,并且即使之后,他也几乎没有表示它将会得到更好的考虑。

Heo expressed a willingness (albeit reluctantly) to work on the control-group problem if it would clear the way for sched_ext. He strongly disagreed with Zijlstra’s characterization of sched_ext schedulers as “toy schedulers” and the claim that working on sched_ext will take effort away from the mainline scheduler, though. There is, he said, no perfect CPU scheduler, so the mainline scheduler has to settle for being good enough for all users. That makes it almost impossible to experiment with “radical ideas”, and severely limits the pool of people who can work on the scheduler. Much of the energy that goes into sched_ext schedulers, he said, is otherwise unavailable for scheduler development at all.
Heo 表示愿意(尽管不情愿)处理控制组问题,如果这能为 sched_ext 铺平道路。不过,他强烈反对 Zijlstra 将sched_ext 的调度器称为“玩具级调度器”以及声称在 sched_ext 上的工作会分散对主线调度器的努力的说法。他说,不存在完美的CPU调度器,因此主线调度器必须满足于对所有用户足够好。这几乎使得实验“激进的想法”成为不可能,严重限制了能够参与调度器工作的人员范围。他说,投入到 sched_ext 调度器上的许多精力,否则对调度器的开发根本就不可获取。

There is, he said, value in some of those radical ideas:
他说,其中一些激进的想法是有价值的:
Yet, the many different ways that even simple schedulers can demonstrates sometimes significant behavior and performance benefits for specific workloads suggest that there are a lot of low hanging fruits in the area. Low hanging fruits that we can’t easily reach from our current local optimum. A single implementation which has to satisfy all users all the time is unlikely to be an effective vehicle for mapping out such landscape.
然而,即使是简单的调度器也能以许多不同的方式展示出对特定工作负载有时具有显著的行为和性能好处,这表明在这个领域有很多触手可及的成果。我们从当前的局部最优状态很难轻易到达这些成果。一个必须时时刻刻满足所有用户的单一实现,不太可能是绘制这种景观的有效工具。

Igalia developer Changwoo Min, who is working with Valve on gaming-oriented scheduling, supported Heo’s argument, saying that: “The successful implementation of sched_ext enriches the scheduler community with fresh insights, ideas, and code”. That, as of this writing, is where this conversation stands.
与 Valve 合作开发面向游戏的调度的 Igalia 开发者 Changwoo Min 支持 Heo 的观点,他说:“sched_ext 的成功实现为调度器社区带来了新的见解、思想和代码”。截至本文撰写时,这次讨论的情况就是这样。

What next?

Sched_ext is on the schedule for the BPF track of the Linux Storage, Filesystem, Memory-Management, and BPF Summit, which begins on May 13. That discussion will cover the future development of sched_ext but, most likely, will not be able to address the question of whether this work should be merged at all. That discussion could continue, on the mailing lists and elsewhere, for some time yet.
sched_ext 安排在5月13日开始的 Linux 存储、文件系统、内存管理和BPF峰会的 BPF 专题中进行讨论。这次讨论将涵盖 sched_ext 的未来发展,但最有可能的是,无法解决这项工作是否应该被合并的问题。在邮件列表和其他地方,这场讨论可能还会继续一段时间。

Sometimes, when a significant kernel development stalls in this way, distributors that see value in it will ship the patches anyway, as Ubuntu, Valve, and ChromeOS are considering doing. While shipping out-of-tree code is often discouraged, it can also serve to demonstrate interest in a feature and flush out any early problems that result from its inclusion. If things go well, this practice can strengthen the argument for merging the code into the mainline, albeit with the ever-present possibility of changes that create pain for the early adopters.
有时,当一项重要的内核开发以这种方式停滞不前时,那些看到其价值的发行版可能会继续发布补丁,正如Ubuntu、Valve 和 ChromeOS 正在考虑做的那样。虽然发布非官方主线(out-of-tree)代码通常是不受鼓励的,但这也可以用来展示对某项功能的兴趣,并暴露其包含所引发的任何早期问题。如果事情进展顺利,这种做法可以加强将代码合并入主线的论点,尽管始终存在因更改而给早期采用者带来痛苦的可能性。

Whether that will be the path taken for sched_ext remains to be seen. What is certain is that this work has attracted a lot of interest and is unlikely to go away anytime soon. Sched_ext has the potential to enable a new level of creativity in scheduler development, even if it remains out of the mainline — but that potential will be stronger if it does end up being merged. Significant scheduler patches are not merged quickly even when they are uncontroversial; this one will be slower than most if it is accepted at all.
sched_ext 是否会走这条路尚未可知。可以确定的是,这项工作吸引了大量兴趣,并且不太可能很快消失。即使sched_ext 没有被并入主线,它也有潜力在调度器开发中启用一个新的创造性水平—但如果最终被合并,这种潜力将会更强。即使是不具争议的重要调度器补丁,也不会很快被合并;如果这个补丁被接受,它的合并速度将比大多数情况都要慢。

  • 6
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值