LPC 2024 会议讨论 sched_ext

https://lwn.net/Articles/991205/ By Jonathan Corbet September 26, 2024
The extensible scheduler class (sched_ext) enables the implementation of CPU schedulers as a set of BPF programs loaded from user space; it first hit the mailing lists in late 2022. Sched_ext has engendered its share of controversy since, but is currently slated to be part of the 6.12 kernel release. At the 2024 Linux Plumbers Conference, the growing sched_ext community held one of its first public gatherings; sched_ext would appear to have launched a new burst of creativity in scheduler design.
可扩展调度类(sched_ext)允许以一组从用户空间加载的 BPF 程序的形式实现 CPU 调度器;它首次出现在 2022 年底的邮件列表中。自那以来,sched_ext 引发了一些争议,但目前计划将成为 6.12 内核发布的一部分。在 2024 年的 Linux Plumbers Conference 上,日益壮大的 sched_ext 社区举行了其首次公开聚会;sched_ext 似乎在调度器设计中引发了新的创造力爆发。

An overview

The sched_ext microconference began with Tejun Heo, one of the authors of this new subsystem. He introduced sched_ext as a new scheduling class that sits in the hierarchy along with the EEVDF and realtime schedulers. It serves as a bridge between the scheduling core and the BPF virtual machine, where all of the interesting decisions are made. BPF maps are used as a high-performance interface to user space.
sched_ext 微会议以这个新子系统的作者之一 Tejun Heo 的演讲开始。他将 sched_ext 介绍为一个新的调度类,它与 EEVDF 和实时调度器一起位于层次结构中。它充当调度核心与 BPF 虚拟机之间的桥梁,所有有趣的决策都在其中做出。BPF 映射被用作与用户空间的高性能接口。
Tejun Heo at the 2024 Linux Plumbers Conference

Work on sched_ext is proceeding on several fronts, starting with the merge of the sched_ext core into the mainline kernel, which was still pending at the time of this talk. Basic scheduling is part of that and works now. There are ongoing efforts to support features like CPU-frequency scaling, CPU hotplug, and control groups (basic support for which landed in 6.12); those can be expected to be added in future mainline kernel releases.
sched_ext 的工作正在多个方面进行,首先是将 sched_ext 核心合并到主线内核中,这在本次演讲时仍在等待。基本调度是其中的一部分,目前已经可以工作。正在努力支持 CPU 频率调整、CPU 热插拔和控制组(在 6.12 中已经实现了基本支持)等特性;可以预期这些特性会在未来的主线内核版本中被添加。

There is a repository available with the current sched_ext work, including a number of example schedulers. The sch_lavd scheduler, for example, is focused on interactivity and, specifically, consistently getting higher frame rates out of games. The scx_bpfland scheduler, instead, is aimed at minimizing response times. There is scx_rustland, which simply forwards scheduling events to user space, where the decisions are made. Heo admitted that he had been skeptical of that idea at the outset, but it has turned out to be “quite usable”. The repository also contains scx_rusty for load balancing on complex CPU topologies, and scx_layered, which is a partitioning scheduler.
有一个包含当前 sched_ext 工作的仓库,包括多个示例调度器。例如,sch_lavd 调度器专注于交互性,特别是一致获得更高的游戏帧率。相反,scx_bpfland 调度器的目标是最小化响应时间。还有 scx_rustland,它仅仅将调度事件转发到用户空间,在那里做出决策。Heo 承认他一开始对这个想法持怀疑态度,但结果证明它“相当可用”。该仓库还包含了针对复杂 CPU 拓扑的负载均衡的 scx_rusty,以及一个分区调度器 scx_layered。

One of the best things about sched_ext, he said, is that it cannot crash the machine. All of the usual BPF safety checks apply here. Additionally, if the kernel detects a scheduling problem, it will simply revert the system to the EEVDF scheduler and life goes on. That makes experimenting easy and the development cycle short.
他说,sched_ext 最好的一点是它不会导致机器崩溃。所有通常的 BPF 安全检查在这里都适用。此外,如果内核检测到调度问题,它将简单地将系统恢复到 EEVDF 调度器,生活继续。这使得实验变得容易,开发周期也缩短。
It is, he said, still the early days of sched_ext development, and he is focused on getting some practical wins. One of those appears to be scx_lavd (about which more was heard later), which is headed for shipment in Steam Deck gaming systems. scx_bpfland is showing promising results for personal machines, while scx_layers has been deployed in over one million machines and is delivering significant performance gains.
他表示,目前仍然是 sched_ext 开发的早期阶段,他专注于获得一些实际的胜利。其中之一似乎是 scx_lavd(稍后听到了更多关于它的内容),它将被部署在 Steam Deck 游戏系统中。scx_bpfland 对个人机器显示出有希望的结果,而 scx_layers 已经在超过一百万台机器上部署,并且带来了显著的性能提升。
The sched_ext developers are also working on building the development community. Support for sched_ext is now shipping in a number of distributions, including CachyOS, Arch Linux, Ubuntu, Fedora, Nix, and openSUSE. On those distributions, running a new scheduler is just a matter of installing a package and running a program.
sched_ext 开发者还在努力构建开发社区。现在已有多个发行版支持 sched_ext,并开始发货,包括 CachyOS、Arch Linux、Ubuntu、Fedora、Nix 和 openSUSE。在这些发行版上,运行一个新的调度器只需要安装一个包并运行一个程序。

Work in the future, Heo said in conclusion, is focused on composability — making it possible for multiple schedulers to work together. That will allow different developers to focus on independent layers; one scheduler could be concerned with time slices, while another would focus on load balancing or partitioning. The plan is also to eventually allow the stacking of schedulers down the control-group hierarchy, so that different schedulers at each level could handle a part of the overall scheduling problem.
Heo 在总结中说,未来的工作重点是可组合性 —— 使多个调度器能够共同工作。这将允许不同的开发者专注于独立的层;一个调度器可能关心时间片,而另一个可能专注于负载均衡或分区。计划最终还将允许在控制组层次结构中堆叠调度器,以便每个层级的不同调度器可以处理整体调度问题的一部分。

User-space scheduling

While sched_ext is meant to put CPU-scheduling decisions into users’ hands, it was still expected that those decisions would be made by BPF programs running within the kernel. So Andrea Righi’s scx_rustland scheduler, which defers all of those decisions to user space, came as a bit of a surprise. Righi started his session by saying that scx_rustland began as just a fun project, with no expectation that something useful would result. He mostly wanted better observability of scheduling decisions and a faster development cycle, where installing a new scheduler is just a matter of restarting a program.
尽管 sched_ext 旨在将 CPU 调度决策交到用户手中,但仍然预期这些决策将由内核中运行的 BPF 程序进行。因此,Andrea Righi 的 scx_rustland 调度器,它将所有这些决策推迟到用户空间,出乎意料。Righi 在他的会议中开始说,scx_rustland 起初只是一个有趣的项目,他没有期望会产生一些有用的东西。他主要想要更好的调度决策的可观察性和更快的开发周期,在此周期中,安装一个新的调度器只是重启程序的问题。
Andrea Righi at the 2024 Linux Plumbers Conference

What he came up with is a new Rust crate providing the scheduling interface; it is licensed under GPLv2. Schedulers are thus written in Rust, but the BPF code, which mostly just communicates events and decisions between the kernel and user space, is still compiled from C code. A pair of ring buffers is used for communication; initially BPF maps had been used, but the ring buffers are much faster. The API for schedulers has been deliberately kept simple, with the idea that anybody should be able to use it to write a new scheduler.
他提出了一个新的 Rust 木箱(Rust crate)来提供调度接口;它是在 GPLv2 许可下授权的。因此,调度器是用 Rust 编写的,但大部分只是在内核与用户空间之间传递事件和决策的 BPF 代码仍由 C 代码编译而成。一对环形缓冲区被用于通信;最初曾使用 BPF 映射,但环形缓冲区快得多。为了让任何人都能使用它编写新的调度器,故意保持调度器的 API 简单。

Righi admitted, though, that scx_rustland “is not all rainbows and unicorns”. One significant problem is that the scheduler program cannot block for any reason (such as a page fault), or scheduling as a whole comes to a halt. So a custom memory allocator is used to keep the scheduler running in locked-in-RAM memory. Multithreading in the scheduling program is “tricky” but mostly solved. Even with the ring buffers, the communication overhead with the kernel is significant, but not a huge issue. There are some possible sched_ext changes that would help there.
不过,Righi 承认,scx_rustland “并非全是美好和独角兽”。一个重要的问题是,调度器程序因任何原因(如页面错误)不能阻塞,否则整个调度过程就会停止。所以使用自定义内存分配器保持调度器在锁定的 RAM 内存中运行。调度程序中的多线程是“棘手的”,但大体上解决了。即使有环形缓冲区,与内核的通信开销仍然很大,但不是一个巨大的问题。有一些可能的 sched_ext 更改将有所帮助。

Righi’s future plans include standardizing and locking down the user-space API for schedulers. He would also like to create a concept of “scheduling domains”, each of which is made up of a set of CPUs. The ability to attach a task to one of these domains would make scheduling easier and improve performance.
Righi 的未来计划包括标准化和锁定用户空间 API 的调度器。他还希望创建一个“调度域”的概念,每个域由一组 CPU 组成。能够将任务附加到其中一个域将使调度更容易,并提高性能。

Higher frame rates

Changwoo Min took over via a remote link to talk about scx_lavd, which is a “latency criticality aware virtual deadline” scheduler aimed at gaming applications. It uses latency criticality (described later) as the primary scheduling decision, handles heterogeneous cores well, and adapts its scheduling decisions to the load pattern on the system.
Changwoo Min 通过远程连接讲述了针对游戏应用设计的“延迟敏感性意识的虚拟截止时间”调度器 scx_lavd。它以延迟敏感性(稍后描述)作为主要的调度决策依据,很好地处理异构核心,并根据系统的负载模式调整其调度决策。

The goal behind this scheduler was to provide the best gaming experience on Linux in general — not just on the Steam Deck. That requires getting high performance (and high video frame rates) without stuttering (short-term performance loss due to load in the system). The scheduler should deliver reasonable performance across a wide range of CPU configurations, but it is not intended to be the best server or general-purpose scheduler.
这个调度器背后的目标是在一般 Linux 系统上提供最佳的游戏体验——不仅仅是在 Steam Deck 上。这需要在没有卡顿(由于系统负载导致的短期性能损失)的情况下获得高性能(以及高视频帧率)。调度器应该在广泛的 CPU 配置中提供合理的性能,但它并不打算成为最佳的服务器或通用调度器。

A key aspect of gaming workloads is that tasks tend to run quickly, typically no more than 100µs at a time. There are a lot of tightly linked tasks, though, and performance depends on the most critical of those tasks running in the necessary sequence; that is the critical path. Every task has a latency criticality that is determined by its place in this path; tasks that wait on others, and are waited on in turn, have a large impact on overall performance and are thus “latency critical”. Detecting these tasks requires observing which tasks wait for which others, and ensuring that the tasks being waited for are run with low latency.
游戏工作负载的一个关键方面是任务通常运行得很快,通常一次不超过 100 微秒。然而,有很多密切相关的任务,性能取决于这些任务中最关键的任务以必要的顺序运行;这就是关键路径。每个任务有一个延迟敏感性,由它在这条路径上的位置决定;等待其他任务并依次被等待的任务对总体性能有重大影响,因此是“延迟敏感的”。检测这些任务需要观察哪些任务在等待其他任务,并确保被等待的任务以低延迟运行。

Each task has a virtual deadline calculated for it, which is a function of both its waking and waiting frequencies — its latency criticality, in other words. Tasks that both wait often for others and are often waited upon are seen as the most critical, so their deadline is the shortest. Time slices are then assigned in a manner similar to how the completely fair scheduler does it; slices are fixed, but get shorter as the number of runnable tasks increases.
为每个任务计算一个虚拟截止期限,这是根据其唤醒和等待频率——换句话说,其延迟敏感性——的函数。经常等待其他任务并且经常被其他任务等待的任务被视为最关键的,所以它们的截止日期是最短的。然后以类似于完全公平调度器的方式分配时间片;时间片是固定的,但随着可运行任务数量的增加而变短。
Care is also taken to chose CPUs properly on heterogeneous systems. At times of low load, with a simple workload, the low-power cores can get the job done while minimizing power use. If the load is heavy, though, then performance becomes the primary goal, and the fast cores must be used. The in-between case is trickier; some tasks can be put on smaller cores, but some will need the faster ones.
在异构系统上适当选择 CPU 也是需要注意的。在低负载时,简单的工作负载可以由低功耗核心完成,同时最小化功耗。但如果负载很重,性能就成为主要目标,必须使用快核心。中间情况更棘手;一些任务可以放在小核上,但一些任务需要快核心。

In the scx_lavd “autopilot” mode, the scheduler looks at the current CPU utilization. For light loads, a power-saving mode is chosen; for heavy loads, the fast cores are used in a race-to-idle strategy. In between those extremes, the scheduler tries to minimize the number of cores in use, but takes care to put the latency-critical tasks onto the large cores.
在 scx_lavd 的“自动驾驶”模式下,调度器会查看当前的 CPU 使用情况。对于轻负载,选择节能模式;对于重负载,则使用快核心以尽快完成任务的策略。在这两个极端之间,调度器尝试最小化使用的核心数量,但注意将延迟敏感的任务放在大核上。

Min concluded by saying that, for gaming applications, scx_lavd consistently enables higher frame rates than the EEVDF scheduler while using (slightly) less power and with fewer stutters.
Min 结束时说,对于游戏应用,scx_lavd 一致地比 EEVDF 调度器实现了更高的帧率,同时使用的电力稍微少一些,并且有更少的卡顿。

A lot of activity

The sched_ext microconference included a number of other presentations, some from people who had been working on out-of-tree schedulers for years. Barret Rhoden and Josh Don talked about the use of pluggable scheduling within Google, a project that has been underway since 2019. Once again, this effort was able to obtain better performance, but also highlighted the fact that different workloads benefit from different scheduling policies. Himadri Chhaya-Shailesh discussed using sched_ext for paravirtualized scheduling, where host and guest schedulers communicate to optimize the overall result. Masahito Suzuki and Alfred Chen have both been working on out-of-tree schedulers for desktop use. Peter Jung discussed the CachyOS distribution, which has been shipping a range of out-of-tree schedulers for years; developers there have created a whole infrastructure allowing users to switch schedulers on the fly.
sched_ext 微会议包括了许多其他演讲,其中一些人多年来一直在从事树外(out-of-tree)调度器的工作。Barret Rhoden 和 Josh Don 讨论了 Google 内部使用的可插拔调度,这是一个自 2019 年以来一直在进行的项目。再次,这项努力能够获得更好的性能,但也突显出不同工作负载从不同调度策略中获益的事实。Himadri Chhaya-Shailesh 探讨了使用 sched_ext 进行半虚拟化调度,这里宿主机和来宾调度器通信以优化整体结果。Masahito Suzuki 和 Alfred Chen 都在开发用于桌面使用的树外调度器。Peter Jung 讨论了 CachyOS 发行版,该发行版多年来一直在发布一系列的树外调度器;那里的开发者创建了一个完整的架构,允许用户即时切换调度器。

The kernel project has long had a policy that it would support one general-purpose CPU scheduler, and that scheduler had to provide good service for all workloads. This policy has, beyond a doubt, resulted in a sophisticated scheduler that is able to run on everything from small embedded systems to massive data-center machines. It has ensured that all users benefit from scheduler improvements.
内核项目长久以来的政策是支持单一的通用 CPU 调度器,而该调度器必须为所有工作负载提供优质的服务。毫无疑问,这一政策已经产生了一个精密的调度器,它能够在从小型嵌入式系统到巨型数据中心机器的所有设备上运行。它确保所有用户都能从调度器的改进中受益。

What was made abundantly clear at the sched_ext microconference, though, is that this policy has also led to the marginalization of a lot of creative work in this area. A scheduler that cannot regress for any workload leaves little room for developers wanting to optimize a specific class of applications, and who cannot even test many other workloads. This is a hard area in which to scratch an itch; developers have been discouraged from trying, and those who have ventured into this area have rarely seen their work enter the mainline kernel.
然而,在 sched_ext 微会议上明显地表明,这一政策也导致了在这一领域许多富有创造性工作的边缘化。一个不允许任何工作负载退步的调度器为那些想要优化特定类别应用的开发者留下了极小的空间,而这些开发者甚至无法测试许多其他工作负载。这是一个难于处理的领域;开发者们被劝阻不要尝试,而那些冒险进入这一领域的人很少看到他们的工作进入主线内核。

Sched_ext has removed many of the barriers to entry in the area of scheduler development, and the result has been an immediate increase in the number of developers playing with ideas and seeing where they lead. There is a new community that is quickly forming here, and it seems likely to come up with some novel (and sometimes crazy) approaches to CPU scheduling. This will be an interesting space to watch in the coming years.
sched_ext 消除了进入调度器开发领域的许多障碍,结果立即增加了尝试思路并看看导向何方的开发者数量。这里正迅速形成一个新社区,很可能会提出一些新颖(有时是疯狂的)的 CPU 调度方法。在未来几年,这将是一个有趣的领域,值得关注。

[ Thanks to the Linux Foundation, LWN’s travel sponsor, for supporting our travel to this event. ]

关注微信公众号 Linux_kernel_kevin 获取最新更新
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值