【k8s之深入理解调度】调度框架扩展点理解

oceanweave

于 2024-10-03 18:57:49 发布

阅读量1.4k

点赞数 19

文章标签： kubernetes 容器云原生

本文链接：https://blog.csdn.net/qq_24433609/article/details/142694653

版权

参考自

K8s 调度框架设计与 scheduler plugins 开发部署示例（2024）

调度插件扩展点

在这里插入图片描述

等待调度阶段
PreEnqueue		Pod 处于 ready for scheduling 的阶段。内部工作原理： sig-scheduling/scheduler_queues.md。在 Pod 被放入调度队列之前执行的插件。它允许用户在 Pod 被正式加入调度队列之前，对 Pod 进行一些预处理或决策这一步没过就不会进入调度队列，更不会进入调度流程。	提前过滤不合格的 Pod 如果 Pod 资源请求明显超出了集群的资源限制，可以在 `PreEnqueue` 阶段拒绝它，而不是让它进入调度队列，浪费调度器的时间执行安全检查或策略验证延迟或取消调度
QueueSort		调度器会从调度队列中选择下一个要调度的 Pod，而 `QueueSort` 扩展点负责定义这个选择的规则和顺序。默认情况下，Kubernetes 调度器会根据 Pod 的优先级（Priority）和调度时间（FIFO，即先入先出）来进行排序，但通过自定义 `QueueSort` 插件，你可以实现更加复杂的排序逻辑。
调度阶段
PreFilter		pod 预处理和检查，不符合预期就提前结束调度（主体是 Pod，对 Pod 进行预处理或检查）
Filter		过滤掉那些不满足要求的 node 针对每个 node，调度器会按配置顺序依次执行 filter plugins；任何一个插件返回失败，这个 node 就被排除了；（主体是 node，每个 node 按顺序执行插件的检查，结果进行 merge，任一失败就不通过）
PostFilter		如果 Filter 阶段之后，所有 nodes 都被筛掉了，一个都没剩，才会执行这个阶段；否则不会执行这个阶段的 plugins。按 plugin 顺序依次执行，任何一个插件将 node 标记为 `Schedulable` 就算成功，不再执行剩下的 PostFilter plugins。典型例子：`preemptiontoleration`， `Filter()` 之后已经没有可用 node 了，在这个阶段就挑一个 pod/node，抢占它的资源。（可以理解为，该抢占 post 插件为 pod 抢到了资源，别的不用执行了）
PreScore		用于提前准备、计算不依赖 Pod 的信息评估硬件特性（如 GPU），不考虑 Pod 的具体要求	假设我们有一个调度策略是根据节点的硬件类型进行评分。在 `PreScore` 阶段，调度器检查所有节点是否具有 GPU 资源，并为这些节点提前打分，比如 GPU 节点得分为 10，普通节点得分为 0
Score		主要评估节点的适配程度，通常依赖于 Pod 的具体要求	现在进入 `Score` 阶段，调度器根据 Pod 的资源需求和标签要求（Pod 需要一定的 CPU 和内存资源，并且要求节点上的特定标签（例如 `app=web`）），对每个节点进行评分。节点 A（`app=web`，有足够的 CPU 和内存）得分 80。节点 B（`app=frontend`，有足够的 CPU，但没有满足标签要求）得分 20。节点 C（`app=web`，资源不足）得分 10。
Normalize Score		将得分转换为标准化值，以便更公平地比较不同节点的适合性	将所有节点的分数进行归一化处理，确保分数在同一范围内
Reserve	Informational，维护 plugin 状态信息，不影响调度决策	这里有两个方法，都是 `informational`，也就是不影响调度决策；维护了 runtime state (aka “stateful plugins”) 的插件，可以通过这两个方法接收 scheduler 传来的信息： Reserve方法：用来避免 scheduler 等待 `bind` 操作结束期间，因 race condition 导致的错误。只有当所有 `Reserve` plugins 都成功后，才会进入下一阶段，否则 scheduling cycle 就中止了。 UnReserve 方法：调度失败，这个阶段回滚时执行。`Unreserve()` 必须幂等，且不能 fail（幂等就是多次执行和一次执行结果保持一致，保证多次执行不会产生其他意外bug情况）
Permit		这是 scheduling cycle 的最后一个扩展点了，可以阻止或延迟将一个 pod binding 到 candidate node。三种结果： approve：所有 Permit plugins 都 appove 之后，这个 pod 就进入下面的 binding 阶段； deny：任何一个 Permit plugin deny 之后，就无法进入 binding 阶段。这会触发 `Reserve` plugins 的 `Unreserve()` 方法； wait (with a timeout)：如果有 Permit plugin 返回 “wait”，这个 pod 就会进入一个 internal “waiting” Pods list；
绑定阶段
WaitOnPermit		`WaitOnPermit` 参数主要控制调度器在 `Permit` 阶段等待的行为，具体来说，它定义了调度器等待 Pod 获得“Permit”（许可）的最大时长。也就是说，如果一个 Pod 在 `Permit` 阶段被插件要求等待，调度器会根据 `WaitOnPermit` 设定的时间限制，等待这个 Pod 获得许可。	Pod 协同调度：有时需要让多个 Pods 在相同的条件下同时被调度或根据某些协调机制进行调度，比如分布式应用程序中的主从架构，或依赖其他 Pods 的启动状态。在这种情况下，可以通过 `Permit` 插件让某个 Pod 等待其他 Pods 满足某些条件，然后再一起放行。资源锁定：某些情况下，你可能希望确保一些资源在其他 Pods 准备好之前不会被使用，`Permit` 阶段可以用来实现这种资源锁定机制，`WaitOnPermit` 则会控制 Pod 在资源锁定期间的等待时间。任务队列：如果某些 Pods 需要排队进行处理，`Permit` 插件可以将它们暂时挂起，并通过设置 `WaitOnPermit` 来定义它们可以等待的最长时间。
PreBind		`Bind` 之前的预处理，例如到 node 上去挂载 volume 任何一个 PreBind plugin 失败，都会导致 pod 被 reject，进入到 `reserve` plugins 的 `Unreserve()` 方法；
Bind		所有 PreBind 完成之后才会进入 Bind - 所有 plugin 按配置顺序依次执行； - 每个 plugin 可以选择是否要处理一个给定的 pod； - 如果选择处理，后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行。
PostBind	Informational，维护 plugin 状态信息，不影响调度决策	这是一个 informational extension point，也就是无法影响调度决策（没有返回值）。 - bind 成功的 pod 才会进入这个阶段； - 作为 binding cycle 的最后一个阶段，一般是用来清理一些相关资源。执行清理操作或其他后置操作（比如将 pod 绑定后的 node 信息保存到 CR 中）

1 引言

K8s 调度框架提供了一种扩展调度功能的插件机制，对于想实现自定义调度逻辑的场景非常有用。

如果 pod spec 里没指定 schedulerName 字段，则使用默认调度器；
如果指定了，就会走到相应的调度器/调度插件。

本文整理一些相关内容，并展示如何用 300 来行代码实现一个简单的固定宿主机调度插件。代码基于 k8s v1.28。

1.1 调度框架（sceduling framework）扩展点

如下图所示，K8s 调度框架定义了一些扩展点（extension points），

在这里插入图片描述

Fig. Scheduling framework extension points.

用户可以编写自己的调度插件（scheduler plugins）注册到这些扩展点来实现想要的调度逻辑。每个扩展点上一般会有多个 plugins，按注册顺序依次执行。

扩展点根据是否影响调度决策，可以分为两类。

1.1.1 影响调度决策的扩展点

大部分扩展点是影响调度决策的，

后面会看到，这些函数的返回值中包括一个成功/失败字段，决定了是允许还是拒绝这个 pod 进入下一处理阶段；
任何一个扩展点失败了，这个 pod 的调度就失败了；

1.1.2 不影响调度决策的扩展点（informational）

少数几个扩展点是 informational 的，

这些函数没有返回值，因此不能影响调度决策；
但是，在这里面可以修改 pod/node 等信息，或者执行清理操作。

1.2 调度插件分类

根据是否维护在 k8s 代码仓库本身，分为两类。

1.2.1 `in-tree` plugins

维护在 k8s 代码目录 pkg/scheduler/framework/plugins 中， 跟内置调度器一起编译。里面有十几个调度插件，大部分都是常用和在用的，

$ ll pkg/scheduler/framework/plugins
defaultbinder/
defaultpreemption/
dynamicresources/
feature/
imagelocality/
interpodaffinity/
names/
nodeaffinity/
nodename/
nodeports/
noderesources/
nodeunschedulable/
nodevolumelimits/
podtopologyspread/
queuesort/
schedulinggates/
selectorspread/
tainttoleration/
volumebinding/
volumerestrictions/
volumezone/

in-tree 方式每次要添加新插件，或者修改原有插件，都需要修改 kube-scheduler 代码然后编译和 重新部署 kube-scheduler，比较重量级。

1.2.2 `out-of-tree` plugins

out-of-tree plugins 由用户自己编写和维护，独立部署，不需要对 k8s 做任何代码或配置改动。

本质上 out-of-tree plugins 也是跟 kube-scheduler 代码一起编译的，不过 kube-scheduler 相关代码已经抽出来作为一个独立项目 github.com/kubernetes-sigs/scheduler-plugins。用户只需要引用这个包，编写自己的调度器插件，然后以普通 pod 方式部署就行（其他部署方式也行，比如 binary 方式部署）。编译之后是个包含默认调度器和所有 out-of-tree 插件的总调度器程序，

它有内置调度器的功能；
也包括了 out-of-tree 调度器的功能；

用法有两种：

跟现有调度器并行部署，只管理特定的某些 pods；
取代现有调度器，因为它功能也是全的。

1.3 每个扩展点上分别有哪些内置插件

内置的调度插件，以及分别工作在哪些 extention points：官方文档。比如，

node selectors 和 node affinity 用到了 NodeAffinity plugin；
taint/toleration 用到了 TaintToleration plugin。

2 Pod 调度过程

一个 pod 的完整调度过程可以分为两个阶段：

scheduling cycle：为 pod 选择一个 node，类似于数据库查询和筛选；
binding cycle：落实以上选择，类似于处理各种关联的东西并将结果写到数据库。

例如，虽然 scheduling cycle 为 pod 选择了一个 node，但是在接下来的 binding cycle 中，在这个 node 上给这个 pod 创建 persistent volume 失败了，那整个调度过程也是算失败的，需要回到最开始的步骤重新调度。以上两个过程加起来称为一个 scheduling context。

另外，在进入一个 scheduling context 之前，还有一个调度队列，用户可以编写自己的算法对队列内的 pods 进行排序，决定哪些 pods 先进入调度流程。总流程如下图所示：

在这里插入图片描述

Fig. queuing/sorting and scheduling context

下面分别来看。

2.1 等待调度阶段

2.1.1 `PreEnqueue`

Pod 处于 ready for scheduling 的阶段。内部工作原理：sig-scheduling/scheduler_queues.md。

这一步没过就不会进入调度队列，更不会进入调度流程。

作用和场景

PreEnqueue 扩展点为调度器提供了一个机会，可以在 Pod 进入调度循环前进行检查、过滤或修改。它能够帮助我们进行如下操作：

提前过滤不合格的 Pod：在 Pod 进入调度队列之前，如果有明确的原因导致这个 Pod 不应该调度，PreEnqueue 可以快速决定，不让这个 Pod 进入调度队列，从而减少不必要的调度开销。

示例：如果 Pod 资源请求明显超出了集群的资源限制，可以在 PreEnqueue 阶段拒绝它，而不是让它进入调度队列，浪费调度器的时间。
对 Pod 进行优先级排序：这个扩展点可以提前调整 Pod 的优先级，确保更重要的 Pod 先进入队列，从而优先被调度。

示例：可以在 PreEnqueue 阶段识别一些关键应用的 Pod，并调整它们的优先级，使它们能更快地调度。
执行安全检查或策略验证：在 Pod 被加入队列前，可以执行一些安全检查或策略验证，确保 Pod 满足集群的安全或策略要求。

示例：在 PreEnqueue 阶段，可以检查 Pod 的安全策略，确保它符合集群的网络隔离或资源使用策略。
延迟或取消调度： PreEnqueue 可以决定某些 Pod 不该立刻调度，或根据策略直接取消它们的调度。

示例：假设某个 Pod 依赖外部服务而这些服务当前不可用，可以在 PreEnqueue 阶段决定暂时不让该 Pod 进入队列，等待服务恢复。

2.1.2 `QueueSort`

对调度队列（scheduling queue）内的 pod 进行排序，决定先调度哪些 pods。

调度器会从调度队列中选择下一个要调度的 Pod，而 QueueSort 扩展点负责定义这个选择的规则和顺序。默认情况下，Kubernetes 调度器会根据 Pod 的优先级（Priority）和调度时间（FIFO，即先入先出）来进行排序，但通过自定义 QueueSort 插件，你可以实现更加复杂的排序逻辑。

作用场景

QueueSort 可以用于以下场景：

按优先级排序：默认情况下，Pods 是按照优先级（PriorityClass）进行排序，优先级高的 Pods 会先被调度。

示例：一个关键服务的 Pod 可以配置一个较高的优先级，通过 QueueSort 扩展点确保它在调度队列中比其他低优先级的 Pods 更快得到调度。
自定义排序规则：如果有特殊需求，比如希望基于 Pod 的某些标签、资源请求量、甚至是某种自定义的策略进行排序，可以通过实现 QueueSort 插件来实现。

示例：你可以自定义排序规则，让需要 GPU 的 Pods 优先被调度，或者按节点的负载平衡策略选择 Pods。
公平调度：对不同用户或不同队列中的 Pods 实现公平调度，防止某些队列的 Pods 占用过多调度资源。

示例：可以根据每个 namespace 的资源配额或用户的权限来调度 Pods，确保某些租户的 Pods 不会霸占调度器资源。
按 Pod 的等待时间排序：除了按优先级排序，还可以按 Pods 等待调度的时间长短进行排序，确保一些长时间等待的 Pods 能够得到调度机会。

示例：如果某些 Pods 因为资源短缺而一直在等待，你可以通过 QueueSort 逻辑优先调度这些等待时间长的 Pods，防止它们被饥饿。

`QueueSort` 插件的实现

实现 QueueSort 插件需要遵循 Kubernetes 调度框架中的插件接口规范。一个 QueueSort 插件主要需要实现两个核心函数：

Less function：决定两个 Pods 的优先级比较，如果返回 true，表示第一个 Pod 的优先级高于第二个 Pod，会优先调度。
```
func (p *MyQueueSortPlugin) Less(pod1, pod2 *v1.Pod) bool {
    // 自定义排序逻辑
}
```
Sort function：决定整个调度队列的排序方式，通常会调用 Less 函数。

小结

QueueSort 是调度框架中的一个重要扩展点，负责定义 Pod 在调度队列中的排序规则。
它可以通过自定义逻辑来优化调度顺序，例如按优先级、等待时间、资源需求或其他策略进行排序。
通过实现 QueueSort 插件，你可以控制调度器的 Pod 排序行为，满足特定的调度需求。

2.2 调度阶段（scheduling cycle）

2.2.1 `PreFilter`：pod 预处理和检查，不符合预期就提前结束调度

这里的插件可以对 Pod 进行预处理，或者条件检查，函数签名如下：

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349-L367

// PreFilterPlugin is an interface that must be implemented by "PreFilter" plugins.
// These plugins are called at the beginning of the scheduling cycle.
type PreFilterPlugin interface {
    // PreFilter is called at the beginning of the scheduling cycle. All PreFilter
    // plugins must return success or the pod will be rejected. PreFilter could optionally
    // return a PreFilterResult to influence which nodes to evaluate downstream. This is useful
    // for cases where it is possible to determine the subset of nodes to process in O(1) time.
    // When it returns Skip status, returned PreFilterResult and other fields in status are just ignored,
    // and coupled Filter plugin/PreFilterExtensions() will be skipped in this scheduling cycle.
    PreFilter(ctx , state *CycleState, p *v1.Pod) (*PreFilterResult, *Status)

    // PreFilterExtensions returns a PreFilterExtensions interface if the plugin implements one,
    // or nil if it does not. A Pre-filter plugin can provide extensions to incrementally
    // modify its pre-processed info. The framework guarantees that the extensions
    // AddPod/RemovePod will only be called after PreFilter, possibly on a cloned
    // CycleState, and may call those functions more than once before calling
    // Filter again on a specific node.
    PreFilterExtensions() PreFilterExtensions
}

输入：
- p *v1.Pod 是待调度的 pod；
- 第二个参数 state 可用于保存一些状态信息，然后在后面的扩展点（例如 Filter() 阶段）拿出来用；
输出：
- 只要有任何一个 plugin 返回失败，这个 pod 的调度就失败了；
- 换句话说，所有已经注册的 PreFilter plugins 都成功之后，pod 才会进入到下一个环节；

2.2.2 `Filter`：排除所有不符合要求的 node

这里的插件可以过滤掉那些不满足要求的 node（equivalent of Predicates in a scheduling Policy），

针对每个 node，调度器会按配置顺序依次执行 filter plugins；
任何一个插件 返回失败，这个 node 就被排除了；

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L349C1-L367C2

// FilterPlugin is an interface for Filter plugins. These plugins are called at the
// filter extension point for filtering out hosts that cannot run a pod.
// This concept used to be called 'predicate' in the original scheduler.
// These plugins should return "Success", "Unschedulable" or "Error" in Status.code.
// However, the scheduler accepts other valid codes as well.
// Anything other than "Success" will lead to exclusion of the given host from running the pod.
type FilterPlugin interface {
    Plugin
    // Filter is called by the scheduling framework.
    // All FilterPlugins should return "Success" to declare that
    // the given node fits the pod. If Filter doesn't return "Success",
    // it will return "Unschedulable", "UnschedulableAndUnresolvable" or "Error".
    // For the node being evaluated, Filter plugins should look at the passed
    // nodeInfo reference for this particular node's information (e.g., pods
    // considered to be running on the node) instead of looking it up in the
    // NodeInfoSnapshot because we don't guarantee that they will be the same.
    // For example, during preemption, we may pass a copy of the original
    // nodeInfo object that has some pods removed from it to evaluate the
    // possibility of preempting them to schedule the target pod.
    Filter(ctx , state *CycleState, pod *v1.Pod, nodeInfo *NodeInfo) *Status
}

输入：
- nodeInfo 是当前给定的 node 的信息，Filter() 程序判断这个 node 是否符合要求；
输出：
- 放行或拒绝。

对于给定 node，如果所有 Filter plugins 都返回成功，这个 node 才算通过筛选，成为备选 node 之一。

2.2.3 `PostFilter`：`Filter` 之后没有 node 剩下，补救阶段

如果 Filter 阶段之后，所有 nodes 都被筛掉了，一个都没剩，才会执行这个阶段；否则不会执行这个阶段的 plugins。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L392C1-L407C2

// PostFilterPlugin is an interface for "PostFilter" plugins. These plugins are called after a pod cannot be scheduled.
type PostFilterPlugin interface {
    // A PostFilter plugin should return one of the following statuses:
    // - Unschedulable: the plugin gets executed successfully but the pod cannot be made schedulable.
    // - Success: the plugin gets executed successfully and the pod can be made schedulable.
    // - Error: the plugin aborts due to some internal error.
    //
    // Informational plugins should be configured ahead of other ones, and always return Unschedulable status.
    // Optionally, a non-nil PostFilterResult may be returned along with a Success status. For example,
    // a preemption plugin may choose to return nominatedNodeName, so that framework can reuse that to update the
    // preemptor pod's .spec.status.nominatedNodeName field.
    PostFilter(ctx , state *CycleState, pod *v1.Pod, filteredNodeStatusMap NodeToStatusMap) (*PostFilterResult, *Status)
}

按 plugin 顺序依次执行，任何一个插件将 node 标记为 Schedulable 就算成功，不再执行剩下的 PostFilter plugins。

典型例子：preemptiontoleration， Filter() 之后已经没有可用 node 了，在这个阶段就挑一个 pod/node，抢占它的资源。

2.2.4 `PreScore`

PreScore/Score/NormalizeScore 都是给 node 打分的，以最终选出一个最合适的 node。这里就不展开了，函数签名也在上面给到的源文件路径中，这里就不贴了。

2.2.5 `Score`

针对每个 node 依次调用 scoring plugin，得到一个分数。

2.2.6 `NormalizeScore`

2.2.7 `Reserve`：Informational，维护 plugin 状态信息

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L444C1-L462C2

// ReservePlugin is an interface for plugins with Reserve and Unreserve
// methods. These are meant to update the state of the plugin. This concept
// used to be called 'assume' in the original scheduler. These plugins should
// return only Success or Error in Status.code. However, the scheduler accepts
// other valid codes as well. Anything other than Success will lead to
// rejection of the pod.
type ReservePlugin interface {
    // Reserve is called by the scheduling framework when the scheduler cache is
    // updated. If this method returns a failed Status, the scheduler will call
    // the Unreserve method for all enabled ReservePlugins.
    Reserve(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
    // Unreserve is called by the scheduling framework when a reserved pod was
    // rejected, an error occurred during reservation of subsequent plugins, or
    // in a later phase. The Unreserve method implementation must be idempotent
    // and may be called by the scheduler even if the corresponding Reserve
    // method for the same plugin was not called.
    Unreserve(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

这里有两个方法，都是 informational，也就是不影响调度决策；维护了 runtime state (aka “stateful plugins”) 的插件，可以通过这两个方法 接收 scheduler 传来的信息：

Reserve

用来避免 scheduler 等待 bind 操作结束期间，因 race condition 导致的错误。只有当所有 Reserve plugins 都成功后，才会进入下一阶段，否则 scheduling cycle 就中止了。
Unreserve

调度失败，这个阶段回滚时执行。Unreserve() 必须幂等，且不能 fail。

2.2.8 `Permit`：`允许/拒绝/等待`进入 binding cycle

这是 scheduling cycle 的最后一个扩展点了，可以阻止或延迟将一个 pod binding 到 candidate node。

// PermitPlugin is an interface that must be implemented by "Permit" plugins.
// These plugins are called before a pod is bound to a node.
type PermitPlugin interface {
    // Permit is called before binding a pod (and before prebind plugins). Permit
    // plugins are used to prevent or delay the binding of a Pod. A permit plugin
    // must return success or wait with timeout duration, or the pod will be rejected.
    // The pod will also be rejected if the wait timeout or the pod is rejected while
    // waiting. Note that if the plugin returns "wait", the framework will wait only
    // after running the remaining plugins given that no other plugin rejects the pod.
    Permit(ctx , state *CycleState, p *v1.Pod, nodeName string) (*Status, time.Duration)
}

三种结果：

approve：所有 Permit plugins 都 appove 之后，这个 pod 就进入下面的 binding 阶段；
deny：任何一个 Permit plugin deny 之后，就无法进入 binding 阶段。这会触发 Reserve plugins 的 Unreserve() 方法；
wait (with a timeout)：如果有 Permit plugin 返回 “wait”，这个 pod 就会进入一个 internal “waiting” Pods list；

2.3 绑定阶段（binding cycle）

在这里插入图片描述

Fig. Scheduling framework extension points.

2.3.1 `WaitOnPermit`：主要控制调度器在 `Permit` 阶段等待的行为，具体来说，它定义了调度器等待 Pod 获得“Permit”（许可）的最大时长

WaitOnPermit 参数主要控制调度器在 Permit 阶段等待的行为，具体来说，它定义了调度器等待 Pod 获得“Permit”（许可）的最大时长。也就是说，如果一个 Pod 在 Permit 阶段被插件要求等待，调度器会根据 WaitOnPermit 设定的时间限制，等待这个 Pod 获得许可。

作用场景

Pod 协同调度：有时需要让多个 Pods 在相同的条件下同时被调度或根据某些协调机制进行调度，比如分布式应用程序中的主从架构，或依赖其他 Pods 的启动状态。在这种情况下，可以通过 Permit 插件让某个 Pod 等待其他 Pods 满足某些条件，然后再一起放行。
资源锁定：某些情况下，你可能希望确保一些资源在其他 Pods 准备好之前不会被使用，Permit 阶段可以用来实现这种资源锁定机制，WaitOnPermit 则会控制 Pod 在资源锁定期间的等待时间。
任务队列：如果某些 Pods 需要排队进行处理，Permit 插件可以将它们暂时挂起，并通过设置 WaitOnPermit 来定义它们可以等待的最长时间。

例子

假设你有一个调度插件，它使用 Permit 扩展点来控制 Pod 的调度时机，并要求某些 Pods 在调度前等待其他 Pods 的状态满足某个条件：

func (p *MyPermitPlugin) Permit(ctx context.Context, state *framework.CycleState, pod *v1.Pod, nodeName string) *framework.Status {
    if shouldWait(pod) {
        // 如果需要等待，则让 Pod 进入等待状态
        return framework.NewStatus(framework.Wait, "Waiting for other pods to be ready")
    }
    return framework.NewStatus(framework.Success, "Pod can be scheduled")
}

在这种情况下，如果 WaitOnPermit 设置为 30 秒，那么调度器会在 Permit 阶段最多等待 30 秒。如果在这段时间内其他条件满足，Pod 会被允许调度；如果超过 30 秒仍未获得许可，Pod 的调度会失败，调度器会对该 Pod 进行重试或报告错误。

总结

WaitOnPermit 的作用是在 Permit 阶段控制调度器等待 Pod 被允许调度的时间，适用于需要等待特定条件的场景，如 Pod 协同调度、资源协调、或任务队列管理。如果在指定时间内没有获得许可，调度将超时并失败。

2.3.2 `PreBind`：`Bind` 之前的预处理，例如到 node 上去挂载 volume

例如，在将 pod 调度到一个 node 之前，先给这个 pod 在那台 node 上挂载一个 network volume。

// PreBindPlugin is an interface that must be implemented by "PreBind" plugins.
// These plugins are called before a pod being scheduled.
type PreBindPlugin interface {
    // PreBind is called before binding a pod. All prebind plugins must return
    // success or the pod will be rejected and won't be sent for binding.
    PreBind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}

任何一个 PreBind plugin 失败，都会导致 pod 被 reject，进入到 reserve plugins 的 Unreserve() 方法；

2.3.3 `Bind`：将 pod 关联到 node

所有 PreBind 完成之后才会进入 Bind。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L497

// Bind plugins are used to bind a pod to a Node.
type BindPlugin interface {
    // Bind plugins will not be called until all pre-bind plugins have completed. Each
    // bind plugin is called in the configured order. A bind plugin may choose whether
    // or not to handle the given Pod. If a bind plugin chooses to handle a Pod, the
    // remaining bind plugins are skipped. When a bind plugin does not handle a pod,
    // it must return Skip in its Status code. If a bind plugin returns an Error, the
    // pod is rejected and will not be bound.
    Bind(ctx , state *CycleState, p *v1.Pod, nodeName string) *Status
}

所有 plugin 按配置顺序依次执行；
每个 plugin 可以选择是否要处理一个给定的 pod；
如果选择处理，后面剩下的 plugins 会跳过。也就是最多只有一个 bind plugin 会执行。

2.3.4 `PostBind`：informational，可选，执行清理操作

这是一个 informational extension point，也就是无法影响调度决策（没有返回值）。

bind 成功的 pod 才会进入这个阶段；
作为 binding cycle 的最后一个阶段，一般是用来清理一些相关资源。

// https://github.com/kubernetes/kubernetes/blob/v1.28.4/pkg/scheduler/framework/interface.go#L473

// PostBindPlugin is an interface that must be implemented by "PostBind" plugins.
// These plugins are called after a pod is successfully bound to a node.
type PostBindPlugin interface {
    // PostBind is called after a pod is successfully bound. These plugins are informational.
    // A common application of this extension point is for cleaning
    // up. If a plugin needs to clean-up its state after a pod is scheduled and
    // bound, PostBind is the extension point that it should register.
    PostBind(ctx , state *CycleState, p *v1.Pod, nodeName string)
}

3 开发一个极简 sticky node 调度器插件（out-of-tree）

这里以 kubevirt 固定宿主机调度 VM 为例，展示如何用几百行代码实现一个 out-of-tree 调度器插件。

3.1 设计

3.1.1 背景知识

一点背景知识 [2,3]：

VirtualMachine 是一个虚拟机 CRD；
一个 VirtualMachine 会对应一个 VirtualMachineInstance，这是一个运行中的 VirtualMachine；
一个 VirtualMachineInstance 对应一个 Pod；

如果发生故障，VirtualMachineInstance 和 Pod 可能会重建和重新调度，但 VirtualMachine 是不变的； VirtualMachine <--> VirtualMachineInstance/Pod 的关系，类似于 StatefulSet <--> Pod 的关系。

3.1.2 业务需求

VM 创建之后只要被调度到某台 node，以后不管发生什么故障，它永远都被调度到这个 node 上（除非人工干预）。

可能场景：VM 挂载了宿主机本地磁盘，因此换了宿主机之后数据就没了。故障场景下，机器或容器不可用没关系，微服务系统自己会处理实例的健康检测和流量拉出，底层基础设施保证不换宿主机就行了，这样故障恢复之后数据还在。

技术描述：

用户创建一个 VirtualMachine 后，能正常调度到一台 node 创建出来；
后续不管发生什么问题（pod crash/eviction/recreate、node restart …），这个 VirtualMachine 都要被调度到这台机器。

3.1.3 技术方案

用户创建一个 VirtualMachine 后，由默认调度器给它分配一个 node，然后将 node 信息保存到 VirtualMachine CR 上；
如果 VirtualMachineInstance 或 Pod 被删除或发生重建，调度器先找到对应的 VirtualMachine CR，如果 CR 中有保存的 node 信息，就用这个 node；否则（必定是第一次调度），转 1。

3.2 实现

实现以上功能需要在三个位置注册调度扩展函数：

PreFilter
Filter
PostBind

代码基于 k8s v1.28。

3.2.1 `Prefilter()`

主要做一些检查和准备工作，

如果不是我们的 Pod：直接返回成功，留给其他 plugin 去处理；
如果是我们的 Pod，查询关联的 VMI/VM CR，这里分两种情况：
1. 找到了：说明之前已经调度过（可能是 pod 被删除了导致重新调度），我们应该解析出原来的 node，供后面 Filter() 阶段使用；
2. 没找到：说明是第一次调度，什么都不做，让默认调度器为我们选择初始 node。
将 pod 及为它选择的 node（没有就是空）保存到一个 state 上下文中，这个 state 会传给后面的 Filter() 阶段使用。

// PreFilter invoked at the preFilter extension point.
func (pl *StickyVM) PreFilter(ctx , state *framework.CycleState, pod *v1.Pod) (*framework.PreFilterResult, *framework.Status) {
    s := stickyState{false, ""}

    // Get pod owner reference
    podOwnerRef := getPodOwnerRef(pod)
    if podOwnerRef == nil {
        return nil, framework.NewStatus(framework.Success, "Pod owner ref not found, return")
    }

    // Get VMI
    vmiName := podOwnerRef.Name
    ns := pod.Namespace

    vmi := pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return nil, framework.NewStatus(framework.Error, "get vmi failed")
    }

    vmiOwnerRef := getVMIOwnerRef(vmi)
    if vmiOwnerRef == nil {
        return nil, framework.NewStatus(framework.Success, "VMI owner ref not found, return")
    }

    // Get VM
    vmName := vmiOwnerRef.Name
    vm := pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return nil, framework.NewStatus(framework.Error, "get vmi failed")
    }

    // Annotate sticky node to VM
    s.node, s.nodeExists = vm.Annotations[stickyAnnotationKey]
    return nil, framework.NewStatus(framework.Success, "Check pod/vmi/vm finish, return")
}

3.2.2 `Filter()`

调度器会根据 pod 的 nodeSelector 等，为我们初步选择出一些备选 nodes。然后会遍历这些 node，依次调用各 plugin 的 Filter() 方法，看这个 node 是否合适。伪代码：

// For a given pod
for node in selectedNodes:
    for pl in plugins:
        pl.Filter(ctx, customState, pod, node)

我们的 plugin 逻辑，首先解析传过来的 state/pod/node 信息，

如果 state 中保存了一个 node，
1. 如果保存的这个 node 就是当前 Filter() 传给我们的 node，返回成功；
2. 对于其他所有 node，都返回失败。
以上的效果就是：只要这个 pod 上一次调度到某个 node，我们就继续让它调度到这个 node，也就是**“固定宿主机调度”**。
如果 state 中没有保存的 node，说明是第一次调度，也返回成功，默认调度器会给我们分一个 node。我们在后面的 PostBind 阶段把这个 node 保存到 state 中。

func (pl *StickyVM) Filter(ctx , state *framework.CycleState, pod *v1.Pod, nodeInfo *framework.NodeInfo) *framework.Status {
    s := state.Read(stateKey)
    if err != nil {
        return framework.NewStatus(framework.Error, fmt.Sprintf("read preFilter state fail: %v", err))
    }

    r, ok := s.(*stickyState)
    if !ok {
        return framework.NewStatus(framework.Error, fmt.Sprintf("convert %+v to stickyState fail", s))
    }
    if !r.nodeExists {
        return nil
    }

    if r.node != nodeInfo.Node().Name {
        // returning "framework.Error" will prevent process on other nodes
        return framework.NewStatus(framework.Unschedulable, "already stick to another node")
    }

    return nil
}

3.2.3 `PostBind()`

能到这个阶段，说明已经为 pod 选择好了一个 node。我们只需要检查下这个 node 是否已经保存到 VM CR 中，如果没有就保存之。

func (pl *StickyVM) PostBind(ctx , state *framework.CycleState, pod *v1.Pod, nodeName string) {
    s := state.Read(stateKey)
    if err != nil {
        return
    }

    r, ok := s.(*stickyState)
    if !ok {
        klog.Errorf("PostBind: pod %s/%s: convert failed", pod.Namespace, pod.Name)
        return
    }

    if r.nodeExists {
        klog.Errorf("PostBind: VM already has sticky annotation, return")
        return
    }

    // Get pod owner reference
    podOwnerRef := getPodOwnerRef(pod)
    if podOwnerRef == nil {
        return
    }

    // Get VMI owner reference
    vmiName := podOwnerRef.Name
    ns := pod.Namespace

    vmi := pl.kubevirtClient.VirtualMachineInstances(ns).Get(context.TODO(), vmiName, metav1.GetOptions{ResourceVersion: "0"})
    if err != nil {
        return
    }

    vmiOwnerRef := getVMIOwnerRef(vmi)
    if vmiOwnerRef == nil {
        return
    }

    // Add sticky node to VM annotations
    retry.RetryOnConflict(retry.DefaultRetry, func() error {
        vmName := vmiOwnerRef.Name
        vm := pl.kubevirtClient.VirtualMachines(ns).Get(context.TODO(), vmName, metav1.GetOptions{ResourceVersion: "0"})
        if err != nil {
            return err
        }

        if vm.Annotations == nil {
            vm.Annotations = make(map[string]string)
        }

        vm.Annotations[stickyAnnotationKey] = nodeName
        if _ = pl.kubevirtClient.VirtualMachines(pod.Namespace).Update(ctx, vm, metav1.UpdateOptions{}); err != nil {
            return err
        }
        return nil
    })
}

前面提到过，这个阶段是 informational 的，它不能影响调度决策，所以它没有返回值。

3.2.4 其他说明

以上就是核心代码，再加点初始化代码和脚手架必需的东西就能编译运行了。完整代码见这里（不包括依赖包）。

实际开发中，golang 依赖问题可能比较麻烦，需要根据 k8s 版本、scheduler-plugins 版本、golang 版本、kubevirt 版本等等自己解决。

3.3 部署

Scheduling plugins 跟网络 CNI plugins 不同，后者是可执行文件（binary），放到一个指定目录就行了。 Scheduling plugins 是 long running 服务。

3.3.1 配置

为我们的 StickyVM scheduler 创建一个配置：

$ cat ksc.yaml

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.kubeconfig"
profiles:
- schedulerName: stickyvm
  plugins:
    preFilter:
      enabled:
      - name: StickyVM
      disabled:
      - name: NodeResourceFit
    filter:
      enabled:
      - name: StickyVM
      disabled:
      - name: NodePorts
      # - name: "*"
    reserve:
      disabled:
      - name: "*"
    preBind:
      disabled:
      - name: "*"
    postBind:
      enabled:
      - name: StickyVM
      disabled:
      - name: "*"

一个 ksc 里面可以描述多个 profile，会启动多个独立 scheduler。由于这个配置是给 kube-scheduler 的，而不是 kube-apiserver，
# content of the file passed to "--config"
apiVersion: kubescheduler.config.k8s.io/v1alpha1
kind: KubeSchedulerConfiguration
所以 k api-resources 或 k get KubeSchedulerConfiguration 都是找不到这个资源的。

pod 想用哪个 profile，就填对应的 schdulerName。如果没指定，就是 default-scheduler。

3.3.2 运行

不需要对 k8s 做任何配置改动，作为普通 pod 部署运行就行（需要创建合适的 CluterRole 等等）。

这里为了方面，用 k8s cluster admin 证书直接从开发机启动，适合开发阶段快速迭代：

$ ./bin/stickyvm-scheduler --leader-elect=false --config ksc.yaml
Creating StickyVM scheduling plugin
Creating kubevirt clientset
Create kubevirt clientset successful
Create StickyVM scheduling plugin successful
Starting Kubernetes Scheduler" version="v0.0.20231122"
Golang settings" GOGC="" GOMAXPROCS="" GOTRACEBACK=""
Serving securely on [::]:10259
"Starting DynamicServingCertificateController"

3.4 测试

只需要在 VM CR spec 里面指定调度器名字。

3.4.1 首次创建 VM

新创建一个 VM 时的 workflow，

yaml 里指定用 schedulerName: stickyvm，
k8s 默认调度器自动选一个 node，
StickyVM 根据 ownerref 依次拿到 vmi/vm，然后在 postbind hook 里将这个 node 添加到 VM annotation 里；

日志：

Prefilter: start
Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp
PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora
PreFilter: found corresponding VMI
PreFilter: found corresponding VM
PreFilter: VM has no sticky node, skip to write to scheduling context
Prefilter: finish
Filter: start
Filter: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp, sticky node not exist, got node-1, return success
PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-nd4hp
PostBind: annotating selected node node-1 to VM
PostBind: parent is VirtualMachineInstance kubevirt-smoke-fedora
PostBind: found corresponding VMI
PostBind: found corresponding VM
PostBind: annotating node node-1 to VM: kubevirt-smoke-fedora

3.4.2 删掉 VMI/Pod，重新调度时

删除 vmi 或者 pod，StickyVM plugin 会在 prefilter 阶段从 annotation 拿出这个 node 信息，然后在 filter 阶段做判断，只有过滤到这个 node 时才返回成功，从而实现固定 node 调度的效果：

Prefilter: start
Prefilter: processing pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v
PreFilter: parent is VirtualMachineInstance kubevirt-smoke-fedora
PreFilter: found corresponding VMI
PreFilter: found corresponding VM
PreFilter: VM already sticky to node node-1, write to scheduling context
Prefilter: finish
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-2
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, given node is sticky node node-1, return success
Filter: finish
Filter: start
Filter: default/virt-launcher-kubevirt-smoke-fedora-m8f7v, already stick to node-1, skip node-3
PostBind: start: pod default/virt-launcher-kubevirt-smoke-fedora-m8f7v
PostBind: VM already has sticky annotation, return

这时候 VM 上已经有 annotation，因此 postbind 阶段不需要做任何事情。