正常情况下,当一个 Pod 调度失败后,它就会被暂时设置 Pending 状态,直到 Pod 被更新,或者集群状态发生变化,调度器才会对 Pod 进行重新调度。可以通过 PriorityClass 优先级来避免这种情况。通过设置优先级一些优先级高的 pod,高优先级的 Pod 调度失败的时候,调度器的抢占能力就会被触发。调度器就会试图从当前集群里寻找一个节点,使得当这个节点上的一个或者多个低优先级 Pod 被删除后,待调度的高优先级 Pod 就可以被调度到这个节点上。
高优先级 Pod 进行抢占的时候会将 pod 的 nominatedNodeName 字段,设置为被抢占的 Node 的名字。在下一周期中决定是不是要运行在被抢占的节点上,当这个 Pod 在等待的时候,如果有其他更高优先级的 Pod 也要抢占这个节点,那么调度器就会清空原抢占者的 nominatedNodeName 字段,从而允许更高优先级的抢占者执行抢占。
1. scheduleOne 函数
scheduleOne 每次对一个 pod 进行调度, 从 scheduler 调度队列 activeQ 中取出一个 pod,调用 sched.Algorithm.Schedule 为 pod 选择待调度节点, 本文不分析调度成功的情况
func (sched *Scheduler) scheduleOne(ctx context.Context) {
scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)
if err != nil {
// Schedule() may have failed because the pod would not fit on any host, so we try to
// preempt, with the expectation that the next time the pod is tried for scheduling it
// will fit due to the preemption. It is also possible that a different pod will schedule
// into the resources that were preempted, but this is harmless.
nominatedNode := ""
if fitError, ok := err.(*framework.FitError); ok {
if !fwk.HasPostFilterPlugins() {
klog.V(3).InfoS("No PostFilter plugins are registered, so no preemption will be performed")
} else {
// Run PostFilter plugins to try to make the pod schedulable in a future scheduling cycle.
result, status := fwk.RunPostFilterPlugins(ctx, state, pod, fitError.Diagnosis.NodeToStatusMap)
if status.Code() == framework.Error {
klog.ErrorS(nil, "Status after running PostFilter plugins for pod", klog.KObj(pod), "status", status)
} else {
klog.V(5).InfoS("Status after running PostFilter plugins for pod", "pod", klog.KObj(pod), "status", status)
}
if status.IsSuccess() && result != nil {
nominatedNode = result.NominatedNodeName
}
}
1.1 RunPostFilterPlugins 函数
调用 RunPostFilterPlugin 在调用 PostFilter,为 PostFilterPlugin 接口的方法,这些插件在 Pod 调度失败后被调用,PostFilter 调用关键函数 preempt 函数
2. preempt 函数
preempt 查找具有 pod 的节点,这些节点可以被抢占以腾出空间来安排“ pod”进行调度。 它选择一个节点,并抢占该节点上的 Pod,然后返回
1) the node name which is picked up for preemption
2) any possible error
func (pl *DefaultPreemption) preempt(ctx context.Context, state *framework.CycleState, pod *v1.Pod, m framework.NodeToStatusMap) (string, *framework.Status) {
cs := pl.fh.ClientSet()
nodeLister := pl.fh.SnapshotSharedLister().NodeInfos()
2.1 PodEligibleToPreemptOthers 确保抢占者有资格抢占其他 Pod
PodEligibleToPreemptOthers 确定是否应考虑将此 Pod 抢占其他 Pod。 如果此 Pod 已经抢占了其他 Pod,并且这些 Pod 处于正常终止期限,则不应考虑将其视为抢占
我们查看为该 Pod 提名的节点,只要该节点上有终止 Pod,我们就不会考虑抢占更多 Pod
// 1) Ensure the preemptor is eligible to preempt other pods.
if !PodEligibleToPreemptOthers(pod, nodeLister, m[pod.Status.NominatedNodeName]) {
klog.V(5).InfoS("Pod is not eligible for more preemption", "pod", klog.KObj(pod))
return "", nil
}
2.2 FindCandidates 找出所有抢占候选者
FindCandidates 方法首先会获取 node 列表,调用 nodesWherePreemptionMightHelp 方法来找出 predicates 阶段失败但是通过抢占也许能够调度成功的 nodes(并不是所有的 node都可以通过抢占来调度成功)
getPodDisruptionBudgets 返回所有 PDB
dryRunPreemption 函数在 <potentialNodes> 上并行模拟抢占逻辑,返回抢占候选者和指示已过滤节点状态的映射。候选对象的数量取决于插件的 args 中定义的约束。
// 2) Find all preemption candidates.
candidates, nodeToStatusMap, status := pl.FindCandidates(ctx, state, pod, m)
if !status.IsSuccess() {
return "", status
}
2.2.1 dryRunPreemption 函数
调用 selectVictimsOnNode 函数找到 node上被抢占的pod,也就是需要牺牲受害的 pod,selectVictimsOnNode 在给定节点上查找应被抢占的最小 Pod 集,以便为计划的“ pod” 留出足够的空间。
func selectVictimsOnNode(
ctx context.Context,
fh framework.Handle,
state *framework.CycleState,
pod *v1.Pod,
nodeInfo *framework.NodeInfo,
pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, *framework.Status) {
var potentialVictims []*framework.PodInfo
removePod := func(rpi *framework.PodInfo) error {
addPod := func(api *framework.PodInfo) error {
2.2.1.1 当所有较低优先级的 Pod 都消失时,该算法首先检查 Pod 是否可以在节点上调度。
// As the first step, remove all the lower priority pods from the node and
// check if the given pod can be scheduled.
podPriority := corev1helpers.PodPriority(pod)
for _, pi := range nodeInfo.Pods {
if corev1helpers.PodPriority(pi.Pod) < podPriority {
potentialVictims = append(potentialVictims, pi)
if err := removePod(pi); err != nil {
return nil, 0, framework.AsStatus(err)
}
}
}
2.2.1.2 将 potentialVictims 集合里的 pod 按优先级对所有较低优先级的进行排序,然后将其分为两组(如果抢占,则PodDisruptionBudget将被违反)和其他不违反的 Pod。
两组均按优先级排序。首先尝试尽可能多地暂停违反 PDB 的Pod,然后对未违反 PDB 的 Pod 进行同样的处理,同时检查“ Pod”是否仍可容纳在节点上。
violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims, pdbs)
reprievePod := func(pi *framework.PodInfo) (bool, error) {
if err := addPod(pi); err != nil {
return false, err
}
status := fh.RunFilterPluginsWithNominatedPods(ctx, state, pod, nodeInfo)
fits := status.IsSuccess()
if !fits {
if err := removePod(pi); err != nil {
return false, err
}
rpi := pi.Pod
victims = append(victims, rpi)
klog.V(5).InfoS("Pod is a potential preemption victim on node", "pod", klog.KObj(rpi), "node", klog.KObj(nodeInfo.Node()))
}
return fits, nil
}
for _, p := range violatingVictims {
if fits, err := reprievePod(p); err != nil {
return nil, 0, framework.AsStatus(err)
} else if !fits {
numViolatingVictim++
}
}
官方文档 PodDisruptionBudget 是在抢占中被支持的,但不提供保证,将被移除的 pod 添加到 victims 列表中,并记录好被删除的删除个数
for _, p := range violatingVictims {
if fits, err := reprievePod(p); err != nil {
return nil, 0, framework.AsStatus(err)
} else if !fits {
numViolatingVictim++
}
}
// Now we try to reprieve non-violating victims.
for _, p := range nonViolatingVictims {
if _, err := reprievePod(p); err != nil {
return nil, 0, framework.AsStatus(err)
}
}
return victims, numViolatingVictim, framework.NewStatus(framework.Success)
2.3 Callextenders
CallExtenders 调用给定的<extenders>来选择可行候选列表。 我们将仅使用支持抢占的扩展程序检查<candidates>。
// 3) Interact with registered Extenders to filter out some candidates if needed.
candidates, status = CallExtenders(pl.fh.Extenders(), pod, nodeLister, candidates)
if !status.IsSuccess() {
return "", status
}
2.4 SelectCandidate 函数选择最合适的候选者
// 4) Find the best candidate.
bestCandidate := SelectCandidate(candidates)
if bestCandidate == nil || len(bestCandidate.Name()) == 0 {
return "", nil
}
candidatesToVictimsMap 返回 map
2.4.1 pickOneNodeForPreemption
pickOneNodeForPreemption 在给定节点列表中选择一个节点。 Pod 通过低优先级来排序。 根据以下条件选择一个节点:
// 1. A node with minimum number of PDB violations. // 2. A node with minimum highest priority victim is picked. // 3. Ties are broken by sum of priorities of all victims. // 4. If there are still ties, node with the minimum number of victims is picked. // 5. If there are still ties, node with the latest start time of all highest priority victims is picked. // 6. If there are still ties, the first such node is picked (sort of randomly).
2.5 PrepareCandidate
PrepareCandidate 方法在提名所选择的候选者前执行一些准备工作
驱逐一些牺牲受害的 pod,如果受害人的 Pod 在waitingPod map 中,则将其拒绝
func PrepareCandidate(c Candidate, fh framework.Handle, cs kubernetes.Interface, pod *v1.Pod, pluginName string) *framework.Status {
for _, victim := range c.Victims().Pods {
// If the victim is a WaitingPod, send a reject message to the PermitPlugin.
// Otherwise we should delete the victim.
if waitingPod := fh.GetWaitingPod(victim.UID); waitingPod != nil {
waitingPod.Reject(pluginName, "preempted")
} else if err := util.DeletePod(cs, victim); err != nil {
klog.ErrorS(err, "Preempting pod", "pod", klog.KObj(victim), "preemptor", klog.KObj(pod))
return framework.AsStatus(err)
}
fh.EventRecorder().Eventf(victim, pod, v1.EventTypeNormal, "Preempted", "Preempting", "Preempted by %v/%v on node %v",
pod.Namespace, pod.Name, c.Name())
}
2.5.1 清除低优先级 pod 的 nominated 信息
// Lower priority pods nominated to run on this node, may no longer fit on
// this node. So, we should remove their nomination. Removing their
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
nominatedPods := getLowerPriorityNominatedPods(fh, pod, c.Name())
if err := util.ClearNominatedNodeName(cs, nominatedPods...); err != nil {
klog.ErrorS(err, "cannot clear 'NominatedNodeName' field")
// We do not return as this error is not critical.
}