深入理解Kubernetes：kube-scheduler源码解析

mujingluo

于 2024-04-23 22:10:15 发布

阅读量833

点赞数 20

文章标签： kubernetes 容器云原生

本文链接：https://blog.csdn.net/mujingluo/article/details/138136947

版权

Kubernetes的调度器（kube-scheduler）是整个系统中至关重要的组件，它负责将待调度的Pods分配到合适的节点上。本文将深入分析kube-scheduler的源码，揭示其内部工作机制。

kube-scheduler的核心功能

kube-scheduler的核心功能包括：

监听Pod变化：通过Kubernetes API监听所有未调度的Pods。

过滤（Filtering）：根据一系列规则（Predicates）过滤出可调度的节点。

打分（Scoring）：对过滤后的节点进行打分，以确定最佳调度位置。

绑定（Binding）：将Pod绑定到选定的节点。

调度决策的持久化：将调度决策持久化到Kubernetes API。

主要组件与数据结构

SchedulingQueue

SchedulingQueue是用于存储待调度Pods的数据结构，通常实现为优先级队列。

type SchedulingQueue interface {
    AddUnschedulablePod(pod *v1.Pod)
    ScheduleOne() (*v1.Pod, error)
    Len() int
}

Framework

Framework是一组插件的集合，包括过滤插件、打分插件等。

// Framework defines the interfaces that must be implemented by the components interested in participating in the scheduling process.
type Framework interface {
    // Handle adds/updates the nodeInfo of a node.
    HandlePods(pods []*v1.Pod) error
    // Has the same effect as Handle, but won't update any Pod's nodeName.
    HandlePodsWithoutBind(pods []*v1.Pod) error
    // Unhandle removes the nodeInfo of a node.
    UnhandlePods(pods []*v1.Pod) error

    // List lists all nodes known to the framework.
    List() ([]*v1.Node, error)

    // Run the framework's filtering, scoring, and binding plugins, if any.
    Run(stopCh <-chan struct{})

    // Score returns the score a node gets for a pod according to the framework's score plugins.
    Score(ctx context.Context, cycle *Cycle, pod *v1.Pod, nodes []*v1.Node) (framework.NodeScoreList, *framework.Status)

    // Filter filters the given nodes according to the framework's filter plugins.
    Filter(ctx context.Context, cycle *Cycle, pod *v1.Pod, nodes []*v1.Node) (framework.NodeToStatusMap, *framework.Status)

    // PreprocessRegister registers the predicate and priority function and returns the preprocessor.
    PreprocessRegister() framework.PreprocessRegister

    // Bind binds a pod to a node.
    Bind(binding *framework.Binding) *framework.Status
}

SchedulerCache

SchedulerCache缓存了节点和Pod的状态，用于加速调度决策。

调度流程

Pod的调度

监听Pod变化：kube-scheduler监听API Server，获取所有未绑定的Pods。
过滤节点：使用过滤插件筛选出满足Pods资源和规则要求的节点。
节点打分：使用打分插件为每个节点计算一个分数，以评估其作为Pods宿主的适宜度。
选择节点：根据打分结果选择得分最高的节点。
绑定Pod：将Pod与选定的节点进行绑定。

源码解析

Pod调度

func (s *Scheduler) scheduleOne() {
    // 从队列中获取一个待调度的Pod
    pod, err := s.schedulingQueue.Pop()
    if err != nil {
        utilruntime.HandleError(err)
        return
    }

    // 调度Pod
    err = s.schedulePod(pod)
    if err != nil {
        utilruntime.HandleError(err)
        s.schedulingQueue.AddUnschedulablePod(pod)
    }
}

过滤节点

func (f *frameworkImpl) Filter(ctx context.Context, cycleState *CycleState, pod *v1.Pod, nodes []*v1.Node) (framework.NodeToStatusMap, *framework.Status) {
    // 调用所有过滤插件
    for _, pl := range f.filterPlugins {
        statusMap := pl.Filter(ctx, cycleState, pod, nodes)
        if statusMap.AsError() != nil {
            return statusMap, framework.NewStatus(framework.Error, statusMap.AsError().Error())
        }
    }
    return nil, framework.NewStatus(framework.Success, "")
}

抢占机制

kube-scheduler还支持抢占机制，允许高优先级的Pods抢占低优先级Pods所占用的节点。

kube-scheduler的抢占机制通常涉及以下步骤：

优先级检测：调度器检查待调度的Pod的优先级。
节点选择：调度器寻找可以放置Pod的节点。
冲突检测：如果节点上没有足够的资源，或者Pod与节点上已有的Pods存在反亲和性，调度器会检测到冲突。
抢占决策：如果有低优先级的Pods占用了资源，调度器会决定是否执行抢占。
抢占执行：调度器发出抢占指令，要求低优先级的Pods被删除，以便为高优先级的Pods腾出空间。

源码解析

优先级检测

调度器首先会检查Pod的优先级类，这通常在调度策略中定义。

podPriority := pod.Spec.Priority
if podPriority == nil {
    podPriority = int32(0) // 默认优先级
}

节点选择

调度器通过过滤和打分选择节点。

feasibleNodes, _, err := f.Filter(ctx, cycleState, pod, allNodes)
if err != nil {
    return nil, err
}

冲突检测

调度器检查节点上是否有资源冲突或反亲和性冲突。

for _, node := range feasibleNodes {
    if !s.isPodAffinitySatisfied(pod, node, cycleState) {
        continue
    }
    if !s.isPodResourcesSatisfied(pod, node, cycleState) {
        continue
    }
    // 节点满足条件，考虑抢占
}

抢占决策

如果节点上没有足够的资源，调度器会考虑抢占。

if shouldPreempt, victims := s.shouldPreempt(pod, nodeInfo, cycleState); shouldPreempt {
    // 执行抢占逻辑
}

抢占执行

调度器发出抢占指令，删除低优先级的Pods。

for _, victim := range victims {
    // 发送抢占事件
    s.recorder.Event(victim, v1.EventTypeNormal, "Preempting", "Preempting pod to fit a higher priority pod")
    if err := s.preemptPod(victim, nodeInfo, cycleState); err != nil {
        // 处理抢占错误
    }
}

抢占的注意事项

抢占可能会影响服务的稳定性：因为抢占会强制删除正在运行的Pods，这可能会对服务的稳定性和可用性造成影响。
抢占策略需要谨慎设计：在设计抢占策略时，需要考虑到业务的优先级和容忍度，以避免不必要的服务中断。

mujingluo

关注

20
点赞
踩
20

收藏

觉得还不错? 一键收藏
0
评论
深入理解Kubernetes：kube-scheduler源码解析

Kubernetes的调度器（kube-scheduler）是整个系统中至关重要的组件，它负责将待调度的Pods分配到合适的节点上。本文将深入分析kube-scheduler的源码，揭示其内部工作机制。
复制链接

扫一扫