Kubernetes Scheduler原理解析

原创 2017年01月13日 16:32:30

本文是对Kubernetes Scheduler的算法解读和原理解析,重点介绍了预选(Predicates)和优选(Priorities)步骤的原理,并介绍了默认配置的Default Policies。接下来,我会分析Kubernetes Scheduler的源码,窥探其具体的实现细节以及如何开发一个Policy,见我下片博文吧。

Scheduler及其算法介绍

Kubernetes Scheduler是Kubernetes Master的一个组件,通常与API Server和Controller Manager组件部署在一个节点,共同组成Master的三剑客。

一句话概括Scheduler的功能:将PodSpec.NodeName为空的Pods逐个地,经过预选(Predicates)和优选(Priorities)两个步骤,挑选最合适的Node作为该Pod的Destination。

展开这两个步骤,就是Scheduler的算法描述:

  • 预选:根据配置的Predicates Policies(默认为DefaultProvider中定义的default predicates policies集合)过滤掉那些不满足这些Policies的的Nodes,剩下的Nodes就作为优选的输入。
  • 优选:根据配置的Priorities Policies(默认为DefaultProvider中定义的default priorities policies集合)给预选后的Nodes进行打分排名,得分最高的Node即作为最适合的Node,该Pod就Bind到这个Node。

    如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

因此整个schedule过程,算法本身的逻辑是非常简单的,关键在这些Policies的逻辑,下面我们就来看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供给Scheduler用来过滤出满足所定义条件的Nodes,并发的(最多16个goroutine)对每个Node启动所有Predicates Policies的遍历Filter,看其是否都满足配置的Predicates Policies,若有一个Policy不满足,则直接被淘汰。

注意:这里的并发goroutine number为All Nodes number,但最多不能超过16个,由一个queue控制。

Kubernetes提供了以下Predicates Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    ...
    ]
}
  1. NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.

  2. NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.

  3. PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.

  4. PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.

  5. HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.

  6. MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.

  7. MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  8. MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  9. CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.

  10. CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

默认的DefaultProvider中选了以下Predicates Policies:

  1. NoVolumeZoneConflict
  2. MaxEBSVolumeCount
  3. MaxGCEPDVolumeCount
  4. MatchInterPodAffinity

    说明:Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

  5. NoDiskConflict
  6. GeneralPredicates
    • PodFitsResources
      • pod, in number
      • cpu, in cores
      • memory, in bytes
      • alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4,每个node最多只支持1个gpu
    • PodFitsHost
    • PodFitsHostPorts
    • PodSelectorMatches
  7. PodToleratesNodeTaints
  8. CheckNodeMemoryPressure
  9. CheckNodeDiskPressure

Priorities Policies

经过预选策略甩选后得到的Nodes,会来到优选步骤。在这个过程中,会并发的根据每个Node分别启动一个goroutine,在每个goroutine中会根据对应的policy实现,遍历所有的预选Nodes,分别进行打分,每个Node每一个Policy的打分为0-10分,0分最低,10分最高。待所有policy对应的goroutine都完成后,根据设置的各个priorities policies的权重weight,对每个node的各个policy的得分进行加权求和作为最终的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意:这里的并发goroutine number为All Nodes number,但最多不能超过16个,由一个queue控制。

思考:如果经过预选后,没有一个Node满足条件,则直接返回FailedPredicates报错,不会再触发Prioritizing阶段,这是合理的。但是,如果经过预选后,只有一个Node满足条件,同样会触发Prioritizing,并且所走的流程和多个Nodes一样。实际上,如果只有一个Node满足条件,在优选阶段,可以直接返回该Node作为最终scheduled结果,无需跑完整个打分流程。

如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

Kubernetes提供了以下Priorities Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    ...
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ]
}
  • LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
  • BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
  • SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
  • CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
  • ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
  • NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

默认的DefaultProvider中选了以下Priorities Policies

  1. SelectorSpreadPriority, 默认权重为1
  2. InterPodAffinityPriority, 默认权重为1

    • pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
    • as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
    • AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    scheduler.alpha.kubernetes.io/affinity="..."

  3. LeastRequestedPriority, 默认权重为1

  4. BalancedResourceAllocation, 默认权重为1
  5. NodePreferAvoidPodsPriority, 默认权重为10000

    说明:这里权重设置足够大(10000),如果得分不为0,那么加权后最终得分将很高,如果得分为0,那么意味着相对其他得搞很高的,注定被淘汰,分析如下:

    如果Node的Anotation没有设置key-value:

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    则该node对该policy的得分就是10分,加上权重10000,那么该node对该policy的得分至少10W分。

    如果Node的Anotation设置了

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    如果该pod对应的Controller是ReplicationController或ReplicaSet,则该node对该policy的得分就是0分,那么该node对该policy的得分相对没有设置该Anotation的Node得分低的离谱了。也就是说这个Node一定会被淘汰!

  6. NodeAffinityPriority, 默认权重为1

  7. TaintTolerationPriority, 默认权重为1

scheduler算法流程图

这里写图片描述

总结

  • kubernetes scheduler的任务就是将pod调度到最合适的Node。
  • 整个调度过程分两步:预选(Predicates)和优选(Policies)
  • 默认配置的调度策略为DefaultProvider,具体包含的策略见上。
  • 可以通过kube-scheduler的启动参数–policy-config-file指定一个自定义的Json内容的文件,按照格式组装自己Predicates and Priorities policies。
版权声明:本文为博主原创文章,未经博主允许不得转载。

kubernetes调度详解

经过六个月的持续优化,kubernetes宣布1.2版本已经可以支持1000+节点的群集,并且有相当出色的响应能力,这对kubernetes来说是一个重大的改进。随着kubernetes集群规模的扩大...
  • horsefoot
  • horsefoot
  • 2016年04月27日 18:19
  • 12179

kube-scheduler最佳配置

kube-scheduler最佳配置。
  • WaltonWang
  • WaltonWang
  • 2017年05月23日 21:04
  • 11188

kube-scheduler 组件源码阅读笔记

kube-scheduler 组件源码阅读笔记在开始之前谈谈我对go语言的项目源码解读的方式的看法。通常阅读别人框架源码都是意见挺痛苦的事情,尤其是go的(因为go的接口的实现方式等原因)。我是这样去...
  • u013812710
  • u013812710
  • 2016年10月12日 00:55
  • 731

kube-scheduler 组件源码阅读笔记(二)

kube-scheduler 组件源码阅读笔记在该类型的博客一中,大题的理了下scheduler的运行启动的流程,此篇博客主要是对源码的整个具体的流程进行梳理,梳理的方式,我采用直接在源码中加上中文注...
  • u013812710
  • u013812710
  • 2016年11月02日 01:31
  • 410

k8s容器调度策略

当新增一个容器时,集群会在可用的集群节点中寻找最合适的节点来运行相应的容器。 首先,集群会排出如下节点: a. 节点状态为不可用的“如节点不通或者k8s服务运行异常等”; b. 节...
  • zhangxiangui40542
  • zhangxiangui40542
  • 2017年01月16日 11:46
  • 1690

从源码解析kube-scheduler默认的配置

本文作为Kubernetes Scheduler源码分析的番外篇,补充一个方面的分析:从源码层面解析kube-scheduler的默认配置是怎么做的。...
  • WaltonWang
  • WaltonWang
  • 2017年01月17日 20:24
  • 15342

k8s nodeSelector&affinity

1.分配pod到node的方法 通过node label selector实现约束pod运行到指定节点,有两种方法 nodeSelector 以及affinity  2.nodeSelector 是k...
  • yevvzi
  • yevvzi
  • 2017年01月17日 18:44
  • 2692

Kubernetes1.6新特性:POD高级调度-亲和性/反亲和性特性

(一)  核心概念Pod是kubernetes中的核心概念,kubernetes对于Pod的管理也就是对Pod生命周期的管理以及对Pod进行调度管理。Kubernetes早期版本使用系统默认调度器来对...
  • horsefoot
  • horsefoot
  • 2017年06月01日 08:34
  • 13459

深入kubernetes调度之Taints和Tolerations

本文主要介绍kubernetes的中调度算法中的Taints和Tolerations用法,实际上是对PodToleratesNodeTaints策略和TaintTolerationPriority策略...
  • tiger435
  • tiger435
  • 2017年06月23日 17:13
  • 1485

Kubernetes Events介绍(下)

原标题:K8s Events之捉妖记(下) 经过前两回的“踏血寻妖”,一个完整的Events原形逐渐浮出水面。我们已经摸清了它的由来和身世,本回将一起探索Events的去向,这是一个终点却也...
  • qq_34463875
  • qq_34463875
  • 2017年01月14日 12:06
  • 1103
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Kubernetes Scheduler原理解析
举报原因:
原因补充:

(最多只允许输入30个字)