Kubernetes Scheduler原理解析

原创 2017年01月13日 16:32:30

本文是对Kubernetes Scheduler的算法解读和原理解析,重点介绍了预选(Predicates)和优选(Priorities)步骤的原理,并介绍了默认配置的Default Policies。接下来,我会分析Kubernetes Scheduler的源码,窥探其具体的实现细节以及如何开发一个Policy,见我下片博文吧。

Scheduler及其算法介绍

Kubernetes Scheduler是Kubernetes Master的一个组件,通常与API Server和Controller Manager组件部署在一个节点,共同组成Master的三剑客。

一句话概括Scheduler的功能:将PodSpec.NodeName为空的Pods逐个地,经过预选(Predicates)和优选(Priorities)两个步骤,挑选最合适的Node作为该Pod的Destination。

展开这两个步骤,就是Scheduler的算法描述:

  • 预选:根据配置的Predicates Policies(默认为DefaultProvider中定义的default predicates policies集合)过滤掉那些不满足这些Policies的的Nodes,剩下的Nodes就作为优选的输入。
  • 优选:根据配置的Priorities Policies(默认为DefaultProvider中定义的default priorities policies集合)给预选后的Nodes进行打分排名,得分最高的Node即作为最适合的Node,该Pod就Bind到这个Node。

    如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

因此整个schedule过程,算法本身的逻辑是非常简单的,关键在这些Policies的逻辑,下面我们就来看看Kubernetes的Predicates and Priorities Policies。

Predicates and Priorities Policies

Predicates Policies

Predicates Policies就是提供给Scheduler用来过滤出满足所定义条件的Nodes,并发的(最多16个goroutine)对每个Node启动所有Predicates Policies的遍历Filter,看其是否都满足配置的Predicates Policies,若有一个Policy不满足,则直接被淘汰。

注意:这里的并发goroutine number为All Nodes number,但最多不能超过16个,由一个queue控制。

Kubernetes提供了以下Predicates Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    {"name" : "PodFitsPorts"},
    {"name" : "PodFitsResources"},
    {"name" : "NoDiskConflict"},
    {"name" : "NoVolumeZoneConflict"},
    {"name" : "MatchNodeSelector"},
    {"name" : "HostName"}
    ],
"priorities" : [
    ...
    ]
}
  1. NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.

  2. NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.

  3. PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.

  4. PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.

  5. HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.

  6. MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.

  7. MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  8. MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

  9. CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.

  10. CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

默认的DefaultProvider中选了以下Predicates Policies:

  1. NoVolumeZoneConflict
  2. MaxEBSVolumeCount
  3. MaxGCEPDVolumeCount
  4. MatchInterPodAffinity

    说明:Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

  5. NoDiskConflict
  6. GeneralPredicates
    • PodFitsResources
      • pod, in number
      • cpu, in cores
      • memory, in bytes
      • alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4,每个node最多只支持1个gpu
    • PodFitsHost
    • PodFitsHostPorts
    • PodSelectorMatches
  7. PodToleratesNodeTaints
  8. CheckNodeMemoryPressure
  9. CheckNodeDiskPressure

Priorities Policies

经过预选策略甩选后得到的Nodes,会来到优选步骤。在这个过程中,会并发的根据每个Node分别启动一个goroutine,在每个goroutine中会根据对应的policy实现,遍历所有的预选Nodes,分别进行打分,每个Node每一个Policy的打分为0-10分,0分最低,10分最高。待所有policy对应的goroutine都完成后,根据设置的各个priorities policies的权重weight,对每个node的各个policy的得分进行加权求和作为最终的node的得分。

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

注意:这里的并发goroutine number为All Nodes number,但最多不能超过16个,由一个queue控制。

思考:如果经过预选后,没有一个Node满足条件,则直接返回FailedPredicates报错,不会再触发Prioritizing阶段,这是合理的。但是,如果经过预选后,只有一个Node满足条件,同样会触发Prioritizing,并且所走的流程和多个Nodes一样。实际上,如果只有一个Node满足条件,在优选阶段,可以直接返回该Node作为最终scheduled结果,无需跑完整个打分流程。

如果经过优选将Nodes打分排名后,有多个Nodes并列得分最高,那么scheduler将随机从中选择一个Node作为目标Node。

Kubernetes提供了以下Priorities Policies的定义,你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如:

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
    ...
    ],
"priorities" : [
    {"name" : "LeastRequestedPriority", "weight" : 1},
    {"name" : "BalancedResourceAllocation", "weight" : 1},
    {"name" : "ServiceSpreadingPriority", "weight" : 1},
    {"name" : "EqualPriority", "weight" : 1}
    ]
}
  • LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
  • BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
  • SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
  • CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
  • ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
  • NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

默认的DefaultProvider中选了以下Priorities Policies

  1. SelectorSpreadPriority, 默认权重为1
  2. InterPodAffinityPriority, 默认权重为1

    • pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
    • as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
    • AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

    scheduler.alpha.kubernetes.io/affinity="..."

  3. LeastRequestedPriority, 默认权重为1

  4. BalancedResourceAllocation, 默认权重为1
  5. NodePreferAvoidPodsPriority, 默认权重为10000

    说明:这里权重设置足够大(10000),如果得分不为0,那么加权后最终得分将很高,如果得分为0,那么意味着相对其他得搞很高的,注定被淘汰,分析如下:

    如果Node的Anotation没有设置key-value:

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    则该node对该policy的得分就是10分,加上权重10000,那么该node对该policy的得分至少10W分。

    如果Node的Anotation设置了

    scheduler.alpha.kubernetes.io/preferAvoidPods="..."

    如果该pod对应的Controller是ReplicationController或ReplicaSet,则该node对该policy的得分就是0分,那么该node对该policy的得分相对没有设置该Anotation的Node得分低的离谱了。也就是说这个Node一定会被淘汰!

  6. NodeAffinityPriority, 默认权重为1

  7. TaintTolerationPriority, 默认权重为1

scheduler算法流程图

这里写图片描述

总结

  • kubernetes scheduler的任务就是将pod调度到最合适的Node。
  • 整个调度过程分两步:预选(Predicates)和优选(Policies)
  • 默认配置的调度策略为DefaultProvider,具体包含的策略见上。
  • 可以通过kube-scheduler的启动参数–policy-config-file指定一个自定义的Json内容的文件,按照格式组装自己Predicates and Priorities policies。
版权声明:本文为博主原创文章,未经博主允许不得转载。

相关文章推荐

k8s nodeSelector&affinity

1.分配pod到node的方法 通过node label selector实现约束pod运行到指定节点,有两种方法 nodeSelector 以及affinity  2.nodeSelector 是k...
  • yevvzi
  • yevvzi
  • 2017年01月17日 18:44
  • 1865

深入kubernetes调度之NodeSelector

Kubernetes的调度有简单,有复杂,指定NodeName和使用NodeSelector调度是最简单的,可以将Pod调度到期望的节点上。 本文主要介绍kubernetes调度框架中的NodeNam...

深入kubernetes调度之原理分析

调度器是编排工具的核心,调度策略和算法是编排工具的灵魂。Kubernetes之所以能够大行其道,正是因为其优良的调度算法,本文就来分析下kubernets中scheduler组件的调度原理。...

Kubernetes Scheduler 调度- Node信息管理

源码为k8s v1.6.1版本,github上对应的commit id为b0b7a323cc5a4a2019b2e9520c21c7830b7f708e本文将对Scheduler的调度中Node信息的...

kube-scheduler最佳配置

kube-scheduler最佳配置。

kubernetes调度组件kube-scheduler源码分析

kubernetes调度组件kube-scheduler源码分析 1.调度流程图 2 默认调度算法介绍 3 数据结构 func main(){     runtime.GOMAXPROCS...
  • ptmozhu
  • ptmozhu
  • 2016年08月24日 16:15
  • 986

深入kubernetes调度之Taints和Tolerations

本文主要介绍kubernetes的中调度算法中的Taints和Tolerations用法,实际上是对PodToleratesNodeTaints策略和TaintTolerationPriority策略...

用kubeadm 搭建 Kubernetes

记录下这几天在折腾的一个事,就是想把Kubernetes 搭建起来,看看它是怎么玩的,搭建过程还是比较辛苦的,因为没有找到特别靠谱的资料,或者版本不兼容。 一 搭建的方式Kubernetes 搭建有三...

使用kubeadm部署kubernetes集群

kubeadm是1.4添加的新功能,使用kubeadm可以轻松的安装集群。 1.安装kubelet 和kubeadm cat /etc/yum.repos.d/k8s.repo [kubelet...
  • yevvzi
  • yevvzi
  • 2016年09月27日 10:43
  • 4690

Kubernetes Node Controller源码分析之Taint Controller

Node Controller是Kubernetes几十个Controller中最为重要的Controller之一,其重要程度在Top3,然而这可能也是最为复杂的一个Controller,因此对其的源...
内容举报
返回顶部
收藏助手
不良信息举报
您举报文章:Kubernetes Scheduler原理解析
举报原因:
原因补充:

(最多只允许输入30个字)