# Kubernetes Scheduler原理解析

## Scheduler及其算法介绍

Kubernetes Scheduler是Kubernetes Master的一个组件，通常与API Server和Controller Manager组件部署在一个节点，共同组成Master的三剑客。

• 预选：根据配置的Predicates Policies（默认为DefaultProvider中定义的default predicates policies集合）过滤掉那些不满足这些Policies的的Nodes，剩下的Nodes就作为优选的输入。
• 优选：根据配置的Priorities Policies（默认为DefaultProvider中定义的default priorities policies集合）给预选后的Nodes进行打分排名，得分最高的Node即作为最适合的Node，该Pod就Bind到这个Node。

如果经过优选将Nodes打分排名后，有多个Nodes并列得分最高，那么scheduler将随机从中选择一个Node作为目标Node。

## Predicates and Priorities Policies

### Predicates Policies

Predicates Policies就是提供给Scheduler用来过滤出满足所定义条件的Nodes，并发的(最多16个goroutine)对每个Node启动所有Predicates Policies的遍历Filter，看其是否都满足配置的Predicates Policies，若有一个Policy不满足，则直接被淘汰。

Kubernetes提供了以下Predicates Policies的定义，你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合,比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
{"name" : "PodFitsPorts"},
{"name" : "PodFitsResources"},
{"name" : "NoDiskConflict"},
{"name" : "NoVolumeZoneConflict"},
{"name" : "MatchNodeSelector"},
{"name" : "HostName"}
],
"priorities" : [
...
]
}
1. NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.

2. NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.

3. PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.

4. PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.

5. HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.

6. MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.

7. MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

8. MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.

9. CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.

10. CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

1. NoVolumeZoneConflict
2. MaxEBSVolumeCount
3. MaxGCEPDVolumeCount
4. MatchInterPodAffinity

说明：Fit is determined by inter-pod affinity.AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

AffinityAnnotationKey string = "scheduler.alpha.kubernetes.io/affinity"

5. NoDiskConflict
6. GeneralPredicates
• PodFitsResources
• pod, in number
• cpu, in cores
• memory, in bytes
• alpha.kubernetes.io/nvidia-gpu, in devices。截止V1.4，每个node最多只支持1个gpu
• PodFitsHost
• PodFitsHostPorts
• PodSelectorMatches
7. PodToleratesNodeTaints
8. CheckNodeMemoryPressure
9. CheckNodeDiskPressure

### Priorities Policies

finalScoreNodeA = (weight1 * priorityFunc1) + (weight2 * priorityFunc2)

Kubernetes提供了以下Priorities Policies的定义，你可以在kube-scheduler启动参数中添加--policy-config-file来指定要运用的Policies集合，比如：

{
"kind" : "Policy",
"apiVersion" : "v1",
"predicates" : [
...
],
"priorities" : [
{"name" : "LeastRequestedPriority", "weight" : 1},
{"name" : "BalancedResourceAllocation", "weight" : 1},
{"name" : "ServiceSpreadingPriority", "weight" : 1},
{"name" : "EqualPriority", "weight" : 1}
]
}
• LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
• BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
• SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
• CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
• ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
• NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.

2. InterPodAffinityPriority, 默认权重为1

• pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
• as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
• AffinityAnnotationKey represents the key of affinity data (json serialized) in the Annotations of a Pod.

scheduler.alpha.kubernetes.io/affinity="..."

3. LeastRequestedPriority, 默认权重为1

4. BalancedResourceAllocation, 默认权重为1
5. NodePreferAvoidPodsPriority, 默认权重为10000

说明：这里权重设置足够大（10000），如果得分不为0，那么加权后最终得分将很高，如果得分为0，那么意味着相对其他得搞很高的，注定被淘汰,分析如下：

如果Node的Anotation没有设置key-value:

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

则该node对该policy的得分就是10分，加上权重10000，那么该node对该policy的得分至少10W分。

如果Node的Anotation设置了

scheduler.alpha.kubernetes.io/preferAvoidPods="..."

如果该pod对应的Controller是ReplicationController或ReplicaSet，则该node对该policy的得分就是0分，那么该node对该policy的得分相对没有设置该Anotation的Node得分低的离谱了。也就是说这个Node一定会被淘汰！

6. NodeAffinityPriority, 默认权重为1

7. TaintTolerationPriority, 默认权重为1

## 总结

• kubernetes scheduler的任务就是将pod调度到最合适的Node。
• 整个调度过程分两步：预选(Predicates)和优选(Policies)
• 默认配置的调度策略为DefaultProvider，具体包含的策略见上。
• 可以通过kube-scheduler的启动参数–policy-config-file指定一个自定义的Json内容的文件，按照格式组装自己Predicates and Priorities policies。

• 本文已收录于以下专栏：

## k8s nodeSelector&affinity

1.分配pod到node的方法 通过node label selector实现约束pod运行到指定节点,有两种方法 nodeSelector 以及affinity  2.nodeSelector 是k...
• yevvzi
• 2017年01月17日 18:44
• 1865

## 深入kubernetes调度之NodeSelector

Kubernetes的调度有简单，有复杂，指定NodeName和使用NodeSelector调度是最简单的，可以将Pod调度到期望的节点上。 本文主要介绍kubernetes调度框架中的NodeNam...

## kube-scheduler最佳配置

kube-scheduler最佳配置。

## kubernetes调度组件kube-scheduler源码分析

kubernetes调度组件kube-scheduler源码分析 1.调度流程图 2 默认调度算法介绍 3 数据结构 func main(){     runtime.GOMAXPROCS...
• ptmozhu
• 2016年08月24日 16:15
• 986