kubernetes scheduler

最新推荐文章于 2024-08-28 11:33:20 发布

discsthnew

最新推荐文章于 2024-08-28 11:33:20 发布

阅读量326

点赞数

分类专栏： kubernetes 文章标签： kubernetes scheduler 调度

kubernetes 专栏收录该内容

20 篇文章 1 订阅

订阅专栏

本文深入解析Kubernetes的调度机制，包括节点筛选与优先级策略。详细介绍NoDiskConflict、NoVolumeZoneConflict等过滤条件，以及LeastRequestedPriority、BalancedResourceAllocation等优先级策略。

摘要由CSDN通过智能技术生成

The scheduling algorithm

For given pod:

+---------------------------------------------+
|               Schedulable nodes:            |
|                                             |
| +--------+    +--------+      +--------+    |
| | node 1 |    | node 2 |      | node 3 |    |
| +--------+    +--------+      +--------+    |
|                                             |
+-------------------+-------------------------+
                    |
                    |
                    v
+-------------------+-------------------------+

Pred. filters: node 3 doesn't have enough resource

+-------------------+-------------------------+
                    |
                    |
                    v
+-------------------+-------------------------+
|             remaining nodes:                |
|   +--------+                 +--------+     |
|   | node 1 |                 | node 2 |     |
|   +--------+                 +--------+     |
|                                             |
+-------------------+-------------------------+
                    |
                    |
                    v
+-------------------+-------------------------+

Priority function:    node 1: p=2
                      node 2: p=5

+-------------------+-------------------------+
                    |
                    |
                    v
    select max{node priority} = node 2

调度器一次只为一个pod寻找适合的node.

首先，调度器会进行一系列的判断，筛选掉不不合适的node. 例如，pod.spec 中定义了资源配额，调度器会过滤掉那些资源不足的node.
其次，调度器会通过一系列的优先级判定，将剩下的node进行排序。排序过程并不会过滤掉node。例如，调度器会将pod尽可能的调度到资源充足，切位于不同zone的节点上。
最后，拥有最高优秀级的node会被选择为调度节点（如果有多个node优秀级相同，会随机选择一个）。相关的代码实现参考 plugin/pkg/scheduler/generic_scheduler.go 中的 schedule() 函数。

总而言之，kubernetes 调度分为两个部分
1、找到符合条件的node (predicates)
2、在符合条件的node中，根据策略选择最优node (priorities policies)

predicates

这里引用官方设计文档的描述

https://github.com/kubernetes/community/blob/master/contributors/devel/scheduler_algorithm.md#filtering-the-nodes

NoDiskConflict: Evaluate if a pod can fit due to the volumes it requests, and those that are already mounted. Currently supported volumes are: AWS EBS, GCE PD, ISCSI and Ceph RBD. Only Persistent Volume Claims for those supported types are checked. Persistent Volumes added directly to pods are not evaluated and are not constrained by this policy.
NoVolumeZoneConflict: Evaluate if the volumes a pod requests are available on the node, given the Zone restrictions.
PodFitsResources: Check if the free resource (CPU and Memory) meets the requirement of the Pod. The free resource is measured by the capacity minus the sum of requests of all Pods on the node. To learn more about the resource QoS in Kubernetes, please check QoS proposal.
PodFitsHostPorts: Check if any HostPort required by the Pod is already occupied on the node.
HostName: Filter out all nodes except the one specified in the PodSpec’s NodeName field.
MatchNodeSelector: Check if the labels of the node match the labels specified in the Pod’s nodeSelector field and, as of Kubernetes v1.2, also match the scheduler.alpha.kubernetes.io/affinity pod annotation if present. See here for more details on both.
MaxEBSVolumeCount: Ensure that the number of attached ElasticBlockStore volumes does not exceed a maximum value (by default, 39, since Amazon recommends a maximum of 40 with one of those 40 reserved for the root volume – see Amazon’s documentation. The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
MaxGCEPDVolumeCount: Ensure that the number of attached GCE PersistentDisk volumes does not exceed a maximum value (by default, 16, which is the maximum GCE allows – see GCE’s documentation). The maximum value can be controlled by setting the KUBE_MAX_PD_VOLS environment variable.
CheckNodeMemoryPressure: Check if a pod can be scheduled on a node reporting memory pressure condition. Currently, no BestEffort should be placed on a node under memory pressure as it gets automatically evicted by kubelet.
CheckNodeDiskPressure: Check if a pod can be scheduled on a node reporting disk pressure condition. Currently, no pods should be placed on a node under disk pressure as it gets automatically evicted by kubelet.

其中，MatchNodeSelector 定义了pod可以被分配到特定的Node上去。通过给node设置label的方式来匹配。

node affinity 甚至可以配置pod与pod之间的分配策略

https://kubernetes.io/docs/user-guide/node-selection/

priorities policies

在筛选出一组符合条件的Node之后，根据优先级策略计算各Node的权重。来决定pod最终被分配到哪个Node

当前，kubernetes提供了如下几种优先级策略(引用官方设计文档)：

LeastRequestedPriority: The node is prioritized based on the fraction of the node that would be free if the new Pod were scheduled onto the node. (In other words, (capacity - sum of requests of all Pods already on the node - request of Pod that is being scheduled) / capacity). CPU and memory are equally weighted. The node with the highest free fraction is the most preferred. Note that this priority function has the effect of spreading Pods across the nodes with respect to resource consumption.
BalancedResourceAllocation: This priority function tries to put the Pod on a node such that the CPU and Memory utilization rate is balanced after the Pod is deployed.
SelectorSpreadPriority: Spread Pods by minimizing the number of Pods belonging to the same service, replication controller, or replica set on the same node. If zone information is present on the nodes, the priority will be adjusted so that pods are spread across zones and nodes.
CalculateAntiAffinityPriority: Spread Pods by minimizing the number of Pods belonging to the same service on nodes with the same value for a particular label.
ImageLocalityPriority: Nodes are prioritized based on locality of images requested by a pod. Nodes with larger size of already-installed packages required by the pod will be preferred over nodes with no already-installed packages required by the pod or a small total size of already-installed packages required by the pod.
NodeAffinityPriority: (Kubernetes v1.2) Implements preferredDuringSchedulingIgnoredDuringExecution node affinity; see here for more details.