Scheduling in Kubernetes

Tech Talking

已于 2023-10-24 17:37:06 修改

阅读量142

点赞数

分类专栏： K8s 文章标签： kubernetes 云原生容器 1024程序员节

于 2023-10-24 16:09:30 首次发布

本文链接：https://blog.csdn.net/mukouping82/article/details/133970875

版权

K8s 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

Scheduling in Kubernetes

Pods are the smallest deployable unit of Kubernetes where we can run our applications. Scheduling in Kubernetes is a core component as it aims to schedule the pod to a correct and available node. If you want to understand why Pods are placed onto a particular Node, or if you’re planning to know types of scheduling then this chapter is for you!

kube-scheduler

kube-scheduler is the default scheduler for Kubernetes and runs as part of the control plane. kube-scheduler is designed so that, if you want and need to, you can write your own scheduling component and use that instead.

Kube-scheduler selects an optimal node to run newly created or not yet scheduled (unscheduled) pods. Since containers in pods - and pods themselves - can have different requirements, the scheduler filters out any nodes that don’t meet a Pod’s specific scheduling needs. Alternatively, the API lets you specify a node for a Pod when you create it, but this is unusual and is only done in special cases.

In a cluster, Nodes that meet the scheduling requirements for a Pod are called feasible nodes. If none of the nodes are suitable, the pod remains unscheduled until the scheduler is able to place it.

The scheduler finds feasible Nodes for a Pod and then runs a set of functions to score the feasible Nodes and picks a Node with the highest score among the feasible ones to run the Pod. The scheduler then notifies the API server about this decision in a process called binding.

Factors that need to be taken into account for scheduling decisions include individual and collective resource requirements, hardware / software / policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference, and so on.

Overview of node selection in kube-scheduler

kube-scheduler selects a node for the pod in a 2-step operation:

Filtering
Scoring

The filtering step finds the set of Nodes where it’s feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod’s specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn’t (yet) schedulable.

In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.

Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.

Use `nodeName` to schedule the Pod

A scheduler watches for newly created pods and finds the best node for their assignment. It chooses the optimal node based on Kubernetes’ scheduling principles and your configuration options.

The simplest configuration option is setting the nodeName field in podspec directly as follows:

root@AlexRampUpVM-01:~# kubectl get node
NAME                                   STATUS                     ROLES   AGE   VERSION
aks-nodepool1-14102961-vmss000002      Ready                      agent   25d   v1.26.6
aks-usernodepool-33612472-vmss000003   Ready                      agent   25d   v1.26.6
akswin1000002                          Ready,SchedulingDisabled   agent   25d   v1.26.6

root@AlexRampUpVM-01:/tmp# cat schedulingtest.yaml
apiVersion: v1
kind: Pod
metadata:
  name: schedulingtest
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: aks-usernodepool-33612472-vmss000003
 
root@AlexRampUpVM-01:/tmp# kubectl apply -f schedulingtest.yaml
pod/schedulingtest created

root@AlexRampUpVM-01:/tmp# kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP            NODE                                   NOMINATED NODE   READINESS GATES
schedulingtest                   1/1     Running   0          6s      10.243.0.21   aks-usernodepool-33612472-vmss000003   <none>           <none>

The schedulingtest pod above will run on aks-usernodepool-33612472-vmss000003 by default. However, nodeName has many limitations that lead to non-functional pods, such as unknown node names in the cloud, out of resource nodes, and nodes with intermittent network problems. For this reason, you should not use nodeName at any time other than during testing or development.

Use `nodeSelector` to schedule the Pod

Labels and selectors are key concepts in Kubernetes that allow you to organize and categorize objects, such as pods, services, and nodes, and perform targeted operations on them. Labels are key-value pairs attached to Kubernetes objects, while selectors are used to filter and select objects based on their labels. Labels and selectors are a standard method to group things together.

Labels

Labels are arbitrary key-value pairs attached to Kubernetes objects to identify and categorize them.
They are typically used to express metadata about objects, such as their purpose, environment, version, or any other relevant information.
Labels are defined within the metadata section of an object and can have multiple labels assigned to a single object.

You can check the node labels with the following command:

root@AlexRampUpVM-01:/tmp# kubectl get node --show-labels
NAME                                   STATUS                     ROLES   AGE   VERSION   LABELS
aks-nodepool1-14102961-vmss000002      Ready                      agent   25d   v1.26.6   agentpool=nodepool1,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_B2s,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastasia ....
aks-usernodepool-33612472-vmss000003   Ready                      agent   25d   v1.26.6   agentpool=usernodepool,beta.kubernetes.io/arch=amd64,beta.kubernetes.io/instance-type=Standard_B2ms,beta.kubernetes.io/os=linux,failure-domain.beta.kubernetes.io/region=eastasia ...
akswin1000002                          Ready,SchedulingDisabled   agent   25d   v1.26.6    ....

nodeSelector

nodeSelector is the simplest recommended form of node selection constraint. You can add the nodeSelector field to your Pod specification and specify the node labels you want the target node to have. Kubernetes only schedules the Pod onto nodes that have each of the labels you specify.

root@AlexRampUpVM-01:/tmp# cat scheduling_nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
  name: schedulingwithnodeselector
spec:
  containers:
  - name: nginx
    image: nginx
  nodeSelector:
    agentpool: usernodepool

root@AlexRampUpVM-01:/tmp# kubectl apply -f scheduling_nodeselector.yaml
pod/schedulingwithnodeselector created

root@AlexRampUpVM-01:/tmp# kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP            NODE                                   NOMINATED NODE   READINESS GATES
schedulingtest                   1/1     Running   0          58m     10.243.0.21   aks-usernodepool-33612472-vmss000003   <none>           <none>
schedulingwithnodeselector       1/1     Running   0          17s     10.243.0.6    aks-usernodepool-33612472-vmss000003   <none>           <none>

For the schedulingwithnodeselector pod above, Kubernetes Scheduler will find a node with the agentpool: usernodepool label.

The use of nodeSelector efficiently constrains pods to run on nodes with specific labels. However, its use is only constrained with labels and their values. There are two more comprehensive features in Kubernetes to express more complicated scheduling requirements: node affinity, to mark pods to attract them to a set of nodes; and taints and tolerations, to mark nodes to repel a set of pods. These features are discussed below.

Use `nodeAffinity` to schedule the Pod

Node affinity is a set of constraints defined on pods that determine which nodes are eligible for scheduling. It’s possible to define hard and soft requirements for the pods’ node assignments using affinity rules. For instance, you can configure a pod to run only the nodes with GPUs and preferably with NVIDIA_TESLA_V100 for your deep learning workload. The scheduler evaluates the rules and tries to find a suitable node within the defined constraints. Like nodeSelectors, node affinity rules work with the node labels; however, they are more powerful than nodeSelectors.

There are four affinity rules you can add to podspec:

requiredDuringSchedulingIgnoredDuringExecution
requiredDuringSchedulingRequiredDuringExecution
preferredDuringSchedulingIgnoredDuringExecution
preferredDuringSchedulingRequiredDuringExecution

These four rules consist of two criteria: required or preferred, and two stages: Scheduling and Execution. Rules starting with required describe hard requirements that must be met. Rules beginning with preferred are soft requirements that will be enforced but not guaranteed. The Scheduling stage refers to the first assignment of the pod to the nodes. The Execution stage applies to situations where node labels change after the scheduling assignment.

If a rule is stated as IgnoredDuringExecution, the scheduler will not check its validity after the first assignment. However, if the rule is specified with RequiredDuringExecution, the scheduler will always ensure the rule’s validity by moving the pod to a suitable node.

Check out the following example to help you grasp these affinities:

root@AlexRampUpVM-01:/tmp# kubectl get node --show-labels
NAME                                   STATUS                     ROLES   AGE     VERSION   LABELS
aks-nodepool1-14102961-vmss000002      Ready                      agent   25d     v1.26.6   ...topology.kubernetes.io/region=eastasia,topology.kubernetes.io/zone=0
aks-usernodepool-33612472-vmss000003   Ready                      agent   25d     v1.26.6   ...topology.kubernetes.io/region=eastasia,topology.kubernetes.io/zone=0
aks-usernodepool-33612472-vmss000004   Ready                      agent   8m25s   v1.26.6   ...topology.kubernetes.io/region=eastasia,topology.kubernetes.io/zone=0
akswin1000002                          Ready,SchedulingDisabled   agent   25d     v1.26.6   ...topology.kubernetes.io/region=eastasia,topology.kubernetes.io/zone=0

root@AlexRampUpVM-01:/tmp# cat scheduling_nodeaffinity.yaml
apiVersion: v1
kind: Pod
metadata:
  name: schedulingwithnodeaffinity
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: topology.kubernetes.io/region
            operator: In
            values:
            - eastasia
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
          - key: topology.kubernetes.io/zone
            operator: In
            values:
            - "1"
            - "2"
  containers:
  - name: nginx
    image: nginx


root@AlexRampUpVM-01:/tmp# kubectl apply -f scheduling_nodeaffinity.yaml
pod/schedulingwithnodeaffinity created

root@AlexRampUpVM-01:/tmp# kubectl get pod -o wide|grep scheduling
schedulingtest                   1/1     Running   0          75m     10.243.0.21   aks-usernodepool-33612472-vmss000003   <none>           <none>
schedulingwithnodeaffinity       1/1     Running   0          25s     10.243.0.17   aks-usernodepool-33612472-vmss000004   <none>           <none>
schedulingwithnodeselector       1/1     Running   0          17m     10.243.0.6    aks-usernodepool-33612472-vmss000003   <none>           <none>

The schedulingwithnodeaffinity pod above has a node affinity rule indicating that Kubernetes Scheduler should only place the pod to a node in the eastasia region. The second rule indicates that the “zone 1” or “zone 2” should be preferred.

Using affinity rules, you can make Kubernetes scheduling decisions work for your custom requirements.

Use Taints and Tolerations to schedule the Pod

Not all Kubernetes nodes are the same in a cluster. It’s possible to have nodes with special hardware, such as GPU, disk, or network capabilities. Similarly, you may need to dedicate some nodes for testing, data protection, or user groups. Taints can be added to the nodes to repel pods, as in the following example:

root@AlexRampUpVM-01:/tmp# kubectl taint nodes aks-usernodepool-33612472-vmss000004 test-environment=true:NoSchedule
node/aks-usernodepool-33612472-vmss000004 tainted

With taint test-environment=true:NoSchedule, Kubernetes Scheduler will not assign any pod unless it has matching toleration in the podspec:

root@AlexRampUpVM-01:/tmp# cat schedulingwithtoleration.yaml
apiVersion: v1
kind: Pod
metadata:
  name: schedulingwithtoleration
spec:
  containers:
  - name: nginx
    image: nginx
  tolerations:
  - key: "test-environment"
    operator: "Exists"
    effect: "NoSchedule"

root@AlexRampUpVM-01:/tmp# kubectl apply -f schedulingwithtoleration.yaml
pod/schedulingwithtoleration created

root@AlexRampUpVM-01:/tmp# kubectl get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE     IP            NODE                                   NOMINATED NODE   READINESS GATES
schedulingwithtoleration         1/1     Running   0          5s      10.243.0.16   aks-usernodepool-33612472-vmss000004   <none>           <none>

Taints and tolerations work together to make Kubernetes Scheduler dedicate some nodes and assign only specific pods.