Pod调度
默认情况,Pod运行在Node节点上,由Scheduler组件采用算法计算出来,这个过程不受人为控制
K8S提供四大类调度方式:
-
自动调度
-
定向调度:NodeName,NodeSelector
-
亲和性调度:NodeAffinity、PodAffinity、PodAntiAffinity
-
污点(容忍)调度:Taints、Toleration
定向调度
官方文档:https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/assign-pod-node/
利用Pod上nodeName或nodeSelector,以此方式进行强制调度,即使Node不存在也会调度
NodeName
查看节点
[root@master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 3d10h v1.17.4
node1 Ready <none> 3d10h v1.17.4
node2 Ready <none> 3d10h v1.17.4
创建pod-nodename.yaml文件
apiVersion: v1
kind: Pod
metadata:
name: pod-nodename
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: node1 # 指定调度到node1节点上
#创建名称空间
[root@master ~]# kubectl create ns dev
namespace/dev created
[root@master ~]# vim pod-nodename.yaml
[root@master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created
#发现在node1
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 1/1 Running 0 31s 10.244.1.16 node1 <none> <none>
#删除pod
[root@master ~]# kubectl delete -f pod-nodename.yaml
pod "pod-nodename" deleted
[root@master ~]# vim pod-nodename.yaml
#修改nodename为一个不存在的node3
[root@master ~]# cat pod-nodename.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodename
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: node3 # 指定调度到node3节点上
[root@master ~]# kubectl create -f pod-nodename.yaml
pod/pod-nodename created
#查看信息 发现一直在pending状态 运行不起来 强制调度在一个不存在node3上
[root@master ~]# kubectl get pod pod-nodename -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodename 0/1 Pending 0 5s <none> node3 <none> <none>
NodeSelector
nodeSelector
是节点选择约束的最简单推荐形式。你可以将 nodeSelector
字段添加到 Pod 的规约中设置你希望目标节点所具有的节点标签。 Kubernetes 只会将 Pod 调度到拥有你所指定的每个标签的节点上(此调度也是强制调度)。
分别向node节点添加标签
[root@master mine]# kubectl label nodes node1 nodeenv=pro
node/node1 labeled
[root@master mine]# kubectl label nodes node2 nodeenv=test
node/node2 labeled
创建pod-nodeselector.yaml文件
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeselector
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeSelector:
nodeenv: pro # 指定调度到具有nodeenv=pro标签的节点上
[root@master mine]# vim pod-nodeselector.yaml
[root@master mine]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created
#发现pod被调度在node1上
[root@master mine]# kubectl get pod pod-nodeselector -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeselector 1/1 Running 0 35s 10.244.1.17 node1 <none> <none>
#删除Pod修改yaml文件
[root@master mine]# kubectl delete -f pod-nodeselector.yaml
pod "pod-nodeselector" deleted
[root@master mine]# vim pod-nodeselector.yaml
[root@master mine]# cat pod-nodeselector.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeselector
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeSelector:
nodeenv: test # 指定调度到具有nodeenv=test标签的节点上
[root@master mine]# kubectl create -f pod-nodeselector.yaml
pod/pod-nodeselector created
#现在已经调度到标签为node2的节点上了
[root@master mine]# kubectl get pod pod-nodeselector -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeselector 1/1 Running 0 9s 10.244.2.12 node2 <none> <none>
亲和性调度
官方文档:https://kubernetes.io/zh-cn/docs/tasks/configure-pod-container/assign-pods-nodes-using-node-affinity/
基于NodeSelector基础上进行拓展,实现优选选择满足条件的NodeSelector进行调度,如果没有,也可以调度到不满足条件的节点上,使调度更加灵活。
Affinity主要分为三类:
-
nodeAffinity(node亲和性): 以node为目标,解决pod可以调度到哪些node的问题
-
podAffinity(pod亲和性) : 以pod为目标,解决pod可以和哪些已存在的pod部署在同一个拓扑域中的问题
-
podAntiAffinity(pod反亲和性) : 以pod为目标,解决pod不能和哪些已存在pod部署在同一个拓扑域中的问题
nodeAffinity
[root@master ~]# kubectl explain pod.spec.affinity.nodeAffinity
KIND: Pod
VERSION: v1
RESOURCE: nodeAffinity <Object>
DESCRIPTION:
Describes node affinity scheduling rules for the pod.
Node affinity is a group of node affinity scheduling rules.
FIELDS:
preferredDuringSchedulingIgnoredDuringExecution <[]Object>
调度器更倾向于将pod调度到满足
由该字段指定的关联表达式,但它可以选择一个节点
违反一个或多个表达式。首选节点为
权重和最大的一个,即每个节点满足所有
调度需求(资源请求,requiredDuringScheduling)
关联表达式等),通过遍历对象来计算和
此字段的元素,如果节点匹配,则向和添加“weight”
对应的matchExpressions;和最大的节点是
最受欢迎的。
优先调度到满足指定的规则的Node,相当于软限制 (倾向)
preference 一个节点选择器项,与相应的权重相关联
matchFields 按节点字段列出的节点选择器要求列表
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持In, NotIn, Exists, DoesNotExist, Gt, Lt
weight 倾向权重,在范围1-100。
requiredDuringSchedulingIgnoredDuringExecution <Object>
如果不满足此字段指定的关联要求
调度时间,pod将不会被调度到节点上。如果
此字段指定的关联需求在某个时刻不再满足
在pod执行期间(例如,由于更新),系统可能会尝试,也可能不会尝试
以最终从其节点中驱逐pod。
Node节点必须满足指定的所有规则才可以,相当于硬限制
nodeSelectorTerms 节点选择列表
matchFields 按节点字段列出的节点选择器要求列表
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持Exists, DoesNotExist, In, NotIn, Gt, Lt
关系符的使用说明:
- matchExpressions:
- key: nodeenv # 匹配存在标签的key为nodeenv的节点
operator: Exists
- key: nodeenv # 匹配标签的key为nodeenv,且value是"xxx"或"yyy"的节点
operator: In
values: ["xxx","yyy"]
- key: nodeenv # 匹配标签的key为nodeenv,且value大于"xxx"的节点
operator: Gt
values: "xxx"
创建一个pod-nodeaffinity-required.yaml文件
apiVersion: v1
kind: Pod
metadata:
name: pod-nodeaffinity-required
namespace: dev #提前创建好
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: #亲和性设置
nodeAffinity: #设置node亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
nodeSelectorTerms:
- matchExpressions: # 匹配env的值在["xxx","yyy"]中的标签
- key: nodeenv
operator: In
values: ["xxx","yyy"]
#运行失败
[root@master tmp]# kubectl get pod -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-required 0/1 Pending 0 75s <none> <none> <none> <none>
#查看信息
[root@master tmp]# kubectl describe pod pod-nodeaffinity-required -n dev
...
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 3 node(s) didn't match node selector.
#给node1打上标签
[root@master tmp]# kubectl label nodes node1 nodeenv=xxx
node/node1 labeled
[root@master tmp]# kubectl get nodes --show-labels
NAME STATUS ROLES AGE VERSION LABELS
master Ready master 20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=master,kubernetes.io/os=linux,node-role.kubernetes.io/master=
node1 Ready <none> 20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1,kubernetes.io/os=linux,nodeenv=xxx
node2 Ready <none> 20h v1.17.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node2,kubernetes.io/os=linux
#再次创建 发现可以运行且在node1上
[root@master tmp]# kubectl create -f pod-nodeaffinity-required.yaml
pod/pod-nodeaffinity-required created
[root@master tmp]# kubectl get pod -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-nodeaffinity-required 1/1 Running 0 111s 10.244.1.2 node1 <none> <none>
NodeAffinity规则设置的注意事项:
1 如果同时定义了nodeSelector和nodeAffinity,那么必须两个条件都得到满足,Pod才能运行在指定的Node上
2 如果nodeAffinity指定了多个nodeSelectorTerms,那么只需要其中一个能够匹配成功即可
3 如果一个nodeSelectorTerms中有多个matchExpressions ,则一个节点必须满足所有的才能匹配成功
4 如果一个pod所在的Node在Pod运行期间其标签发生了改变,不再符合该Pod的节点亲和性需求,则系统将忽略此变化
podAffinity
pod.spec.affinity.podAffinity
requiredDuringSchedulingIgnoredDuringExecution 硬限制
namespaces 指定参照pod的namespace
topologyKey 指定调度作用域
labelSelector 标签选择器
matchExpressions 按节点标签列出的节点选择器要求列表(推荐)
key 键
values 值
operator 关系符 支持In, NotIn, Exists, DoesNotExist.
matchLabels 指多个matchExpressions映射的内容
preferredDuringSchedulingIgnoredDuringExecution 软限制
podAffinityTerm 选项
namespaces
topologyKey
labelSelector
matchExpressions
key 键
values 值
operator
matchLabels
weight 倾向权重,在范围1-100
topologyKey用于指定调度时作用域,例如:
如果指定为kubernetes.io/hostname,那就是以Node节点为区分范围
如果指定为beta.kubernetes.io/os,则以Node节点的操作系统类型来区分
先创建一个参照Pod, pod-podaffinity-target.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-target
namespace: dev
labels:
podenv: pro #设置标签
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: node1 # 将目标pod名确指定到node1上
#创建Pod查看信息
[root@master tmp]# kubectl create -f pod-podaffinity-target.yaml
pod/pod-podaffinity-target created
[root@master tmp]# kubectl get pods -n dev --show-labels
NAME READY STATUS RESTARTS AGE LABELS
pod-podaffinity-target 1/1 Running 0 4s podenv=pro
创建pod-podaffinity-required.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-required
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: #亲和性设置
podAffinity: #设置pod亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配env的值在["pro","yyy"]中的标签
- key: podenv
operator: In
values: ["pro","yyy"]
topologyKey: kubernetes.io/hostname
[root@master tmp]# vim pod-podaffinity-required.yaml
[root@master tmp]# kubectl create -f pod-podaffinity-required.yaml
pod/pod-podaffinity-required created
[root@master tmp]# kubectl get pods -n dev --show-labels
NAME READY STATUS RESTARTS AGE LABELS
pod-podaffinity-required 1/1 Running 0 8s <none>
pod-podaffinity-target 1/1 Running 0 3m41s podenv=pro
#都调度到node1上面去
[root@master tmp]# kubectl get pods -n dev -o wide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
pod-podaffinity-required 1/1 Running 0 27s 10.244.1.4 node1 <none> <none> <none>
pod-podaffinity-target 1/1 Running 0 4m 10.244.1.3 node1 <none> <none> podenv=pro
podAntiAffinity
用上面的文件创建一个pod:
[root@master tmp]# cat pod-podaffinity-target.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-podaffinity-target
namespace: dev
labels:
podenv: pro #设置标签
spec:
containers:
- name: nginx
image: nginx:1.17.1
nodeName: node1 # 将目标pod名确指定到node1上
[root@master tmp]# kubectl create -f pod-podaffinity-target.yaml
pod/pod-podaffinity-target created
创建新的yaml文件:
apiVersion: v1
kind: Pod
metadata:
name: pod-podantiaffinity-required
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
affinity: #亲和性设置
podAntiAffinity: #设置pod亲和性
requiredDuringSchedulingIgnoredDuringExecution: # 硬限制
- labelSelector:
matchExpressions: # 匹配podenv的值在["pro"]中的标签
- key: podenv
operator: In
values: ["pro"]
topologyKey: kubernetes.io/hostname
这里时不能让新的Pod与podenv=pro的Pod在一个节点,所以执行命令后肯定不在同一个节点上
[root@master tmp]# kubectl get pod -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-podaffinity-target 1/1 Running 0 6m17s 10.244.1.5 node1 <none> <none>
pod-podantiaffinity-required 1/1 Running 0 31s 10.244.2.4 node2 <none> <none>
发现不在一个节点上
污点和容忍度
官方文档:https://kubernetes.io/zh-cn/docs/concepts/scheduling-eviction/taint-and-toleration/
节点亲和性 是 Pod 的一种属性,它使 Pod 被吸引到一类特定的节点 (这可能出于一种偏好,也可能是硬性要求)。 污点(Taint) 则相反——它使节点能够排斥一类特定的 Pod。
容忍度(Toleration) 是应用于 Pod 上的。容忍度允许调度器调度带有对应污点的 Pod。 容忍度允许调度但并不保证调度:作为其功能的一部分, 调度器也会评估其他参数。
污点和容忍度(Toleration)相互配合,可以用来避免 Pod 被分配到不合适的节点上。 每个节点上都可以应用一个或多个污点,这表示对于那些不能容忍这些污点的 Pod, 是不会被该节点接受的
污点
污点的格式为:key=value:effect
, key和value是污点的标签,effect描述污点的作用,支持如下三个选项:
- PreferNoSchedule:kubernetes将尽量避免把Pod调度到具有该污点的Node上,除非没有其他节点可调度
- NoSchedule:kubernetes将不会把Pod调度到具有该污点的Node上,但不会影响当前Node上已存在的Pod
- NoExecute:kubernetes将不会把Pod调度到具有该污点的Node上,同时也会将Node上已存在的Pod驱离
# 设置污点
kubectl taint nodes node1 key=value:effect
# 去除污点
kubectl taint nodes node1 key:effect-
# 去除所有污点
kubectl taint nodes node1 key-
- 准备节点node1(为了演示效果更加明显,暂时停止node2节点)
- 为node1节点设置一个污点:
tag=test:PreferNoSchedule
;然后创建pod1( pod1 可以 ) - 修改为node1节点设置一个污点:
tag=test:NoSchedule
;然后创建pod2( pod1 正常 pod2 失败 ) - 修改为node1节点设置一个污点:
tag=test:NoExecute
;然后创建pod3 ( 3个pod都失败 )
#暂停node2
[root@master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master Ready master 21h v1.17.4
node1 Ready <none> 21h v1.17.4
node2 NotReady <none> 21h v1.17.4
#为node1节点设置一个污点: `tag=test:PreferNoSchedule`;然后创建pod1( pod1 可以 )
[root@master ~]# kubectl taint nodes node1 tag=test:PreferNoSchedule
node/node1 tainted
[root@master ~]# kubectl run pod1 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/pod1 created
#pod1存在
[root@master ~]# kubectl get pod -n dev
NAME READY STATUS RESTARTS AGE
pod1-7c448df459-d24jd 1/1 Running 0 31s
# 为node1设置污点(取消PreferNoSchedule,设置NoSchedule)
[root@master ~]# kubectl taint nodes node1 tag:PreferNoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint nodes node1 tag=test:NoSchedule
node/node1 tainted
#创建Pod2
[root@master ~]# kubectl get pod -n dev
NAME READY STATUS RESTARTS AGE
pod1-7c448df459-d24jd 1/1 Running 0 12m
pod2-684ccb5d4c-t555g 0/1 Pending 0 5m30s
#修改为node1节点设置一个污点: `tag=test:NoSchedule`;然后创建pod2( pod1 正常 pod2 失败 )
# 为node1设置污点(取消NoSchedule,设置NoExecute)
[root@master ~]# kubectl taint nodes node1 tag:NoSchedule-
node/node1 untainted
[root@master ~]# kubectl taint nodes node1 tag=test:NoExecute
node/node1 tainted
#创建节点Pod3
[root@master ~]# kubectl run pod3 --image=nginx:1.17.1 -n dev
kubectl run --generator=deployment/apps.v1 is DEPRECATED and will be removed in a future version. Use kubectl run --generator=run-pod/v1 or kubectl create instead.
deployment.apps/pod3 created
[root@master ~]# kubectl get pod -n dev -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod1-7c448df459-fk54q 0/1 Pending 0 113s <none> <none> <none> <none>
pod2-684ccb5d4c-c8pw6 0/1 Pending 0 113s <none> <none> <none> <none>
pod3-6f94998d79-psvdc 0/1 Pending 0 11s <none> <none> <none> <none>
# 修改为node1节点设置一个污点: `tag=test:NoExecute`;然后创建pod3 ( 3个pod都失败 )
容忍度
污点就是拒绝,容忍就是忽略,Node通过污点拒绝pod调度上去,Pod通过容忍忽略拒绝
- 上文中已经在node1节点上打上了
NoExecute
的污点,此时pod是调度不上去的 - 可以添加容忍让Pod调度上去
创建pod-toleration.yaml
apiVersion: v1
kind: Pod
metadata:
name: pod-toleration
namespace: dev
spec:
containers:
- name: nginx
image: nginx:1.17.1
tolerations: # 添加容忍
- key: "tag" # 要容忍的污点的key
operator: "Equal" # 操作符
value: "test" # 容忍的污点的value
effect: "NoExecute" # 添加容忍的规则,这里必须和标记的污点规则相同
[root@master ~]# kubectl create -f pod-toleration.yaml
pod/pod-toleration created
#只有添加了容忍的pod才能在node1上运行
[root@master ~]# kubectl get pod -n dev
NAME READY STATUS RESTARTS AGE
pod-toleration 1/1 Running 0 11s
pod1-7c448df459-fk54q 0/1 Pending 0 20m
pod2-684ccb5d4c-c8pw6 0/1 Pending 0 20m
pod3-6f94998d79-psvdc 0/1 Pending 0 18m
tolerations配置:
[root@master ~]# kubectl explain pod.spec.tolerations
.....
FIELDS:
effect <string>
# 对应污点的effect,空意味着匹配所有影响
key <string>
# 对应着要容忍的污点的键,空意味着匹配所有的键
operator <string>
# key-value的运算符,支持Equal和Exists(默认)
tolerationSeconds <integer>
# 容忍时间, 当effect为NoExecute时生效,表示pod在Node上的停留时间
value <string>
# 对应着要容忍的污点的值