Kubernetes按角色分为master和node节点,其中node节点是运行Pod的节点
创建Pod的时候如何调度除了按照资源(cpu,内存)等,还有很多调度策略
Label标签
Label是Kubernetes核心概念之一,主要作用就是给k8s的资源记录标签,简单的key-value键值对,比如Pod、Service、Deployment、Node等都可以设置Label
查看node节点的标签
[root@k8s-node1 ~]# kubectl get node --show-labels
NAME STATUS ROLES AGE VERSION LABELS
k8s-node1 Ready master,node 8d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/os=linux,cputype=intel-xeon-e5-2620-v4,gputype=nvidia-geforce-gtx-1080-ti,kubernetes.io/hostname=k8s-node1,node-role.kubernetes.io/master=k8s-node1,node-role.kubernetes.io/node=k8s-node1,pooltype=shared
k8s-node2 Ready node 8d v1.13.4 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/fluentd-ds-ready=true,beta.kubernetes.io/os=linux,cputype=intel-xeon-e5-2620-v4,gputype=nvidia-geforce-gtx-1080-ti,kubernetes.io/hostname=k8s-node2,node-role.kubernetes.io/node=k8s-node2,pooltype=shared
比如如上给node节点设置了cputype为intel-xeon-e5-2620-v4,gputype为nvidia-geforce-gtx-1080-ti
给节点设置标签
# k8s-node1节点使用的是普通硬盘, 设置disk=hdd
[root@k8s-node1 ~]# kubectl label node k8s-node1 disk=hdd
node/k8s-node1 labeled
# k8s-node1节点使用的是机械硬盘, 设置disk=ssd
[root@k8s-node1 ~]# kubectl label node k8s-node2 disk=ssd
node/k8s-node2 labeled
按照label查看node
[root@k8s-node1 example]# kubectl get node -l 'disk=ssd'
NAME STATUS ROLES AGE VERSION
k8s-node2 Ready node 8d v1.13.4
# 多个lable组合查询
root@k8s-node1 example]# kubectl get node -l 'disk=hdd, pooltype!=unshared'
NAME STATUS ROLES AGE VERSION
k8s-node1 Ready master,node 8d v1.13.4
查看其他资源的label标签
[root@k8s-node1 ~]# kubectl --namespace=admin get pod --show-labels
NAME READY STATUS RESTARTS AGE LABELS
enp183pm-session-8tf2d 1/1 Running 0 19h app=enp183pm-session,controller-uid=a0758fe2-81e8-11e9-a660-88d7f6ae9c94,job-name=enp183pm-session,taskname=enp183pm,uuid=03670773-e16f-4886-9306-de13da7c9958
v4exp-bigdata 1/1 Running 0 7d20h volume=bigdata
[root@k8s-node1 ~]#
[root@k8s-node1 ~]# kubectl --namespace=admin get service --show-labels
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE LABELS
gputask-session NodePort 10.10.124.20 <none> 8888:30043/TCP 2d1h app=gputask-session,taskname=gputask
v4exp-bigdata NodePort 10.10.142.143 <none> 80:30379/TCP 7d20h volume=bigdata
POD资源调度
nodeSelector
这是最简单的调度方法,具体使用就是(目前已经不建议使用,使用Node affinity可以实现同样的功能):
- 用户给node定义label;
- 用户创建pod时可以指定nodeSelector,通过label选择对应的node;
查看当前的node节点
[root@k8s-node1 example]# kubectl get node --label-columns=disk
NAME STATUS ROLES AGE VERSION DISK
k8s-node1 Ready master,node 8d v1.13.4 hdd
k8s-node2 Ready node 8d v1.13.4 ssd
现在想创建一个POD调度到disk=ssd的节点上
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
# 设置Pod自身的label
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
# 表明该Pod调度 指定nodeSelect为 disk=ssd
nodeSelector:
disk: ssd
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
# 可以看到nginx-pod调度到了k8s-node2也就是disk=ssd的节点
[root@k8s-node1 example]# kubectl get pod nginx-pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-pod 1/1 Running 0 14s 172.17.76.16 k8s-node2 <none> <none>
如果通过nodeselector选择一个不存在的label标签
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
# nodeSelector设置为disk=ceph(实际上集群中没有设置该标签的node)
nodeSelector:
disk: ceph
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
# 可以看到状态一直是Pending
[root@k8s-node1 example]# kubectl get pod nginx-pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-pod 0/1 Pending 0 7s <none> <none> <none> <none>
#查看详细描述, 2个节点都不匹配 node selector
[root@k8s-node1 example]# kubectl describe pod nginx-pod
...
...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 14s (x2 over 14s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector.
nodeAffinity
node 节点 Affinity ,从字面上很容易理解nodeAffinity就是节点亲和性,Anti-Affinity也就是反亲和性。节点亲和性就是控制pod是否调度到指定节点,相对nodeSelector来说更为灵活,可以实现一些简单的逻辑组合。
nodeAffinity包括如下几种
- requiredDuringSchedulingIgnoredDuringExecution 必须满足,没有满足条件的node,pod会创建失败
- preferredDuringSchedulingIgnoredDuringExecution 尽力满足,没有满足条件的node,pod也会创建成
IgnoredDuringExecution的意思是,上面两条规则只在pod创建时起作用,如果pod已经运行,后来又改了node的lable,node不满足pod运行条件,但已经运行的pod不受影响
现在想通过nodeAffinity调度创建一个POD调度到disk=ssd的节点上
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disk
operator: In
values:
- ssd
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
[root@k8s-node1 example]# kubectl get pod nginx-pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-pod 1/1 Running 0 4s 172.17.76.16 k8s-node2 <none> <none>
使用requiredDuringSchedulingIgnoredDuringExecution查看不满足条件的情况,修改disk values为ceph,可以看到创建失败,一直在pending中(截图略,同node selector一样)
使用preferredDuringSchedulingIgnoredDuringExecution查看不满足条件的情况,修改disk values为ceph
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disk
operator: In
values:
- ceph
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
# 可以看到即使没有disk=ceph的节点,仍然会创建成功
[root@k8s-node1 example]# kubectl get pod nginx-pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-pod 1/1 Running 0 2s 172.17.76.16 k8s-node2 <none> <none>
通过yaml可以看到nodeAffinity支持组合查询,并且支持更加灵活的运算符,operator包括如下:
- In:label 的值在某个列表中
- NotIn:label 的值不在某个列表中
- Gt:label 的值大于某个值
- Lt:label 的值小于某个值
- Exists:某个 label 存在
- DoesNotExist:某个 label 不存
podAffinity
podAffinity和nodeAffinity差不多,区别是nodeAffinity是通过node节点的label进行选择,而podaffinity则是通过pod的label标签进行选择,实例如下:
首先创建一个pod,打上标签为ks-app=nginx-pod,会随机调度到一个节点
[root@k8s-node1 example]# cat nginx-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f nginx-pod.yaml
pod/nginx-pod created
# 可以看到该Pod随机调度到了k8s-node2上
[root@k8s-node1 example]# kubectl get pod nginx-pod -owide --show-labels
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES LABELS
nginx-pod 1/1 Running 0 39s 172.17.76.16 k8s-node2 <none> <none> k8s-app=nginx-pod
然后再创建一个Pod,该Pod不想和标签为k8s-app=nginx-pod的Pod调度到一个节点上
[root@k8s-node1 example]# cat mysql-pod.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: mysql-pod
name: mysql-pod
spec:
affinity:
# podAntiAffinity表示反亲和, 也就是不想和选中的一起
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
# 该地方选择的是k8s-app=nginx-pod, 表示不想和有该标签的pod调度在一个节点
labelSelector:
matchExpressions:
- key: k8s-app
operator: In
values:
- nginx-pod
topologyKey: kubernetes.io/hostname
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
[root@k8s-node1 example]# kubectl create -f mysql-pod.yaml
pod/mysql-pod created
# 可以看到mysql-pod调度在了k8s-node1节点
[root@k8s-node1 example]# kubectl get pod mysql-pod -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql-pod 1/1 Running 0 7s 172.17.86.12 k8s-node1 <none> <none>
podAffinity和podAntiAffinity也可以同时使用,表示想和A调度在一起,不想和B调度在一起
常用的使用场景就是比如创建deployment时可以指定副本,如果希望每个副本分布在不同的节点,则可以使用podAitiAffinity
Taints
对于Node affinity,无论是强制约束(hard)或偏好(preference)方式,都是调度pod到预期节点上,而Taints恰好与之相反,如果一个节点标记为 Taints ,除非 Pod也被标识为可以耐受污点节点,否则该Taints节点不会被调度pod。
Taints节点应用场景比如用户希望把某个节点有特殊用途,不希望pod调度到该节点,或者某节点需要进行维护操作,不希望有pod调度,则可以设置为taints节点。
[root@k8s-node1 example]# kubectl taint nodes k8s-node1 key=value:NoSchedule
node "k8s-node1" tainted