kubernetes资源调度
调度
创建一个pod的工作流程
kubernetes基于list-watch机制的控制器架构,实现组件间交互的解耦。
其他组件监控自己负责的资源,当这些资源发生变化时,kube-apiserver会通知这些组件,这个过程类似于发布与订阅。
pod中影响调度的主要属性
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: default
spec:
...
containers:
- image: lizhenliang/java-demo
name: java-demo
imagePullPolicy: Always
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 20
tcpSocket:
port: 8080
respurces: {} ## 资源调度依据
restartPolicy: Always
schedulerName: default-scheduler ## 以下都是调度策略
nodeName: ""
nodeSelector: {}
affinity: {}
tolerations: []
资源限制对pod调度的影响
容器资源限制:
- resources.limits.cpu
- resources.limits.memory
容器使用的最小资源需求,作为容器调度时资源分配的依据:
- resources.requests.cpu
- resources.requests.memory
CPU单位:可以写m也可以写浮点数,例如0.5=500m,1=1000m
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
containers:
- name: web
image: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "128Mi"
cpu: "500m"
K8s会根据Request的值去查找有足够资源的Node来调度此Pod
nodeSelector&nodeAffinity
nodeSelector:用于将Pod调度到匹配Label的Node上,如果没有匹配的标签会调度失败。
作用:
- 约束Pod到特定的节点运行
- 完全匹配节点标签
应用场景:
- 专用节点:根据业务线将Node分组管理
- 配备特殊硬件:部分Node配有SSD硬盘、GPU
示例:确保pod分配到具体有SSD硬盘的节点上
第一步:给节点添加标签
格式: kubectl label nodes <node-name> <label-key> =<label-value>
例如: kubectl label nodes k8s-node1 disktype=ssd
验证: kubectl get nodes --show-labels
第二步:添加nodeSelector字段到Pod配置中
最后,验证:
kubectl get pods -o wide
apiVersion: v1
kind: Pod
metadata:
name: pod-example
spec:
nodeSelector:
disktype: “ssd"
containers:
- name: nginx
image: nginx:1.19
nodeAffinity:节点亲和性,与nodeSelector作用一样,但相比更灵活,满足更多条件,诸如:
- 匹配有更多的逻辑组合,不只是字符串的完全相等
- 调度分为软策略和硬策略,而不是硬性要求
- 硬(required) :必须满足
- 软(preferred):尝试满足,但不保证
操作符: ln、Notln、Exists、DoesNotExist、Gt、Lt
apiversion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulinglgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu
operator: In
values:
- nvidia-tesla
preferredDuringSchedulinglgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: group
operator: In
values:
- ai
containers:
- name: web
image: nginx
Taint(污点)&Tolerations(污点容忍)
Taints:避免Pod调度到特定Node上
Tolerations:允许Pod调度到持有Taints的Node上
应用场景:
- 专用节点:根据业务线将Node分组管理,
希望在默认情况下不调度该节点
,只有配置了污点容忍才允许分配 - 配备特殊硬件:部分Node配有SSD硬盘、GPU,希望在默认情况下不调度该节点,只有配置了污点容忍才允许分配
- 基于Taint的驱逐
第一步:给节点添加污点
格式: kubectl taint node [node] key=value:[effect]
例如: kubectl taint node k8s-node1 gpu=yes:NoSchedule
验证: kubectl describe node k8s-node1 lgrep Taint
其中[effect]可取值:
- NoSchedule :一定不能被调度
- PreferNoSchedule:尽量不要调度,非必须配置容忍
- NoExecute:不仅不会调度,还会驱逐Node上已有的Pod
第二步:添加污点容忍(tolrations)字段到Pod配置中
去掉污点:
kubectl taint node [node] key:[effect]-
apiversion: v1
kind: Pod
metadata:
name: pod-taints
spec:
containers:
- name: pod-taints
image: busybox:latest
tolerations:
- key: "gpu"
operator: "Equal"
value: "yes"
effect:"NoSchedule"
nodeName
nodeName:指定节点名称,用于将Pod调度到指定的Node上,祖经过调度器
apiVersion: v1
kind: Pod
metadata:
name: pod-example
labels:
app: nginx
spec:
nodeName: k8s-node2
containers:
- name: nginx
image: nginx:1.15
资源调度
Kubernetes中的调度策略主要分为全局调度与运行时调度2种。其中全局调度策略在调度器启动时配置,而运行时调度策略主要包括选择节点(nodeSelector),节点亲和性(nodeAffinity),pod亲和与反亲和性(podAffinity与podAntiAffinity)。Node Affinity、podAffinity/AntiAffinity以及后文即将介绍的污点(Taints)与容忍(tolerations)等特性,在Kuberntes1.6中均处于Beta阶段。
设置节点label
Label是Kubernetes核心概念之一,其以key/value的形式附加到各种对象上,如Pod、Service、Deployment、Node等,达到识别这些对象,管理关联关系等目的,如Node和Pod的关联。
获取当前集群中的全部节点:
[root@master ~]# kubectl get nodes
NAME STATUS ROLES AGE VERSION
master.example.com Ready control-plane,master 2d2h v1.23.1
node1.example.com Ready <none> 2d2h v1.23.1
node2.example.com Ready <none> 2d2h v1.23.1
查看节点默认label:
[root@master ~]# kubectl get node node1.example.com --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node1.example.com Ready <none> 2d2h v1.23.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux
为指定节点设置label:
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
确认节点label是否设置成功:
[root@master ~]# kubectl get node node1.example.com --show-labels
NAME STATUS ROLES AGE VERSION LABELS
node1.example.com Ready <none> 2d2h v1.23.1 beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,disktype=ssd,kubernetes.io/arch=amd64,kubernetes.io/hostname=node1.example.com,kubernetes.io/os=linux
[root@master ~]# kubectl get nodes -l disktype=ssd
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d2h v1.23.1
选择节点(nodeSelector)
nodeSelector是目前最为简单的一种pod运行时调度限制,目前在Kubernetes1.7.x及以下版本可用。Pod.spec.nodeSelector通过kubernetes的label-selector机制选择节点,由调度器调度策略匹配label,而后调度pod到目标节点,该匹配规则属于强制约束。后文要讲的nodeAffinity具备nodeSelector的全部功能,所以未来Kubernetes会将nodeSelector废除。
nodeSelector举例:
设置label
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
查看满足非master节点且disktype类型为ssd的节点:
[root@master ~]# kubectl get nodes -l 'role!=master, disktype=ssd'
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d2h v1.23.1
pod.yaml文件内容:
[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
disktype: ssd
创建pod:
[root@master ~]# kubectl apply -f pod.yml
pod/nginx created
查看pod nginx被调度到预期节点运行:
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 9m35s 10.244.1.31 node1.example.com <none> <none>
注:如果非默认namespace,需要指定具体namespace,例如:
kubectl -n kube-system get pods -o wide
built-in label举例
yaml文件内容:
[root@master ~]# vi pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
nodeSelector:
kubernetes.io/hostname: node1.example.com
创建pod,并检查结果符合预期,pod被调度在预先设置的节点 node1.example.com
[root@master ~]# kubectl apply -f pod.yml
pod/nginx unchanged
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 63s 10.244.1.32 node1.example.com <none> <none>
亲和性(Affinity)与非亲和性(anti-affinity)
前面提及的nodeSelector,其仅以一种非常简单的方式、即label强制限制pod调度到指定节点。而亲和性(Affinity)与非亲和性(anti-affinity)则更加灵活的指定pod调度到预期节点上,相比nodeSelector,Affinity与anti-affinity优势体现在:
- 表述语法更加多样化,不再仅受限于强制约束与匹配。
- 调度规则不再是强制约束(hard),取而代之的是软限(soft)或偏好(preference)。
- 指定pod可以和哪些pod部署在同一个/不同拓扑结构下。
亲和性主要分为3种类型:node affinity与inter-pod affinity/anti-affinity,下文会进行详细说明。
节点亲和性(Node affinity)
Node affinity在Kubernetes 1.2做为alpha引入,其涵盖了nodeSelector功能,主要分为requiredDuringSchedulingIgnoredDuringExecution与preferredDuringSchedulingIgnoredDuringExecution 2种类型。前者可认为一种强制限制,如果 Node 的标签发生了变化导致其没有符合 Pod 的调度要求节点,那么pod调度就会失败。而后者可认为理解为软限或偏好,同样如果 Node 的标签发生了变化导致其不再符合 pod 的调度要求,pod 依然会调度运行。
Node affinity举例
设置节点label:
[root@master ~]# kubectl label nodes node1.example.com cpu=high
node/node1.example.com labeled
[root@master ~]# kubectl label node node1.example.com disktype=ssd
node/node1.example.com labeled
[root@master ~]# kubectl label nodes node2.example.com cpu=low
node/node2.example.com labeled
部署pod的预期是到ssd类型硬盘(disktype=ssd)、且CPU高配的机器上(cpu=high)。
查看满足条件节点:
[root@master ~]# kubectl get nodes -l 'cpu=high, disktype=ssd'
NAME STATUS ROLES AGE VERSION
node1.example.com Ready <none> 2d3h v1.23.1
pod.yaml文件内容如下:
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: cpu
operator: In
values:
- high
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
检查结果符合预期,pod nginx成功部署到ssd类型硬盘且CPU高配的机器上。
[root@master ~]# kubectl get pod nginx -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx 1/1 Running 0 27s 10.244.1.33 node1.example.com
污点(Taints)与容忍(tolerations)
对于Node affinity,无论是强制约束(hard)或偏好(preference)方式,都是调度pod到预期节点上,而Taints恰好与之相反,如果一个节点标记为 Taints ,除非 Pod也被标识为可以耐受污点节点,否则该Taints节点不会被调度pod。Taints)与tolerations当前处于beta阶段,
Taints节点应用场景比如用户希望把Kubernetes Master节点保留给 Kubernetes 系统组件使用,或者把一组具有特殊资源预留给某些 pod。pod不会再被调度到taint标记过的节点。
taint标记节点举例如下:
[root@master ~]# kubectl taint node node1.example.com cpu=high:NoSchedule
node/node1.example.com tainted
[root@master ~]# kubectl apply -f pod.yml
pod/nginx created
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 0/1 Pending 0 6s
如果仍然希望某个pod调度到taint节点上,则必须在 Spec 中做出Toleration 定义,才能调度到该节点,举例如下:
[root@master ~]# vim pod.yml
---
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
app: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: cpu
operator: In
values:
- high
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
tolerations:
- key: "cpu"
operator: "Equal"
value: "high"
effect: "NoSchedule"
[root@master ~]# kubectl apply -f pod.yml
pod/nginx configured
[root@master ~]# kubectl get pods
NAME READY STATUS RESTARTS AGE
nginx 1/1 Running 0 3m48s
effect 共有三个可选项,可按实际需求进行设置:
- NoSchedule:pod不会被调度到标记为taints节点。
- PreferNoSchedule:NoSchedule的“preference”或“soft”版本。
- NoExecute:该选项意味着一旦Taint 生效,如该节点内正在运行的 Pod 没有对应 Tolerate 设置,会直接被逐出。