污点与容忍 taint && toleration
污点与容度,排他性进行部署,只有容忍该污点的Pod 才会被部署到上面去。
kubectl taint node k8s-master01 master-test=test:NoSchedule
#查看节点信息
kubectl describe node k8s-master01
增加NodeSelector标签
nodeSelector:
kubernetes.io/hostname: k8s-master01
此时的容器会处于Pending 状态,不会被调度……(若有多个pod 会存在部分pod 处于pending状态)
查看处于pending的原因:
kubectl describe po nginx-deployment-745cccf67b-wbbjz
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 36s (x3 over 117s) default-scheduler 0/5 nodes are available: 1 node(s) had taint {master-test: test}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity/selector.
我们加上容忍就可以进行调度
tolerations:
- key: master-test
value: test
effect: NoSchedule
operator: Equal
三个pod 都被调度到 k8s-master01 节点上
nginx-deployment-54df5b659b-9zcf4 1/1 Running 0 36s 172.25.244.222 k8s-master01 <none> <none>
nginx-deployment-54df5b659b-fwtg6 1/1 Running 0 31s 172.25.244.224 k8s-master01 <none> <none>
nginx-deployment-54df5b659b-lpgs7 1/1 Running 0 33s 172.25.244.223 k8s-master01 <none> <none>
调度的类型:
NoSchedule: 禁止调度
NoExecute: 不符合污点立即驱逐
PerferNoSchedule: 尽量将pod 不要调度到打有该effect 值污点的节点上(软污点)
删除污点:
kubectl tailt node k8s-master01 master-test:NoSchedule-
当node节点上有多个污点,要满足所有污点才能够部署上去;写法满足条件有都多种情况:
tolerations:
- key: master-test
value: test
effect: NoSchedule
operator: Exist
能容忍 key为 master-test effect 为 NoSchedule 无论value 是什么值。都能够容忍。
tolerations:
- operator: Exist
所有的污点都可以容忍,一般用不到
tolerations:
- key: master-test
operator: Exist
只能容忍污点key 为 master-test 不管 value 与 effect 这样可以同时容忍多个污点
给master01节点打上两个污点:
kubectl taint node k8s-master01 master-test=test:NoSchedule
kubectl taint node k8s-master01 master-test=test:NoExecute
tolerations:
- key: master-test
value: test
effect: NoSchedule
operator: Exist
- key: master-test
value: test
effect: NoExecute
operator: Exist
tolerationSeconds: 60s
NoExecute 打上污点后,节点上的pod 会被立即驱逐,但是有些情况,例如网络波动等,可以等待几分钟恢复后把污点去掉;否则过期后再被调度
当操作为 Exists 必须要填写容忍时间,否则就是Equal 不是必须
# deployments.apps "nginx-deployment" was not valid:
# * spec.template.spec.tolerations[1].operator: Invalid value: core.Toleration{Key:"master-test", Operator:"Exists", Value:"test", Effect:"NoExecute", TolerationSeconds:(*int64)(nil)}: value must be empty when `operator` is 'Exists'
查看重新被调度的pod 会被打上能够容忍的污点:
kubectl describe pod nginx-deployment-6cb78fcb79-96n7v
Node-Selectors: <none>
Tolerations: master-test=test:NoSchedule
master-test=test:NoExecute for 60s
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
表示节点未就绪或不可达时能够容忍 300s
operator 的默认值是 Equal。
k8s 系统内置的其它的污点标记:参考链接
initContainer 使用与说明:
容器启动之前执行,postStart 不保证一定再entrypoint之前执行,但是initContainer 是
initContainers:
- image: alpine:3.6
command: ["/sbin/sysctl","-w","vm.max_map_count=262144"]
name: init-es-log
securityContext:
privileged: true
节点与POD亲和性
Affinity: 亲和力
nodeAffinity: 节点亲和性:
RequiredDuringSchedulingIgnoredDuringExecution:硬亲和力,即支持必须部署再指定的节点上,也支持必须不能部署到指定的节点上
PreferredDuringSchedulingIgnoredDuringExecution:软亲和力,尽量部署再满足条件的节点上,也支持尽量不部署在满足的节点上
podAffinity: pod亲和性
RequiredDuringSchedulingIgnoredDuringExecution:将应用A,B,C 必须部署在一起
PreferredDuringSchedulingIgnoredDuringExecution:将应用A,B,C 尽量部署在一起
podAntiAffity:pod 反亲和性:(常用,例如集群就是需要反亲和性不能部署在一起)
RequiredDuringSchedulingIgnoredDuringExecution:不可以,不要将应用A,B,C 部署在一起(硬性要求)
PreferredDuringSchedulingIgnoredDuringExecution:将应用A,B,C 尽量不要部署在一起(软要求)
官网文档:
1, 给节点打上label
kubectl label nodes k8s-master02 disktype=ssd
2, 节点亲和性配置,硬性要求 IN 部署在打有ssd label 的节点上
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx:1.14.2
imagePullPolicy: IfNotPresent
Warning FailedScheduling 18s default-scheduler 0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector.
Warning FailedScheduling 17s default-scheduler 0/5 nodes are available: 5 node(s) didn't match Pod's node affinity/selector.
这个时候我们可以考虑使用 软亲和行进行部署
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: disktype
operator: In
values:
- ssd
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
Note: In the preceding types, IgnoredDuringExecution means that if the node labels change after Kubernetes schedules the Pod, the Pod continues to run.
亲和性组合条件,先满足硬性条件的同时再考虑软亲和性条件
apiVersion: v1
kind: Pod
metadata:
name: with-node-affinity
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- antarctica-east1
- antarctica-west1
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
preference:
matchExpressions:
- key: another-node-label-key
operator: In
values:
- another-node-label-value
containers:
- name: with-node-affinity
image: registry.k8s.io/pause:2.0
In this example, the following rules apply:
- The node must have a label with the key topology.kubernetes.io/zone and the value of that label must be either antarctica-east1 or antarctica-west1.
- The node preferably has a label with the key another-node-label-key and the value another-node-label-value.
You can use the operator field to specify a logical operator for Kubernetes to use when interpreting the rules. You can use In, NotIn, Exists, DoesNotExist, Gt and Lt.
In: 部署在满足多个条件的节点上
NotIn: 不部署在满足以下多个条件的节点上
Exists: 部署在具有某个label 的条件节点上
DoesNotExist: 部署在不能具有某个label 的条件节点上
Gt: 条件参数大于多少,条件为数字,不能为字符串
Lt: 条件参数小于多少,条件为数字,不能为字符串
容器亲和力与反亲和力:podAffinity 与 podAntiAffinity
背景:想要把多个pod 部署到同一个节点上就叫pod亲和性,反之想把节点分散部署到不同的节点上,叫做pod 反亲和性。
apiVersion: v1
kind: Pod
metadata:
name: with-pod-affinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S1
# 如果写了namespces: 空值则匹配所有命名空间下的pod的label,如果没有写,则匹配该pod所在的命名空间下;否则就是只匹配写了命名空间下的pod进行匹配
namespaces:
- kube-system #匹配命名空间为kube-system下的pod
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: security
operator: In
values:
- S2
topologyKey: topology.kubernetes.io/zone
containers:
- name: with-pod-affinity
image: nginx:1.14.2
查看pod 的label 标签:
kubectl get pod -n kube-system -l security=S2
如果写的是亲和力,则会部署在所匹配的pod 的同一个节点上,否则则相反。
Namespace selector
FEATURE STATE: Kubernetes v1.24 [stable]
You can also select matching namespaces using namespaceSelector, which is a label query over the set of namespaces. The affinity term is applied to namespaces selected by both namespaceSelector and the namespaces field. Note that an empty namespaceSelector ({}) matches all namespaces, while a null or empty namespaces list and null namespaceSelector matches the namespace of the Pod where the rule is defined.
More practical use-cases
Inter-pod affinity and anti-affinity can be even more useful when they are used with higher level collections such as ReplicaSets, StatefulSets, Deployments, etc. These rules allow you to configure that a set of workloads should be co-located in the same defined topology; for example, preferring to place two related Pods onto the same node.
For example: imagine a three-node cluster. You use the cluster to run a web application and also an in-memory cache (such as Redis). For this example, also assume that latency between the web application and the memory cache should be as low as is practical. You could use inter-pod affinity and anti-affinity to co-locate the web servers with the cache as much as possible.
In the following example Deployment for the Redis cache, the replicas get the label app=store. The podAntiAffinity rule tells the scheduler to avoid placing multiple replicas with the app=store label on a single node. This creates each cache in a separate node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-cache
spec:
selector:
matchLabels:
app: store
replicas: 3
template:
metadata:
labels:
app: store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: redis-server
image: redis:3.2-alpine
The following example Deployment for the web servers creates replicas with the label app=web-store. The Pod affinity rule tells the scheduler to place each replica on a node that has a Pod with the label app=store. The Pod anti-affinity rule tells the scheduler never to place multiple app=web-store servers on a single node.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 3
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: "kubernetes.io/hostname"
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- store
topologyKey: "kubernetes.io/hostname"
containers:
- name: web-app
image: nginx:1.16-alpine
Creating the two preceding Deployments results in the following cluster layout, where each web server is co-located with a cache, on three separate nodes.
node-1 | node-2 | node-3 |
webserver-1 | webserver-2 | webserver-3 |
cache-1 | cache-2 | cache-3 |
The overall effect is that each cache instance is likely to be accessed by a single client, that is running on the same node. This approach aims to minimize both skew (imbalanced load) and latency.
You might have other reasons to use Pod anti-affinity. See the for an example of a StatefulSet configured with anti-affinity for high availability, using the same technique as this example.
拓扑域讲解:
topologyKey: "kubernetes.io/hostname"
相同的key不同的value 也被当做不同的拓扑域,除非相同的key 相同的value 才会当做同一个拓扑域。
kubernetes.io/hostname 该名字所绑定的拓扑域只有一个节点,也就是在满足namespaces 下以及匹配pod 的label 下,部署在唯一的个拓扑域的节点上,也就是尽管某个命名空间下匹配了很多相同label的pod ,按道理该pod会部署在这些已经匹配的pod 上,但是由于按照hostname进行拓扑域划分,导致该pod所有的副本都在一个节点上。
按照这个顺序匹配部署或者不能部署:namespaces --> topologykey ---> label pod --> node
当每个拓扑域都部署了一个符合label的pod ;反亲和力下该pod 会出现pending 无法部署成功
总结:topology 拓扑域的划分会导致要么pending 要么部署到一个节点上。
举例子: 下面有四个副本,划分三个拓扑域,其中一个pod 会处于peding再反亲和力作用下
kubectl label node k8s-mater01 jigui=jigui1
kubectl label node k8s-mater02 jigui=jigui2
kubectl label node k8s-mater03 jigui=jigui2
kubectl label node k8s-nod01 jigui=jigui3
kubectl label node k8s-nod02 jigui=jigui3
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-server
spec:
selector:
matchLabels:
app: web-store
replicas: 4
template:
metadata:
labels:
app: web-store
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-store
topologyKey: jigui
containers:
- name: web-app
image: nginx:1.16-alpine