k8s限制Evicted数量防止集群异常
背景
- 生产环境下出现类似下面这种大量被驱逐的pod,如果不进行处理会在集群内产生大量的异常pod,导致集群卡顿
[root@k8s-master-1 rollout]# kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default busybox-deployment-rollout-55cdb64f8b-2c4dz 0/1 Evicted 0 1s
default busybox-deployment-rollout-55cdb64f8b-7sx2r 0/1 Pending 0 1s
default busybox-deployment-rollout-55cdb64f8b-9xnw4 0/1 Evicted 0 2s
default busybox-deployment-rollout-55cdb64f8b-bmk28 0/1 Evicted 0 3s
default busybox-deployment-rollout-55cdb64f8b-gz6h2 0/1 Evicted 0 4s
default busybox-deployment-rollout-55cdb64f8b-k5wf9 0/1 Evicted 0 2s
default busybox-deployment-rollout-55cdb64f8b-llk4d 0/1 Evicted 0 4s
default busybox-deployment-rollout-55cdb64f8b-mlxc9 0/1 Evicted 0 4s
default busybox-deployment-rollout-55cdb64f8b-nwv8r 0/1 Evicted 0 4s
default busybox-deployment-rollout-55cdb64f8b-xzbch 0/1 Evicted 0 4s
kube-system coredns-7d9b46dfb8-kgwhr 1/1 Running 0 98m
kube-system kube-flannel-ds-gxfjc 1/1 Running 0 98m
kube-system kube-flannel-ds-lb9sl 0/1 Pending 0 35d
monitor monitor-prometheus-node-exporter-hxrkc 0/1 Evicted 0 41s
monitor monitor-prometheus-node-exporter-p95fh 0/1 Pending 0 35d
Eviction相关参数
Eviction,即驱逐的意思,意思是当节点出现异常时,为了保证工作负载的可用性,kubernetes将有相应的机制驱逐该节点上的Pod。
目前kubernetes中存在两种eviction机制,分别由kube-controller-manager和kubelet实现。
-
kube-controller-manager实现的eviction,kube-controller-manager主要由多个控制器构成,而eviction的功能主要由node controller这个控制器实现。该Eviction会周期性检查所有节点状态,当节点处于NotReady状态超过一段时间后,驱逐该节点上所有pod。
-
kube-controller-manager提供了以下启动参数控制eviction:
pod-eviction-timeout
:即当节点宕机该时间间隔后,开始eviction机制,驱赶宕机节点上的Pod,默认为5min。node-eviction-rate
:驱赶速率,即驱赶Node的速率,由令牌桶流控算法实现,默认为0.1,即每秒驱赶0.1个节点,注意这里不是驱赶Pod的速率,而是驱赶节点的速率。相当于每隔10s,清空一个节点。secondary-node-eviction-rate
:二级驱赶速率,当集群中宕机节点过多时,相应的驱赶速率也降低,默认为0.01。unhealthy-zone-threshold
:不健康zone阈值,会影响什么时候开启二级驱赶速率,默认为0.55,即当该zone中节点宕机数目超过55%,而认为该zone不健康。- l
arge-cluster-size-threshold
:大集群阈值,当该zone的节点多于该阈值时,则认为该zone是一个大集群。大集群节点宕机数目超过55%时,则将驱赶速率降为0.01,假如是小集群,则将速率直接降为0。 terminated-pod-gc-threshold
:在已终止 Pod 垃圾收集器删除已终止 Pod 之前,可以保留的已终止 Pod 的个数上限。 若此值小于等于 0,则相当于禁止垃圾回收已终止的 Pod
-
kubelet的eviction机制
-
如果节点处于资源压力,那么kubelet就会执行驱逐策略。驱逐会考虑Pod的优先级,资源使用和资源申请。当优先级相同时,资源使用/资源申请最大的Pod会被首先驱逐。
-
kube-controller-manager的eviction机制是粗粒度的,即驱赶一个节点上的所有pod,而kubelet则是细粒度的,它驱赶的是节点上的某些Pod,驱赶哪些Pod与Pod的Qos机制有关。该Eviction会周期性检查本节点内存、磁盘等资源,当资源不足时,按照优先级驱逐部分pod。
驱逐阈值分为软驱逐阈值(Soft Eviction Thresholds)和强制驱逐(Hard Eviction Thresholds)两种机制,如下:
- 软驱逐阈值:当node的内存/磁盘空间达到一定的阈值后,kubelet不会马上回收资源,如果改善到低于阈值就不进行驱逐,若这段时间一直高于阈值就进行驱逐。
- 强制驱逐:强制驱逐机制则简单的多,一旦达到阈值,直接把pod从本地驱逐。
kubelet提供了以下参数控制eviction:
eviction-soft
:软驱逐阈值设置,具有一系列阈值,比如memory.available<1.5Gi时,它不会立即执行pod eviction,而会等待eviction-soft-grace-period时间,假如该时间过后,依然还是达到了eviction-soft,则触发一次pod eviction。eviction-soft-grace-period
:默认为90秒,当eviction-soft时,终止Pod的grace的时间,即软驱逐宽限期,软驱逐信号与驱逐处理之间的时间差。eviction-max-pod-grace-period
:最大驱逐pod宽限期,停止信号与kill之间的时间差。eviction-pressure-transition-period
:默认为5分钟,驱逐压力过渡时间,超过阈值时,节点会被设置为memory pressure或者disk pressure,然后开启pod eviction。eviction-minimum-reclaim
:表示每一次eviction必须至少回收多少资源。eviction-hard
:强制驱逐设置,也具有一系列的阈值,比如memory.available<1Gi,即当节点可用内存低于1Gi时,会立即触发一次pod eviction。
-
模拟产生驱逐pod
- 修改kubelet参数,将驱逐条件设大
[root@k8s-master-1 rollout]# cat /etc/kubernetes/kubelet.yaml
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
....................................... # 略
cgroupDriver: systemd
cgroupsPerQOS: true
eventBurst: 10
eventRecordQPS: 5
evictionHard:
imagefs.available: 15%
memory.available: 100Mi
nodefs.available: 95% # 如果节点没有剩下95的磁盘空间,就驱逐节点上的pod
nodefs.inodesFree: 5%
evictionPressureTransitionPeriod: 5m0s
....................................... # 略
- 测试yaml文件, 一定要指定调度到某个节点,不然不会大量产生被驱逐的pod
[root@k8s-master-1 rollout]# cat busybox-deployment-rollout.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox-deployment-rollout
spec:
replicas: 1
selector:
matchLabels:
app: busybox
template:
metadata:
labels:
app: busybox
spec:
containers:
- name: busybox-2
image: busybox:1.28
imagePullPolicy: IfNotPresent
command: ["/bin/sh","-c","sleep 10000"]
nodeName: k8s-master-1
# 查看产生了多少被驱逐的pod
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
476
kube-controller-manager添加terminated-pod-gc-threshold参数
[root@k8s-master-1 rollout]# cat /usr/lib/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
[Service]
ExecStart=/usr/bin/kube-controller-manager \
--port=10252 \
--secure-port=10257 \
--bind-address=127.0.0.1 \
--kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
--service-cluster-ip-range=10.0.0.0/16 \
--cluster-name=kubernetes \
--cluster-signing-cert-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
--cluster-signing-key-file=/etc/kubernetes/ssl/kube-apiserver-ca-key.pem \
--cluster-signing-duration=87600h \
--allocate-node-cidrs=true \
--cluster-cidr=10.70.0.0/16 \
--node-cidr-mask-size=24 \
--root-ca-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
--service-account-private-key-file=/etc/kubernetes/ssl/service.key \
--use-service-account-credentials=true \
--leader-elect=true \
--feature-gates=RotateKubeletServerCertificate=true,RotateKubeletClientCertificate=true,EphemeralContainers=true \
--controllers=*,bootstrapsigner,tokencleaner \
--tls-cert-file=/etc/kubernetes/ssl/kube-controller-manager.pem \
--tls-private-key-file=/etc/kubernetes/ssl/kube-controller-manager-key.pem \
--requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.pem \
--requestheader-allowed-names=front-proxy-client \
--requestheader-extra-headers-prefix=X-Remote-Extra- \
--requestheader-group-headers=X-Remote-Group \
--requestheader-username-headers=X-Remote-User \
--horizontal-pod-autoscaler-use-rest-clients=true \
--alsologtostderr=true \
--logtostderr=false \
--log-dir=/var/log/kubernetes \
--v=2 \
--terminated-pod-gc-threshold=100 \ # 设置在已终止 Pod 垃圾收集器删除已终止 Pod 之前,可以保留的已终止 Pod 的个数上限
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
# 重启kube-controller-manger
[root@k8s-master-1 rollout]# systemctl daemon-reload && systemctl restart kube-controller-manager.service
- 可以发现在
配置terminated-pod-gc-threshold=100后,Evicted pod数量并未增加,但是k8s也未自动回收这些异常pod
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
[root@k8s-master-1 rollout]# kubectl get pods -A | wc -l
507
- 重新部署,看Evicted pod数量是否会超过5,
可以发现被驱逐的pod还是超过5了,但是也可以发现kube-controller-manager开始回收Evicted pod了
# 删除deployment
[root@k8s-master-1 rollout]# kubectl delete -f busybox-deployment-rollout.yaml
deployment.apps "busybox-deployment-rollout" deleted
# 清理pod
[root@k8s-master-1 rollout]# kubectl get pods -A | grep "Evicted" | awk 'NR>1{print $2}' | xargs kubectl delete pods
# 检查是否还有被驱逐的pod
[root@k8s-master-1 rollout]# kubectl get pods -A | grep "Evicted" | wc -l
0
# 重新部署测试yaml
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
9
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
14
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
18
[root@k8s-master-1 rollout]# kubectl get pods -A | grep Evicted | wc -l
7
- 针对上面的的情况我们修改一下kube-controller-manger参数
[root@k8s-master-1 rollout]# cat /usr/lib/systemd/system/kube-controller-manager.service
[Unit]
Description=Kubernetes Controller Manager
Documentation=https://github.com/kubernetes/kubernetes
[Service]
ExecStart=/usr/bin/kube-controller-manager \
--port=10252 \
--secure-port=10257 \
--bind-address=127.0.0.1 \
--kubeconfig=/etc/kubernetes/kube-controller-manager.kubeconfig \
--service-cluster-ip-range=10.0.0.0/16 \
--cluster-name=kubernetes \
--cluster-signing-cert-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
--cluster-signing-key-file=/etc/kubernetes/ssl/kube-apiserver-ca-key.pem \
--cluster-signing-duration=87600h \
--allocate-node-cidrs=true \
--cluster-cidr=10.70.0.0/16 \
--node-cidr-mask-size=24 \
--root-ca-file=/etc/kubernetes/ssl/kube-apiserver-ca.pem \
--service-account-private-key-file=/etc/kubernetes/ssl/service.key \
--use-service-account-credentials=true \
--leader-elect=true \
--feature-gates=RotateKubeletServerCertificate=true,RotateKubeletClientCertificate=true,EphemeralContainers=true \
--controllers=*,bootstrapsigner,tokencleaner \
--tls-cert-file=/etc/kubernetes/ssl/kube-controller-manager.pem \
--tls-private-key-file=/etc/kubernetes/ssl/kube-controller-manager-key.pem \
--requestheader-client-ca-file=/etc/kubernetes/ssl/front-proxy-ca.pem \
--requestheader-allowed-names=front-proxy-client \
--requestheader-extra-headers-prefix=X-Remote-Extra- \
--requestheader-group-headers=X-Remote-Group \
--requestheader-username-headers=X-Remote-User \
--horizontal-pod-autoscaler-use-rest-clients=true \
--alsologtostderr=true \
--logtostderr=false \
--log-dir=/var/log/kubernetes \
--v=2 \
--terminated-pod-gc-threshold=5 \
--concurrent-gc-syncs=50 \ # gc并发数,将其调高可以提升回收速度
--node-eviction-rate=0.05 # 将该值从0.1修改为0.05 ,设置为20s内清空节点上的所有pod,当节点不健康的时候
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
# 添加上面二个参数后,实际被驱逐的速率增长下降,回收的数率提升,由于不方便演示,这里不做展示
参考链接:
https://support.huaweicloud.com/cce_faq/cce_faq_00209.html
https://kubernetes.io/zh-cn/docs/reference/command-line-tools-reference/kube-controller-manager/