写在前面的话
- 由于业务关系,整个项目拥有多台服务器,导致节点可能分布在各个服务器上,我在操作如下步骤的时候一直在不停切换到相关服务器,如果小伙伴有类似业务,需要格外注意一下需要切换到自己的哪一台服务器,登录服务器的错误可能会影响判断
- 思路来源:https://blog.csdn.net/wanger5354/article/details/122538340
问题描述
今天在查看pod的时候发现很多pod状态都是Evicted,查看grafana上的监控图并没有任何问题,做了如下查看
查看pod状态
由于有很多状态为Evicted的pod,所以直接运行 kubectl get pod 命令来看状态有些不太好观察,建议使用 grep 进行过滤
[root@master01 ~]# kubectl get po -A -o wide | grep -v "Running"
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nosp base-influxdb-8c7559b46-25jbt 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-5vl69 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-bgh8j 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-hbmwh 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-kss7x 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-ksx9d 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-ntb5x 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-pcq2r 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-rt2l4 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-tcpn4 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-tvzx9 0/1 Evicted 0 7d20h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-x4577 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-znmjl 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
nosp base-influxdb-8c7559b46-zzh4q 0/1 Evicted 0 3d7h <none> worker02 <none> <none>
查看任意一个pod
[root@master01 /]# kubectl describe po base-influxdb-8c7559b46-25jbt -n nosp
Name: base-influxdb-8c7559b46-25jbt
Namespace: nosp
Priority: 0
PriorityClassName: <none>
Node: worker02/
Start Time: Sun, 03 Jul 2022 03:50:43 +0800
Labels: app=base-influxdb
pod-template-hash=8c7559b46
release=base
Annotations: <none>
Status: Failed
Reason: Evicted
Message: Pod The node had condition: [DiskPressure].
IP:
Controlled By: ReplicaSet/base-dicastal-influxdb-8c7559b46
Containers:
base-dicastal-influxdb:
Image: dev.kmx.com.cn:5001/public/influxdb:1.7.6-alpine
Port: 8086/TCP
Host Port: 0/TCP
Liveness: http-get http://:api/ping delay=30s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get http://:api/ping delay=5s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/influxdb from config (rw)
/var/lib/influxdb from base-dicastal-influxdb-data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-2vhzc (ro)
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: base-influxdb
Optional: false
base-influxdb-data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: influxdb-pv-dicastal
ReadOnly: false
default-token-2vhzc:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-2vhzc
Optional: false
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=worker02
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events: <none>
观察到Message提示磁盘有压力
登录node2.pre.ayunw.cn查看
[root@master01 ~]# df -Th | egrep -v "overlay2|kubernetes|docker"
文件系统 类型 容量 已用 可用 已用% 挂载点
/dev/mapper/centos-root xfs 50G 31G 20G 62% /
devtmpfs devtmpfs 32G 0 32G 0% /dev
tmpfs tmpfs 32G 0 32G 0% /dev/shm
tmpfs tmpfs 32G 3.1G 29G 10% /run
tmpfs tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/sda1 xfs 1014M 232M 783M 23% /boot
/dev/sdb1 ext4 1008G 36G 922G 4% /disk1
/dev/sdb2 ext4 1008G 150G 808G 16% /disk2
/dev/mapper/centos-home xfs 441G 154G 288G 35% /data
tmpfs tmpfs 6.3G 12K 6.3G 1% /run/user/42
tmpfs tmpfs 6.3G 0 6.3G 0% /run/user/0
发现磁盘还有35%,虽然剩的不多,但勉强可以接受,继续往后查查看吧
查看磁盘IO
[root@localhost ~]# iostat -xk 1 3
Linux 3.10.0-1160.21.1.el7.x86_64 (localhost.localdomain) 2022年07月06日 _x86_64_ (8 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
1.11 0.00 1.96 0.01 0.08 96.84
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.07 0.00 1.58 0.16 18.72 23.85 0.00 2.21 3.66 2.21 1.49 0.24
dm-0 0.00 0.00 0.00 0.57 0.10 9.15 32.32 0.00 3.12 3.01 3.12 0.75 0.04
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 48.19 0.00 0.85 0.85 0.00 0.83 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 99.51 0.00 2.54 2.41 6.80 2.04 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
1.01 0.00 1.88 0.00 0.13 96.98
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 1.00 2.00 8.00 9.50 11.67 0.00 1.00 0.00 1.50 1.67 0.50
dm-0 0.00 0.00 1.00 0.00 8.00 0.00 16.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
avg-cpu: %user %nice %system %iowait %steal %idle
0.38 0.00 0.38 0.00 0.00 99.25
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
vda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
(此处插一下,如果不明白这些参数是什么意思,请参数文章:iostat命令详解)
看起来虽然磁盘剩余空间不是很多,但也没有什么IO压力
由于查了很久并没有什么明显错误,并且无法进行场景重现,所以只能先将Evicted状态的pod先删除
但出现此问题的同时,服务还出现了很多任务失败的情况,所以暂且怀疑因为网络问题导致任务失败和pod被驱逐,后续需要持续观察,并且排查网络问题