kubenetes基于rook-ceph创建pv失败的一次故障排除
1、本次问题出现,新创建statefulset的pod无法正常创建pv
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler running "VolumeBinding" filter plugin for pod "mysql-0": pod has unbound immediate PersistentVolumeClaims
Warning FailedScheduling <unknown> default-scheduler running "VolumeBinding" filter plugin for pod "mysql-0": pod has unbound immediate PersistentVolumeClaims
出现pvc没法正常挂载pv,PV未正常挂载的情况;
排查问题过程:
1、首先检查rook-ceph状态
[root@master1 images]# kubectl exec -it -n rook-ceph rook-ceph-tools-6b4889fdfd-86dp5 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# ceph -s
cluster:
id: bb5107d5-d3f7-45df-9146-1148efa378b5
health: HEALTH_OK
services:
mon: 3 daemons, quorum b,c,d (age 67m)
mgr: a(active, since 7m)
mds: myfs:1 {0=myfs-b=up:active} 1 up:standby-replay
osd: 10 osds: 10 up (since 7h), 10 in (since 8d)
task status:
scrub status:
mds.myfs-a: idle
mds.myfs-b: idle
data:
pools: 4 pools, 97 pgs
objects: 1.20k objects, 3.2 GiB
usage: 19 GiB used, 1.9 TiB / 2.0 TiB avail
pgs: 97 active+clean
io:
client: 852 B/s rd, 1 op/s rd, 0 op/s wr
检查发现ceph状态正常;
2、检查kube-system集群
[root@master1 images]# kubectl get po -n kube-system
NAME READY STATUS RESTARTS AGE
calico-kube-controllers-578894d4cd-8wlg4 1/1 Running 0 8d
calico-node-5rnjk 1/1 Running 0 8d
calico-node-7rvj2 1/1 Running 0 8d
calico-node-p7hpq 1/1 Running 0 8d
calico-node-vgrlg 1/1 Running 0 8d
calico-node-zd2mn 1/1 Running 0 8d
coredns-66bff467f8-fj7td 1/1 Running 0 5d3h
coredns-66bff467f8-rmnzk 1/1 Running 0 8d
dashboard-metrics-scraper-6b4884c9d5-8gtnl 1/1 Running 0 8d
etcd-master1 1/1 Running 0 20m
etcd-master2 1/1 Running 0 20m
etcd-master3 1/1 Running 0 20m
kube-apiserver-master1 1/1 Running 0 8d
kube-apiserver-master2 1/1 Running 0 8d
kube-apiserver-master3 1/1 Running 0 8d
kube-controller-manager-master1 1/1 Running 63 8d
kube-controller-manager-master2 1/1 Running 64 8d
kube-controller-manager-master3 1/1 Running 64 8d
kube-proxy-6n7lz 1/1 Running 0 8d
kube-proxy-7nstv 1/1 Running 0 8d
kube-proxy-kxzhp 1/1 Running 0 8d
kube-proxy-tw9j4 1/1 Running 0 8d
kube-proxy-w4s47 1/1 Running 0 8d
kube-scheduler-master1 1/1 Running 63 32m
kube-scheduler-master2 1/1 Running 64 22m
kube-scheduler-master3 1/1 Running 74 22m
kubernetes-dashboard-6f77f7cfdb-kb6fx 1/1 Running 4 8d
metrics-server-584b5f4754-z58xl 1/1 Running 0 8d
traefik-5875c779f4-4z62m 1/1 Running 0 4d21h
检查发现,kube-scheduler-master和kube-controller-manager-master 出现多次重启;
E0406 05:13:21.278261 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 05:17:05.409616 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 05:35:03.180503 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 06:07:26.579433 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 07:14:02.476881 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 07:48:13.975004 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 08:27:33.280699 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 09:01:27.363775 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
E0406 09:40:58.889225 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: leader changed
E0406 10:13:18.380376 1 leaderelection.go:320] error retrieving resource lock kube-system/kube-scheduler: etcdserver: request timed out
发现etcd出现频繁选主的情况;
于是检查etcd的状态,发现etcd的状态正常;初步判断etcd出现异常可能由于网络波动造成etcd重新选主;
[root@master1 images]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key --write-out=table --endpoints=172.10.25.184:2379,172.10.25.185:2379,172.10.25.186:2379 endpoint status
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| 172.10.25.184:2379 | 3b85f750baf8d841 | 3.4.3 | 59 MB | false | false | 1886 | 3128844 | 3128844 | |
| 172.10.25.185:2379 | 5f95ee4c3d9d164 | 3.4.3 | 59 MB | false | false | 1886 | 3128845 | 3128845 | |
| 172.10.25.186:2379 | be2885dc23c5f563 | 3.4.3 | 59 MB | true | false | 1886 | 3128846 | 3128846 | |
+--------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
之后再检查kubelet的状态,
[root@master1 mysql]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/usr/lib/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /usr/lib/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2021-04-06 17:57:51 CST; 2min 28s ago
Docs: https://kubernetes.io/docs/
Main PID: 13723 (kubelet)
Tasks: 30
Memory: 88.4M
CGroup: /system.slice/kubelet.service
└─13723 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --cgroup-driver=systemd --network-plugin=cni --pod-infr...
Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930233 13723 operation_generator.go:181] scheme "" not registered, fallback to default scheme
Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930252 13723 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}] <nil> <nil>}
Apr 06 18:00:12 master1 kubelet[13723]: I0406 18:00:12.930263 13723 clientconn.go:933] ClientConn switching balancer to "pick_first"
Apr 06 18:00:12 master1 kubelet[13723]: W0406 18:00:12.930363 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:13 master1 kubelet[13723]: W0406 18:00:13.930508 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:13 master1 kubelet[13723]: W0406 18:00:13.930641 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:15 master1 kubelet[13723]: W0406 18:00:15.319378 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:15 master1 kubelet[13723]: W0406 18:00:15.473849 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Apr 06 18:00:17 master1 kubelet[13723]: W0406 18:00:17.583410 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.cephfs.csi.ceph.com-reg.sock <nil> 0 <nil>}. ...
Apr 06 18:00:18 master1 kubelet[13723]: W0406 18:00:18.305165 13723 clientconn.go:1208] grpc: addrConn.createTransport failed to connect to {/var/lib/kubelet/plugins_registry/rook-ceph.rbd.csi.ceph.com-reg.sock <nil> 0 <nil>}. Err...
Hint: Some lines were ellipsized, use -l to show in full.
此时kubelet调度出现异常,无法正常调度到CSI( rook-ceph.rbd.csi.ceph.com),本次问题出现原因浮现
于是检查rook-ceph的pod情况,发现rook-ceph的相关csi的pod缺失,导致没发正常调度
[root@master1 mysql]# kubectl get po -n rook-ceph
NAME READY STATUS RESTARTS AGE
rook-ceph-crashcollector-master1-9ff9c4b7f-92zqn 1/1 Running 0 8d
rook-ceph-crashcollector-master2-6fd8fd857d-m4ngp 1/1 Running 0 8d
rook-ceph-crashcollector-master3-78869fc5b5-9rsrs 1/1 Running 0 5d
rook-ceph-crashcollector-node1-765b49998c-25wjd 1/1 Running 0 4d20h
rook-ceph-crashcollector-node2-5c9bf65fcd-rpv26 1/1 Running 0 7h31m
rook-ceph-mds-myfs-a-765f596697-cp7zs 1/1 Running 43 4d20h
rook-ceph-mds-myfs-b-5488556c97-kdr9m 1/1 Running 0 8d
rook-ceph-mgr-a-77b889cb6d-rqktg 1/1 Running 0 8d
rook-ceph-mon-b-5d747c4957-mgv2t 1/1 Running 0 8d
rook-ceph-mon-c-55c86765c7-7clf6 1/1 Running 0 5d
rook-ceph-mon-d-85d9bcd45b-lkjvs 1/1 Running 0 7d20h
rook-ceph-operator-6f9fc8c7dd-bk68g 1/1 Running 0 3d16h
rook-ceph-osd-0-65d88658cb-gkthq 1/1 Running 0 5d
rook-ceph-osd-1-7dc95f7cd7-46ppp 1/1 Running 0 8d
rook-ceph-osd-2-5894c6b9c8-88q98 1/1 Running 0 4d20h
rook-ceph-osd-3-8565f8b8cc-l4llh 1/1 Running 0 8d
rook-ceph-osd-4-6cf8449f54-qh2lq 1/1 Running 0 5d
rook-ceph-osd-5-c7b84d7b7-qqv5p 1/1 Running 0 5d
rook-ceph-osd-6-5485fdcfc5-hwdl2 1/1 Running 0 8d
rook-ceph-osd-7-b54c78b68-m88kh 1/1 Running 0 4d20h
rook-ceph-osd-8-74b4575bd8-cx2fb 1/1 Running 0 8d
rook-ceph-osd-9-6b8d6bb87f-7tgcj 1/1 Running 0 5d
rook-ceph-osd-prepare-master1-cbgzb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master2-lmglb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master3-j9fx5 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-node1-cdmcc 0/1 Completed 0 3d16h
rook-ceph-tools-6b4889fdfd-86dp5 1/1 Running 0 4d20h
rook-discover-5grcs 1/1 Running 0 3d16h
rook-discover-7ltj8 1/1 Running 0 3d16h
rook-discover-bnrrw 1/1 Running 0 3d16h
rook-discover-m8lbb 1/1 Running 0 7h31m
rook-discover-zkdb5 1/1 Running 0 3d16h
由于rook-ceph的所有的pod的启动调度都和rook-ceph-operator的调度有关,于是重启了 rook-ceph-operator这个pod;
[root@master1 mysql]# kubectl delete po -n rook-ceph rook-ceph-operator-6f9fc8c7dd-bk68g
pod "rook-ceph-operator-6f9fc8c7dd-bk68g" deleted
重启后检查rook-ceph:
[root@master1 mysql]# kubectl get po -n rook-ceph
NAME READY STATUS RESTARTS AGE
csi-cephfsplugin-6hntb 0/3 ContainerCreating 0 1s
csi-cephfsplugin-6zdm8 0/3 ContainerCreating 0 1s
csi-cephfsplugin-84dhz 0/3 Pending 0 1s
csi-cephfsplugin-twkbn 0/3 ContainerCreating 0 1s
csi-cephfsplugin-xg4mg 0/3 ContainerCreating 0 1s
csi-rbdplugin-48zk8 0/3 ContainerCreating 0 1s
csi-rbdplugin-4tn8s 0/3 ContainerCreating 0 1s
csi-rbdplugin-6vrwq 0/3 ContainerCreating 0 1s
csi-rbdplugin-provisioner-b4d4bc45d-s2sfx 0/6 ContainerCreating 0 1s
csi-rbdplugin-provisioner-b4d4bc45d-shz27 0/6 ContainerCreating 0 1s
csi-rbdplugin-s4jlv 0/3 ContainerCreating 0 1s
csi-rbdplugin-sdvjt 0/3 ContainerCreating 0 2s
rook-ceph-crashcollector-master1-9ff9c4b7f-92zqn 1/1 Running 0 8d
rook-ceph-crashcollector-master2-6fd8fd857d-m4ngp 1/1 Running 0 8d
rook-ceph-crashcollector-master3-78869fc5b5-9rsrs 1/1 Running 0 5d
rook-ceph-crashcollector-node1-765b49998c-25wjd 1/1 Running 0 4d20h
rook-ceph-crashcollector-node2-5c9bf65fcd-rpv26 1/1 Running 0 7h32m
rook-ceph-detect-version-wkcpw 0/1 Terminating 0 6s
rook-ceph-mds-myfs-a-765f596697-cp7zs 1/1 Running 43 4d20h
rook-ceph-mds-myfs-b-5488556c97-kdr9m 1/1 Running 0 8d
rook-ceph-mgr-a-77b889cb6d-rqktg 1/1 Running 0 8d
rook-ceph-mon-b-5d747c4957-mgv2t 1/1 Running 0 8d
rook-ceph-mon-c-55c86765c7-7clf6 1/1 Running 0 5d
rook-ceph-mon-d-85d9bcd45b-lkjvs 1/1 Running 0 7d20h
rook-ceph-operator-6f9fc8c7dd-ktkw5 1/1 Running 0 12s
rook-ceph-osd-0-65d88658cb-gkthq 1/1 Running 0 5d
rook-ceph-osd-1-7dc95f7cd7-46ppp 1/1 Running 0 8d
rook-ceph-osd-2-5894c6b9c8-88q98 1/1 Running 0 4d20h
rook-ceph-osd-3-8565f8b8cc-l4llh 1/1 Running 0 8d
rook-ceph-osd-4-6cf8449f54-qh2lq 1/1 Running 0 5d
rook-ceph-osd-5-c7b84d7b7-qqv5p 1/1 Running 0 5d
rook-ceph-osd-6-5485fdcfc5-hwdl2 1/1 Running 0 8d
rook-ceph-osd-7-b54c78b68-m88kh 1/1 Running 0 4d20h
rook-ceph-osd-8-74b4575bd8-cx2fb 1/1 Running 0 8d
rook-ceph-osd-9-6b8d6bb87f-7tgcj 1/1 Running 0 5d
rook-ceph-osd-prepare-master1-cbgzb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master2-lmglb 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-master3-j9fx5 0/1 Completed 0 3d16h
rook-ceph-osd-prepare-node1-cdmcc 0/1 Completed 0 3d16h
rook-ceph-tools-6b4889fdfd-86dp5 1/1 Running 0 4d20h
rook-discover-5grcs 1/1 Running 0 3d16h
rook-discover-7ltj8 1/1 Running 0 3d16h
rook-discover-bnrrw 1/1 Running 0 3d16h
rook-discover-m8lbb 1/1 Running 0 7h32m
rook-discover-zkdb5 1/1 Running 0 3d16h
检查发现CSI等插件重新创建后,本地等待pod全部正常运行;
之后再次检查本地的初始创建pod的情况:
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 54m
mysql-1 2/2 Running 0 27m
mysql-2 2/2 Running 0 27m
创建pod正常启动