K8S节点故障针对数据恢复的问题
如下是我整个集群现有情况及mysql集群的分布情况:
mysql-0作为整个mysql集群写入数据的唯一节点,mysql-1和mysql-2作为mysql对外提供读取功能的节点;
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 8m30s
mysql-1 2/2 Running 0 7d19h
mysql-2 2/2 Running 0 7d19h
现有pod在集群的分布
[root@master1 mysql]# kubectl get po -n ztw -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
mysql-0 2/2 Running 0 29m 192.168.180.61 master2 <none> <none>
mysql-1 2/2 Running 0 7d19h 192.168.166.164 node1 <none> <none>
mysql-2 2/2 Running 0 7d19h 192.168.104.63 node2 <none> <none>
node情况
[root@master1 mysql]# kubectl get no
NAME STATUS ROLES AGE VERSION
master1 Ready master 16d v1.18.6
master2 Ready master 16d v1.18.6
master3 Ready master 16d v1.18.6
node1 Ready worker 16d v1.18.6
node2 Ready worker 16d v1.18.6
模拟测试:
1、mysql-0 节点的kubelet挂掉,mysql无法正常写入
K8S集群的 master2节点的kubelet挂掉
[root@master1 mysql]# kubectl get no
NAME STATUS ROLES AGE VERSION
master1 Ready master 17d v1.18.6
master2 NotReady master 17d v1.18.6
master3 Ready master 17d v1.18.6
node1 Ready worker 17d v1.18.6
node2 Ready worker 17d v1.18.6
检查mysql集群(kubelet挂掉后需要5分钟,才能看到mysql集群的真实状况)
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Terminating 0 59m
mysql-1 2/2 Running 0 7d20h
mysql-2 2/2 Running 0 7d20h
本次模拟测试,针对kubelet没法恢复,之后对mysql集群进行恢复的情况:
从前面信息中获取到mysql-0是在master2节点,先检查mysql的pv
[root@master1 mysql]# kubectl get pv |grep ztw/data-mysql
pvc-1b2be124-db4a-4220-9a75-bcd9d7ef26fd 10Gi RWO Delete Bound ztw/data-mysql-2 ceph-rbd 7d20h
pvc-2f3248b3-a163-4f6d-964c-744e46cd899a 10Gi RWO Delete Bound ztw/data-mysql-1 ceph-rbd 7d20h
pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856 10Gi RWO Delete Bound ztw/data-mysql-0 ceph-rbd 7d20h
获取到ztw/data-mysql-0 使用的csi镜像
[root@master1 mysql]# kubectl get pv pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856 -o yaml|grep imageName
f:imageName: {}
imageName: csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
这里使用的csi为: csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
检查该镜像的挂在情况
[root@master1 mysql]# kubectl exec -it -n rook-ceph rook-ceph-tools-6b4889fdfd-86dp5 /bin/bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# rbd showmapped
id pool namespace image snap device
0 replicapool csi-vol-b0547ce2-9061-11eb-9dfa-c2bce0e658d7 - /dev/rbd0
1 replicapool csi-vol-4abead3a-978d-11eb-b4aa-ee49455e3e6d - /dev/rbd1
[root@rook-ceph-tools-6b4889fdfd-86dp5 /]# rbd status replicapool/csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
Watchers:
watcher=172.16.25.185:0/2783421160 client.860678 cookie=18446462598732840969
这里能发现该RBD的挂载情况及所属服务器
找到pv挂载所在服务器(master2),在master2上通过
[root@master2 ~]# docker ps|grep k8s_csi-rbdplugin_csi
6ee7aa256fbd 3d66848c3c6f "/usr/local/bin/ceph…" 45 minutes ago Up 45 minutes k8s_csi-rbdplugin_csi-rbdplugin-provisioner-b4d4bc45d-z45dd_rook-ceph_a7328719-bfea-42b1-9715-2378be38a514_0
6f8307339942 3d66848c3c6f "/usr/local/bin/ceph…" 45 minutes ago Up 45 minutes k8s_csi-rbdplugin_csi-rbdplugin-mll4k_rook-ceph_8a34b870-a840-47eb-9324-3e062fc0bdb9_0
选择 k8s_csi-rbdplugin_csi-rbdplugin-mll4k_rook-ceph 这个容器进行操作
[root@master2 ~]# docker exec -it 6f8307339942 bash
[root@master2 /]# rbd showmapped|grep csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d
3 replicapool csi-vol-33414fe0-978d-11eb-b4aa-ee49455e3e6d - /dev/rbd3
[root@master2 /]# rbd unmap -o force /dev/rbd3
[root@master2 ~]# umount /dev/rbd3
通过以上命令检查该csi使用的是那个 rbd目录,如上使用的是/dev/rbd3,同时使用rbd unmap卸载掉挂载
在通过umount /dev/rbd3 卸载掉挂载的目录
本以为卸载掉了pv,pod会自动飘逸,但是时间证明了他不会
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Terminating 0 174m
mysql-1 2/2 Running 0 7d22h
mysql-2 2/2 Running 0 7d22h
本地动用强制手段进行迁移,通过etcd将之前pod信息清理掉;
[root@master1 mysql]# ETCDCTL_API=3 etcdctl --cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/peer.crt --key=/etc/kubernetes/pki/etcd/peer.key \
--endpoints=172.10.25.184:2379,172.10.25.185:2379,172.10.25.186:2379 \
del /registry/pods/ztw/mysql-0
1
之后在进行观察,发现pod重新创建
[root@master1 mysql]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 0/2 Init:0/2 0 74s
mysql-1 2/2 Running 0 7d22h
mysql-2 2/2 Running 0 7d22h
在重新创建
[root@master1 mysql]# kubectl describe po -n ztw mysql-0
.....
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled <unknown> default-scheduler Successfully assigned ztw/mysql-0 to master3
Warning FailedAttachVolume 6m54s attachdetach-controller Multi-Attach error for volume "pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856" Volume is already exclusively attached to one node and can't be attached to another
Warning FailedMount 4m51s kubelet, master3 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[conf config-map default-token-d9qmq data]: timed out waiting for the condition
Warning FailedMount 2m34s kubelet, master3 Unable to attach or mount volumes: unmounted volumes=[data], unattached volumes=[default-token-d9qmq data conf config-map]: timed out waiting for the condition
Normal SuccessfulAttachVolume 54s attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-c91fc22e-dd7e-4b23-94f8-c4fdcdb0f856"
Normal Created 37s kubelet, master3 Created container init-mysql
Normal Pulled 37s kubelet, master3 Container image "hub.youedata.com/rds/mysql:5.7" already present on machine
Normal Started 37s kubelet, master3 Started container init-mysql
Normal Pulled 37s kubelet, master3 Container image "hub.youedata.com/base/xtrabackup:1.0" already present on machine
Normal Started 36s kubelet, master3 Started container clone-mysql
Normal Created 36s kubelet, master3 Created container clone-mysql
Normal Pulled 35s kubelet, master3 Container image "hub.youedata.com/base/xtrabackup:1.0" already present on machine
Normal Created 35s kubelet, master3 Created container xtrabackup
Normal Started 35s kubelet, master3 Started container xtrabackup
Normal Pulled 15s (x3 over 36s) kubelet, master3 Container image "hub.youedata.com/rds/mysql:5.7" already present on machine
Normal Created 15s (x3 over 35s) kubelet, master3 Created container mysql
Normal Started 15s (x3 over 35s) kubelet, master3 Started container mysql
此刻的mysql集群恢复,
[root@master1 ~]# kubectl get po -n ztw
NAME READY STATUS RESTARTS AGE
mysql-0 2/2 Running 0 15m
mysql-1 2/2 Running 0 15m
mysql-2 2/2 Running 0 14m
本章讲解关于kubelet挂掉无法恢复的情况下对statefulset的mysql进行恢复处理;
后续会讲解在docker挂掉的情况对statefulset的服务进行恢复的情况