1、问题思考
前面花了点时间尝试通过数据恢复的方式修复故障,一直没有成功。想了下,应该是陷入了一种思维定势,认为只要有数据损坏,就要通过恢复数据的方式修复。其实etcd本身是一款非常优秀的分布式kv存储集群系统,基于raft协议来保证数据库一致性。集群中有节点数据损坏,可以通过同步方式恢复数据。
想到这个点,修复起来就比较简单了
2、修复数据
2.1、停数据损坏节点的kubelet
[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
Docs: http://kubernetes.io/docs/
Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
Main PID: 17209 (code=exited, status=0/SUCCESS)
2.2、停数据损坏节点的etcd容器
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eec644681fe4 0cae8d5cc64c "kube-apiserver --ad…" 6 minutes ago Up 6 minutes k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082 303ce5db0e90 "etcd --advertise-cl…" 7 minutes ago Up 7 minutes k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f
2.3、删除数据损坏节点的etcd数据
这一步直接删除故障数据即可,加入集群后会自动从其他节点同步过来
# rm -f /var/lib/etcd/member/wal/*
# rm -f /var/lib/etcd/member/snap/*
2.4、在第一个节点删除添加故障节点
获取member列表
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
删除节点k8s-m2
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3
添加故障节点k8s-m2
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
刚添加的节点处于unstarted状态
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
2.5、启动故障节点的kubelet进程
[root@k8s-m2 home]# systemctl start kubelet
2.6、查看etcd集群和k8s集群状态
etcd集群的k8s-m2节点已经处于started状态
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
k8s集群的k8s-m2节点也已经处于ready状态
# kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
k8s-m1 Ready master 56d v1.17.0 172.0.2.139 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m2 Ready master 56d v1.17.0 172.0.2.146 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
k8s-m3 Ready master 56d v1.17.0 172.0.2.234 <none> CentOS Linux 7 (Core) 3.10.0-1062.el7.x86_64 docker://19.3.6
2.7、数据同步成功
[root@k8s-m2 ~]# ls /var/lib/etcd/member/wal/ -l
total 312512
-rw-------. 1 root root 64000272 Apr 16 14:23 0000000000000000-0000000000000000.wal
-rw-------. 1 root root 64000432 Apr 17 02:00 0000000000000001-0000000000b0c506.wal
-rw-------. 1 root root 64000440 Apr 17 13:36 0000000000000002-0000000000b3001d.wal
-rw-------. 1 root root 64000000 Apr 17 22:08 0000000000000003-0000000000b53b3a.wal
-rw-------. 1 root root 64000000 Apr 17 13:36 1.tmp
3、总结
其实这个步骤就和部署etcd集群时,加入member节点思路差不多。
进一步思考一下,既然可以通过同步方式修复,那么,etcd集群为什么不实现自动修复呢?