1、停止etcd pod
本步骤操作都在故障节点k8s-m2上执行
1.1、停止kubelet
[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
Docs: http://kubernetes.io/docs/
Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
Main PID: 17209 (code=exited, status=0/SUCCESS)
1.2、停止etcd容器
由于etcd一致处于重启状态,所以只要停止了kubelet后,etcd容器就不会再启动
如果发现etcd容器还在,可以手动停止
# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
eec644681fe4 0cae8d5cc64c "kube-apiserver --ad…" 6 minutes ago Up 6 minutes k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082 303ce5db0e90 "etcd --advertise-cl…" 7 minutes ago Up 7 minutes k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f
2、通过snap修复数据
本步骤操作都在故障节点k8s-m2上执行
2.1、备份数据
# mkdir backup
# mv /var/lib/etcd/member backup/
2.2、使用snap恢复数据
# rm -rf /var/lib/etcd
# etcdctl snapshot restore backup/member/snap/db --data-dir=/var/lib/etcd --skip-hash-check=true
{"level":"info","ts":1587002300.8804266,"caller":"snapshot/v3_snapshot.go:287","msg":"restoring snapshot","path":"backup/member/snap/db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
{"level":"info","ts":1587002301.0180507,"caller":"mvcc/kvstore.go:378","msg":"restored last compact revision","meta-bucket-name":"meta","meta-bucket-name-key":"finishedCompactRev","restored-compact-revision":8003675}
{"level":"info","ts":1587002301.028811,"caller":"membership/cluster.go:392","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"0","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]}
{"level":"info","ts":1587002301.046414,"caller":"snapshot/v3_snapshot.go:300","msg":"restored snapshot","path":"backup/member/snap/db","wal-dir":"/var/lib/etcd/member/wal","data-dir":"/var/lib/etcd","snap-dir":"/var/lib/etcd/member/snap"}
# ls /var/lib/etcd/member/snap/
0000000000000001-0000000000000001.snap db
# ls /var/lib/etcd/member/wal/
0000000000000000-0000000000000000.wal
3、etcd集群删除添加节点
本步骤操作都在非故障节点k8s-m1上执行
3.1、删除k8s-m2节点
获取member列表
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
删除节点k8s-m2
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3
3.3、添加k8s-m2节点
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
刚添加的节点处于unstarted状态
# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
4、启动故障节点etcd pod
4.1、启动kubelet进程
[root@k8s-m2 home]# systemctl start kubelet
4.2、查看etcd集群状态
可以看到k8s-m2节点依然是unstarted状态
[root@k8s-m1 wal]# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.234:2379 --insecure-skip-tls-verify member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false
在k8s-m2节点上查看,发现etcd 在k8s-m2节点上单独创建了一个集群
[root@k8s-m1 wal]# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key --endpoints https://172.0.2.146:2379 --insecure-skip-tls-verify member list
8e9e05c52164694d, started, k8s-m2, http://localhost:2380, https://172.0.2.146:2379, false
4.3、查看etcd 容器日志
k8s-m2节点上的容器列表
# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
5a3c10335a0d 303ce5db0e90 "etcd --advertise-cl…" 9 seconds ago Up 9 seconds k8s_etcd_etcd-k8s-m2_kube-system_c45c8fe716669e896c01df9357b80855_1
2bd2e6148d5c 78c190f736b1 "kube-scheduler --au…" 35 seconds ago Up 35 seconds k8s_kube-scheduler_kube-scheduler-k8s-m2_kube-system_ff67867321338ffd885039e188f6b424_57
k8s-m2节点上的etcd容器日志
[root@k8s-m2 home]# docker logs 5a3c10335a0d
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-16 02:21:21.879836 I | etcdmain: etcd Version: 3.4.3
2020-04-16 02:21:21.879920 I | etcdmain: Git SHA: 3cf2f69b5
2020-04-16 02:21:21.879926 I | etcdmain: Go Version: go1.12.12
2020-04-16 02:21:21.879931 I | etcdmain: Go OS/Arch: linux/amd64
2020-04-16 02:21:21.879937 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2020-04-16 02:21:21.880006 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-04-16 02:21:21.880065 I | embed: peerTLS: cert = /etc/kubernetes/pki/etcd/peer.crt, key = /etc/kubernetes/pki/etcd/peer.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-04-16 02:21:21.881130 I | embed: name = k8s-m2
2020-04-16 02:21:21.881141 I | embed: data dir = /var/lib/etcd
2020-04-16 02:21:21.881147 I | embed: member dir = /var/lib/etcd/member
2020-04-16 02:21:21.881152 I | embed: heartbeat = 100ms
2020-04-16 02:21:21.881157 I | embed: election = 1000ms
2020-04-16 02:21:21.881162 I | embed: snapshot count = 10000
2020-04-16 02:21:21.881174 I | embed: advertise client URLs = https://172.0.2.146:2379
2020-04-16 02:21:21.881180 I | embed: initial advertise peer URLs = https://172.0.2.146:2380
2020-04-16 02:21:21.881192 I | embed: initial cluster =
2020-04-16 02:21:21.886462 I | etcdserver: recovered store from snapshot at index 1
2020-04-16 02:21:21.897999 I | mvcc: restore compact to 8005911
2020-04-16 02:21:21.915289 I | etcdserver: restarting member 8e9e05c52164694d in cluster cdf818194e3a8c32 at commit index 2028
raft2020/04/16 02:21:21 INFO: 8e9e05c52164694d switched to configuration voters=(10276657743932975437)
raft2020/04/16 02:21:21 INFO: 8e9e05c52164694d became follower at term 3
raft2020/04/16 02:21:21 INFO: newRaft 8e9e05c52164694d [peers: [8e9e05c52164694d], term: 3, commit: 2028, applied: 1, lastindex: 2028, lastterm: 3]
2020-04-16 02:21:21.915719 I | etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
2020-04-16 02:21:21.924745 I | mvcc: restore compact to 8005911
2020-04-16 02:21:21.928380 W | auth: simple token is not cryptographically signed
2020-04-16 02:21:21.939722 I | etcdserver: starting server... [version: 3.4.3, cluster version: to_be_decided]
2020-04-16 02:21:21.939852 I | etcdserver: 8e9e05c52164694d as single-node; fast-forwarding 9 ticks (election ticks 10)
2020-04-16 02:21:21.940415 N | etcdserver/membership: set the initial cluster version to 3.4
2020-04-16 02:21:21.940536 I | etcdserver/api: enabled capabilities for version 3.4
2020-04-16 02:21:21.943059 I | embed: ClientTLS: cert = /etc/kubernetes/pki/etcd/server.crt, key = /etc/kubernetes/pki/etcd/server.key, trusted-ca = /etc/kubernetes/pki/etcd/ca.crt, client-cert-auth = true, crl-file =
2020-04-16 02:21:21.943177 I | embed: listening for peers on 172.0.2.146:2380
2020-04-16 02:21:21.943272 I | embed: listening for metrics on http://127.0.0.1:2381
2020-04-16 02:21:21.956914 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.981283 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.982367 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.991827 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:21.992782 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.004156 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.045179 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
2020-04-16 02:21:22.082271 E | rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
日志中,有集群id不一致的打印
rafthttp: request cluster ID mismatch (got 450f66a1edd8aab3 want cdf818194e3a8c32)
4.4、日志分析
往上看日志,想要的集群id cdf818194e3a8c32 是从store中读取的
etcdserver/membership: added member 8e9e05c52164694d [http://localhost:2380] to cluster cdf818194e3a8c32 from store
而从2380端口读取到的集群id为450f66a1edd8aab3,出现不一致情况,导致加入集群失败
5、结论
从上述分析可以看出,从snap恢复的数据有问题,导致无法加入正常集群。尝试失败。
注:又尝试了将k8s-m1节点上的etcd snap数据拷贝到k8s-m2节点,尝试恢复,问题一样
该问题最后通过主备同步解决
etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(3):主备同步修复