etcdserver: read wal error (walpb: crc mismatch) and cannot be repaired(3):主备同步修复

1、问题思考

前面花了点时间尝试通过数据恢复的方式修复故障,一直没有成功。想了下,应该是陷入了一种思维定势,认为只要有数据损坏,就要通过恢复数据的方式修复。其实etcd本身是一款非常优秀的分布式kv存储集群系统,基于raft协议来保证数据库一致性。集群中有节点数据损坏,可以通过同步方式恢复数据。
想到这个点,修复起来就比较简单了

2、修复数据

2.1、停数据损坏节点的kubelet

[root@k8s-m2 wal]# systemctl stop kubelet
[root@k8s-m2 wal]# systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
   Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
  Drop-In: /etc/systemd/system/kubelet.service.d
           └─10-kubeadm.conf
   Active: inactive (dead) since Wed 2020-04-15 21:31:31 EDT; 2s ago
     Docs: http://kubernetes.io/docs/
  Process: 17209 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
  Process: 17195 ExecStartPre=/usr/bin/kubelet-pre-start.sh (code=exited, status=0/SUCCESS)
 Main PID: 17209 (code=exited, status=0/SUCCESS)

2.2、停数据损坏节点的etcd容器

# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS               NAMES
eec644681fe4        0cae8d5cc64c           "kube-apiserver --ad…"   6 minutes ago       Up 6 minutes                            k8s_kube-apiserver_kube-apiserver-k8s-m2_kube-system_a6969daa2e4e9a047c11e645ac639c8f_6543
5f75788ca082        303ce5db0e90           "etcd --advertise-cl…"   7 minutes ago       Up 7 minutes                            k8s_etcd_etcd-k8s-m2_kube-system_8fc15ff127d417c1e3b2180d50ce85e3_1
# docker stop 5f75788ca082 88b1dcb2e14f
5f75788ca082
88b1dcb2e14f

2.3、删除数据损坏节点的etcd数据

这一步直接删除故障数据即可,加入集群后会自动从其他节点同步过来

# rm -f /var/lib/etcd/member/wal/*
# rm -f /var/lib/etcd/member/snap/*

2.4、在第一个节点删除添加故障节点

获取member列表

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
1e2fb9983e528532, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

删除节点k8s-m2

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member remove 1e2fb9983e528532
Member 1e2fb9983e528532 removed from cluster 450f66a1edd8aab3

添加故障节点k8s-m2

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member add k8s-m2 --peer-urls="https://172.0.2.146:2380"
Member 630ebadbb6f56ec1 added to cluster 450f66a1edd8aab3
ETCD_NAME="k8s-m2"
ETCD_INITIAL_CLUSTER="k8s-m2=https://172.0.2.146:2380,k8s-m3=https://172.0.2.234:2380,k8s-m1=https://172.0.2.139:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://172.0.2.146:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

刚添加的节点处于unstarted状态

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, unstarted, , https://172.0.2.146:2380, , false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

2.5、启动故障节点的kubelet进程

[root@k8s-m2 home]# systemctl start kubelet

2.6、查看etcd集群和k8s集群状态

etcd集群的k8s-m2节点已经处于started状态

# etcdctl --cert /etc/kubernetes/pki/etcd/peer.crt --key /etc/kubernetes/pki/etcd/peer.key  --endpoints https://172.0.2.139:2379 --insecure-skip-tls-verify  member list
947c9889866d299a, started, k8s-m3, https://172.0.2.234:2380, https://172.0.2.234:2379, false
bb6750bbed808391, started, k8s-m2, https://172.0.2.146:2380, https://172.0.2.146:2379, false
e97c0cc82d69a534, started, k8s-m1, https://172.0.2.139:2380, https://172.0.2.139:2379, false

k8s集群的k8s-m2节点也已经处于ready状态

# kubectl get nodes -o wide
NAME               STATUS     ROLES    AGE   VERSION                                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                KERNEL-VERSION           CONTAINER-RUNTIME
k8s-m1             Ready      master   56d   v1.17.0                                172.0.2.139   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6
k8s-m2             Ready      master   56d   v1.17.0                                172.0.2.146   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6
k8s-m3             Ready      master   56d   v1.17.0                                172.0.2.234   <none>        CentOS Linux 7 (Core)   3.10.0-1062.el7.x86_64   docker://19.3.6

2.7、数据同步成功

[root@k8s-m2 ~]# ls /var/lib/etcd/member/wal/ -l
total 312512
-rw-------. 1 root root 64000272 Apr 16 14:23 0000000000000000-0000000000000000.wal
-rw-------. 1 root root 64000432 Apr 17 02:00 0000000000000001-0000000000b0c506.wal
-rw-------. 1 root root 64000440 Apr 17 13:36 0000000000000002-0000000000b3001d.wal
-rw-------. 1 root root 64000000 Apr 17 22:08 0000000000000003-0000000000b53b3a.wal
-rw-------. 1 root root 64000000 Apr 17 13:36 1.tmp

3、总结

其实这个步骤就和部署etcd集群时,加入member节点思路差不多。
进一步思考一下,既然可以通过同步方式修复,那么,etcd集群为什么不实现自动修复呢?

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值