Etcd 集群故障恢复详细步骤

场景一:集群中单个节点故障

member信息会持久化到磁盘上,数据丢失的节点必须以新的member身份加入,必须严格按照如下操作:

  1. 移除failure节点:使用member remove命令剔除错误节点。保证当前集群的健康状况。
  2. 彻底清理数据目录:错误节点必须停止服务,然后删除data dir。保证member信息被清理干净,清空member目录。
  3. 集群扩容:使用member add命令添加步骤1的故障节点。
  4. 重新启动:步骤1的故障节点进行启动
[root@node01 ~]# etcdctl endpoint health -w table \
>    --cacert=/etc/etcd/pki/etcd-ca.pem \
>    --key=/etc/etcd/pki/etcd-server-key.pem \
>    --cert=/etc/etcd/pki/etcd-server.pem \
>    --endpoints https://10.0.2.10:2379,https://10.0.2.11:2379,https://10.0.2.12:2379
{"level":"warn","ts":"2023-01-19T15:46:51.582+0800","logger":"client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000324fc0/10.0.2.12:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.2.12:2379: connect: connection refused\""}
+------------------------+--------+--------------+---------------------------+
|        ENDPOINT        | HEALTH |     TOOK     |           ERROR           |
+------------------------+--------+--------------+---------------------------+
| https://10.0.2.10:2379 |   true |  16.441802ms |                           |
| https://10.0.2.11:2379 |   true |  19.920028ms |                           |
| https://10.0.2.12:2379 |  false | 5.000833537s | context deadline exceeded |
+------------------------+--------+--------------+---------------------------+

[root@node01 ~]# etcdctl member list -w table \
>    --cacert=/etc/etcd/pki/etcd-ca.pem \
>    --key=/etc/etcd/pki/etcd-server-key.pem \
>    --cert=/etc/etcd/pki/etcd-server.pem \
>    --endpoints https://10.0.2.10:2379,https://10.0.2.11:2379,https://10.0.2.12:2379
+------------------+---------+-------+------------------------+------------------------+------------+
|        ID        | STATUS  | NAME  |       PEER ADDRS       |      CLIENT ADDRS      | IS LEARNER |
+------------------+---------+-------+------------------------+------------------------+------------+
| 4ef681e2655d2a35 | started | etcd1 | https://10.0.2.10:2380 | https://10.0.2.10:2379 |      false |
| 6c9d6a0746c7546b | started | etcd2 | https://10.0.2.11:2380 | https://10.0.2.11:2379 |      false |
| eb20e4406d78f7f7 | started | etcd3 | https://10.0.2.12:2380 | https://10.0.2.12:2379 |      false |
+------------------+---------+-------+------------------------+------------------------+------------+

[root@node01 ~]# etcdctl member remove eb20e4406d78f7f7 \
>    --cacert=/etc/etcd/pki/etcd-ca.pem \
>    --key=/etc/etcd/pki/etcd-server-key.pem \
>    --cert=/etc/etcd/pki/etcd-server.pem \
>    --endpoints https://10.0.2.10:2379,https://10.0.2.11:2379,https://10.0.2.12:2379
Member eb20e4406d78f7f7 removed from cluster de54f873fa2bd441

[root@node01 ~]# etcdctl member add etcd3 --peer-urls=https://10.0.2.12:2380 \
>    --cacert=/etc/etcd/pki/etcd-ca.pem \
>    --key=/etc/etcd/pki/etcd-server-key.pem \
>    --cert=/etc/etcd/pki/etcd-server.pem \
>    --endpoints https://10.0.2.10:2379,https://10.0.2.11:2379,https://10.0.2.12:2379
Member ebe50cbfc9552823 added to cluster de54f873fa2bd441

ETCD_NAME="etcd3"
ETCD_INITIAL_CLUSTER="etcd1=https://10.0.2.10:2380,etcd2=https://10.0.2.11:2380,etcd3=https://10.0.2.12:2380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://10.0.2.12:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"

# 在故障节点上修改 ETCD_INITIAL_CLUSTER_STATE="existing" 并重启 etcd

[root@node01 ~]# etcdctl endpoint status \
>    --cacert=/etc/etcd/pki/etcd-ca.pem \
>    --key=/etc/etcd/pki/etcd-server-key.pem \
>    --cert=/etc/etcd/pki/etcd-server.pem \
>    --endpoints https://10.0.2.10:2379,https://10.0.2.11:2379,https://10.0.2.12:2379
https://10.0.2.10:2379, 4ef681e2655d2a35, 3.5.4, 3.0 MB, false, false, 32, 375709, 375709,
https://10.0.2.11:2379, 6c9d6a0746c7546b, 3.5.4, 3.0 MB, true, false, 32, 375709, 375709,
https://10.0.2.12:2379, ebe50cbfc9552823, 3.5.4, 3.0 MB, false, false, 32, 375709, 375709,

场景二:使用快照恢复etcd集群

etcd 保存快照

https://etcd.io/docs/v3.5/op-guide/recovery/

https://etcd.io/docs/v3.5/upgrades/upgrade_3_5/

[root@node01 ~]# etcdctl snapshot save  backup/$(date +%Y-%m-%d-%H-%M).snapshot \
>     --cacert=/etc/etcd/pki/etcd-ca.pem \
>     --key=/etc/etcd/pki/etcd-server-key.pem \
>     --cert=/etc/etcd/pki/etcd-server.pem \
>     --endpoints https://10.0.2.10:2379
{"level":"info","ts":"2023-01-19T16:11:47.041+0800","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"backup/2023-01-19-16-11.snapshot.part"}
{"level":"info","ts":"2023-01-19T16:11:47.051+0800","logger":"client","caller":"v3/maintenance.go:211","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-01-19T16:11:47.051+0800","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://10.0.2.10:2379"}
{"level":"info","ts":"2023-01-19T16:11:47.085+0800","logger":"client","caller":"v3/maintenance.go:219","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-01-19T16:11:47.126+0800","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://10.0.2.10:2379","size":"3.0 MB","took":"now"}
{"level":"info","ts":"2023-01-19T16:11:47.126+0800","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"backup/2023-01-19-16-11.snapshot"}
Snapshot saved at backup/2023-01-19-16-11.snapshot

快照恢复

1、首先停止apiserver(确保没有任何程序对etcd进行写入操作)

2、停掉etcd集群,并清空各etcd节点的数据目录

3、将快照文件分发到各etcd节点

4、在各etcd节点进行快照恢复

[root@node01 ~]# etcdutl snapshot restore /tmp/backup/2023-01-19-16-11.snapshot \
    --data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
    --name=etcd1 --initial-advertise-peer-urls=https://10.0.2.10:2380 \
    --initial-cluster="etcd1=https://10.0.2.10:2380,etcd2=https://10.0.2.11:2380,etcd3=https://10.0.2.12:2380"

[root@node02 ~]# etcdutl snapshot restore /tmp/backup/2023-01-19-16-11.snapshot \
    --data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
    --name=etcd2 --initial-advertise-peer-urls=https://10.0.2.11:2380 \
    --initial-cluster="etcd1=https://10.0.2.10:2380,etcd2=https://10.0.2.11:2380,etcd3=https://10.0.2.12:2380"

[root@node03 ~]# etcdutl snapshot restore /tmp/backup/2023-01-19-16-11.snapshot \
    --data-dir=/var/lib/etcd/ --initial-cluster-token="etcd-cluster" \
    --name=etcd3 --initial-advertise-peer-urls=https://10.0.2.12:2380 \
    --initial-cluster="etcd1=https://10.0.2.10:2380,etcd2=https://10.0.2.11:2380,etcd3=https://10.0.2.12:2380"

5、重启etcd集群

注意etcd数据目录权限

chown -R etcd:etcd /var/lib/etcd
systemctl restart etcd

6、启动apiserver

  • 5
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 1
    评论
Kubernetes集群中,etcd扮演着非常重要的角色,用于存储集群状态和元数据。为保证集群的高可用性,需要搭建etcd高可用集群。下面是搭建k8s etcd高可用的详细步骤: 1. 安装etcd 在所有etcd节点上安装etcd。可以从etcd的官方网站下载二进制文件进行安装。在安装完成后,需要在每个etcd节点上创建一个etcd配置文件,配置文件示例如下: ``` # cat /etc/etcd/etcd.conf # etcd configuration # [member] ETCD_NAME=default ETCD_DATA_DIR="/var/lib/etcd/default.etcd" #ETCD_WAL_DIR="" ETCD_LISTEN_PEER_URLS="http://192.168.0.1:2380" ETCD_LISTEN_CLIENT_URLS="http://192.168.0.1:2379" #[cluster] ETCD_INITIAL_ADVERTISE_PEER_URLS="http://192.168.0.1:2380" ETCD_ADVERTISE_CLIENT_URLS="http://192.168.0.1:2379" ETCD_INITIAL_CLUSTER="default=http://192.168.0.1:2380,default=http://192.168.0.2:2380,default=http://192.168.0.3:2380" ETCD_INITIAL_CLUSTER_STATE="new" ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster" ``` 在每个etcd节点上需要修改配置文件中的IP地址和端口号,以及节点的名称。需要注意的是,ETCD_INITIAL_CLUSTER中需要列出集群中所有的etcd节点。 2. 启动etcd 在所有etcd节点上启动etcd服务: ``` # systemctl enable etcd # systemctl start etcd ``` 3. 配置etcd集群自动发现 在每个etcd节点上创建一个etcd集群自动发现的配置文件,配置文件示例如下: ``` # cat /etc/systemd/system/etcd2.service.d/etcd2.conf [Service] Environment=ETCDCTL_ENDPOINTS=http://192.168.0.1:2379,http://192.168.0.2:2379,http://192.168.0.3:2379 ``` 在每个etcd节点上需要修改配置文件中的IP地址和端口号,以及集群中所有etcd节点的地址。 4. 配置kubeadm 在kubeadm配置文件中,需要配置etcd集群的地址和端口号,以及etcd集群中的所有节点。配置文件示例如下: ``` apiVersion: kubeadm.k8s.io/v1beta2 kind: ClusterConfiguration etcd: external: endpoints: - http://192.168.0.1:2379 - http://192.168.0.2:2379 - http://192.168.0.3:2379 caFile: /etc/kubernetes/pki/etcd/ca.crt certFile: /etc/kubernetes/pki/apiserver-etcd-client.crt keyFile: /etc/kubernetes/pki/apiserver-etcd-client.key ``` 在配置文件中需要修改etcd集群的地址和端口号,以及etcd集群中所有节点的地址。 5. 创建kubernetes集群 使用kubeadm创建kubernetes集群时,需要指定etcd集群的地址和端口号,以及etcd集群中所有节点的地址。命令示例如下: ``` # kubeadm init --config=kubeadm.yaml ``` 在创建集群时需要注意,etcd集群中至少需要有3个节点才能保证高可用性。如果etcd集群中有节点出现故障,可以在其它的etcd节点上执行etcd容器恢复

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

CodingDemo

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值