matlab中cricshift,OpenShift 4 灾难恢复-多master集群中有一个master节点出现故障(机器不可用)...

故障场景

OpenShift 4 离线环境多 master 集群中有一个 master 节点出现故障(机器不可用)

这种场景下集群依然可以正常使用

为了让集群处于完整的高可用状态下,我们需要将故障节点移除,再重新添加 master节点

集群当前状态

检查节点状态

可以看到故障节点已经处于 NotReady 状态

[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master

NAME STATUS ROLES AGE VERSION

master-0.ocp4-cluster1.guachen.ocp Ready master 21d v1.16.2

master-1.ocp4-cluster1.guachen.ocp Ready master 21d v1.16.2

master-2.ocp4-cluster1.guachen.ocp NotReady master 21d v1.16.2

[root@kr8s-ocp-tools ~]# oc get pod -A|grep -Ev "Running|Completed"

NAMESPACE NAME READY STATUS RESTARTS AGE

openshift-machine-config-operator etcd-quorum-guard-58696fdc97-422jn 1/1 Terminating 0 144m

openshift-machine-config-operator etcd-quorum-guard-58696fdc97-nsnnp 0/1 Pending 0 6m13s

检查 etcd cluster-health

登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp

目前 etcd cluster 处于 degraded 状态,只有两个 membership 可用

[root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp

[core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

sh-4.2# export ETCDCTL_API=2

sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \

--ca-file=/etc/ssl/etcd/ca.crt \

--cert-file=$(find /etc/ssl/ -name *peer*crt) \

--key-file=$(find /etc/ssl/ -name *peer*key) cluster-health

~~~

member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379

failed to check the health of member d8cb362c01859289 on https://10.72.44.174:2379: Get https://10.72.44.174:2379/health: dial tcp 10.72.44.174:2379: connect: no route to host

member d8cb362c01859289 is unreachable: [https://10.72.44.174:2379] are all unreachable

member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379

cluster is degraded

处理过程

1. 删除故障节点

[root@kr8s-ocp-tools ~]# oc delete node master-2.ocp4-cluster1.guachen.ocp

node "master-2.ocp4-cluster1.guachen.ocp" deleted

[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master

NAME STATUS ROLES AGE VERSION

master-0.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2

master-1.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2

2. 删除故障 etcd membership

登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp

[root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp

[core@master-0 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

sh-4.2# export ETCDCTL_API=2

sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \

--ca-file=/etc/ssl/etcd/ca.crt \

--cert-file=$(find /etc/ssl/ -name *peer*crt) \

--key-file=$(find /etc/ssl/ -name *peer*key) member remove 3d95fa872c4a2282

Removed member 3d95fa872c4a2282 from cluster

sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 --ca-file=/etc/ssl/etcd/ca.crt --cert-file=$(find /etc/ssl/ -name *peer*crt) --key-file=$(find /etc/ssl/ -name *peer*key) cluster-health

member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379

member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379

cluster is healthy

3. 重新添加新的节点作为 master 节点,以恢复完整的高可用集群

离线集群添加节点的方式跟部署集群时一致,使用 master 的 ign 文件重新引导一个 RHCOS 节点。

可以复用集群部署时该节点的ign文件,如果还在的话,若不在了按照部署集群时的方法重新生成即可

具体参考集群部署步骤

approve 新添加的节点生成的 csr,有4个

[root@kr8s-ocp-tools ~]# oc get csr -o name | xargs oc adm certificate approve

4. 恢复 etcd membership 至完整的 etcd 集群

a. 部署 etcd-signer Pod

登陆到剩余的正常 master 节点操作,比如 master-0.ocp4-cluster1.guachen.ocp

i. login 到 OpenShift 集群

[root@kr8s-ocp-tools ~]# ssh core@master-0.ocp4-cluster1.guachen.ocp

# 需要cluster-admin权限的user

[core@master-0 ~]$ oc login https://localhost:6443

Authentication required for https://localhost:6443 (openshift)

Username: admin

Password:

Login successful.

ii. 获取 kube-etcd-signer-server 镜像的 pull specification

export KUBE_ETCD_SIGNER_SERVER=$(sudo oc adm release info --image-for kube-etcd-signer-server --registry-config=/var/lib/kubelet/config.json)

上面的命令取到的值是quay.io的,离线环境我们需要另外的处理,转换成本地的registry

export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull $(your-local-registry):5000/ocp4/openshift4:$(your-version)-kube-etcd-signer-server |awk '{print $7}')

### 比如我的环境

export KUBE_ETCD_SIGNER_SERVER=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-etcd-signer-server |awk '{print $7}')

iii. 生成kube-etcd-cert-signer.yaml文件

[core@master-0 ~]$ sudo -E /usr/local/bin/tokenize-signer.sh master-0.ocp4-cluster1.guachen.ocp

iv. 创建 etcd-signer Pod

oc create -f assets/manifests/kube-etcd-cert-signer.yaml

b. 将新添加回来的 master 节点恢复到 etcd cluster

登陆到新增加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp

i. login 到 OpenShift 集群

[root@kr8s-ocp-tools ~]# ssh core@master-2.ocp4-cluster1.guachen.ocp

[core@master-2 ~]$ oc login https://localhost:6443

Authentication required for https://localhost:6443 (openshift)

Username: admin

Password:

Login successful.

ii. 获取恢复 etcd cluster 需要的环境变量(etcd-member-recover.sh脚本需要)

export SETUP_ETCD_ENVIRONMENT=$(sudo oc adm release info --image-for machine-config-operator --registry-config=/var/lib/kubelet/config.json)

export KUBE_CLIENT_AGENT=$(sudo oc adm release info --image-for kube-client-agent --registry-config=/var/lib/kubelet/config.json)

上面的命令是通过 quay.io 取值的,离线环境我们需要另外的处理,转换成本地的 registry

# 注意 $your-local-registry 和 $your-version

[core@master-2 ~]$ export SETUP_ETCD_ENVIRONMENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-machine-config-operator |awk '{print $7}')

[core@master-2 ~]$ export KUBE_CLIENT_AGENT=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-kube-client-agent |awk '{print $7}')

iii. 修改 openshift-recovery-tools,将里面 etcd 的镜像转换成本地镜像仓库的

# 注意 $your-local-registry 和 $your-version

[core@master-2 ~]$ export ETCDIMG=$(sudo crictl pull kr8s-ocp-tools:5000/ocp4/openshift4:4.3.8-etcd |awk '{print $7}')

[core@master-2 ~]$ sudo -E sed -i "s?local etcdimg=.*?local etcdimg=\"$ETCDIMG\"?g" /usr/local/bin/openshift-recovery-tools

iv. 运行恢复 etcd membership 脚本 etcd-member-recover.sh

sudo -E /usr/local/bin/etcd-member-recover.sh $IP etcd-member-$hostname

IP 为恢复操作前正常的master节点 ip,master-0.ocp4-cluster1.guachen.ocp 的 ip 10.72.44.172

hostname 为需要恢复的etcd membership 节点 hostname,如 master-2.ocp4-cluster1.guachen.ocp

[core@master-2 ~]$ sudo -E /usr/local/bin/etcd-member-recover.sh 10.72.44.172 etcd-member-master-2.ocp4-cluster1.guachen.ocp

4320daf71e2d45927d66c6a74f46faa6a1bfe7cabb708d81344255fdc289b5bb

etcdctl version: 3.3.17

API version: 3.3

Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/

Backing up /etc/etcd/etcd.conf to ./assets/backup/

Trying to backup etcd client certs..

etcd client certs found in /etc/kubernetes/static-pod-resources/kube-apiserver-pod-9 backing up to ./assets/backup/

Stopping etcd..

Waiting for etcd-member to stop

Waiting for etcd-member to stop

Waiting for etcd-member to stop

Waiting for etcd-member to stop

Local etcd snapshot file not found, backup skipped..

Backing up etcd certificates..

Removing etcd certs..

Populating template /usr/local/share/openshift-recovery/template/etcd-generate-certs.yaml.template

Populating template ./assets/tmp/etcd-generate-certs.stage1

Populating template ./assets/tmp/etcd-generate-certs.stage2

Starting etcd client cert recovery agent..

Waiting for certs to generate... (1/60)

Waiting for certs to generate... (2/60)

Waiting for certs to generate... (3/60)

Waiting for certs to generate... (4/60)

Stopping cert recover..

Waiting for generate-certs to stop

Patching etcd-member manifest..

Updating etcd membership..

Removing etcd data_dir /var/lib/etcd..

Member 3c6458d18aa43907 added to cluster a792367fd9b198cc

ETCD_NAME="etcd-member-master-2.ocp4-cluster1.guachen.ocp"

ETCD_INITIAL_CLUSTER="etcd-member-master-2.ocp4-cluster1.guachen.ocp=https://etcd-2.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-1.ocp4-cluster1.guachen.ocp=https://etcd-1.ocp4-cluster1.guachen.ocp:2380,etcd-member-master-0.ocp4-cluster1.guachen.ocp=https://etcd-0.ocp4-cluster1.guachen.ocp:2380"

ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-2.ocp4-cluster1.guachen.ocp:2380"

ETCD_INITIAL_CLUSTER_STATE="existing"

Starting etcd..

验证处理结果

检查 node/etcd pod 状态

[root@kr8s-ocp-tools ~]# oc get nodes -l node-role.kubernetes.io/master

NAME STATUS ROLES AGE VERSION

master-0.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2

master-1.ocp4-cluster1.guachen.ocp Ready master 22d v1.16.2

master-2.ocp4-cluster1.guachen.ocp Ready master 13m v1.16.2

[root@kr8s-ocp-tools ~]# oc -n openshift-etcd get pod -owide

NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES

etcd-member-master-0.ocp4-cluster1.guachen.ocp 2/2 Running 2 22d 10.72.44.172 master-0.ocp4-cluster1.guachen.ocp

etcd-member-master-1.ocp4-cluster1.guachen.ocp 2/2 Running 2 22d 10.72.44.173 master-1.ocp4-cluster1.guachen.ocp

etcd-member-master-2.ocp4-cluster1.guachen.ocp 2/2 Running 0 68s 10.72.44.174 master-2.ocp4-cluster1.guachen.ocp

检查 etcd cluster-health

登陆到新添加的 master 节点操作,比如 master-2.ocp4-cluster1.guachen.ocp

[root@kr8s-ocp-tools ~]# ssh core@master-2.ocp4-cluster1.guachen.ocp

[core@master-2 ~]$ id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh

sh-4.2# export ETCDCTL_API=2

sh-4.2# etcdctl -C https://master-0.ocp4-cluster1.guachen.ocp:2379 \

--ca-file=/etc/ssl/etcd/ca.crt \

--cert-file=$(find /etc/ssl/ -name *peer*crt) \

--key-file=$(find /etc/ssl/ -name *peer*key) cluster-health

member 3c6458d18aa43907 is healthy: got healthy result from https://10.72.44.174:2379

member 57c7ac1766477035 is healthy: got healthy result from https://10.72.44.173:2379

member e18cfd0175af8004 is healthy: got healthy result from https://10.72.44.172:2379

cluster is healthy

可以看到 etcd cluster 有 3 个 membership,且 cluster 状态是正常的

恢复完成后删除 etcd-signer pod

[root@kr8s-ocp-tools ~]# oc delete pod -n openshift-config etcd-signer

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值