一:问题出现环境
有一个etcd节点因为磁盘问题当掉
在node1节点查看健康状态
[root@node01 ~]# /k8s/etcd/bin/etcdctl --ca-file=/k8s/etcd/ssl/ca.pem --cert-file=/k8s/etcd/ssl/server.pem --key-file=/k8s/etcd/ssl/server-key.pem --endpoints="https://192.168.247.149:2379,https://192.168.247.143:2379,https://192.168.247.144:2379" cluster-health
member 8f4e6ce663f0d49a is healthy: got healthy result from https://192.168.247.143:2379
member b6230d9c6f20feeb is healthy: got healthy result from https://192.168.247.144:2379
failed to check the health of member d618618928dffeba on https://192.168.247.149:2379: Get https://192.168.247.149:2379/health: dial tcp 192.168.247.149:2379: i/o timeout
member d618618928dffeba is unreachable: [https://192.168.247.149:2379] are all unreachable
cluster is degraded
切换到192.168.247.149节点
将etcd的相关配置文件、命令脚本、证书、启动脚本复制过去
[root@node01 ~]# scp -r /k8s root@192.168.247.149:/k8s
The authenticity of host '192.168.247.149 (192.168.247.149)' can't be established.
ECDSA key fingerprint is SHA256:QeJNZeAOre44X0uR34SeAzOr80+OZ173556h07FrT0k.
ECDSA key fingerprint is MD5:e2:4c:4c:bc:ed:a2:e0:03:2c:71:c7:4f:2c:da:32:a8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.247.149' (ECDSA) to the list of known hosts.
root@192.168.247.149's password:
etcd 100% 523 270.3KB/s 00:00
etcd 100% 18MB 77.8MB/s 00:00
etcdctl 100% 15MB 118.9MB/s 00:00
ca-key.pem 100% 1679 1.5MB/s 00:00
ca.pem 100% 1265 361.4KB/s 00:00
server-key.pem 100% 1675 936.5KB/s 00:00
server.pem 100% 1338 1.2MB/s 00:00
[root@node01 ~]# scp /usr/lib/systemd/system/etcd.service root@192.168.247.149:/usr/lib/systemd/system/
root@192.168.247.149's password:
etcd.service 100% 923 3
然后修改成本地参数
开启服务发现失败
[root@master1 k8s]# systemctl status kube-apiserver.service
● kube-apiserver.service - Kubernetes API Server
Loaded: loaded (/usr/lib/systemd/system/kube-apiserver.service; enabled; vendor preset: disabled)
Active: failed (Result: start-limit) since Thu 2020-04-30 08:22:35 CST; 21s ago
[root@master1 etcd]# journalctl -xe
Apr 30 09:31:32 master1 etcd[51631]: member d618618928dffeba has already been bootstrapped
Apr 30 09:31:32 master1 systemd[1]: etcd.service: main process exited, code=exited, status=1/FAILUR
Apr 30 09:31:32 master1 systemd[1]: Failed to start Etcd Server.
-- Subject: Unit etcd.service has failed
二:问题:member d618618928dffeba has already been bootstrapped
大概意思:
其中一个成员是通过discovery service引导的。必须删除以前的数据目录来清理成员信息。否则成员将忽略新配置,使用旧配置。这就是为什么你看到了不匹配。
看到了这里,问题所在也就很明确了,启动失败的原因在于data-dir (/var/lib/etcd/default.etcd)中记录的信息与 etcd启动的选项所标识的信息不太匹配造成的。
这里用的解决办法时把配置参数中–initial-cluster-state改为existing
#!/bin/bash
# example: ./etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380
ETCD_NAME=$1
ETCD_IP=$2
ETCD_CLUSTER=$3
WORK_DIR=/k8s/etcd
cat <<EOF >$WORK_DIR/cfg/etcd
#[Member]
ETCD_NAME="${ETCD_NAME}"
ETCD_DATA_DIR="/var/lib/etcd/default.etcd"
ETCD_LISTEN_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_LISTEN_CLIENT_URLS="https://${ETCD_IP}:2379"
#[Clustering]
ETCD_INITIAL_ADVERTISE_PEER_URLS="https://${ETCD_IP}:2380"
ETCD_ADVERTISE_CLIENT_URLS="https://${ETCD_IP}:2379"
ETCD_INITIAL_CLUSTER="etcd01=https://${ETCD_IP}:2380,${ETCD_CLUSTER}"
ETCD_INITIAL_CLUSTER_TOKEN="etcd-cluster"
ETCD_INITIAL_CLUSTER_STATE="existing" #此处修改
EOF
cat <<EOF >/usr/lib/systemd/system/etcd.service
[Unit]
Description=Etcd Server
After=network.target
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
EnvironmentFile=${WORK_DIR}/cfg/etcd
ExecStart=${WORK_DIR}/bin/etcd \
--name=\${ETCD_NAME} \
--data-dir=\${ETCD_DATA_DIR} \
--listen-peer-urls=\${ETCD_LISTEN_PEER_URLS} \
--listen-client-urls=\${ETCD_LISTEN_CLIENT_URLS},http://127.0.0.1:2379 \
--advertise-client-urls=\${ETCD_ADVERTISE_CLIENT_URLS} \
--initial-advertise-peer-urls=\${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster=\${ETCD_INITIAL_CLUSTER} \
--initial-cluster-token=\${ETCD_INITIAL_CLUSTER_TOKEN} \
--initial-cluster-state=existing \ #此处修改
--cert-file=${WORK_DIR}/ssl/server.pem \
--key-file=${WORK_DIR}/ssl/server-key.pem \
--peer-cert-file=${WORK_DIR}/ssl/server.pem \
--peer-key-file=${WORK_DIR}/ssl/server-key.pem \
--trusted-ca-file=${WORK_DIR}/ssl/ca.pem \
--peer-trusted-ca-file=${WORK_DIR}/ssl/ca.pem
Restart=on-failure
LimitNOFILE=65536
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable etcd
systemctl restart etcd
然后重新执行脚本
[root@master1 etcd]# bash etcd.sh etcd01 192.168.247.149 etcd02=https://192.168.247.143:2380,etcd03=https://192.168.247.144:2380
成功
从网络上还找到两个方法
第二种方式删除所有etcd节点的 data-dir 文件(不删也行),重启各个节点的etcd服务,这个时候,每个节点的data-dir的数据都会被更新,就不会有以上故障了。
第三种方式是复制其他节点的data-dir中的内容,以此为基础上以 --force-new-cluster 的形式强行拉起一个,然后以添加新成员的方式恢复这个集群。