问题需求
某节点因发生故障被关机,经核查后暂时无法将该节点修复正常并加入集群中,因而,现需要将四节点集群变成三节点集群,即将关机的node3节点相关信息从集群中删除,确保集群以三节点集群状态继续正常运行。
本文中关机节点以node3为例,node3被关机时的集群状态如下:
[huai@node0 ~]# pcs status
Cluster name: my_cluster
Cluster Summary:
* Stack: corosync
* Current DC: node2 (version 2.0.7-1.oe2205-ba59ce7147) - partition with quorum
* Last updated: Wed Jan 11 17:55:03 2023
* Last change: Wed Jan 11 17:13:13 2023 by hacluster via crmd on node1
* 4 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node0 node1 node2 ]
* OFFLINE: [ node3 ]
Full List of Resources:
* Resource Group: skl_data_group:
* skl_shared (ocf::heartbeat:Filesystem): Started node0
* skl_metadata (ocf::heartbeat:Filesystem): Started node0
* Resource Group: skl_service:
* skl_mysql (ocf::heartbeat:mysql): Started node0
* skl_tomcat (ocf::heartbeat:tomcat): Started node0
* webip (ocf::heartbeat:IPaddr2): Started node0
* httpd (lsb:apache2): Started node0
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
问题分析
既然是需要删除节点,首先想到的是使用PCS命令中的remove方式,但只使用remove方式是否可行呢?
[huai@node0 ~]# pcs cluster node remove node3 --force
Error: Unable to connect to node3 (Failed to connect to node3 port 2224 after 2005 ms: No route to host), use --skip-offline to override
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3066 ms: No route to host)
Warning: Unable to determine whether this action will cause a loss of the quorum
Error: Errors have occurred, therefore pcs is unable to continue
通过上面的验证不难发现,使用remove方式删除关机的node3节点时,仍需要去连接node3节点,但node3节点已经关机了,必然连不上node3进而影响到整个删除操作,换句话说,直接使用remove方式必然不可行。
那么,能否在删除关机节点时让集群内部不去连接这个关机节点呢?答案其实是可以的。
再仔细观察上面的错误信息,可以发现一个关键打印信息“use --skip-offline to override”,也就是说,使用remove方式删除节点时可以使用--skip-offline
参数来跳过去连接关机的节点的操作,进而保证集群中将该关机节点信息删除并更新集群中所有节点的corosync.conf配置信息。
问题解决
[huai@node0 ~]# pcs cluster node remove node3 --force --skip-offline
Warning: Omitting node 'node3'
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3130 ms: No route to host)
Warning: Unable to determine whether this action will cause a loss of the quorum
Destroying cluster on hosts: 'node3'...
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3071 ms: No route to host)
Warning: Removed node 'node3' could not be reached and subsequently deconfigured. Run 'pcs cluster destroy' on the unreachable node.
Sending updated corosync.conf to nodes...
node2: Succeeded
node0: Succeeded
node1: Succeeded
node0: Corosync configuration reloaded
可以看到,通过以上命令可以将node3的相关信息在集群中顺利删除掉。另外,成功删除node3节点后,pacemaker集群状态如下:
[huai@node0 ~]# pcs status
Cluster name: my_cluster
Cluster Summary:
* Stack: corosync
* Current DC: node0 (version 2.0.7-1.oe2205-ba59ce7147) - partition with quorum
* Last updated: Wed Jan 11 16:17:49 2023
* Last change: Wed Jan 11 16:17:39 2023 by hacluster via crm_node on node2
* 3 nodes configured
* 6 resource instances configured
Node List:
* Online: [ node0 node1 node2 ]
Full List of Resources:
* Resource Group: skl_data_group:
* skl_shared (ocf::heartbeat:Filesystem): Started node0
* skl_metadata (ocf::heartbeat:Filesystem): Started node0
* Resource Group: skl_service:
* skl_mysql (ocf::heartbeat:mysql): Started node0
* skl_tomcat (ocf::heartbeat:tomcat): Started node0
* webip (ocf::heartbeat:IPaddr2): Started node0
* httpd (lsb:apache2): Started node0
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled
成功删除node3节点后,corosync.conf配置信息如下:
[huai@node0 ~]# cat /etc/corosync/corosync.conf
totem {
version: 2
cluster_name: my_cluster
secauth: on
transport: knet
rrp_mode: passive
crypto_cipher: aes256
crypto_hash: sha256
}
nodelist {
node {
ring0_addr: 192.168.20.5
ring1_addr: 192.168.21.5
name: node0
nodeid: 1
}
node {
ring0_addr: 192.168.20.6
ring1_addr: 192.168.21.6
name: node1
nodeid: 2
}
node {
ring0_addr: 192.168.20.7
ring1_addr: 192.168.21.7
name: node2
nodeid: 3
}
}
quorum {
provider: corosync_votequorum
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
timestamp: on
}