Corosync+Pacemaker集群中删除关机节点相关信息

怀羽默

已于 2023-01-11 23:12:24 修改

阅读量275

点赞数 1

分类专栏： HA集群文章标签： linux

于 2023-01-11 23:04:01 首次发布

本文链接：https://blog.csdn.net/weixin_44972197/article/details/128652261

版权

HA集群专栏收录该内容

5 篇文章 0 订阅

订阅专栏

问题需求

某节点因发生故障被关机，经核查后暂时无法将该节点修复正常并加入集群中，因而，现需要将四节点集群变成三节点集群，即将关机的node3节点相关信息从集群中删除，确保集群以三节点集群状态继续正常运行。
本文中关机节点以node3为例，node3被关机时的集群状态如下：

[huai@node0 ~]# pcs status
Cluster name: my_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: node2 (version 2.0.7-1.oe2205-ba59ce7147) - partition with quorum
  * Last updated: Wed Jan 11 17:55:03 2023
  * Last change:  Wed Jan 11 17:13:13 2023 by hacluster via crmd on node1
  * 4 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ node0 node1 node2 ]
  * OFFLINE: [ node3 ]

Full List of Resources:
  * Resource Group: skl_data_group:
    * skl_shared       (ocf::heartbeat:Filesystem):     Started node0
	* skl_metadata       (ocf::heartbeat:Filesystem):     Started node0
  * Resource Group: skl_service:
    * skl_mysql        (ocf::heartbeat:mysql):     Started node0
    * skl_tomcat        (ocf::heartbeat:tomcat):         Started node0
    * webip    (ocf::heartbeat:IPaddr2):        Started node0
  * httpd    (lsb:apache2):  Started node0

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

问题分析

既然是需要删除节点，首先想到的是使用PCS命令中的remove方式，但只使用remove方式是否可行呢？

[huai@node0 ~]# pcs cluster node remove node3 --force
Error: Unable to connect to node3 (Failed to connect to node3 port 2224 after 2005 ms: No route to host), use --skip-offline to override
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3066 ms: No route to host)
Warning: Unable to determine whether this action will cause a loss of the quorum
Error: Errors have occurred, therefore pcs is unable to continue

通过上面的验证不难发现，使用remove方式删除关机的node3节点时，仍需要去连接node3节点，但node3节点已经关机了，必然连不上node3进而影响到整个删除操作，换句话说，直接使用remove方式必然不可行。
那么，能否在删除关机节点时让集群内部不去连接这个关机节点呢？答案其实是可以的。
再仔细观察上面的错误信息，可以发现一个关键打印信息“use --skip-offline to override”，也就是说，使用remove方式删除节点时可以使用--skip-offline参数来跳过去连接关机的节点的操作，进而保证集群中将该关机节点信息删除并更新集群中所有节点的corosync.conf配置信息。

问题解决

[huai@node0 ~]# pcs cluster node remove node3 --force --skip-offline
Warning: Omitting node 'node3'
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3130 ms: No route to host)
Warning: Unable to determine whether this action will cause a loss of the quorum
Destroying cluster on hosts: 'node3'...
Warning: Unable to connect to node3 (Failed to connect to node3 port 2224 after 3071 ms: No route to host)
Warning: Removed node 'node3' could not be reached and subsequently deconfigured. Run 'pcs cluster destroy' on the unreachable node.
Sending updated corosync.conf to nodes...
node2: Succeeded
node0: Succeeded
node1: Succeeded
node0: Corosync configuration reloaded

可以看到，通过以上命令可以将node3的相关信息在集群中顺利删除掉。另外，成功删除node3节点后，pacemaker集群状态如下：

[huai@node0 ~]# pcs status
Cluster name: my_cluster
Cluster Summary:
  * Stack: corosync
  * Current DC: node0 (version 2.0.7-1.oe2205-ba59ce7147) - partition with quorum
  * Last updated: Wed Jan 11 16:17:49 2023
  * Last change:  Wed Jan 11 16:17:39 2023 by hacluster via crm_node on node2
  * 3 nodes configured
  * 6 resource instances configured

Node List:
  * Online: [ node0 node1 node2 ]

Full List of Resources:
  * Resource Group: skl_data_group:
    * skl_shared       (ocf::heartbeat:Filesystem):     Started node0
	* skl_metadata       (ocf::heartbeat:Filesystem):     Started node0
  * Resource Group: skl_service:
    * skl_mysql        (ocf::heartbeat:mysql):     Started node0
    * skl_tomcat        (ocf::heartbeat:tomcat):         Started node0
    * webip    (ocf::heartbeat:IPaddr2):        Started node0
  * httpd    (lsb:apache2):  Started node0

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

成功删除node3节点后，corosync.conf配置信息如下：

[huai@node0 ~]# cat /etc/corosync/corosync.conf
totem {
    version: 2
    cluster_name: my_cluster
    secauth: on
    transport: knet
    rrp_mode: passive
    crypto_cipher: aes256
    crypto_hash: sha256
}

nodelist {
    node {
        ring0_addr: 192.168.20.5
        ring1_addr: 192.168.21.5
        name: node0
        nodeid: 1
    }

    node {
        ring0_addr: 192.168.20.6
        ring1_addr: 192.168.21.6
        name: node1
        nodeid: 2
    }

    node {
        ring0_addr: 192.168.20.7
        ring1_addr: 192.168.21.7
        name: node2
        nodeid: 3
    }
}

quorum {
    provider: corosync_votequorum
}

logging {
    to_logfile: yes
    logfile: /var/log/cluster/corosync.log
    to_syslog: yes
    timestamp: on
}