一,需求背景
在对lightdb进行日常维护,又或是解决生产问题时,总是在多个控制台窗口之间来回切换。当需要停止服务时,一不小心就先在主节点上执行了lt_ctl stop,直接导致了双主节点的出现,原来10分钟搞定的事,现在可能需要半小时,白白增加了工作量。
为了防止此类事件和新手的误操作。当lightdb运行在HA架构下时,在主节点执行lt_ctl stop/restart时,我们做了一些相应的提示和判断。
二,修改说明
1,正常停止HA架构下的主节点的流程:
a) 先查看一下集群状态
[lightdb@centos7-ha-master ~]$ ltcluster -f ./ltcluster.conf service status
ID | Name | Role | Status | Upstream | ltclusterd | PID | Paused? | Upstream last seen
----+--------------------------+---------+-----------+--------------------------+------------+------+---------+--------------------
1 | lightdbCluster1002885432 | primary | * running | | running | 2282 | no | n/a
2 | lightdbCluster1002995432 | standby | running | lightdbCluster1002885432 | running | 1586 | no | n/a
b) 如果Paused列中显示如上面一样是no,则执行如下命令
[lightdb@centos7-ha-master ~]$ ltcluster -f ./ltcluster.conf service pause
NOTICE: node 1 (lightdbCluster1002885432) paused
NOTICE: node 2 (lightdbCluster1002995432) paused
c ) 此时主/备节点的Paused状态如下
[lightdb@centos7-ha-master ~]$ ltcluster -f ./ltcluster.conf service status
ID | Name | Role | Status | Upstream | ltclusterd | PID | Paused? | Upstream last seen
----+--------------------------+---------+-----------+--------------------------+------------+------+---------+--------------------
1 | lightdbCluster1002885432 | primary | * running | | running | 2282 | yes | n/a
2 | lightdbCluster1002995432 | standby | running | lightdbCluster1002885432 | running | 1586 | yes | 1 second(s) ago
d) 然后再执行停止命令
[lightdb@centos7-ha-master ~]$ lt_ctl stop
waiting for server to shut down................................................... done
server stopped
至此,主节点正常停止,VIP被收回。备节点处于等待主节点恢复时的状态,不会变成主节点。状态显示如下:
[lightdb@centos7-ha-standby ~]$ ltcluster -f ./ltcluster.conf service status
ID | Name | Role | Status | Upstream | ltclusterd | PID | Paused? | Upstream last seen
----+--------------------------+---------+---------------+----------------------------+------------+------+---------+--------------------
1 | lightdbCluster1002885432 | primary | ? unreachable | ? | n/a | n/a | n/a | n/a
2 | lightdbCluster1002995432 | standby | running | ? lightdbCluster1002885432 | running | 1586 | yes | 352 second(s) ago
2,参考正确流程对lt_ctl做的修改。
a) 当执行lt_ctl stop/restart时,先判断是否处于HA模式下,如果不是,则允许用户正常停止。
b) 如果处于HA模式下,再判断当前节点是否是备节点,如果是,则允许用户正常停止。
c) 如果处于HA模式下,且是主节点,且Paused显示的状态为no,则禁止用户停止,并给出相应的提示。看起来像下面这样子,从而有效的防止了意外的发生。
[lightdb@centos7-ha-master ~]$ lt_ctl stop
The ltcluster Paused status must 'yes' or '1', current state:0
d) 如果某些特殊情况,确实需要停止主节点,那么可以新增加-F来强制停止。
[lightdb@centos7-ha-master ~]$ lt_ctl stop -F
Message: In the HA architecture, force stop the primary node may cause a switchover.
waiting for server to shut down........................... done
server stopped
当然这情况之下,双主的出现就是不可避免的,后续的问题就需要用户自行来解决。
[lightdb@centos7-ha-standby ~]$ ltcluster -f ./ltcluster.conf service status
ID | Name | Role | Status | Upstream | ltclusterd | PID | Paused? | Upstream last seen
----+--------------------------+---------+-----------+----------+------------+------+---------+--------------------
1 | lightdbCluster1002885432 | primary | - failed | ? | n/a | n/a | n/a | n/a
2 | lightdbCluster1002995432 | primary | * running | | running | 1586 | no | n/a
三,总结
经过上述修改,有效的防止了因误操作导致HA架构下双主的出现。