2023-01-11 LightDB高可用常用操作-管理.md-CSDN博客

本文链接：https://blog.csdn.net/richar1/article/details/131331553

本文详细介绍了LightDB集群的高可用操作，包括主库和备库的重启、主备切换以及主库和备库故障后的恢复步骤。通过ltcluster工具，管理员可以执行standbyswitchover、standbyfollow、noderejoin等操作，确保数据库服务的连续性和数据一致性。

摘要由CSDN通过智能技术生成

LightDB-高可用常用操作-管理篇

安装环境

操作系统：
    centos7
服务器IP:
    1.192.168.121.112 (主)
    2.192.168.121.113 (从)
    3.192.168.121.114 (哨兵-可选)

主库重启操作

1.先停止备库的keepalived,在root用户下执行

# 1.获得备库keepalived进 程pid
[root@localhost ~]# cat /var/run/keepalived.pid
7680
# 2.杀死keepalived进程
[root@localhost ~]# kill 7680
# 3.确认keepalived进程确实已不存在
[root@localhost ~]# ps aux | grep keepalived
root      24464  0.0  0.0 112824   976 pts/1    S+   20:27   0:00 grep --color=auto keepalived

主库重启，在lightdb用户下执行，重启期间主库不提供服务

# 1.暂停ltclusterd,防止自动failover
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service pause
NOTICE: node 1 (lightdbCluster1921681211125432) paused
NOTICE: node 2 (lightdbCluster1921681211135432) paused
NOTICE: node 4 (lightdbCluster1921681211145432) paused

# 2.查看集群状态，确认primary的paused?状态为yes
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | yes     | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | yes     | 0 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | yes     | 0 second(s) ago   
 
# 3.先断开所有连接到数据库的客户端和应用程序(否则数据库将stop failed)，然后停止主库
# 3.1默认回滚所有未断开的连接
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop
waiting for server to shut down................... done
server stopped
# 3.2如果存在客户端连接，尝试执行
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m smart
waiting for server to shut down............................................................... failed
lt_ctl: server does not shut down
HINT: The "-m fast" option immediately disconnects sessions rather than
waiting for session-initiated disconnection.
# 3.3如果依旧失败，尝试执行
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m immediate
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m fast

# 4.等待数据库停止成功

# 5.修改数据库参数，或其他操作

# 6.启动主库
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA start
2023-01-10 20:46:30.865738T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  LightDB autoprewarm: prewarm dbnum=0
2023-01-10 20:46:30.869043T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  ltaudit extension initialized
waiting for server to start........2023-01-10 20:46:35.267239T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  redirecting log output to logging collector process
2023-01-10 20:46:35.267239T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,HINT:  Future log output will appear in directory "log".
done
server started

# 7.等待启动成功，验证集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | yes     | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | yes     | 1 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | yes     | 1 second(s) ago

# 8.恢复集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service unpause
NOTICE: node 1 (lightdbCluster1921681211125432) unpaused
NOTICE: node 2 (lightdbCluster1921681211135432) unpaused
NOTICE: node 4 (lightdbCluster1921681211145432) unpaused

# 9.再次查验集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | no      | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | no      | 0 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | no      | 0 second(s) ago

重启备机keepalived服务

新开一个会话，使用root用户，按照上面输出的提示执行
[root@localhost bin]# cd /usr/local/lightdb/lightdb-x/13.8-22.3/tools/bin

#启动keepalive服务(standby server执行)
[root@localhost bin]# ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf

#查看进程
[root@localhost bin]# ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
[root@localhost bin]# ps aux|grep keepalived
root      26657  0.0  0.0  48100  1008 ?        Ss   20:55   0:00 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26658  0.1  0.0  48100  1844 ?        S    20:55   0:00 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26659  0.0  0.0  48100  1364 ?        S    20:55   0:00 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf

备库重启操作

备库因修改参数或者其他原因需要重启，在备库上执行

# 1.暂停ltclusterd,防止自动failover
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service pause
NOTICE: node 1 (lightdbCluster1921681211125432) paused
NOTICE: node 2 (lightdbCluster1921681211135432) paused
NOTICE: node 4 (lightdbCluster1921681211145432) paused

# 2.查看集群状态，确认standby的paused?状态为yes
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | yes     | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | yes     | 0 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | yes     | 0 second(s) ago   
 
# 3.先断开所有连接到数据库的客户端和应用程序(否则数据库将stop failed)，然后停止主库
# 3.1默认回滚所有未断开的连接
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop
waiting for server to shut down................... done
server stopped
# 3.2如果存在客户端连接，尝试执行
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m smart
waiting for server to shut down............................................................... failed
lt_ctl: server does not shut down
HINT: The "-m fast" option immediately disconnects sessions rather than
waiting for session-initiated disconnection.
# 3.3如果依旧失败，尝试执行
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m immediate
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop -m fast

# 4.等待数据库停止成功，备库停止后，可以在主库机上执行查看集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                         | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+----------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                  | running    | 4928 | yes     | n/a                
 2  | lightdbCluster1921681211135432 | standby | - failed  | ? lightdbCluster1921681211125432 | n/a        | n/a  | n/a     | n/a                
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432   | running    | 3525 | yes     | 0 second(s) ago    

# 5.修改数据库参数，或其他操作

# 6.启动备库
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA start
2023-01-10 20:46:30.865738T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  LightDB autoprewarm: prewarm dbnum=0
2023-01-10 20:46:30.869043T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  ltaudit extension initialized
waiting for server to start........2023-01-10 20:46:35.267239T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,LOG:  redirecting log output to logging collector process
2023-01-10 20:46:35.267239T,,,,,postmaster,,00000,2023-01-10 20:46:30 CST,0,72843,HINT:  Future log output will appear in directory "log".
done
server started

# 7.等待启动成功，验证集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | yes     | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | yes     | 1 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | yes     | 1 second(s) ago

# 8.恢复集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service unpause
NOTICE: node 1 (lightdbCluster1921681211125432) unpaused
NOTICE: node 2 (lightdbCluster1921681211135432) unpaused
NOTICE: node 4 (lightdbCluster1921681211145432) unpaused

# 9.再次查验集群状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | no      | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 3727 | no      | 0 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | no      | 0 second(s) ago

主备切换

集群正常运行状态下进行主备切换，可以通过在备机上使用switchover来自动完成

# 1.在备机上试运行
[lightdb@localhost ~]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf standby switchover --siblings-follow --dry-run
NOTICE: checking switchover on node "lightdbCluster1921681211135432" (ID: 2) in --dry-run mode
INFO: SSH connection to host "192.168.121.112" succeeded
INFO: able to execute "ltcluster" on remote host "192.168.121.112"
INFO: all sibling nodes are reachable via SSH
INFO: 2 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
INFO: 1 replication slots required, 10 available
INFO: would pause ltclusterd on node "lightdbCluster1921681211125432" (ID 1)
INFO: would pause ltclusterd on node "lightdbCluster1921681211135432" (ID 2)
INFO: would pause ltclusterd on node "lightdbCluster1921681211145432" (ID 4)
NOTICE: local node "lightdbCluster1921681211135432" (ID: 2) would be promoted to primary; current primary "lightdbCluster1921681211125432" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "lightdbCluster1921681211125432":
  "/usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_ctl  -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 1800 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

# 2.如果显示prerequisites for executing STANDBY SWITCHOVER are met，表示成功，可进行下一步

# 3.在备库机上正式执行
[lightdb@localhost ~]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf standby switchover --siblings-follow
NOTICE: executing switchover on node "lightdbCluster1921681211135432" (ID: 2)
ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf standby switchover --siblings-follow
NOTICE: local node "lightdbCluster1921681211135432" (ID: 2) will be promoted to primary; current primary "lightdbCluster1921681211125432" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "lightdbCluster1921681211125432" (ID: 1)
NOTICE: issuing CHECKPOINT on node "lightdbCluster1921681211125432" (ID: 1) 
DETAIL: executing server command "/usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_ctl  -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' -W -m fast stop"
INFO: checking for primary shutdown; 1 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 2 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 3 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 4 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 5 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 6 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 7 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 8 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 9 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 10 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 11 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 12 of 1800 attempts ("shutdown_check_timeout")
INFO: checking for primary shutdown; 13 of 1800 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/A0000028
NOTICE: promoting standby to primary
DETAIL: promoting server "lightdbCluster1921681211135432" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "lightdbCluster1921681211135432" (ID: 2) was successfully promoted to primary
INFO: local node 1 can attach to rejoin target node 2
DETAIL: local node's recovery point: 0/A0000028; rejoin target node's fork point: 0/A00000A0
INFO: creating replication slot as user "ltcluster"
NOTICE: setting node 1's upstream to node 2
WARNING: unable to ping "host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_ctl  -w -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' start"
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refused
	Is the server running on host "192.168.121.112" and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=ltcluster connect_timeout=2 dbname=ltcluster host=192.168.121.112 port=5432 fallback_application_name=ltcluster options=-csearch_path=
WARNING: unable to connect to local node to check replication slot status
HINT: execute "ltcluster node check" to check inactive slots and drop manually if necessary
NOTICE: NODE REJOIN successful
DETAIL: node 1 is now attached to node 2
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
INFO: waiting for node "lightdbCluster1921681211125432" (ID: 1) to connect to new primary; 1 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: checking for record in node "lightdbCluster1921681211135432"'s "pg_stat_replication" table where "application_name" is "lightdbCluster1921681211125432"
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
WARNING: node "lightdbCluster1921681211125432" not found in "pg_stat_replication"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
INFO: waiting for node "lightdbCluster1921681211125432" (ID: 1) to connect to new primary; 6 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: node "lightdbCluster1921681211135432" (ID: 1) is currrently attached to its upstream node in state "catchup"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
WARNING: node "lightdbCluster1921681211125432" attached in state "catchup"
INFO: waiting for node "lightdbCluster1921681211125432" (ID: 1) to connect to new primary; 11 of max 60 attempts (parameter "node_rejoin_timeout")
DETAIL: node "lightdbCluster1921681211135432" (ID: 1) is currrently attached to its upstream node in state "catchup"
NOTICE: node  "lightdbCluster1921681211135432" (ID: 2) promoted to primary, node "lightdbCluster1921681211125432" (ID: 1) demoted to standby
NOTICE: executing STANDBY FOLLOW on 1 of 1 siblings
INFO:  node 4 received notification to follow node 2
INFO: STANDBY FOLLOW successfully executed on all reachable sibling nodes
NOTICE: replication slot "ltcluster_slot_2" deleted on node 1
NOTICE: switchover was successful
DETAIL: node "lightdbCluster1921681211135432" is now primary and node "lightdbCluster1921681211125432" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

# 4.查看集群状态(2切换成了primary)
[lightdb@localhost ~]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | standby |   running | lightdbCluster1921681211135432 | running    | 4928 | no      | 1 second(s) ago    
 2  | lightdbCluster1921681211135432 | primary | * running |                                | running    | 3727 | no      | n/a                
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211135432 | running    | 3525 | no      | 0 second(s) ago

主库故障恢复

当主库发生故障(如宕机)failover后，备库会自动提升为新主库，以确保集群可用因上面做了一次主备切换，目前新的主库地址为：192.168.121.113，备库地址为：192.168.121.112

模拟主库故障，将主库所在机器(192.168.121.113)上的lightdb进程停掉

# 1.查看lightdb进程，确认lightdb是停止运行状态，如果不是，执行步骤2停掉，来模拟主库故障
[lightdb@localhost ~]$ ps aux|grep lightdb

# 2.停止lightdb
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop
waiting for server to shut down......... done
server stopped

# 3.查看状态
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
ERROR: connection to database failed
DETAIL: 
could not connect to server: Connection refused
	Is the server running on host "192.168.121.113" and accepting
	TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
  user=ltcluster connect_timeout=2 dbname=ltcluster host=192.168.121.113 port=5432 fallback_application_name=ltcluster options=-csearch_path=

# 4.在备库机器上查询状态(可以看到1机器切换到了primary)
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID  | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928 | no      | n/a                
 2  | lightdbCluster1921681211135432 | primary | - failed  | ?                              | n/a        | n/a  | n/a     | n/a                
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525 | no      | 0 second(s) ago    

WARNING: following issues were detected
  - unable to  connect to node "lightdbCluster1921681211135432" (ID: 2)

HINT: execute with --verbose option to see connection error messages

确认当前主库所在机器(192.168.121.113)上的ltcluster处于运行状态

# 1.查看ltcluster进程
[lightdb@localhost ltcluster]$ ps aux|grep ltcluster
lightdb   60183  0.0  0.0 116884  1624 ?        Sl   22:47   0:00 /usr/local/lightdb/lightdb-x/13.8-22.3/bin/ltclusterd -d -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf -p /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltclusterd.pid

# 2.确认ltcluster是运行状态，如果没有，则执行下面命令启动
[lightdb@localhost ltcluster]$ /usr/local/lightdb/lightdb-x/13.8-22.3/bin/ltclusterd -d -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf -p /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltclusterd.pid
[2023-01-10 22:47:18] [NOTICE] redirecting logging output to "/usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster2023-01-10_224718.log"

3.重启故障机器上的lightdb数据库进程

# 1.重新启动lightdb数据库进程
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA start
2023-01-10 23:02:16.502933T,,,,,postmaster,,00000,2023-01-10 23:02:16 CST,0,62485,LOG:  LightDB autoprewarm: prewarm dbnum=0
waiting for server to start....2023-01-10 23:02:16.508863T,,,,,postmaster,,00000,2023-01-10 23:02:16 CST,0,62485,LOG:  ltaudit extension initialized
...2023-01-10 23:02:20.135427T,,,,,postmaster,,00000,2023-01-10 23:02:16 CST,0,62485,LOG:  redirecting log output to logging collector process
2023-01-10 23:02:20.135427T,,,,,postmaster,,00000,2023-01-10 23:02:16 CST,0,62485,HINT:  Future log output will appear in directory "log".
 done
server started

4.故障恢复后重新加入集群
故障恢复后的数据库不会自动加入集群，如果想要重新加入，可以使用region将当前数据库恢复成备库。如果还想将当前数据库成为主库，可以参考主从切换操作。

# 1.region试运行(ps:host地址为集群中primary的ip地址)
[lightdb@localhost ltcluster]$ ltcluster -f $LTHOME/etc/ltcluster/ltcluster.conf node rejoin -d 'host=192.168.121.112 port=5432 dbname=ltcluster user=ltcluster' --verbose --force-rewind --dry-run
NOTICE: using provided configuration file "/usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf"
INFO: replication slots in use, 1 free slots on node 10
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 7186927613994770615
NOTICE: lt_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 1/28
INFO: prerequisites for using lt_rewind are met
INFO: temporary archive directory "/tmp/ltcluster-config-archive-lightdbCluster1921681211135432" created
INFO: 0 files would have been copied to "/tmp/ltcluster-config-archive-lightdbCluster1921681211135432"
INFO: temporary archive directory "/tmp/ltcluster-config-archive-lightdbCluster1921681211135432" deleted
INFO: lt_rewind would now be executed
DETAIL: lt_rewind command is:
  /usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_rewind -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' --source-server='host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met 

# 2.等待确认执行成功

# 3.region正式执行
[lightdb@localhost ltcluster]$ ltcluster -f $LTHOME/etc/ltcluster/ltcluster.conf node rejoin -d 'host=192.168.121.112 port=5432 dbname=ltcluster user=ltcluster' --verbose --force-rewind 
NOTICE: using provided configuration file "/usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf"
NOTICE: lt_rewind execution required for this node to attach to rejoin target node 1
DETAIL: rejoin target server's timeline 3 forked off current database system timeline 2 before current recovery point 1/28
INFO: prerequisites for using lt_rewind are met
INFO: 0 files copied to "/tmp/ltcluster-config-archive-lightdbCluster1921681211135432"
NOTICE: executing lt_rewind
DETAIL: lt_rewind command is "/usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_rewind -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' --source-server='host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2'"
NOTICE: 0 files copied to /usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster
INFO: directory "/tmp/ltcluster-config-archive-lightdbCluster1921681211135432" deleted
INFO: creating replication slot as user "ltcluster"
NOTICE: setting node 2's upstream to node 1
WARNING: unable to ping "host=192.168.121.113 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
NOTICE: starting server using "/usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_ctl  -w -D '/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster' start"
WARNING: unable to ping "host=192.168.121.113 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
INFO: waiting for node "lightdbCluster1921681211135432" (ID: 2) to respond to pings; 1 of max 60 attempts (parameter "node_rejoin_timeout")
WARNING: unable to ping "host=192.168.121.113 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
WARNING: unable to ping "host=192.168.121.113 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
WARNING: unable to ping "host=192.168.121.113 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
DETAIL: PQping() returned "PQPING_NO_RESPONSE"
INFO: node "lightdbCluster1921681211135432" (ID: 2) is pingable
INFO: node "lightdbCluster1921681211135432" (ID: 2) has attached to its upstream node
NOTICE: NODE REJOIN successful
DETAIL: node 2 is now attached to node 1

# 4.查看集群状态(可以看到2重新加入了集群)
[lightdb@localhost ltcluster]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
 ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID   | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+-------+---------+--------------------
 1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928  | no      | n/a                
 2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 60183 | no      | 1 second(s) ago    
 4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525  | no      | 1 second(s) ago 

# 5.确认keepalived正常运行，如果没有运行，则需要启动它
[lightdb@localhost ltcluster]$ ps aux | grep keepalived
root      26657  0.0  0.0  48100  1008 ?        Ss   20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26658  0.0  0.0  48100  1844 ?        S    20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26659  0.0  0.0  48100  1408 ?        S    20:56   0:03 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
lightdb   63836  0.0  0.0 112832   984 pts/1    S+   23:32   0:00 grep --color=auto keepalived

5.以上执行region如果失败，可以按照下面步骤重新克隆

# 1.确认lightdb数据库已经停止运行，如果没有，执行以下命令进行停止
[lightdb@localhost ltcluster]$ lt_ctl -D $LTDATA stop
waiting for server to shut down.... done
server stopped

# 2.查看一下集群状态
[lightdb@localhost data]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
ERROR: connection to database failed
DETAIL:
could not connect to server: Connection refused
Is the server running on host "192.168.121.113" and accepting
TCP/IP connections on port 5432?

DETAIL: attempted to connect using:
user=ltcluster connect_timeout=2 dbname=ltcluster host=192.168.121.113 port=5432 fallback_application_name=ltcluster options=-csearch_path=

# 3.清空实例目录下的内容($LTDATA/),如果需要，可以先进行备份

# 4.clone试运行
[lightdb@localhost data]$ ltcluster -h 192.168.121.112 -p 5432 -U ltcluster -d ltcluster -f $LTHOME/etc/ltcluster/ltcluster.conf standby clone --dry-run
NOTICE: destination directory "/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster
DETAIL: current installation size is 83 MB
INFO: "ltcluster" extension is installed in database "ltcluster"
INFO: parameter "max_replication_slots" set to 10
INFO: parameter "max_wal_senders" set to 10
NOTICE: checking for available walsenders on the source node (2 required)
INFO: sufficient walsenders available on the source node
DETAIL: 2 required, 10 available
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: required number of replication connections could be made to the source server
DETAIL: 2 replication connections required
INFO: replication slots will be created by user "ltcluster"
NOTICE: standby will attach to upstream node 1
HINT: consider using the -c/--fast-checkpoint option
INFO: all prerequisites for "standby clone" are met


# 5.确认all prerequisites for "standby clone" are met

# 6.克隆实例目录(ps:-h为primary服务器地址)
[lightdb@localhost data]$ ltcluster -h 192.168.121.112 -p 5432 -U ltcluster -d ltcluster -f $LTHOME/etc/ltcluster/ltcluster.conf standby clone -F
NOTICE: destination directory "/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster" provided
INFO: connecting to source node
DETAIL: connection string is: host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster
DETAIL: current installation size is 83 MB
NOTICE: checking for available walsenders on the source node (2 required)
NOTICE: checking replication connections can be made to the source server (2 required)
INFO: creating directory "/usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster"...
INFO: creating replication slot as user "ltcluster"
NOTICE: starting backup (using lt_basebackup)...
HINT: this may take some time; consider using the -c/--fast-checkpoint option
INFO: executing:
  /usr/local/lightdb/lightdb-x/13.8-22.3/bin/lt_basebackup -l "ltcluster base backup"  -D /usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster -h 192.168.121.112 -p 5432 -U ltcluster -X stream -S ltcluster_slot_2 
NOTICE: standby clone (using lt_basebackup) complete
NOTICE: you can now start your LightDB server
HINT: for example: lt_ctl -D /usr/local/lightdb/lightdb-x/13.8-22.3/data/defaultCluster start
HINT: after starting the server, you need to re-register this standby with "ltcluster standby register --force" to update the existing node record

# 7.启动数据库
[lightdb@localhost data]$ lt_ctl -D $LTDATA start
2023-01-11 00:02:19.037650T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  LightDB autoprewarm: prewarm dbnum=0
2023-01-11 00:02:19.040082T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  ltaudit extension initialized
waiting for server to start......2023-01-11 00:02:22.125889T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  redirecting log output to logging collector process
2023-01-11 00:02:22.125889T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,HINT:  Future log output will appear in directory "log".
done
server started

# 8.重新注册为standby
[lightdb@localhost data]$ ltcluster -f $LTHOME/etc/ltcluster/ltcluster.conf standby register -F
INFO: connecting to local node "lightdbCluster1921681211135432" (ID: 2)
INFO: connecting to primary database
INFO: standby registration complete
NOTICE: standby node "lightdbCluster1921681211135432" (ID: 2) successfully registered

# 9.查看集群状态(可以看到2正常加入到集群中)
[lightdb@localhost data]$ ltcluster -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf service status
ID | Name                           | Role    | Status    | Upstream                       | ltclusterd | PID   | Paused? | Upstream last seen
----+--------------------------------+---------+-----------+--------------------------------+------------+-------+---------+--------------------
1  | lightdbCluster1921681211125432 | primary | * running |                                | running    | 4928  | no      | n/a
2  | lightdbCluster1921681211135432 | standby |   running | lightdbCluster1921681211125432 | running    | 60183 | no      | 0 second(s) ago
4  | lightdbCluster1921681211145432 | witness | * running | lightdbCluster1921681211125432 | running    | 3525  | no      | 0 second(s) ago


# 10.确认keepalived正常运行，如果没有运行，则需要启动它
[lightdb@localhost ltcluster]$ ps aux | grep keepalived
root      26657  0.0  0.0  48100  1008 ?        Ss   20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26658  0.0  0.0  48100  1844 ?        S    20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26659  0.0  0.0  48100  1408 ?        S    20:56   0:03 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf

备库故障

备库发生故障后的恢复步骤如下

# 1.确认ltcluster是运行状态，如果没有，则执行下面命令启动
[lightdb@localhost ltcluster]$ /usr/local/lightdb/lightdb-x/13.8-22.3/bin/ltclusterd -d -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster.conf -p /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltclusterd.pid
[2023-01-10 22:47:18] [NOTICE] redirecting logging output to "/usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster2023-01-10_224718.log"

# 2.启动lightdb数据库
[lightdb@localhost data]$ lt_ctl -D $LTDATA start
2023-01-11 00:02:19.037650T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  LightDB autoprewarm: prewarm dbnum=0
2023-01-11 00:02:19.040082T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  ltaudit extension initialized
waiting for server to start......2023-01-11 00:02:22.125889T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,LOG:  redirecting log output to logging collector process
2023-01-11 00:02:22.125889T,,,,,postmaster,,00000,2023-01-11 00:02:19 CST,0,65044,HINT:  Future log output will appear in directory "log".
done
server started

# 3.确认keepalived正常运行，如果没有运行，则需要启动它
[lightdb@localhost ltcluster]$ ps aux | grep keepalived
root      26657  0.0  0.0  48100  1008 ?        Ss   20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26658  0.0  0.0  48100  1844 ?        S    20:56   0:01 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf
root      26659  0.0  0.0  48100  1408 ?        S    20:56   0:03 ./keepalived -f /usr/local/lightdb/lightdb-x/13.8-22.3/etc/keepalived/keepalived.conf

集群数据复制 (replication) 级别

不通业务场景对数据库主备一致性有不同的要求，一致性越高对性能影响越大。可以通过调整synchronous_commit来达到不通级别的一致性

修改$LTDATA/lightdb.conf文件中的配置参数

# 同步模式下，修改主节点上的配置
synchronous_commit = 'on'
synchronous_standby_names = '*'

# 异步模式下，修改主节点上的配置
synchronous_commit = 'local'
synchronous_standby_names = ''

# 使修改生效
[lightdb@localhost defaultCluster]$ lt_ctl -D $LTDATA reload
server signaled

其他

ltcluster和lightdb的运行是独立的。如果三台机器全部宕机重启，这时候lightdb和ltcluster进程都不存在。
如果想要恢复集群环境，需要先启动lightdb,再启动ltcluster

[lightdb@localhost ~]$ cat /usr/local/lightdb/lightdb-x/13.8-22.3/etc/ltcluster/ltcluster2023-01-12_103553.log
[2023-01-12 10:35:53] [NOTICE] ltclusterd (ltclusterd 5.2.1) starting up
[2023-01-12 10:35:53] [INFO] connecting to database "host=192.168.121.112 port=5432 user=ltcluster dbname=ltcluster connect_timeout=2"
[2023-01-12 10:35:53] [ERROR] connection to database failed
[2023-01-12 10:35:53] [DETAIL] 
could not connect to server: Connection refused
	Is the server running on host "192.168.121.112" and accepting
	TCP/IP connections on port 5432?

[2023-01-12 10:35:53] [DETAIL] attempted to connect using:
  user=ltcluster connect_timeout=2 dbname=ltcluster host=192.168.121.112 port=5432 fallback_application_name=ltcluster options=-csearch_path=