【PostgreSQL】主从集群之keepalived实现HA和故障自动转移

前言

书接上回,我们已经有了一套集群环境,(参考文章:CSDN)目前工作正常,但还不能高枕无忧,突然哪一天主库发生故障了,业务系统就会无法使用,我们的电话会被打爆,如何睡个安稳觉呢,那就需要继续增加组件实现高可用(HA)。今天介绍的是keepalived方案。闲话少说,马上进入正题。

环境介绍

我们有如下集群,还没有的可以参考之前的文章动手操作

主机角色OSdb
192.168.0.111Ubuntu20postgresql14
192.168.0.112Ubuntu20postgresql14
192.168.0.113Ubuntu20postgresql14

另外需要keepalived工具,如果还没装的,在集群所有主机上执行下面的命令进行安装 

 apt-get -y install keepalived

一些keepalived的基本操作命令如下

sudo service keepalived [start | stop | reload | restart ] #启停操作
keepalived -v #检查是否安装成功

方案介绍

 虽然本方案使用了一主两备的集群,但是目前现有资料都是针对一主一备的,多备的情况更复杂一点,所以本方案在此基础上简化了一些功能,即实现一主两备集群,但当发生主库故障时,只实现单备的故障转移,另一台备机需要手工切换到新主库。这个方案可以满足对实时性要求不高的读场景,也就是说即使另一台备机没有切换,也能支持正常的查询业务,但数据不再更新,直至手工切换后数据再次同步。

部署步骤

设置自启动【主、备】

sudo vim /etc/rc.local 

 加入以下内容

#!/bin/sh -e
service keepalived start &
 
exit 0

配置keepalived.conf【主、备】

注意ubuntu中通过这种方式安装以后目录是/etc/keepalived,初始是没有keepalived.conf文件的,服务也是没起来的,网上找了一些处理过程,参考如下:

在/etc/keepalived目录下自己创建配置文件

cd /etc/keepalived
vi keepalived.conf

# 初始内容如下,这个是最简易的,具体可阅读官方说明文档

#  Keepalived for Linux简易配置

###主机1
global_defs {
router_id 192.168.0.111
script_user root
enable_script_security
}
vrrp_script chk_http_port {
script "/etc/keepalived/pg_alive.sh"
interval 10
weight 2
}
vrrp_instance VI_1 {
state MASTER
interface enp0s3
virtual_router_id 51
priority 99
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.0.50
}
track_script {
chk_http_port
}
smtp_alert
notify_master '/etc/keepalived/failover.sh'
notify_fault '/etc/keepalived/fault.sh'
}

###备库【1】配置信息
global_defs {
router_id 192.168.0.112
script_user root
enable_script_security
}
vrrp_script chk_http_port {
script "/etc/keepalived/pg_alive.sh"
interval 10
weight 2
}
vrrp_instance VI_1 {
state BACKUP
interface enp0s3
virtual_router_id 51
priority 90 
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.0.50
}
track_script {
chk_http_port
}
smtp_alert
notify_master '/etc/keepalived/failover.sh'
notify_fault '/etc/keepalived/fault.sh'
}

###备库【2】配置信息
global_defs {
router_id 192.168.0.113
script_user root
enable_script_security
}
vrrp_script chk_http_port {
script "/etc/keepalived/pg_alive.sh"
interval 10
weight 2
}
vrrp_instance VI_1 {
state BACKUP
interface enp0s3
virtual_router_id 51
priority 80 
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.0.50
}
track_script {
chk_http_port
}
}

 这里几个参数简单说明如下:
router_id 192.168.223.131//你的ip地址
script "/etc/keepalived/a.sh"  健康检查的脚本,这个要自己写
interval 2 #(检测脚本执行的间隔)
state BACKUP # 主机上为 MASTER;备机上为 BACKUP
interface ens33 #网卡,可以通过ifconfig或ip a查看
virtual_router_id 51 # 主、备机的 virtual_router_id 必须相同
priority 90 # 主、备机取不同的优先级,主机值较大,备份机值较小
virtual_ipaddress { 192.168.0.50 // VRRP H 虚拟地址,这里最好用和实体机一个网段的,否则容易出现联不通的情况

smtp_alert # 如果主机状态变成Master|Backup|Fault之后会去执行的通知脚本,脚本要自己写

注意:这里的内容直接复制到linux里可能无法识别某些字符要当心 

编写健康检查脚本 【主、备】

健康检查的脚本这里也给出一段供参考

cd /etc/keepalived
vi pg_alive.sh
# 建好文件还要授权
chmod +x pg_alive.sh

#!/bin/bash
# FILENAME pg_alive.sh
# define DB parameters
PGDATABASE=postgres
PGPORT=5432
PGUSER=postgres
PGHOME=/usr/lib/postgresql/14/bin
LOGFILE=/etc/keepalived/pg_alive.log
PATH=$PATH:$PGHOME
keeplognums=30000

# define sql scripts
SQL_UPDATE_HEART="update sr_delay set sr_date = now() where id =1;"
SQL_ROLE="SELECT pg_is_in_recovery from pg_is_in_recovery();"
SQL_CHK_DBALIVED="SELECT 1;"

# process step
db_role=`echo $SQL_ROLE  | psql -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w`
if [ $db_role == 't' ];then
echo -e `date +"%F %T"` "Attention1:the current database is standby DB!" >> $LOGFILE
exit 0
fi

#备库不检查存活,主库更新状态
echo $SQL_CHK_DBALIVED | psql -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
if [ $? -eq 0 ] ;then
echo $SQL_UPDATE_HEART | psql -p $PGPORT -d $PGDATABASE -U $PGUSER -At -w
echo -e `date +"%F %T"` "Success: update the master sr_delay successed!" >> $LOGFILE
exit 0
else
echo -e `date +"%F %T"` "Error:Is the server is running?" >> $LOGFILE
echo -e `date +"%F %T"` "Error:Postgresql is tring stop..." >> $LOGFILE
systemctl stop postgresql >> $LOGFILE 2>&1
sleep 5
echo -e `date +"%F %T"` "Error:keepalived is tring stop..." >> $LOGFILE
service keepalived stop >>$LOGFILE 2>&1
exit 1
fi
#日志保留 keeplognums 行
if [ ! -f ${MONITOR_LOG} ] ;then touch ${MONITOR_LOG};fi 
lognums=`cat ${MONITOR_LOG} |wc -l`
catnum=$((${lognums} -${keeplognums}))
if [[ $lognums -gt ${keeplognums} ]] ; then 
sed -i "1,${catnum}d" ${MONITOR_LOG}; 
fi

这是一个很简单的集群健康检查演示脚本,它每隔2秒去检查一下主库的响应,如果正常则更新心跳标记,同时把检查结果输出到日志中

注意:这里的内容直接复制到linux里可能无法识别某些字符要当心 

要让上面这段脚本正常运行,还需要在【主库】中新建一个心跳表,执行下面的脚本

create table sr_delay(id serial ,sr_date timestamp default now(),new_primary_ip varchar(50),standby_num int);
insert into sr_delay(id,sr_date) values(1,now());

字段说明:

sr_date心跳时间,主机健康检查时更新
new_primary_ip 新主库IP,当备库升级为主库时填入,主库健康检查时如果standby_num归0则清空
standby_num 备库数量,不在故障转移阶段(即new_primary_ip为空时)由健康检查时更新,当故障转移时数量自动-1;还在故障转移阶段时(即备库升级为主库后,主库健康检查时获取到的从库数>standby_num),在健康检查时自动-已有从库数,直至0

 确认日志正常【主 备】

root@host111:/etc/keepalived# tail -f pg_alive.log
2022-04-26 08:53:22 Success: update the master sr_delay successed!
2022-04-26 08:53:24 Success: update the master sr_delay successed!

 所有主库备库上都部署一遍以后,我们就完成了HA的第一步,自动检测和VIP服务。接下来就是重点了,通过自定义脚本结合keepalived实现自动故障转移。

故障后处理(fault.sh)

这是keepalived在健康检查时遇到异常时要执行的脚本

cd //etc/keepalived
vi fault.sh
# 编辑完成后授权
chmod +x fault.sh

内容如下:

#!/bin/bash
# FILENAME fault.sh
LOGFILE=/etc/keepalived/fault.log
PGDATA=/var/lib/postgresql/14/main
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE

PGPID="`head -n1 $PGDATA/postmaster.pid`"
if [ $? -ne 0 ];then
echo -e `date +"%F %T"` "Error:$PGDATA/postmaster.pid not found!" >> $LOGFILE
fi
# systemctl stop keepalived
service keepalived stop

kill -9 $PGPID
if [ $? -eq 0 ] ;
then
echo -e `date +"%F %T"` "Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!" >> $LOGFILE
# systemctl stop keepalived
service keepalived stop
exit 1
fi

 需要确认PGDATA是否在正确的目录,安装默认在/var/lib/postgresql/14/main

部署自动故障转移(failover.sh)

创建一个故障转移的脚本

cd //etc/keepalived
vi failover.sh
# 编辑完成后授权
chmod +x failover.sh

内容如下:

#!/bin/bash
# FILENAME failover.sh
# define DB parameters
PGPORT=5432
PGUSER=postgres
PG_OS_USER=postgres
PGDBNAME=postgres
LANG=en_US.UTF-8
PGPATH=/usr/lib/postgresql/14/bin
PGDBPATH=/var/lib/postgresql/14/main
LOGFILE=/etc/keepalived/failover.log
PATH=$PATH:$PGPATH

# 主备数据库同步时延,单位为秒
sr_allowed_delay_time=100
# define SQL scripts
SQL_ROLE="select pg_is_in_recovery from pg_is_in_recovery();"
SQL_DELAY_TIME="select sr_date as delay_time from sr_delay where now()>(sr_date + interval '$sr_allowed_delay_time seconds');"
SWITCH_COMMAND="pg_ctl promote -D $PGDBPATH"
echo -e `date +"%F %T"` "Attention:failover process start ..." >> $LOGFILE
echo su - postgres -c "psql" >> $LOGFILE  2>&1
if [ $? -ne 0 ];then
echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is disconnect!" >> $LOGFILE
exit 1
else
echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is OK!" >> $LOGFILE
fi

# process step
db_role=`echo $SQL_ROLE | psql -At -p $PGPORT -U $PGUSER -d $PGDBNAME -w`
# Switch Rule: only standby node and db delay greater than defined delay-time allowed switch to primary node
if [ $db_role != 't' ];then
echo -e `date +"%F %T"` "Attention:The current postgreSQL DB is master database,cannot switched!" >> $LOGFILE
exit 0
fi
slp_sec=$[$sr_allowed_delay_time+10]
echo "wait $slp_sec sec..." >> $LOGFILE
sleep $slp_sec
db_sr_delaytime=`echo $SQL_DELAY_TIME | psql -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
if [ -z "$db_sr_delaytime" ];then
echo -e `date +"%F %T"` "Attention:The current master database is health,the standby DB cannot switched!" >> $LOGFILE
exit 0
fi

if [ -n "$db_sr_delaytime" ];then
echo -e `date +"%F %T"` "Attention:The current database is statndby,ready to switch master database!" >> $LOGFILE
echo -e `date +"%F %T"` "su - $PG_OS_USER -c $SWITCH_COMMAND" >>$LOGFILE
echo "su - $PG_OS_USER -c $SWITCH_COMMAND" >>$LOGFILE
su - $PG_OS_USER -c "$SWITCH_COMMAND" >> $LOGFILE 2>&1
if [ $? -eq 0 ];then
echo -e `date +"%F %T"` "success:The current standby database successed to switched the primary PG database !" >> $LOGFILE
exit 0
else
echo -e `date +"%F %T"` "Error: the standby database failed to switch the primary PG database !,pelease checked it!" >> $LOGFILE
exit 1
fi
fi

 切换流程:

当心跳延迟超过设定上限时,开始启动备库升级;备库升级完成后,更新心跳表的IP字段,将新主库的IP地址写进去。 

上述脚本在当前主机上编辑验证完成后传输到其他节点,在所有节点上执行下面的命令

scp postgres@192.168.0.111:/etc/keepalived/*.sh /etc/keepalived

最后启动keepalived服务

service keepalived start

模拟故障转移

这里主要模拟三种场景

  • 主库停止服务,主服务器还在运行
  • 主服务器丢失 
  • 任意备库丢失(不会发生转移)

主库停止服务

在主库(192.168.111)执行下面的语句,停止postgresql服务

# root

systemctl stop postgresql

观察【主库】日志

022-04-27 07:51:00 Success: update the master sr_delay successed!
2022-04-27 07:51:02 Success: update the master sr_delay successed!
2022-04-27 07:51:04 Success: update the master sr_delay successed!
2022-04-27 07:51:06 Error:Is the server is running?
2022-04-27 07:51:08 Error:Is the server is running?
2022-04-27 07:51:10 Error:Is the server is running?
2022-04-27 07:51:12 Error:Is the server is running?

 可以看到日志出现状态异常。此时备库日志还是正常的,在限定时间内(这里我们设的是100秒)不会发生转移。

在看主库的failover.log

root@host111:/etc/keepalived# tail -f failover.log

2022-04-27 08:28:00 Attention:The current postgreSQL DB is disconnect!

再看主库的fault.log

root@host111:/etc/keepalived# cat fault.log
2022-04-27 08:33:12 Error:Because of the priamry DB happend some unknown problem,So turn off the PostgreSQL Database!

如果在目录里找不到日志就要检查一下脚本里面可能有报错,手工执行一下根据报错信息修正脚本。

功能展望

下一步要实现多备集群中所有备机自动切换到新主库上,初步设想如下:

新增一个集群信息表,用来记录整个集群的主机信息

create table sr_cluster(node_ip varchar(50),role_name varchar(30),status varchar(20));

核心字段介绍:

node_ip:集群中的节点IP 

role_name:节点当前角色,值: master-主库;standby-备库;

status:节点状态,值 online-正常在线;offline-脱离集群

处理流程简介:

1. master自动扫描新节点并注册到集群表,扫描现有节点并更新集群表状态

2.在故障转移过程中,同时把本节点更新为master,原来的master更新为standby

3. standby节点检测master节点,一旦发现master节点离线,则从集群表中寻找新的master,找到后切换到新master

  • 0
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值