【我和openGauss的故事】openGauss集群故障节点替换操作

【我和openGauss的故事】openGauss集群故障节点替换操作

ziyoo0830 openGauss 2023-08-07 18:00 发表于中国香港

背景信息

节点故障或者节点替换(主机名和ip与原主机保持一致)的情况下,尝试使用拷贝正常节点的app二进制文件和om文件来恢复故障或替换节点,并通过gs_ctl build[从备机进行build]来将节点重新加入到现有集群中。

本次验证是在测试环境下,数据库无压力,生产环境请谨慎测试。

集群信息

2023-08-04 07:43:24 [line:905] INFO <module> 94105 [   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node   node_ip         port      instance                     state
---------------------------------------------------------------------------------------
1  pghost3 192.168.56.30   26000      6001 /app/ogdata/data/dn1   P Primary Normal
2  pghost5 192.168.56.50   26000      6002 /app/ogdata/data/dn1   S Standby Normal
3  pghost6 192.168.56.60   26000      6003 /app/ogdata/data/dn1   S Standby Normal

模拟故障

root@pghost6 /app# rm -rf ogdata/
root@pghost6 /app# rm -rf opengauss/
root@pghost6 /app# rm -rf ogxlog/
root@pghost6 /app# rm -rf ogarchive/

kill -9 ${GAUSSDB-PID}

集群状态

omm@pghost3 ~$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node   node_ip         port      instance                     state
---------------------------------------------------------------------------------------
1  pghost3 192.168.56.30   26000      6001 /app/ogdata/data/dn1   P Primary Normal
2  pghost5 192.168.56.50   26000      6002 /app/ogdata/data/dn1   S Standby Normal
3  pghost6 192.168.56.60   26000      6003 /app/ogdata/data/dn1   S Unknown Unknown

恢复

安装python3

根据系统情况决定是否需要安装python3

拷贝目录及文件

/etc/hosts中加入节点的映射关系

omm@pghost6 ~$ more /etc/hosts
127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
192.168.56.60 pghost6 
192.168.56.30 pghost3 
192.168.56.50 pghost5 

pghost5节点拷贝目录到故障节点pghost6对应目录下。

# pghost6 上创建对应目录
root@pghost6 /app# mkdir opengauss
root@pghost6 /app# chown omm: opengauss

# 拷贝 app 和 tool 目录
omm@pghost5 /app/opengauss$ scp -r app omm pghost6:/app/opengauss/
omm@pghost5 /app/opengauss$ scp -r tool pghost6:/app/opengauss/

pghost5节点拷贝pg_hba.confpostgresql.conf文件到故障节点pghost6对应目录下。

root@pghost6 /app# mkdir -p ogdata/data/dn1
root@pghost6 /app# chown omm: ogdata/data/dn1/
omm@pghost5 /app/ogdata/data/dn1$ scp pg_hba.conf postgresql.conf pghost6:/app/ogdata/data/dn1/

修改postgresql.conf对应的值

local_bind_address = '192.168.56.60'

replconninfo1 = 'localhost=192.168.56.60 localport=26001 localheartbeatport=26005 localservice=26004 remotehost=192.168.56.30 remoteport=26001 remoteheartbeatport=26005 remoteservice=26004' 
replconninfo2 = 'localhost=192.168.56.60 localport=26001 localheartbeatport=26005 localservice=26004 remotehost=192.168.56.50 remoteport=26001 remoteheartbeatport=26005 remoteservice=26004' 

synchronous_standby_names = 'ANY 1(dn_6001,dn_6002)'
log_directory = '/app/opengauss/gaussdb_log/omm/pg_log/dn_6003'
audit_directory = '/app/opengauss/gaussdb_log/omm/pg_audit/dn_6003'
application_name = 'dn_6003'

.bashrc中加入如下内容

PATH="$HOME/.local/bin:$HOME/bin:$PATH"
export PATH
export GPHOME=/app/opengauss/tool
export PATH=$GPHOME/script/gspylib/pssh/bin:$GPHOME/script:$PATH
export LD_LIBRARY_PATH=$GPHOME/lib:$LD_LIBRARY_PATH
export PYTHONPATH=$GPHOME/lib
export GAUSSHOME=/app/opengauss/app/2.0.1
export PATH=$GAUSSHOME/bin:$PATH
export LD_LIBRARY_PATH=$GAUSSHOME/lib:$LD_LIBRARY_PATH
export S3_CLIENT_CRT_FILE=$GAUSSHOME/lib/client.crt
export GAUSS_VERSION=3.0.3
export PGHOST=/app/opengauss/tmp
export GAUSSLOG=/app/opengauss/gaussdb_log/omm
umask 077
export GAUSS_ENV=2
export GS_CLUSTER_NAME=gauss_omm

build拉齐数据

# 从备机进行build
gs_ctl build -D /app/ogdata/data/dn1 -b standby_full -C "localhost=192.168.56.60 localport=26000 remotehost=192.168.56.50 remoteport=26000"

0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

0 LOG:  [Alarm Module]Host Name: pghost6

0 LOG:  [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

0 LOG:  [Alarm Module]Cluster Name: gauss_omm

0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57

0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
[2023-08-04 08:36:04.775][68234][][gs_ctl]: gs_ctl standby full build ,datadir is /app/ogdata/data/dn1,conn_str is 'localhost=192.168.56.60 localport=26000 remotehost=192.168.56.50 remoteport=26000'
[2023-08-04 08:36:04.775][68234][][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.775][68234][][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.779][68234][][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:04.780][68234][][gs_ctl]: stop failed, killing gaussdb by force ...
[2023-08-04 08:36:04.780][68234][][gs_ctl]: command [ps c -eo pid,euid,cmd | grep gaussdb | grep -v grep | awk '{if($2 == curuid && $1!="-n") print "/proc/"$1"/cwd"}' curuid=`id -u`| xargs ls -l | awk '{if ($NF=="/app/ogdata/data/dn1")  print $(NF-2)}' | awk -F/ '{print $3 }' | xargs kill -9 >/dev/null 2>&1 ] path: [/app/ogdata/data/dn1]
[2023-08-04 08:36:04.812][68234][][gs_ctl]: server stopped
[2023-08-04 08:36:04.812][68234][][gs_ctl]: current workdir is (/home/omm).
[2023-08-04 08:36:04.814][68234][][gs_ctl]: set gaussdb state file when standby full build build:db state(BUILDING_STATE), server mode(STANDBY_MODE), build mode(FULL_BUILD).
[2023-08-04 08:36:04.814][68234][dn_6001_6002_6003][gs_ctl]: Get repl_auth_mode is  and repl_uuid is
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: standby build try host(192.168.56.50) port(26000) success
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: connected to server success, build started.
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: clear old target dir success
[2023-08-04 08:36:04.915][68234][dn_6001_6002_6003][gs_ctl]: create build tag file success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: create build tag file again success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: get system identifier success
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: receiving and unpacking files...
[2023-08-04 08:36:04.916][68234][dn_6001_6002_6003][gs_ctl]: create backup label success
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: xlog start point: 0/5008718
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin build tablespace list
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: finish build tablespace list
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin get xlog by xlogstream
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: starting background WAL receiver
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: starting walreceiver
[2023-08-04 08:36:07.391][68234][dn_6001_6002_6003][gs_ctl]: begin receive tar files
[2023-08-04 08:36:07.392][68234][dn_6001_6002_6003][gs_ctl]: receiving and unpacking files...
[2023-08-04 08:36:07.424][68234][dn_6001_6002_6003][gs_ctl]: standby build try host(192.168.56.50) port(26000) success
[2023-08-04 08:36:07.429][68234][dn_6001_6002_6003][gs_ctl]: check identify system success
[2023-08-04 08:36:07.437][68234][dn_6001_6002_6003][gs_ctl]: send START_REPLICATION 0/5000000 success
[2023-08-04 08:36:12.641][68234][dn_6001_6002_6003][gs_ctl]: finish receive tar files
[2023-08-04 08:36:12.641][68234][dn_6001_6002_6003][gs_ctl]: xlog end point: 0/5008838
[2023-08-04 08:36:12.642][68234][dn_6001_6002_6003][gs_ctl]: fetching MOT checkpoint
[2023-08-04 08:36:12.820][68234][dn_6001_6002_6003][gs_ctl]: waiting for background process to finish streaming...
[2023-08-04 08:36:18.521][68234][dn_6001_6002_6003][gs_ctl]: starting fsync all files come from source.
[2023-08-04 08:36:26.308][68234][dn_6001_6002_6003][gs_ctl]: finish fsync all files.
[2023-08-04 08:36:26.313][68234][dn_6001_6002_6003][gs_ctl]: build dummy dw file success
[2023-08-04 08:36:26.313][68234][dn_6001_6002_6003][gs_ctl]: rename build status file success
[2023-08-04 08:36:26.321][68234][dn_6001_6002_6003][gs_ctl]: standby full build build completed(/app/ogdata/data/dn1).
[2023-08-04 08:36:26.758][68234][dn_6001_6002_6003][gs_ctl]: waiting for server to start...
.0 LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

0 LOG:  [Alarm Module]Host Name: pghost6

0 LOG:  [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

0 LOG:  [Alarm Module]Cluster Name: gauss_omm

0 LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57

0 WARNING:  failed to open feature control file, please check whether it exists: FileName=gaussdb.version, Errno=2, Errmessage=No such file or directory.
0 WARNING:  failed to parse feature control file: gaussdb.version.
0 WARNING:  Failed to load the product control file, so gaussdb cannot distinguish product version.
The core dump path is an invalid directory
2023-08-04 08:36:26.895 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [REDO] LOG:  Recovery parallelism, cpu count = 1, max = 4, actual = 1
2023-08-04 08:36:26.895 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [REDO] LOG:  ConfigRecoveryParallelism, true_max_recovery_parallelism:4, max_recovery_parallelism:4
2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]can not read GAUSS_WARNING_TYPE env.

2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Host Name: pghost6

2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Host IP: pghost6. Copy hostname directly in case of taking 10s to use 'gethostbyname' when /etc/hosts does not contain <HOST IP>

2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Cluster Name: gauss_omm

2023-08-04 08:36:27.010 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  [Alarm Module]Invalid data in AlarmItem file! Read alarm English name failed! line: 57

2023-08-04 08:36:27.139 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] LOG:  loaded library "security_plugin"
2023-08-04 08:36:27.144 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] WARNING:  could not create any HA TCP/IP sockets
2023-08-04 08:36:27.144 [unknown] [unknown] localhost 140411958592896 0[0:0#0] 0 [BACKEND] FATAL:  could not create lock file "/app/opengauss/tmp/.s.PGSQL.26000.lock": No such file or directory
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: waitpid 68257 failed, exitstatus is 256, ret is 2

[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: stopped waiting
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: could not start server
Examine the log output.
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:27.159][68234][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:36:27.164][68234][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success

报错1

 could not create lock file "/app/opengauss/tmp/.s.PGSQL.26000.lock": No such file or directory
 # 创建 /app/opengauss/tmp 目录,再次build。
 
 .........................................
[2023-08-04 08:45:10.215][68303][dn_6001_6002_6003][gs_ctl]:  done
[2023-08-04 08:45:10.533][68303][dn_6001_6002_6003][gs_ctl]: server started (/app/ogdata/data/dn1)
[2023-08-04 08:45:10.602][68303][dn_6001_6002_6003][gs_ctl]: fopen build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:45:10.602][68303][dn_6001_6002_6003][gs_ctl]: fprintf build pid file "/app/ogdata/data/dn1/gs_build.pid" success
[2023-08-04 08:45:11.525][68303][dn_6001_6002_6003][gs_ctl]: fsync build pid file "/app/ogdata/data/dn1/gs_build.pid" success
# 从以上日志可以看到build已经成功,查看进程和集群状态,发现集群已经恢复正常。
omm@pghost6 /app/ogdata/data/dn1$ ps x
    PID TTY      STAT   TIME COMMAND
  68150 pts/0    S      0:00 -bash
  68320 ?        Ssl    0:03 /app/opengauss/app/2.0.1/bin/gaussdb -D /app/ogdata/data/dn1 -M standby
  68377 pts/0    R+     0:00 ps x
omm@pghost6 /app/ogdata/data/dn1$ gs_om -t status --detail
[   Cluster State   ]

cluster_state   : Normal
redistributing  : No
current_az      : AZ_ALL

[  Datanode State   ]

    node   node_ip         port      instance                     state
---------------------------------------------------------------------------------------
1  pghost3 192.168.56.30   26000      6001 /app/ogdata/data/dn1   P Primary Normal
2  pghost5 192.168.56.50   26000      6002 /app/ogdata/data/dn1   S Standby Normal
3  pghost6 192.168.56.60   26000      6003 /app/ogdata/data/dn1   S Standby Normal

删除pg_tblspc无效目录

如果pghost6节点是通过安装单节点集群以后再build修复的话,修复成功后需要注意pg_tblspc目录下无效文件的大小,如太大,要考虑删除,避免占用较大的磁盘空间。

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
OpenGauss 5.0.0是一个开源的关系型数据库管理系统,下面是集群部署的步骤: 1. 准备环境:确保所有参与集群的服务器都满足最低配置要求,并且安装了适当的操作系统和依赖项。 2. 下载并安装OpenGauss:从官方网站下载OpenGauss 5.0.0的安装包,并按照安装指南进行安装。 3. 配置OpenGauss集群:在任意一台服务器上执行集群初始化命令,例如: ``` gsql -d postgres -p 5432 -c "gaussdb -D $GAUSSHOME/data" ``` 4. 创建集群用户:使用创建集群命令创建集群用户,并设置密码: ``` gsql -d postgres -p 5432 -c "create user myuser with password 'mypass'" ``` 5. 配置集群参数:根据实际需求,修改数据库的配置文件,在OpenGauss 5.0.0中,配置文件为postgresql.conf。 6. 启动集群:在所有服务器上启动OpenGauss集群服务: ``` gs_ctl start -D $GAUSSHOME/data -M primary ``` 7. 验证集群状态:使用集群账户登录集群,并执行一些SQL语句来验证集群是否正常运行: ``` gsql -d postgres -p 5432 -U myuser -W ``` 8. 添加额外节点(可选):如果需要添加更多的节点到现有的集群中,可以执行“扩展集群操作,具体操作步骤可以参考OpenGauss的官方文档。 9. 配置和管理集群:为了更好地管理和监控集群,可以使用OpenGauss提供的工具,如pgAdmin等。 通过以上步骤,就能成功部署一个OpenGauss 5.0.0的集群。在实际部署过程中,还需要根据实际需求和环境做适当的调整和配置。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值