Gaussdb分布式集群状态为degraded，其中2个CN节点状态为Deleted（上）

最新推荐文章于 2024-09-27 23:54:01 发布

如清风一般

最新推荐文章于 2024-09-27 23:54:01 发布

阅读量830

点赞数 18

文章标签： gaussdb 分布式

本文链接：https://blog.csdn.net/Oliviahome/article/details/142338205

版权

问题描述

服务器卡顿的厉害，xshell工具不能访问，联系服务器管理员后，其对服务器进行了重启，恢复访问后查看集群状态显示状态异常，出现了CN服务异常告警，查询集群状态， CN组件状态为Deleted。
详情如下：

--查看集群状态
[omm@node01 ~]$ cm_ctl query -Cvipd
[  CMServer State   ]

node             node_ip         instance                                state
--------------------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   1    /gaussdb/cluster/data/cm/cm_server Primary
2  192.168.30.52 192.168.30.52   2    /gaussdb/cluster/data/cm/cm_server Standby
3  192.168.30.53 192.168.30.53   3    /gaussdb/cluster/data/cm/cm_server Standby

[    ETCD State     ]

node             node_ip         instance                        state
------------------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   7001 /gaussdb/cluster/data/etcd StateFollower
2  192.168.30.52 192.168.30.52   7002 /gaussdb/cluster/data/etcd StateFollower
3  192.168.30.53 192.168.30.53   7003 /gaussdb/cluster/data/etcd StateLeader

[   Cluster State   ]

cluster_state   : Degraded  降级
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[ Coordinator State ]

node             node_ip         instance                             state
----------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   5001 8000   /gaussdb/cluster/data/cn Deleted
2  192.168.30.52 192.168.30.52   5002 8000   /gaussdb/cluster/data/cn Normal
3  192.168.30.53 192.168.30.53   5003 8000   /gaussdb/cluster/data/cn Deleted

[ Central Coordinator State ]

node             node_ip         instance                      state
----------------------------------------------------------------------
2  192.168.30.52 192.168.30.52   5002 /gaussdb/cluster/data/cn Normal

[     GTM State     ]

node             node_ip         instance                       state                    sync_state
-------------------------------------------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   1001 /gaussdb/cluster/data/gtm P Primary Connection ok  Sync
2  192.168.30.52 192.168.30.52   1002 /gaussdb/cluster/data/gtm S Standby Connection ok  Sync
3  192.168.30.53 192.168.30.53   1003 /gaussdb/cluster/data/gtm S Standby Connection ok  Sync

[  Datanode State   ]

node             node_ip         instance                                     state            | node             node_ip         instance                                     state            | node             node_ip         instancestate
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   6001 33100  /gaussdb/cluster/data/dn/dn_6001 P Primary Normal | 2  192.168.30.52 192.168.30.52   6002 33100  /gaussdb/cluster/data/dn/dn_6002 S Standby Normal | 3  192.168.30.53 192.168.30.53   6003 33100  /gaussdb/cluster/data/dn/dn_6003S Standby Normal
2  192.168.30.52 192.168.30.52   6004 33120  /gaussdb/cluster/data/dn/dn_6004 P Standby Normal | 1  192.168.30.51 192.168.30.51   6005 33120  /gaussdb/cluster/data/dn/dn_6005 S Primary Normal | 3  192.168.30.53 192.168.30.53   6006 33120  /gaussdb/cluster/data/dn/dn_6006S Standby Normal
3  192.168.30.53 192.168.30.53   6007 33140  /gaussdb/cluster/data/dn/dn_6007 P Primary Normal | 2  192.168.30.52 192.168.30.52   6008 33140  /gaussdb/cluster/data/dn/dn_6008 S Standby Normal | 1  192.168.30.51 192.168.30.51   6009 33140  /gaussdb/cluster/data/dn/dn_6009S Standby Normal

连接工具也只能访问30.52节点，访问30.51和30.53节点提示如下报错：

问题原因

可能原因有以下：
1、虚拟机重启，断网等故障导致CN被剔除。
2、 CN与主DN断连导致CN被剔除。
3、 CN组件Down导致CN被剔除。
4、 CN组件频繁重启导致CN被剔除。
5、 CN组件被主动剔除。

分析过程

build相关日志路径：$GAUSSLOG/bin/gs_ctl

cm_agent日志路径：$GAUSSLOG/cm

实例日志路径：$GAUSSLOG/gs_log/dn_xx

cn日志路径：$GAUSSLOG/gs_log/cn_5001

系统日志（仅root用户）：/var/log/message

查看磁盘空间

df -h 查看磁盘空间，3个节点磁盘空间剩余还有不少。

查询集群状态

登录任一节点，切换到数据库用户，查询集群状态，集群状态变成了Cluster State：Degraded，30.51和30.53节点上的CN节点被剔除，状态变成了Deleted。

su - omm
cm_ctl query -Cvipd

输出如下：

[  CMServer State   ]

node             node_ip         instance                                state
--------------------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   1    /gaussdb/cluster/data/cm/cm_server Primary
2  192.168.30.52 192.168.30.52   2    /gaussdb/cluster/data/cm/cm_server Standby
3  192.168.30.53 192.168.30.53   3    /gaussdb/cluster/data/cm/cm_server Standby
......
[   Cluster State   ]

cluster_state   : Degraded
redistributing  : No
balanced        : No
current_az      : AZ_ALL

[ Coordinator State ]

node             node_ip         instance                             state
----------------------------------------------------------------------
1  192.168.30.51 192.168.30.51   5001 8000   /gaussdb/cluster/data/cn Deleted
2  192.168.30.52 192.168.30.52   5002 8000   /gaussdb/cluster/data/cn Normal
3  192.168.30.53 192.168.30.53   5003 8000   /gaussdb/cluster/data/cn Deleted 
......

查看主节点的cm_server日志

登录到CMS主节点查集群状态获取主节点信息

任一个节点登录到CMS主节点(显示节点30.51的实例状态是Primary)，查看cm_server日志。 CMS主节点，可通过查询集群状态获取。

su - omm 
cm_ctl query -Cv

输出如下：

[omm@node01 ~]$ cm_ctl query -Cv
[  CMServer State   ]

node             instance state
---------------------------------
1  192.168.30.51 1        Primary  --显示节点30.51的实例状态是Primary
2  192.168.30.52 2        Standby
3  192.168.30.53 3        Standby
......

查看主节点的cm_server日志

打开对应时间点的cm_server-***.log日志，如对应时间点的日志已被压缩，则查看对应的cm_server-****.log.gz日志。

su - omm
cd $GAUSSLOG/cm/cm_server
vi cm_server-2024-08-26_100433-current.log

输出如下：

2024-08-26 11:32:58.134 tid=24745 Monitor ASYN LOG: instance(5001) heartbeat timeout, heartbeat:6191, threshold:6
2024-08-26 11:32:58.134 tid=24745 Monitor ASYN LOG: instance(5003) heartbeat timeout, heartbeat:6191, threshold:6
2024-08-26 11:32:58.143 tid=24759 AGENT_WORKER ASYN LOG: instance(node =1 instanceid =5001) is not in normal status(db_state=18 status=5)
2024-08-26 11:32:58.567 tid=24758 AGENT_WORKER ASYN LOG: instance(node =3 instanceid =5003) is not in normal status(db_state=18 status=5)
2024-08-26 11:32:58.577 tid=24770 STORAGE ASYN LOG: [ReadOnlyActRecordInitState] instance 5001 in the initial state, disk_usage:8, read_only:0, read_only_threshold:85
2024-08-26 11:32:59.144 tid=24759 AGENT_WORKER ASYN LOG: instance(node =1 instanceid =5001) is not in normal status(db_state=18 status=5)
2024-08-26 11:32:59.568 tid=24758 AGENT_WORKER ASYN LOG: instance(node =3 instanceid =5003) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:00.144 tid=24759 AGENT_WORKER ASYN LOG: instance(node =1 instanceid =5001) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:00.569 tid=24758 AGENT_WORKER ASYN LOG: instance(node =3 instanceid =5003) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:01.146 tid=24759 AGENT_WORKER ASYN LOG: instance(node =1 instanceid =5001) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:01.571 tid=24758 AGENT_WORKER ASYN LOG: instance(node =3 instanceid =5003) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:02.146 tid=24759 AGENT_WORKER ASYN LOG: instance(node =1 instanceid =5001) is not in normal status(db_state=18 status=5)
2024-08-26 11:33:02.570 tid=24758 AGENT_WORKER ASYN LOG: instance(node =3 instanceid =5003) is not in normal status(db_state=18 status=5)

场景1：cm_server日志中搜索有关键词cn_down_to_delete=1

在cm_server日志中搜索有关键词cn_down_to_delete=1:
● 如对应时间点存在该信息，则原因为CN组件Down 导致，参考CN状态Down排查详细原因。
● 确认故障节点是否发生断网。
● 确认故障节点操作系统是否发生过重启。
● 故障处理后执行节点修复加回被剔除的CN。
● 如不涉及继续场景2。

场景2：cm_server日志中搜索有关键词isCnDnDisconnected=1

在cm_server日志中搜索有关键词： isCnDnDisconnected=1：
● 如对应时间点存在该信息，则原因为CN组件与主DN断链导致，此时需要排查CN与主DN之间的网络，待网络恢复后，执行节点修复加回被剔除的CN。
● 如不涉及继续场景3。

场景3：cm_server日志中搜索有关键词cmd_disable_cn=1

在cm_server日志中搜索有关键词： cmd_disable_cn=1：
● 如对应时间点存在该信息，则原因为CN组件被主动剔除，确认剔除原因后，后执行节点修复加回被剔除的CN。
● 如不涉及继续场景4。

场景4：cm_server日志中搜索有关键词cnReadonly=1

在cm_server日志中搜索有关键词： cnReadonly=1：
● 如对应时间点存在该信息，则原因为CN组件readonly 导致，参考CN状态为readonly排查详细原因。
● 如不涉及继续场景5。

场景5：cm_server日志中搜索有关键词cn instance restarts within ten minutes is more
than

在cm_server日志中搜索有关键词” cn instance restarts within ten minutes is more than”：
● 如对应时间点存在此信息，则原因为CN组件频繁重启导致，出现此种情况，参考
CN 启动失败进一步定位，故障处理后执行节点修复加回被剔除的CN。
● 如不涉及继续场景6

场景6：联系华为工程师提供技术支持

若以上均不涉及，则联系华为工程师提供技术支持

本文档属于场景6的情况。

Gaussdb分布式集群状态为degraded，其中2个CN节点状态为Deleted - 墨天轮 (modb.pro)