环境部署:资源池化
node1:主server,主db node2:备1db node3:备2db
问题现象:
stop node2, node2卡在stop流程中, node3节点pg_log日志中出现receive fast shutdown request
1. pg_log
node3正常运行,看到node3日志中不应该出现日志:06:02:25.699 receive fast shutdown request
该日志出现的场景只有两种情况:(1) 手动执行gs_ctl stop或者cm_ctl stop停掉数据库 (2) cm自动触发停掉数据库
去查看cm日志
2.cm日志
假如为了排除原因(1),查看三个节点的cm_ctl日志,尝试寻找时间点对应的cm_ctl stop,发现三个节点都没有,所以不是原因(1)
查看cm主server(即node1的cm_server日志)的key_event-2024-05-14_044825-current.log,可知node3先被踢掉,想知道为啥被踢掉,看主server的cm_server-current.log日志
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) kick out.]
2024-05-14 06:02:26.019 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(2) kick out.]
2024-05-14 06:02:26.019 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) join in cluster.]
2024-05-14 06:03:51.086 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(2) join in
查看cm主server的cm_server-2024-05-14_044825-current.log,找到对应时间06:02:25.006的日志,kick out result: (index=2,nodeId=3) disconnect with (index=1,nodeId=2)原因是节点2和节点连接时间超过6s。cm应该是有自己的judgment逻辑,具体细节不清楚为啥把node3踢掉,而不是踢掉node2(实际情况是node2正在被stop,卡在stop中)
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) kick out.]
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: kick out result: (index=2,nodeId=3) disconnect with (index=1,nodeId=2).
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: Network timeout:6
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: Network base_time:2024-05-14 06:02:25
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |1970-01-01 08:00:00|2024-05-14 06:02:19|2024-05-14 06:02:23|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |2024-05-14 06:02:19|1970-01-01 08:00:00|2024-05-14 06:02:18|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |2024-05-14 06:02:24|2024-05-14 06:02:19|1970-01-01 08:00:00|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: cms will set key(/pkq/CM/CMServer/Cluster) value([{"version":"16", "nodes": [{"id":1}, {"id":2}]}]) to ddb.
2024-05-14 06:02:25.141 tid=53410 IO_WORKER LOG: cm_agent connected from nodeId 2, conn count=3.
2024-05-14 06:02:25.436 tid=53408 IO_WORKER LOG: cmserver send msg to node 3, msgtype: MSG_CM_AGENT_RES_STATUS_LIST
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] [UpdateIsworkList] res(dms_res), version=144, status:
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=1, cmInstId=6001, resInstId=6001, status=1, isWork=1;
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=2, cmInstId=6002, resInstId=6002, status=1, isWork=1;
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=3, cmInstId=6003, resInstId=6003, status=1, isWork=0;
理论上,cm应该有个地方调用dms_contrl.sh,但是一直没找到cm调用dms_contrl.sh的地方,需进一步确定
3. dms_contrl.log
进入$PGDATA目录查看node3的dms_control.log日志,发现06:02:25 stop了数据库,也就是表明cm调用dms_contrl.sh脚本执行了stop,但是一直没找到cm调用dms_contrl.sh的地方,需进一步确定?
2024-05-14 06:01:55 start dn in /data/ss_env/install/dss_home success.
2024-05-14 06:02:25 stop /data/ss_env/install/app/bin/gaussdb...
2024-05-14 06:02:30 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:31 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:32 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:33 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:35 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail
dms_contrl.sh如下:
# 1st step: kill database
function Stop()
{
log "stop ${GSDB_BIN}..."
${BIN_PATH}/gs_ctl stop -D ${GSDB_HOME}
sleep 5
pid=`program_pid ${GSDB_BIN} ${GSDB_HOME}`
if [[ -z ${pid} ]]
then
log "${GSDB_BIN} stopped in dir ${GSDB_HOME}..."
else
log "Killing ${GSDB_BIN} if running..."
kill_program ${GSDB_BIN} ${GSDB_HOME}
fi
}
# 1st step: kill database
function Clean()
{
log "stop ${GSDB_BIN}..."
${BIN_PATH}/gs_ctl stop -D ${GSDB_HOME}
if [ $? -eq 0 ]
then
log "${GSDB_BIN} stopped in dir ${GSDB_HOME}..."
else
log "Killing ${GSDB_BIN} if running..."
kill_program ${GSDB_BIN} ${GSDB_HOME}
fi
}
4.在gs_ctl日志中可以再确定一下
进入到node3的gs_ctl日志,如下:
[2024-05-14 06:02:25]
[2024-05-14 06:02:25.619][51308][][gs_ctl]: gs_ctl stopped ,datadir is /data/ss_env/install/data/dn
[2024-05-14 06:03:25.678][51308][][gs_ctl]: server does not shut down (/data/ss_env/install/data/dn)
[2024-05-14 06:14:56]