openGauss问题定位经验分享-------启停日志分析

环境部署:资源池化
  node1:主server,主db    node2:备1db    node3:备2db

问题现象:
stop node2, node2卡在stop流程中, node3节点pg_log日志中出现receive fast shutdown request


1. pg_log

  node3正常运行,看到node3日志中不应该出现日志:06:02:25.699 receive fast shutdown request

  该日志出现的场景只有两种情况:(1) 手动执行gs_ctl stop或者cm_ctl stop停掉数据库 (2) cm自动触发停掉数据库

  去查看cm日志

2.cm日志

  假如为了排除原因(1),查看三个节点的cm_ctl日志,尝试寻找时间点对应的cm_ctl stop,发现三个节点都没有,所以不是原因(1)

  查看cm主server(即node1的cm_server日志)的key_event-2024-05-14_044825-current.log,可知node3先被踢掉,想知道为啥被踢掉,看主server的cm_server-current.log日志

2024-05-14 06:02:25.006 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) kick out.]
2024-05-14 06:02:26.019 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(2) kick out.]
2024-05-14 06:02:26.019 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) join in cluster.]
2024-05-14 06:03:51.086 tid=53426 MaxClusterAb: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(2) join in

  查看cm主server的cm_server-2024-05-14_044825-current.log,找到对应时间06:02:25.006的日志,kick out result: (index=2,nodeId=3) disconnect with (index=1,nodeId=2)原因是节点2和节点连接时间超过6s。cm应该是有自己的judgment逻辑,具体细节不清楚为啥把node3踢掉,而不是踢掉node2(实际情况是node2正在被stop,卡在stop中)

2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [KeyEvent: KEY_EVENT_RES_ARBITRATE] [Instance: 0] [Details: node(3) kick out.]
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: kick out result: (index=2,nodeId=3) disconnect with (index=1,nodeId=2).
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: Network timeout:6
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: Network base_time:2024-05-14 06:02:25
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |1970-01-01 08:00:00|2024-05-14 06:02:19|2024-05-14 06:02:23|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |2024-05-14 06:02:19|1970-01-01 08:00:00|2024-05-14 06:02:18|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: [RHB] hb infos: |2024-05-14 06:02:24|2024-05-14 06:02:19|1970-01-01 08:00:00|
2024-05-14 06:02:25.006 tid=53426 MaxClusterAb LOG: cms will set key(/pkq/CM/CMServer/Cluster) value([{"version":"16", "nodes": [{"id":1}, {"id":2}]}]) to ddb.
2024-05-14 06:02:25.141 tid=53410 IO_WORKER LOG: cm_agent connected from nodeId 2, conn count=3.
2024-05-14 06:02:25.436 tid=53408 IO_WORKER LOG: cmserver send msg to node 3, msgtype: MSG_CM_AGENT_RES_STATUS_LIST
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] [UpdateIsworkList] res(dms_res), version=144, status:
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=1, cmInstId=6001, resInstId=6001, status=1, isWork=1;
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=2, cmInstId=6002, resInstId=6002, status=1, isWork=1;
2024-05-14 06:02:25.436 tid=53411 CTL_WORKER LOG: [CUS_RES] nodeId=3, cmInstId=6003, resInstId=6003, status=1, isWork=0;

  理论上,cm应该有个地方调用dms_contrl.sh,但是一直没找到cm调用dms_contrl.sh的地方,需进一步确定

3. dms_contrl.log

进入$PGDATA目录查看node3的dms_control.log日志,发现06:02:25 stop了数据库,也就是表明cm调用dms_contrl.sh脚本执行了stop,但是一直没找到cm调用dms_contrl.sh的地方,需进一步确定?

2024-05-14 06:01:55 start dn in /data/ss_env/install/dss_home success.
2024-05-14 06:02:25 stop /data/ss_env/install/app/bin/gaussdb...
2024-05-14 06:02:30 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:31 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:32 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:33 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail.
2024-05-14 06:02:35 check /data/ss_env/install/app/bin/gaussdb in /data/ss_env/install/data/dn fail

dms_contrl.sh如下:

# 1st step: kill database
function Stop()
{
    log "stop ${GSDB_BIN}..."
    ${BIN_PATH}/gs_ctl stop -D ${GSDB_HOME}
    sleep 5

    pid=`program_pid ${GSDB_BIN} ${GSDB_HOME}`
    if [[ -z ${pid} ]]
    then
        log "${GSDB_BIN} stopped in dir ${GSDB_HOME}..."
    else
        log "Killing ${GSDB_BIN} if running..."
        kill_program ${GSDB_BIN} ${GSDB_HOME}
    fi
}



# 1st step: kill database
function Clean()
{
    log "stop ${GSDB_BIN}..."
    ${BIN_PATH}/gs_ctl stop -D ${GSDB_HOME}

    if [ $? -eq 0 ]
    then
        log "${GSDB_BIN} stopped in dir ${GSDB_HOME}..."
    else
        log "Killing ${GSDB_BIN} if running..."
        kill_program ${GSDB_BIN} ${GSDB_HOME}
    fi
}

4.在gs_ctl日志中可以再确定一下

进入到node3的gs_ctl日志,如下:

[2024-05-14 06:02:25]
[2024-05-14 06:02:25.619][51308][][gs_ctl]: gs_ctl stopped ,datadir is /data/ss_env/install/data/dn
[2024-05-14 06:03:25.678][51308][][gs_ctl]:  server does not shut down (/data/ss_env/install/data/dn)
[2024-05-14 06:14:56]
  • 4
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值