Oracle DB CRS监听故障处理

  1. List item

上周五生产Oracle DB集群出了一次比较离奇的报警,半夜两点开始解决处理,前后折腾到下午总算搞定。不废话,处理过程如下:

一.报错过程

1.Jul 27, 2024 1:40:08 AM CST node #01先报错:
在这里插入图片描述
2.Jul 27, 2024 2:07:05 AM CST node #02报错:
在这里插入图片描述
3.集群down在这里插入图片描述

二.排查过程:

1.查看业务进程,正常;
2.新建连接访问不了DB,初步判断为监听异常;
3.查看集群状态:crsctl stat res -t,异常
4.检查监听状态(lsnrctl status):node #01 hang死,无反应,node #02,正常,但监听service显示为0;
5.尝试重启监听(lsnrctl stop),
在这里插入图片描述
在这里插入图片描述
6.然后是一系列报错:
1.Message=[OHASD(4869)]CRS-2771: Maximum restart attempts reached for resource ‘ora.crsd’; .will not restart. See /grid/app/grid/diag/crs/mcs1/crs/trace/alert.log for details.
2.Message=[CRSD(21006)]CRS-1019: The OCR Service exited on host mcs1. Details in /grid/app/grid/diag/crs/mcs1/crs/trace/crsd.trc2024-07-27T14:20:57.334410+08:00Errors in file /grid/app/grid/diag/crs/mcs1/crs/trace/crsd.trc (incident=89):~CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []~Incident details in: /grid/app/grid/diag/crs/mcs1/crs/incident/incdir_89/crsd_i89.trc~ See /grid/app/grid/diag/crs/mcs1/crs/trace/alert.log for details.
3.Message=The database status is UNKNOWN.
4.Message=[OCSSD(29448)]CRS-1605: CSSD voting file is online: /dev/mapper/ocr02; details in /grid/app/grid/diag/crs/mcs2/crs/trace/ocssd.trc. See /grid/app/grid/diag/crs/mcs2/crs/trace/alert.log for details.
5.Message=[CRSD(30935)]CRS-1019: The OCR Service exited on host mcs1. Details in /grid/app/grid/diag/crs/mcs1/crs/trace/crsd.trc2024-07-27T15:21:23.329962+08:00Errors in file /grid/app/grid/diag/crs/mcs1/crs/trace/crsd.trc (incident=441):~CRS-1019 [] [] [] [] [] [] [] [] [] [] [] []~Incident details in: /grid/app/grid/diag/crs/mcs1/crs/incident/incdir_441/crsd_i441.trc~ See /grid/app/grid/diag/crs/mcs1/crs/trace/alert.log for details.
6.Message=Failed to connect to database instance: Failed to connect: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection.

三.log排查:

定位至/grid/app/grid/diag/crs/mcs1/crs/trace/ohasd_orarootagent_root_70586.trc
在这里插入图片描述
1:25 可见连接存储方向出现闪断

2024-07-27 01:25:07.051 :CLSDYNAM:2728253184: [ora.storage]{0:9:3} [check] StorageAgent::check NODEROLE_HUB getOCRdetails
2024-07-27 01:25:07.054 :CLSDYNAM:2728253184: [ora.storage]{0:9:3} [check] StorageAgent::parsekgforetcodes retcode = 0, kgfoCheckMount(OCRVOTE), flag 4
2024-07-27 01:25:07.054 :CLSDYNAM:2728253184: [ora.storage]{0:9:3} [check] StorageAgent::check kgfo returncode 0
2024-07-27 01:25:29.101 : AGFW:3640911616: {0:0:2} Agent received the message: AGENT_HB[Engine] ID 12293:35716634
2024-07-27 01:25:36.733 :CLSDYNAM:3649316608: [ ora.crf]{0:9:3} [check] Check return = 0, state detail = NULL
2024-07-27 01:25:36.977 : AGENT:3643012864: {0:9:3} {0:9:3} Created alert : (:CRSAGF00113:) : Aborting the command: check for resource: ora.crsd 1 1

1:34 节点2CRS同步报错
2024-07-27 01:34:11.647 [ORAROOTAGENT(5576)]CRS-5818: Aborted command ‘check’ for resource ‘ora.crsd’. Details at (:CRSAGF00113:) {0:9:3} in /grid/app/grid/diag/crs/mcs2/crs/trace/ohasd_orarootagent_root.trc.
2024-07-27 01:50:11.605 [ORAROOTAGENT(5576)]CRS-5818: Aborted command ‘check’ for resource ‘ora.crsd’. Details at (:CRSAGF00113:) {0:9:3} in /grid/app/grid/diag/crs/mcs2/crs/trace/ohasd_orarootagent_root.trc.
2024-07-27 02:05:11.887 [ORAAGENT(21558)]CRS-5822: Agent ‘/grid/12.2.0/grid/bin/oraagent_oracle’ disconnected from server. Details at (:CRSAGF00117:) {0:9:5387} in /grid/app/grid/diag/crs/mcs2/crs/trace/crsd_oraagent_oracle.trc.
2024-07-27 02:05:11.887 [ORAROOTAGENT(9805)]CRS-5822: Agent ‘/grid/12.2.0/grid/bin/orarootagent_root’ disconnected from server. Details at (:CRSAGF00117:) {0:5:482} in /grid/app/grid/diag/crs/mcs2/crs/trace/crsd_orarootagent_root.trc.
2024-07-27 02:05:11.900 [ORAAGENT(9998)]CRS-5822: Agent ‘/grid/12.2.0/grid/bin/oraagent_grid’ disconnected from server. Details at (:CRSAGF00117:) {0:7:4} in /grid/app/grid/diag/crs/mcs2/crs/trace/crsd_oraagent_grid.trc.

6:01 CRS反复重启无法完成GIPC 通讯。
2024-07-27 06:01:01.561 :CLSDYNAM:2193491712: [ora.crsd]{0:3:2263} [check] DaemonAgent::clsdmCheck 040 sendMessage excp:
2024-07-27 06:01:01.562 :CLSDYNAM:2193491712: [ora.crsd]{0:3:2263} [check] DaemonAgent::check returned 1
2024-07-27 06:01:01.562 :CLSDYNAM:2193491712: [ora.crsd]{0:3:2263} [check] DaemonAgent::check checkErrorCode=16777215, pestate=65280,perole=16711680, pemode=255
2024-07-27 06:01:01.562 : AGFW:2199795456: {0:3:2263} ora.crsd 1 1 state changed from: CLEANING to: OFFLINE
2024-07-27 06:01:01.562 : AGFW:2199795456: {0:3:2263} RECYCLE_AGENT attribute not found
2024-07-27 06:01:01.562 : AGFW:2199795456: {0:3:2263} Agent sending last reply for: RESOURCE_CLEAN[ora.crsd 1 1] ID 4100:34536838
2024-07-27 06:01:01.562 : CLSDMC:2166093568: Connecting to ipc://mcs2_CRSD
2024-07-27 06:01:01.562 : CLSDMC:2166093568: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2024-07-27 06:01:01.563 :CLSDYNAM:2166093568: [ora.crsd][ALERT]{0:3:2260} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:
2024-07-27 06:01:01.563 :CLSDYNAM:2166093568: [ora.crsd]{0:3:2260} [check] DaemonAgent::clsdmCheck 040 sendMessage excp:
2024-07-27 06:01:01.563 :CLSDYNAM:2166093568: [ora.crsd]{0:3:2260} [check] Process id 25201 translated to
2024-07-27 06:01:01.563 :CLSDYNAM:2166093568: [ora.crsd]{0:3:2260} [check] DaemonAgent::clsdmAndPidChecks 040 adjusted timeout:48000
2024-07-27 06:01:01.563 : CLSDMC:2166093568: Connecting to ipc://mcs2_CRSD
2024-07-27 06:01:01.563 : CLSDMC:2166093568: Error: gipcWait for gipcConnect - ret_gipcreqinfo=gipcretConnectionRefused, type_gipcreqinfo=gipcreqtypeConnect
2024-07-27 06:01:01.563 :CLSDYNAM:2166093568: [ora.crsd][ALERT]{0:3:2260} [check] ClsdmClient::sendMessage clsdmc_send error rmsg:0 ecode:-7 errbuf:

分析:初步判断为存储链路闪断后导致的CRS资源OFFLINE,且无法从错误状态中自动改出。
临时解决办法:1.KILL两个节点GIPC进程,以期重新完成GIPC连接。
2.如上述依然无法改出,重启两个节点CRS

总结:

1.本次故障还好没对生产造成太大影响,因为是监听故障,只影响到对数据库的新连接,发生在夜间凌晨。
2.查询Oracle技术支持文档发现,根本原因其实是个Bug:
在这里插入图片描述
3.A3报告:
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值