故障分析
一:DB Alert日志分析
节点2数据库实例alert日志(/opt/oracle/diag/rdbms/xxxx02/xxxxb2/trace/alert_xxxxb2.log):
[Wed Mar 09 07:18:33 2022
LNS: Standby redo logfile selected for thread 2 sequence 424269 for destination LOG_ARCHIVE_DEST_2
Wed Mar 09 07:18:43 2022
Archived Log entry 2554691 added for thread 2 sequence 424268 ID 0x9f058ce7 dest 1:
Wed Mar 09 07:28:27 2022
Thread 2 advanced to log sequence 424270 (LGWR switch)
Current log# 5 seq# 424270 mem# 0: +REDODG01/xxxx02/redo02_2.log
Wed Mar 09 07:28:54 2022
LNS: Standby redo logfile selected for thread 2 sequence 424270 for destination LOG_ARCHIVE_DEST_2
Wed Mar 09 07:29:19 2022 《------- 从7:29开始无日志输出直到07:36分
Archived Log entry 2554697 added for thread 2 sequence 424269 ID 0x9f058ce7 dest 1:
Wed Mar 09 07:36:56 2022
***********************************************************************
Fatal NI connect error 12537, connecting to:
(LOCAL=NO) 《----- 从07:36分开始出现数据库实例无法和服务器进程进行通讯的报错,限于篇幅原因以下省略了类似错误告警日志
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 09-MAR-2022 07:36:57
Tracing not turned on.
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
opiodr aborting process unknown ospid (35034) as a result of ORA-609
Wed Mar 09 07:38:30 2022
Thread 2 advanced to log sequence 424271 (LGWR switch)
Current log# 6 seq# 424271 mem# 0: +REDODG01/xxxx02/redo03_2.log
Wed Mar 09 07:40:42 2022
Wed Mar 09 07:40:42 2022
***********************************************************************
Wed Mar 09 07:40:50 2022
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
Wed Mar 09 07:40:55 2022
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 09-MAR-2022 07:41:00
Tracing not turned on.
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
opiodr aborting process unknown ospid (35275) as a result of ORA-609
Wed Mar 09 07:40:55 2022
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 09-MAR-2022 07:41:02
Tracing not turned on.
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
TNS-12537: TNS:connection closed
ns secondary err code: 12560
ns secondary err code: 12560
nt main err code: 0
nt main err code: 0
nt secondary err code: 0
nt secondary err code: 0
nt OS err code: 0
opiodr aborting process unknown ospid (35405) as a result of ORA-609
opiodr aborting process unknown ospid (35362) as a result of ORA-609
Wed Mar 09 07:45:20 2022
opiodr aborting process unknown ospid (35416) as a result of ORA-609
Wed Mar 09 07:45:22 2022
***********************************************************************
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
nt OS err code: 0
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 09-MAR-2022 07:45:22
Tracing not turned on.
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Wed Mar 09 07:45:36 2022
opiodr aborting process unknown ospid (35383) as a result of ORA-609Wed Mar 09 07:45:34 2022
opiodr aborting process unknown ospid (35379) as a result of ORA-609
Wed Mar 09 07:45:36 2022
opiodr aborting process unknown ospid (35344) as a result of ORA-609
Wed Mar 09 07:45:42 2022
LMON (ospid: 40530) has not called a wait for 13 secs. 《---- LMON进程此时已经hang了13秒
Wed Mar 09 07:45:53 2022
***********************************************************************
Fatal NI connect error 12537, connecting to:
(LOCAL=NO)
VERSION INFORMATION:
TNS for Linux: Version 11.2.0.3.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
Time: 09-MAR-2022 07:45:54
Tracing not turned on.
Tns error struct:
ns main err code: 12537
TNS-12537: TNS:connection closed
ns secondary err code: 12560
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Wed Mar 09 07:46:17 2022
opiodr aborting process unknown ospid (35333) as a result of ORA-609
Wed Mar 09 07:46:50 2022
Dumping diagnostic data in directory=[cdmp_20220309074647], requested by (instance=1, osid=5621 (CKPT)), summary=[incident=928237].
Wed Mar 09 07:47:35 2022
GES: System Load is HIGH.
GES: Current load is 703.92 and high load threshold is 240.00 《--—— 提示此时系统负载过高
Wed Mar 09 08:10:32 2022
Starting ORACLE instance (normal). 《----- OS重启后数据库实例重新启动
****************** Large Pages Information *****************
Total Shared Global Region in Large Pages = 75 GB (100%)
Large Pages used by this instance: 38401 (75 GB)
Large Pages unused system wide = 2559 (5118 MB) (alloc incr 256 MB)
Large Pages configured system wide = 40960 (80 GB)
Large Page size = 2048 KB
***********************************************************
二:GI相关日志分析
1)节点2 GI alert日志(/opt/grid/products/log/irora07/alertirora07.log)
可以看到,从07:21开始数据库集群相关的进程状态开始出现异常,oraagent_grid 无法检测到集群相关进程的状态,如asm实例、监听、asm磁盘组等,直至 时间点“2022-03-09 07:26:10.134”节点2的crsd服务挂掉,接下来数据库实例也开始出现异常,影响前端业务访问。 接下来集群相关agent尝试重启crsd服务未果,直至时间点“2022-03-09 07:48:14.562”,接下来的一段时间无日志输出,直到8点09分os重启后才继续有新的日志输出,也就是说7点48分开始节点基本上hang死。
2022-03-09 06:42:10.386:
[client(10516)]CRS-0009:log file "/opt/grid/products/log/irora07/client/olsnodes.log" reopened
2022-03-09 06:42:10.386:
[client(10516)]CRS-0019:file rotation terminated. log file: "/opt/grid/products/log/irora07/client/olsnodes.log"
2022-03-09 07:21:02.808:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {0:1:17229} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:22:02.296:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {0:1:17229} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:25:02.522:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {0:1:17229} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:25:02.968:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:8:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:02.385:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {0:1:17229} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:03.024:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER1.lsnr'. Details at (:CRSAGF00113:) {0:8:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:03.213:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:26:06.584:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.REDODG01.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:09.922:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.ARCHDG01.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:09.992:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG02.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:10.063:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG05.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:10.134:
[ohasd(31745)]CRS-2765:Resource 'ora.crsd' has failed on server 'irora07'. 《----crsd 服务挂掉
2022-03-09 07:26:11.650:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.OCR_DG.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:12.852:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG01.dg'. Details at (:CRSAGF00113:) {0:8:436} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:17.313:
[/opt/grid/products/bin/oraagent.bin(48919)]CRS-5822:Agent '/opt/grid/products/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:8:448} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:26:18.021:
[/opt/grid/products/bin/oraagent.bin(39916)]CRS-5822:Agent '/opt/grid/products/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:7:15} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:26:18.904:
[/opt/grid/products/bin/orarootagent.bin(33985)]CRS-5822:Agent '/opt/grid/products/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:3:47301} in /opt/grid/products/log/irora07/agent/crsd/orarootagent_root/orarootagent_root.log.
2022-03-09 07:28:02.053:
[crsd(30634)]CRS-1012:The OCR service started on node irora07.
2022-03-09 07:28:19.729:
[crsd(30634)]CRS-1201:CRSD started on node irora07. 《---
2022-03-09 07:29:29.995:
[/opt/grid/products/bin/oraagent.bin(32510)]CRS-5818:Aborted command 'check' for resource 'ora.xxxx02.server2.svc'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:29:29.995:
[/opt/grid/products/bin/oraagent.bin(32510)]CRS-5818:Aborted command 'check' for resource 'ora.xxxx02.server7.svc'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:29:29.995:
[/opt/grid/products/bin/oraagent.bin(32510)]CRS-5818:Aborted command 'check' for resource 'ora.xxxx02.test1.svc'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:29:30.255:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.ARCHDG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:29:30.265:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:29:30.275:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG02.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:29:30.285:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG05.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:29:42.252:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.OCR_DG.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:29:42.296:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.REDODG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:30:46.717:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:30:46.727:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER1.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:36:26.342:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:36:52.519:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:36:56.245:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.ARCHDG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:37:32.672:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:2:15} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:37:46.988:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:37:48.484:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:37:57.371:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:39:46.833:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:39:47.369:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:39:51.707:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:40:00.520:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:40:49.505:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER1.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:40:56.961:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.asm'. Details at (:CRSAGF00113:) {0:2:23} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:40:57.313:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:41:09.214:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER_SCAN1.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:10.370:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:41:25.557:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG05.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:25.558:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:25.498:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.DATADG02.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:25.377:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.ons'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:25.625:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.REDODG01.dg'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:30.902:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/opmn/bin/onsctli" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:41:55.251:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5818:Aborted command 'check' for resource 'ora.LISTENER.lsnr'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log.
2022-03-09 07:41:55.695:
[/opt/grid/products/bin/oraagent.bin(32501)]CRS-5014:Agent "/opt/grid/products/bin/oraagent.bin" timed out starting process "/opt/grid/products/bin/lsnrctl" for action "check": details at "(:CLSN00009:)" in "/opt/grid/products/log/irora07/agent/crsd/oraagent_grid/oraagent_grid.log"
2022-03-09 07:42:07.950:
[/opt/grid/products/bin/orarootagent.bin(32042)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:6:4061} in /opt/grid/products/log/irora07/agent/ohasd/orarootagent_root/orarootagent_root.log.
2022-03-09 07:43:37.405:
[/opt/grid/products/bin/oraagent.bin(32931)]CRS-5818:Aborted command 'check' for resource 'ora.xxxx02.server2.svc'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:43:37.409:
[/opt/grid/products/bin/oraagent.bin(32931)]CRS-5818:Aborted command 'check' for resource 'ora.xxxx02.test1.svc'. Details at (:CRSAGF00113:) {2:34448:2} in /opt/grid/products/log/irora07/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2022-03-09 07:45:03.859:
[/opt/grid/products/bin/orarootagent.bin(32042)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:6:4061} in /opt/grid/products/log/irora07/agent/ohasd/orarootagent_root/orarootagent_root.log.
2022-03-09 07:46:19.261:
[crsd(30634)]CRS-5828:Could not start agent '/opt/grid/products/bin/oraagent_grid'. Details at (:CRSAGF00130:) {2:34448:187} in /opt/grid/products/log/irora07/crsd/crsd.log.
2022-03-09 07:46:38.138:
[/opt/grid/products/bin/orarootagent.bin(32042)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:6:4061} in /opt/grid/products/log/irora07/agent/ohasd/orarootagent_root/orarootagent_root.log.
2022-03-09 07:48:14.562:
[/opt/grid/products/bin/orarootagent.bin(32042)]CRS-5818:Aborted command 'check' for resource 'ora.crsd'. Details at (:CRSAGF00113:) {0:6:4061} in /opt/grid/products/log/irora07/agent/ohasd/orarootagent_root/orarootagent_root.log.
2022-03-09 08:09:06.089:
[ohasd(13163)]CRS-2112:The OLR service started on node irora07.
2022-03-09 08:09:06.237:
[ohasd(13163)]CRS-1301:Oracle High Availability Service started on node irora07.
2022-03-09 08:09:06.268:
[ohasd(13163)]CRS-8017:location: /etc/oracle/lastgasp has 2 reboot advisory log files, 0 were announced and 0 errors occurred
2022-03-09 08:09:08.697:
2)节点1 GI alert日志(/opt/grid/products/log/irora06/alertirora06.log)
2022-03-09 07:26:40.172:
[client(19685)]CRS-0014:An error occurred while attempting to delete file "/opt/grid/products/log/irora06/client/olsnodes.l03" during log file rotation. Additional diagnostics: LFI-00142: Unable to delete an existing file [olsnodes][l03] not owned by Oracle.
2022-03-09 07:26:40.172:
[client(19685)]CRS-0009:log file "/opt/grid/products/log/irora06/client/olsnodes.log" reopened
2022-03-09 07:26:40.172:
[client(19685)]CRS-0019:file rotation terminated. log file: "/opt/grid/products/log/irora06/client/olsnodes.log"
2022-03-09 07:44:10.507:
[ctssd(3818)]CRS-2409:The clock on host irora06 is not synchronous with the mean cluster time. No action has been taken as the Cluster Time Synchronization Service is running in observer mode.
2022-03-09 07:51:29.759:
[cssd(3635)]CRS-1663:Member kill issued by PID 5585 for 1 members, group DBxxxx02. Details at (:CSSGM00044:) in /opt/grid/products/log/irora06/cssd/ocssd.log. 《----- 节点1的lmon进程发起对节点2的member kill,即重启gi服务
2022-03-09 07:52:33.246:
[ohasd(3250)]CRS-8011:reboot advisory message from host: irora07, component: cssagent, with time stamp: L-2022-03-09-07:52:33.206。 《----member kill升级为node kill,节点2被驱逐出集群
[ohasd(3250)]CRS-8013:reboot advisory message text: Rebooting after limit 28020 exceeded; disk timeout 28020, network timeout 23310, last heartbeat from CSSD at epoch seconds 1646783525.171, 28030 milliseconds ago based on invariant clock value of 2390470364
2022-03-09 07:52:34.804:
[cssd(3635)]CRS-1601:CSSD Reconfiguration complete. Active nodes are irora06 irora08 . 《-----集群重配置完成
2022-03-09 07:52:34.819:
[crsd(4095)]CRS-5504:Node down event reported for node 'irora07'.
2022-03-09 07:52:34.831:
[ctssd(3818)]CRS-2407:The new Cluster Time Synchronization Service reference node is host irora06.
2022-03-09 07:52:43.584:
[crsd(4095)]CRS-2773:Server 'irora07' has been removed from pool 'Generic'.
2022-03-09 07:52:43.585:
[crsd(4095)]CRS-2773:Server 'irora07' has been removed from pool 'ora.xxxx02'.
2022-03-09 07:52:43.585:
[crsd(4095)]CRS-2773:Server 'irora07' has been removed from pool 'ora.xxxx02_server2'.
3)节点1 cssd 日志(/opt/grid/products/log/irora06/cssd/ocssd.log)
2022-03-09 07:51:26.133: [ CSSD][3037062912]clssnmSendingThread: sending status msg to all nodes
2022-03-09 07:51:26.133: [ CSSD][3037062912]clssnmSendingThread: sent 5 status msgs to all nodes
2022-03-09 07:51:29.759: [ CSSD][3054339840]clssgmExecuteClientRequest: Member kill request from client (0x7f61a033c120)
2022-03-09 07:51:29.759: [ CSSD][3054339840](:CSSGM00044:)clssgmReqMemberKill: Kill requested map 0x00000002 flags 0x2 escalate 0xffffffff 《------------- 开始发起member kill
2022-03-09 07:51:29.760: [ CSSD][3032332032]clssgmMbrKillThread: Kill requested map 0x00000002 id 1 Group name DBxxxx02 flags 0x00000001 start time 0x8b4d5230 end time 0x8b4dc954 time out 30500 req node 1
2022-03-09 07:51:29.760: [ CSSD][3032332032]clssgmAllocateRPCIndex: allocated rpc 1266 (0x7f61b6281ee0)
。。。。
2022-03-09 07:52:00.233: [ CSSD][3139344128]clssgmCleanupNodeContexts(): cleaning up nodes, rcfg(291822534)
2022-03-09 07:52:00.233: [ CSSD][3139344128]clssgmCleanupNodeContexts(): successful cleanup of nodes rcfg(291822534)
2022-03-09 07:52:00.233: [ CSSD][3139344128]clssgmStartNMMon: completed node cleanup
2022-03-09 07:52:00.260: [ CSSD][3032332032]clssgmMbrKillThread: Time up: Start time -1957866960 End time -1957836460 Current time -1957836460 timeout 30500 《------- member kill超时
2022-03-09 07:52:00.260: [ CSSD][3032332032]clssgmMbrKillThread: Member kill request complete.
2022-03-09 07:52:00.260: [ CSSD][3032332032]clssgmMbrKillSendEvent: Missing answers or immediate escalation: Req member 0 Req node 1 Number of answers expected 0 Number of answers outstanding 1
2022-03-09 07:52:00.260: [ CSSD][3032332032]clssgmQueueGrockEvent: groupName(DBxxxx02) count(3) master(1) event(11), incarn 0, mbrc 0, to member 0, events 0x68, state 0x0
2022-03-09 07:52:00.261: [ CSSD][3032332032]clssgmMbrKillEsc: Escalating node 2 Member request 0x00000002 Member success 0x00000000 Member failure 0x00000000 Number left to kill 1
2022-03-09 07:52:00.261: [ CSSD][3032332032]clssnmMarkNodeForRemoval: node 2, irora07 marked for removal
2022-03-09 07:52:00.261: [ CSSD][3032332032]clssnmKillNode: node 2 (irora07) kill initiated
《------- 由member kill升级为node kill
2022-03-09 07:52:00.261: [ CSSD][3032332032]clssgmMbrKillThread: Exiting
三:osw相关日志分析
节点2 OSW日志
故障时间段vmstat数据分析:
1)节点2的osw日志信息有断档(07:19:36至08:09:30),从下面的osw数据可以看到从7点02分左右开始,osw数据采集已经开始出现异常,经常会出现3分钟采集不到数据的情况。
2)从7点开始系统交换分区使用频繁,有大量的和swap相关的页换进和换出操作。
3)系统内存余量很低,虽然系统的r运行队列不高,但是经常会出现b队列较高的情况且wa占比达10%左右,说明存在较为严重的I/O性能问题。
zzz ***Wed Mar 9 07:01:21 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 3 3827952 1638668 52080 569560 0 0 553 1175 0 0 6 1 92 1 0
5 5 3823968 1621720 52412 570144 8828 0 10173 50310 63430 15628 5 3 82 9 0
5 2 3821476 1604548 52740 572500 5800 0 7885 47451 62802 14435 5 3 84 8 0
zzz ***Wed Mar 9 07:01:50 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 2 3783400 972404 74272 604084 0 0 553 1175 0 0 6 1 92 1 0
6 2 3783336 945856 75312 602072 48 0 20245 210 62220 13415 3 3 91 3 0
6 2 3782708 926044 76276 602460 1540 0 20653 539 62286 14408 5 3 89 3 0
zzz ***Wed Mar 9 07:03:05 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
9 18 3991048 982544 33764 386008 0 0 553 1175 0 0 6 1 92 1 0
10 13 4038444 1043704 31796 380168 4408 49872 14657 49986 63382 16549 7 5 65 22 0
7 6 4055028 1025820 30864 378096 23880 35308 44841 35371 66177 18751 7 5 72 16 0
zzz ***Wed Mar 9 07:03:34 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
7 4 4024340 1989548 31284 442948 0 0 553 1175 0 0 6 1 92 1 0
7 3 4022876 1976432 31620 445128 3908 0 6950 42625 62407 16874 6 3 84 6 0
6 3 4021268 1970280 32048 445448 3920 0 4961 57801 61703 16986 6 3 86 5 0
zzz ***Wed Mar 9 07:04:03 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 2 4001900 1271372 45636 459204 0 0 553 1175 0 0 6 1 92 1 0
4 2 4001820 1140692 46724 459500 112 0 111097 83 61604 14179 3 4 91 3 0
5 2 4001616 1007028 47656 460148 436 0 114002 187 61673 14444 3 4 90 3 0
zzz ***Wed Mar 9 07:04:32 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 3 3988144 760532 73704 464512 0 0 553 1175 0 0 6 1 92 1 0
8 3 3985844 753980 74084 463632 6844 0 7641 43228 61608 14286 10 4 79 8 0
5 2 3984352 744548 74804 464932 3768 0 4890 60910 61502 13814 7 3 84 6 0
zzz ***Wed Mar 9 07:05:02 CST 2022 《---- 接下来的3分钟osw数据采集异常
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 2 3979188 664112 97732 467560 0 0 553 1175 0 0 6 1 92 1 0
13 2 3988164 792904 59160 420776 464 9088 34473 9287 65324 13469 5 7 85 3 0
7 21 4111284 1058380 27052 356148 14788 127124 33557 127301 70190 14829 7 12 59 22 0
zzz ***Wed Mar 9 07:08:24 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 4 4395788 1854460 34828 469948 0 0 553 1175 0 0 6 1 92 1 0
7 2 4388440 1846588 35196 473320 3992 0 6545 138 62242 13458 5 3 85 7 0
6 2 4386380 1821132 35772 475496 3764 0 10340 16395 63013 14999 5 3 85 7 0
zzz ***Wed Mar 9 07:08:53 CST 2022
。。。
zzz ***Wed Mar 9 07:09:23 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 0 4329972 624476 47452 438864 0 0 553 1175 0 0 6 1 92 1 0
4 1 4329940 599544 47584 440108 48 0 1053 34199 61047 12244 4 3 92 1 0
5 0 4329840 598096 47712 440324 196 0 1182 61594 61710 12710 4 3 92 1 0
zzz ***Wed Mar 9 07:11:24 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 2 4471196 654472 29452 422528 0 0 553 1175 0 0 6 1 92 1 0
6 2 4468884 649952 30592 425404 28 0 2413 186 59799 12119 5 3 90 3 0
5 12 4467128 642476 31240 425448 3988 0 9329 31615 61397 14133 4 3 75 17 0
zzz ***Wed Mar 9 07:12:51 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 1 4632032 1711644 26104 429928 0 0 553 1175 0 0 6 1 92 1 0
4 3 4629772 1710468 26768 430176 568 0 1869 53291 61757 12385 4 3 90 3 0
6 1 4629556 1708832 27528 430460 436 0 2085 47938 61718 12607 4 3 90 3 0
zzz ***Wed Mar 9 07:13:19 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
6 2 4610112 623096 50076 469208 0 0 553 1175 0 0 6 1 92 1 0
4 2 4609836 596332 50788 468732 228 0 34686 130 62668 13288 4 3 89 4 0
5 2 4609616 572604 51660 469792 136 0 28953 294 63369 13801 3 3 90 4 0
zzz ***Wed Mar 9 07:16:01 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
8 3 4756096 1312068 22824 498764 0 0 553 1175 0 0 6 1 92 1 0
6 7 4751724 1303260 23188 499072 5692 0 7757 47194 63456 17844 10 3 81 6 0
5 3 4749868 1296636 23676 502004 3312 0 6005 48277 65172 18193 3 3 88 6 0
zzz ***Wed Mar 9 07:19:36 CST 2022
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
5 4 5099928 959772 49968 478000 0 0 553 1175 0 0 6 1 92 1 0
5 3 5096136 877512 50620 478236 2792 0 75029 263 62548 13558 5 3 86 6 0
8 2 5095348 834504 51388 478412 2908 0 59097 106 62817 13176 4 3 88 5 0
故障时间段meminfo数据分析:
这里选取7点19分的数据,可以看到系统总内存为125g,在采样时间点剩余内存仅700m不到。
zzz ***Wed Mar 9 07:19:40 CST 2022
MemTotal: 132112984 kB
MemFree: 700236 kB
Buffers: 54364 kB
Cached: 478336 kB
SwapCached: 726196 kB
Active: 2489080 kB
Inactive: 804352 kB
Active(anon): 2434628 kB
Inactive(anon): 597620 kB
Active(file): 54452 kB
Inactive(file): 206732 kB
Unevictable: 344112 kB
Mlocked: 228356 kB
SwapTotal: 16383992 kB
SwapFree: 11289116 kB
Dirty: 2572 kB
Writeback: 0 kB
AnonPages: 2889760 kB
Mapped: 189076 kB
Shmem: 61828 kB
Slab: 38791440 kB
SReclaimable: 3471968 kB
SUnreclaim: 35319472 kB
KernelStack: 2922488 kB
PageTables: 385748 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 40497444 kB
Committed_AS: 15252384 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 815316 kB
VmallocChunk: 34240704308 kB
HardwareCorrupted: 0 kB
AnonHugePages: 401408 kb 《------该值非零,说明透明大页未关闭
HugePages_Total: 40960 《------
HugePages_Free: 5068
HugePages_Rsvd: 2509
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 8496 kB
DirectMap2M: 4134912 kB
DirectMap1G: 130023424 kB
故障时间段iostat数据分析:
因其他磁盘未见异常,这里只列出和磁盘sda相关的采样数据,=从下面的采样数据可以看到磁盘“sda”的繁忙度都在99%以上,很多时候都是保持100%繁忙度。
而磁盘sda服务器本地磁盘,上面存放有os系统相关文件、gi home\db home和rac集群软件相关的文件、osw软件相关文件等,而该磁盘之所以这么繁忙,是因为swap分区的频繁使用引起。
[grid@irora07 oswiostat]$ cat irora07_iostat_22.03.09.0700.dat | grep -E 'zzz|sda '
zzz ***Wed Mar 9 07:01:23 CST 2022
sda 764.00 2.00 538.00 4.00 6320.00 24.00 23.41 5.75 10.54 1.85 100.00
zzz ***Wed Mar 9 07:01:52 CST 2022
sda 8.91 12.87 175.25 2.97 831.68 63.37 10.04 1.15 6.61 5.58 99.50
zzz ***Wed Mar 9 07:03:07 CST 2022
sda 1419.81 893.40 699.06 40.57 14203.77 2177.36 44.30 26.30 32.12 1.36 100.47
zzz ***Wed Mar 9 07:03:36 CST 2022
sda 430.00 30.00 343.00 8.00 4676.00 152.00 27.51 3.83 11.04 2.85 100.10
zzz ***Wed Mar 9 07:04:05 CST 2022
sda 333.06 25.62 416.53 15.70 3054.55 165.29 14.90 2.41 5.54 2.32 100.08
zzz ***Wed Mar 9 07:04:34 CST 2022
sda 2.00 40.00 257.00 19.00 1036.00 236.00 9.22 1.09 3.95 3.62 99.90
zzz ***Wed Mar 9 07:06:39 CST 2022
sda 740.59 644.55 466.34 40.59 8807.92 2827.72 45.91 12.81 25.22 1.95 99.01
zzz ***Wed Mar 9 07:08:26 CST 2022
sda 596.00 3.00 480.00 18.00 4796.00 84.00 19.60 4.85 9.67 2.01 99.90
zzz ***Wed Mar 9 07:08:55 CST 2022
sda 48.00 4.00 226.00 6.00 1096.00 40.00 9.79 0.99 4.26 4.16 96.40
zzz ***Wed Mar 9 07:09:25 CST 2022
sda 75.00 0.00 295.00 0.00 1856.00 0.00 12.58 1.21 4.12 3.37 99.50
zzz ***Wed Mar 9 07:11:27 CST 2022
sda 171.00 1701.00 280.00 202.00 2176.00 7616.00 40.63 2.52 5.22 2.07 99.80
zzz ***Wed Mar 9 07:12:53 CST 2022
sda 67.00 6.00 285.00 6.00 1816.00 48.00 12.81 1.42 4.78 3.43 99.80
zzz ***Wed Mar 9 07:13:21 CST 2022
sda 370.00 8101.00 455.00 435.00 8128.00 32216.00 90.66 65.83 51.15 1.12 99.70
zzz ***Wed Mar 9 07:16:12 CST 2022
sda 533.00 5.00 776.00 4.00 6168.00 36.00 15.91 8.11 10.41 1.28 99.80
zzz ***Wed Mar 9 07:19:38 CST 2022
sda 65.00 730.00 256.00 101.00 1632.00 3152.00 26.80 2.14 5.97 2.78 99.40
分析总结
1)此次xxxx02数据库集群节点2因系统内存资源耗尽且系统本地磁盘sda繁忙度达到100%,导致该节点由member kill升级到node kill,最终导致节点操作系统重启。
2)从集群的ocssd的日志来看,未发现心跳网络异常相关信息,而crsd日志中的“Authorization failed, network error”相关报错信息指的是client端操作时引起的网络超时错误,和服务端无关。 也就是说本次故障从数据库层面来看,和网络基本没有关联。
3)目前该集群的3个节点的透明大页特性均未关闭,存在较大的安全隐患,特别是在内存资源紧张的情况。虽然本次故障和这个特性没有关联,仍然建议关闭,具体可参考如下mos文档:
ALERT: Disable Transparent HugePages on SLES11, RHEL6, RHEL7, OL6, OL7, and UEK2 and above (Doc ID 1557478.1)
4)从当前系统的运行情况来看,内存资源存在瓶颈。
目前共享内存大页配置了80g,有一定的剩余量,可以考虑缩减大页来扩充内存资源