问题现象
接到数据库连接异常告警,查看数据库告警日志数据库正在重启。完成重启后手动启动ogg相关进程。后面查找重启原因
查找原因
查看db日志
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
ERROR: slave communication error with ASM
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)Mon Jan 20 15:10:24 2020
Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)Mon Jan 20 15:10:24 2020
WARNING: ASM communication error: op 17 state 0x50 (3113)
Mon Jan 20 15:10:24 2020
Mon Jan 20 15:10:24 2020
ERROR: slave communication error with ASM
ERROR: slave communication error with ASM
查看asm日志
alert_+ASM2.log
Mon Jan 20 15:10:13 2020
LMD0 (ospid: 14709) waits for event 'ges remote message' for 162 secs.
LMS0 (ospid: 14710) waits for event 'gcs remote message' for 164 secs.
Errors in file /export/u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_lmhb_14713.trc (incident=312097):
ORA-29770: global enqueue process LMD0 (OSID 14709) is hung for more than 150 seconds
Incident details in: /export/u01/app/oracle/diag/asm/+asm/+ASM2/incident/incdir_312097/+ASM2_lmhb_14713_i312097.trc
Errors in file /export/u01/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_lmhb_14713.trc (incident=312098):
ORA-29770: global enqueue process LMS0 (OSID 14710) is hung for more than 150 seconds
Incident details in: /export/u01/app/oracle/diag/asm/+asm/+ASM2/incident/incdir_312098/+ASM2_lmhb_14713_i312098.trc
Mon Jan 20 15:10:24 2020
ERROR: Some process(s) is not making progress.
LMHB (ospid: 14713) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 14713): terminating the instance due to error 29770
Mon Jan 20 15:10:25 2020
ORA-1092 : opitsk aborting process
Mon Jan 20 15:10:25 2020
License high water mark = 39
Instance terminated by LMHB, pid = 14713
USER (ospid: 22341): terminating the instance
Instance terminated by USER, pid = 22341
重启原因
bug导致,详见下面参考文档
解决方法
在实例运行时间到达248天前将数据库asm实例重启(只重启db实例仍会触发bug)或者打补丁。
参考文档
Solaris SPARC: LMHB terminating the instance due to error 29770 (after around 248 days of instance uptime) (文档 ID 1473937.1)
Applies to:
Oracle Database - Enterprise Edition - Version 10.2.0.4 to 12.1.0.1 [Release 10.2 to 12.1]
Oracle Solaris on SPARC (64-bit)
Symptoms
Running Oracle GI/RAC 11gR2, 11.2.0.1 on a two node SPARC (64-bit) cluster.
LMBH terminated one of the instances.
The following errors were reported in the Alert log:
Errors in file /oracle/product/diag/rdbms///trace/_lmhb_29680.trc (incident=144129):
ORA-29770: global enqueue process LMON (OSID 29660) is hung for more than 70 seconds
ERROR: Some process(s) is not making progress.
LMHB (ospid: 29680) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 29680): terminating the instance due to error 29770
A review of the LMHB trace file shows that this issue is impacting all the Lock Management BG Processes.
For each LM BG processes (LMON, LMD0, LMS0, LMS1 and LCK0 in this case), the trace file shows information similar to the following:
*** 2012-06-17 02:06:31.607
==============================
LMON (ospid: 29660) has not moved for 30 sec (1339891590.1339891560)
: waiting for event 'rdbms ipc message' for 25 secs with wait_id 3381334669.
===[ Wait Chain ]===
Wait chain is empty.
*** 2012-06-17 02:06:36.617
==============================
LMD0 (ospid: 29662) has not moved for 32 sec (1339891595.1339891563)
: waiting for event 'ges remote message' for 32 secs with wait_id 2766541183.
===[ Wait Chain ]===
Wait chain is empty.
*** 2012-06-17 02:06:36.618
==============================
LMS0 (ospid: 29670) has not moved for 33 sec (1339891595.1339891562)
: waiting for event 'gcs remote message' for 30 secs with wait_id 2748989442.
===[ Wait Chain ]===
Wait chain is empty.
*** 2012-06-17 02:06:36.618
==============================
LMS1 (ospid: 29674) has not moved for 34 sec (1339891595.1339891561)
: waiting for event 'gcs remote message' for 29 secs with wait_id 2722779551.
===[ Wait Chain ]===
Wait chain is empty.
*** 2012-06-17 02:06:36.618
==============================
LCK0 (ospid: 29745) has not moved for 31 sec (1339891595.1339891564)
: waiting for event 'rdbms ipc message' for 31 secs with wait_id 128751834.
===[ Wait Chain ]===
Wait chain is empty.
The same sort of output will be repeated for each BG process every 20 seconds until we exceed the thresholds:
*** 2012-06-17 02:07:31.681
==============================
LMON (ospid: 29660) has not moved for 90 sec (1339891650.1339891560)
==================================================
=== LMON (ospid: 29660) Heartbeat Report
==================================================
LMON (ospid: 29660) has no heartbeats for 90 sec. (threshold 70 sec)
.
.
.
*** 2012-06-17 02:07:39.856
==============================
LMON (ospid: 29660) has not moved for 99 sec (1339891659.1339891560)
*** 2012-06-17 02:07:39.857
LMHB (ospid: 29680): terminating the instance due to error 29770
*** 2012-06-17 02:07:42.235
ksuitm: waiting up to [5] seconds before killing DIAG(29644)
Note that in the trace output, the wait_id related to each of the BG processes is not changing throughout the LMHB trace file .
Hence in this example, all LMON 'waiting for event' reports in the trace file reflect the same wait_id (3381334669 in this example)
Cause
A review of the Alert log shows that the previous instance startup took place on October 12 2011.
The October 12th restart was also due to a similar instance termination by the LHMB process.
Prior to October 12th, the previous instance startup was February 7th 2011.
Calculating the number of days between crashes we see that they took place around once every 248 days.
07-FEB-2011 to 12-OCT-2011 is 248 days
12-OCT-2011 to 17-JUN-2012 is 250 days
The reported symptoms match Solaris SPARC specific bug 10194190. Please refer to the following article for additional information on this issue:
Document 10194190.8 Bug 10194190 - Solaris: Process spin / ASM and DB crash if RAC instance up for > 248 days
Solution
The bug is fixed in 11.2.0.4 and 12.1.0.2
For other releases, check the link to Interim patches in Doc ID 10194190.8 and if available, apply the patch for bug 10194190.
Alternatively, schedule instance restarts to occur before 248 days of instance uptime.