Solaris SPARC: LMHB&nb…

最新推荐文章于 2022-04-13 22:48:15 发布

babymouse1212

最新推荐文章于 2022-04-13 22:48:15 发布

阅读量1.3k

点赞数

分类专栏：故障处理

本文链接：https://blog.csdn.net/babymouse1212/article/details/71079098

版权

故障处理专栏收录该内容

103 篇文章 0 订阅

订阅专栏

Slorai RAC 2节点，突然2个节点数据库都Down掉后重启了

alertASM.log中

Fri Apr 24 07:59:58 2015
Time drift detected. Please check VKTM trace file for more details.
Sat Apr 25 07:09:29 2015
LMD0 (ospid: 28470) waits for event 'ges remote message' for 173 secs.
LMS0 (ospid: 28472) waits for event 'gcs remote message' for 175 secs.
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_lmhb_28476.trc (incident=376097):
ORA-29770: global enqueue process LMD0 (OSID 28470) is hung for more than 150 seconds
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_376097/+ASM1_lmhb_28476_i376097.trc
Errors in file /u01/app/grid/diag/asm/+asm/+ASM1/trace/+ASM1_lmhb_28476.trc (incident=376098):
ORA-29770: global enqueue process LMS0 (OSID 28472) is hung for more than 150 seconds
Incident details in: /u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_376098/+ASM1_lmhb_28476_i376098.trc
Sat Apr 25 07:09:40 2015
Sweep [inc][376098]: completed
Sweep [inc][376097]: completed
Sweep [inc2][376098]: completed
Sweep [inc2][376097]: completed
Sat Apr 25 07:09:40 2015
ERROR: Some process(s) is not making progress.
LMHB (ospid: 28476) is terminating the instance.
Please check LMHB trace file for more details.
Please also check the CPU load, I/O load and other system properties for anomalous behavior
ERROR: Some process(s) is not making progress.
LMHB (ospid: 28476): terminating the instance due to error 29770
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
License high water mark = 19
Instance terminated by LMHB, pid = 28476
USER (ospid: 27314): terminating the instance
Instance terminated by USER, pid = 27314

查看 TRC文件：

/u01/app/grid/diag/asm/+asm/+ASM1/incident/incdir_376098/+ASM1_lmhb_28476_i376098.trc

*** 2015-04-25 07:09:29.355
==============================
LMD0 (ospid: 28470) has not moved for 173 sec (1429916968.1429916795)
kjfmGCR_HBCheckAll: LMD0 (ospid: 28470) has status 6
==================================================
=== LMD0 (ospid: 28470) Heartbeat Report
==================================================
LMD0 (ospid: 28470) has no heartbeats for 173 sec. (threshold 150 sec)
: waiting for event 'ges remote message' for 173 secs with wait_id 252168469.
===[ Wait Chain ]===
Wait chain is empty.
==============================
Dumping PROCESS LMD0 (ospid: 28470) States
==============================
===[ System Load State ]===
CPU Total 256 Core 32 Socket 2
Load normal: Cur 2120 Highmark 327680 (8.28 1280.00)
===[ Latch State ]===
Not in Latch Get

DB的alertlog:

Completed checkpoint up to RBA [0xb8f6.2.10], SCN: 10260692317532
Sat Apr 25 06:36:18 2015
Archived Log entry 72173 added for thread 1 sequence 47349 ID 0xee7d887 dest 1:
Sat Apr 25 06:38:36 2015
Incremental checkpoint up to RBA [0xb8f6.14a22f.0], current log tail at RBA [0xb8f6.163b8f.0]
Sat Apr 25 06:39:23 2015
Beginning log switch checkpoint up to RBA [0xb8f7.2.10], SCN: 10260692457839
Thread 1 advanced to log sequence 47351 (LGWR switch)
Current log# 2 seq# 47351 mem# 0: +DATA/prod/onlinelog/group_2.459.838352237
Current log# 2 seq# 47351 mem# 1: +DATA/prod/onlinelog/group_2.509.858593353
Sat Apr 25 06:39:30 2015
Archived Log entry 72174 added for thread 1 sequence 47350 ID 0xee7d887 dest 1:
Sat Apr 25 06:39:38 2015
Completed checkpoint up to RBA [0xb8f7.2.10], SCN: 10260692457839
Sat Apr 25 06:44:32 2015
Beginning log switch checkpoint up to RBA [0xb8f8.2.10], SCN: 10260692664815
Thread 1 advanced to log sequence 47352 (LGWR switch)
Current log# 3 seq# 47352 mem# 0: +DATA/prod/onlinelog/group_3.436.838352239
Current log# 3 seq# 47352 mem# 1: +DATA/prod/onlinelog/group_3.510.858593367
Sat Apr 25 06:44:40 2015
Archived Log entry 72175 added for thread 1 sequence 47351 ID 0xee7d887 dest 1:
Sat Apr 25 06:44:50 2015
Completed checkpoint up to RBA [0xb8f8.2.10], SCN: 10260692664815
Sat Apr 25 06:49:27 2015
opiodr aborting process unknown ospid (14579) as a result of ORA-609
Sat Apr 25 06:56:20 2015
opiodr aborting process unknown ospid (16234) as a result of ORA-609
Sat Apr 25 06:58:40 2015
Incremental checkpoint up to RBA [0xb8f8.1ab53b.0], current log tail at RBA [0xb8f8.1b02ca.0]
Sat Apr 25 07:06:08 2015
Beginning log switch checkpoint up to RBA [0xb8f9.2.10], SCN: 10260693117305
Thread 1 advanced to log sequence 47353 (LGWR switch)
Current log# 1 seq# 47353 mem# 0: +DATA/prod/onlinelog/group_1.465.838352233
Current log# 1 seq# 47353 mem# 1: +DATA/prod/onlinelog/group_1.508.858593231
Sat Apr 25 07:06:14 2015
Archived Log entry 72176 added for thread 1 sequence 47352 ID 0xee7d887 dest 1:
Sat Apr 25 07:09:40 2015
NOTE: ASMB terminating
Errors in file /u01/prod/db/diag/rdbms/prod/PROD1/trace/PROD1_asmb_9310.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 433 Serial number: 31
Errors in file /u01/prod/db/diag/rdbms/prod/PROD1/trace/PROD1_asmb_9310.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 433 Serial number: 31
ASMB (ospid: 9310): terminating the instance due to error 15064
Sat Apr 25 07:09:40 2015
System state dump requested by (instance=1, osid=9310 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/prod/db/diag/rdbms/prod/PROD1/trace/PROD1_diag_9168.trc
Sat Apr 25 07:09:40 2015
opiodr aborting process unknown ospid (26330) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (17901) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (24588) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (19128) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (19144) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (17899) as a result of ORA-1092
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:41 2015
opiodr aborting process unknown ospid (29116) as a result of ORA-1092
Sat Apr 25 07:09:45 2015
ORA-1092 : opitsk aborting process
Sat Apr 25 07:09:45 2015
License high water mark = 1879
Instance terminated by ASMB, pid = 9310
USER (ospid: 27351): terminating the instance
Instance terminated by USER, pid = 27351
Sat Apr 25 07:11:58 2015
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0

参考文档：

Solaris SPARC: LMHB terminating the instance due to error 29770 (after around 248 days of instance uptime) (文档 ID 1473937.1)

Applies to:

Oracle Database - Enterprise Edition - Version 10.2.0.4 to 12.1.0.1 [Release 10.2 to 12.1]
Oracle Solaris on SPARC (64-bit)

Symptoms

Running Oracle GI/RAC 11gR2, 11.2.0.1 on a two node SPARC (64-bit) cluster.

LMBH terminated one of the instances.

The following errors were reported in the Alert log:

Errors in file /oracle/product/diag/rdbms///trace/_lmhb_29680.trc (incident=144129):

ORA-29770: global enqueue process LMON (OSID 29660) is hung for more than 70 seconds

ERROR: Some process(s) is not making progress.

LMHB (ospid: 29680) is terminating the instance.

Please check LMHB trace file for more details.

Please also check the CPU load, I/O load and other system properties for anomalous behavior

ERROR: Some process(s) is not making progress.

LMHB (ospid: 29680): terminating the instance due to error 29770

A review of the LMHB trace file shows that this issue is impacting all the Lock Management BG Processes.

For each LM BG processes (LMON, LMD0, LMS0, LMS1 and LCK0 in this case), the trace file shows information similar to the following:

   *** 2012-06-17 02:06:31.607
    ==============================
    LMON (ospid: 29660) has not moved for 30 sec (1339891590.1339891560)
   : waiting for event 'rdbms ipc message' for 25 secs with wait_id 3381334669.
     ===[ Wait Chain ]===
     Wait chain is empty.

   *** 2012-06-17 02:06:36.617
    ==============================
    LMD0 (ospid: 29662) has not moved for 32 sec (1339891595.1339891563)
   : waiting for event 'ges remote message' for 32 secs with wait_id 2766541183.
   ===[ Wait Chain ]===
   Wait chain is empty.

   *** 2012-06-17 02:06:36.618
    ==============================
    LMS0 (ospid: 29670) has not moved for 33 sec (1339891595.1339891562)
   : waiting for event 'gcs remote message' for 30 secs with wait_id 2748989442.
   ===[ Wait Chain ]===
     Wait chain is empty.

   *** 2012-06-17 02:06:36.618
    ==============================
    LMS1 (ospid: 29674) has not moved for 34 sec (1339891595.1339891561)
   : waiting for event 'gcs remote message' for 29 secs with wait_id 2722779551.
   ===[ Wait Chain ]===
   Wait chain is empty.

   *** 2012-06-17 02:06:36.618
    ==============================
    LCK0 (ospid: 29745) has not moved for 31 sec (1339891595.1339891564)
     : waiting for event 'rdbms ipc message' for 31 secs with wait_id 128751834.
   ===[ Wait Chain ]===
   Wait chain is empty.

The same sort of output will be repeated for each BG process every 20 seconds until we exceed the thresholds:

   *** 2012-06-17 02:07:31.681
    ==============================
    LMON (ospid: 29660) has not moved for 90 sec (1339891650.1339891560)
    ==================================================
=== LMON (ospid: 29660) Heartbeat Report
    ==================================================
    LMON (ospid: 29660) has no heartbeats for 90 sec. (threshold 70 sec)
   .
   .
   .
   *** 2012-06-17 02:07:39.856
    ==============================
    LMON (ospid: 29660) has not moved for 99 sec (1339891659.1339891560)

   *** 2012-06-17 02:07:39.857
    LMHB (ospid: 29680): terminating the instance due to error 29770

   *** 2012-06-17 02:07:42.235
   ksuitm: waiting up to [5] seconds before killing DIAG(29644)