ORA-600 [KFDAUDEALLOC2] AND INSTANCE CRASH EVEN WITH THE FIX OF BUG 14467061 (文档 ID 1903273.1)-CSDN博客

In this Document

This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.1 [Release 11.2 to 12.1]
Information in this document applies to any platform.

SYMPTOMS

Customer got ORA-600 [kfdAuDealloc2] and instance crashed even with the fix of bug 14467061.

Mon Jun 02 09:39:25 2014
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=561):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [187], [603], [28], [], [], [], [], [], [], [], []
Incident details in: /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_561/+ASM3_ora_2283_i561.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ERROR: An unrecoverable error has been identified in ASM metadata. The instance will be taken down.
Mon Jun 02 09:39:41 2014
NOTE: AMDU dump of disk group DG2SVC created at /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_561
NOTE: starting check of diskgroup DG2SVC
ERROR: file +dg2svc.603.849027701: F603 PX3 => D0 A1896 => F3 PX126: fnum mismatch
ERROR: file +dg2svc.603.849027701: F603 PX4 => D0 A1983 => F3 PX127: fnum mismatch
ERROR: file +dg2svc.603.849027701: F603 PX5 => D0 A2214 => F1861 PX221: fnum mismatch
ERROR: file +dg2svc.603.849027701: F603 PX6 => D0 A2228 => F1861 PX222: fnum mismatch
....
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2214 => F1861 X221: extent not mapped
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2216 => F1861 X232: extent not mapped
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2228 => F1861 X222: extent not mapped
ERROR: disk DG2SVC_DISK1, asz 0, AT 14: AT full, FS avail
NOTE: disk DG2SVC_DISK1, used AU total mismatch: DD={52750, 0} AT={53279, 0}
ERROR: check of diskgroup DG2SVC found 51 total errors
ORA-15049: diskgroup "DG2SVC" contains 51 error(s)
Mon Jun 02 09:39:44 2014
Dumping diagnostic data in directory=[cdmp_20140602093944], requested by (instance=3, osid=2283), summary=[incident=561].
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=562):
ORA-00600: internal error code, arguments: [17090], [], [], [], [], [], [], [], [], [], [], []
Incident details in:
/odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_562/+ASM3_ora_2283_i562.trc
Dumping diagnostic data in directory=[cdmp_20140602093945], requested by (instance=3, osid=2283), summary=[incident=562].
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=563):
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [187], [603], [28], [], [], [], [], [], [], [], []
ORA-00600: internal error code, arguments: [17090], [], [], [], [], [], [], [], [], [], [], []
Incident details in: /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_563/+ASM3_ora_2283_i563.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.
ERROR: An unrecoverable error has been identified in ASM metadata. The instance will be taken down.
....

This fatal assert brought down ASM instances on all nodes. After ASM instances restarted, the problem diskgroup got auto-mounted. Then RBAL process detected the corruptions again during COD recovery rollback, which triggered the fatal assert and crashed ASM instances again. This cycle repeated many times on all nodes until the problem diskgroup was manually dropped by the customer.

CAUSE

The bug fix of bug 14467061 was already in place, so the corruptions were not caused by this bug or its related bugs.

The cause of the corruption was found to be some lost writes in the storage layer.

However, the corruptions were only on one diskgroup, so it's expected that only the problem diskgroup should be dismounted and ASM instances should NOT have crashed.

SOLUTION

Some specifc types of diskgroup corruption could trigger fatal asserts that could crash ASM instances on all nodes. This would cause all ASM instances and other healthy diskgroups unusable.

The following bug fix can prevent ASM instances from crashing AFTER diskgroup corruptions are detected and fatal assert is hit. With this bug fix, only the corrupted diskgroups would be forcibly dismounted, so ASM instances can stay online to service the other healthy diskgroups.

Bug 11814376 - FORCE DISMOUNT AFFECTED DISKGROUP ON METADATA CORRUPTION INSTEAD OF CRASHING

A backport patch can be requested for 11.2.0.3 and above. The bug is fixed in 12.1.0.2 and 12.2

Please note that the cause of the diskroup corruptions would NOT be triggered by this bug. Root causes of diskgroup corruption usually are results of lost writes in OS/storage layer. This bug fix is to help reducing the impact caused by some diskgroup corruptions and make our recovery more robust when we encounter this type of corruption.

Workaround:

Manual dismount of the problem diskgroup in SQLPLUS or ASMCMD on all nodes can stop diskgroup automount upon ASM instance restarts. Then this avoids hitting the fatal assert repeatedly and stabilizes ASM instances.