Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.1
[Release 11.2 to 12.1]
Information in this document applies to any
platform.
SYMPTOMS
Normal or high redundancy diskgroup is dismounted with these
WARNING messages.
//ASM alert.log
Mon Jul 01 09:10:47 2013
WARNING: Waited 15 secs for write IO to PST disk 1 in group
6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group
6.
WARNING: Waited 15 secs for write IO to PST disk 1 in group
6.
WARNING: Waited 15 secs for write IO to PST disk 4 in group
6.
....
GMON dismounting group 6 at 72 for pid 44, osid
8782162
CAUSE
Generally this kind messages comes in ASM alertlog file on below
situations,
Delayed
ASM PST heart beats on ASM disks in normal or high redundancy
diskgroup,
thus the ASM instance dismount the diskgroup.By
default, it is 15 seconds.
By
the way the heart beat delays are sort of ignored for external
redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST
revalidation,
but the heart beat delays do not dismount external redundancy
diskgroup directly.
The ASM disk could go into unresponsiveness, normally in the
following scenarios:
+ Some of the
paths of the physical paths of the multipath device are offline or
lost
+ During path
'failover' in a multipath set up
+ Server
load, or any sort of storage/multipath/OS maintenance
The Doc ID 10109915.8 briefs about
Bug 10109915(this
fix introduce this underscore parameter). And the issue is
with no OS/Storage tunable timeout mechanism in a case of a Hung
NFS Server/Filer. And then _asm_hbeatiowait helps in setting the time
out.
SOLUTION
1] Check with
OS and Storage admin that there is disk unresponsiveness.
2] Possibly
keep the disk responsiveness to below 15
seconds. This will depend on various factors like
+ Operating
System
+ Presence of
Multipath ( and Multipath Type )
+ Any kernel
parameter
So you need to find out, what is the 'maximum' possible disk
unresponsiveness for your set up.
For example, on AIX rw_timeout setting affects this and defaults to 30 seconds.
Another example is Linux with native multipathing. In such set up,
number of physical paths and polling_interval
value in multipath.conf file, will dictate this maximum disk
unresponsiveness.
So for your set up ( combination of OS / multipath / storage ), you
need to find out this.
3] If you can
not keep the disk unresponsiveness to below 15 seconds, then the
below parameter can be set in the ASM instance ( on all the Nodes
of RAC ):
_asm_hbeatiowait
As per internal bug 17274537 , based on internal
testing the value should be increased to 120 secs, the same will be
fixed in 12.2
Run below in asm instance to set desired value
for _asm_hbeatiowait
alter system set "_asm_hbeatiowait"= scope=spfile sid='*';
And then restart asm instance / crs, to take new parameter value in
effect.
REFERENCES
BUG:17043894-
DISKGROUP DISMOUNTS IF 2 OUT OF 8 PATHS LOST
BUG:10109915-
ASM HANGS IN HIGH REDUNDANCY CONFIG IF 1 OF 5 DISKS GOES
OFFLINE
NOTE:1910315.1-
How to Create a Normal Redundancy Diskgroup Best Practices
[grid@racj1 ~]$ more asm.txt
*._asm_hbeatiowait=120
+ASM2.asm_diskgroups='ARCHDG','DATADG'#Manual
Mount
+ASM1.asm_diskgroups='ARCHDG','DATADG'#Manual
Mount
*.asm_diskstring='/dev/asmdisk/*'
*.asm_power_limit=1
*.diagnostic_dest='/oracle/app/grid'
*.instance_type='asm'
*.large_pool_size=12M
+ASM1.local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=10.62.xxx.xx2)(PORT=1521))'
+ASM2.local_listener='(ADDRESS=(PROTOCOL=TCP)(HOST=10.62.xxx.xx4)(PORT=1521))'
*.memory_max_target=2147483648
*.memory_target=2147483648
*.remote_login_passwordfile='EXCLUSIVE'