CRS-2765: Resource 'ora.crsd' has failed on server


ocrd进程无法启动,表决盘无法访问,15秒后超时 出现dismount情况

ocssd.log
>>>>>>>>>>>>>>>>>>>>>
  1. 2016-01-15 10:33:11.620: [ CSSD][1042200320]clssnmSendingThread: sent 5 status msgs to all nodes
  2. 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
  3. 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15550 msecs
  4. 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs
  5. 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvWorkerThread not scheduled for 15540 msecs
  6. 2016-01-15 10:33:15.621: [ CSSD][1069954816]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 15550 msecs

ASM alert.log
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

  1. Fri Jan 15 10:33:20 2016
  2. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.   
  3. WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
  4. WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
  5. WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
  6.  Submitted all GCS remote-cache requests
  7.  Fix write in gcs resources
  8. Reconfiguration complete
  9. Fri Jan 15 10:33:20 2016
  10. WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
  11. WARNING: Waited 20 secs for write IO to PST disk 0 in group 1.
  12. WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
  13. WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
  14. WARNING: Waited 20 secs for write IO to PST disk 0 in group 2.
  15. WARNING: Waited 20 secs for write IO to PST disk 1 in group 2.
  16. Fri Jan 15 10:33:21 2016
  17. NOTE: process _b000_+asm1 (11776) initiating offline of disk 0.3915952267 (OCR_0000) with mask 0x7e in group 2
  18. NOTE: process _b000_+asm1 (11776) initiating offline of disk 1.3915952269 (OCR_0001) with mask 0x7e in group 2
  19. NOTE: checking PST: grp = 2
  20. GMON checking disk modes for group 2 at 9 for pid 34, osid 11776
  21. ERROR: no read quorum in group: required 2, found 1 disks
  22. NOTE: checking PST for grp 2 done.
  23. NOTE: initiating PST update: grp = 2, dsk = 0/0xe968b08b, mask = 0x6a, op = clear
  24. NOTE: initiating PST update: grp = 2, dsk = 1/0xe968b08d, mask = 0x6a, op = clear
  25. GMON updating disk modes for group 2 at 10 for pid 34, osid 11776
  26. ERROR: no read quorum in group: required 2, found 1 disks
  27. Fri Jan 15 10:33:21 2016
  28. NOTE: cache dismounting (not clean) group 2/0x22584077 (OCR)
  29. NOTE: messaging CKPT to quiesce pins Unix process pid: 11785, image: oracle@cntl119 (B001)
  30. Fri Jan 15 10:33:21 2016
  31. NOTE: halting all I/Os to diskgroup 2 (OCR)
  32. Fri Jan 15 10:33:21 2016
  33. NOTE: LGWR doing non-clean dismount of group 2 (OCR)
  34. NOTE: LGWR sync ABA=4.55 last written ABA 4.55
  35. WARNING: Offline for disk OCR_0000 in mode 0x7f failed.
  36. WARNING: Offline for disk OCR_0001 in mode 0x7f failed.
  37. Fri Jan 15 10:33:21 2016
  38. kjbdomdet send to inst 2
  39. detach from dom 2, sending detach message to inst 2
  40. Fri Jan 15 10:33:21 2016
  41. ERROR: ORA-15130 in COD recovery for diskgroup 2/0x22584077 (OCR)
  42. ERROR: ORA-15130 thrown in RBAL for group number 2
  43. Errors in file /grid/base/diag/asm/+asm/+ASM1/trace/+ASM1_rbal_4292.trc:
  44. ORA-15130: diskgroup "OCR" is being dismounted
  45. Fri Jan 15 10:33:21 2016
  46. NOTE: No asm libraries found in the system
  47. ASM Health Checker found 1 new failures
  48. Fri Jan 15 10:33:21 2016
  49. List of instances:
  50.  1 2
  51. Dirty detach reconfiguration started (new ddet inc 1, cluster inc 4)
  52.  Global Resource Directory partially frozen for dirty detach
  53. * dirty detach - domain 2 invalid = TRUE
  54.  130 GCS resources traversed, 0 cancelled
  55. Dirty Detach Reconfiguration complete
  56. Fri Jan 15 10:33:21 2016
  57. WARNING: dirty detached from domain 2
  58. NOTE: cache dismounted group 2/0x22584077 (OCR)
  59. NOTE: cache deleting context for group OCR 2/0x22584077
  60. SQL> alter diskgroup OCR dismount force /* ASM SERVER:576209015 */    >>>>>>>>>>>>>>>>>>>>>>>>crs强制卸载磁盘组
  61. GMON dismounting group 2 at 11 for pid 36, osid 11785
  62. NOTE: Disk OCR_0000 in mode 0x7f marked for de-assignment
  63. NOTE: Disk OCR_0001 in mode 0x7f marked for de-assignment
  64. NOTE: Disk OCR_0002 in mode 0x7f marked for de-assignment


APPLIES TO:

Oracle Database - Standard Edition - Version 11.2.0.1 to 11.2.0.4 [Release 11.2]
Managed Cloud Services Problem Resolution - Version N/A to N/A
Information in this document applies to any platform.

SYMPTOMS

The CRSD process (crsd.bin) aborted due to the OCR location being inaccessible (here +OCR_VOTE).

? <GI_HOME>/log/<node>/alert<node>.log shows the following messages:

2013-11-07 06:50:58.319: 
[crsd(14307)] CRS-1006:The OCR location +OCR_VOTE is inaccessible. Details in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log.
2013-11-07 06:51:08.465: 
[ohasd(12810)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac1'.
2013-11-07 06:51:09.907: 
[crsd(31845)] CRS-0804:Cluster Ready Service aborted due to Oracle Cluster Registry error [PROC-26: Error while accessing the physical storage
]. Details at (:CRSD00111:) in /app/11.2.0/grid_1/log/rac1/crsd/crsd.log."

? The log of the grid user agent (<GI_HOME>/log/<node>/agent/crsd/oraagent_grid/oraagent_grid.log) shows that it received a message to stop the ora.OCR_VOTE.dg resource:

2013-11-07 06:37:28.744: [ AGFW][1112471872]{1:58982:16178} Agent received the message: RESOURCE_STOP[ora.OCR_VOTE.dg rac1 1] ID 4099:346136
2013-11-07 06:37:28.744: [ AGFW][1112471872]{1:58982:16178} Preparing STOP command for: ora.OCR_VOTE.dg rac1 1
2013-11-07 06:37:28.745: [ AGFW][1112471872]{1:58982:16178} ora.OCR_VOTE.dg rac1 1 state changed from: ONLINE to: STOPPING
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop {
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: enter { 
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] getResAttrib: attrib name USR_ORA_OPI value true len 4
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] Agent::flagUsrOraOpiIsSet(true) reason not dependency
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stop: tha exit }
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] DgpAgent::stopSingle status:2 }
2013-11-07 06:37:28.747: [ora.OCR_VOTE.dg][1101465920]{1:58982:16178} [stop] (:CLSN00108:) clsn_agent::stop }
2013-11-07 06:37:28.747: [ AGFW][1101465920]{1:58982:16178} Command: stop for resource: ora.OCR_VOTE.dg rac1 1 completed with status: SUCCESS

? The CSS daemon log (<GI_HOME>/log/<node>/cssd/ocssd.log) shows that several CSS threads have not been scheduled for several seconds:  

2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7760 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 7830 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 7830 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvDiskPingThread not scheduled for 8040 msecs
2013-11-07 06:36:19.281: [ CSSD][1078569280]clssscMonitorThreads clssnmvWorkerThread not scheduled for 8040 msecs
..
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60030 msecs
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 59620 msecs
2013-11-07 06:37:11.488: [ CSSD][1118316864]clssscMonitorThreads clssnmvWorkerThread not scheduled for 60240 msecs

 ? The ASM alert log shows that the OCR diskgroup was dismounted after IO issue:

Thu Nov 07 06:36:28 2013
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 1 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 2 in group 1.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 2.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
WARNING: Waited 15 secs for write IO to PST disk 0 in group 3.
Thu Nov 07 06:36:28 2013
NOTE: process _b000_+asm1 (21285) initiating offline of disk 0.3915951409 (OCR_VOTE01) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (21285) initiating offline of disk 1.3915951410 (OCR_VOTE02) with mask 0x7e in group 1
NOTE: process _b000_+asm1 (21285) initiating offline of disk 2.3915951411 (OCR_VOTE03) with mask 0x7e in group 1

Thu Nov 07 06:36:28 2013
WARNING: dirty detached from domain 1
NOTE: cache dismounted group 1/0xDE385DDC (OCR_VOTE) 
SQL> alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */

Thu Nov 07 06:37:28 2013
SUCCESS: diskgroup OCR_VOTE was dismounted
SUCCESS: alter diskgroup OCR_VOTE dismount force /* ASM SERVER:3728235996 */
SUCCESS: ASM-initiated MANDATORY DISMOUNT of group OCR_VOTE

? Running crsctl shows that CRSD process (crsd.bin) is not running:

# crsctl stat res -t
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4000: Command Status failed, or completed with errors.

CAUSE

The issue was caused by a storage problem. The disks were not responding for approximately one minute, which caused the update of the PST disk to exceed 15 secs and consequently caused the OCR diskgroup to be dismounted which caused the CRS daemon (crsd.bin)  to abort.

? The OS messages log shows that there was a disk issue (message may vary depending on the storage used):

Nov 7 06:37:11 rac1 kernel: qla2xxx 0000:0f:00.0: scsi(3:0:0): Abort command issued -- 1 c1e7a3 2002.

? OSW vmstat shows 10 processes wait on IO (column b=10)

zzz ***Thu Nov 7 06:36:18 CST 2013
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
b swpd free buff cache si so bi bo in cs us sy id wa st
10 0 50944832 776328 10809588 0 0 1 3 1 1 0 0 100 0 0
10 0 50936532 776328 10809628 0 0 0 244 1044 9054 1 1 97 1 0
10 0 50939424 776328 10809872 0 0 0 0 1024 6426 0 0 98 2 0

? In this particular case the ASM disks for the +OCR_VOTE disk group were mapped to 3 partitions (/dev/sda1 - /dev/sda3) on a single disk device :

# ls -ltr /dev/oracleasm/disks/*
brw-rw---- 1 grid asmadmin 8, 1 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE01 --- major is 8,minor is 1
brw-rw---- 1 grid asmadmin 8, 2 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE02 --- major is 8,minor is 2
brw-rw---- 1 grid asmadmin 8, 3 Nov 18 05:04 /dev/oracleasm/disks/OCR_VOTE03 --- major is 8,minor is 3
brw-rw---- 1 grid asmadmin 8, 5 Nov 18 05:04 /dev/oracleasm/disks/ASM_DATA
brw-rw---- 1 grid asmadmin 8, 6 Nov 18 05:04 /dev/oracleasm/disks/ASM_BACKUP

# ls -ltr /dev/sda*
brw-r----- 1 root disk 8, 0 Nov 18 05:03 /dev/sda
brw-r----- 1 root disk 8, 1 Nov 18 05:04 /dev/sda1 --- major is 8,minor is 1, used for OCR_VOTE01
brw-r----- 1 root disk 8, 2 Nov 18 05:04 /dev/sda2 --- major is 8,minor is 2, used for OCR_VOTE02
brw-r----- 1 root disk 8, 3 Nov 18 05:04 /dev/sda3 --- major is 8,minor is 3, used for OCR_VOTE03
brw-r----- 1 root disk 8, 4 Nov 18 05:04 /dev/sda4
brw-r----- 1 root disk 8, 5 Nov 18 05:04 /dev/sda5
brw-r----- 1 root disk 8, 6 Nov 18 05:04 /dev/sda6 
brw-r----- 1 root disk 8, 7 Nov 18 05:04 /dev/sda7

OSW iostat shows device /dev/sda1, /dev/sda2, /dev/sda3 are abnormal for approximate 1 minute.
Start from 06:36:18 many IO requests in request queue(avgqu-sz is 16) and %util is 100%,
however disk throughput is zero (rkB/s =0 and wkB/s=0), these information indicate disk Device saturation occurred.
Disk IO were recovered at 06:37:19.

zzz ***Thu Nov 7 06:35:17 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 9.00 10.00 39.00 53.50 9.74 0.03 1.32 1.32 2.50
sda1 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00
sda2 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.01 2.00 2.00 0.80
sda3 0.00 0.00 2.00 2.00 1.00 4.50 2.75 0.00 0.00 0.00 0.00
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 2.00 2.00 32.00 20.00 26.00 0.02 4.00 4.00 1.60
sda6 0.00 0.00 1.00 2.00 4.00 20.00 16.00 0.00 0.33 0.33 0.10
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

zzz ***Thu Nov 7 06:36:18 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 16.02 0.00 0.00 100.10
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda3 0.00 0.00 0.00 0.00 0.00 0.00 0.00 3.00 0.00 0.00 100.10
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 0.00 0.00 100.10
sda6 0.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 0.00 100.10
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

zzz ***Thu Nov 7 06:37:19 CST 2013
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 0.00 7.00 7.00 19.00 41.50 8.64 0.03 1.86 1.86 2.60
sda1 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.00 0.00 0.00 0.00
sda2 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90
sda3 0.00 0.00 2.00 1.00 1.00 0.50 1.00 0.01 3.00 3.00 0.90
sda4 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sda5 0.00 0.00 1.00 2.00 16.00 20.00 24.00 0.01 2.67 2.67 0.80
sda6 0.00 0.00 0.00 2.00 0.00 20.00 20.00 0.00 0.00 0.00 0.00
sda7 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

 

SOLUTION

Contact the system/storage administrator and possibly the storage vendor to determine why the disk (shared storage) was not responding for an extended amount time.

In addition 

  • avoid using multiple partitions of the same physical disk as ASM disks for the Voting disks. Each ASM disk ideally should be on a different physical device or even a different SAN if possible. This may not be possible when using LUNs as they typically don't map to a physical disk.
  • make sure the OCR is mirrored onto a 2nd diskgroup so the CRSD process can survive the dismount of one of the diskgroups. 

Other ISSUE

Generally this kind messages comes in ASM alertlog file on below situations,

Delayed ASM PST heart beats on ASM disks in normal or high redundancy diskgroup,
thus the ASM instance dismount the diskgroup.By default, it is 15 seconds.

 

By the way the heart beat delays are sort of ignored for external redundancy diskgroup.
ASM instance stop issuing more PST heart beat until it succeeds PST revalidation,
but the heart beat delays do not dismount external redundancy diskgroup directly.

The ASM disk could go into unresponsiveness, normally in the following scenarios:

+    Some of the paths of the physical paths of the multipath device are offline or lost
+    During path 'failover' in a multipath set up
+    Server load, or any sort of storage/multipath/OS maintenance

 

The Doc ID 10109915.8  briefs about Bug 10109915(this fix introduce this underscore parameter). And the issue is with no OS/Storage tunable timeout mechanism in a case of a Hung NFS Server/Filer. And then  _asm_hbeatiowait  helps in setting the time out.

 

SOLUTION

1]    Check with OS and Storage admin that there is disk unresponsiveness.

2]    Possibly keep the disk responsiveness to below 15 seconds. 

This will depend on various factors like
+    Operating System
+    Presence of Multipath ( and Multipath Type )
+    Any kernel parameter

So you need to find out, what is the 'maximum' possible disk unresponsiveness for your set up.

For example, on AIX  rw_timeout  setting affects this and defaults to 30 seconds.
                            >>>>>>>>>>>>>>>>>>>>>>>>aix  rw_timeout默认是30秒,我的环境是linux
Another example is Linux with native multipathing. In such set up, number of physical paths and polling_interval value in multipath.conf file, will dictate this maximum disk unresponsiveness.

So for your set up ( combination of OS / multipath / storage ), you need to find out this.

3]    If you can not keep the disk unresponsiveness to below 15 seconds, then the below parameter can be set in the ASM instance ( on all the Nodes of RAC ):

    _asm_hbeatiowait
    
Set it to 200.

                               >>>>>>>>>>>>>>>>>>>>>>>>>>>polling_interval*no_path_retry

 

Run below in asm instance to set desired value for _asm_hbeatiowait

alter system set "_asm_hbeatiowait"= scope=spfile sid='*';

 由于暂时没有办法解决存储问题,故采用增大iowait来暂时解决。根本还是存储链路。




来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/17086096/viewspace-1978542/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/17086096/viewspace-1978542/

  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值