Oracle Database - Enterprise Edition - Version 11.2.0.3 and later Information in this document applies to any platform.
A node is evicted from the cluster due to network communication error. GI Alert log reports following errors and one node gets evicted: Node 1 GI Alert log ------------------------ CRS-1612:Network communication with node prodrac2(2) missing for 50% of timeout interval. Removal of this node from cluster in 29.240 seconds .. CRS-1610:Network communication with node prodrac2 (2) missing for 90% of timeout interval. Removal of this node from cluster in 3.740 seconds CRS-1607:Node utx2db02 is being evicted in cluster incarnation 278185525; details at (:CSSNM00007:) in /orabase1/app/11.2.0.3/grid_6/log/utx2db01/cssd/ocssd.log. Node 2 GI Alert log ------------------------ CRS-1610:Network communication with node prodrac1 (1) missing for 90% of timeout interval. Removal of this node from cluster in 3.740 seconds CRS-1609:This node is unable to communicate with other nodes in the cluster and is going down to preserve cluster integrity; details at (:CSSNM00008:) in /orabase1/app/11.2.0.3/grid_6/log/utx2db02/cssd/ocssd.log. CRS-1656:The CSS daemon is terminating due to a fatal error; Details at (:CSSSC00012:) in /orabase1/app/11.2.0.3/grid_6/log/utx2db02/cssd/ocssd.log Top output shows that Cluster Health Monitor (CHM) daemon ologgerd using high CPU and starts spinning before the reboot PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31439 root RT 0 375m 142m 58m S 161.8 0.1 359:42.49 /orabase1/app/11.2.0.3/grid_6/bin/ologgerd -m utx2db01 -r -d /orabase1/app/11.2.0.3/grid_ The call stack also shows page allocation failure for ologgerd: Jan 14 03:16:16 utx2db02 kernel: Free swap = 13181784kB Jan 14 03:16:21 utx2db02 kernel: Total swap = 25165816kB Jan 14 03:16:30 utx2db02 kernel: ologgerd: page allocation failure. order:4, mode:0xd0 Jan 14 03:16:35 utx2db02 kernel: Pid: 31475, comm: ologgerd Not tainted 2.6.32-400.21.1.el5uek #1 <<<<<<<<<< Jan 14 03:16:40 utx2db02 kernel: Call Trace: Jan 14 03:16:44 utx2db02 kernel: [] __alloc_pages_nodemask+0x524/0x595 Jan 14 03:17:01 utx2db02 kernel: [] kmem_getpages+0x4f/0xf4 Jan 14 03:17:05 utx2db02 kernel: [] fallback_alloc+0x12e/0x1ce Jan 14 03:17:06 utx2db02 kernel: [] ____cache_alloc_node+0x121/0x134 Jan 14 03:17:07 utx2db02 kernel: [] kmem_cache_alloc_node_notrace+0x84/0xb9 Jan 14 03:17:09 utx2db02 kernel: [] __kmalloc_node+0x46/0x73 Jan 14 03:17:13 utx2db02 kernel: [] ? __alloc_skb+0x72/0x13d Jan 14 03:17:13 utx2db02 kernel: [] __alloc_skb+0x72/0x13d Jan 14 03:17:15 utx2db02 kernel: [] sk_stream_alloc_skb+0x3d/0xaf Jan 14 03:17:16 utx2db02 kernel: [] tcp_sendmsg+0x176/0x6cf Jan 14 03:17:16 utx2db02 kernel: [] __sock_sendmsg+0x5e/0x67 Jan 14 03:17:18 utx2db02 kernel: [] sock_sendmsg+0xcc/0xe5 Jan 14 03:17:19 utx2db02 kernel: [] ? radix_tree_delete+0xf1/0x194 Jan 14 03:17:20 utx2db02 kernel: [] ? autoremove_wake_function+0x0/0x3d Jan 14 03:17:21 utx2db02 kernel: [] ? security_sk_alloc+0x16/0x18 Jan 14 03:17:23 utx2db02 kernel: [] ? fget_light+0x58/0x73 Jan 14 03:17:25 utx2db02 kernel: [] ? sockfd_lookup_light+0x20/0x58 Jan 14 03:17:26 utx2db02 kernel: [] sys_sendto+0x12f/0x171 Jan 14 03:17:27 utx2db02 kernel: [] ? audit_syscall_entry+0x103/0x12f Jan 14 03:17:31 utx2db02 kernel: [] system_call_fastpath+0x16/0x1b
None.
Loggerd uses high cpu and do lots of I/O to the disk where the BDB (Berkeley Database used by CHM) resides. This is due to BUG 13867435 - OLOGGERD USING A LOT OF RESOURCES .
Apply Patch 13867435 - OLOGGERD USING A LOT OF RESOURCES on top of 11.2.0.3. The bug is fixed in 11.2.0.4 GI PSU. 我的处理方式如下 : 1. grid@woqurac1:/home/grid>crsctl stat res -t -init -------------------------------------------------------------------------------- NAME TARGET STATE SERVER STATE_DETAILS -------------------------------------------------------------------------------- Cluster Resources -------------------------------------------------------------------------------- ora.asm 1 ONLINE OFFLINE Instance Shutdown ora.cluster_interconnect.haip 1 ONLINE OFFLINE ora.crf 1 ONLINE ONLINE rac1 --查看该资源 ora.crsd 1 ONLINE OFFLINE ora.cssd 1 ONLINE OFFLINE ora.cssdmonitor 1 ONLINE ONLINE rac1 ora.ctssd 1 ONLINE OFFLINE ora.diskmon 1 OFFLINE OFFLINE ora.drivers.acfs 1 ONLINE ONLINE rac1 ora.evmd 1 ONLINE OFFLINE ora.gipcd 1 ONLINE ONLINE rac1 ora.gpnpd 1 ONLINE ONLINE rac1 ora.mdnsd 1 ONLINE ONLINE rac1 |