今天公司DB遇到ORA-29702: error occurred in Cluster Group Service operation

最新推荐文章于 2023-06-13 11:59:04 发布

cizeb5816

最新推荐文章于 2023-06-13 11:59:04 发布

阅读量314

点赞数

文章标签：数据库操作系统

早上剛上班，開發人員說DB不正常，（三台linux as3 +9208+RAC）有兩台DB不能連接。

登陸nodeA，$lsnrctl status，發現隻有主節點nodeA在線

nodeA，$TOP，CPU WAIT 幾乎百分百，idle為0，另外兩台CPU idle基本為100，為保証應用正常先登陸nodeB\nodeC，>startup，DB啟動一切正常，應用也正常了。

接下來查找原因：

查看B（C同樣方法查看）的alert.log，最後一切日志切換在昨晚4：21，然後ORA-29702錯誤讓LMON進程將例程關閉了。

Mon Oct 18 04:21:21 2010
Errors in file /opt/oracle/admin/pdm/bdump/XXX3_lmon_2798.trc:
ORA-29702: error occurred in Cluster Group Service operation
Mon Oct 18 04:21:21 2010
LMON: terminating instance due to error 29702
Instance terminated by LMON, pid = 2798

查看LMON進程跟蹤文件XXX3_lmon_2798.trc

*** 2010-10-18 04:21:21.956
kjxggpoll: received an error event from DBALL_DB
Return code from kjxggpoll: 10
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation

再查看CM.log，其中很多disconnect和SYNC的錯誤，顯然是clusterware之間同步的問題了。

查看/var/log/message

Oct 17 04:03:05 DB03 syslogd 1.4.1: restart.
Oct 18 04:20:54 DB03 kernel: tg3: eth0: Link is down.
Oct 18 04:20:54 DB03 kernel: tg3: eth1: Link is down.
Oct 18 04:23:10 DB03 kernel: tg3: eth0: Link is up at 1000 Mbps, full duplex.
Oct 18 04:23:10 DB03 kernel: tg3: eth0: Flow control is off for TX and off for RX.
Oct 18 04:23:12 DB03 kernel: tg3: eth1: Link is up at 1000 Mbps, full duplex.
Oct 18 04:23:12 DB03 kernel: tg3: eth1: Flow control is off for TX and off for RX.
Oct 18 04:38:08 DB03 login(pam_unix)[23406]: session opened for user oracle by (uid=0)
Oct 18 04:38:08 DB03 -- oracle[23406]: LOGIN ON pts/2 BY oracle FROM 172.16.0.204
Oct 18 06:07:41 DB03 login(pam_unix)[11801]: session closed for user oracle
Oct 18 07:18:21 DB03 kernel: scsi(0): RSCN database changed -0x1,0x800.
Oct 18 07:18:21 DB03 kernel: scsi(0): Waiting for LIP to complete...
Oct 18 07:18:21 DB03 kernel: scsi(0): Waiting for LIP to complete...
Oct 18 07:18:21 DB03 kernel: scsi(0): Topology - (F_Port), Host Loop address 0xffff
Oct 18 07:18:21 DB03 kernel: scsi(1): RSCN database changed -0x1,0x900.
Oct 18 07:18:21 DB03 kernel: scsi(1): Waiting for LIP to complete...
Oct 18 07:18:21 DB03 kernel: scsi(1): Waiting for LIP to complete...

肯定是網絡問題導致群集通信失敗，查看有群集的服務器和日志，在這個時間群集都有重啟，肯定是這原因了。

經過詢問，確實是在這個時間網絡機櫃的UPS跳電了（原因就不追究了），交換機重啟了，差不多兩三分鐘吧!

在10G RAC中，除主節點外，其它節點在網絡故障影響CRS時，會自動重啟OS，以恢復節點。

9i怎麼就沒這類似功能呢？9i雖然穩定，確實有很多不盡人意啊。

另外，還有一種原因也會導致ORA-29702錯誤

在升級oracle pathset時，有時會出現升級節點間文件版本不一樣，文件不同步，instance也會頻繁突然down掉，這時需要手動將主節點的安裝文件復寫到其它節點，以下記錄：

ORA-29702 Appears After Upgrade To 9.2.0.8.0 [ID 443439.1]
Modified 09-JUL-2007 Type PROBLEM Status MODERATED

In this Document
Symptoms
Cause
Solution
References

This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process, and therefore has not been subject to an independent technical review.

Applies to:
Oracle Server - Enterprise Edition - Version: 9.2.0.8.0
This problem can occur on any platform.
Symptoms
On 9.2.0.8.0 in Production:
Every now and then LMON is terminating the instance, the following error occurs:

ERROR
ORA-29702 error occurred in Cluster Group Service operation

-- Steps To Reproduce:
The issue can be reproduced at will with the following steps:
1. Suddenly the error ORA-29702 appeared.
2. LMON terminated the instance.

Cause
Some of the files have not copied from node#1 to node#2 correctly.
Solution

-- To implement the solution, please execute the following steps::
Please perform. the following when all instances are shutdown.
1. Please tar $ORACLE_HOME directory in first Node.
2. Please backup the $ORACLE_HOME directory in second Node.
3. Please remove the $ORACLE_HOME directory in second Node.
4. Please put $ORACLE_HOME directory of the first node into the second node.
5. Please copy the $ORACLE_HOME/dbs of the backup of second node into the $ORACLE_HOME/dbs of new
home in second node.
6. Please relink all in second node.
cd $ORACLE_HOME/bin
./relink all
7. Please check Libraries and binary files sizes.
8. Start the instances.
References
BUG:6028803 - LMON TERMINIATING WITH ERROR ORA-29702

Show Related Information Related
Products

* Oracle Database Products > Oracle Database > Oracle Database > Oracle Server - Enterprise Edition

Keywords
LMON; UPGRADE TO 9.2.0.8.0
Errors
ORA-29702

Bug 6028803: LMON TERMINIATING WITH ERROR ORA-29702

Show Bug Attributes Bug Attributes
Type B - Defect Fixed in Product Version -
Severity 3 - Minimal Loss of Service Product Version 9.2.0.8
Status 32 - Not a Bug. To Filer Platform. 46 - Linux x86
Created 02-May-2007 Platform. Version RHAS 3
Updated 12-Jun-2007 Base Bug -
Database Version 9.2.0.8
Affects Platforms Port-Specific
Product Source Oracle

Show Related Products Related Products
Line Oracle Database Products Family Oracle Database
Area Oracle Database Product 5 - Oracle Server - Enterprise Edition

Hdr: 6028803 9.2.0.8 RDBMS 9.2.0.8 RAC PRODID-5 PORTID-46
Abstract: LMON TERMINIATING WITH ERROR ORA-29702

*** 05/02/07 05:40 am ***
TAR:
----

PROBLEM:
--------
On 9.2.0.8.0 in Production:
Every now and then LMON is terminating the instance,
the following error occurs:

ERROR
-----------------------
ORA-29702 error occurred in Cluster Group Service operation

-- Steps To Reproduce:
The issue can be reproduced at will with the following steps:
1. Suddenly the error ORA-29702 appeared.
2. LMON terminated the instance.

DIAGNOSTIC ANALYSIS:
--------------------
Alert Log: node#2
-------------

Sun Apr 15 00:21:22 2007
Errors in file
/home/oracle/app/product/9.2.0.5.0/rdbms/log/somprd2_lmon_21839.trc:
ORA-29702 : error occurred in Cluster Group Service operation
Sun Apr 15 00:21:22 2007
LMON: terminating instance due to error 29702
Sun Apr 15 00:21:22 2007
System state dump is made for local instance
Sun Apr 15 00:21:27 2007
Instance terminated by LMON, pid = 21839

WORKAROUND:
-----------
N/A

RELATED BUGS:
-------------

REPRODUCIBILITY:
----------------
occurs every now and then.

TEST CASE:
----------
N/A

STACK TRACE:
------------

SUPPORTING INFORMATION:
-----------------------

24 HOUR CONTACT INFORMATION FOR P1 BUGS:
----------------------------------------

DIAL-IN INFORMATION:
--------------------

IMPACT DATE:
------------

*** 05/02/07 05:41 am ***
Log File: cm.log on node#2
-----------

>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850771996
file = unixinc.c, line = 833 {Sun Apr 15 00:19:16 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850788369
file = unixinc.c, line = 833 {Sun Apr 15 00:19:52 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850804764
file = unixinc.c, line = 833 {Sun Apr 15 00:19:53 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850821137
file = unixinc.c, line = 833 {Sun Apr 15 00:20:41 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850837532
file = unixinc.c, line = 833 {Sun Apr 15 00:20:41 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850853905
file = unixinc.c, line = 833 {Sun Apr 15 00:20:51 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850870300
file = unixinc.c, line = 833 {Sun Apr 15 00:20:51 2007 }
>WARNING: ReadCommPort: received error=104 on recv()., tid = 756793359 file
= unixinc.c, line = 841 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 848773149
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850116629
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757022747
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756989978
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 784973857
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757055504
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 847069206
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 816824340
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756973593
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757317642
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756957208
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757006345
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757039122
file = unixinc.c, line = 833 {Sun Apr 15 00:21:2 2 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 757088275
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>ERROR: WriteEventPort: write failed with error 32., tid = 757088275 file
= unixinc.c, line = 981 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756940823
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 756760590
file = unixinc.c, line = 833 {Sun Apr 15 00:21:22 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850886665
file = unixinc.c, line = 833 {Sun Apr 15 05:54:16 2007 }
>WARNING: ReadCommPort: socket closed by peer on recv()., tid = 850903050
file = unixinc.c, line = 833 {Sun Apr 15 05:54:16 2007 }
*** 05/02/07 05:42 am ***

libskgxn9: node#1:
-rwxr-xr-x 1 oracle oinstall 65236 Apr 8 06:34
/home/oracle/app/product/9.2.0.5.0/lib/libskgxn9.so

libskgxn9: node#2:
-rwxr-xr-x 1 oracle oinstall 64274 Mar 11 2004
/home/oracle/app/product/9.2.0.5.0/lib/libskgxn9.so

ls -al $ORACLE_HOME/bin/ora* : node#1
-rwxr-xr-x    1 oracle   oinstall       46 Nov 20 2001
/home/oracle/app/product/9.2.0.5.0/bin/oracg
-rwsr-s--x    1 oracle   oinstall 53734840 Apr 8 07:23
/home/oracle/app/product/9.2.0.5.0/bin/oracle
-rwsr-s--x    1 oracle   oinstall 53734840 Apr 8 07:20
/home/oracle/app/product/9.2.0.5.0/bin/oracleO
-rwxr-xr-x    1 oracle   oinstall     2548 Jul 26 2003
/home/oracle/app/product/9.2.0.5.0/bin/oraenv

ls -al $ORACLE_HOME/bin/ora* : node#2
-rwxr-xr-x    1 oracle   oinstall       46 Nov 20 2001
/home/oracle/app/product/9.2.0.5.0/bin/oracg
-rwsr-s--x    1 oracle   oinstall 53734840 Apr 8 07:23
/home/oracle/app/product/9.2.0.5.0/bin/oracle
-rwsr-s--x    1 oracle   oinstall 53735479 Apr 8 06:35
/home/oracle/app/product/9.2.0.5.0/bin/oracleO
-rwsr-s--x    1 oracle   oinstall 52999216 May 15 2005
/home/oracle/app/product/9.2.0.5.0/bin/oracleO.bak
-rwxr-xr-x    1 oracle   oinstall     2548 Jul 26 2003
/home/oracle/app/product/9.2.0.5.0/bin/oraenv
*** 05/09/07 08:26 am *** (CHG: Sta->16)
*** 05/09/07 08:26 am ***
uploaded the following files:
RDA.RDA_somprd1.zip
RDA.RDA_somprd2.zip
alert_traces.zip

Thanks in advance
*** 05/09/07 09:13 pm *** (CHG: Sta->10)
*** 05/09/07 09:14 pm ***
*** 05/14/07 05:09 am ***
This is what happenened in a previous time. The same ORA-29702 happened, it
can be seen in LMON trace log:
*** ID:(3.1) 2007-04-13 10:45:27.324
GES IPC: Receivers 3 Senders 3
GES IPC: Buffers Receive 1000 Send (i:660 b:660) Reserve 430
GES IPC: Msg Size Regular 396 Batch 2048
Batch msg size = 2048
Batching factor: enqueue replay 48, ack 53
Batching factor: cache replay 34 size per lock 56
kjxggin: receive buffer size = 32768
kjxgmin: SKGXN ver (2 1 Oracle 9i Reference CM)
CMCLI WARNING: CMInitContext: init ctx(0xccfb358)
*** 10:45:31.439
kjxgmrcfg: Reconfiguration started, reason 1
kjxgmcs: Setting state to 0 0.
*** 10:45:31.441
     Name Service frozen
kjxgmcs: Setting state to 0 1.
kjfcpiora: publish my weight 62874
kjxgmps: proposing substate 2
kjxgmcs: Setting state to 6 2.
     Performed the unique instance identification check
kjxgmps: proposing substate 3
kjxgmcs: Setting state to 6 3.
     Name Service recovery started
     Deleted all dead-instance name entries
kjxgmps: proposing substate 4
kjxgmcs: Setting state to 6 4.
     Multicasted all local name entries for publish
     Replayed all pending requests
kjxgmps: proposing substate 5
kjxgmcs: Setting state to 6 5.
     Name Service normal
     Name Service recovery done
*** 10:45:32.808
kjxgmps: proposing substate 6
kjxgmcs: Setting state to 6 6.
*** 10:45:32.920
*** 10:45:32.920
Reconfiguration started (old inc 0, new inc 6)
Synchronization timeout interval: 660 sec
List of nodes:
0 1
Global Resource Directory frozen
node 0
release 9 2 0 8
node 1
release 9 2 0 8
res_master_weight for node 0 is 62874
res_master_weight for node 1 is 62874
Total master weight = 125748
Dead inst
Join inst 0 1
Exist inst
Active Sendback Threshold = 50 %
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Resources and enqueues cleaned out
Resources remastered 0
0 GCS shadows traversed, 0 cancelled, 0 closed
0 GCS resources traversed, 0 cancelled
set master node info
Submitted all remote-enqueue requests
kjfcrfg: Number of mesgs sent to node 0 = 0
Update rdomain variables
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
*** 10:45:34.508
0 GCS shadows traversed, 0 replayed, 0 unopened
Submitted all GCS cache requests
0 write requests issued in 31086 GCS resources
7 PIs marked suspect, 0 flush PI msgs
*** 10:45:34.775
Reconfiguration complete
*** 10:45:39.636
kjxgrtmc2: Member 1 thread 2 mounted
CMCLI WARNING: ReadCommPort: poll() failed
kjxggpoll: received an error event from DBALL_DB
Return code from kjxggpoll: 10
error 29702 detected in background process
ORA-29702: error occurred in Cluster Group Service operation
ksuitm: waiting for [5] seconds before killing DIAG

--
The concern here that when checking the libskgxn9.so library on both nodes,
they are not the same. So may be when the reconfiguration happened the issue
appeared.

The same time the ORA-29702 appeared in alert log these errors appeared in
cm.log:
>WARNING: ReadCommPort: received error=104 on recv()., tid = 756793359 file
= unixinc.c, line = 841 {Sun Apr 15 00:21:22 2007 }
>ERROR: WriteEventPort: write failed with error 32., tid = 757088275 file
= unixinc.c, line = 981 {Sun Apr 15 00:21:22 2007 }

This happened in the node which has a wrong libskgxn9.so library.

Thanks
*** 05/14/07 05:10 am *** (CHG: Sta->16)
*** 05/23/07 07:24 am ***
Hi,
Could you please provide an update?

Thanks
Amr
*** 05/30/07 10:45 am *** (CHG: G/P->P Asg->NEW OWNER OWNER)
*** 05/30/07 10:45 am ***
*** 06/04/07 06:42 am *** (CHG: Asg->NEW OWNER OWNER)
*** 06/04/07 06:42 am ***
*** 06/04/07 08:34 am ***
*** 06/04/07 08:34 am *** (CHG: Sta->10)
*** 06/11/07 08:43 am *** (CHG: Sta->16)
*** 06/11/07 08:43 am ***
*** 06/12/07 07:51 am ***
*** 06/12/07 07:52 am *** (CHG: Sta->32)

Back to topBack to top

Rate this document
Article Rating
Rate this document
Excellent
Good
Poor
Did this document help you?
Yes
No
Just browsing
How easy was it to find this document?
Very easy
Somewhat easy
Not easy

Comments
Provide some feedback
Cancel

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/7608831/viewspace-676166/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/7608831/viewspace-676166/

cizeb5816

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
今天公司DB遇到ORA-29702: error occurred in Cluster Group Service operation

早上剛上班，開發人員說DB不正常，（三台linux as3 +9208+RAC）有兩台DB不能連接。登陸nodeA，$lsnrctl status，發現隻有主節點nodeA在線nodeA，$TOP，CPU WAIT 幾乎...
复制链接

扫一扫