在另一个性能优化组中连续奋战了2周,昨天领导终于放我回来了。今天早上用ssh连生产库去检查alert.log.
发现2个RAC节点中的一个连不上。刚开始以为是网络慢,过了一会发现真的连不上了。
ping那个节点也ping不通。招呼了IT部的兄弟去机房看看。检查发现小机PCI插槽坏了。
郁闷的是我不在的这段时间一直都好好的,我昨天下午刚回来,今天早上就这样了。
能说明什么呢,无语。唯一的解释就是数据库想过光棍节了。
DB VERSION: 10.2.0.4.0 - 64bit Production
OS : AIX
alert.log中的片段
Wed Nov 10 14:12:30 2010
Thread 1 advanced to log sequence 1897 (LGWR switch)
Current log# 1 seq# 1897 mem# 0: /dev/rora1g_lv2
Wed Nov 10 17:17:42 2010
Thread 1 advanced to log sequence 1898 (LGWR switch)
Current log# 2 seq# 1898 mem# 0: /dev/rora1g_lv3
Wed Nov 10 23:42:06 2010
ALTER SYSTEM SET service_names='' SCOPE=MEMORY SID='caprod1';
Wed Nov 10 23:42:06 2010 -- 好像在为2010.11.11的节日做准备
Immediate Kill Session#: 96, Serial#: 33556
Immediate Kill Session: sess: 7000001fe2f9e68 OS pid: 893516
Wed Nov 10 23:42:06 2010
Process OS id : 893516 alive after kill
Errors in file
Immediate Kill Session#: 103, Serial#: 13972
Immediate Kill Session: sess: 7000001fc312698 OS pid: 1012654
Wed Nov 10 23:42:07 2010
Process OS id : 1012654 alive after kill
Errors in file /home/oracle/app/admin/caprod/udump/caprod1_ora_127060.trc
Immediate Kill Session#: 106, Serial#: 20794
Immediate Kill Session: sess: 7000002032d5720 OS pid: 843926
Wed Nov 10 23:42:07 2010
Process OS id : 843926 alive after kill
Errors in file /home/oracle/app/admin/caprod/udump/caprod1_ora_127060.trc
Immediate Kill Session#: 108, Serial#: 59655
Immediate Kill Session: sess: 7000001fe2fdea0 OS pid: 782412
caprod1_ora_127060.trc文件的片段
Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit Production
*** 2010-11-10 23:42:06.895
*** ACTION NAME:() 2010-11-10 23:42:06.876
*** MODULE NAME:(racgimon@cadb01 (TNS V1-V3)) 2010-11-10 23:42:06.876
*** SERVICE NAME:(SYS$USERS) 2010-11-10 23:42:06.876
*** SESSION ID:(625.1) 2010-11-10 23:42:06.876
----------------------------------------
SO: 7000001fc2c5660, type: 2, owner: 0, flag: INIT/-/-/0x00
(process) Oracle pid=257, calls cur/top: 0/7000001993b6f60, flag: (0) -
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 0
last post received-location: No post
last process to post me: none
last post sent: 0 0 0
last post sent-location: No post
last process posted by me: none
(latch info) wait_event=0 bits=0
Process Group: DEFAULT, pseudo proc: 7000001fd2a4f80
O/S info: user: oracle, term: UNKNOWN, ospid: 893516
OSD pid info: Unix process pid: 893516, image: oracle@cadb01
Short stack dump: unable to dump stack due to error 72
Dump of memory from 0x07000001FC29AFB0 to 0x07000001FC29B1B8
7000001FC29AFB0 00000005 00000000 07000001 9EFA14A8 [................]
7000001FC29AFC0 00000010 000313A7 07000001 993B6F60 [.............;o`]
7000001FC29AFD0 00000003 000313A7 07000001 FC506BB0 [.............Pk.]
7000001FC29AFE0 00000013 0003129B 07000001 FE401670 [.............@.p]
7000001FC29AFF0 0000000B 000313A7 07000001 FE2F9E68 [............./.h]
7000001FC29B000 00000004 0003129B 00000000 00000000 [................]
7000001FC29B010 00000000 00000000 00000000 00000000 [................]
Repeat 25 times
7000001FC29B1B0 00000000 00000000 [........]
----------------------------------------
单节点设置的最大连接数是600,平时数据库的总连接数稳定在650左右。平均每个节点300左右的连接。
如今只剩下了一个节点能否顶的住呢。观察了一下单节点活跃的连接数已经在580上下浮动了。跟老大商量
了一下,停掉了几个不重要的系统。连接数稳定在了500左右。
总结:
1.RAC总的某个节点宕掉后,一定要观察剩余节点能否支撑住现在的业务压力。
如果不能支撑需要立即上报上级,商讨解决方案。
2.一般一个节点出现问题后,另一个节点也会出现相同的问题。
所以要及时对集群中的其它节点进行监控。防止意外发生。
还好今天是硬件问题,悲剧没有在另一个节点上重现。