环境:AIX 7100
Oracle 11gR2 RAC
详细版本:11.2.0.4
现象:
节点2 CRS HANG住了,CRSCTL命令完全没反应,直接干掉CRS进程主机重启后,但VIP没飘到节点1
分析思路;
1、DB下的alert日志及相关trace日志。
2. 查看所有节点的"errpt -a"的输出。
3. 查看发生问题时所有节点的GI日志:
/log//alert*.log
/log//crsd/crsd.log
/log//cssd/ocssd.log
/log//agent/ohasd/oracssdmonitor_root/oracssdmonitor_root.log
/log//agent/ohasd/oracssdagent_root/oracssdagent_root.log
/etc/oracle/lastgasp/*, or /var/opt/oracle/lastgasp/*(If have)
注:如果是CRS发起的重启主机会在/etc/oracle/lastgasp/目录下的文件中添加一条记录。
4. 查看发生问题时所有节点的LMON, LMS*,LMD0 trace files。
5. 查看发生问题时所有节点OSW的所有输出。
--------------------------------------分割线 --------------------------------------
--------------------------------------分割线 --------------------------------------
详细分析过程如下:
节点1DB的alert日志:
Tue Mar 25 12:59:07 2014
Thread 1 advanced to log sequence 245 (LGWR switch)
Current log# 2 seq# 245 mem# 0: +SYSDG/dbracdb/onlinelog/group_2.264.840562709
Current log# 2 seq# 245 mem# 1: +SYSDG/dbracdb/onlinelog/group_2.265.840562727
Tue Mar 25 12:59:20 2014
Archived Log entry 315 added for thread 1 sequence 244 ID 0xffffffff82080958 dest 1:
Tue Mar 25 13:14:54 2014
IPC Send timeout detected. Sender: ospid 6160700 [oracle@dbrac1 (LMS0)]
Receiver: inst 2 binc 291585594 ospid 11010320
IPC Send timeout to 2.1 inc 50 for msg type 65518 from opid 12
Tue Mar 25 13:14:59 2014
Communications reconfiguration: instance_number 2
Tue Mar 25 13:15:01 2014
IPC Send timeout detected. Sender: ospid 12452050 [oracle@dbrac1 (LMS1)]
Receiver: inst 2 binc 291585600 ospid 11534636
IPC Send timeout to 2.2 inc 50 for msg type 65518 from opid 13
Tue Mar 25 13:15:22 2014
IPC Send timeout detected. Sender: ospid 10682630 [oracle@dbrac1 (TNS V1-V3)]
Rec