linux rac节点主机不定时重启,双节点RAC各个节点主机频繁自动重启故障解决

最近在vmware中搭建了一个oracle10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5,后来发现两台linux经常自动重

1) 背景介绍:

最近在vmware中搭建了一个Oracle10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5,后来发现两台linux经常自动重启;

2) 平台信息:

vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5

3) /var/log/message日志:

NODE1:Linux1

Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.

Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuset

Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpu

Apr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Wed Jul 27 21:02:33 EDT 2011

Apr 18 20:44:18 Linux1 kernel: Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet

Apr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:

Apr 18 20:44:18 Linux1 kernel: Intel GenuineIntel

Apr 18 20:44:18 Linux1 kernel: AMD AuthenticAMD

Apr 18 20:44:18 Linux1 kernel: Centaur CentaurHauls

Apr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000ca000 - 00000000000cc000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000dc000 - 00000000000e4000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000100000 - 00000000bfef0000 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfef0000 - 00000000bfeff000 (ACPI data)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfeff000 - 00000000bff00000 (ACPI NVS)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bff00000 - 00000000c0000000 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fffe0000 - 0000000100000000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000100000000 - 0000000140000000 (usable)

Apr 18 20:44:18 Linux1 kernel: DMI present.

NODE2:Linux2

Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 has been idle for 30.0 seconds, shutting it down.

Apr 18 20:43:35 Linux2 kernel: (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1334752985.559806 now 1334753015.306532 dr 1334752985.559360 adv 1334752985.559806:1334752985.559807 func (b651ea27:504) 1334752951.27068:1334752951.27323)

Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num 0) at 192.168.3.131:7777

Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7

Apr 18 20:44:05 Linux2 kernel: (o2net,3480,0):o2net_connect_expired:1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.

Apr 18 20:44:24 Linux2 avahi-daemon[4341]: Registering new address record for 192.168.0.136 on eth0.

Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7

Apr 18 20:44:28 Linux2 last message repeated 2 times

Apr 18 20:44:28 Linux2 kernel: (o2hb-9938799A41,3564,1):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 9938799A418642218A66FE77029DE473

Apr 18 20:44:28 Linux2 kernel: (ocfs2rec,19793,1):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,65)

Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 8

Apr 18 20:44:31 Linux2 kernel: (ocfs2rec,19793,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 0

Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq,3567,1):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 0

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:836 9938799A418642218A66FE77029DE473:$RECOVERY: at least one node (0) to recover before lock mastery can begin

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:870 9938799A418642218A66FE77029DE473: recovery map is not empty, but must master $RECOVERY lock now

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_do_recovery:523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain 9938799A418642218A66FE77029DE473

以上信息在两台机器中会交换出现,说明并不是总是固定的一台机器对另外一台超时。

4) 根据message信息报错,应该是o2cb的idle时间超限导致的,,系统中O2CB服务的状态为:

[oracle@Linux1]service o2cb status

Driver for "configfs": Loaded

Filesystem "configfs": Mounted

Stack glue driver: Loaded

Stack plugin "o2cb": Loaded

Driver for "ocfs2_dlmfs": Loaded

Filesystem "ocfs2_dlmfs": Mounted

Checking O2CB cluster ocfs2: Online

Heartbeat dead threshold = 301

Network idle timeout: 30000 /此处单位为毫秒,正式message中报的30秒

Network keepalive delay: 2000

Network reconnect delay: 2000

Checking O2CB heartbeat: Active

logo.gif

声明:本文原创发布php中文网,转载请注明出处,感谢您的尊重!如有疑问,请联系admin@php.cn处理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值