双节点RAC各个节点主机频繁自动重启故障解决

最新推荐文章于 2021-05-15 14:04:05 发布

weixin_33849942

最新推荐文章于 2021-05-15 14:04:05 发布

阅读量470

点赞数

文章标签：操作系统数据库运维

原文链接：http://blog.51cto.com/ccz320/838687

版权

1) 背景介绍：

最近在vmware中搭建了一个oracle10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5，后来发现两台linux经常自动重启；

2) 平台信息：

vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5

3) /var/log/message日志：

Ø NODE1:Linux1

Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.

Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source = /proc/kmsg started.

Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuset

Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpu

Apr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.us.oracle.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)) #1 SMP Wed Jul 27 21:02:33 EDT 2011

Apr 18 20:44:18 Linux1 kernel: Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quiet

Apr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:

Apr 18 20:44:18 Linux1 kernel: Intel GenuineIntel

Apr 18 20:44:18 Linux1 kernel: AMD AuthenticAMD

Apr 18 20:44:18 Linux1 kernel: Centaur CentaurHauls

Apr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000000000 - 000000000009f800 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000ca000 - 00000000000cc000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000dc000 - 00000000000e4000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000000100000 - 00000000bfef0000 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfef0000 - 00000000bfeff000 (ACPI data)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bfeff000 - 00000000bff00000 (ACPI NVS)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000bff00000 - 00000000c0000000 (usable)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 00000000fffe0000 - 0000000100000000 (reserved)

Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000000100000000 - 0000000140000000 (usable)

Apr 18 20:44:18 Linux1 kernel: DMI present.

Ø NODE2:Linux2

Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 has been idle for 30.0 seconds, shutting it down.

Apr 18 20:43:35 Linux2 kernel: (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr 1334752985.559806 now 1334753015.306532 dr 1334752985.559360 adv 1334752985.559806:1334752985.559807 func (b651ea27:504) 1334752951.27068:1334752951.27323)

Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num 0) at 192.168.3.131:7777

Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7

Apr 18 20:44:05 Linux2 kernel: (o2net,3480,0):o2net_connect_expired:1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.

Apr 18 20:44:24 Linux2 avahi-daemon[4341]: Registering new address record for 192.168.0.136 on eth0.

Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7

Apr 18 20:44:28 Linux2 last message repeated 2 times

Apr 18 20:44:28 Linux2 kernel: (o2hb-9938799A41,3564,1):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group 9938799A418642218A66FE77029DE473

Apr 18 20:44:28 Linux2 kernel: (ocfs2rec,19793,1):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,65)

Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 8

Apr 18 20:44:31 Linux2 kernel: (ocfs2rec,19793,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 0

Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq,3567,1):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 0

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:836 9938799A418642218A66FE77029DE473:$RECOVERY: at least one node (0) to recover before lock mastery can begin

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:870 9938799A418642218A66FE77029DE473: recovery map is not empty, but must master $RECOVERY lock now

Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_do_recovery:523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain 9938799A418642218A66FE77029DE473

以上信息在两台机器中会交换出现，说明并不是总是固定的一台机器对另外一台超时。

4) 根据message信息报错，应该是o2cb的idle时间超限导致的，系统中O2CB服务的状态为：

[oracle@Linux1] service o2cb status

Driver for "configfs": Loaded

Filesystem "configfs": Mounted

Stack glue driver: Loaded

Stack plugin "o2cb": Loaded

Driver for "ocfs2_dlmfs": Loaded

Filesystem "ocfs2_dlmfs": Mounted

Checking O2CB cluster ocfs2: Online

Heartbeat dead threshold = 301

Network idle timeout: 30000 /此处单位为毫秒，正式message中报的30秒

Network keepalive delay: 2000

Network reconnect delay: 2000

Checking O2CB heartbeat: Active

5) /etc/sysconfig/o2cb 中心跳的超时阈值已经修改为了301秒，但idle的时间使用了缺省的30000毫秒

[root@Linux1 ~]# more /etc/sysconfig/o2cb

# This is a configuration file for automatic startup of the O2CB

# driver. It is generated by running /etc/init.d/o2cb configure.

# On Debian based systems the preferred method is running

# 'dpkg-reconfigure ocfs2-tools'.

# O2CB_ENABLED: 'true' means to load the driver on boot.

O2CB_ENABLED=true

# O2CB_STACK: The name of the cluster stack backing O2CB.

O2CB_STACK=o2cb

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.

O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.

O2CB_HEARTBEAT_THRESHOLD=301 /此处单位为秒

# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection is considered dead.

O2CB_IDLE_TIMEOUT_MS=30000 /此处单位为毫秒，正式message中报的30秒

# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet is sent

O2CB_KEEPALIVE_DELAY_MS=2000 /此处单位为毫秒

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts

O2CB_RECONNECT_DELAY_MS=2000 /此处单位为毫秒

6) 可能由于是在pc机中的vmware搭建的多个虚拟机进行的实验，系统负载较重，导致各个节点的idel时间较长引起，因此计划将/etc/init.d/o2cb进行重新配置，将network相关的配置翻倍：

[root@Linux1 ~]# /etc/init.d/o2cb configure

Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.

The following questions will determine whether the driver is loaded on

boot. The current values will be shown in brackets ('[]'). Hitting

<ENTER> without typing an answer will keep that current value. Ctrl-C

will abort.

Load O2CB driver on boot (y/n) [y]:

Cluster stack backing O2CB [o2cb]:

Cluster to start on boot (Enter "none" to clear) [ocfs2]:

Specify heartbeat dead threshold (>=7) [301]: /*此处单位为秒

Specify network idle timeout in ms (>=5000) [30000]: 60000 /*此后三行单位为毫秒

Specify network keepalive delay in ms (>=1000) [2000]: 4000

Specify network reconnect delay in ms (>=2000) [2000]: 4000

Writing O2CB configuration: OK

Cluster ocfs2 already online

[root@Linux1 ~]# exit

在各个节点分别重新启动o2cb服务：

[root@Linux2 ~]# service o2cb stop

Stopping O2CB cluster ocfs2: Failed

Unable to stop cluster as heartbeat region still active /*此时ocfs2文件系统在加载状态，不能停o2cb服务，需要先umount ocfs2

[root@Linux2 ~]# umount /u02

[root@Linux2 ~]# service o2cb stop

Stopping O2CB cluster ocfs2: OK

Unloading module "ocfs2": OK

Unmounting ocfs2_dlmfs filesystem: OK

Unloading module "ocfs2_dlmfs": OK

Unloading module "ocfs2_stack_o2cb": OK

Unmounting configfs filesystem: OK

Unloading module "configfs": OK

[root@Linux2 ~]# service o2cb start

Loading filesystem "configfs": OK

Mounting configfs filesystem at /sys/kernel/config: OK

Loading stack plugin "o2cb": OK

Loading filesystem "ocfs2_dlmfs": OK

Mounting ocfs2_dlmfs filesystem at /dlm: OK

Setting cluster stack "o2cb": OK

Starting O2CB cluster ocfs2: OK

[root@Linux2 ~]# mount /u02

[root@Linux2 ~]# mount

/dev/mapper/VolGroup00-LogVol00 on / type ext3 (rw)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

devpts on /dev/pts type devpts (rw,gid=5,mode=620)

/dev/sda1 on /boot type ext3 (rw)

tmpfs on /dev/shm type tmpfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)

192.168.2.110:/mnt/nfs4backup/nfs4backup/nfs4backup on /mnt/share type nfs (rw,hard,nointr,tcp,noac,nfsvers=3,timeo=600,rsize=32768,wsize=32768,addr=192.168.2.110)

oracleasmfs on /dev/oracleasm type oracleasmfs (rw)

configfs on /sys/kernel/config type configfs (rw)

ocfs2_dlmfs on /dlm type ocfs2_dlmfs (rw)

/dev/sdf1 on /u02 type ocfs2 (rw,_netdev,datavolume,nointr,heartbeat=local)

启动crs

[root@Linux2 ~]# /u01/app/crs/bin/crsctl start crs

Attempting to start CRS stack

The CRS stack will be started shortly

[oracle@Linux2] crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....SM1.asm application ONLINE ONLINE linux1

ora....X1.lsnr application ONLINE ONLINE linux1

ora.linux1.gsd application ONLINE ONLINE linux1

ora.linux1.ons application ONLINE ONLINE linux1

ora.linux1.vip application ONLINE ONLINE linux1

ora....SM2.asm application ONLINE ONLINE linux2

ora....X2.lsnr application ONLINE ONLINE linux2

ora.linux2.gsd application ONLINE ONLINE linux2

ora.linux2.ons application ONLINE ONLINE linux2

ora.linux2.vip application ONLINE ONLINE linux2

ora.racdb.db application ONLINE ONLINE linux2

ora....b1.inst application ONLINE ONLINE linux1

ora....b2.inst application ONLINE ONLINE linux2

ora...._taf.cs application OFFLINE OFFLINE

ora....db1.srv application OFFLINE OFFLINE

ora....db2.srv application OFFLINE OFFLINE

经过观察，系统正常运行，问题圆满解决。

转载于:https://blog.51cto.com/ccz320/838687

weixin_33849942

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
双节点RAC各个节点主机频繁自动重启故障解决

1) 背景介绍：最近在vmware中搭建了一个oracle10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5，后来发现两台linux经常自动重启；2) 平台信息：vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.53)...
复制链接

扫一扫