版权声明:转载时请以超链接形式标明文章原始出处和作者信息及本声明
http://blog.csdn.net/wenshuangzhu/article/details/44078639
【问题描述】
数据库服务器迁移到另外一个实验室后,发现RAC数据库启动异常,有如下现象:
查看日志,发现节点1的alert日志里有如下信息:
2013-01-16 03:13:40.314 [cssd(28019)]CRS-1612:node node76 (2) at 50% heartbeat fatal, eviction in 14.280 seconds
2013-01-16 03:13:47.315 [cssd(28019)]CRS-1611:node node76 (2) at 75% heartbeat fatal, eviction in 7.280 seconds
2013-01-16 03:13:48.315 [cssd(28019)]CRS-1611:node node76 (2) at 75% heartbeat fatal, eviction in 6.280 seconds
2013-01-16 03:13:52.306 [cssd(28019)]CRS-1610:node node76 (2) at 90% heartbeat fatal, eviction in 2.290 seconds
2013-01-16 03:13:53.306 [cssd(28019)]CRS-1610:node node76 (2) at 90% heartbeat fatal, eviction in 1.290 seconds
2013-01-16 03:13:54.306 [cssd(28019)]CRS-1610:node node76 (2) at 90% heartbeat fatal, eviction in 0.290 seconds
2013-01-16 03:13:54.597 [cssd(28019)]CRS-1607:CSSD evicting node node76.Details in /home/database/oracle/oracrs/log/node74/cssd/ocssd.log.
2013-01-16 03:14:24.608 [cssd(28019)]CRS-1601:CSSD Reconfiguration complete. Active nodes are node74 .
2013-01-16 03:14:25.245 [crsd(28394)]CRS-1005:The OCR upgrade was completed. Version has changed from 185599744 to 185599744. Details in /home/database/oracle/oracrs/log/node74/crsd/crsd.log.
2013-01-16 03:14:54.755 [crsd(28394)]CRS-1204:Recovering CRS resources for node node76.
看起来好像是由于节点2心跳丢失,节点1将节点2踢出集群并重启了节点2。 奇怪的是,我测试了好多次,如果不启动oracle,发现从节点1 ping 节点2的心跳地址,并不会出现丢包。可是只要2个节点都启动oracle,节点1就会报上面的错,然后节点2自动重启。测试了很多次,都是这个样子。
2. 自动重启多次后,两个节点的数据库状态都不正常。尝试启动CRS,结果vip与listener都是offline状态。
oracle@node74:~> crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....SM1.asm application ONLINE ONLINE node74
ora....74.lsnr application ONLINE OFFLINE
ora.node74.gsd application ONLINE ONLINE node74
ora.node74.ons application ONLINE ONLINE node74
ora.node74.vip application ONLINE OFFLINE
ora....SM2.asm application ONLINE ONLINE node76
ora....76.lsnr application ONLINE OFFLINE
ora.node76.gsd application ONLINE ONLINE node76
ora.node76.ons application ONLINE ONLINE node76
ora.node76.vip application ONLINE OFFLINE
ora.orcl.db application ONLINE ONLINE node74
ora....l1.inst application ONLINE ONLINE node74
ora....l2.inst application ONLINE ONLINE node76
ora....srv1.cs application ONLINE OFFLINE
ora....cl1.srv application ONLINE OFFLINE
ora....srv2.cs application ONLINE OFFLINE
ora....cl2.srv application ONLINE OFFLINE
单独启动vip,报错CRS-1006和CRS-0215:
oracle@node74:~> crs_start ora.node74.vip
Attempting to start `ora.node74.vip` on member `node74`
Start of `ora.node74.vip` on member `node74` failed.
Attempting to start `ora.node74.vip` on member `node76`
Start of `ora.node74.vip` on member `node76` failed.
CRS-1006: No more members to consider CRS-0215:
Could not start resource 'ora.node74.vip'.
【问题分析】
查看节点1的vip日志(路径:$CRS_HOME/log/node74/racg/ora.node74.vip.log),有如下报错信息:
2013-01-17 11:45:13.230: [ RACG][3049252608] [1549][3049252608][ora.node74.vip]: checkIf: Default gateway is not defined (host=node74)
Interface eth0 checked failed (host=node74)
Invalid parameters, or failed to bring up VIP (host=node74)
可以看出,日志中提示节点1的默认网关没有配置。查看节点2的VIP日志,也有同样的报错。
【问题解决】
为两个节点配置默认网关(172.16.52.254)后,所有资源启动正常,并且再未出现节点自动重启的情况。
【问题总结】
查看$CRS_HOME/bin/racgvip文件,有如下内容:
# When the script sets the VIP to an interface, it adds a route to default
# gateway for that interface. It makes sure the node will use the interface
# which VIP is set for going network traffic. ……
# - Variable FAIL_WHEN_DEFAULTGW_NO_FOUND to configure if checkIf() returns
# failure when default gateway is not found. If mii-tool works,
# default gateway is not needed in checkIf().
可以看出,VIP启动时是会检查默认网关配置的。如果未配置,checkif()会返回失败,VIP无法启动。