记一次rac vip故障处理

此次rac vip故障主要是由于vip所在网卡ent3(做了EtherChannel,即主备网卡绑定)出现故障,导致1号节点vip漂移至2号节点。
$ crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora....b1.inst application ONLINE ONLINE crmdb01
ora....b2.inst application ONLINE ONLINE crmdb02
ora....db2.srv application ONLINE ONLINE crmdb02
ora....srv1.cs application ONLINE ONLINE crmdb02
ora.crmdb.db application ONLINE ONLINE crmdb02
[color=red]ora....01.lsnr application ONLINE OFFLINE [/color]
ora....b01.gsd application ONLINE ONLINE crmdb01
ora....b01.ons application ONLINE ONLINE crmdb01
[color=red]ora....b01.vip application ONLINE ONLINE crmdb02 [/color]
ora....02.lsnr application ONLINE ONLINE crmdb02
ora....b02.gsd application ONLINE ONLINE crmdb02
ora....b02.ons application ONLINE ONLINE crmdb02
ora....b02.vip application ONLINE ONLINE crmdb02
解决办法处理相对比较简单,只要更换问题网卡,1号节点重启nodeapps即可,vip就自动从2号机切回1号机。
但通过此次故障,我们是不是可以更加挖掘一下,rac vip漂移背后的一些东西呢?
1号机故障发生时,在操作系统级别,我们可以看到一些错误:
$ netstat -in
Name Mtu Network Address Ipkts Ierrs Opkts Oerrs Coll
en0 1500 link#2 0.11.25.be.50.e9 2364166277 0 1352130944 371 0
en0 1500 3.3.22 3.3.22.1 2364166277 0 1352130944 371 0
[color=red]en3 1500 link#3 0.11.25.be.4d.41 3591277841 0 1817998840 5 0
en3 1500 130.36.23 130.36.23.8 3591277841 0 1817998840 5 0[/color]
lo0 16896 link#1 1335635349 0 1335747477 0 0
lo0 16896 127 127.0.0.1 1335635349 0 1335747477 0 0
lo0 16896 ::1 1335635349 0 1335747477 0 0

$ errpt
IDENTIFIER TIMESTAMP T C RESOURCE_NAME DESCRIPTION
[color=red]173C787F 0416124011 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416124011 T H ent1 TRANSMIT FAILURE[/color]
173C787F 0416095911 I S topsvcs Possible malfunction on local adapter
4FC185D1 0416095811 T H ent1 TRANSMIT FAILURE
4FC185D1 0416065011 T H ent1 TRANSMIT FAILURE

更为详细的错误如下所示:
$ errpt -a -j 4FC185D1|more
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1

Date/Time: Sat Apr 16 12:40:04 BEIST 2011
Sequence Number: 10413
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: H
Type: TEMP
Resource Name: ent1
Resource Class: adapter
Resource Type: 14106802
Location: U5791.001.99B18ND-P1-C06-T1
VPD:
Product Specific.( ).......Gigabit Ethernet-SX PCI-X Adapter
Part Number.................10N8586
FRU Number..................10N8586
EC Level....................D76267
Manufacture ID..............YL1021
Network Address.............001125BE4D41
ROM Level.(alterable).......GOL021

Description
TRANSMIT FAILURE

Recommended Actions
PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
FILE NAME
line: 2187 file: goent_tx.c
PCI ETHERNET STATISTICS
0000 25C5 0063 081B 0000 0003 0000 0003 0000 0000 0000 0000 0000 0000 0000 00DA
0000 010C D192 B18E 0001 B2FA DD4E 1CFC 0000 0041 1C93 93A5 0000 0000 0031 20A1
0000 00EE 256D C53E 0002 3042 90A3 0EE5 0000 0000 0000 0000 0000 0001 0001 B321
0000 09DF 0000 0000 0000 0000 0000 01DF 0000 000F 0000 0205 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 BBA3 087C 0200 D400 4120 8000 01A0 0000 0000
0230 0156 0009 F007 0443 C808 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000
DEVICE DRIVER INTERNAL STATE
2222 2222 256D C53E 0000 00C8
SOURCE ADDRESS
0011 25BE 4D41
---------------------------------------------------------------------------
LABEL: GOENT_TX_ERR
IDENTIFIER: 4FC185D1
$ errpt -a -j 173C787F|more
---------------------------------------------------------------------------
LABEL: TS_LOC_DOWN_ST
IDENTIFIER: 173C787F

Date/Time: Sat Apr 16 12:40:21 BEIST 2011
Sequence Number: 10414
Machine Id: 00CE37F34C00
Node Id: crmdb01
Class: S
Type: INFO
Resource Name: topsvcs

Description
Possible malfunction on local adapter

Probable Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Failure Causes
Local adapter mal-functioned
Local adapter lost connection to network
Local adapter mis-configured

Recommended Actions
Verify adapter configuration
Verify network connectivity

Detail Data
DETECTING MODULE
rsct,nim_control.C,1.39.1.21,4983
ERROR ID
6zV5DL.pqFeB/ThN//Ml.1....................
REFERENCE CODE

Adapter interface name
en3
Adapter offset
0
Adapter IP address
130.36.23.8
由于硬件故障,我们对OS日志不做详细解读,我们关心的是故障发生一刻,Oracle做了什么?
故障发生时racg首先检测到vip发生故障,并再次进行了vip检测,racgvip check crmdb01,并记录至ora.crmdb01.vip.log中
2011-04-16 12:40:13.049: [ RACG][1] [4276526][1][ora.crmdb01.vip]: Invalid parameters, or failed to bring up VIP (host=crmdb01)

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: env ORACLE_CONFIG_HOME=/opt/oracle/product/10.2.0.4/crs

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: cmd = /opt/oracle/product/10.2.0.4/crs/bin/racgeut -e _USR_ORA_DEBUG=0 54 /opt/oracl
e/product/10.2.0.4/crs/bin/racgvip check crmdb01

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: clsrcexecut: rc = 1, time = 4.405s

2011-04-16 12:40:13.054: [ RACG][1] [4276526][1][ora.crmdb01.vip]: end for resource = ora.crmdb01.vip, action = check, status = 1, time = 4.572s
检测结束后,判断存在异常之后,由crs进程执行vip漂移动作,可以看到当crs检测到vip异常offline之后(OFFLINE unexpectedly),
首先停止了监听,然后将组件ora.crmdb.crmsrv1.crmdb2.srv漂移至crmdb02即2号节点。
2011-04-16 12:40:13.058: [ CRSAPP][11051]32CheckResource error for ora.crmdb01.vip error code = 1
2011-04-16 12:40:13.071: [ CRSRES][11051]32In stateChanged, ora.crmdb01.vip target is ONLINE
2011-04-16 12:40:13.072: [ CRSRES][11051]32ora.crmdb01.vip on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.072: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.086: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.vip` on member `crmdb01`
2011-04-16 12:40:13.487: [ CRSRES][11312]32In stateChanged, ora.crmdb.crmsrv1.crmdb2.srv target is ONLINE
2011-04-16 12:40:13.487: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv on crmdb01 went OFFLINE unexpectedly
2011-04-16 12:40:13.488: [ CRSRES][11312]32StopResource: setting CLI values
2011-04-16 12:40:13.520: [ CRSRES][11312]32Attempting to stop `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01`
2011-04-16 12:40:13.636: [ CRSRES][11051]32Stop of `ora.crmdb01.vip` on member `crmdb01` succeeded.
2011-04-16 12:40:13.636: [ CRSRES][11051]32ora.crmdb01.vip RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:13.650: [ CRSRES][11051]32ora.crmdb01.vip failed on crmdb01 relocating.
2011-04-16 12:40:13.770: [ CRSRES][11051]32StopResource: setting CLI values
2011-04-16 12:40:13.786: [ CRSRES][11051]32Attempting to stop `ora.crmdb01.LISTENER_CRMDB01.lsnr` on member `crmdb01`
2011-04-16 12:40:14.093: [ CRSRES][11312]32Stop of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb01` succeeded.
2011-04-16 12:40:14.094: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv RESTART_COUNT=0 RESTART_ATTEMPTS=0
2011-04-16 12:40:14.105: [ CRSRES][11312]32ora.crmdb.crmsrv1.crmdb2.srv failed on crmdb01 relocating.
2011-04-16 12:40:14.150: [ CRSRES][11312]32Attempting to start `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02`
[color=red]2011-04-16 12:40:14.442: [ CRSRES][11312]32Start of `ora.crmdb.crmsrv1.crmdb2.srv` on member `crmdb02` succeeded.[/color]

此时2号节点crs日志显示如下:
2011-04-16 12:40:14.148: [ CRSRES][11617]32startRunnable: setting CLI values
2011-04-16 12:40:24.488: [ CRSRES][12145]32CRS-1002: Resource 'ora.crmdb.crmsrv1.cs' is already running on member 'crmdb02'

需要注意的是,vip出现故障,甚至会将和vip相关的资源全部停止,
If the VIP fails for any reason and cannot be restarted, CRS will bring down all dependent resources, including the Listener, ASM instance and database instance. CRS will attempt to bring these resources down gracefully - hence, a shutdown immediate will be issued, and will be seen in the alert log of the ASM instance - no errors will be evident in the alert log for the ASM instance.
以下来自一metalink (ID 277274.1) 案例,此故障经常在10.1上出现
`ora.rmsclnxclu1.vip` on `rmsclnxclu1` went OFFLINE unexpectedly
2004-06-21 21:21:05.562: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
RTD #0: Action Script /home/oracle/product/crs/bin/racgwrap(stop) timed out for ora.rmsclnxclu1.vip! (timeout=60)
2004-06-21 21:22:16.472: [RTI:884782] StopResource error for ora.rmsclnxclu1.vip error code = 1
2004-06-21 21:22:18.611: `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` has experienced an unrecoverable failure.
2004-06-21 21:22:18.611: Human intervention required to resume its availability.
2004-06-21 21:22:18.790: [RUNNABLELISTENER:884782] Resource failed into UNKNOWN, killing dependents
`ora.rmsclnxclu1.vip` experienced a failure on `rmsclnxclu1`. Stopping dependent resources.
2004-06-21 21:22:20.525: Attempting to stop `ora.gofod.gofod1.inst` on member `rmsclnxclu1`
2004-06-21 21:25:38.531: Stop of `ora.gofod.gofod1.inst` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:38.611: Attempting to stop `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1`
2004-06-21 21:25:38.983: Stop of `ora.rmsclnxclu1.LISTENER_rmsclnxclu1.lsnr` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:39.041: Attempting to stop `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1`
2004-06-21 21:25:46.669: Stop of `ora.rmsclnxclu1.ASM1.asm` on member `rmsclnxclu1` succeeded.
2004-06-21 21:25:46.728: Attempting to stop `ora.rmsclnxclu1.vip` on member `rmsclnxclu1`
2004-06-21 21:25:55.547: Stop of `ora.rmsclnxclu1.vip` on member `rmsclnxclu1` succeeded.

如果出现上述故障或者vip经常自动offline,可以用以下思路来解决问题:
1、启用vip跟踪,如果vip出现故障,可以进一步获得更为详细的日志信息
开启vip跟踪:
# crsctl debug log res ora.node1.vip:1
Set Resource Debug Module: ora.node1.vip Level: 1
关闭vip跟踪
# crsctl debug log res ora.node1.vip:0
Set Resource Debug Module: ora.node1.vip Level: 0
在11 R2中开启跟踪语法变为:
#crsctl set log res "ora.rmntops1.vip.com:1"

2、修改vip检查间隔时间和脚本超时时间,vip检查间隔时间从默认的30秒改为120秒,脚本超时时间从60秒改为120秒。
1. Create the .cap file for each vip resource (on each node):

./crs_stat -p ora.rmsclnxclu1.vip > /tmp/ora.rmsclnxclu1.vip.cap

2. Then, update the .cap file using the following syntax and values:

./crs_profile -update ora.rmsclnxclu1.vip -dir /tmp -o ci=120,st=120

(Where ci = the CHECK_INTERVAL and st = the SCRIPT_TIMEOUT value.)

3. Finally, re-register it using the '-u' option:

./crs_register ora.rmsclnxclu1.vip -dir /tmp -u

3、如果是10.1的话,可以在asm资源中将vip相关性移除:
ASM resource name is in the form of ora.<nodename>.<ASM instance name>.asm.
VIP resource name is in the form of ora.<nodename>.vip
- crs_stat -p <ASM resource name> > /tmp/<ASM resource name>.cap
- Edit /tmp/<ASM resource name>.cap to remove VIP resource name from the REQUIRED_RESOURCES attribute.
- crs_register -u <ASM resource name> -dir /tmp
- Use "crs_stat -p <ASM resource name>" to verify if REQUIRED_RESOURCE attribute is updated.
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值