RAC 11.2.0.3 LISTENER异常终止

最新推荐文章于 2023-04-26 17:05:11 发布

cri5768

最新推荐文章于 2023-04-26 17:05:11 发布

阅读量527

点赞数 1

文章标签：数据库网络操作系统

今天同事遇到一个RAC监听异常终止的问题，版本11.2.0.3操作系统AIX6.1，如下：
10-MAY-2014 11:44:16 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=__jdbc__)(USER=))(SERVER=DEDICATED)(SERVICE_NAME=zhdw)) * (ADDRESS=(PROTOC
OL=tcp)(HOST=10.195.160.162)(PORT=59755)) * establish * zhdw * 0
Sat May 10 11:44:27 2014
10-MAY-2014 11:44:27 * (CONNECT_DATA=(CID=(PROGRAM=)(HOST=zhdwdb1)(USER=grid))(COMMAND=status)(ARGUMENTS=64)(SERVICE=LISTENER_SCAN1)
(VERSION=186647296)) * status * 0
10-MAY-2014 11:44:35 * service_died * LsnrAgt * 12537
可以看到service_died 。以前遇到过died的问题是由于数据库异常终止了，而且我还特意关注过，但是这里不一样
因为实例根本没有问题。所以就和同事一起进行了分析。

既然是RAC就要考虑到CRS是否进行了异常的LISTENER终止，所以首先看了CRS的ALTER日志，如下：
2014-05-13 11:59:40.629
[/u01/app/11.2.3/grid/bin/oraagent.bin(8847514)]CRS-5016:Process "/u01/app/11.2.3/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.3/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.3/grid/log/zhdwdb2/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-05-13 11:59:40.907
[/u01/app/11.2.3/grid/bin/oraagent.bin(8847514)]CRS-5016:Process "/u01/app/11.2.3/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.3/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.3/grid/log/zhdwdb2/agent/crsd/oraagent_grid/oraagent_grid.log"
2014-05-13 12:01:23.234
可以看到ORAAGENT进行检查的时候出现了错误，然后查看oraagent_grid.log日志。
2014-05-13 11:59:40.257: [ A**][2314] {0:3:13726} Agent received the message: RESOURCE_STOP[ora.ons zhdwdb2 1] ID 4099:28335
2014-05-13 11:59:40.258: [ A**][2314] {0:3:13726} Preparing STOP command for: ora.ons zhdwdb2 1
2014-05-13 11:59:40.258: [ A**][2314] {0:3:13726} ora.ons zhdwdb2 1 state changed from: ONLINE to: STOPPING
2014-05-13 11:59:40.262: [ A**][2314] {0:3:13726} Agent received the message: RESOURCE_STOP[ora.LISTENER.lsnr zhdwdb2 1] ID 4099:28336
2014-05-13 11:59:40.262: [ A**][2314] {0:3:13726} Preparing STOP command for: ora.LISTENER.lsnr zhdwdb2 1
2014-05-13 11:59:40.262: [ora.ons][2572] {0:3:13726} [stop] (:CLSN00108:) clsn_agent::stop {
2014-05-13 11:59:40.262: [ A**][2314] {0:3:13726} ora.LISTENER.lsnr zhdwdb2 1 state changed from: ONLINE to: STOPPING
2014-05-13 11:59:40.263: [ora.ons][2572] {0:3:13726} [stop] OnsAgent::stop {
2014-05-13 11:59:40.268: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] (:CLSN00108:) clsn_agent::stop {
2014-05-13 11:59:40.268: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] LsnrAgent::stop {
2014-05-13 11:59:40.268: [ USRTHRD][1800] {0:3:13726} Thread:RegEndpointThread:LISTENER stop {
2014-05-13 11:59:40.270: [ USRTHRD][5157] {0:3:13726} Thread:RegEndpointThread:LISTENER isRunning is reset to false here
2014-05-13 11:59:40.270: [ USRTHRD][1800] {0:3:13726} Thread:RegEndpointThread:LISTENER stop }
2014-05-13 11:59:40.270: [ora.LISTENER.lsnr][1800] {0:3:13726} [stop] lsnrctl stop LISTENER

可以看到CRS触发了LISTENER的关闭，然后进一步查看orarootagent_root.log查看network资源是否异常发现

2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] got lock
2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] tryActionLock }
2014-05-13 11:59:40.184: [ora.net1.network][2057] {2:64347:2} [check] abort }
2014-05-13 11:59:40.185: [ora.net1.network][2057] {2:64347:2} [check] (:CLSN00110:) clsn_agent::abort }
2014-05-13 11:59:40.185: [ A**][2057] {2:64347:2} Command: check for resource: ora.net1.network zhdwdb2 1 completed with status: TIMEDOUT
2014-05-13 11:59:40.185: [ A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state changed from: ONLINE to: UNKNOWN
2014-05-13 11:59:40.185: [ A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 would be continued to monitored!
2014-05-13 11:59:40.185: [ A**][2314] {2:64347:2} Switching offline monitor to intermedite one
2014-05-13 11:59:40.186: [ A**][2314] {2:64347:2} Started implicit monitor for [ora.net1.network zhdwdb2 1] interval=1000 delay=1000
2014-05-13 11:59:40.186: [ A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state details has changed from: to: CHECK TIMED OUT
2014-05-13 11:59:40.186: [ A**][2314] {0:3:13726} Generating new Tint for unplanned state change. Original Tint: {2:64347:2}
2014-05-13 11:59:40.186: [ A**][2314] {0:3:13726} Agent sending message to PE: RESOURCE_STATUS[Proxy] ID 20481:1574940
2014-05-13 11:59:41.228: [ A**][2314] {2:64347:2} ora.net1.network zhdwdb2 1 state changed from: UNKNOWN to: ONLINE
2014-05-13 11:59:41.228: [ A**][2314] {2:64347:2} Switching offline monitor to online one
发现NETWORK资源在短时时间内由ONLINE 到UNKOWN状态然后由UNKOWN到ONLINE状态。这可能是由于网络颤抖引起。

所以问题基本定位，由于NETWORK资源出现问题导致

下面说明为什么NETWROK资源出现问题监听会出现问题。
LISTERN--VIP--PUBLIC IP --network（网卡）
SCAN LISTERN--SCAN IP --PUBLIC IP --network（网卡）
可以看到只要NETWROK资源出现问题就会影响到上层的各种资源。

然后解释下为什么LOCAL LISTENER出现问题也会导致连接问题
client--- scan listener（根据负载分发，确定进入的实例）----LOCAL LISTENER（这一环出了问题还是不行）---进入数据库

可以看到只要LOCAL LISTENER出现问题也会导致不能连接。
所以问题可能是
1、网络颤抖
2、各种BUG

查看METALINK发现2篇文章说这个问题如下：
1、

Agent Check Timeout for Network and ACFS resources occasionally in AIX (文档 ID 1558510.1)

APPLIES TO:
Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
IBM AIX on POWER Systems (64-bit)
SYMPTOMS

This is for AIX only. The problem is AIX specific.

The cluster alert.log shows check timeout errors for acfs and network resource occasionally:

< [/u01/app/11.2.0/grid/bin/orarootagent.bin(7143488)]CRS-5818:Aborted command 'check for resource: ora.drivers.acfs 1 1' for resource 'ora.drivers.acfs'. Details at (:CRSAGF00113:) {0:0:2} in /u01/app/11.2.0/grid/log/p01dou412/agent/ohasd/orarootagent_root/orarootagent_root.log.
< [/u01/app/11.2.0/grid/bin/orarootagent.bin(7143488)]CRS-5014:Agent "/u01/app/11.2.0/grid/bin/orarootagent.bin" timed out starting process "/u01/app/11.2.0/grid/bin/acfsload" for action "check": details at "(:CLSN00009:)" in "/u01/app/11.2.0/grid/log/p01dou412/agent/ohasd/orarootagent_root/orarootagent_root.log"
< [/u01/app/11.2.0/grid/bin/orarootagent.bin(20054094)]CRS-5818:Aborted command 'check for resource: ora.net1.network p01dou412 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {0:17:2} in /u01/app/11.2.0/grid/log/p01dou412/agent/crsd/orarootagent_root/orarootagent_root.log.

CAUSE

A defect with AIX causes the unix command "pwd" to hang for a minute or two.
This affects all LPARs on the server at the same time.

The OSWatcher also hangs for a minute or two because it uses pwd command, so have the customer set up OSwatcher and check if the OSWatcher hangs at the same time the check timeout error occurs. If OSWatcher hangs at the same time for about a minute or two, then the server is running into this problem.

SOLUTION

Resolved by upgrading AIX and firmware to the following:

Firmware
IBM,9117-MMB
The current permanent system firmware image is AM730_087
The current temporary system firmware image is AM730_087
VIO server version 2.2.1.4
AIX version 6100-07-04-1216

After the above patching, the problem is resolved.

2、
VIP, SCAN VIP/Listener Fails Over and Listener Stops After Short Public Network Hiccup (文档 ID 1333165.1)

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.1 to 12.1.0.1 [Release 11.2 to 12.1]
Information in this document applies to any platform.
SYMPTOMS

After check timed out, 11gR2 Grid Infrastructure network resource (usually ora.net1.network) goes to INTERMEDIATE state, then goes back to ONLINE very shortly. This note will not discuss cause of check time out, but most common cause is public network hiccup.

Once network resource goes into INTERMEDIATE state, it may trigger VIP, service, SCAN VIP/SCAN listener, ora.cvu and ora.ons etc to be failed over/go offline due to resource dependence, which could result in unnecessary connectivity issue for that period of time. After network resource is back online, affected resources may not come back online.

$GRID_HOME/log//crsd/crsd.log
2011-06-12 07:12:31.261: [ A**][10796] {0:1:2881} Received state change for ora.net1.network racnode1 1 [old state = ONLINE, new state = UNKNOWN]
2011-06-12 07:12:31.261: [ A**][10796] {0:1:2881} Received state LABEL change for ora.net1.network racnode1 1 [old label = , new label = CHECK TIMED OUT]
..
2011-06-12 07:12:31.297: [ CRSPE][12081] {0:1:2881} RI [ora.net1.network racnode1 1] new external state [INTERMEDIATE] old value: [ONLINE] on racnode1 label = [CHECK TIMED OUT]
..
2011-06-12 07:12:31.981: [ A**][10796] {0:1:2882} Received state change for ora.net1.network racnode1 1 [old state = UNKNOWN, new state = ONLINE]
..
2011-06-12 07:12:32.307: [ CRSPE][12081] {0:1:2881} RI [ora.LISTENER.lsnr racnode1 1] new internal state: [STOPPING] old value: [STABLE]
2011-06-12 07:12:32.308: [ CRSPE][12081] {0:1:2881} CRS-2673: Attempting to stop 'ora.LISTENER.lsnr' on 'racnode1'
$GRID_HOME/log//agent/crsd/orarootagent_root/orarootagent_root.log
2011-06-12 07:12:08.965: [ A**][2070] {1:27767:2} Created alert : (:CRSAGF00113:) : Aborting the command: check for resource: ora.net1.network racnode1 1
2011-06-12 07:12:08.966: [ora.net1.network][2070] {1:27767:2} [check] clsn_agent::abort {
..
2011-06-12 07:12:31.257: [ A**][2070] {1:27767:2} Command: check for resource: ora.net1.network racnode1 1 completed with status: TIMEDOUT
2011-06-12 07:12:31.258: [ A**][2314] {1:27767:2} ora.net1.network racnode1 1 state changed from: ONLINE to: UNKNOWN
2011-06-12 07:12:31.258: [ A**][2314] {1:27767:2} ora.net1.network racnode1 1 would be continued to monitored!
2011-06-12 07:12:31.258: [ A**][2314] {1:27767:2} ora.net1.network racnode1 1 state details has changed from: to: CHECK TIMED OUT
..
2011-06-12 07:12:31.923: [ora.net1.network][2314][F-ALGO] {1:27767:2} CHECK initiated by timer for: ora.net1.network racnode1 1
..
2011-06-12 07:12:31.973: [ora.net1.network][8502][F-ALGO] {1:27767:2} [check] Command check for resource: ora.net1.network racnode1 1 completed with status ONLINE
2011-06-12 07:12:31.978: [ A**][2314] {1:27767:2} ora.net1.network racnode1 1 state changed from: UNKNOWN to: ONLINE
$GRID_HOME/log//agent/crsd/oraagent_/oraagent_.log
2011-06-12 07:12:32.335: [ A**][2314] {0:1:2881} Agent received the message: RESOURCE_STOP[ora.LISTENER.lsnr racnode1 1] ID 4099:14792
2011-06-12 07:12:32.335: [ A**][2314] {0:1:2881} Preparing STOP command for: ora.LISTENER.lsnr racnode1 1
2011-06-12 07:12:32.335: [ A**][2314] {0:1:2881} ora.LISTENER.lsnr racnode1 1 state changed from: ONLINE to: STOPPING

$GRID_HOME/log//alert.log
2012-01-10 06:48:18.474 [/ocw/grid/bin/orarootagent.bin(10485902)]CRS-5818:Aborted command 'check for resource: ora.net1.network racnode1 1' for resource 'ora.net1.network'. Details at (:CRSAGF00113:) {1:24200:2} in /ocw/grid/log/racnode1/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-01-10 06:48:43.481 [/ocw/grid/bin/oraagent.bin(8847542)]CRS-5016:Process "/ocw/grid/bin/lsnrctl" spawned by agent "/ocw/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/ocw/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-01-10 06:48:43.552 [/ocw/grid/bin/oraagent.bin(8847542)]CRS-5016:Process "/ocw/grid/opmn/bin/onsctli" spawned by agent "/ocw/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/ocw/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log"

CAUSE

SOLUTION
The issue is fixed in a few different bugs:

1. bug 12680491 fixes the dependence between network and VIP

The fix of bug 12680491 will add intermediate modifier to stop dependency between network resource and VIP to avoid unnecessary resource state change, it's included in 11.2.0.2 GI PSU4, 11.2.0.3 GI PSU3, 11.2.0.3 Windows Patch 7, 11.2.0.4 and above. This fix is recommended instead of fix for bug 12378938 to avoid the issue in first place.

Once patch for this bug is applied, the following needs to be executed to change the dependence for all VIPs:

# $GRID_HOME/bin/crsctl modify res ora..vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora..network)"

For example:

# /ocw/grid/bin/crsctl modify res ora.racnode1.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
Once the attribute is changed, a restart of nodeapps/VIP is needed to be in effect

2. bug 13582411 fixes the dependence between network and SCAN VIP/listener

The fix of bug 13582411 will add intermediate modifyer to stop dependency between network resource and SCAN VIP to avoid unnecessary resource state change, it's included in 11.2.0.3 GI PSU4, 11.2.0.4 and above.

Once patch for this bug is applied, the following needs to be executed to change the dependence for all SCAN VIPs and to restart SCAN VIPs:

# $GRID_HOME/bin/crsctl modify res ora.scan.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net.network)"
For example:

# /ocw/grid/bin/crsctl modify res ora.scan1.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/crsctl modify res ora.scan2.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/crsctl modify res ora.scan3.vip -attr "STOP_DEPENDENCIES=hard(intermediate:ora.net1.network)"
# /ocw/grid/bin/srvctl stop scan -f
$ /ocw/grid/bin/srvctl start scan_listener

3. bug 17435488 fixes the dependence between network and ora.cvu and ora.ons

The fix will add intermediate modifyer to stop dependency between network resource and ora.cvu and ora.ons to avoid unnecessary resource state change, it's included in 12.1.0.2

所以初步认为需要：
1、RAC 修改VIP SCAN VIP的依赖属性
2、查看AIX 固件

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/7728585/viewspace-1160374/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/7728585/viewspace-1160374/