1 环境介绍
数据库版本11.2.0.4 RAC环境。
操作系统版本centos 7,
2 故障现象
今日对数据库一个节点进行重启,重启完成后发现。数据库agent信息只有3个agent在运行
[grid@rac02 admin]$ ps -ef|grep agent
patrol 10723 10552 0 09:56 ? 00:00:00 /usr/bin/ssh-agent /bin/sh -c exec -l /bin/bash -c "env GNOME_SHELL_SESSION_MODE=classic gnome-session --session gnome-classic"
grid 16067 1 0 10:02 ? 00:00:00 /u01/11.2.0/bin/oraagent.bin
root 16099 1 0 10:02 ? 00:00:02 /u01/11.2.0/bin/orarootagent.bin
root 16145 1 0 10:02 ? 00:00:00 /u01/11.2.0/bin/cssdagent
grid 19423 16434 0 10:07 pts/2 00:00:00 grep --color=auto agent
查询资源状态信息如下:
[grid@rac02 admin]$ crsctl status res -t -init
--------------------------------------------------------------------------------
NAME TARGET STATE SERVER STATE_DETAILS
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
1 ONLINE ONLINE rac02 Started
ora.cluster_interconnect.haip
1 ONLINE ONLINE rac02
ora.crf
1 ONLINE ONLINE rac02
ora.crsd
1 ONLINE OFFLINE
ora.cssd
1 ONLINE ONLINE rac02
ora.cssdmonitor
1 ONLINE ONLINE rac02
ora.ctssd
1 ONLINE ONLINE rac02 OBSERVER
ora.diskmon
1 OFFLINE OFFLINE
ora.evmd
1 ONLINE ONLINE rac02
ora.gipcd
1 ONLINE ONLINE rac02
ora.gpnpd
1 ONLINE ONLINE rac02
ora.mdnsd
1 ONLINE ONLINE rac02
[grid@rac02 admin]$
crsd资源处于offline状态。
3 日志分析
查询alert日志发现如下信息
[crsd(19237)]CRS-0813:Cluster Ready Service aborted due to failure to initialize the network layer with error [clsclisten failed with ret 3
(File: caa_Socket.cpp, line: 525
]. Details at (:CRSD00133:) in /u01/11.2.0/log/rac02/crsd/crsd.log.
2021-03-10 10:07:17.115:
[ohasd(15918)]CRS-2765:Resource 'ora.crsd' has failed on server 'rac02'.
2021-03-10 10:07:17.116:
[ohasd(15918)]CRS-2771:Maximum restart attempts reached for resource 'ora.crsd'; will not restart.
查询crsd.log发现如下信息:
[ OCRMAS][1132443392]th_master: Received group public data event. Incarnation [1]
2021-03-10 10:07:16.527: [ OCRMAS][1132443392]th_master:1': Recvd pubdata event from node [2]
2021-03-10 10:07:16.527: [ OCRMAS][1132443392]th_master:2': Recvd pubdata event for self. Do nothing.
2021-03-10 10:07:16.533: [ CRSMAIN][1468389184] Running path init...
2021-03-10 10:07:16.539: [ CLSE][1468389184]clse_get_auth_loc: Returning default authloc: /u01/11.2.0/auth/crs/rac02
2021-03-10 10:07:16.539: [ CRSMAIN][1468389184] Using Authorizer location: /u01/11.2.0/auth/crs/rac02
2021-03-10 10:07:16.539: [ CRSMAIN][1468389184] Initialing cluclu context...
2021-03-10 10:07:16.551: [ CLSCLU][1468389184]clsclu_init: rc 0
2021-03-10 10:07:16.551: [ CRSMAIN][1468389184] Getting CR Root...
2021-03-10 10:07:16.555: [ CRSMAIN][1468389184] Initializing RTI
2021-03-10 10:07:16.555: [ CRSMAIN][1468389184] Initializing staging area
2021-03-10 10:07:16.571: [ CLSE][1468389184]clse_get_auth_loc: Returning default authloc: /u01/11.2.0/auth/crs/rac02
2021-03-10 10:07:16.571: [ default][1468389184] AuthLoc /u01/11.2.0/auth/crs/rac02
2021-03-10 10:07:16.571: [ default][1468389184] PE active version: 11.2.0.4.0
2021-03-10 10:07:16.571: [ default][1468389184] PE Engine: NEW
2021-03-10 10:07:16.571: [ default][1468389184] Using OCR batch ops : ENABLED
2021-03-10 10:07:16.571: [ CRSMAIN][1468389184] Creating RTI lock info...
2021-03-10 10:07:16.571: [ CRSMAIN][1468389184] Initializing EVMMgr
2021-03-10 10:07:16.576: [ CRSMAIN][1468389184] Getting local nodename...
[ CLWAL][1468389184]clsw_Initialize: OLR initlevel [70000]
2021-03-10 10:07:16.617: [ OCRSRV][1126139648]th_upgrade: Starting upgrade calculation
2021-03-10 10:07:16.630: [ OCRSRV][1126139648]th_upgrade:10.1 AV [186647552]. State [11]. Already upgraded.Updated global data to the crs version group. Return [0]
2021-03-10 10:07:16.835: [ COMMCRS][1096722176]clsclisten: Error listening on: (ADDRESS=(PROTOCOL=tcp)(HOST=10.2.0.76)(PORT=0))2021-03-10 10:07:16.835: [ COMMCRS][1096722176]clsclisten: op 65 failed, NSerr (12560, 0), transport: (584, 0, 0)
2021-03-10 10:07:16.836: [ CRSD][1468389184] Created alert : (:CRSD00133:) : Unable to get E2E port, error: IOException : clsclisten failed with ret 3
(File: caa_Socket.cpp, line: 5252021-03-10 10:07:16.836: [ CRSD][1468389184][PANIC] CRSD exiting: Unable to get E2E port after 2nd attempt
2021-03-10 10:07:16.836: [ CRSD][1468389184] Done.
查看网卡信息如下:
ens36: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.2.151.86 netmask 255.255.255.224 broadcast 10.228.151.95
inet6 fe80::250:56ff:fe8d:5908 prefixlen 64 scopeid 0x20<link>
ether 00:50:56:8d:59:08 txqueuelen 1000 (Ethernet)
RX packets 7289 bytes 646307 (631.1 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 10140 bytes 6909723 (6.5 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens37: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 10.2.0.76 netmask 255.255.255.0 broadcast 10.2.0.255
inet6 fe80::250:56ff:fe8d:13fa prefixlen 64 scopeid 0x20<link>
ether 00:50:56:8d:13:fa txqueuelen 1000 (Ethernet)
RX packets 271 bytes 35397 (34.5 KiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 183 bytes 29338 (28.6 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0ens37:1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 169.254.220.193 netmask 255.255.0.0 broadcast 169.254.255.255
ether 00:50:56:8d:13:fa txqueuelen 1000 (Ethernet)lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
inet 127.0.0.1 netmask 255.0.0.0
inet6 ::1 prefixlen 128 scopeid 0x10<host>
loop txqueuelen 1000 (Local Loopback)
RX packets 4524 bytes 7037484 (6.7 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 4524 bytes 7037484 (6.7 MiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0virbr0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500
inet 192.168.122.1 netmask 255.255.255.0 broadcast 192.168.122.255
ether 52:54:00:8d:96:71 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0virbr0-nic: flags=4098<BROADCAST,MULTICAST> mtu 1500
ether 52:54:00:8d:96:71 txqueuelen 1000 (Ethernet)
RX packets 0 bytes 0 (0.0 B)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 0 bytes 0 (0.0 B)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0[grid@rac02 rac02]$
HAIP已经正常启动。
4 问题解决
后来发现是由于GRID_HOME下sqlnet.ora文件配置存在问题导致scan和普通listener无法正常启动
[grid@rac02 admin]$ rm sqlnet.ora
启动资源
[grid@rac02 admin]$ crsctl start resource "ora.crsd" -init
CRS-2672: Attempting to start 'ora.crsd' on 'rac02'
CRS-2676: Start of 'ora.crsd' on 'rac02' succeeded
[grid@rac02 admin]$ ps -ef|grep tns
root 19 2 0 09:55 ? 00:00:00 [netns]
grid 21423 20603 0 10:12 pts/2 00:00:00 grep --color=auto tns
[grid@rac02 admin]$ ps -ef|grep tns
root 19 2 0 09:55 ? 00:00:00 [netns]
grid 21493 20603 0 10:12 pts/2 00:00:00 grep --color=auto tns
[grid@rac02 admin]$ ps -ef|grep tns
root 19 2 0 09:55 ? 00:00:00 [netns]
grid 21506 20603 0 10:12 pts/2 00:00:00 grep --color=auto tns
[grid@rac02 admin]$ ps -ef|grep tns
root 19 2 0 09:55 ? 00:00:00 [netns]
grid 21513 1 2 10:12 ? 00:00:00 /u01/11.2.0/bin/tnslsnr LISTENER_SCAN1 -inherit
grid 21525 1 0 10:12 ? 00:00:00 /u01/11.2.0/bin/tnslsnr LISTENER -inherit
grid 21546 20603 0 10:12 pts/2 00:00:00 grep --color=auto tns
[grid@rac02 admin]$
资源启动正常。