环境:虚拟机:VBOX,操作系统为RHEL5.6,数据库为Oracle11.2.0.1.0
问题:启动节点1后,发现集群不能正常启动,检查虚拟机配置没有问题,检查集群状态发现错误,如下:
[root@node1 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Event Manager is online
解决过程:
1.检查集群状态,如上所示
2.然后检查crsd与cssd的日志文件
日志目录通常在“GRID的安装主目录/log/节点主机名/”目录下面,通常按照服务名来命名目录,可以进入相应目录下查看日志。
3.crsd目录下的crsd.log一直报如下错误,CSS is not ready,问题应该在cssd服务上:
2014-07-18 17:07:04.063: [ CRSRTI][1547711200] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
2014-07-18 17:07:05.065: [ CSSCLNT][1547711200]clssscConnect: gipc request failed with 29 (0x16)
2014-07-18 17:07:05.065: [ CSSCLNT][1547711200]clsssInitNative: connect failed, rc 29
2014-07-18 17:07:05.066: [ CRSRTI][1547711200] CSS is not ready. Received status 3 from CSS. Waiting for good status ..
4.继续查看cssd的日志,cssd目录下的ocssd.log日志,内容较多,但是有关于IP地址的报错:
2014-07-18 17:17:58.018: [GIPCXCPT][2517008128]gipcmodGipcPassInitializeNetwork: failed to find any interfaces in clsinet, ret gipcretFail (1)
2014-07-18 17:17:58.019: [GIPCGMOD][2517008128]gipcmodGipcPassInitializeNetwork: EXCEPTION[ ret gipcretFail (1) ] failed to determine host from clsinet, using default
2014-07-18 17:17:58.019: [GIPCXCPT][2517008128]gipcShutdownF: skipping shutdown, count 3, from [ clsinet.c : 1732], ret gipcretSuccess (0)
2014-07-18 17:17:58.021: [GIPCXCPT][2517008128]gipcShutdownF: skipping shutdown, count 2, from [ clsgpnp0.c : 1021], ret gipcretSuccess (0)
2014-07-18 17:17:58.021: [GIPCGMOD][2517008128]gipcmodGipcPassInitializeNetwork: using host information node1
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcmodNetworkProcessBind: failed to bind endp 0x818c170 [000000000000007d] { gipcEndpoint : localAddr 'gipc://node1:nm_scan-cluster#192.0.2.130', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x0, usrFlags 0x0 }, addr 0x818b3e0 [000000000000007f] { gipcAddress : name 'gipc://node1:nm_scan-cluster#192.0.2.130', objFlags 0x0, addrFlags 0x1 }
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcmodNetworkProcessBind: slos op : sgipcnTcpBind
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcmodNetworkProcessBind: slos dep : Cannot assign requested address (99)
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcmodNetworkProcessBind: slos loc : bind
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcmodNetworkProcessBind: slos info: addr '192.0.2.130:0'
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcBindF [gipcInternalEndpoint : gipcInternal.c : 416]: EXCEPTION[ ret gipcretAddressNotAvailable (39) ] failed to bind endp 0x818c170 [000000000000007d] { gipcEndpoint : localAddr 'gipc://node1:nm_scan-cluster#192.0.2.130', remoteAddr '', numPend 0, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x0, usrFlags 0x0 }, addr 0x81906d0 [0000000000000084] { gipcAddress : name 'gipc://node1:nm_scan-cluster', objFlags 0x0, addrFlags 0x0 }, flags 0x0
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcInternalEndpoint: failed to bind address to endpoint name 'gipc://node1:nm_scan-cluster', ret gipcretAddressNotAvailable (39)
2014-07-18 17:17:58.022: [GIPCXCPT][2517008128]gipcEndpointF [clsssclsnrsetup : clsssc.c : 2743]: EXCEPTION[ ret gipcretAddressNotAvailable (39) ] failed endp create ctx 0x808a630 [0000000000000011] { gipcContext : traceLevel 2, fieldLevel 0x0, numDead 0, numPending 0, numZombie 0, numObj 4, objFlags 0x0 }, name 'gipc://node1:nm_scan-cluster', flags 0x0
2014-07-18 17:17:58.022: [ CSSD][2517008128]clsssclsnrsetup: gipcEndpoint failed, rc 39
2014-07-18 17:17:58.022: [ CSSD][2517008128]clssnmOpenGIPCEndp: failed to listen on gipc addr gipc://node1:nm_scan-cluster- ret 39
2014-07-18 17:17:58.022: [ CSSD][2517008128]clssscmain: failed to open gipc endp
2014-07-18 17:17:58.096: [ CSSD][1136195904]clssscSelect: cookie accept request 0x7ebe878
2014-07-18 17:17:58.096: [ CSSD][1136195904]clssgmAllocProc: (0x81938e0) allocated
2014-07-18 17:17:58.097: [ CSSD][1136195904]clssgmClientConnectMsg: properties of cmProc 0x81938e0 - 1,2,3,4
2014-07-18 17:17:58.097: [ CSSD][1136195904]clssgmClientConnectMsg: Connect from con(0xfd) proc(0x81938e0) pid(4051) version 11:2:1:4, properties: 1,2,3,4
2014-07-18 17:17:58.097: [ CSSD][1136195904]clssgmClientConnectMsg: msg flags 0x0000
5.检查主机的IP地址,发现两个网卡都没有启动:
[root@node1 ~]# ifconfig -a
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:1632 errors:0 dropped:0 overruns:0 frame:0
TX packets:1632 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1973544 (1.8 MiB) TX bytes:1973544 (1.8 MiB)
6.检查网卡配置文件,发现eth0与eth1的配置文件不正确,配置正确后,尝试重启网卡,报错:
[root@node1 cssd]# service network restart
Shutting down loopback interface: [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0: e1000 device eth0 does not seem to be present, delaying initialization.
[FAILED]
Bringing up interface eth1: e1000 device eth1 does not seem to be present, delaying initialization.
[FAILED]
7.检查开机内核加载网卡的信息,发现不正常,内容如下:
[root@node1 cssd]# dmesg | grep eth
[root@node1 cssd]# dmesg | grep e1000
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
e1000: Unknown parameter `irq'
8.从网上搜索上面错误后,有一个国外的网站提到了/etc/modprobe.conf里的内容,检查文件内容,发现问题,有一行options e1000 irq=4 irq=4:
[root@node1 cssd]# cat /etc/modprobe.conf
alias scsi_hostadapter ata_piix
alias scsi_hostadapter1 ahci
alias net-pf-10 off
alias ipv6 off
options ipv6 disable=1
remove snd-intel8x0 { /usr/sbin/alsactl store 0 >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0
alias eth0 e1000
options e1000 irq=4
alias snd-card-0 snd-intel8x0
options snd-card-0 index=0
options snd-intel8x0 index=0
alias eth1 e1000
9.不知道这行是干吗的,后来找了一台刚安装的机器,检查/etc/modprobe.conf文件,内容如下,并没有options e1000 irq=4这一行:
[root@redhat ~]# cat /etc/modprobe.conf
alias scsi_hostadapter ata_piix
alias scsi_hostadapter1 ahci
alias net-pf-10 off
alias ipv6 off
options ipv6 disable=1
options snd-intel8x0 index=0
remove snd-intel8x0 { /usr/sbin/alsactl store 0 >/dev/null 2>&1 || : ; }; /sbin/modprobe -r --ignore-remove snd-intel8x0
alias eth0 e1000
10.删除这一行信息后,重启虚拟机,IP地址正常,集群也正常
[root@node1 ~]# ifconfig
eth0 Link encap:Ethernet HWaddr 08:00:27:B0:C1:BB
inet addr:192.0.2.130 Bcast:192.0.2.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:617 errors:0 dropped:0 overruns:0 frame:0
TX packets:45 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:48756 (47.6 KiB) TX bytes:6011 (5.8 KiB)
eth1 Link encap:Ethernet HWaddr 08:00:27:22:0E:25
inet addr:10.10.10.101 Bcast:10.10.10.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:699 errors:0 dropped:0 overruns:0 frame:0
TX packets:54 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:55458 (54.1 KiB) TX bytes:6402 (6.2 KiB)
[root@node1 bin]# ./crsctl check crs
CRS-4638: Oracle High Availability Services is online
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
[root@node1 bin]# ./crs_stat -t
Name Type Target State Host
------------------------------------------------------------
ora.DATA.dg ora....up.type ONLINE ONLINE node1
ora.FLASH.dg ora....up.type ONLINE ONLINE node1
ora.GRIDDG.dg ora....up.type ONLINE ONLINE node1
ora....ER.lsnr ora....er.type ONLINE ONLINE node1
ora....N1.lsnr ora....er.type ONLINE ONLINE node1
ora.asm ora.asm.type ONLINE ONLINE node1
ora.devdb.db ora....se.type ONLINE OFFLINE
ora.eons ora.eons.type ONLINE ONLINE node1
ora.gsd ora.gsd.type OFFLINE OFFLINE
ora....network ora....rk.type ONLINE ONLINE node1
ora....SM1.asm application ONLINE ONLINE node1
ora....E1.lsnr application ONLINE ONLINE node1
ora.node1.gsd application OFFLINE OFFLINE
ora.node1.ons application ONLINE ONLINE node1
ora.node1.vip ora....t1.type ONLINE ONLINE node1
ora.node2.vip ora....t1.type ONLINE ONLINE node1
ora.oc4j ora.oc4j.type OFFLINE OFFLINE
ora.ons ora.ons.type ONLINE ONLINE node1
ora....ry.acfs ora....fs.type ONLINE ONLINE node1
ora.scan1.vip ora....ip.type ONLINE ONLINE node1
总结:该问题和网卡没有正常启动导致集群不能启动有关系,还是要多了解Redhat Linux系统与Oracle集群的原理啊。