oracle rac节点重启,oracle RAC一个节点频繁重启解决

最新推荐文章于 2023-03-06 18:43:36 发布

weixin_39640773

最新推荐文章于 2023-03-06 18:43:36 发布

阅读量793

点赞数

文章标签： oracle rac节点重启

oracle RAC一个节点频繁重启解决

类别：Oracle数据库作者：码皇来源：hijk139的专栏点击：

oracle RAC一个节点频繁重启解决故障现象：2011年的一次问题，oracle 11gr2 rac + redhat linux ，2节点rac中的其中一个节点频繁重启；原因分析：主机日志VIP发生了漂移，重启后又归位node1Nov 23 18:2

oracle RAC一个节点频繁重启解决故障现象： 2011年的一次问题，oracle 11gr2 rac + redhat linux ，2节点rac中的其中一个节点频繁重启；原因分析：主机日志 VIP发生了漂移，重启后又归位 node1 Nov 23 18:22:27 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2. Nov 23 18:22:31 dtydb2 avahi-daemon[13096]: Withdrawing address record for 169.254.188.250 on bond1. Nov 23 18:23:10 dtydb2 avahi-daemon[13096]: Registering new address record for 169.254.188.250 on bond1. Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2. Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2. Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2. Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2. Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2. node2 Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2. Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2. Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2. Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2. Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2. Nov 23 18:23:34 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2. 数据库日志不能连接ASM，所有重启 ORA-15064: communication failure with ASMinstance ORA-03113: end-of-file on communicationchannel ASM日志 and the ASM instance has the alert info Wed Nov 23 18:22:29 2011 NOTE: client exited [13858] Wed Nov 23 18:22:29 2011 NOTE: ASMB process exiting, either shutdown is in progress NOTE: or foreground connected to ASMB was killed. Wed Nov 23 18:22:29 2011 PMON (ospid: 13797): terminating the instance due to error 481 Wed Nov 23 18:22:29 2011 ORA-1092 : opitsk aborting process Wed Nov 23 18:22:30 2011 ORA-1092 : opitsk aborting process Wed Nov 23 18:22:30 2011 ORA-1092 : opitsk aborting process Wed Nov 23 18:22:30 2011 ORA-1092 : opitsk aborting process Wed Nov 23 18:22:30 2011 License high water mark = 16 Instance terminated by PMON, pid = 13797 USER (ospid: 9488): terminating the instance Instance terminated by USER, pid = 948 ocssd.log：has a disk HB, but no network HB, 2011-11-23 18:22:20.512: [ CSSD][1111939392]clssnmPollingThread: node dtydb1 (1) is impending reconfig, flag 394254, misstime 15910 2011-11-23 18:22:20.512: [ CSSD][1111939392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1) 2011-11-23 18:22:20.512: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004978, LATS 1030715744, lastSeqNo 946497, uniqueness 1321449141, timestamp 1322043740/933687024 2011-11-23 18:22:21.515: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004980, LATS 1030716744, lastSeqNo 1004978, uniqueness 1321449141, timestamp 1322043741/933688024 2011-11-23 18:22:22.518: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004982, LATS 1030717754, lastSeqNo 1004980, uniqueness 1321449141, timestamp 1322043742/933689044 2011-11-23 18:22:23.520: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004984, LATS 1030718754, lastSeqNo 1004982, uniqueness 1321449141, timestamp 1322043743/933690044 2011-11-23 18:22:24.140: [ CSSD][1113516352]clssnmSendingThread: sending status msg to all nodes 2011-11-23 18:22:24.141: [ CSSD][1113516352]clssnmSendingThread: sent 4 status msgs to all nodes 2011-11-23 18:22:24.523: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004986, LATS 1030719754, lastSeqNo 1004984, uniqueness 1321449141, timestamp 1322043744/933691044 2011-11-23 18:22:25.525: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004988, LATS 1030720754, lastSeqNo 1004986, uniqueness 1321449141, timestamp 1322043745/933692044 2011-11-23 18:22:26.527: [ CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004990, LATS 1030721764, lastSeqNo 1004988, uniqueness 1321449141, timestamp 1322043746/933693044 经过部署监控脚本，ping日志从18：21：56开始丢包(117-150包丢失) 64 bytes from 192.168.100.1: icmp_seq=114 ttl=64 time=0.342 ms 64 bytes from 192.168.100.1: icmp_seq=115 ttl=64 time=0.444 ms 64 bytes from 192.168.100.1: icmp_seq=116 ttl=64 time=0.153 ms --- 192.168.100.1 ping statistics --- 150 packets transmitted, 116 received, 22% packet loss, time 149054ms rtt min/avg/max/mdev = 0.084/0.246/0.485/0.099 ms Wed Nov 23 18:22:31 CST 2011 继续分析经过以上分析，原因基本确认为RAC节点私有网络丢包，导致一个节点主机重启；但为什么会丢包呢？在检查主机网络配置没有问题的情况下，只能请网络工程师协助解决了网络专家通过网络抓包，发现如下现象观察到几个现象，内容来自回复的邮件： 1. 4:02:09，192.168.100.1在e4cc这块网卡上发出的ping请求，192.168.100.2没有把回应包送到e4cc； 2. 192.168.100.2发出的ping请求数据包，没有送到192.168.100.1的e4cc这块网卡，但192.168.100.1主机肯定是收到了，因为在e4cc这块网卡上，看到了192.168.100.1给192.168.100.2的回应数据包； 3. 4:02:41，192.168.100.2的e474网卡向192.168.100.1回应了Destination unreachable (Port unreachable)，此时192.168.100.2可以正常回包，经过一段时间调整后，4:02:53起，网络恢复正常。具体可以理解如下 1，已主机2的丢包为例，seq9-seq41丢包 64 bytes from 192.168.100.1: icmp_seq=7ttl=64 time=0.170 ms 64 bytes from 192.168.100.1: icmp_seq=8ttl=64 time=0.376 ms 64 bytes from 192.168.100.1: icmp_seq=42ttl=64 time=0.151 ms 64 bytes from 192.168.100.1: icmp_seq=43ttl=64 time=0.340 ms 2，主机2发出了seq9request 04:02:09.284929 00:1b:21:c1:e4:74 >00:1b:21:c1:e4:cc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP(1), length: 84) 192.168.100.2 > 192.168.100.1: ICMP echo request, id 59655,seq 9, length 64 04:02:10.284885 00:1b:21:c1:e4:74 >00:1b:21:c1:e4:cc, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto: ICMP(1), length: 84) 192.168.100.2 > 192.168.100.1: ICMP echo request, id 59655,seq 10, length 64 3，NE401抓到了主机1回复的seq9的reply包，但没有抓到请求包(从另一个NE40转发的？？) 4，这条seq9的数据库包没有送达主机2，或者送达到主机2，主机2没能正常接收(由于没有部署主机2端的reply包，此条无法确认) 继续抓包，主机的备用网卡不停的在发ARP更新请求，这种数据包，影响了二层网络的MAC地址学习，导致地址学习频繁切换，极端情况下会导致丢包。建议确认其用途，在不影响业务的情况下，建议关闭这种通信。解决方法： down掉交换机上的和主机相连的一个端口，使主机、交换机、防火墙口字型连接，这样就不会有arp请求发出，问题解决再也没有出现节点重启的问题。