RAC的高可用也不是那么好用


oracle一直在“鼓吹”着他的rac是如何如何的高可用,如何如何的可以实现针对应用透明的failover,但是,在实际的使用过程中,要完全实现这样的透明,条件是何等的苛刻。

先从一次故障说起吧。某天中午正在吃饭的时候,突然接到电话做应用程序连接数据库挂死了,并且也在客户端测试连接数据库也是挂死,长时间没有响应。

该省的数据库是2节点的rac数据库,登录数据库后发现rac2已经宕机,只留着侦听还没宕。观察到rac2的alertlog为:

Tue Apr 29 10:12:10 2008
Thread 2 advanced to log sequence 11999
  Current log# 8 seq# 11999 mem# 0: /dev/vg_rac1/rG1_redo_512m_204
  Current log# 8 seq# 11999 mem# 1: /dev/vg_rac2/rG2_redo_512m_214
Tue Apr 29 10:12:10 2008
  Current log# 8 seq# 11999 mem# 2: /dev/vg_rac3/rG3_redo_512m_224
Tue Apr 29 10:12:10 2008
ARC1: Evaluating archive   log 6 thread 2 sequence 11998
Tue Apr 29 10:12:10 2008
ARC1: Beginning to archive log 6 thread 2 sequence 11998
Creating archive destination LOG_ARCHIVE_DEST_1: '/archlog2/fjmisc2_2_11998.arc'
Tue Apr 29 10:12:32 2008
ARC1: Completed archiving  log 6 thread 2 sequence 11998
Tue Apr 29 10:42:18 2008
alter tablespace DATA_PORTAL add datafile '/dev/vg_rac1/rG1_data_2g_185' size 2047m
Tue Apr 29 10:42:50 2008
Completed: alter tablespace DATA_PORTAL add datafile '/dev/vg
Tue Apr 29 12:16:59 2008
Communications reconfiguration: instance 0
Waiting for clusterware split-brain resolution
Tue Apr 29 12:17:32 2008
Trace dumping is performing id=[80429121659]
Tue Apr 29 12:18:09 2008
Trace dumping is performing id=[80429121745]
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lmon_6757.trc:
ORA-29740: evicted by member 0, group incarnation 5
Tue Apr 29 12:27:04 2008
LMON: terminating instance due to error 29740
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_pmon_6753.trc:
ORA-29740: evicted by member , group incarnation
Tue Apr 29 12:27:04 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_dbw0_6765.trc:
ORA-29740: evicted by member , group incarnation
Tue Apr 29 12:27:05 2008
Errors in file /oracle/app/oracle/admin/fjmisc/bdump/fjmisc2_lck0_6793.trc:
ORA-29740: evicted by member , group incarnation
Instance terminated by LMON, pid = 6757

根据报错信息,看到是ora-29740的错误,这个一般是rac之间的通信问题,导致一个节点被踢出cluster group。具体的错误分析,还是要看trace文件:

*** 2008-04-29 12:16:59.317
kjxgrgetresults: Detect reconfig from 0, seq 4, reason 3
kjxgrrcfgchk: Initiating reconfig, reason 3
*** 2008-04-29 12:17:04.332
kjxgmrcfg: Reconfiguration started, reason 3
kjxgmcs: Setting state to 4 0.
*** 2008-04-29 12:17:04.347
     Name Service frozen
kjxgmcs: Setting state to 4 1.
*** 2008-04-29 12:17:04.392
kjxgrrecp2: Waiting for split-brain resolution, upd 0, seq 5
*** 2008-04-29 12:27:04.479
Voting results, upd 0, seq 5, bitmap: 0
*** 2008-04-29 12:27:04.479
kjxgrdtrt: Evicted by 0, seq (5, 5)
IMR state information
  Member 1, thread 2, state 4, flags 0x00a1
  RR seq 5, propstate 3, pending propstate 0
  rcfg rsn 3, rcfg time 2235139911, mem ct 2
  master 1, master rcfg time 2235139911
 
  Member information:
    Member 0, incarn 4, version 122818100
      thrd 1, prev thrd 65535, status 0x0047, err 0x0002
      valct 0
    Member 1, incarn 4, version 134210
      thrd 2, prev thrd 65535, status 0x0107, err 0x0000
      valct 2
 
Group name: FJMISC
Member id: 1
Cached SKGXN event: 0
Group State:
  State: 4 1
  Commited Map: 0 1
  New Map: 0 1
  SKGXN Map: 0 1
  Master node: 0
  Memcnt 2  Rcvcnt 0
  Substate Proposal: false
Inc Proposal:
  incarn 0  memcnt 0  master 0
  proposal false  matched false
  map:
Master Inc State:
  incarn 0  memcnt 0  agrees 0  flag 0x1
  wmap:
  nmap:
  ubmap:
Submitting asynchronized dump request [1]
error 29740 detected in background process
ORA-29740: evicted by member 0, group incarnation 5
ksuitm: waiting for [5] seconds before killing DIAG
oracle@FJ_DB02:/oracle/app/oracle/admin/fjmisc/bdump >

在这里我们看到是由于ora-29740的reason 3引起宕机的,查询metalink(Note:219361.1)上得知:

For troubleshooting ORA-29740 errors, the 'reason' will be very important.
In the above example, the first section indicates the reason for the
initiated reconfiguration. The reasons are as follows:
 
Reason 0 = No reconfiguration
Reason 1 = The Node Monitor generated the reconfiguration.
Reason 2 = An instance death was detected.
Reason 3 = Communications Failure
Reason 4 = Reconfiguration after suspend
 
For ORA-29740 errors, you will most likely see reasons 1, 2, or 3.
 
-----------------------------------------------------------------------------
 
Reason 1: The Node Monitor generated the reconfiguration. This can happen if:
 
a) An instance joins the cluster
b) An instance leaves the cluster
c) A node is halted
 
It should be easy to determine the cause of the error by reviewing the alert
logs and LMON trace files from all instances. If an instance joins or leaves
the cluster or a node is halted then the ORA-29740 error is not a problem.
 
ORA-29740 evictions with reason 1 are usually expected when the cluster
membership changes. Very rarely are these types of evictions a real problem.
 
If you feel that this eviction was not correct, do a search in Metalink or
the bug database for:
 
ORA-29740 'reason 1'
 
Important files to review are:
 
a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) Each node's syslog or messages file
e) iostat output before, after, and during evictions
f) vmstat output before, after, and during evictions
g) netstat output before, after, and during evictions
 
There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1
"OS Watcher User Guide".
 
-----------------------------------------------------------------------------
 
Reason 2: An instance death was detected. This can happen if:
 
a) An instance fails to issue a heartbeat to the control file.
 
When the heartbeat is missing, LMON will issue a network ping to the instance
not issuing the heartbeat. As long as the instance responds to the ping,
LMON will consider the instance alive. If, however, the heartbeat is not
issued for the length of time of the control file enqueue timeout, the
instance is considered to be problematic and will be evicted.
 
Common causes for an ORA-29740 eviction (Reason 2):
 
a) NTP (Time changes on cluster) - usually on Linux, Tru64, or IBM AIX
b) Network Problems (SAN).
c) Resource Starvation (CPU, I/O, etc..)
d) An Oracle bug.
 
 
Common bugs for reason 2 evictions:
Bug 2820871 - Abrupt time adjustments can crash instance with ORA-29740
(Reason 2) (Linux Only)
Fixed-Releases: 9204+ A000
Bug 3917158 - ORA-29740 and a false instance eviction can occur (Reason 2)
(IBM AIX Only)
Fixed Releases: 9206+
 
If you feel that this eviction was not correct, do a search in Metalink or the
bug database for:
 
ORA-29740 'reason 2'
 
Important files to review are:
 
a) Each instance's alert log
b) Each instance's LMON trace file
c) Statspack reports from all nodes leading up to the eviction
d) The CKPT process trace file of the evicted instance
e) Other bdump or udump files...
f) Each node's syslog or messages file
g) iostat output before, after, and during evictions
h) vmstat output before, after, and during evictions
i) netstat output before, after, and during evictions
 
There is a tool called "OS Watcher" that is being developed that helps gather
this information. For more information on "OS Watcher" see Note 301137.1
"OS Watcher User Guide".
 
-----------------------------------------------------------------------------
 
Reason 3: Communications Failure. This can happen if:
 
a) The LMON processes lose communication between one another.
b) One instance loses communications with the LMS, LMD, process of
another instance.
c) The LCK processes lose communication between one another.
d) A process like LMON, LMD, LMS, or LCK is blocked, spinning, or stuck
and is not responding to remote requests.
 
In this case the ORA-29740 error is recorded when there are communication
issues between the instances. It is an indication that an instance has been
evicted from the configuration as a result of IPC send timeout. A
communications failure between processes across instances will also generate a
ORA-29740 with reason 3. When this occurs, the trace file of the process
experiencing the error will print a message:
 
Reporting Communication error with instance:
 
If communication is lost at the cluster layer (for example, network cables
are pulled), the cluster software may also perform. node evictions in the
event of a cluster split-brain. Oracle will detect a possible split-brain
and wait for cluster software to resolve the split-brain. If cluster
software does not resolve the split-brain within a specified interval,
Oracle proceeds with evictions.
 
Oracle Support has seen cases where resource starvation (CPU, I/O, etc...) can
cause an instance to be evicted with this reason code. The LMON or LMD process
could be blocked waiting for resources and not respond to polling by the remote
instance(s). This could cause that instance to be evicted. If you have
a statspack report available from the time just prior to the eviction on the
evicted instance, check for poor I/O times and high CPU utilization. Poor I/O
times would be an average read time of > 20ms.
 
Common causes for an ORA-29740 eviction (Reason 3):
 
a) Network Problems.
b) Resource Starvation (CPU, I/O, etc..)
c) Severe Contention in Database.
d) An Oracle bug.

OK,那么我们基本将问题定位在网络方面的原因,继续查看系统的syslog:

Apr 29 11:59:10 FJ_DB02 su: + 0 root-oracle
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 failed
Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan1 to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switching to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan1 to lan2
Apr 29 12:11:02 FJ_DB02 cmcld: lan1 switched to lan2
Apr 29 12:11:22 FJ_DB02 cmcld: lan1 recovered
Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switching from lan2 to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switching to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: Subnet 10.203.17.0 switched from lan2 to lan1
Apr 29 12:11:22 FJ_DB02 cmcld: lan2 switched to lan1
Apr 29 12:14:15 FJ_DB02 vmunix: NFS server FJ_DB01 not responding still trying
Apr 29 12:14:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 not responding, closing
Apr 29 12:14:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing
Apr 29 12:18:59 FJ_DB02 cmcld: GS connection to 10.203.17.1 is responding
Apr 29 12:19:26 FJ_DB02 vmunix: NFS server FJ_DB01 ok
Apr 29 12:20:59 FJ_DB02 cmcld: HB connection to 10.203.17.1 is responding
Apr 29 12:22:47 FJ_DB02 xntpd[3717]: synchronized to 218.200.246.196, stratum=1

发现主机确实出现了网络问题,在rac2上连接不到rac1(10.203.17.1),此时应用挂死,怀疑是应用发起的连接去了rac2,而rac2的lsnr还是打开的但是instance已经宕机,由于配置了local lsnr和remote lsnr,在客户端也配置了load balance和failover,当应用的连接随机的连接到实例时候,由于配置了load balance和remote lsnr,需要检查对方节点的服务器负载,由于网络不正常,无法确定对方节点情况,因此挂死。关闭rac2上的lsnr后,应用恢复正常!!

从主机的syslog中进一步观察,主机的网络在12:18已经恢复,网络开始responding。ping rac1也是正常,检查cluster软件没有问题后,开始重启rac2的instance。

问题是,启动的时候再次出现了问题,本来rac1上的连接都是好的,且应用也是顺利的连接到rac1,可是在重启的时候,所有的应用挂死。由于考虑到此时两个节点是在做内存同步,对split brain的东西做回复。在这段时间内的rac2和rac1都会很慢,就让应用先等了一会……谁知过了15分钟了,rac2还是没起来,而且rac2 instance再次被lmon终止了。

汗!看来网络还是有问题!!白天不敢动了,怕再次影响业务,到晚上继续捣鼓……

晚上拉上了网络工程师,发现还是有“FJ_DB02 cmcld: GS connection to 10.203.17.1 not responding, closing”的报错,仔细的把网络检查了一遍:

$ sqlplus " / as sysdba "
SQL > oradebug setmypid
SQL > oradebug ipc
SQL > exit ;

发现在trace文件中:

SSKGXPT 0x1aa5e4 flags SSKGXPT_READPENDING     info for network 0
    socket no 9     IP 10.203.17.2     UDP 57854
< ------ 此处
    
sflags SSKGXPT_UP
    
info for network 1
    
socket no 0       IP 0 . 0 . 0 . 0       UDP 0
    
sflags SSKGXPT_DOWN
context timestamp 0
    
no ports
    
sconno       accono     ertt   state     seq #    sent   async     sync rtrans     acks
      
ach       accono       sconno       admno   state lstack nxtseq     rcv rtrans     acks credts

发现此处配的10.203.17.2,即应用的连接的网段,没有用private网段,进一步检查:

SQL > show parameter cluster              
 
NAME                                   TYPE         VALUE
----------------------------------
-- ----------- ------------------------------
cluster_database                       boolean       TRUE
cluster_database_instances             integer       2
cluster_interconnects                 string

发现也是没有配private的网卡。

由于这个省的rac的网络配置是这样的:
lan0 : 192.168.0.0
lan1 : 10.103.17.0
lan2 : 0.0.0.0
其中,lan2是lan1的备用。网卡lan1之间的连接是经过两台互为冗余的交换机,而lan0则是通过交叉线直接连接。应用程序通过10.103.17.0网段的IP连接数据库,因此,而lan0对于数据库来讲目前是没有用上的。lan1上的问题导致了此次rac2的宕机。

通过网络工程师的进一步ping包,发现lan1的接口上有丢包的现象,由于是少量的不稳定的丢包,并不是网络全断,因此也没有切换到lan2上。因此造成因为丢包而split brain,从而rac2实例宕机。而进一步的咨询现场了解到,在当天中午的12点多确实有人拔动过lan1的网线。可能是拔动后没有插紧,造成了这次宕机。

解决方法:重新插拔lan1的网线,更换网线,更换网口后,节点间的通信回复正常。顺利的启动rac2.

后续改进:将直连的interconnect改为通过交换机连接(需千兆网络),配置cluster_interconnects=ip_private.

总结:rac其实要求还是挺严格的。首先,在硬件上,要求千兆private网络用于interconnect,如果有多块网卡且既能做为interconnect通信又能用于应用连接,需要配置cluster_interconnects进行指定。load balance其实在2处地方可以做load balance:一处是在服务器端,根据配置的remote lsnr和local lsnr来做负载均衡,即在客户端指定连接rac1也有可能在服务器端做再次调整连去rac2;另一处是在客户端,通过客户端的tnsnames来配置load balance,客户端根据tnsnames中的address list随机的连接服务器端。至于最终连接到哪个节点,要根据服务器端的lsnr再做调整(感觉这不是很好控制,如果同时配置了客户端的load balance和服务器端的load balance,基本不可预知在客户端发起的连接会去哪个节点。更何况,oracle美其名曰说是通过cpu的负载来实现load balance,至于其中如何的运算cpu的比例,还是一个黑盒子。)。至于failover这个功能,感觉只是和已经连接上rac的实例select有关,update、insert还是基本不能实现TAF,至于新发起的连接,和failover没啥关系了……

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/611609/viewspace-674934/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/611609/viewspace-674934/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值