27号接到电话匆匆忙忙准备行程,并且收集故障信息。
合作伙伴在故障发生时收集的信息如下:
查询v$streams_capture:
SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;
CAPTURE_NAME STATE
-------------- ------------------------
STREAM_CAPTURE PAUSED FOR FLOW CONTROL
查询V$BUFFERED_SUBSCRIBERS得到:
V$BUFFERED_SUBSCRIBERS may show TOO MANY UNBROWSED MESSAGES
我到达现场后,由于故障不在,但应该可以从ALERT.log里面发现什么东西,于是收集了全部三个实例的alert.log日志再加上合作伙伴收集到的信息进行分析。
alert1.log里面可以找到:
Tue Sep 27 10:17:41 2011
Propagation Schedule for (STRMADMIN1.CAP_GISIZJ_QUEUE, "STRMADMIN1"."APP_GISIZJ_QUEUE"@LDCX) encountered following error:
ORA-25307: Enqueue rate too high, flow control enable另外一个alert日志里面可以看到从2010年这个潜在的故障就已经存在了,但没有被引起关注,因为那时业务量也不是很大,因此stream的压力不大,所以问题没有被暴露出来。
Fri Jun 18 10:39:24 2010
Propagation Schedule for (STRMADMIN1.UP_CAP_QUE_GIJD, "STRMADMIN1"."UP_APP_QUE_GIJD"@LDSC) encountered following error:
ORA-25307: Enqueue rate too high, flow control enabled ß 吻合第三种情况
Fri Jun 18 10:42:59 2010
<wbr></wbr>
stream的故障表现通常有三种形式:1. 查询v$streams_capture:
SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;
CAPTURE_NAME STATE
-------------- ------------------------
STREAM_CAPTURE PAUSED FOR FLOW CONTROL
2. 查询V$BUFFERED_SUBSCRIBERS得到:
V$BUFFERED_SUBSCRIBERS may show TOO MANY UNBROWSED MESSAGES
3.就是上面那个Error Message: ORA-25307: Enqueue rate too high, flow control enabled
造成streams故障的原因有如下几种.
1) INIT.ORA parameter problem
Inadequate Memory esp., Streams pool size is small or not set
<wbr><wbr><wbr></wbr></wbr></wbr> 2) If AQ_TM_PROCESSES parameter is set too high in 9i and 10g
For 10g/11g unset AQ_TM_PROCESSES and QMON is automatically configured (auto-tune).
<wbr><wbr><wbr></wbr></wbr></wbr> 3) Capture parameter setting have not been done properly .
4) Publisher has enqueued a lot more messages than have been browsed.
(>1000 unbrowsed messages)
<wbr><wbr><wbr></wbr></wbr></wbr> 5) High latency
High Latency can be due to long-running transactions,many dependent transactions, or slow Capture,Propagation, or Apply processes.In some situations, propagation may become disabled (if the number of failures is (16). Also check Propagation latency from dba_queue_schedules;
6) Check Apply latency
7) Long running Transactions
Check that there is no occurrence for long running transaction entries with the alert log
file, it will be something like that C002: long running txn detected, xid:
0x0008.010.00084261
<wbr></wbr>8) Some known bugs
<wbr><wbr><wbr></wbr></wbr></wbr> bug 7206332 bug 6054573 bug 5093060 bug 7719668
<wbr><wbr><wbr></wbr></wbr></wbr> bug 7206332(lack of evidence no patch at this moment)
bug 6054573( base bug 5093060 fixed at 10.2.0.4)
bug 5093060(NA for 10.2.0.4)
bug 7719668 (11.1.0.7)
经过综合分析,我认为客户这套streams是由于bug导致.
当前环境补丁信息如下,仅仅升级了10.2.0.4,并没有在10.2.0.4上打stream相关的补丁。
使用Stream的话Oracle推荐要打上如下补丁方可安全使用Stream。
$ ./opatch lsinventory
Invoking OPatch 10.2.0.4.3
<wbr>Oracle Interim Patch Installer version 10.2.0.4.3</wbr>
Copyright (c) 2007, Oracle Corporation.<wbr></wbr> All rights reserved.
<wbr>Oracle Home<span style="mso-spacerun:yes"><wbr><wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></span> : /oracle/app/10.2</wbr>
Central Inventory : /oracle/oraInventory
<wbr><wbr></wbr></wbr> from<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr> : /etc/oraInst.loc
OPatch version<wbr><wbr><wbr></wbr></wbr></wbr> : 10.2.0.4.3
OUI version<wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr> <wbr></wbr>: 10.2.0.4.0
OUI location<wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr> : /oracle/app/10.2/oui
Log file location : /oracle/app/10.2/cfgtoollogs/opatch/opatch2011-09-28_16-40-0
7PM.log
<wbr>Lsinventory Output file location : /oracle/app/10.2/cfgtoollogs/opatch/lsinv/lsi</wbr>
nventory2011-09-28_16-40-07PM.txt
<wbr>--------------------------------------------------------------------------------</wbr>
Installed Top-level Products (2):
<wbr>Oracle Database 10g<span style="mso-spacerun:yes"><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></span> 10.2.0.1.0</wbr>
Oracle Database 10g Release 2 Patch Set 3<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr> <wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>10.2.0.4.0
There are 2 products installed in this Oracle Home.
<wbr>There are no Interim patches installed in this Oracle Home.</wbr>
<wbr> --------------------------------------------------------------------------------</wbr>
<wbr>OPatch succeeded.</wbr>
上面结果可以得知,此环境没有打任何和streams相关的补丁。
客户环境在故障发生时还有另外一个现象,就是latch free,从下面的awr可以看到latch free非常高,在latch breakdown中可以查到是channel operations parent latch, 此latch非常少见,通常不会对数据库造成什么影响,但此次出现了,经过分析此为bug7810939,也是由于使用stream的bug。
软件开发商报Capture进程写trace文件造成文件巨大,并且由于文件被锁定无法清理,因此造成文件系统被消耗很快,不得不停机清理,具体表现为某trace文件里面不断的写如下信息:
LET: next scn before krvxrolf.scn: 0x0000.00000000
LET:<wbr></wbr> next scn after krvxrolf.
scn: 0xffff.ffffffff
LET: next scn before krvxrolf.
此又为streams的另一个 bug 6994160 : LET: next scn before krvxrolf. This patch already merged into patch 8680173
最后得出结论打上如下Stream的补丁:8680137, 7033630, 7219752, 6838714, 6827260, 6640411, 5933656另外加上latch补丁和写trace文件的补丁。
补丁应用很成功,启动数据库和流后,又发现另外一个我曾经给Shanghaionstar解决的一个问题:7257038:USING DBMS_REDEFINITION ON STREAMS TABLE CAUSES 'MISSING STREAMS MVDD'。
为了不影响数据库,我们使用rolling mode应用这个补丁,没想到的事情出来了,这个以前从没碰到过:
在RAC中一个节点应用这个补丁后,问题解决,但另一个节点在数据库启动后却没有解决,最后我们将数据库全部停下来,回退此补丁后再重新打一次,问题得到解决,由此说明oracle所谓的rolling mode patch并不是十分的可靠,可靠性可能在99.99%,而我在这次环境中就碰到的这个0.01%,因为我给这么多客户打过的补丁中以前是比没碰到过这种情况的。
综合一下:如果要使用streams,最好打上oracle推荐的补丁再进行配置,否则会有这样和那样的问题,并且这些补丁解决了一些streams的性能问题。