streams capture状态PAUSED FOR FLOW CONTROL原因分析和诊断

最新推荐文章于 2023-03-28 15:13:56 发布

clt3617

最新推荐文章于 2023-03-28 15:13:56 发布

阅读量661

点赞数

文章标签：数据库

客户环境：ibm p595 双机

oracle：10.2.0.4 rac与另外一台查询db通过stremas交换数据

问题：事务多、事务大的时侯capture process出现PAUSED FOR FLOW CONTROL导致同步暂停

[@more@]

Streams同步中断问题诊断

· 1、中断之后capture process的状态是PAUSED FOR FLOW CONTROL：

SELECT CAPTURE_NAME, STATE FROM V$STREAMS_CAPTURE;

CAPTURE_NAME STATE
————– ————————
STREAM_CAPTURE PAUSED FOR FLOW CONTROL

另外从select publisher_state from v$buffered_publishers

得到的状态是PUBLISHER_STATE: IN FLOW CONTROL: TOO MANY UNBROWSED MESSAGES

也说明确实问题是出现在传播或者应用上，但是查看select apply_name from v$streams_apply_server的状态时发现都是Idle（空闲）,由此断定其实不是apply server本身的问题，apply process根本就没有干活…是不是propagation（传播）的问题不好判断，于是修改了可能引起传播问题的2个参数：放大了参数job_queue_processes同时减小了_job_queue_interval = 1，目的就是希望更频繁的调度传播作业。最初_job_queue_interval 的值是5，重启db，重启streams之后问题依旧。仔细检查和传播有关的重要视图：dba_queue_schedules发现failures都是0以及scheduler_disabled都是N,也就是说传播作业都是正常的，基本上可以排除问题出在传播上。这次监控了203，204，205上的等待事件发现205上有异常，等待事件latch free非常严重。

查看等待事件的方法：

select * from v$session_wait where wait_class<>'Idle';

通过上面查询获得的latch free对应的字段P2的值是37进一步查询具体的latch种类：

SQL> select * from v$latchname where latch#=37;

LATCH# NAME HASH

---------- -------------------------------------------------- -------

37 channel operations parent latch 3794363110

通过进一步诊发现oracle的一个后台进程LCK0阻塞了205上的几乎一切应用。Oracle对造成这种现象的解释是：

When SMON needs to recover a large transaction and cannot be done in parallel
then it can take longer than usual for it to read the messages from the
'obj stat del channel' broadcast channel.

LCK0 is also a subscriber of this channel and If the messages are not read
by all subcribers then they are not purged and LCK0 has to traverse the whole
list of messages to find a new one causing it to consume CPU and hold
"channel operations parent latch" while doing it

是由于数据库用并行执行很大的事务时,意外中断或人为KILL后,SMON进程要对其进行恢复并UNDO 由于时间很长,LOCK进程就会占有 channel operations parent latch 并阻塞了其他会话的操做,对这个问题进行详细的解释是：

Since SMON was busy recovering a large dead transaction, it

could not process any of the broadcast messages being sent

to it. This in turn caused, a backlog of messages to build

up on the "obj stat del channel" broadcast channel. This

channel is used to inform SMON (on each node) that an

object has been truncated/dropped and there is no longer a

requirement to maintain any in-memory object statistics

structure (ksolselm) for that object. The LCK0 process is

also a subscriber to this channel (LCK0 acts as a gateway

for the inter-instance broadcast mechanism; LCK0 reads the

messages broadcast to local channels and then sends these

messages to other instances). Now if the messages start to

pile up on the "obj stat del channel" broadcast channel then

LCK0 has to continually traverse all the messages (linearly)

until reaching the latest/new unread messages at the end.

The traversal by LCK0 is done holding the 'channel operation

parant latch'. There is competition for this latch from other

processes which are dropping/truncating temporary objects,

plus there is comptetion from PMON who is trying to cleanup

dead processes that owned temporary objects. LCK0 is slowed

and it is slow responding to global enqueue requests which

in turn slows down SMON, and PMON takes a long time cleaning

just one dead process due to having to check for a possible

dead latchholder, and the whole situation gets worse.

(Note: as soon as SMON completes the transaction recovery

task, the performance problem should immediately clear up.)

期间我们在shutdown 205时不能正常shutdown，提示pmon存在问题，查看pmon进程相关的日志如下,和上面描述的现象完全吻合。

--====================================

*** 2011-09-21 20:59:43.256

*** SERVICE NAME:(SYS$BACKGROUND) 2011-09-21 20:59:43.249

*** SESSION ID:(1325.1) 2011-09-21 20:59:43.249

PMON unable to acquire latch 7000008054b69c0 Child channel operations parent latch level=5 child#=59

Location from where latch is held: ksrmfree():

Context saved from call: 504403192714062200

state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293

waiters [orapid (seconds since: put on list, posted, alive check)]:

343 (45, 1316609983, 3)

350 (42, 1316609983, 3)

351 (33, 1316609983, 3)

359 (15, 1316609983, 3)

357 (6, 1316609983, 3)

353 (3, 1316609983, 3)

waiter count=6

gotten 18094214 times wait, failed first 4390982 sleeps 2171306

gotten 0 times nowait, failed: 0

possible holder pid = 293 ospid=1966160

----------------------------------------

SO: 7000008082f5848, type: 2, owner: 0, flag: INIT/-/-/0x00

(process) Oracle pid=293, calls cur/top: 7000007e916fd70/7000007f9cc79a0, flag: (0) -

int error: 1089, call error: 0, sess error: 0, txn error 0

(post info) last post received: 1089 8 12

last post received-location: ksusig

last process to post me: 70000080f2f5468 2 0

last post sent: 504403192714062272 37 2

last post sent-location: kslges

last process posted by me: 70000080631a728 1 14

(latch info) wait_event=0 bits=20

holding (efd=15) 7000008054b69c0 Child channel operations parent latch level=5 child#=59

Location from where latch is held: ksrmfree():

Context saved from call: 504403192714062200

state=busy(exclusive) (val=0x2000000000000125) holder orapid = 293

waiters [orapid (seconds since: put on list, posted, alive check)]:

343 (45, 1316609983, 3)

350 (42, 1316609983, 3)

351 (33, 1316609983, 3)

359 (15, 1316609983, 3)

357 (6, 1316609983, 3)

353 (3, 1316609983, 3)

waiter count=6

Process Group: DEFAULT, pseudo proc: 70000080738d260

O/S info: user: oracle, term: UNKNOWN, ospid: 1966160

OSD pid info: Unix process pid: 1966160, image: oracle@cxdb2

Short stack dump:

ksdxfstk+002c

Dump of memory from 0x07000008082CA988 to 0x07000008082CAB90

7000008082CA980 0000000B 00000000 [........]

7000008082CA990 07000007 E8148FB0 00000010 000313A7 [................]

7000008082CA9A0 07000007 F9CC79A0 00000003 000313A7 [......y.........]

7000008082CA9B0 07000008 075D2088 0000000B 000313A7 [.....] .........]

7000008082CA9C0 07000008 043FC758 00000004 0003129B [.....?.X........]

7000008082CA9D0 07000007 FE5867B8 00000007 000313A7 [.....Xg.........]

7000008082CA9E0 07000007 FE586898 00000007 000313A7 [.....Xh.........]

7000008082CA9F0 07000007 FE586978 00000007 000313A7 [.....Xix........]

7000008082CAA00 07000007 FE586B38 00000007 000313A7 [.....Xk8........]

7000008082CAA10 07000007 FE586C18 00000007 000313A7 [.....Xl.........]

7000008082CAA20 07000008 075D23B8 0000000B 000313A7 [.....]#.........]

7000008082CAA30 07000008 075D24C8 0000000B 000313A7 [.....]$.........]

7000008082CAA40 00000000 00000000 00000000 00000000 [................]

Repeat 20 times

*** 2011-09-21 20:59:43.314

--====================================

结论：

通过上面分析，造成streams同步暂停的原因是有大的事务被kill，oracle smon内部恢复事务和pmon要清除死进程相互锁定或者阻塞造成的205几乎处于悬挂状态。致使传播过来的消息根本无法入队。所以我们看到的205上的apply server状态都是Idle的。另外和我们部署的streams capture和apply process数量多也有一定的关系。

解决办法：

以后遇到这种情况，先查询205上的等待事件：

select * from v$session_wait where wait_class<>'Idle';

再验证该等待事件是否是latch free，如果是那么再确实该latch是否是：channel operations parent latch，如果是，那么先把所有的streams先停下来，允许的话把205重启一下，不方便重启就先少启动几支streams，等大事务处理之后再把所有的streams都起来。

建议：

首先要保证不要无缘无故人为的去 kill session ，其次把大事务分

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/19602/viewspace-1055337/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/19602/viewspace-1055337/

clt3617

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
streams capture状态PAUSED FOR FLOW CONTROL原因分析和诊断

客户环境：ibm p595 双机oracle：10.2.0.4 rac与另外一台查询db通过stremas交换数据问题：事务多、事务大的时侯capture process出现PAUSED FOR FLOW CONTROL导致同步...
复制链接

扫一扫