一个长时间parse的bug解决过程

开发同事说一个sql运行2个多小时没有返回,问是否存在异常.
简单检查后(我的工作记录中忘记记录当时session的等待事件),
1. 发现这个sql消耗1个cpu的100%时间.
2. 检查v$sql发现PLAN_HASH_VALUE为0,在v$sql_plan中也不存在对应sql_id的执行计划信息.
3. 使用pl/sql developer查看这个sql的执行计划时,一直没有显示,而且也是消耗1个cpu,
   使用sqlplus中explain plan for来获取执行计划时,也出现同样现象.

初步判断sql在parse阶段出现异常,由于以前没有遇到类似问题,所以使用pstack来记录当前函数栈的情况,
以便查找bug对比.
toracle@ssz0009dww: ~$pstack 7148
7148:   oraclexxx (LOCAL=NO)
 00000000026aaaf5 kkosbn () + 2c5
 00000000026a157a kkobmp () + 50a
 000000000266ddb8 kkotap () + 338
 000000000265814d kkojnp () + d4d
 0000000002656ca3 kkocnp () + 93
 0000000002644222 kkooqb () + 8d2
 0000000002638a03 kkoqbc () + 813
 000000000276c83f apakkoqb () + 8f
 000000000276d047 apaqbd () + 187
 000000000276a5ce apadrv () + 1fe
 0000000002935e2c opitca () + 5cc
 0000000001fe190d kksFullTypeCheck () + 1d
 0000000000e9bbdc rpiswu2 () + 18c
 0000000001fe6b85 kksLoadChild () + 1d15
 0000000001f0655f kxsGetRuntimeLock () + 61f
 0000000001ffe0fb kksfbc () + 15fb
 0000000001ff998f kkspsc0 () + 4ff
 0000000001ffb813 kksParseCursor () + 93
 000000000292693d opiosq0 () + 75d
 00000000028d169d kpooprx () + dd
 00000000028ceb69 kpoal8 () + 2d9
 0000000000e97c6c opiodr () + 41c
 0000000003d9f6da ttcpip () + 46a
 0000000000e939d3 opitsk () + 503
 0000000000e96f18 opiino () + 3a8
 0000000000e97c6c opiodr () + 41c
 0000000000e924d1 opidrv () + 2f1
 0000000000e8f90b sou2o () + 5b
 0000000000e552e4 opimai_real () + 84
 0000000000e551b4 main () + 64
 0000000000e54ffc ???????? ()

注意到kksParseCursor(猜测是parse有关的函数),当前正在执行此函数,这也说明parse阶段出现异常.
一方面在metalink中查找类似bug, 同时在网上查找类似案例.

在metalink中发现相似bug:5667683(尽管版本不同).
按照文档的解决方法删除统计数据后(我们无法打patch),发现sql正常执行.
重新获取统计后,问题重现.尝试几次后证明这个问题是可重现的.
但我不希望最终采用这个方法,因为这样一来使用这几个表的其它sql会受到影响.(metalink中没有找到其他解决方法)
我希望通过修改某种参数来避开.在网上发现Tanel的播客中描述了这种长时间处于编译阶段的案例,
请教他之后,最终按照他的回答,将参数_b_tree_bitmap_plans设置为FALSE避开了这个问题.

下面是Tanel中的描述和他当时推荐的几种方法.

Advanced Oracle Troubleshooting Guide, Part 4: Diagnosing a long parsing issue

byTanel PoderPosted onJune 3, 2008

There was a recent thread in Oracle Forums about a session getting stuck somewhere when a specific SQL was issued. The SQL executed did not return at all unless ORDERED hint was used. Even the EXPLAIN PLAN command (which only parses the statement, doesn’t execute it) did never return.

Classic tracing + tkprof techniques didn’t show much (just some recursive queries consuming insignificant amounts of time).

The proven V$SESSION_WAIT sampling technique didn’t reveal anything as it showed the session being constantly on CPU (the wait state = ‘WAITED KNOWN TIME’ which means session is on CPU) and SEQ# didn’t increase (which means that wait state did not change over time).

Due the symptoms described above I was well prepared to troubleshoot this issue. This looks exactly like one of the troubleshooting use cases I demonstrate in of my Advanced Oracle Troubleshooting class (nice embedded advertisment, huh? ;)

In such a case where tracing and V$ views don’t provide any useful information about what the session is doing, I normally look into few stack traces of the server process. In this case I asked the poster to do this and here is the result:

21820: oracleXXXXXX (DESCRIPTION=(LOCAL=YES)(ADDRESS=(PROTOCOL=beq)))

00000001021f5f58 kkosbn (ffffffff7ad79910, ff, ffffffff7ad96f58, ffffffff7ad96f58, 10000, ffffffff7ad96f58) + 2d8

00000001021eb278 kkobmp (ffffffff7ad790f0, 10000, ffffffff7ad73e18, 0, 0, ffffffff7ad797a0) + 4d8

00000001021ad348 kkotap (0, 0, 0, ffffffff7ad79910, ffffffff7ad73e18, ffffffff7ad790f0) + 4e8

000000010218fd4c kkojnp (382660a28, 0, ffffffff7ad790f0, 10218f160, 1065327d0, ffffffff7ad797a0) + bac

000000010218e828 kkocnp (106400, 200, ffffffff7ad790f0, 1, 0, 1065327d0) + e8

00000001021757a4 kkooqb (102174, 0, 3, 0, ffffffff7af15d00, ffffffff7ad790f0) + c04

0000000102165ac0 kkoqbc (0, ffffffff7af15d00, 0, 1065327d0, 6, 102165380) + 720

00000001022f3700 apakkoqb (ffffffff7fff6810, ffffffff7af15d00, 3843e26e0, 0, c200080, 0) + 80

00000001022f4110 apaqbd (ffffffff7fff6810, ffffffff7af15d00, 3843e26e0, ffffffff7ad7fd38, 381cbf738, 200000) + 170

00000001022f1014 apadrv (3843e26e0, 106532, ffffffff7af43f28, 106531b28, ffffffff7af15d00, 106532) + 294

0000000102529dc8 opitca (3, 3843e26e0, ffffffff7af43f28, 0, 1065327d0, 106525000) + 9c8

00000001019f1f8c kksFullTypeCheck (ffffffff7fff7b70, 1056d8c28, 106525898, ffeffc00, 1065253d8, 1056d8c18) + c

000000010034c894 rpiswu2 (0, 106525898, 0, ffffffff7fff72a8, 1056d8000, 106525) + 1f4

00000001019f8ca8 kksLoadChild (106400, ffffffff7fff7b70, ffffffff7af43f28, 106525898, ffffffff7fff91c0, 3843e26e0) + 22e8

00000001018f767c kxsGetRuntimeLock (0, 0, 180, 3834b2578, e9e733ed, 106525898) + 5bc

0000000101a18f24 kksfbc (106531, 0, 108, 0, 0, 382d0d758) + 3de4

0000000101a12618 kkspsc0 (1000, fffdffff, 0, 108, 10000000, 0) + 1838

0000000101a13844 kksParseCursor (ffffffff7fffa8b8, 3c305e100, 0, 0, ffffffff7af32260, 3) + e4

0000000102515d3c opiosq0 (1056d8c18, ffffffff7af32260, 1056d8, 20a, 1056d8000, 20b) + 85c

00000001024b1078 kpooprx (ffffffff7fffe07c, ffffffff7fffbec0, 106400, ffffffff7fffabd0, 0, 0) + d8

00000001024ad6d8 kpoal8 (ffffffff7fffbe88, 106528, a4, 8061, 1, 106532860) + 2b8

0000000100346ccc opiodr (1057687c0, 5e, 106525600, 1, ffffffff7af26738, 106528) + 60c

0000000103d74124 ttcpip (34, ffffffff7fffcb50, 1056beb2c, 1056bc310, 0, 106525718) + 504

0000000100341238 opitsk (106532868, 1, 0, 106532860, 105882f78, fffffffd) + 598

0000000100345888 opiino (106400, 106532860, 0, 40002801, 106400, 106532860) + 468

0000000100346ccc opiodr (1065319d8, 3c, 106400, 1065327e0, 106532, 106528) + 60c

000000010033f660 opidrv (106534f60, 0, 3c, 106531c98, 3c, 0) + 380

0000000100339c30 sou2o (ffffffff7ffff2a8, 3c, 4, ffffffff7ffff2d0, 105e0c000, 105e0c) + 50

00000001002fbf5c opimai_real (2, ffffffff7ffff3a8, 104050b4c, 1064cd848, 247cbfc, 14800) + 7c

00000001002fbe18 main (2, 8, 0, ffffffff7ffff3b8, ffffffff7ffff4c8, ffffffff7b500140) + 98

00000001002fbd3c _start (0, 0, 0, 0, 0, 0) + 17c

 

And one of the highlighted functions (kksParseCursor) already gives out that this process is parsing.
And the other function (kksLoadChild) gives out that we are in process of hard parsing (as soft parsing does not need to load child cursors into library cache)

Some more bits:

§  The rpiswu function is the one initializing a separate runtime execution context for any hard parsing recursive activity ( RPI = Recursive Program Interface )

§  The apa* functions above also indicate parsing action

§  The kko* functions are the ones taking care of SQL query transformation, validation and various other checks

On Oracle 10g+ you can actually measure session execution time spent parsing by sampling V$SESS_TIME_MODEL, what I have done using my Snapper tool here:

SQL> @sn 10 106

 

-- Session Snapper v1.07 by Tanel Poder ( http://www.tanelpoder.com )

 

---------------------------------------------------------------------------------------------------------------------------------------------

HEAD,     SID, SNAPSHOT START   ,  SECONDS, TYPE, STATISTIC                               ,         DELTA,  DELTA/SEC,     HDELTA, HDELTA/SEC

---------------------------------------------------------------------------------------------------------------------------------------------

DATA,     106, 20080603 06:31:17,       10, TIME, hard parse elapsed time                 ,      10124018,    1012402,     10.12s,      1.01s

DATA,     106, 20080603 06:31:17,       10, TIME, parse time elapsed                      ,      10124018,    1012402,     10.12s,      1.01s

DATA,     106, 20080603 06:31:17,       10, TIME, DB CPU                                  ,      10113462,    1011346,     10.11s,      1.01s

DATA,     106, 20080603 06:31:17,       10, TIME, DB time                                 ,      10124018,    1012402,     10.12s,      1.01s

DATA,     106, 20080603 06:31:17,       10, WAIT, events in waitclass Other               ,            51,          5,       51us,      5.1us

--  End of snap 1

 

PL/SQL procedure successfully completed.

 

However as the original poster had STATISTICS_LEVEL=BASIC, these values were zero, for Time Model stats gathering STATISTICS_LEVEL needs to be set to either TYPICAL or ALL.

…but reading a stack trace is more fun, isn’t it? ;-)

Related Posts

§  Advanced Oracle Troubleshooting Guide – Part 10: Index unique scan doing multiblock reads?!

§  Advanced Oracle Troubleshooting Guide – Part 11: Complex Wait Chain Signature Analysis with…

§  When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 2?

§  When do Oracle Parallel Execution Slaves issue buffered physical reads – Part 1?

§  Diagnosing buffer busy waits with the ash_wait_chains.sql script (v0.2)

Share this post:

§   

§   

§   

§   

§  More

§   

§   

§   

§   

§   

§   

§   

§   

§   

§   

§   

§   

This entry was posted in Oracle and tagged Internals, Troubleshooting. Bookmark the permalink.

Snapper shortcut

Performance Tools Quick Reference Guide

5 Responses to Advanced Oracle Troubleshooting Guide, Part 4: Diagnosing a long parsing issue

1.       Charles says:

September 18, 2009 at 9:41 pm

Hi Tanel,
A big fan of your blogs…
Please, tell me how you can tell from your trace that the kksParseCursor and kksLoadChild functions are currently being executed, as opposed to them having BEEN executed at some point in the past?
Thank you if you have time to respond!

Reply

2.      wanqisheng says:

July 15, 2014 at 1:53 am

hello Tanel, i’m meeted the same bug.
from metalink, compare with stack , it’s similar with Bug:5667683,
but version is different, my version is 10.2.0.3 for solaris ; at doc 421913.1, the version is 10.2.0.2.
i can’t patch the db, we can set some parameter to avoid it?
also i will try to follow Bug 5667683 to remove stats.
can you tell me how to process and avoid it? thanks

$pstack 7148
7148: oraclexxx (LOCAL=NO)
00000000026aaaf5 kkosbn () + 2c5
00000000026a157a kkobmp () + 50a
000000000266ddb8 kkotap () + 338
000000000265814d kkojnp () + d4d
0000000002656ca3 kkocnp () + 93
0000000002644222 kkooqb () + 8d2
0000000002638a03 kkoqbc () + 813
000000000276c83f apakkoqb () + 8f
000000000276d047 apaqbd () + 187
000000000276a5ce apadrv () + 1fe
0000000002935e2c opitca () + 5cc
0000000001fe190d kksFullTypeCheck () + 1d
0000000000e9bbdc rpiswu2 () + 18c
0000000001fe6b85 kksLoadChild () + 1d15
0000000001f0655f kxsGetRuntimeLock () + 61f
0000000001ffe0fb kksfbc () + 15fb
0000000001ff998f kkspsc0 () + 4ff
0000000001ffb813 kksParseCursor () + 93
000000000292693d opiosq0 () + 75d
00000000028d169d kpooprx () + dd
00000000028ceb69 kpoal8 () + 2d9
0000000000e97c6c opiodr () + 41c
0000000003d9f6da ttcpip () + 46a
0000000000e939d3 opitsk () + 503
0000000000e96f18 opiino () + 3a8
0000000000e97c6c opiodr () + 41c
0000000000e924d1 opidrv () + 2f1
0000000000e8f90b sou2o () + 5b
0000000000e552e4 opimai_real () + 84
0000000000e551b4 main () + 64

Reply

§  Tanel Poder says:

July 15, 2014 at 5:47 am

Are you using bitmap indexes? (as the bug 5667683 mentions bitmap indexes) and the kkobmp() function sounds like a bitmap index related one.
You could try a few options:
1) use a lower optimizer_features_enabled value for this session/query in hope of optimizer not hitting this bug
2) test the query by making the bitmap indexes invisible or by using a bunch of NO_INDEX hints listing all bitmap indexes on the tables accessed by this query – again in hope of not hitting this bug as you’ve prohibited evaluating these indexes
3) test with setting the _b_tree_bitmap_plans = FALSE. especially when you don’t have bitmap indexes in your schema. This disables some CBO codepath and would help in case this spin happens in this codepath.
4) patch the bug

Reply

§  wanqisheng says:

August 15, 2014 at 1:14 am

sorry , until now to reply . At first, i’m follow Bug 5667683 description,
remove statistics data, then lock it, the sql is can parse and run normal.
Today, i have time to test your advice. before test, i’m unlock statistics data, and gather it, the sql is spin agin.
1. use a lower optimizer_features_enabled value .
at first, i’m use 10.2.0.1 to test(current is 10.2.0.3), the sql is spin again,
check stack, it’s same with before.
then i’m use 9.2.0.8, it’s ok, it’s finished at moment.
2. if modify sql, it’s need developer modify and deploy new package for all
location, so i’m not test it.
3. setup the _b_tree_bitmap_plans = FALSE.
(at our db, we don’t use bitmap index.)
yes, the sql is run normal.
after test, so i’m set the parameter in my spfile.
thanks very much.

I have click “post comment” , but IE refresh long time and not display,
so i ‘m reply again.

Reply

§  Tanel Poder says:

August 17, 2014 at 2:49 pm

Thanks for getting back with your results!

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值