Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理

环境信息

数据库版本:11.2.0.2.0

操作系统版本:AIX 6.1

 

 

 

一.现象

1.RAC环境中其中一台服务器node2负载上升(平时为idle90左右),node1数据库正常,下面为node2当时的SAR报告

bash-3.00# sar -f sa15|more

AIX p740db2 1 6 00F688714C00    08/15/12

System configuration: lcpu=32  mode=Capped

00:00:00    %usr    %sys    %wio   %idle   physc

16:15:00       5       2       4      89    8.00

16:16:00       8       5       0      87    8.00

16:17:00      36       6       0      57    8.00

16:18:00      50       5       4      42    8.00

16:19:00       7       2       7      85    8.00

16:20:00       6       2       5      88    8.00

16:21:00       6       2       8      84    8.00

16:22:00       6       2       9      83    8.00

16:23:00       7       2       7      84    8.00

16:24:00       6       2       5      87    8.00

16:25:00       6       2       6      87    8.00

16:26:00       6       2       6      86    8.00

16:27:00       5       2       5      89    8.00

16:28:00       5       2       5      88    8.00

16:29:00       5       2       4      89    8.00

16:30:00       5       2       4      90    8.00

16:31:00      25       3       3      70    8.00

16:32:01      63       2       0      34    8.00

16:33:00      27       2       0      71    8.00

16:34:00      46       2       0      51    8.00

16:35:00      66       2       0      32    8.00

16:36:01      37       2       0      61    8.00

16:37:00      27       2       0      72    8.00

16:38:00      15       1       0      83    8.00

16:39:00      29       2       0      69    8.00

16:40:00      41       2       0      57    8.00

16:41:00      45       2       0      53    8.00

16:42:00      47       2       0      51    8.00

16:43:00      39       2       0      59    8.00

16:44:00      50       2       1      48    8.00

16:45:00      15       2       5      78    8.00

16:46:00       1       1       3      95    8.00

16:47:00       6       2       1      91    8.00

 

2.登陆数据库一直挂起,长时间无响应,查看alert log,发现从16:15开始报ORA-04031 shared pool不够的错

Archived Log entry 15904 added for thread 2 sequence 2418 ID 0xffffffffd412add4 dest 1:

Wed Aug 15 16:15:09 2012

Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_24707320.trc  (incident=5241532):

ORA-04031:鎊谞娉^^萟崎^?32^^蔨络^赹腻^疟浜 ^芧佩璣?("shared pool","select company_id,user_id fr...","KGLH0^28668775","k

glHeapInitialize:temp")

Incident details in: /db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshpd2_ora_24707320_i5241532.trc

Wed Aug 15 16:15:09 2012

Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_40042558.trc"

Writing to the above trace file is disabled for now on...

Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshp

d2_ora_24707320_i5241532.trc"

Writing to the above trace file is disabled for now on...

Wed Aug 15 16:16:02 2012

Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_36044968.trc  (incident=5241340):

ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select count(*) as totalrec ...","KGLH0^c0f94bbb","kglHeapInitialize:temp")

 

二.诊断过程

1.查看这段时间的ASH,其中最严重的等待事件为“cursor: pin S wait on X

 

SQL> select * from (

  2  select /*+ rule */

  3   event,count(1) cnt

  4    from dba_hist_active_sess_history

  5   where sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')

  6     and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')

  7     group by event

  8     order by cnt desc) where rownum<=5;

 

 

EVENT                                                                   CNT

---------------------------------------------------------------- ----------

cursor: pin S wait on X                                                 944

latch: shared pool                                                      305

db file sequential read                                                 277

null event                                                               95

library cache: mutex X                                                   87

 

SQL>

 

根据等待事情查找相关的SQL

SQL> select sql_id,count(1)

  2    from dba_hist_active_sess_history

  3   where

  4   sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')

  5   and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')

  6   and event = 'cursor: pin S wait on X'

  7   group by sql_id;

 

SQL_ID          COUNT(1)

------------- ----------

22bb9t1167adc          5

6yx2qgbpw6j8x       909

bunvx480ynf57         22

dcrjrtxqrsrsw          8

根据上面结果,有问题的SQL_id6yx2qgbpw6j8x,根据SQL_ID查找问题的SQL

select*fromv$sqlwheresql_id='6yx2qgbpw6j8x'

取得的SQL如下:

select company_id,user_id from t_xxxx_user where user_name=:1

 

同时查看这个SQL的执行计划是否正常,通过下面结果可以看到,这个SQL非常简单,全表只有883条记录,查询时走的是唯一索引

SQL>

SQL> --查看历吏执行计划

SQL> SELECT sp.id,

  2         lpad(' ', sp.DEPTH * 3, ' ') || sp.OPERATION || ' ' || sp.OPTIONS operation,

  3         sp.OBJECT_NAME

  4    FROM dba_hist_sql_plan sp

  5   WHERESQL_ID = '6yx2qgbpw6j8x'

  6   ORDER BY ID;

 

        ID OPERATION                                                                        OBJECT_NAME

---------- -------------------------------------------------------------------------------- -------------------------------

         0 SELECT STATEMENT                                                                 

         1    TABLE ACCESS BY INDEX ROWID                                                   T_XXXX_USER

         2       INDEX UNIQUE SCAN                                                          U_TXXXX_USER__USER_NAME

 

SQL> select * from table(dbms_xplan.display_awr('6yx2qgbpw6j8x'));

 

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------

SQL_ID 6yx2qgbpw6j8x

--------------------

select company_id,user_id from t_xxxx_user where user_name=:1

Plan hash value: 377560326

--------------------------------------------------------------------------------

| Id  | Operation                   | Name                    | Rows  | Bytes |

--------------------------------------------------------------------------------

|   0 | SELECT STATEMENT            |                         |       |       |

|   1 |  TABLE ACCESS BY INDEX ROWID| T_XXXX_USER             |     1 |    20 |

|   2 |   INDEX UNIQUE SCAN         | U_TXXXX_USER__USER_NAME |     1 |       |

--------------------------------------------------------------------------------

 

14 rows selected

 

 

2.分析这期间的AWR报告如下

比较出问题时的AWR和正常情况下的AWR,发现平均每秒的硬解析和软解析没有明显的增长,再比较了AWR中的SQL ordered by Parse CallsSQL ordered by Version Count都和平时差不多,包括SQL_ID6yx2qgbpw6j8xSQL的执行次数,初步怀疑为oracle bug

出问题期间的AWR

Per Second

Per Transaction

Per Exec

Per Call

DB Time(s):

13.8

0.3

0.02

0.01

DB CPU(s):

0.7

0.0

0.00

0.00

Redo size:

194,547.2

4,767.2

 

 

Logical reads:

7,571.0

185.5

 

 

Block changes:

815.7

20.0

 

 

Physical reads:

296.2

7.3

 

 

Physical writes:

71.7

1.8

 

 

User calls:

1,339.8

32.8

 

 

Parses:

497.4

12.2

 

 

Hard parses:

18.2

0.5

 

 

W/A MB processed:

1.3

0.0

 

 

Logons:

0.1

0.0

 

 

Executes:

774.7

19.0

 

 

Rollbacks:

0.0

0.0

 

 

Transactions:

40.8

 

 

 

 

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:

99.90

Redo NoWait %:

100.00

Buffer Hit %:

96.12

In-memory Sort %:

100.00

Library Hit %:

92.39

Soft Parse %:

96.35

Execute to Parse %:

35.79

Latch Hit %:

99.73

Parse CPU to Parse Elapsd %:

2.49

% Non-Parse CPU:

73.0

 

 

 

 

正常情况下的AWR

Load Profile

Per Second

Per Transaction

Per Exec

Per Call

DB Time(s):

2.7

0.1

0.00

0.00

DB CPU(s):

0.3

0.0

0.00

0.00

Redo size:

182,077.7

4,810.9

 

 

Logical reads:

9,038.7

238.8

 

 

Block changes:

756.9

20.0

 

 

Physical reads:

338.2

8.9

 

 

Physical writes:

78.9

2.1

 

 

User calls:

1,284.9

34.0

 

 

Parses:

482.7

12.8

 

 

Hard parses:

17.5

0.5

 

 

W/A MB processed:

1.0

0.0

 

 

Logons:

0.1

0.0

 

 

Executes:

744.9

19.7

 

 

Rollbacks:

0.0

0.0

 

 

Transactions:

37.9

 

 

 

 

 

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:

99.94

Redo NoWait %:

100.00

Buffer Hit %:

96.29

In-memory Sort %:

100.00

Library Hit %:

92.88

Soft Parse %:

96.37

Execute to Parse %:

35.20

Latch Hit %:

99.78

Parse CPU to Parse Elapsd %:

16.07

% Non-Parse CPU:

92.86

 

 

 

3.metalink1349387.1中发现下面一段话,If the database has been migrated from 10g to 11g and mutex performance issue surfaces,please consider the 11.2.0.2.2 psu + fix for 12431716.  Many mutex fixes are already included in this patch,查看Bug12431716的说明,版本和现象都比较相似,oracle强烈推荐在11.2.0.2版本中打上补丁12431716

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Troubleshooting 'cursor: pin S wait on X' waits. [ID1349387.1]

To Bottom

Modified:Jan 7, 2012Type:TROUBLESHOOTINGStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Purpose
  
Last Review Date
  
Instructions for the Reader
  
Troubleshooting Details
     
What is a 'Cursor: pin S wait on X' wait?
     
What causes 'Cursor: pin S wait on X' waits?
     
How to diagnose the cause.
     
How to Examine the Diagnostics.
     
Potential Solutions
     
Troubleshooting Other Issues
  
References

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.

Purpose

The purpose of the note is to help customers troubleshoot 'Cursor: pin S wait on X' waits.

Last Review Date

August 17, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

What is a 'Cursor: pin S wait on X' wait?

A cursor wait is associated with parsing in some form. A session may wait for this event when it is trying to get a mutex pin in Share mode but another session is holding the mutex pin on the same cursor object in exclusive.  Frequently, waits for 'Cursor: pin S wait on X' is a symptom and not the cause.  There may be underlying tuning requirements or known issues. 

What causes 'Cursor: pin S wait on X' waits?

  • Frequent Hard Parses
    If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.
  • High Version Counts
    When Version counts become excessive, a long chain of versions needs to be examined and this can lead to contention on this event
  • Known bugs

How to diagnose the cause.

Obtain information and diagnostics to help locate the cause.  

1. Run AWR and ADDM reports during the time that you see 'Cursor: pin S wait on X' waits and also collect baseline reports when you do not see the waits to make comparisons.  The baseline shows us the typical 'background' concurrency and activity going on during the time of both the issue and the baseline and  may help identify (for example) culprit SQLs with high version counts. 

To gather this it is suggested to run AWR and ADDM for half an hour to an hour interval as follows:

SQL>@$ORACLE_HOME/rdbms/admin/awrrpt.sql

SQL>@$ORACLE_HOME/rdbms/admin/addmrpt.sql



2. Sometimes system state dump is necessary to match known issues.  For example, if there is no obvious candidate SQL in AWR, capturing holder or waiter processes in systemstate allows you to focus in on potential problems. Run system state when processes appear hung on 'Cursor: pin S wait on X' waits :

(a) Non-Rac
sqlplus "/ as sysdba"

oradebug setmypid
oradebug unlimit
oradebug dump systemstate 266
wait 90 seconds
oradebug dump systemstate 266
wait 90 seconds
oradebug dump systemstate 266
quit

(b) RAC
$ sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug setinst all
oradebug -g all hanganalyze 4
oradebug -g all dump systemstate 266
quit



3. Errorstacks: Another way to obtain process information is with errorstack.  Assuming you can identify a blocker, taking errorstacks will provide much the same information as systemstates but with a much reduced disk footprint for trace. Once the ospid of the blocker has been found, an errorstack can be generated:

$ sqlplus
SQL> oradebug setospid
oradebug dump errorstack 3
<< wait 1min>> 
oradebug dump errorstack 3
<< wait 1min>>
oradebug dump errorstack 3
exit


In particular, the stack from the resultant trace can be used to match known issues.

The system state and errorstacks are not easily readable; so a Service Request may need to be opened to read the files.

4. It is not always feasible to run system state dump. Thus, this note on finding the blocker can be used as well:

Note:786507.1 How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'


5. Furthermore, the following sqls can be ran to identify the waiters:

SELECT s.sid, t.sql_text
FROM v$session s, v$sql t
WHERE s.event LIKE '%cursor: pin S wait on X%'
AND t.sql_id = s.sql_id


6. In 11g RAC, there is another less resource intensive tool that can be used when compared with taking system state dumps:

Note:459694.1 Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes

 

How to Examine the Diagnostics.

1. Look for high parsing and high version counts from AWR.
Click on *SQL Statistics under Main Report of AWR:



Then, under SQL Statistics click on 'SQL ordered by Parse Calls' or 'SQL ordered by Version Count' to view that information:

SQL ordered by Parse Calls

* Total Parse Calls: 2,935,390
* Captured SQL account for 95.6% of Total 

Parse Calls Executions % Total Parses SQL Id SQL Module SQL Text
668,174 668,014 22.76 gpkdazvcfwkv select DATA_TABLE (...
667,440 668,234 22.74 5p9vjzht9jqb INSERT INTO DATA_TABLE (DATA_I...


From this list, investigate the top SQLs to determine whether this volume of parse calls is excessive or can be reduced. 

SQL ordered by Version Count

 

* Only Statements with Version Count greater than 20 are displayed 

Version Count Executions SQL Id SQL Module SQL Text
277 392,737 30d5a6v16mpb select FLOW_CONTEXT (...
78 131,104 7c0gj35488xs INSERT INTO PROJECT (DOC_ID, ...


From this list, investigate the SQLs with the high version count. What are the reasons that these statements are not shared? Can this be addressed? 

2. From systemstate and errorstack, we are looking for stacks to match known issues.  

For systemstates/errorstacks to be useful, it relies them being taken "at the right time". On a fast moving system, processes may have moved on so holders and waiters are not shown. In a perfect example the  systemstates should show the blocker and processes being blocked or one of these.  The stack from blocker may indicate known issues.

Potential Solutions

1. Tune sqls with  high parse count by either investigating the application or the sql. 

From AWR, the parse statistics is at the top of the report under load profile:





In this case there are mainly soft parses, however, if there were high hard parses, this may indicate high usage of literals or introduction of many new sqls. In this case consider using bind variables in place of literals.  

If there is high percentage of soft parses, then check the application to see if it is using shareable sqls.  

Ideally, Execute to Parse should be closer to 100%.  The application should parse less and execute many times. See:

Note:62143.1 Understanding and Tuning the Shared Pool


Also remember if the shared pool is flushed, then sqls will need to be hard parsed.  This also may cause mutex waits.  So make sure sqls are in memory once hard parsed and monitor to see if mutex waits are alleviated.  

2. High version counts can also cause 'cursor: pin S wait on X' waits.

Check V$SQL_SHARED_CURSOR to see the potential reason for the high version count using:

Note:438755.1 Formated V$SQL_SHARED_CURSOR Report by SQLID or Hash Value
Note:296377.1 Troubleshooting: High Version Count Issues (Doc ID 296377.1)


There are some notable bugs with high version counts:

Note:10157392.8 Bug 10157392 - High version counts for SQL with binds (BIND_MISMATCH)
Note:9689310.8 Bug 9689310 - Excessive child cursors / high VERSION_COUNT / OERI:17059 due to bind mismatch


3. For more known defects, please go to following note and click on known bugs:

Note:1298015.1 WAITEVENT: "cursor: pin S wait on X" Reference Note


Click on the version that applies and review bug or bugs with similar scenario.

4. If the database has been migrated from 10g to 11g and mutex performance issue surfaces,
please consider the 11.2.0.2.2 psu + fix for 12431716.  Many mutex fixes are already included in this patch:

Note:1291879.1 Oracle Database Patch Set Update 11.2.0.2.2 Known Issues



Troubleshooting Other Issues

For guidance troubleshooting other performance issues see:

Document 1377446.1 Troubleshooting Performance Issues

 

References

NOTE:1291879.1 - Oracle Database Patch Set Update 11.2.0.2.2 Known Issues
NOTE:1298015.1 - WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE:296377.1 - Troubleshooting: High Version Count Issues
NOTE:459694.1 - Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes
NOTE:786507.1 - How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'
NOTE:1356828.1 - FAQ: 'cursor: ' waits

 

Attachments

 

 

 

 

 

Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU - Overlay fix Recommended [ID 1321817.1]

To Bottom

Modified:May 17, 2011Type:ALERTStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Description
  
Likelihood of Occurrence
  
Possible Symptoms
  
Workaround or Resolution
  
Patches
  
Modification History
  
References

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.2 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.
This issue affects the following server releases only:
- 11.2.0.2.2 Database Patch Set Update (PSU) [ 11724916 ]
- 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU) [ 12311357 ]
- Windows 11.2.0.2 Patch bundles 4, 5 and 6.

This issue also affects 11.2.0.2.1 interim overlay patches for Bug 10411618 on Unix platforms downloaded before 14/May/2011.
(Interim patches for other versions are NOT affected. Interim patches downloaded on or after 14/May/2011 are NOT affected).

This issue does NOT affect any of the Exadata Database Recommended Patches.

Description

An incomplete fix in the above releases can cause database mutex waits to be accompanied by higher CPU usage than expected.

Likelihood of Occurrence

This problem is most likely to impact large heavily concurrent systems especially if those systems have some degree of mutex contention.

Possible Symptoms

There are 2 mains symptoms of the problem:

  1. Higher CPU usage may be seen for mutex waits compared to earlier releases. This may appear as CPU spikes or as general higher CPU usage levels depending on the nature of the contention.
  2. The hidden parameters described in Note:10411618.8, which allow configuration of the mutex wait scheme, do not work as expected. Notably setting "_mutex_wait_scheme" has no effect.

Workaround or Resolution

  • For non-Windows platforms it is recommended to apply Patch:12431716 on top of the 11.2.0.2.2 PSU.
  • For Windows platforms on 11.2.0.2 bundle patch 4, 5 or 6 it is recommended to apply bundle patch 7 when available. If that patch cannot be applied then it is recommended to set "_mutex_wait_time"=8 at instance level until such time as it is applied.
  • 11.2.0.2.1 systems with overlay patches for bug 10411618 from before 14/May/2011 are advised to replace the patch with the latest version.
  • Once the relevant patch is in place then the mutex wait scheme can be configured as described in Note:10411618.8 if required.

 

Patches

  • For non-Windows platforms with 11.2.0.2.2 PSU installed it is recommended to apply overlay Patch:12431716 (or any patch containing that fix).
  • For Windows platforms it is recommended to apply 11.2.0.2 bundle patch 7 (or later) once available.
  • 11.2.0.2.1 overlay patches for 10411618 on non-Windows platforms were reloaded on 14/May/2011. If you have an 11.2.0.2.1 overlay patch for bug 10411618 from before that date then it is advisable to replace it with the current version.
    eg: Rollback the installed patch, download the latest version of 
    Patch:10411618 for 11.2.0.2.1, apply the replacement patch. 
  • The problem will be fixed generally in the next 11.2.0.2 Patch Set Update (11.2.0.2.3) due for release in July 2011.

Modification History

13-May-2011 - Created
16-May-2011 - Confirm this does NOT affect Exadata Recommended Patches
17-May-2011 - Add references to 11.2.0.2.2 PSU / GI PSU Known Issues (which mention this problem)

References

NOTE:10411618.8 - Bug 10411618 - Enhancement to add different "Mutex" wait schemes
NOTE:11724916.8 - Bug 11724916 - 11.2.0.2.2 Patch Set Update (PSU)
NOTE:12311357.8 - Bug 12311357 - 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU)
NOTE:12431716.8 - Bug 12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU
NOTE:1272288.1 - 11.2.0.2.X Grid Infrastructure Bundle/PSU Known Issues
NOTE:1291879.1 - Oracle Database Patch Set Update 11.2.0.2.2 Known Issues

 

 


来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/13454868/viewspace-772990/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/13454868/viewspace-772990/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值