Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理

最新推荐文章于 2021-06-01 17:05:21 发布

csc20546

最新推荐文章于 2021-06-01 17:05:21 发布

阅读量190

点赞数

文章标签：数据库操作系统

环境信息

数据库版本：11.2.0.2.0

操作系统版本：AIX 6.1

一．现象

1.RAC环境中其中一台服务器node2负载上升（平时为idle为90左右），node1数据库正常，下面为node2当时的SAR报告

bash-3.00# sar -f sa15|more

AIX p740db2 1 6 00F688714C00 08/15/12

System configuration: lcpu=32 mode=Capped

00:00:00 %usr %sys %wio %idle physc

16:15:00 5 2 4 89 8.00

16:16:00 8 5 0 87 8.00

16:17:00 36 6 0 57 8.00

16:18:00 50 5 4 42 8.00

16:19:00 7 2 7 85 8.00

16:20:00 6 2 5 88 8.00

16:21:00 6 2 8 84 8.00

16:22:00 6 2 9 83 8.00

16:23:00 7 2 7 84 8.00

16:24:00 6 2 5 87 8.00

16:25:00 6 2 6 87 8.00

16:26:00 6 2 6 86 8.00

16:27:00 5 2 5 89 8.00

16:28:00 5 2 5 88 8.00

16:29:00 5 2 4 89 8.00

16:30:00 5 2 4 90 8.00

16:31:00 25 3 3 70 8.00

16:32:01 63 2 0 34 8.00

16:33:00 27 2 0 71 8.00

16:34:00 46 2 0 51 8.00

16:35:00 66 2 0 32 8.00

16:36:01 37 2 0 61 8.00

16:37:00 27 2 0 72 8.00

16:38:00 15 1 0 83 8.00

16:39:00 29 2 0 69 8.00

16:40:00 41 2 0 57 8.00

16:41:00 45 2 0 53 8.00

16:42:00 47 2 0 51 8.00

16:43:00 39 2 0 59 8.00

16:44:00 50 2 1 48 8.00

16:45:00 15 2 5 78 8.00

16:46:00 1 1 3 95 8.00

16:47:00 6 2 1 91 8.00

2.登陆数据库一直挂起，长时间无响应,查看alert log,发现从16:15开始报ORA-04031 shared pool不够的错

Archived Log entry 15904 added for thread 2 sequence 2418 ID 0xffffffffd412add4 dest 1:

Wed Aug 15 16:15:09 2012

Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_24707320.trc (incident=5241532):

ORA-04031:鎊谞娉^斟^萟崎^臹?32瀛^阻^蔨络^赹腻^疟浜　^芧佩璣?("shared pool","select company_id,user_id fr...","KGLH0^28668775","k

glHeapInitialize:temp")

Incident details in: /db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshpd2_ora_24707320_i5241532.trc

Wed Aug 15 16:15:09 2012

Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_40042558.trc"

Writing to the above trace file is disabled for now on...

Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshp

d2_ora_24707320_i5241532.trc"

Writing to the above trace file is disabled for now on...

Wed Aug 15 16:16:02 2012

Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_36044968.trc (incident=5241340):

ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select count(*) as totalrec ...","KGLH0^c0f94bbb","kglHeapInitialize:temp")

二．诊断过程

1.查看这段时间的ASH，其中最严重的等待事件为“cursor: pin S wait on X”

SQL> select * from (

2 select /*+ rule */

3 event,count(1) cnt

4 from dba_hist_active_sess_history

5 where sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')

6 and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')

7 group by event

8 order by cnt desc) where rownum<=5;

EVENT CNT

---------------------------------------------------------------- ----------

cursor: pin S wait on X 944

latch: shared pool 305

db file sequential read 277

null event 95

library cache: mutex X 87

SQL>

根据等待事情查找相关的SQL

SQL> select sql_id,count(1)

2 from dba_hist_active_sess_history

3 where

4 sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')

5 and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')

6 and event = 'cursor: pin S wait on X'

7 group by sql_id;

SQL_ID COUNT(1)

------------- ----------

22bb9t1167adc 5

6yx2qgbpw6j8x 909

bunvx480ynf57 22

dcrjrtxqrsrsw 8

根据上面结果，有问题的SQL_id为6yx2qgbpw6j8x,根据SQL_ID查找问题的SQL

select*fromv$sqlwheresql_id='6yx2qgbpw6j8x'

取得的SQL如下：

select company_id,user_id from t_xxxx_user where user_name=:1

同时查看这个SQL的执行计划是否正常，通过下面结果可以看到，这个SQL非常简单，全表只有883条记录，查询时走的是唯一索引

SQL>

SQL> --查看历吏执行计划

SQL> SELECT sp.id,

2 lpad(' ', sp.DEPTH * 3, ' ') || sp.OPERATION || ' ' || sp.OPTIONS operation,

3 sp.OBJECT_NAME

4 FROM dba_hist_sql_plan sp

5 WHERE SQL_ID = '6yx2qgbpw6j8x'

6 ORDER BY ID;

ID OPERATION OBJECT_NAME

---------- -------------------------------------------------------------------------------- -------------------------------

0 SELECT STATEMENT

1 TABLE ACCESS BY INDEX ROWID T_XXXX_USER

2 INDEX UNIQUE SCAN U_TXXXX_USER__USER_NAME

SQL> select * from table(dbms_xplan.display_awr('6yx2qgbpw6j8x'));

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------

SQL_ID 6yx2qgbpw6j8x

--------------------

select company_id,user_id from t_xxxx_user where user_name=:1

Plan hash value: 377560326

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | | |

| 1 | TABLE ACCESS BY INDEX ROWID| T_XXXX_USER | 1 | 20 |

| 2 | INDEX UNIQUE SCAN | U_TXXXX_USER__USER_NAME | 1 | |

--------------------------------------------------------------------------------

14 rows selected

2.分析这期间的AWR报告如下

比较出问题时的AWR和正常情况下的AWR，发现平均每秒的硬解析和软解析没有明显的增长，再比较了AWR中的SQL ordered by Parse Calls和SQL ordered by Version Count都和平时差不多，包括SQL_ID为6yx2qgbpw6j8x的SQL的执行次数，初步怀疑为oracle bug

出问题期间的AWR

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	13.8	0.3	0.02	0.01
DB CPU(s):	0.7	0.0	0.00	0.00
Redo size:	194,547.2	4,767.2
Logical reads:	7,571.0	185.5
Block changes:	815.7	20.0
Physical reads:	296.2	7.3
Physical writes:	71.7	1.8
User calls:	1,339.8	32.8
Parses:	497.4	12.2
Hard parses:	18.2	0.5
W/A MB processed:	1.3	0.0
Logons:	0.1	0.0
Executes:	774.7	19.0
Rollbacks:	0.0	0.0
Transactions:	40.8

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:	99.90	Redo NoWait %:	100.00
Buffer Hit %:	96.12	In-memory Sort %:	100.00
Library Hit %:	92.39	Soft Parse %:	96.35
Execute to Parse %:	35.79	Latch Hit %:	99.73
Parse CPU to Parse Elapsd %:	2.49	% Non-Parse CPU:	73.0

正常情况下的AWR

Load Profile

	Per Second	Per Transaction	Per Exec	Per Call
DB Time(s):	2.7	0.1	0.00	0.00
DB CPU(s):	0.3	0.0	0.00	0.00
Redo size:	182,077.7	4,810.9
Logical reads:	9,038.7	238.8
Block changes:	756.9	20.0
Physical reads:	338.2	8.9
Physical writes:	78.9	2.1
User calls:	1,284.9	34.0
Parses:	482.7	12.8
Hard parses:	17.5	0.5
W/A MB processed:	1.0	0.0
Logons:	0.1	0.0
Executes:	744.9	19.7
Rollbacks:	0.0	0.0
Transactions:	37.9

Instance Efficiency Percentages (Target 100%)

Buffer Nowait %:	99.94	Redo NoWait %:	100.00
Buffer Hit %:	96.29	In-memory Sort %:	100.00
Library Hit %:	92.88	Soft Parse %:	96.37
Execute to Parse %:	35.20	Latch Hit %:	99.78
Parse CPU to Parse Elapsd %:	16.07	% Non-Parse CPU:	92.86

3.在metalink1349387.1中发现下面一段话，If the database has been migrated from 10g to 11g and mutex performance issue surfaces,please consider the 11.2.0.2.2 psu + fix for 12431716. Many mutex fixes are already included in this patch，查看Bug12431716的说明，版本和现象都比较相似，oracle强烈推荐在11.2.0.2版本中打上补丁 12431716


		Troubleshooting 'cursor: pin S wait on X' waits. [ID1349387.1]	To Bottom

Modified:Jan 7, 2012Type:TROUBLESHOOTINGStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Purpose
  Last Review Date
  Instructions for the Reader
  Troubleshooting Details
     What is a 'Cursor: pin S wait on X' wait?
     What causes 'Cursor: pin S wait on X' waits?
     How to diagnose the cause.
     How to Examine the Diagnostics.
     Potential Solutions
     Troubleshooting Other Issues
  References

Applies to:

Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.

Purpose

The purpose of the note is to help customers troubleshoot 'Cursor: pin S wait on X' waits.

Last Review Date

August 17, 2011

Instructions for the Reader

A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.

Troubleshooting Details

What is a 'Cursor: pin S wait on X' wait?

A cursor wait is associated with parsing in some form. A session may wait for this event when it is trying to get a mutex pin in Share mode but another session is holding the mutex pin on the same cursor object in exclusive. Frequently, waits for 'Cursor: pin S wait on X' is a symptom and not the cause. There may be underlying tuning requirements or known issues.

What causes 'Cursor: pin S wait on X' waits?

Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.
High Version Counts
When Version counts become excessive, a long chain of versions needs to be examined and this can lead to contention on this event
Known bugs

How to diagnose the cause.

Obtain information and diagnostics to help locate the cause.

1. Run AWR and ADDM reports during the time that you see 'Cursor: pin S wait on X' waits and also collect baseline reports when you do not see the waits to make comparisons. The baseline shows us the typical 'background' concurrency and activity going on during the time of both the issue and the baseline and may help identify (for example) culprit SQLs with high version counts.

To gather this it is suggested to run AWR and ADDM for half an hour to an hour interval as follows:

SQL>@$ORACLE_HOME/rdbms/admin/awrrpt.sql

SQL>@$ORACLE_HOME/rdbms/admin/addmrpt.sql

2. Sometimes system state dump is necessary to match known issues. For example, if there is no obvious candidate SQL in AWR, capturing holder or waiter processes in systemstate allows you to focus in on potential problems. Run system state when processes appear hung on 'Cursor: pin S wait on X' waits :

(a) Non-Rac
sqlplus "/ as sysdba"

oradebug setmypid
oradebug unlimit
oradebug dump systemstate 266
wait 90 seconds
oradebug dump systemstate 266
wait 90 seconds
oradebug dump systemstate 266
quit

(b) RAC
$ sqlplus '/ as sysdba'
oradebug setmypid
oradebug unlimit
oradebug setinst all
oradebug -g all hanganalyze 4
oradebug -g all dump systemstate 266
quit

3. Errorstacks: Another way to obtain process information is with errorstack. Assuming you can identify a blocker, taking errorstacks will provide much the same information as systemstates but with a much reduced disk footprint for trace. Once the ospid of the blocker has been found, an errorstack can be generated:

$ sqlplus
SQL> oradebug setospid
oradebug dump errorstack 3
<< wait 1min>>
oradebug dump errorstack 3
<< wait 1min>>
oradebug dump errorstack 3
exit

In particular, the stack from the resultant trace can be used to match known issues.

The system state and errorstacks are not easily readable; so a Service Request may need to be opened to read the files.

4. It is not always feasible to run system state dump. Thus, this note on finding the blocker can be used as well:

Note:786507.1 How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'

5. Furthermore, the following sqls can be ran to identify the waiters:

SELECT s.sid, t.sql_text
FROM v$session s, v$sql t
WHERE s.event LIKE '%cursor: pin S wait on X%'
AND t.sql_id = s.sql_id

6. In 11g RAC, there is another less resource intensive tool that can be used when compared with taking system state dumps:

Note:459694.1 Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes

How to Examine the Diagnostics.

1. Look for high parsing and high version counts from AWR.
Click on *SQL Statistics under Main Report of AWR:

Then, under SQL Statistics click on 'SQL ordered by Parse Calls' or 'SQL ordered by Version Count' to view that information:

SQL ordered by Parse Calls

* Total Parse Calls: 2,935,390
* Captured SQL account for 95.6% of Total

Parse Calls Executions % Total Parses SQL Id SQL Module SQL Text
668,174 668,014 22.76 gpkdazvcfwkv select DATA_TABLE (...
667,440 668,234 22.74 5p9vjzht9jqb INSERT INTO DATA_TABLE (DATA_I...

From this list, investigate the top SQLs to determine whether this volume of parse calls is excessive or can be reduced.

SQL ordered by Version Count

* Only Statements with Version Count greater than 20 are displayed

Version Count Executions SQL Id SQL Module SQL Text
277 392,737 30d5a6v16mpb select FLOW_CONTEXT (...
78 131,104 7c0gj35488xs INSERT INTO PROJECT (DOC_ID, ...

From this list, investigate the SQLs with the high version count. What are the reasons that these statements are not shared? Can this be addressed?

2. From systemstate and errorstack, we are looking for stacks to match known issues.

For systemstates/errorstacks to be useful, it relies them being taken "at the right time". On a fast moving system, processes may have moved on so holders and waiters are not shown. In a perfect example the systemstates should show the blocker and processes being blocked or one of these. The stack from blocker may indicate known issues.

Potential Solutions

1. Tune sqls with high parse count by either investigating the application or the sql.

From AWR, the parse statistics is at the top of the report under load profile:

In this case there are mainly soft parses, however, if there were high hard parses, this may indicate high usage of literals or introduction of many new sqls. In this case consider using bind variables in place of literals.

If there is high percentage of soft parses, then check the application to see if it is using shareable sqls.

Ideally, Execute to Parse should be closer to 100%. The application should parse less and execute many times. See:

Note:62143.1 Understanding and Tuning the Shared Pool

Also remember if the shared pool is flushed, then sqls will need to be hard parsed. This also may cause mutex waits. So make sure sqls are in memory once hard parsed and monitor to see if mutex waits are alleviated.

2. High version counts can also cause 'cursor: pin S wait on X' waits.

Check V$SQL_SHARED_CURSOR to see the potential reason for the high version count using:

Note:438755.1 Formated V$SQL_SHARED_CURSOR Report by SQLID or Hash Value
Note:296377.1 Troubleshooting: High Version Count Issues (Doc ID 296377.1)

There are some notable bugs with high version counts:

Note:10157392.8 Bug 10157392 - High version counts for SQL with binds (BIND_MISMATCH)
Note:9689310.8 Bug 9689310 - Excessive child cursors / high VERSION_COUNT / OERI:17059 due to bind mismatch

3. For more known defects, please go to following note and click on known bugs:

Note:1298015.1 WAITEVENT: "cursor: pin S wait on X" Reference Note

Click on the version that applies and review bug or bugs with similar scenario.

4. If the database has been migrated from 10g to 11g and mutex performance issue surfaces,please consider the 11.2.0.2.2 psu + fix for 12431716. Many mutex fixes are already included in this patch:

Note:1291879.1 Oracle Database Patch Set Update 11.2.0.2.2 Known Issues

Troubleshooting Other Issues

For guidance troubleshooting other performance issues see:

Document 1377446.1 Troubleshooting Performance Issues

References

NOTE:1291879.1 - Oracle Database Patch Set Update 11.2.0.2.2 Known Issues
NOTE:1298015.1 - WAITEVENT: "cursor: pin S wait on X" Reference Note
NOTE:296377.1 - Troubleshooting: High Version Count Issues
NOTE:459694.1 - Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes
NOTE:786507.1 - How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'
NOTE:1356828.1 - FAQ: 'cursor: ' waits

Attachments

Parse_Vers.png(13.75 KB)

Parses.png(16.69 KB)

SQL_Stats_link.png(9.94 KB)

Soft_Parse.png(13.39 KB)

Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU - Overlay fix Recommended [ID 1321817.1]

To Bottom

Modified:May 17, 2011Type:ALERTStatus:PUBLISHEDPriority:1

Comments (0)

In this Document
  Description
  Likelihood of Occurrence
  Possible Symptoms
  Workaround or Resolution
  Patches
  Modification History
  References

Applies to:

Oracle Server - Enterprise Edition - Version: 11.2.0.2 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.
This issue affects the following server releases only:
- 11.2.0.2.2 Database Patch Set Update (PSU) [ 11724916 ]
- 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU) [ 12311357 ]
- Windows 11.2.0.2 Patch bundles 4, 5 and 6.

This issue also affects 11.2.0.2.1 interim overlay patches for Bug 10411618 on Unix platforms downloaded before 14/May/2011.
(Interim patches for other versions are NOT affected. Interim patches downloaded on or after 14/May/2011 are NOT affected).

This issue does NOT affect any of the Exadata Database Recommended Patches.

Description

An incomplete fix in the above releases can cause database mutex waits to be accompanied by higher CPU usage than expected.

Likelihood of Occurrence

This problem is most likely to impact large heavily concurrent systems especially if those systems have some degree of mutex contention.

Possible Symptoms

There are 2 mains symptoms of the problem:

Higher CPU usage may be seen for mutex waits compared to earlier releases. This may appear as CPU spikes or as general higher CPU usage levels depending on the nature of the contention.
The hidden parameters described in Note:10411618.8, which allow configuration of the mutex wait scheme, do not work as expected. Notably setting "_mutex_wait_scheme" has no effect.

Workaround or Resolution

For non-Windows platforms it is recommended to apply Patch:12431716 on top of the 11.2.0.2.2 PSU.
For Windows platforms on 11.2.0.2 bundle patch 4, 5 or 6 it is recommended to apply bundle patch 7 when available. If that patch cannot be applied then it is recommended to set "_mutex_wait_time"=8 at instance level until such time as it is applied.
11.2.0.2.1 systems with overlay patches for bug 10411618 from before 14/May/2011 are advised to replace the patch with the latest version.
Once the relevant patch is in place then the mutex wait scheme can be configured as described in Note:10411618.8 if required.

Patches

For non-Windows platforms with 11.2.0.2.2 PSU installed it is recommended to apply overlay Patch:12431716 (or any patch containing that fix).
For Windows platforms it is recommended to apply 11.2.0.2 bundle patch 7 (or later) once available.
11.2.0.2.1 overlay patches for 10411618 on non-Windows platforms were reloaded on 14/May/2011. If you have an 11.2.0.2.1 overlay patch for bug 10411618 from before that date then it is advisable to replace it with the current version.
eg: Rollback the installed patch, download the latest version of Patch:10411618 for 11.2.0.2.1, apply the replacement patch.
The problem will be fixed generally in the next 11.2.0.2 Patch Set Update (11.2.0.2.3) due for release in July 2011.

Modification History

13-May-2011 - Created
16-May-2011 - Confirm this does NOT affect Exadata Recommended Patches
17-May-2011 - Add references to 11.2.0.2.2 PSU / GI PSU Known Issues (which mention this problem)

References

NOTE:10411618.8 - Bug 10411618 - Enhancement to add different "Mutex" wait schemes
NOTE:11724916.8 - Bug 11724916 - 11.2.0.2.2 Patch Set Update (PSU)
NOTE:12311357.8 - Bug 12311357 - 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU)
NOTE:12431716.8 - Bug 12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSU
NOTE:1272288.1 - 11.2.0.2.X Grid Infrastructure Bundle/PSU Known Issues
NOTE:1291879.1 - Oracle Database Patch Set Update 11.2.0.2.2 Known Issues

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/13454868/viewspace-772990/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/13454868/viewspace-772990/

csc20546

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Oracle 11.2.0.2.0 RAC环境一次内存溢出ORA-04031问题的处理

环境信息数据库版本：11.2.0.2.0操作系统版本：AIX 6.1一．现象1.RAC环境中其中一台服务器node2负载上升（平时为idle为90左右），node1数据库正常，下面为node2当时的SAR报告bash-3....
复制链接

扫一扫