环境信息
数据库版本:11.2.0.2.0
操作系统版本:AIX 6.1
一.现象
1.RAC环境中其中一台服务器node2负载上升(平时为idle为90左右),node1数据库正常,下面为node2当时的SAR报告
bash-3.00# sar -f sa15|more
AIX p740db2 1 6 00F688714C00 08/15/12
System configuration: lcpu=32 mode=Capped
00:00:00 %usr %sys %wio %idle physc
16:15:00 5 2 4 89 8.00
16:16:00 8 5 0 87 8.00
16:17:00 36 6 0 57 8.00
16:18:00 50 5 4 42 8.00
16:19:00 7 2 7 85 8.00
16:20:00 6 2 5 88 8.00
16:21:00 6 2 8 84 8.00
16:22:00 6 2 9 83 8.00
16:23:00 7 2 7 84 8.00
16:24:00 6 2 5 87 8.00
16:25:00 6 2 6 87 8.00
16:26:00 6 2 6 86 8.00
16:27:00 5 2 5 89 8.00
16:28:00 5 2 5 88 8.00
16:29:00 5 2 4 89 8.00
16:30:00 5 2 4 90 8.00
16:31:00 25 3 3 70 8.00
16:32:01 63 2 0 34 8.00
16:33:00 27 2 0 71 8.00
16:34:00 46 2 0 51 8.00
16:35:00 66 2 0 32 8.00
16:36:01 37 2 0 61 8.00
16:37:00 27 2 0 72 8.00
16:38:00 15 1 0 83 8.00
16:39:00 29 2 0 69 8.00
16:40:00 41 2 0 57 8.00
16:41:00 45 2 0 53 8.00
16:42:00 47 2 0 51 8.00
16:43:00 39 2 0 59 8.00
16:44:00 50 2 1 48 8.00
16:45:00 15 2 5 78 8.00
16:46:00 1 1 3 95 8.00
16:47:00 6 2 1 91 8.00
2.登陆数据库一直挂起,长时间无响应,查看alert log,发现从16:15开始报ORA-04031 shared pool不够的错
Archived Log entry 15904 added for thread 2 sequence 2418 ID 0xffffffffd412add4 dest 1:
Wed Aug 15 16:15:09 2012
Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_24707320.trc (incident=5241532):
ORA-04031:鎊谞娉^斟^萟崎^臹?32瀛^阻^蔨络^赹腻^疟浜?^芧佩璣?("shared pool","select company_id,user_id fr...","KGLH0^28668775","k
glHeapInitialize:temp")
Incident details in: /db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshpd2_ora_24707320_i5241532.trc
Wed Aug 15 16:15:09 2012
Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_40042558.trc"
Writing to the above trace file is disabled for now on...
Non critical error ORA-48913 caught while writing to trace file "/db/diag/rdbms/xxxxshprod/xxxxshpd2/incident/incdir_5241532/xxxxshp
d2_ora_24707320_i5241532.trc"
Writing to the above trace file is disabled for now on...
Wed Aug 15 16:16:02 2012
Errors in file /db/diag/rdbms/xxxxshprod/xxxxshpd2/trace/xxxxshpd2_ora_36044968.trc (incident=5241340):
ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select count(*) as totalrec ...","KGLH0^c0f94bbb","kglHeapInitialize:temp")
二.诊断过程
1.查看这段时间的ASH,其中最严重的等待事件为“cursor: pin S wait on X”
SQL> select * from (
2 select /*+ rule */
3 event,count(1) cnt
4 from dba_hist_active_sess_history
5 where sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')
6 and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')
7 group by event
8 order by cnt desc) where rownum<=5;
EVENT CNT
---------------------------------------------------------------- ----------
cursor: pin S wait on X 944
latch: shared pool 305
db file sequential read 277
null event 95
library cache: mutex X 87
SQL>
根据等待事情查找相关的SQL
SQL> select sql_id,count(1)
2 from dba_hist_active_sess_history
3 where
4 sample_time > to_date('2012-08-15 16:15:00', 'yyyy-MM-dd HH24:mi:ss')
5 and sample_time < to_date('2012-08-15 16:40:00', 'yyyy-MM-dd HH24:mi:ss')
6 and event = 'cursor: pin S wait on X'
7 group by sql_id;
SQL_ID COUNT(1)
------------- ----------
22bb9t1167adc 5
bunvx480ynf57 22
dcrjrtxqrsrsw 8
根据上面结果,有问题的SQL_id为6yx2qgbpw6j8x,根据SQL_ID查找问题的SQL
select*fromv$sqlwheresql_id='6yx2qgbpw6j8x'
取得的SQL如下:
同时查看这个SQL的执行计划是否正常,通过下面结果可以看到,这个SQL非常简单,全表只有883条记录,查询时走的是唯一索引
SQL>
SQL>--查看历吏执行计划
SQL> SELECT sp.id,
2 lpad(' ', sp.DEPTH * 3, ' ') || sp.OPERATION || ' ' || sp.OPTIONS operation,
3 sp.OBJECT_NAME
4 FROM dba_hist_sql_plan sp
6 ORDER BY ID;
ID OPERATION OBJECT_NAME
---------- -------------------------------------------------------------------------------- -------------------------------
0 SELECT STATEMENT
1 TABLE ACCESS BY INDEX ROWID T_XXXX_USER
2 INDEX UNIQUE SCAN U_TXXXX_USER__USER_NAME
SQL> select * from table(dbms_xplan.display_awr('6yx2qgbpw6j8x'));
PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID 6yx2qgbpw6j8x
--------------------
Plan hash value: 377560326
--------------------------------------------------------------------------------
|Id | Operation | Name | Rows | Bytes |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | |
| 1 | TABLE ACCESS BY INDEX ROWID| T_XXXX_USER | 1 | 20 |
| 2 | INDEX UNIQUE SCAN | U_TXXXX_USER__USER_NAME | 1 | |
--------------------------------------------------------------------------------
14 rows selected
2.分析这期间的AWR报告如下
比较出问题时的AWR和正常情况下的AWR,发现平均每秒的硬解析和软解析没有明显的增长,再比较了AWR中的SQL ordered by Parse Calls和SQL ordered by Version Count都和平时差不多,包括SQL_ID为6yx2qgbpw6j8x的SQL的执行次数,初步怀疑为oraclebug
出问题期间的AWRPer SecondPer TransactionPer ExecPer Call
DB Time(s):13.80.30.020.01
DB CPU(s):0.70.00.000.00
Redo size:194,547.24,767.2
Logical reads:7,571.0185.5
Block changes:815.720.0
Physical reads:296.27.3
Physical writes:71.71.8
User calls:1,339.832.8
Parses:497.412.2
Hard parses:18.20.5
W/A MB processed:1.30.0
Logons:0.10.0
Executes:774.719.0
Rollbacks:0.00.0
Transactions:40.8
Instance Efficiency Percentages (Target 100%)Buffer Nowait %:99.90Redo NoWait %:100.00
Buffer Hit %:96.12In-memory Sort %:100.00
Library Hit %:92.39Soft Parse %:96.35
Execute to Parse %:35.79Latch Hit %:99.73
Parse CPU to Parse Elapsd %:2.49% Non-Parse CPU:73.0
正常情况下的AWR
Load ProfilePer SecondPer TransactionPer ExecPer Call
DB Time(s):2.70.10.000.00
DB CPU(s):0.30.00.000.00
Redo size:182,077.74,810.9
Logical reads:9,038.7238.8
Block changes:756.920.0
Physical reads:338.28.9
Physical writes:78.92.1
User calls:1,284.934.0
Parses:482.712.8
Hard parses:17.50.5
W/A MB processed:1.00.0
Logons:0.10.0
Executes:744.919.7
Rollbacks:0.00.0
Transactions:37.9
Instance Efficiency Percentages (Target 100%)Buffer Nowait %:99.94Redo NoWait %:100.00
Buffer Hit %:96.29In-memory Sort %:100.00
Library Hit %:92.88Soft Parse %:96.37
Execute to Parse %:35.20Latch Hit %:99.78
Parse CPU to Parse Elapsd %:16.07% Non-Parse CPU:92.86
3.在metalink1349387.1中发现下面一段话,If the database has been migrated from 10g to 11g and mutex performance issue surfaces,please consider the 11.2.0.2.2 psu + fix for 12431716.Many mutex fixes are already included in this patch,查看Bug12431716的说明,版本和现象都比较相似,oracle强烈推荐在11.2.0.2版本中打上补丁12431716
Troubleshooting 'cursor: pin S wait on X' waits. [ID1349387.1]
Modified:Jan 7, 2012TROUBLESHOOTINGStatus:PUBLISHEDPriority:1Comments (0)
Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.2.0.2 - Release: 10.2 to 11.2
Information in this document applies to any platform.
Purpose
The purpose of the note is to help customers troubleshoot 'Cursor: pin S wait on X' waits.
Last Review Date
August 17, 2011
Instructions for the Reader
A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible, diagnostic tools are included in the document to assist in troubleshooting.
Troubleshooting Details
What is a 'Cursor: pin S wait on X' wait?
A cursor wait is associated with parsing in some form. A session may wait for this event when it is trying to get a mutex pin in Share mode but another session is holding the mutex pin on the same cursor object in exclusive. Frequently, waits for 'Cursor: pin S wait on X' is a symptom and not the cause. There may be underlying tuning requirements or known issues.
What causes 'Cursor: pin S wait on X' waits?Frequent Hard Parses
If the frequency of Hard Parsing is extremely high, then contention can occur on this pin.
High Version Counts
When Version counts become excessive, a long chain of versions needs to be examined and this can lead to contention on this event
Known bugs
How to diagnose the cause.
Obtain information and diagnostics to help locate the cause.
1. Run AWR and ADDM reports during the time that you see 'Cursor: pin S wait on X' waits and also collect baseline reports when you do not see the waits to make comparisons. The baseline shows us the typical 'background' concurrency and activity going on during the time of both the issue and the baseline and may help identify (for example) culprit SQLs with high version counts.
To gather this it is suggested to run AWR and ADDM for half an hour to an hour interval as follows:
SQL>@$ORACLE_HOME/rdbms/admin/awrrpt.sqlSQL>@$ORACLE_HOME/rdbms/admin/addmrpt.sql
2. Sometimes system state dump is necessary to match known issues. For example, if there is no obvious candidate SQL in AWR, capturing holder or waiter processes in systemstate allows you to focus in on potential problems. Run system state when processes appear hung on 'Cursor: pin S wait on X' waits :
(a) Non-Racsqlplus "/ as sysdba"oradebug setmypidoradebug unlimitoradebug dump systemstate 266wait 90 secondsoradebug dump systemstate 266wait 90 secondsoradebug dump systemstate 266quit(b) RAC$ sqlplus '/ as sysdba'oradebug setmypidoradebug unlimitoradebug setinst alloradebug -g all hanganalyze 4oradebug -g all dump systemstate 266quit
3. Errorstacks: Another way to obtain process information is with errorstack. Assuming you can identify a blocker, taking errorstacks will provide much the same information as systemstates but with a much reduced disk footprint for trace. Once the ospid of the blocker has been found, an errorstack can be generated:
$ sqlplusSQL> oradebug setospidoradebug dump errorstack 3<< wait 1min>>oradebug dump errorstack 3<< wait 1min>>oradebug dump errorstack 3exit
In particular, the stack from the resultant trace can be used to match known issues.
The system state and errorstacks are not easily readable; so a Service Request may need to be opened to read the files.
4. It is not always feasible to run system state dump. Thus, this note on finding the blocker can be used as well:
Note:786507.1How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'
5. Furthermore, the following sqls can be ran to identify the waiters:
SELECT s.sid, t.sql_textFROM v$session s, v$sql tWHERE s.event LIKE '%cursor: pin S wait on X%'AND t.sql_id = s.sql_id
6. In 11g RAC, there is another less resource intensive tool that can be used when compared with taking system state dumps:
Note:459694.1Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware Processes
How to Examine the Diagnostics.
1. Look for high parsing and high version counts from AWR.
Click on *SQL Statistics under Main Report of AWR:
Then, under SQL Statistics click on 'SQL ordered by Parse Calls' or 'SQL ordered by Version Count' to view that information:
SQL ordered by Parse Calls
* Total Parse Calls: 2,935,390* Captured SQL account for 95.6% of TotalParse Calls Executions % Total Parses SQL Id SQL Module SQL Text668,174 668,014 22.76 gpkdazvcfwkv select DATA_TABLE (...667,440 668,234 22.74 5p9vjzht9jqb INSERT INTO DATA_TABLE (DATA_I...
From this list, investigate the top SQLs to determine whether this volume of parse calls is excessive or can be reduced.
SQL ordered by Version Count
* Only Statements with Version Count greater than 20 are displayedVersion Count Executions SQL Id SQL Module SQL Text277 392,737 30d5a6v16mpb select FLOW_CONTEXT (...78 131,104 7c0gj35488xs INSERT INTO PROJECT (DOC_ID, ...
From this list, investigate the SQLs with the high version count. What are the reasons that these statements are not shared? Can this be addressed?
2. From systemstate and errorstack, we are looking for stacks to match known issues.
For systemstates/errorstacks to be useful, it relies them being taken "at the right time". On a fast moving system, processes may have moved on so holders and waiters are not shown. In a perfect example the systemstates should show the blocker and processes being blocked or one of these. The stack from blocker may indicate known issues.
Potential Solutions
1. Tune sqls with high parse count by either investigating the application or the sql.
From AWR, the parse statistics is at the top of the report under load profile:
In this case there are mainly soft parses, however, if there were high hard parses, this may indicate high usage of literals or introduction of many new sqls. In this case consider using bind variables in place of literals.
If there is high percentage of soft parses, then check the application to see if it is using shareable sqls.
Ideally, Execute to Parse should be closer to 100%. The application should parse less and execute many times. See:
Note:62143.1Understanding and Tuning the Shared Pool
Also remember if the shared pool is flushed, then sqls will need to be hard parsed. This also may cause mutex waits. So make sure sqls are in memory once hard parsed and monitor to see if mutex waits are alleviated.
2. High version counts can also cause 'cursor: pin S wait on X' waits.
Check V$SQL_SHARED_CURSOR to see the potential reason for the high version count using:
Note:438755.1Formated V$SQL_SHARED_CURSOR Report by SQLID or Hash ValueNote:296377.1Troubleshooting: High Version Count Issues (Doc ID 296377.1)
There are some notable bugs with high version counts:
Note:10157392.8Bug 10157392 - High version counts for SQL with binds (BIND_MISMATCH)Note:9689310.8Bug 9689310 - Excessive child cursors / high VERSION_COUNT / OERI:17059 due to bind mismatch
3. For more known defects, please go to following note and click on known bugs:
Note:1298015.1WAITEVENT: "cursor: pin S wait on X" Reference Note
Click on the version that applies and review bug or bugs with similar scenario.
4. If the database has been migrated from 10g to 11g and mutex performance issue surfaces,please consider the 11.2.0.2.2 psu + fix for 12431716. Many mutex fixes are already included in this patch:
Note:1291879.1Oracle Database Patch Set Update 11.2.0.2.2 Known Issues
Troubleshooting Other Issues
For guidance troubleshooting other performance issues see:
Document 1377446.1Troubleshooting Performance Issues
References
NOTE:1291879.1- Oracle Database Patch Set Update 11.2.0.2.2 Known IssuesNOTE:1298015.1- WAITEVENT: "cursor: pin S wait on X" Reference NoteNOTE:296377.1- Troubleshooting: High Version Count IssuesNOTE:459694.1- Procwatcher: Script. to Monitor and Examine Oracle DB and Clusterware ProcessesNOTE:786507.1- How to Determine the Blocking Session for Event: 'cursor: pin S wait on X'NOTE:1356828.1- FAQ: 'cursor: ' waits
Attachments
(16.69 KB)
Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU - Overlay fix Recommended [ID 1321817.1]
Modified:May 17, 2011Type:ALERTStatus:PUBLISHEDPriority:1Comments (0)
Applies to:
Oracle Server - Enterprise Edition - Version: 11.2.0.2 to 11.2.0.2 - Release: 11.2 to 11.2
Information in this document applies to any platform.
This issue affects the following server releases only:
- 11.2.0.2.2 Database Patch Set Update (PSU) [ 11724916 ]
- 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU) [ 12311357 ]
- Windows 11.2.0.2 Patch bundles 4, 5 and 6.
This issue also affects 11.2.0.2.1 interim overlay patches for Bug 10411618 on Unix platforms downloaded before 14/May/2011.
(Interim patches for other versions are NOT affected. Interim patches downloaded on or after 14/May/2011 are NOT affected).
This issue does NOT affect any of the Exadata Database Recommended Patches.
Description
An incomplete fix in the above releases can cause database mutex waits to be accompanied by higher CPU usage than expected.
Likelihood of Occurrence
This problem is most likely to impact large heavily concurrent systems especially if those systems have some degree of mutex contention.
Possible Symptoms
There are 2 mains symptoms of the problem:Higher CPU usage may be seen for mutex waits compared to earlier releases. This may appear as CPU spikes or as general higher CPU usage levels depending on the nature of the contention.
The hidden parameters described in, which allow configuration of the mutex wait scheme, do not work as expected. Notably setting "_mutex_wait_scheme" has no effect.
Workaround or ResolutionFor non-Windows platforms it is recommended to applyon top of the 11.2.0.2.2 PSU.
For Windows platforms on 11.2.0.2 bundle patch 4, 5 or 6 it is recommended to apply bundle patch 7 when available. If that patch cannot be applied then it is recommended to set "_mutex_wait_time"=8 at instance level until such time as it is applied.
11.2.0.2.1 systems with overlay patches for bug 10411618 from before 14/May/2011 are advised to replace the patch with the latest version.
Once the relevant patch is in place then the mutex wait scheme can be configured as described inif required.
PatchesFor non-Windows platforms with 11.2.0.2.2 PSU installed it is recommended to apply overlay(or any patch containing that fix).
For Windows platforms it is recommended to apply 11.2.0.2 bundle patch 7 (or later) once available.
11.2.0.2.1 overlay patches for 10411618 on non-Windows platforms were reloaded on 14/May/2011. If you have an 11.2.0.2.1 overlay patch for bug 10411618 from before that date then it is advisable to replace it with the current version.
eg: Rollback the installed patch, download the latest version offor 11.2.0.2.1, apply the replacement patch.
The problem will be fixed generally in the next 11.2.0.2 Patch Set Update (11.2.0.2.3) due for release in July 2011.
Modification History
13-May-2011 - Created
16-May-2011 - Confirm this does NOT affect Exadata Recommended Patches
17-May-2011 - Add references to 11.2.0.2.2 PSU / GI PSU Known Issues (which mention this problem)
References
NOTE:10411618.8- Bug 10411618 - Enhancement to add different "Mutex" wait schemesNOTE:11724916.8- Bug 11724916 - 11.2.0.2.2 Patch Set Update (PSU)NOTE:12311357.8- Bug 12311357 - 11.2.0.2.2 Grid Infrastructure Patch Set Update (GI PSU)NOTE:12431716.8- Bug 12431716 - Mutex waits may cause higher CPU usage in 11.2.0.2.2 PSU / GI PSUNOTE:1272288.1- 11.2.0.2.X Grid Infrastructure Bundle/PSU Known IssuesNOTE:1291879.1- Oracle Database Patch Set Update 11.2.0.2.2 Known Issues