从昨天晚上开始某核心库过十几分钟就会报出Cursor:pin s wait x的告警短信,而且很频繁,第二天到公司查了下原因:
首先收集了下相应时间ASH报告和AWR报告,显示的信息如下:
Top 5 Timed Events
Event Waits Time(s) Avg Wait(ms) % Total Call Time Wait Class
cursor: pin S wait on X 1,091, 611 10,688 10 45.1 Concurrency
row cache lock 95,543 6,924 72 29.2 Concurrency
收集ASH信息,间隔为15分钟,看到如下信息:
Event Event Class % Activity Avg Active Sessions
cursor: pin S wait on X Concurrency 55.94 85.03
row cache lock Concurrency 34.39 52.28
library cache load lock Concurrency 2.73 4.15
可以看到Active Sessions的数量相当高,在当时堵塞的session达到85。
继续查看:
select sample_id,session_id,
to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') samp_time,
sql_id,
session_state,
event,
p1,
p1text,
p2,
p2text,
p3,
p3text,
wait_time,
time_waited
from v$active_session_history --DBA_HIST_ACTIVE_SESS_HISTORY
where to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') between
'12-05-16:07:00:00' and '12-05-16:08:00:00' and event like'cursor%' and p1=2057858921
order by sample_id desc;
查看blocking sid:
SELECT decode(trunc(24386824306688/4294967296),
0,trunc(24386824306688/65536),
trunc(24386824306688/4294967296)) SID_HOLDING_MUTEX
FROM dual;
SID_HOLDING_MUTEX
--------------------------------
5678
查找阻塞此session的sid为:5678,那么看看session 5678究竟在做些什么。
select sample_id,session_id,
sql_id,
session_state,
event,
p1,
p2,
p3,
wait_time,
time_waited
from DBA_HIST_ACTIVE_SESS_HISTORY
where to_char(SAMPLE_TIME,'YY-MM-DD:HH:MI:SS') between
'12-05-16:07:00:00' and '12-05-16:08:00:00' and session_id=5678
order by sample_id desc;
SESSION_ID SQL_ID STATE EVENT P1 P2 P3 WAIT_TIME TIME_WAITED
---------- ------------- ------------- --------------- ----- ----- ---- ---------- ----
5678 g3176qdxahvv9 WAITING row cache lock 8 0 3 0 0
此session也在等待,只不过等待的事件是:row cache lock。
select cache#,parameter,type,count,gets,getmisses from v$rowcache where cache#=8;
CACHE# PARAMETER TYPE COUNT GETS GETMISSES
-------- -------------------- ----------- ---------- ---------- ----------
8 dc_objects PARENT 1410 -76918095 1851821
8 dc_object_grants SUBORDINATE 91 840478 47918
Oracle文档对dc_objects的解释:DC_OBJECTS
Look for any object compilation activity which might require an exclusive lock and thus block online activity.
现在看来是大量的解析导致的,难道是有什么作业在定时的做某种DDL操作,导致相关对象上的游标失效,于是,抓了个session,查看相关sql的对象,发现并没有做ddl操作。那为什么会间隔的导致如此多的硬解析呢。
此时,当我执行此SQL:select s.sid, s.username,s.sql_id,s.module,s.osuser,s.event, s.blocking_session,
s.seconds_in_wait,s.wait_time,w.p1,w.p2,w.p3
from v$session s,v$session_wait w,v$sql q
where s.sid=w.sid
and s.statein ('WAITING')
and s.sql_id=q.sql_id
and s.wait_class !='Idle';
数据库抛出了
ORA-04031: unable to allocate 370048 bytes of sharedmemory ("shared pool","unknown object",""sgaheap(4,0),"hng: All sessions data for API.")
ORA-0431,这个很经典的错误,导致此错误的原因也有很多,我们就不多说,看看我的错误的参数部分,sga heap(4,0),说明在第四个subpool里找不到370048bytes,那继续来看看数据库的共享池配置及使用情况:
select a.ksppinm "Parameter",
b.ksppstvl "Session Value",
c.ksppstvl "Instance Value"
from x$ksppi a, x$ksppcv b, x$ksppsv c
where a.indx = b.indx and a.indx = c.indx
and a.ksppinm in('_kghdsidx_count','__shared_pool_size','_shared_pool_reserved_pct');
Parameter Session Value Instance Value
---------------------------- --------------- -----------------
_kghdsidx_count 7 7
_shared_pool_size 6442450944 6442450944
_shared_pool_reserved_pct 5 5
系统有6G的共享池,7个subpool,另外我们的sga_target=0.
SQL> select
2 kghlurcr,
3 kghlutrn,
4 kghlufsh,
5 kghluops,
6 kghlunfu,
7 kghlunfs
8 from
9 sys.x$kghlu
10 where
11 inst_id = userenv('Instance')
12 /
RECURRENT TRANSIENT FLUSHED PINS AND ORA-4031 LAST ERROR
CHUNKS CHUNKS CHUNKS RELEASES ERRORS SIZE
---------- ---------- ---------- ---------- --------- ----------
852 865 1367514 286598314 1280 370048
1932 36615 597116 245240992 578 370024
6663 7618 2208370 70999571 2037 370000
907 960 1207376 64136373 1057 369976
892 1141 1749793 138707112 701 369952
1060 1297 1795660 52162243 1598 369928
1593 2149 1295995 176694298 1386 369904
7 rows selected.
可以看到last error size为370048,正好和我们的报错吻合,那么,再来看看各个Subpool的空间使用情况:
再来看看reserved shared pool信息:
SQL> select used_space, requests, request_misses, last_miss_size
2 from v$shared_pool_reserved;
REM Look at the breakdown of chunks for more detail on fragmentation.
Reserved
Reserved Total Area Size of
Used Requests Misses Last Miss
---------------- ---------------- ---------------- ----------------
232,613,024 623,742 9,324 370,048
select free_space,avg_free_size,used_space,avg_used_size,request_failures,last_failure_size
from v$shared_pool_reserved;
FREE_SPACE AVG_FREE_SIZE USED_SPACE AVG_USED_SIZE REQUEST_FAILURES LAST_FAILURE_SIZE
---------- ------------- ---------- ------------- ---------------- -----------------
89753440 56734.1593 232613024 147037.31 9713 370048
参数:
_shared_pool_reserved_min_alloc 4400 4400
可以看到request_failures>0并且last_failure_size > _shared_pool_reserved_min_alloc,name ORA-04031错误就可能是因为共享池保留空间缺少连续空间所致。要解决这个问题,可以考虑加大_shared_pool_reserved_min_allock来降低缓冲区共享池保留空间的对象数目,并增大shared_pool_reserved_size来加大共享池保留空间的可用内存。当抛出了ORA-04031,当抛出此错误的同时,此核心库的Cursor:pin s wait x再次出现,并且大量导致session堵塞,经过查看相关的监控作业,发现有个作业会定期执行一条sql(select count(*) from gv$session;).但是,为什么只有在查询v$session的时候才会引发呢?而且,这里只是一个查询信息,为什么会出现wait x?
从一开始这个问题就走入了误区,大量session的cursor:pin s wait x其实只是表象,并不是真正的问题原因。那么,有这么一种可能: Oracle会了帮助诊断4031错误,出现这个错误的时候可能会做一次SGA的dump,这个过程可能会长时间或频繁地持有一些latch,这样会导致SQL解析出现问题。但是,我发现4031生成的trace文件很小,只有0.2M,也不好确定。于是尝试修改参数,在测试一下:
Alter system set “_4031_dump_bitvec”=o scope=memory;
再次执行v$session的相关查询,查看结果,还是有大量的cursor:Pin s wait x,row cache lock.正如文章开头的信息,导致cursor:pins wait x的事件是row cache lock,那具体的是dc_objects,dc_objects_grant,所以猜测查询v$session的时候,有row cache一直没有成功装入,引起rowcache lock,影响了其他SQL的解析,但是因为系统是核心库,我尝试dump systemstat,结果CPU迅速飙升到80%,我也只好终止。所以,这个问题最后还是修改session_cached_cursor为200,重启数据库。