早上CRM的人说Exadata的数据库无法登陆,一看发现节点一个节点Hang住了,使用SQL*Plus登陆会卡死,且日志中提示:无法分配共享内存。
查看alert日志发现,其中出现了很多ORA-04031的错:
Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^e949e218","kglHeapInitialize:temp") Tue Sep 25 08:55:46 2012 Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^e949e218","kglHeapInitialize:temp") Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^e949e218","kglHeapInitialize:temp") Tue Sep 25 08:56:06 2012 Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^e949e218","kglHeapInitialize:temp") Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","select t.ts#,t.file#,t.block...","KGLH0^19d2357e","kglHeapInitialize:temp") Tue Sep 25 08:56:26 2012 Errors in file /u01/app/oracle/diag/rdbms/srcbfin/SRCBFIN2/trace/SRCBFIN2_smon_12501.trc: ORA-04031: unable to allocate 32 bytes of shared memory ("shared pool","unknown object","KGLH0^e949e218","kglHeapInitialize:temp") |
截取trace文件的部分内容如下:
*** 2012-09-25 05:18:53.209 AUTO MEM: backup fail a, 0, 25296896, 384, 385, 0, 100
*** 2012-09-25 05:18:53.209 ================================= Begin 4031 Diagnostic Information ================================= The following information assists Oracle in diagnosing causes of ORA-4031 errors. This trace may be disabled by setting the init.ora _4031_dump_bitvec = 0 ===================================== Allocation Request Summary Informaton ===================================== Current information setting: 04014fff SGA Heap Dump Interval=3600 seconds Dump Interval=300 seconds Last Dump Time=09/25/2012 05:18:52 Dump Count=1 Allocation request for: kglHeapInitialize:temp Heap: 0xc565e0020, size: 32 ****************************************************** HEAP DUMP heap name="sga heap" desc=0x60001190 extent sz=0x9800 alt=248 het=32767 rec=9 flg=-126 opc=0 parent=(nil) owner=(nil) nex=(nil) xsz=0x1 heap=(nil) fl2=0x60, nex=(nil) ds for latch 1: 0x60056bf0 0x60058448 0x60059ca0 0x6005b4f8 ds for latch 2: 0x600604b8 0x60061d10 0x60063568 0x60064dc0 ds for latch 3: 0x60069d80 0x6006b5d8 0x6006ce30 0x6006e688 ds for latch 4: 0x60073648 0x60074ea0 0x600766f8 0x60077f50 ds for latch 5: 0x6007cf10 0x6007e768 0x6007ffc0 0x60081818 ds for latch 6: 0x600867d8 0x60088030 0x60089888 0x6008b0e0 reserved granule count 0 (granule size 134217728) ****************************************************** HEAP DUMP heap name="KGLH0^e949e218" desc=0xc565e0020 extent sz=0xfe8 alt=32767 het=56 rec=0 flg=2 opc=2 parent=0x60001190 owner=0xc565dffd0 nex=(nil) xsz=0x8000000 heap=(nil) fl2=0x26, nex=(nil) Subheap has 0 bytes of memory allocated ==================== Process State Object ==================== ---------------------------------------- SO: 0xc00d41558, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc00d41558, name=process, file=ksu.h LINE:12616, pg=0 (process) Oracle pid:24, ser:1, calls cur/top: 0xb9713fb58/0xc1147da10 flags : (0x16) SYSTEM flags2: (0x0), flags3: (0x10) intr error: 0, call error: 0, sess error: 0, txn error 0 intr queue: empty ksudlp FALSE at location: 0 (post info) last post received: 4248 0 2 last post received-location: ksl2.h LINE:2362 ID:kslpsr last process to post me: c00d404a8 1 6 last post sent: 0 0 26 last post sent-location: ksa2.h LINE:285 ID:ksasnd last process posted by me: c00d404a8 1 6 (latch info) wait_event=0 bits=0 Process Group: DEFAULT, pseudo proc: 0xc58e7a120 O/S info: user: oracle, term: UNKNOWN, ospid: 12501 OSD pid info: Unix process pid: 12501, image: oracle@dm01db02.srcb.com (SMON) PSO child state object changes : Dump of memory from 0x0000000C10CE8E58 to 0x0000000C10CE9060 C10CE8E50 00000000 00000000 [........] C10CE8E60 00000000 00000000 00000000 00000000 [................] Repeat 31 times ========================= User Session State Object ========================= ---------------------------------------- SO: 0xc20ea7708, type: 4, owner: 0xc00d41558, flag: INIT/-/-/0x00 if: 0x3 c: 0x3 proc=0xc00d41558, name=session, file=ksu.h LINE:12624, pg=0 (session) sid: 1 ser: 1 trans: (nil), creator: 0xc00d41558 flags: (0x51) USR/- flags_idl: (0x1) BSY/-/-/-/-/- flags2: (0x40409) -/-/INC DID: , short-term DID: txn branch: (nil) oct: 0, prv: 0, sql: (nil), psql: 0xc4c326750, user: 0/SYS ksuxds FALSE at location: 0 service name: SYS$BACKGROUND Current Wait Stack: Not in wait; last wait ended 0.000602 sec ago Wait State: fixed_waits=0 flags=0x21 boundary=(nil)/-1 Session Wait History: elapsed time of 0.000629 sec since last wait 0: waited for 'SGA: allocation forcing component growth' =0x0, =0x0, =0x0 wait_id=5546611 seq_num=42378 snap_id=101 wait times: snap=0.000000 sec, exc=5.041349 sec, total=5.050030 sec wait times: max=infinite wait counts: calls=100 os=100 occurred after 0.000000 sec of elapsed time 1: waited for 'SGA: allocation forcing component growth' =0x0, =0x0, =0x0 wait_id=5546711 seq_num=42377 snap_id=1 wait times: snap=0.000101 sec, exc=0.000101 sec, total=0.000101 sec wait times: max=infinite wait counts: calls=1 os=1 occurred after 0.000000 sec of elapsed time
Sampled Session History of session 1 serial 1 --------------------------------------------------- The sampled session history is constructed by sampling the target session every 1 second. The sampling process captures at each sample if the session is in a non-idle wait, an idle wait, or not in a wait. If the session is in a non-idle wait then one interval is shown for all the samples the session was in the same non-idle wait. If the session is in an idle wait or not in a wait for consecutive samples then one interval is shown for all the consecutive samples. Though we display these consecutive samples in a single interval the session may NOT be continuously idle or not in a wait (the sampling process does not know).
The history is displayed in reverse chronological order.
sample interval: 1 sec, max history 120 sec --------------------------------------------------- [1 sample, 05:18:52] waited for 'SGA: allocation forcing component growth', seq_num: 42362 p1: ''=0x0 p2: ''=0x0 p3: ''=0x0 time_waited: >= 0 sec (still in wait) [1 sample, 05:18:51] waited for 'SGA: allocation forcing component growth', seq_num: 42322 p1: ''=0x0 p2: ''=0x0 p3: ''=0x0 time_waited: 0.050906 sec (sample interval: 0 sec) [1 sample, 05:18:50] waited for 'SGA: allocation forcing component growth', seq_num: 42282 p1: ''=0x0 p2: ''=0x0 p3: ''=0x0 time_waited: 0.050941 sec (sample interval: 0 sec) idle wait at each sample [5 samples not collected, 05:17:59 - 05:18:03] [55 samples, 05:17:04 - 05:17:58] idle wait at each sample [5 samples not collected, 05:16:59 - 05:17:03] [17 samples, 05:16:53 - 05:16:58] idle wait at each sample --------------------------------------------------- Sampled Session History Summary: longest_non_idle_wait: 'SGA: allocation forcing component growth' [1 sample, 05:18:50 ] time_waited: 0.050941 sec (sample interval: 0 sec) --------------------------------------------------- temporary object counter: 0 ---------------------------------------- |
从以上日志中,会发现存在很多和SGA分配的等待事件:SGA:allocation forcing component growth,看来这个问题可以暂时确定为是“SGA被撑爆了”。这个等待事件在AWR报告中也可以看到:
Top 5 Timed Foreground Events
Event | Waits | Time(s) | Avg wait (ms) | % DB time | Wait Class |
DB CPU |
| 129 |
| 26.90 |
|
SGA: allocation forcing component growth | 612 | 105 | 172 | 21.96 | Other |
latch: shared pool | 11 | 88 | 8040 | 18.48 | Concurrency |
SQL*Net more data to client | 384,182 | 55 | 0 | 11.59 | Network |
cell smart table scan | 6,645 | 16 | 2 | 3.44 | User I/O |
SGA中的两个主要的区域是Share Pool和Buffer Cache,先使用以下语句查看这个两个区域在的值在SGA中所占比例的变化情况:
共享池最近3天的变化趋势 : select to_char(b.end_interval_time,'yyyy-mm-dd hh24:mi:ss') eit,a.value/1024/1024 from dba_hist_parameter a ,dba_hist_snapshot b where a.snap_id=b.snap_id and a.instance_number=b.instance_number and a.instance_number=2 and a.parameter_name like '%shared_pool_size%' and b.end_interval_time>=trunc(sysdate)-3 order by b.end_interval_time;
db cache最近3天的变化趋势: select to_char(b.end_interval_time,'yyyy-mm-dd hh24:mi:ss') eit,a.value/1024/1024 from dba_hist_parameter a ,dba_hist_snapshot b where a.snap_id=b.snap_id and a.instance_number=b.instance_number and a.instance_number=2 and a.parameter_name like '%db_cache_size%' and b.end_interval_time>=trunc(sysdate)-3 order by b.end_interval_time; |
把结果放到excel里面生成如下图,其中蓝色区域为SharePool的值,红色区域为BufferCache的值:
从上图中可以看出,其实从昨天开始,SharePool就不断侵占BufferCache的空间,最终导致BufferCache空间所剩无几,直至应用失败。这个在AWR报告中也可以看出来:
Cache Sizes
Begin | End | |||
Buffer Cache: | 256M | 256M | Std Block Size: | 8K |
Shared Pool Size: | 48,128M | 48,128M | Log Buffer: | 35,252K |
看来此次的故障是由于SGA中SharedPool被某些回话(作业)所大量侵占之后,导致其他回话无法申请SharePool所致,并且那些侵占SharedPool的回话(作业)还在不断的申请SharedPool空间。后来经过排查发现是最近应用上了个程序补数据,并且是使用游标的方式,同时由于应用程序没仔细检查,导致无限循环,作业无法释放资源所致。
解决方法:
1、 重启实例
2、 由于之前SharedPool和BufferCache都是自动管理的,没有设置为最小固定值,所以SharedPool才会侵占BufferCache的空间,因此修改SharedPool和BufferCache为固定值,且保留一定的冗余,但是这个并不是避免该问题的方案,因为如果SharedPool再次被回话(作业)完全侵占也是会出现此问题。
可以参考一下网上大牛们关于ORA-04031错误的分析:
Eygle: DBA手记:共享池的改进与ORA-04031的变化
老熊的三分地:ORA-04031案例一则
DBA Notes: 如何解决ORA-04031 错误