Oracle数据库经常会遇到CPU利用率很高的情况,这种时候大都是数据库中存在着严重性能低下的SQL语句,这种SQL语句大大的消耗了CPU资源,导致整个系统性能低下。当然,引起严重性能低下的SQL语句的原因是多方面的,具体的原因要具体的来分析,下面通过一个实际的案例来说明如何来诊断和解决CPU利用率高的这类问题。
8EUg)Z:I25736250
操作系统:solairs8
数据库:Oracle9.2.0.4
问题描述:现场工程师汇报数据库非常慢,几乎所有应用操作均无法正常进行。
首先登陆主机,执行top发现CPU资源几乎消耗殆尽,存在很多占用CPU很高的进程,而内存和I/O都不高,具体如下:
last pid: 26136;load averages:8.89,8.91,8.12
216 processes: 204 sleeping, 8 running, 4 on cpu
CPU states:0.6% idle, 97.3% user,1.8% kernel,0.2% iowait,0.0% swap
Memory: 8192M real, 1166M free, 14M swap in use, 8179M swap free
PID USERNAME THR PRI NICESIZERES STATETIMECPU COMMAND
25725 oracle1500 4550M 4508M cpu212:23 11.23% oracle
25774 oracle1410 4550M 4508M run14:25 10.66% oracle
26016 oracle1310 4550M 4508M run5:41 10.37% oracle
26010 oracle1410 4550M 4508M run4:409.81% oracle
26014 oracle1510 4550M 4506M cpu64:199.76% oracle
25873 oracle1410 4550M 4508M run12:109.45% oracle
25723 oracle1500 4550M 4508M run15:099.40% oracle
26121 oracle1410 4550M 4506M cpu01:139.28% oracle
25745 oracle1410 4551M 4512M run9:339.28% oracle
26136 oracle1410 4550M 4506M run0:065.61% oracle
409 root15590 7168K 7008K sleep 173.1H0.52% picld
25653 oracle1590 4550M 4508M sleep1:010.46% oracle
25565 oracle1590 4550M 4508M sleep0:070.24% oracle
25703 oracle1590 4550M 4506M sleep0:080.13% oracle
25701 oracle1590 4550M 4509M sleep0:230.10% oracle
于是先查看数据库的告警日志ALERT文件,并没有发现有什么错误存在,日志显示数据库运行正常,排除数据库本身存在问题。
然后查看这些占用CPU资源很高的Oracle进程究竟是在做什么操作,使用如下SQL语句:
select sql_text,spid,v$session.program,processfrom
v$sqlarea,v$session,v$process
where v$sqlarea.address=v$session.sql_address
and v$sqlarea.hash_value=v$session.sql_hash_value
and v$session.paddr=v$process.addr
and v$process.spid in (PID);
用top中占用CPU很高的进程的PID替换脚本中的PID,得到相应的Oracle进程所执行的SQL语句,发现占用CPU资源很高的进程都是执行同一个SQL语句:
SELECT d.domainname,d.mswitchdomainid, a.SERVICEID,a.SERVICECODE,a.USERTYPE,a.STATUS,a.NOTIFYSTATUS,to_char(a.DATECREATED,'yyyy-mm-dd hh24:mi:ss') DATECREATED,VIPFLAG,STATUS2,CUSTOMERTYPE,CUSTOMERIDFROM service a, gatewayloc b, subbureaunumber c, mswitchdomain dWHERE b.mswitchdomainid = d.mswitchdomainid and b.gatewaysn = c.gatewaysnAND a.ServiceCode like c.code||'%' and a.serviceSpecID=1 and a.status!='4' and a.status!='10'and a.servicecode like '010987654321%' and SubsidiaryID=999999999
基本上可以肯定是这个SQL引起了系统CPU资源大量被占用,那究竟是什么原因造成这个SQL这么大量占用CPU资源呢,我们先来看看数据库的进程等待事件都有些什么:
SQL> select sid,event,p1,p1text from v$session_wait;
SID EVENTP1 P1TEXT
---------- ----------------------------------------------------------------
12 latch free4.3982E+12 address
36 latch free4.3982E+12 address
37 latch free4.3982E+12 address
84 latch free4.3982E+12 address
102 latch free4.3982E+12 address
101 latch free4.3982E+12 address
85 latch free4.3982E+12 address
41 latch free4.3982E+12 address
106 latch free4.3982E+12 address
155 latch free4.3982E+12 address
151 latch free4.3982E+12 address
149 latch free4.3982E+12 address
147 latch free 4.3982E+12 address
1 pmon timer300 duration
从上面的查询我们可以看出,大都是latch free的等待事件,然后接着查一下这些latch的等待都是什么进程产生的:
SQL> select spid from v$process where addr in
(select paddr from v$session where sid in(84,102,101,106,155,151));
SPID
------------
25774
26010
25873
25725
26014
26016
由此看出latch free这个等待事件导致了上面的那个SQL语句都在等待,占用了大量的CPU资源。我们来看看究竟主要是那种类型的latch的等待,根据下面的SQL语句:
SQL> SELECT latch#, name, gets, misses, sleeps
FROM v$latch
WHERE sleeps>0
ORDER BY sleeps;
LATCH# NAMEGETSMISSESSLEEPS
---------- ----------------------------------------------------------------
15messages96876201
159library cache pin allocation407322431
132dml lock allocation1945332132
4session allocation304897483
115redo allocation2380312864
17enqueue hash chains277510 855
7session idle bit272726431416
158library cache pin3881788558658
156shared pool27716296184662
157library cache563757325246801
98cache buffers chains1722750424758400109837
由上面的查询可以看出最主要的latch等待是cache buffers chains,这个latch的等待表明数据库存在单独的BLOCK的竞争这些latch,我们来看这个latch存在的子latch及其对应的类型:
SQL> SELECT addr, latch#, gets, misses, sleeps
FROM v$latch_children
WHERE sleeps>0
and latch# = 98
ORDER BY sleeps desc;
ADDRLATCH#GETSMISSESSLEEPS
---------------- ---------- ---------- ---------- ----------
000004000A3DFD10981084066182891389
000004000A698C70981595102244
0000040009B217389810426977134926209
0000040009B227A89810760465935697185
000004000A3E0D7098544760118922156
000004000A6C2BD0988533757134
0000040009B24888988553840925752106
000004000A36B25098108335119996
000004000A79EC70982579706435
000004000A356AD098118481016034
……………
接着我们来查看sleep较多的子latch对应都有哪些对象:
SQL> select distinct a.owner,a.segment_name,a.segment_type from
dba_extents a,
(select dbarfil,dbablk
from x$bh
where hladdr in
(select addr
from (select addr
from v$latch_children
order by sleeps desc)
where rownum < 5)) b
where a.RELATIVE_FNO = b.dbarfil
and a.BLOCK_ID <= b.dbablk and a.block_id + a.blocks > b.dbablk;
OWNER SEGMENT_NAMESEGMENT_TYPE
---------------------------------------------------------------------------
TESTI_SERVICE_SERVICESPECID INDEX
TESTI_SERVICE_SUBSIDIARYIDINDEX
TESTSERVICETABLE
TESTMSWITCHDOMAINTABLE
TESTI_SERVICE_SC_SINDEX
TESTPK_MSWITCHDOMAIN INDEX
TESTGATEWAYLOCTABLE
…………………
我们看到在开始的那个SQL语句中的几个对象都有包括在内,于是来看看开始的那个SQL的执行计划:
SQL> set autotrace trace explain
SQL>SELECT d.domainname,d.mswitchdomainid, a.SERVICEID,a.SERVICECODE,a.USERTYPE,a.STATUS,a.NOTIFYSTATUS,to_char(a.DATECREATED,'yyyy-mm-dd hh24:mi:ss') DATECREATED,VIPFLAG,STATUS2,CUSTOMERTYPE,CUSTOMERIDFROM service a, gatewayloc b, subbureaunumber c, mswitchdomain dWHERE b.mswitchdomainid = d.mswitchdomainid and b.gatewaysn = c.gatewaysnAND a.ServiceCode like c.code||'%' and a.serviceSpecID=1 and a.status!='4' and a.status!='10'and a.servicecode like '010987654321%' and SubsidiaryID=999999999;
Execution Plan
----------------------------------------------------------
0SELECT STATEMENT ptimizer=CHOOSE
10NESTED LOOPS
21NESTED LOOPS
32NESTED LOOPS
43TABLE ACCESS (FULL) OF 'SUBBUREAUNUMBER'
53TABLE ACCESS (BY INDEX ROWID) OF 'GATEWAYLOC'