作者:
eygle |
English Version
【转载时请以超链接形式标明文章
出处和作者信息及
本声明】
链接: http://www.eygle.com/archives/2004/10/howto_capture_problem_sql.html
链接: http://www.eygle.com/archives/2004/10/howto_capture_problem_sql.html
问题描述:
开发人员报告系统运行缓慢,影响用户访问.
1.登陆数据库主机
使用vmstat检查,发现CPU资源已经耗尽,大量任务位于运行队列:
bash-2.03$ vmstat 3 procs memory page disk faults cpu r b w swap free re mf pi po fr de sr s6 s9 s1 sd in sy cs us sy id 0 0 0 5504232 1464112 0 0 0 0 0 0 0 0 1 1 0 4294967196 0 0 -84 -5 -145 131 0 0 5368072 1518360 56 691 0 2 2 0 0 0 1 0 0 3011 7918 2795 97 3 0 131 0 0 5377328 1522464 81 719 0 2 2 0 0 0 1 0 0 2766 8019 2577 96 4 0 130 0 0 5382400 1524776 67 682 0 0 0 0 0 0 0 0 0 3570 8534 3316 97 3 0 134 0 0 5373616 1520512 127 1078 0 2 2 0 0 0 1 0 0 3838 9584 3623 96 4 0 136 0 0 5369392 1518496 107 924 0 5 5 0 0 0 0 0 0 2920 8573 2639 97 3 0 132 0 0 5364912 1516224 63 578 0 0 0 0 0 0 0 0 0 3358 7944 3119 97 3 0 129 0 0 5358648 1511712 189 1236 0 0 0 0 0 0 0 0 0 3366 10365 3135 95 5 0 129 0 0 5354528 1511304 120 1194 0 0 0 0 0 0 0 4 0 3235 8864 2911 96 4 0 128 0 0 5346848 1507704 99 823 0 0 0 0 0 0 0 3 0 3189 9048 3074 96 4 0 125 0 0 5341248 1504704 80 843 0 2 2 0 0 0 6 1 0 3563 9514 3314 95 5 0 133 0 0 5332744 1501112 79 798 0 0 0 0 0 0 0 1 0 3218 8805 2902 97 3 0 129 0 0 5325384 1497368 107 643 0 2 2 0 0 0 1 4 0 3184 8297 2879 96 4 0 126 0 0 5363144 1514320 81 753 0 0 0 0 0 0 0 0 0 2533 7409 2164 97 3 0 136 0 0 5355624 1510512 169 566 786 0 0 0 0 0 0 1 0 3002 8600 2810 96 4 0 130 1 0 5351448 1502936 267 580 1821 0 0 0 0 0 0 0 0 3126 7812 2900 96 4 0 129 0 0 5347256 1499568 155 913 2 2 2 0 0 0 0 1 0 2225 8076 1941 98 2 0 116 0 0 5338192 1495400 177 1162 0 0 0 0 0 0 0 1 0 1947 7781 1639 97 3 0 |
2.使用Top命令
观察进程CPU耗用,发现没有明显过高CPU使用的进程
$ top
last pid: 28313; load averages: 99.90, 117.54, 125.71 23:28:38 296 processes: 186 sleeping, 99 running, 2 zombie, 9 on cpu CPU states: 0.0% idle, 96.5% user, 3.5% kernel, 0.0% iowait, 0.0% swap Memory: 4096M real, 1404M free, 2185M swap in use, 5114M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 27082 oracle8i 1 33 0 1328M 1309M run 0:17 1.29% oracle 26719 oracle8i 1 55 0 1327M 1306M sleep 0:29 1.11% oracle 28103 oracle8i 1 35 0 1327M 1304M run 0:06 1.10% oracle 28161 oracle8i 1 25 0 1327M 1305M run 0:04 1.10% oracle 26199 oracle8i 1 45 0 1328M 1309M run 0:42 1.10% oracle 26892 oracle8i 1 33 0 1328M 1310M run 0:24 1.09% oracle 27805 oracle8i 1 45 0 1327M 1306M cpu/1 0:10 1.04% oracle 23800 oracle8i 1 23 0 1327M 1306M run 1:28 1.03% oracle 25197 oracle8i 1 34 0 1328M 1309M run 0:57 1.03% oracle 21593 oracle8i 1 33 0 1327M 1306M run 2:12 1.01% oracle 27616 oracle8i 1 45 0 1329M 1311M run 0:14 1.01% oracle 27821 oracle8i 1 43 0 1327M 1306M run 0:10 1.00% oracle 26517 oracle8i 1 33 0 1328M 1309M run 0:33 0.97% oracle 25785 oracle8i 1 44 0 1328M 1309M run 0:46 0.96% oracle 26241 oracle8i 1 45 0 1327M 1306M run 0:42 0.96% oracle |
3.检查进程数量
bash-2.03$ ps -ef|grep ora|wc -l 258 bash-2.03$ ps -ef|grep ora|wc -l 275 bash-2.03$ ps -ef|grep ora|wc -l 274 bash-2.03$ ps -ef|grep ora|wc -l 278 bash-2.03$ ps -ef|grep ora|wc -l 277 bash-2.03$ ps -ef|grep ora|wc -l 366 |
发现系统存在大量Oracle进程,大约在300左右,而正常情况下Oracle连接数应该在100左右.
4.检查数据库
查询v$session_wait获取各进程等待事件
|
发现存在大量db file scattered read及db file sequential read等待.
5.捕获相关SQL
这里用到了我的以下脚本getsqlbysid.sql:
SELECT sql_text FROM v$sqltext a WHERE a.hash_value = (SELECT sql_hash_value FROM v$session b WHERE b.SID = '&sid') ORDER BY piece ASC / |
|
对几个全表扫描进程跟踪以后,得到以上SQL语句.
以上语句如果良好编码应该使用绑定变量.但是现在这个不是我们关心的.
使用该应用用户连接,检查其执行计划:
SQL> set autotrace trace explain SQL> select i.vc2title,i.numinfoguid 2 from hs_info i where i.intenabledflag = 1 3 and i.intpublishstate = 1 and i.datpublishdate <=sysdate 4 and i.numcatalogguid = 3475 5 order by i.datpublishdate desc, i.numorder desc ; Execution Plan ---------------------------------------------------------- 0 SELECT STATEMENT Optimizer=CHOOSE (Cost=228 Card=1 Bytes=106) 1 0 SORT (ORDER BY) (Cost=228 Card=1 Bytes=106) 2 1 TABLE ACCESS (FULL) OF 'HS_INFO' (Cost=218 Card=1 Bytes=106) SQL> select count(*) from hs_info; COUNT(*) ---------- 227404 |
该表这里有22万记录,全表扫描已经不再适合.
检查该表,存在以下索引:
SQL> select index_name,index_type from user_indexes where table_name='HS_INFO'; INDEX_NAME INDEX_TYPE ------------------------------ --------------------------- HSIDX_INFO1 FUNCTION-BASED NORMAL HSIDX_INFO_SEARCHKEY DOMAIN PK_HS_INFO NORMAL 检查索引键值: SQL> select index_name,column_name from user_ind_columns where table_name ='HS_INFO'; INDEX_NAME COLUMN_NAME ------------------------------ -------------------- HSIDX_INFO1 NUMORDER HSIDX_INFO1 SYS_NC00024$ HSIDX_INFO_SEARCHKEY VC2INDEXWORDS PK_HS_INFO NUMINFOGUID SQL> desc hs_info Name Null? Type ----------------------------------------------------------------- -------- --------------------------- NUMINFOGUID NOT NULL NUMBER(15) NUMCATALOGGUID NOT NULL NUMBER(15) INTTEXTTYPE NOT NULL NUMBER(38) VC2TITLE NOT NULL VARCHAR2(60) VC2AUTHOR VARCHAR2(100) NUMPREVINFOGUID NUMBER(15) NUMNEXTINFOGUID NUMBER(15) NUMORDER NOT NULL NUMBER(15) DATPUBLISHDATE NOT NULL DATE INTPUBLISHSTATE NOT NULL NUMBER(38) VC2PUBLISHERID VARCHAR2(30) VC2INDEXWORDS VARCHAR2(200) VC2WAPPREVPATH VARCHAR2(200) VC2WEBPREVPATH VARCHAR2(200) VC2WAP2PREVPATH VARCHAR2(200) NUMVISITED NOT NULL NUMBER(15) INTENABLEDFLAG NOT NULL NUMBER(38) DATCREATETIME NOT NULL DATE DATMODIFYTIME NOT NULL DATE VC2NOTES VARCHAR2(1000) INTINFOTYPE NOT NULL NUMBER(38) VC2PRIZEFLAG VARCHAR2(1) VC2DESC VARCHAR2(1000) |
6.决定创建新的索引以消除全表扫描
|
7.观察系统状况
原大量等待消失
|
持续观察的CPU使用情况
|
至此,此问题得以解决.