周三收到一台数据库(一个逻辑从库,10.2.0.4 64bit)昨天的数据库概况信息邮件(因为数据库不是太重要,收集的频率就设置成了一天一次),在告警日志部分信息里存在错误信息说:TEMP临时表空间不足.
查看告警日志文件:
Tue Nov 2 17:12:31 2010
ORA-1652: unable to extend temp segment by 128 in tablespace TEMP
Tue Nov 2 17:12:31 2010
ORA-1652: unable to extend temp segment by 128 in tablespace TEMP
Tue Nov 2 17:12:31 2010
ORA-1652: unable to extend temp segment by 128 in tablespace TEMP
查看temp表空间大小,发现是8000M,对于这个数据库来说应该不算小了呀!
那会不会是因为一些SQL存在问题,导致需要的optimal的workarea size过大,pga不够用,导致one-pass甚至multi-pass操作,频繁的读写临时表空间造成的呢?
于是立刻生成了两份儿包含了错误产生时刻的awr报表
2815~~2817
(2010-11-02 16:00:41 ~~2010-11-02 18:00:58)
2815--2817 和 2816--2817 这两个快照间的awr报表.
因为是要解决这个错误问题的,所以没有过多关注其他的东西,直奔主题去了,查看等待事件,查看是否有显著的
direct path read temp
direct path write temp
等待.
几乎是找不到这样的等待事件的.
查看表空间读写数据,发现在1小时的awr报表里,发现temp表空间的读写比undotbs1和另一个用户表空间的读写还要频繁,临时表空间的读写似乎是频繁了一些.
再查看pga相关数据.
1小时(17:00--18:00)的PGA Cache Hit %是98.78,而2小时(16:00--18:00)的PGA Cache Hit %是96.29,也就是说都存在pga不足导致需要读写临时表空间的问题,当然也许正是由于这个问题导致了TEMP表空间不足的问题.
查看这两份儿报表的PGA Memory Advisory部分,现在的pga大小2000M(这台主机上物理内存一共是8000M)对应的Estd PGA Overalloc Count是2而不是0.
随即查看v$pgastat,发现2个月产生了近1800次的over allocation count.似乎是pga_aggregate_target=2000M设置的偏小了一些.
当然是不是偏小,还需要查看具体的pga使用信息,查看是否存在有些SQL一次性消耗了大量PGA的情况,一次也是遇到temp表空间不足的问题,居然发现一些SQL的pga消耗的Optimal size在1G~2G之间,解决sql性能问题之后,就再也没有报错了.如果不是这样,也就是说不存在一次性大量消耗pga空间的sql的话,如果optimal size都不大,但还是频繁的出现one-pass甚至multi-pass,over allocation,PGA Cache Hit %不是100%的情况,那就是pga设置的有些偏小了.
在PGA Aggr Target Histogram部分,发现optimal size在64M~128M的sql
Total Execs Optimal Execs 1-Pass Execs
1小时报表 489 484 5
2小时报表 505 490 15
对于这个不存在大的统计查询的系统里,感觉sql似乎也存在问题.
使用下面的sql查询那个时间段写临时表空间偏大的sql:
SELECT a.r,a.DBID,a.SQL_ID,a.DIRECT_WRITES_DELTA,b.SQL_TEXT,executions_delta,trunc(a.DIRECT_WRITES_DELTA*8/executions_delta/1024) mb_perexec
FROM
(
SELECT DBID,SQL_ID,DIRECT_WRITES_DELTA,ROWNUM r,executions_delta
FROM
(
SELECT DBID,SQL_ID,DIRECT_WRITES_DELTA,executions_delta
FROM DBA_HIST_SQLSTAT s
WHERE SNAP_ID between :v_b and :v_e AND DIRECT_WRITES_DELTA>0 AND PARSING_SCHEMA_ID<>0 /* not sys user's sql*/
ORDER BY DIRECT_WRITES_DELTA DESC
)
WHERE ROWNUM<=20
) a,DBA_HIST_SQLTEXT b
WHERE a.dbid=b.DBID AND a.sql_id=b.sql_id and b.command_type in(2,3,6,7,189) /* insert,select,update,delete,merge*/
ORDER BY r;
代人2815,2817 和2816,2817,发现总的写临时表空间量大和每次写临时表空间量大的只有一个sql:
R DBID SQL_ID DIRECT_WRITES_DELTA EXECUTIONS_DELTA MB_PEREXEC
---------- ---------- ------------- ------------------- ---------------- ----------
1 1650823549 9ft62wuac5ayw 952080 312 23
2 1650823549 9ft62wuac5ayw 626382 737 6
3 1650823549 9ft62wuac5ayw 379549 54 54
R DBID SQL_ID DIRECT_WRITES_DELTA EXECUTIONS_DELTA MB_PEREXEC
---------- ---------- ------------- ------------------- ---------------- ----------
1 1650823549 9ft62wuac5ayw 952080 312 23
2 1650823549 9ft62wuac5ayw 626382 737 6
在awr报表中查看一下是否有这个sql的信息:
在awr报表中查看,发现这个sql
% Total DB Time=87.82
Elap per Exec (s): 419.83
CPU per Exec (s):2.00
SQL ordered by Gets %Total:74.69
Gets per Exec 823893.44
SQL ordered by Reads %Total:63.52
Reads per Exec :1015.41 这些物理读极可能是读取之前写到临时表空间的临时段(都是多块读)
感觉这个sql一定是存在性能问题的.
使用awrsqrpt.sql生成sql执行计划和统计信息:
sql格式化之后是这样的:
select productid1,
productid2,
productname1,
productname2,
nvl(vote1, 0) as vote1,
nvl(vote2, 0) as vote2,
p1.catalogid as catalogid,
p1.catalogname catalogname,
p1.brandname brandname,
p1.brandid brandid,
getproducturl(productid1, 7) as product1_indexurl,
getproducturl(productid1, 8) product1_priceurl,
getproducturl(productid2, 7) as product2_indexurl,
getproducturl(productid2, 8) product2_priceurl,
p1.image1 as product1_image1,
p1.image2 as product1_image2,
p1.image3 as product1_image3,
p1.image4 as product1_image4,
nvl(p1.minprice, 0) as product1_minprice,
nvl(p1.maxprice, 0) as product1_maxprice,
nvl(p1.refprice, 0) as product1_refprice,
p2.image1 as product2_image1,
p2.image2 as product2_image2,
p2.image3 as product2_image3,
p2.image4 as product2_image4,
nvl(p2.minprice, 0) as product2_minprice,
nvl(p2.maxprice, 0) as product2_maxprice,
nvl(p2.refprice, 0) product2_refprice
from pk
inner join product p1 on pk.productid1 = p1.id
inner join product p2 on pk.productid2 = p2.id
where (productid1 = :1 or productid2 = :2)
and rownum = 1;
这里没有加表前缀的字段都是pk的字段.
在2小时的时间里,语句总共执行了1049次,流逝时间是:440399s,cpu时间是:2098s,User I/O Wait Time:17813S,Concurrency Wait Time:14688s.
(其它平均每次的资源消耗上面已经列出了,这里就不再列举了)
执行计划是这样的:
Id Operation Name Rows Bytes Cost (%CPU) Time
0 SELECT STATEMENT 22 (100)
1 COUNT STOPKEY
2 CONCATENATION
3 FILTER
4 NESTED LOOPS 2 1582 12 (0) 00:00:01
5 NESTED LOOPS 2 816 8 (0) 00:00:01
6 TABLE ACCESS FULL PRODUCT 12 4344 2 (0) 00:00:01
7 TABLE ACCESS BY INDEX ROWID PK 1 46 1 (0) 00:00:01
8 INDEX UNIQUE SCAN PK_PK 1 1 (0) 00:00:01
9 TABLE ACCESS BY INDEX ROWID PRODUCT 1 383 11 (0) 00:00:01
10 INDEX UNIQUE SCAN PK_PRODUCT 1 1 (0) 00:00:01
11 FILTER
12 HASH JOIN 8 9760 10 (0) 00:00:01
13 TABLE ACCESS FULL PRODUCT 12 4344 2 (0) 00:00:01
14 NESTED LOOPS 10 4290 8 (0) 00:00:01
15 TABLE ACCESS FULL PRODUCT 39 14937 2 (0) 00:00:01
16 TABLE ACCESS BY INDEX ROWID PK 1 46 1 (0) 00:00:01
17 INDEX UNIQUE SCAN PK_PK 1 1 (0) 00:00:01
查看当前的执行计划和统计信息:
SQL> select SQL_ID,CHILD_NUMBER,EXECUTIONS,BUFFER_GETS,disk_reads,trunc(BUFFER_GETS/EXECUTIONS) buffer_gets_per,Trunc(disk_reads/EXECUTIONS) disk_reads_per
2 from v$sql where sql_id='9ft62wuac5ayw';
SQL_ID CHILD_NUMBER EXECUTIONS BUFFER_GETS DISK_READS BUFFER_GETS_PER DISK_READS_PER
------------- ------------ ---------- ----------- ---------- --------------- --------------
9ft62wuac5ayw 1 683413 2198486115 9817304 3216 14
SQL> select * from table(dbms_xplan.display_cursor('9ft62wuac5ayw',1));
Plan hash value: 3965636877
-----------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | | | 22 (100)| |
|* 1 | COUNT STOPKEY | | | | | |
| 2 | CONCATENATION | | | | | |
|* 3 | FILTER | | | | | |
| 4 | NESTED LOOPS | | 2 | 1582 | 12 (0)| 00:00:01 |
| 5 | NESTED LOOPS | | 2 | 816 | 8 (0)| 00:00:01 |
| 6 | TABLE ACCESS FULL | PRODUCT | 12 | 4344 | 2 (0)| 00:00:01 |
| 7 | TABLE ACCESS BY INDEX ROWID| PK | 1 | 46 | 1 (0)| 00:00:01 |
|* 8 | INDEX UNIQUE SCAN | PK_PK | 1 | | 1 (0)| 00:00:01 |
| 9 | TABLE ACCESS BY INDEX ROWID | PRODUCT | 1 | 383 | 11 (0)| 00:00:01 |
|* 10 | INDEX UNIQUE SCAN | PK_PRODUCT | 1 | | 1 (0)| 00:00:01 |
|* 11 | FILTER | | | | | |
|* 12 | HASH JOIN | | 8 | 9760 | 10 (0)| 00:00:01 |
| 13 | TABLE ACCESS FULL | PRODUCT | 12 | 4344 | 2 (0)| 00:00:01 |
| 14 | NESTED LOOPS | | 10 | 4290 | 8 (0)| 00:00:01 |
| 15 | TABLE ACCESS FULL | PRODUCT | 39 | 14937 | 2 (0)| 00:00:01 |
| 16 | TABLE ACCESS BY INDEX ROWID| PK | 1 | 46 | 1 (0)| 00:00:01 |
|* 17 | INDEX UNIQUE SCAN | PK_PK | 1 | | 1 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(ROWNUM=1)
3 - filter(ROWNUM=1)
8 - access("PK"."PRODUCTID1"=:1 AND "PK"."PRODUCTID2"="P2"."ID")
10 - access("PK"."PRODUCTID1"="P1"."ID")
11 - filter(ROWNUM=1)
12 - access("PK"."PRODUCTID2"="P2"."ID")
17 - access("PK"."PRODUCTID1"="P1"."ID" AND "PK"."PRODUCTID2"=:2)
filter(LNNVL("PK"."PRODUCTID1"=:1))
从执行计划来看,应该是第二个filter部分的hash join导致了大量的workarea的消耗,大量并发的时候,pga不足,需要频繁的大量并发的写而后读临时表空间,导致了temp表空间的不足。
select sql_id,child_number,operation_type,operation_id,policy,total_executions,
optimal_executions,ONEPASS_EXECUTIONS,multipasses_executions,
max_tempseg_size
from v$sql_workarea
where sql_id='9ft62wuac5ayw';