由于归档路径设置不当,系统无法响应的问题
Author: Kevin Zou
Date: 2009-2-17
环境: DB 10.2.0.
OS: AIX 5.3
系统采用DBCA的方式来安装,安装完毕后没有更改配置。
一次在做大量数据刷新操作时,发现新提交到系统的SQL无法响应,整个系统都HANG住。
查看系统资源,iostat ,topas发现系统只有DISK的资源比较繁忙,其余都很空闲:
AIX h3cap61 3 5 0059362D4C00 02/18/09
System configuration: lcpu=4 drives=7 mode=Capped
16:19:19 device %busy avque r+w/s Kbs/s avwait avserv
16:19:21 hdisk0 0 0.0 0 3 0.0 0.2
hdisk5 14 0.0 337 5392 0.0 0.0
hdisk4 0 0.0 0 0 0.0 0.0
cd0 0 0.0 0 0 0.0 0.0
hdisk2 0 0.0 0 0 0.0 0.0
hdisk3 0 0.0 0 0 0.0 0.0
hdisk1 0 0.0 0 1 0.0 10.5
16:19:23 hdisk0 0 0.0 0 3 0.0 0.2
hdisk5 15 0.0 358 5740 0.0 0.0
hdisk4 0 0.0 0 0 0.0 0.0
cd0 0 0.0 0 0 0.0 0.0
hdisk2 0 0.0 0 0 0.0 0.0
hdisk3 0 0.0 0 0 0.0 0.0
hdisk1 0 0.0 0 0 0.0 0.0
Average hdisk0 0 0.0 0 3 0.0 0.2
hdisk5 14 0.0 347 5566 0.0 0.0
hdisk4 0 0.0 0 0 0.0 0.0
cd0 0 0.0 0 0 0.0 0.0
hdisk2 0 0.0 0 0 0.0 0.0
hdisk3 0 0.0 0 0 0.0 0.0
hdisk1 0 0.0 0 0 0.0 5.3
查看等待事件:
SID EVENT
---------- ----------------------------------------------------------------
128 library cache lock
129 library cache lock
130 switch logfile command
131 library cache lock
132 buffer busy waits
134 log file switch (checkpoint incomplete)
135 log file switch (checkpoint incomplete)
140 Streams AQ: qmn slave idle wait
142 log file switch (checkpoint incomplete)
146 log file switch (checkpoint incomplete)
149 Streams AQ: qmn coordinator idle wait
SID EVENT
---------- ----------------------------------------------------------------
150 log file switch (checkpoint incomplete)
154 class slave wait
156 ASM background timer
159 log file switch (checkpoint incomplete)
161 log file switch (checkpoint incomplete)
164 buffer busy waits
170 pmon timer
系统有大量的log file switch 的等待事件。
难道是归档目录满的原因导致系统无法进行新的事物:
SQL> archive log list
Database log mode Archive Mode
Automatic archival Enabled
Archive destination USE_DB_RECOVERY_FILE_DEST
Oldest online log sequence 113
Next log sequence to archive 113
Current log sequence 115
SQL> show parameter USE_DB_RECOVERY_FILE_DEST
这里Archive destination 启用了新的参数USE_DB_RECOVERY_FILE_DEST 。在9i没有使用过,当时没有仔细去思考,而是把目光注意到library cache lock 的等待事件。
查询之后,发现library cache lock这些等待事件对应的session也没有运行任何的sql。
通过查看最耗cpu资源的进程,发现都是归档进程。
# ps -ef|grep 438720
root 315398 496036 0 16:31:33 pts/4 0:00 grep 438720
oracle 438720 1 2 Feb 13 - 0:54 ora_arc1_kong
# ps -ef|grep 229542
oracle 229542 1 1 Feb 13 - 0:58 ora_arc0_kong
root 401408 496036 0 16:31:43 pts/4 0:00 grep 229542
最后还是把目光转移到了归档进程上。想到刚开始出现的“陌生”参数:USE_DB_RECOVERY_FILE_DEST
找到了说明,原来是需要看db_recovery_file_dest 参数,而不是USE_DB_RECOVERY_FILE_DEST。NND!!
查看其配置:
SQL> show parameter db_recovery_file_dest
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
db_recovery_file_dest string /oracle/product/flash_recovery
_area
db_recovery_file_dest_size big integer 2G
在OS上查看其空间余额:
/oracle/product/flash_recovery_area/KONG
$ cd ..
$ du -sg .
2.00 .
设置为其他目录,再次执行日志切换:
1* alter system switch logfile
SQL> /
System altered.
一切都正常了。
查看等待事件
SID EVENT
---------- ----------------------------------------------------------------
132 jobq slave wait
140 Streams AQ: qmn slave idle wait
142 Streams AQ: waiting for time management or cleanup tasks
149 Streams AQ: qmn coordinator idle wait
154 class slave wait
156 ASM background timer
164 smon timer
170 pmon timer
都是非重点关注的对象了。
在alert文件也有归档空间满的报错信息:
ARC0: Failed to archive thread 1 sequence 113 (19809)
Wed Feb 18 16:38:00 2009
Errors in file /oracle/product/admin/kong/bdump/kong_arc1_438720.trc:
ORA-19815: WARNING: db_recovery_file_dest_size of 2147483648 bytes is 100.00% used, and has 0 remaining bytes available.
Wed Feb 18 16:38:00 2009
************************************************************************
You have following choices to free up space from flash recovery area:
1. Consider changing RMAN RETENTION POLICY. If you are using Data Guard,
then consider changing RMAN ARCHIVELOG DELETION POLICY.
2. Back up files to tertiary device such as tape using RMAN
BACKUP RECOVERY AREA command.
3. Add disk space and increase db_recovery_file_dest_size parameter to
reflect the new space.
4. Delete unnecessary files using RMAN DELETE command. If an operating
system command was used to delete files, then use RMAN CROSSCHECK and
DELETE EXPIRED commands.
再次证明alert日志很重要,有事没事拜读ALERT!!
不熟悉10g的特性,虚惊一场!
谢谢ITPUB上棉花的帮助
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/40239/viewspace-553604/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/40239/viewspace-553604/