案例描述:
用户报告,服务器启动一段时间以后,无法建立数据库连接
重新启动几分钟以后,再次无法连接
系统无法正常使用.
1.登陆系统
SunOS 5.8
login: root
Password:
Last login: Tue Mar 23 13:56:59 from 172.16.31.41
Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001
You have new mail.
2.su 为Oracle用户
检查启动的Oracle进程
发现后台进程正常,有一定量的用户连接
wapplatform:/>su - oracle Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001 You have new mail. /export/home1/oracle>ls admin codesyndealt31 exp.sh local.cshrc local.profile oraclebak oui v6_database app exp.log jre local.login nsmail oradata swan export/home1/oracle>cd admin /export/home1/oracle/admin>ps -ef|grep ora oracle 25269 25258 0 13:58:36 pts/3 0:00 grep ora oracle 25257 24906 0 13:58:31 pts/4 0:00 vi alert_HSWAPDB.log oracle 25267 1 1 13:58:34 ? 0:00 oracleHSWAPDB (LOCAL=NO) oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB oracle 25193 1 0 13:57:03 ? 0:01 oracleHSWAPDB (LOCAL=NO) oracle 25209 1 0 13:57:09 ? 0:00 oracleHSWAPDB (LOCAL=NO) oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB oracle 24254 24240 0 12:08:25 pts/2 0:00 -ksh oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB oracle 25244 1 1 13:58:23 ? 0:00 oracleHSWAPDB (LOCAL=NO) oracle 25218 1 0 13:57:23 ? 0:00 oracleHSWAPDB (LOCAL=NO) oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB oracle 25230 1 0 13:57:40 ? 0:01 oracleHSWAPDB (LOCAL=NO) oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB oracle 24906 3698 0 13:47:47 pts/4 0:00 -ksh oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB oracle 25058 7464 0 13:55:14 pts/1 0:00 -ksh oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB oracle 25199 1 15 13:57:04 ? 0:49 ora_j000_HSWAPDB oracle 4149 4146 0 12:05:11 pts/5 0:00 -ksh oracle 25232 1 0 13:57:41 ? 0:00 oracleHSWAPDB (LOCAL=NO) oracle 25119 1 0 13:56:29 ? 0:00 oraclehswapdb (LOCAL=NO) oracle 25075 1 0 13:55:34 ? 0:00 /export/home1/oracle/app/bin/tnslsnr LISTENER -inherit oracle 24374 4149 0 12:21:56 pts/5 0:00 sqlplus /nolog oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB oracle 25258 25242 0 13:58:31 pts/3 0:00 -ksh /export/home1/oracle/admin>ps -ef|grep ora_ oracle 25275 25258 0 13:58:42 pts/3 0:00 grep ora_ oracle 25184 1 0 13:56:57 ? 0:00 ora_p007_HSWAPDB oracle 25182 1 0 13:56:57 ? 0:00 ora_p006_HSWAPDB oracle 25176 1 0 13:56:57 ? 0:00 ora_p003_HSWAPDB oracle 25180 1 0 13:56:57 ? 0:00 ora_p005_HSWAPDB oracle 25172 1 0 13:56:56 ? 0:00 ora_p001_HSWAPDB oracle 25178 1 0 13:56:57 ? 0:00 ora_p004_HSWAPDB oracle 25170 1 0 13:56:56 ? 0:00 ora_p000_HSWAPDB oracle 25174 1 0 13:56:56 ? 0:00 ora_p002_HSWAPDB oracle 25159 1 0 13:56:42 ? 0:02 ora_qmn0_HSWAPDB oracle 25161 1 0 13:56:42 ? 0:00 ora_s000_HSWAPDB oracle 25149 1 0 13:56:41 ? 0:01 ora_lgwr_HSWAPDB oracle 25157 1 0 13:56:42 ? 0:00 ora_cjq0_HSWAPDB oracle 25153 1 0 13:56:42 ? 0:01 ora_smon_HSWAPDB oracle 25163 1 0 13:56:42 ? 0:00 ora_d000_HSWAPDB oracle 25155 1 0 13:56:42 ? 0:00 ora_reco_HSWAPDB oracle 25151 1 0 13:56:41 ? 0:00 ora_ckpt_HSWAPDB oracle 25145 1 0 13:56:41 ? 0:00 ora_dbw0_HSWAPDB oracle 25199 1 13 13:57:04 ? 0:51 ora_j000_HSWAPDB oracle 25143 1 0 13:56:41 ? 0:00 ora_pmon_HSWAPDB |
3.检查Alert.log警报日志文件
/export/home1/oracle/admin>ls hswapdb /export/home1/oracle/admin>cd * /export/home1/oracle/admin/hswapdb>ls bdump cdump create pfile udump /export/home1/oracle/admin/hswapdb>cd bdump /export/home1/oracle/admin/hswapdb/bdump> /export/home1/oracle/admin/hswapdb/bdump>ls -l *.log -rw-r--r-- 1 oracle dba 813396 Mar 23 13:57 alert_HSWAPDB.log /export/home1/oracle/oradata/hswapdb/control02.ctl, ................. Tue Mar 23 13:40:45 2004 /export/home1/oracle/oradata/hswapdb/control02.ctl, |
发现数据库多次重起,并记录了部分错误信息
该提示说明数据库无法spawn a new session.
quote Yong Huang's comment:
The number in "skgpspawn failed:category = 27142" is probably ORA error:
$ oerr ora 27142
27142, 0000, "could not create new process"
// *Cause: OS system call
// *Action: check errno and if possible increase the number of processes
OSD (OS-dependent) errors are almost always shown as an skg... error (probably means "system, kernel generic").
I don't know what "depinfo = 12" means.
4.尝试连接数据库
收到错误信息,无法连接数据库
SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 3月 23 14:14:06 2004 Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. ERROR: 请输入用户名: 请输入用户名: SP2-0157: 在3次尝试之后无法 CONNECT 到 ORACLE, 退出 SQL*Plus |
内部限制超过,通常说明某些系统资源不足.
5.检查监听器
发现部分连接被拒绝
/export/home1/oracle>lsnrctl services LSNRCTL for Solaris: Version 9.2.0.3.0 - Production on 23-3月 -2004 14:37:23 Copyright (c) 1991, 2002, Oracle Corporation. All rights reserved. 正在连接到 (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=EXTPROC))) |
在listener.log中找到了相关错误信息
23-3\324\302 -2004 12:19:40 * (CONNECT_DATA=(SID=hswapdb)(CID=(PROGRAM=C:\WINNT\Microsoft.NET\Framework\v1.1.4322\aspnet_wp.e |
quote Yong Huang's comment:
$ grep -w 12 /usr/include/sys/errno.h
#define ENOMEM 12 /* Not enough core */
Here "core" means memory, including real RAM memory and swap space.
6.退出Oracle用户检查
检查系统日志信息,发现大量失败的su操作
有swap区不足的报告
/export/home1/oracle/admin/hswapdb/bdump>exit
wapplatform:/>dmesg 2004年03月23日 星期二 14时00分32秒 CST for retry using short name Mar 23 10:18:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full (sqlplus) |
现在基本可以判断是交换区的问题,当然和Oracle SGA设置有关.
7.检查系统内存及交换区使用
/export/home1/oracle/admin/hswapdb/bdump>exit wapplatform:/>dmesg 2004年03月23日 星期二 14时00分32秒 CST for retry using short name Mar 23 10:18:15 wapplatform ufs: [ID 845546 kern.notice] NOTICE: alloc: /export/home1: file system full (sqlplus) |
现在基本可以判断是交换区的问题,当然和Oracle SGA设置有关.
7.检查系统内存及交换区使用
$ top last pid: 25456; load averages: 0.67, 0.70, 0.69 14:10:03 93 processes: 91 sleeping, 2 on cpu CPU states: 72.7% idle, 14.9% user, 2.7% kernel, 9.7% iowait, 0.0% swap Memory: 1024M real, 34M free, 752M swap in use, 10M swap free PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND 25199 oracle 1 40 0 674M 631M cpu/2 8:03 16.32% oracle 25209 oracle 1 30 0 675M 630M sleep 0:03 0.13% oracle 25159 oracle 1 48 0 674M 628M sleep 0:03 0.06% oracle 25384 oracle 1 58 0 2632K 1736K cpu/0 0:01 0.05% top 25145 oracle 143 58 0 682M 630M sleep 0:01 0.03% oracle 25446 oracle 1 58 0 674M 628M sleep 0:00 0.03% oracle 25149 oracle 15 58 0 682M 626M sleep 0:00 0.02% oracle 25075 oracle 1 48 0 17M 7208K sleep 0:00 0.01% tnslsnr 25151 oracle 11 58 0 676M 624M sleep 0:00 0.01% oracle 25366 oracle 1 10 0 674M 628M sleep 0:00 0.00% oracle 25356 oracle 1 18 0 674M 628M sleep 0:00 0.00% oracle 25360 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle 25364 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle 25362 oracle 1 20 0 674M 628M sleep 0:00 0.00% oracle 25330 oracle 1 28 0 674M 628M sleep 0:00 0.00% oracle |
发现物理内存仅为1G,free部分为34M,交换区使用了752M,仅10M free
系统内存严重不足,Swap区不足
8. 检查数据库的SGA设置
发现SGA设置为: 622299344 bytes
接近600M
wapplatform:/>su - oracle Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001 You have new mail. /export/home1/oracle>sqlplus "/ as sysdba" SQL*Plus: Release 9.2.0.3.0 - Production on 星期二 3月 23 14:02:30 2004 Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. 连接到: SQL> show sga Total System Global Area 622299344 bytes |
对于RAM小于1G的系统,Dedicated模式下,Oracle的SGA一般不应超过1/2物理内存.
9.第一步调整
减小SGA,为系统保留足够的内存.
10.增加swap区
wapplatform:/>df -k 文件系统 千字节 用了 可用 容量 挂接在 /dev/dsk/c0t1d0s0 3099093 105421 2931691 4% / /dev/dsk/c0t2d0s0 10325760 8359637 1862866 82% /usr /proc 0 0 0 0% /proc fd 0 0 0 0% /dev/fd mnttab 0 0 0 0% /etc/mnttab /dev/dsk/c0t1d0s3 1018382 285914 671366 30% /var swap 3904 24 3880 1% /var/run swap 3936 56 3880 2% /tmp /dev/dsk/c0t1d0s5 1671823 459202 1162467 29% /opt /dev/dsk/c0t2d0s7 7087473 6068462 948137 87% /export/home /dev/dsk/c2t1d0s7 17413250 15900222 1338896 93% /export/home2 /dev/dsk/c0t3d0s7 17413250 13749782 3489336 80% /export/home1 /dev/dsk/c0t1d0s1 771110 382410 334723 54% /usr/openwin /export/home/wapgw/luke 7087473 6068462 948137 87% /home/wap wapplatform:/var/swap>cd /export/home1 |
11.连接测试
系统恢复正常,问题解决
wapplatform:/export/home1/swap>su - oracle Sun Microsystems Inc. SunOS 5.8 Generic Patch October 2001 You have new mail. /export/home1/oracle>sqlplus "/ as sysdba" SQL*Plus: Release 9.2.0.3.0 - Production on 星期四 3月 25 11:56:28 2004 Copyright (c) 1982, 2002, Oracle Corporation. All rights reserved. 连接到: SQL> exit last pid: 5372; load averages: 0.25, 0.22, 0.29 11:57:58 PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND |
问题总结:
Oracle数据库问题的解决从来就离不开操作系统
很多时候我们必须通过操作系统一级的手段来诊断并解决问题.
关于操作系统
一般Swap区的推荐值为2XRAM
如果Ram很大,不一定非要把Swap设置为2xSwap
但是通常至少设置Swap = Ram
如果Swap区过小,在系统繁忙期间
产生大量交换无法换到磁盘,就会出现问题.
如本案例就是这样。
另外,如果系统Ram较小
通常设置SGA < 1/2 Ram
要为Server process及OS保留足够的内存空间
原文地址:http://www.eygle.com/archives/2004/10/oracle_sga_swap.html