最近果然乱七八糟的事情比较多,刚刚重启的RAC节点,其中一个节点内存占用99%了。
这台机器的配置内存为128G
/oracle>machinfo
CPU info:
12 Intel(R) Itanium 2 9100 series processors (1.6 GHz, 24 MB)
533 MT/s bus, CPU version A1
24 logical processors (2 per socket)
Memory: 130875 MB (127.81 GB)
Firmware info:
Firmware revision: 9.48
FP SWA driver revision: 1.18
IPMI is supported on this system.
BMC firmware revision: 26.03
glance看到的内存和进程排序占用情况:
实际上oracle也没占用这么多的内存啊。
1314*30m=40G
那其他的内存去哪里了呢?
ProcList CPU Rpt Mem Rpt Disk Rpt NextKeys SlctProc Help Exit
Glance C.04.70.001 12:35:26 actdb1 ia64 Current Avg High
------------------------------------------------------------------------------------------------------------------------------------
CPU Util S SR RU | 38% 44% 48%
Disk Util F F | 10% 12% 21%
Mem Util S SU U | 99% 99% 99%
Networkil U UR R | 95% 95% 95%
------------------------------------------------------------------------------------------------------------------------------------
PROCESS LIST Users= 3
User CPU % Thrd Disk Memory Block
Process Name PID Name (2400% max) Cnt IOrate RSS/VSS On
--------------------------------------------------------------------------------
midaemon 5130 root 11.8 14 0.0 519.0mb 523.2mb SLEEP
vxpal 3817 root 0.0 15 0.0 182.9mb 223.7mb SLEEP
oraclengact1 10805 oracle 3.5 1 0.0 125.0mb 138.5mb SOCKT
vxfsd 330 root 5.1 299 9.2 77.4mb 87.0mb OTHER
java 27769 oracle 0.0 22 0.0 70.5mb 312.1mb SLEEP
oraclengact1 22093 oracle 8.1 1 0.0 68.3mb 74.5mb SOCKT
cimprovagt 4277 root 0.0 34 0.0 57.7mb 131.2mb SLEEP
vxconfigd 742 root 0.0 1 0.0 53.2mb 99.8mb SLEEP
ocssd.bin 25219 oracle 0.6 20 5.8 50.7mb 50.7mb SLEEP
crsd.bin 24999 root 0.1 44 0.0 49.0mb 98.0mb SLEEP
vxpal 4470 root 0.0 47 0.0 43.6mb 149.0mb SLEEP
ora_arc1_nga 1827 root 0.0 1 0.0 40.7mb 43.5mb OTHER
ora_lms2_nga 27150 oracle 3.5 1 0.0 37.2mb 44.7mb SLEEP
ora_lms0_nga 27146 oracle 3.3 1 0.0 37.2mb 49.9mb SLEEP
ora_lms4_nga 27161 oracle 3.5 1 0.0 37.2mb 46.0mb SLEEP
ora_lms1_nga 27148 oracle 3.3 1 0.0 37.1mb 44.9mb SLEEP
ora_lms5_nga 27163 oracle 3.5 1 0.0 37.1mb 49.5mb SLEEP
ora_lmd0_nga 27144 oracle 0.1 1 0.0 36.7mb 403.3mb SLEEP
ora_arc1_nga 27445 oracle 0.0 1 2.6 35.2mb 54.5mb OTHER
oraclengact1 4556 oracle 0.0 1 0.0 34.1mb 34.1mb SOCKT
ora_arc0_nga 27443 oracle 0.0 1 0.0 34.0mb 50.8mb OTHER
ora_lms3_nga 27157 oracle 3.4 1 0.0 33.9mb 53.6mb SLEEP
ora_ckpt_nga 27190 oracle 0.3 1 6.2 32.7mb 45.8mb OTHER
ora_cjq0_nga 27203 oracle 0.0 1 0.0 31.7mb 468.6mb OTHER
oraclengact1 7650 oracle 0.0 1 0.0 31.3mb 38.6mb SOCKT
ora_lck0_nga 27293 oracle 0.0 1 0.0 29.9mb 35.3mb SLEEP
ora_dbw3_nga 27173 oracle 0.5 1 22.7 28.8mb 35.1mb OTHER
ora_dbw0_nga 27167 oracle 0.8 1 27.3 28.4mb 45.3mb OTHER
ora_dbw1_nga 27169 oracle 0.4 1 14.2 28.1mb 34.1mb OTHER
ora_dbw5_nga 27177 oracle 0.5 1 16.1 28.1mb 34.1mb OTHER
/oracle>ps -ef|grep oracle |wc -l
1314
终于同事在共享内存段里面发现了问题:
原来有一段空间没有释放回收,而这个共享段的创建进程29637和最后的访问进程9644都已经不在了
actdb1:/oracle>ipcs -ma
IPC status from /dev/kmem as of Fri Dec 9 12:34:31 2011
T ID KEY MODE OWNER GROUP CREATOR CGROUP NATTCH SEGSZ CPID LPID ATIME DTIME CTIME
Shared Memory:
m 0 0x4118016b --rw-rw-rw- root root root root 0 348 2959 21281 13:29:18 13:29:18 21:12:25
m 1 0x4e0c0002 --rw-rw-rw- root root root root 3 61760 2959 10645 11:34:56 11:43:20 21:12:25
m 2 0x411c98fb --rw-rw-rw- root root root root 2 8192 2959 2961 13:30:49 13:29:18 21:12:25
m 3 0x0000cace --rw-rw-rw- root root root root 0 2 4100 4100 11:14:20 11:14:20 21:13:13
m 4 0x00a5c581 --rw------- sfmdb users sfmdb users 8 10469376 4165 4168 21:13:14 21:13:14 21:13:14
m 5 0x4118061f --rw------- root root root root 1 4096 4318 5732 21:15:48 no-entry 21:15:48
m 6488070 0x4d4e5251 --rw-r--r-- root sys root sys 2 330752 28899 5805 2:51:13 no-entry 0:35:30
m 32775 0x55315352 --rw-rw-rw- root sys root sys 1 4096 28899 22247 11:41:47 11:42:55 0:35:30
m 32776 0x44525354 --rw-r--r-- root sys root sys 3 638976 28899 19619 11:40:28 11:40:28 0:35:30
m 32777 0x53494152 --rw-r--r-- root sys root sys 1 1024 28899 28899 0:35:30 no-entry 0:35:30
m 32778 0x00005643 --rw-rw-rw- root sys root sys 1 1024 28915 28915 0:35:31 no-entry 0:35:31
m 32779 0x00005654 --rw-rw-rw- root sys root sys 1 1024 28915 28915 0:35:31 no-entry 0:35:31
m 131084 0x00000000 D-rw-rw---- oracle dba oracle dba 1 51556487168 29637 9644 1:33:29 1:33:29 1:33:29
m 13 0x06347849 --rw-rw-rw- root root root root 2 65544 5093 5099 21:15:20 21:15:16 21:15:15
m 14 0x0c6629c9 --rw-r----- root root root root 2 17911576 5104 604 10:37:15 10:54:24 21:15:16
m 15 0x4910ab8c --rw-r--r-- root root root root 0 22908 5142 5099 12:34:00 12:34:00 21:15:17
m 10223632 0x5077995c --rw-rw---- oracle dba oracle dba 2580 51556487168 26217 17977 12:34:31 12:34:31 1:44:21
actdb1:/oracle>ps -ef | grep 29637
oracle 8172 15104 1 13:42:21 pts/tb 0:00 grep 29637
actdb1:/oracle>ps -ef | grep 9644
oracle 5268 15104 0 13:40:24 pts/tb 0:00 grep 9644
这个问题发生在RMAN问题处理之后,详细的见:http://user.qzone.qq.com/8733223/blog/1323283285
联想到当天晚上重启的操作,其实在重启的过程中,有报错
Thu Dec 8 01:39:22 2011
Errors in file /oraclelog/ngact/bdump/ngact1_pmon_18234.trc:
ORA-00304: requested INSTANCE_NUMBER is busy
Thu Dec 8 01:39:22 2011
USER: terminating instance due to error 304
Instance terminated by USER, pid = 17422
重启两次都不成功。最终重启了crs才成功。
当时没想到无法重启的原因,现在联系到一起,就可以发现了,估计出问题的归档进程还是没杀掉,所以导致共享内存段也无法释放,新的实例也起不来,检查一下当时出问题的归档进程,果然还在
actdb1:/oracle>ps -ef | grep 1827
oracle 1827 1 0 Dec 5 ? 0:00 ora_arc1_ngact1
oracle 6473 15104 0 13:41:11 pts/tb 0:00 grep 1827
这1827已经成为一个僵尸进程了,无法kill掉了,只能等待机器重启
actdb1:/oracle>kill -9 1827
actdb1:/oracle>ps -ef | grep ora_arc1
oracle 1827 1 0 Dec 5 ? 0:00 ora_arc1_ngact1
最近出问题比较多,也不敢用ipcrm直接清理共享内存段,只能安排一次维护重启来解决这个问题了。