solaris系统上11.2.0.4 db出现ora-04030排查引出的/tmp与swap管理方式差异

最近一次处理ORA-04030错误,引出了solaris操作系统上的一个交换空间与虚拟内存管理上的注意事项,记录一下。 

客户来电说EBS系统的一个节点上报错ORA-04030错误,部分业务操作界面也报错ORA-04030错误;等到远程上去查看时,发现节点1 alert仅有少量报错,看节点2,数据库实例已经异常关闭了;协调用户及时重启数据库实例来快速恢复业务后,开始了对ORA-04030错误发生原因及节点2异常宕机的分析~

 大致排查步骤:

----环境:Solaris[tm] OE (64-bit)+11.2.0.4两节点RAC,运行EBS业务 

1.集群状态、数据库实例的alert日志(异常宕机出现有ORA-07445: exception encountered: core dump [ksmdfr()+100] [SIGSEGV],在此错误之前是大量ORA-04030错误,判断是ORA-04030错误引起的ORA-07445,最终导致的宕机) 

2.数据库所在主机top/df -h等对cpu 内存、磁盘空间等主机资源查看;没有主机监控无法查看历史的 

3.shell 的ulimit 检查 4.数据库问题时段AWR分析 5.此时可检查的未发现异常,还没找到头绪;再次按正常的ORA-04030错误排查步骤来查~ 

6.查swap/内存配置等等,在查看操作系统日志时,发现有出现swap不足的信息,与数据库中alert日志中的报错对比,时间完全吻合 

7.排查swap不足原因,最终发现是solaris操作系统上交换空间与虚拟内存管理方式:“TMPFS 文件系统从系统的交换资源分配 /tmp 目录中的空间。此功能意味着,用尽 /tmp 目录中的空间时,也就用尽了交换空间。” 而在物理内存充足的情况下,为何会使用交换空间呢?从ORACLE官方文档:How does the Solaris Operating System Calculate Available Swap? (文档 ID 1010585.1)中可以看到solaris系统为process分配内存时的机制:When a process calls the malloc()/sbrk() commands, only virtual swap is allocated.

 问题分析总结: 

当solaris操作系统的process申请分配内存时是分配的virtual swap;===>>>而从交换空间与虚拟内存管理方式可知:当 /tmp 目录中的空间被使用完时,也就用尽了交换空间(文末有实验向/tmp复制数据查看swap可用空间来验证此管理机制)。===>>>而节点2由于/tmp只有4GB大小,在系统运行一段时间后就可能出现/tmp使用率高(或者有其它操作使用到/tmp时),进而引起交换空间不足,===>>>这一点在操作系统log中也得到了验证;===>>>最终由于交换空间不足,引起了oracle数据库的报错ORA-O4030并导致实例异常关闭;而节点1的/tmp配置是98G,因此极少出现此问题; 

经验教训: 

1.主机的log很重要,这次查log晚了,耽误不少时间。

 2.两个节点之一出问题时,一些基本配置要对比起来看,可能就是那个发现问题的点 

3.当问题在正常排查步骤没有线索时,按照正常排查步骤要反复查不放过一些细节,所谓事出反常必有妖~~ 

4.经验主义有时候也耽误事,比如这个节点2 /tmp一开始就发现只有4GB,但是按照以前aix/linux的经验,完全没往这个上面想。

5.不同操作系统查看一些资源的命令是不同的,要注意。在查看swap使用方面只是top/vmstat输出中看了下确实没看到问题,而后面使用swap -s时可以对比发现节点1、节点2的available差距非常大 

参考文档:

 solaris 交换空间与 TMPFS 文件系统介绍:https://docs.oracle.com/cd/E26926_01/html/E25884/fsswap-29713.html#fsswap-89187

How does the Solaris Operating System Calculate Available Swap? ORA-4030 (QERHJ hash-joi,kllcqas:kllsltba) (文档 ID 1071033.1) 

诊断并解决 ORA-4030 错误 (文档 ID 1548826.1) FAQ: ORA-4030 [Video] (文档ID 399497.1) 

增加一下:发现oracle官方blog上也有一个文章 :一个奇怪的ora-4030错误的诊断过程,里面也有如下: Oct 31 15:37:21 xxxx tmpfs: [ID 518458 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded Oct 31 17:11:28 xxxx genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 23708 (oracle) 而官方blog诊断是由于swap不足,建议增加swap=====>系统日志显示是/tmp满了,swap报不足;而swap实际是100G,应该再多验证下/tmp的大小,才能确定到底是/tmp问题还是要扩swap. 

 

如下为整个排查过程中摘录的主要信息:

 1.异常宕机的节点2的alert日志--节取部分

Mon Nov 27 10:06:02 2017
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_1700.trc  (incident=706091):
ORA-04030: out of process memory when trying to allocate 48 bytes (kxs-heap-w,qesdcf_dfb_reset:qesdcfm)
Mon Nov 27 10:06:02 2017
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_1618.trc  (incident=705858):
ORA-04030: out of process memory when trying to allocate 8240 bytes (qesca subheap,qerjo hash table)
Mon Nov 27 10:06:02 2017
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_1259.trc  (incident=706290):
ORA-04030: 鍦ㄥ皾璇曞垎閰?211584 瀛楄妭 (QERGH hash-agg,kllcqc:kllcqslt) 鏃惰繘绋嬪唴瀛樹笉瓒?Mon Nov 27 10:06:02 2017
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_1809.trc  (incident=705957):
ORA-04030: 鍦ㄥ皾璇曞垎閰?4128 瀛楄妭 (kxs-heap-c,temporary memory) 鏃惰繘绋嬪唴瀛樹笉瓒?Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_24948.trc  (incident=708189):
ORA-04030: 鍦ㄥ皾璇曞垎閰?123416 瀛楄妭 (QERGH hash-agg,kllcqas:kllsltba) 鏃惰繘绋嬪唴瀛樹笉瓒?Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_18427.trc  (incident=707469):
ORA-04030: 鍦ㄥ皾璇曞垎閰?65560 瀛楄妭 (QERHJ hash-joi,QERHJ Bit vector) 鏃惰繘绋嬪唴瀛樹笉瓒?Use ADRCI or Support Workbench to package the incident.

Mon Nov 27 11:09:15 2017
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_8978.trc:
ORA-04030: 鍦ㄥ皾璇曞垎閰?385048 瀛楄妭 (KFK_IO_SUBHEAP,kfk_kfkio_freeq) 鏃惰繘绋嬪唴瀛樹笉瓒?Mon Nov 27 11:09:20 2017
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0xFFFFFFFF7FFDFF48] [PC:0xFFFFFFFF63F10100, skgxpivsnd()+1952] [flags: 0x0, count: 1]
Mon Nov 27 11:09:26 2017
Process startup failed, error stack:
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_psp0_10601.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspwn10
ORA-27303: additional information: childpid 9596, read pid 0
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_ora_9556.trc  (incident=705590):
ORA-07445: 鍑虹幇寮傚父閿欒: 鏍稿績杞偍 [skgxpivsnd()+1952] [SIGSEGV] [ADDR:0xFFFFFFFF7FFDFF48] [PC:0xFFFFFFFF63F10100] [Address not mapped to object] []
Incident details in: /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/incident/incdir_705590/AAAA2_ora_9556_i705590.trc
Use ADRCI or Support Workbench to package the incident.
See Note 411.1 at My Oracle Support for error and packaging details.

Mon Nov 27 11:13:14 2017
Exception [type: SIGSEGV, Address not mapped to object] [ADDR:0x302E302E302E2001] [PC:0x100D405E4, ksmdfr()+100] [flags: 0x0, count: 1]
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_pmon_10600.trc  (incident=705218):
ORA-07445: exception encountered: core dump [ksmdfr()+100] [SIGSEGV] [ADDR:0x302E302E302E2001] [PC:0x100D405E4] [Address not mapped to object] []
Incident details in: /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/incident/incdir_705218/AAAA2_pmon_10600_i705218.trc


Mon Nov 27 11:13:25 2017
Process startup failed, error stack:
Errors in file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_psp0_10601.trc:
ORA-27300: OS system dependent operation:fork failed with status: 11
ORA-27301: OS failure message: Resource temporarily unavailable
ORA-27302: failure occurred at: skgpspwn10
ORA-27303: additional information: childpid 9897, read pid 0
Mon Nov 27 11:13:26 2017
Process m002 died, see its trace file
Mon Nov 27 11:13:45 2017
CKPT (ospid: 10650): terminating the instance due to error 472
Mon Nov 27 11:13:45 2017
System state dump requested by (instance=2, osid=10650 (CKPT)), summary=[abnormal instance termination].
Mon Nov 27 11:13:45 2017
opiodr aborting process unknown ospid (11450) as a result of ORA-1092
Mon Nov 27 11:13:45 2017
ORA-1092 : opitsk aborting process
System State dumped to trace file /u01/oracle/app/db/diag/rdbms/AAAA/AAAA2/trace/AAAA2_diag_10618_20171127111345.trc
Instance terminated by CKPT, pid = 10650

2.检查操作系统的资源信息,未发现明显异常, 3.检查GRID用户 -->监听程序,ORACLE用户的SHELL ulimit限制,均无异常

/u01/oracle/11.2.0/db$ ulimit -a
core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
file size               (blocks, -f) unlimited
open files                      (-n) 65536
pipe size            (512 bytes, -p) 10
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 29995
virtual memory          (kbytes, -v) unlimited

4.数据库问题时段AWR中也未发现PGA大量增长情况:

SQL> show parameter pga
NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
pga_aggregate_target                 big integer 10G
AWR中PGA部分:
        PGA Aggr Target(M)      Auto PGA Target(M)      PGA Mem Alloc(M)        W/A PGA Used(M) %PGA W/A Mem    %Auto W/A Mem   %Man W/A Mem    Global Mem Bound(K)
B       10,240  5,247   5,658.13        296.35  5.24    100.00  0.00    1,048,570
E       10,240  4,857   5,960.85        69.05   1.16    100.00  0.00    1,048,570

5.异常宕机的节点2的主机log发现有大量swap不足信息,与第1部分数据库alert报错对应:

Nov 27 11:09:42 BBBdb02 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 8978 (oracle)
Nov 27 11:09:42 BBBdb02 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9556 (oracle)
Nov 27 11:09:44 BBBdb02 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 8978 (oracle)
Nov 27 11:09:49 BBBdb02 last message repeated 2 times
Nov 27 11:09:52 BBBdb02 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9556 (oracle)  =====>>>>
Nov 27 11:12:06 BBBdb02 genunix: [ID 163280 kern.notice] oracle: Cannot map /lib/sparcv9/ld.so.1
Nov 27 11:12:11 BBBdb02 genunix: [ID 470503 kern.warning] WARNING: Sorry, no swap space to grow stack for pid 9803 (oracle)
Nov 27 11:12:37 BBBdb02 last message repeated 9 times
Nov 27 11:13:43 BBBdb02 tmpfs: [ID 518458 kern.warning] WARNING: /system/volatile: File system full, swap space limit exceeded =====>>>>

6.查看主机的SWAP及/TMP相关配置

/var/adm# df -h /tmp
Filesystem             Size   Used  Available Capacity  Mounted on
swap                   4.4G    29M       4.4G     1%    /tmp
/var/adm# swap -l  ===>>每个blocks=512byte;134217712*512=64GB,不包含物理内存
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 293,1        16 134217712 133853344
/var/adm# swap -s 
total: 80785896k bytes allocated + 101555856k reserved = 182341752k used, 4538080k available
---------------------
注 - swap -l 命令以 512 字节块为单位显示交换空间。swap -s 命令以 1024 字节块为单位显示交换空间。
如果将 swap -l 中的块累加,然后将其转换为 KB,则结果将小于 used + available(在 swap -s 输出中)。
原因是 swap -l 在计算交换空间时不包括物理内存。
---------------------
/var/adm# top

last pid:  6281;  load avg:  6.08,  5.45,  5.79;  up 72+15:49:00                                                                                                                      16:17:42
560 processes: 556 sleeping, 4 on cpu
CPU states: 92.9% idle,  6.5% user,  0.5% kernel,  0.0% iowait,  0.0% swap
Kernel: 17380 ctxsw, 1229 trap, 16752 intr, 17969 syscall, 76 flt
Memory: 128G phys mem, 43G free mem, 64G total swap, 64G free swap  ====>>要注意这里的显示不能当真了、、
 This terminal can only display 26 processes
   PID USERNAME NLWP PRI NICE  SIZE   RES STATE    TIME    CPU COMMAND
 22081 grid        2   0    0   60G   60G cpu/18  36:29  1.55% oracle
  5717 grid        1   0    0   61G   61G cpu/16   4:50  1.47% oracle

7.关于solaris交换空间与 TMPFS 文件系统管理机制的验证:

在 Oracle Solaris 环境中,TMPFS 文件系统由 /etc/vfstab 文件中的项自动激活。 TMPFS 文件系统将文件及其关联信息存储在内存中(存储在 /tmp 目录中)而不磁盘上,从而加快了访问这些文件的速度。 对于大量使用 /tmp 的应用程序(例如编译器和 DBMS 产品),此功能可显著增强这些应用程序的性能。 TMPFS 文件系统从系统的交换资源分配 /tmp 目录中的空间。此功能意味着,用尽 /tmp 目录中的空间时,也就用尽了交换空间。 因此,如果应用程序大量使用 /tmp 目录,而且您未监视交换空间使用情况,则系统可能会用尽交换空间。 如下实验验证,向/tmp目录复制数据,swap的可用空间减少:

$ df -k /tmp
Filesystem           1024-blocks        Used   Available Capacity  Mounted on
swap                   105884264       25192   105859072     1%    /tmp
$ du -sk awr.tar
160304  awr.tar
<$ swap -s
total: 76763328k bytes allocated + 1097632k reserved = 77860960k used, 105657000k available
$ swap -l
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 293,1        16 134217712 134217712
$ cp awr.tar /tmp/   ====>>>>进行复制
$ df -k /tmp   ===>>空间used列大约增加cp的文件大小
Filesystem           1024-blocks        Used   Available Capacity  Mounted on
swap                   105664224      185304   105478920     1%    /tmp
$ swap -l
swapfile             dev    swaplo   blocks     free
/dev/zvol/dsk/rpool/swap 293,1        16 134217712 134217712
$ swap -s   ====>>>swap的available部分减少大约 awr.tar文件的大小
total: 76982976k bytes allocated + 1073400k reserved = 78056376k used, 105461584k available</span>
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值