Oracle RAC之节点故障：File table overflow

最新推荐文章于 2023-06-19 14:01:17 发布

zhoul77777

最新推荐文章于 2023-06-19 14:01:17 发布

阅读量553

点赞数

分类专栏： ORACLE管理文章标签： Oracle OS HP SUN Socket

ORACLE管理专栏收录该内容

275 篇文章 1 订阅

订阅专栏

某客户数据库于2010年4月26日早晨9点左右发生单节点故障，后台故障表现为一节点数据库(hisdb01)异常终止，进一步导致一节点主机重启。前台故障表现为部分业务不可用。由于没有部署主机性能跟踪脚本，只能根据现场日志描述初步推断为主机资源不足(如文件句柄没有释放)从而导致Oracle实例异常终止。2010年6月7日早晨9点再次发生单节点故障。
后台日志分析
查看发生故障前后各种日志
1、操作系统日志
[quote]Jun 7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun 7 09:08:18 hisdb01 cmcld[7603]: Unable to accept a connection: File table overflow
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to resolve local hostname hisdb01 to determine the domain name
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to allocate a socket: File table overflow
Jun 7 09:08:19 hisdb01 cmcld[7603]: Sending file $SGRUN/frdump.cmcld.8 (167257 bytes) to file assistant daemon.
Jun 7 09:08:18 hisdb01 cmclconfd[2878]: Unable to open /etc/cmcluster/cmclconfig, File table overflow
Jun 7 09:08:19 hisdb01 above message repeats 3 times
Jun 7 09:08:19 hisdb01 cmfileassistd[2894]: Updated file /var/adm/cmcluster/frdump.cmcld.8 (length = 167257).
Jun 7 09:09:00 hisdb01 inetd[1018]: hacl-cfg/tcp: accept: File table overflow
Jun 7 09:09:19 hisdb01 cmcld[7603]: Service cmfileassistd terminated due to an exit(0).
Jun 7 09:12:16 hisdb01 syslog: Unable to open the /etc/utmpx file, to sync the records from file->/usr/sbin/utmpd
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 13576 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 syslogd: utmp database: Bad file number
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 10 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 above message repeats 17 times
Jun 7 09:12:17 hisdb01 vmunix: file: table is full
Jun 7 09:12:17 hisdb01 vmunix: file: table is full[/quote]

2、crs后台日志：
[quote]2010-06-06 21:15:09.225: [ CRSEVT][167223] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.orcl.orcl1.inst
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:09.225: [ CRSAPP][167223] CheckResource error for ora.orcl.orcl1.inst error code = -1
2010-06-06 21:15:19.211: [ CRSEVT][167224] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/crs/bin/racgwrap(check) for ora.hisdb01.ons
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:15:19.211: [ CRSAPP][167224] CheckResource error for ora.hisdb01.ons error code = -1
2010-06-06 21:16:18.020: [ CRSEVT][167225] CAAMonitorHandler :: 0:Could not execute /oracle/app/product/db10g/bin/racgwrap(check) for ora.hisdb01.ASM1.asm
category: 1234, operation: scls_process_spawn, loc: out_pipe, OS error: 23, other: out of memory

2010-06-06 21:16:18.021: [ CRSAPP][167225] CheckResource error for ora.hisdb01.ASM1.asm error code = -1[/quote]

3、实例orcl1日志:
[quote]Sun Jun 6 21:08:42 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_2915.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27544: Message 27544 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [socket] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [sskgxpcre1]
…
Sun Jun 6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun 6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3
Sun Jun 6 21:40:52 2010
WARNING: kfk failed to open a disk[/dev/vgdata/rasm_disk5]
Sun Jun 6 21:40:52 2010
Errors in file /oracle/app/product/admin/orcl/udump/orcl1_ora_4809.trc:
ORA-15025: could not open disk '/dev/vgdata/rasm_disk5'
ORA-27041: unable to open file
HPUX-ia64 Error: 23: File table overflow
Additional information: 3[/quote]
4、实例asm1后台日志：
[quote]Sun Jun 6 21:14:26 2010
Errors in file /oracle/app/product/admin/+ASM/udump/+asm1_ora_3254.trc:
ORA-00603: Message 603 not found; No message file for product=RDBMS, facility=ORA
ORA-27504: Message 27504 not found; No message file for product=RDBMS, facility=ORA
ORA-27300: Message 27300 not found; No message file for product=RDBMS, facility=ORA; arguments: [ioctl] [23]
ORA-27301: Message 27301 not found; No message file for product=RDBMS, facility=ORA; arguments: [File table overflow]
ORA-27302: Message 27302 not found; No message file for product=RDBMS, facility=ORA; arguments: [skgxpvaddr1][/quote]
5、查看故障发生前nfile使用情况
[quote]root@hisdb01:/sbin/init.d # kcusage nfile
Tunable Usage / Setting
=============================================
nfile 51795 / 65536[/quote]
6、查看imon_orcl1.log
[quote]2010-06-17 17:38:17.168: [ RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13

2010-06-17 17:39:17.178: [ RACG][30] [9233][30][ora.orcl.orcl1.inst]: GIMH: GIM-00104: Health check failed to connect to instance.
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
"/oracle/app/product/db10g/log/hisdb01/racg/imon_orcl.log" 158031 lines, 9229057 characters
GIM-00090: OS-dependent operation:mmap failed with status: 12
GIM-00091: OS failure message: Not enough space
GIM-00092: OS failure occurred at: sskgmsmr_13[/quote]

从以上日志可以看出（红色部分标出），很可能是Oracle受操作系统资源限制引发的故障。进一步查看故障发生前后操作系统资源利用情况。
1、查看nfile使用情况
[quote]root@hisdb02:/ # kcusage nfile
Tunable Usage / Setting
=============================================
nfile 12089 / 65536[/quote]

2、查看主机内存,CPU资源
[quote]zzz ***Sun Jun 6 21:17:20 EAT 2010
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
1 0 0 2311458 1996119 172 18 0 0 0 0 0 2620 21313 834 0 1 99
1 0 0 2311458 1996103 191 21 0 0 0 0 0 2408 16170 709 0 1 99
1 0 0 2311458 1995210 166 18 0 0 0 0 0 2403 14823 700 1 0 99
zzz ***Sun Jun 6 21:17:30 EAT 2010
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
1 0 0 2285994 1996297 172 18 0 0 0 0 0 2620 21313 834 0 1 99
1 0 0 2285994 1996297 171 20 0 0 0 0 0 2426 11112 710 1 1 98
1 0 0 2285994 1995404 150 17 0 0 0 0 0 2398 10711 694 0 1 99
zzz ***Sun Jun 6 21:17:40 EAT 2010
procs memory page faults cpu
r b w avm free re at pi po fr de sr in sy cs us sy id
2 0 0 2196419 1996297 172 18 0 0 0 0 0 2620 21313 834 0 1 99
2 0 0 2196419 1995404 170 19 0 0 0 0 0 2372 10075 698 0 1 99
2 0 0 2196419 1995386 149 17 0 0 0 0 0 2380 10401 715 0 0 100[/quote]

3、查看磁盘io情况
[quote]zzz ***Sun Jun 6 21:06:37 EAT 2010

device bps sps msps

c1t0d0 0 0.0 1.0
c6t0d1 0 0.0 1.0
c6t0d2 0 0.0 1.0
c6t0d3 0 0.0 1.0
c6t0d4 0 0.0 1.0
c6t0d5 0 0.0 1.0
c8t0d1 0 0.0 1.0
c8t0d2 0 0.0 1.0
c8t0d3 0 0.0 1.0
c8t0d4 0 0.0 1.0
c8t0d5 0 0.0 1.0
c10t0d1 0 0.0 1.0
c10t0d2 0 0.0 1.0
c10t0d3 0 0.0 1.0
c10t0d4 0 0.0 1.0
c10t0d5 0 0.0 1.0
c12t0d1 0 0.0 1.0
c12t0d2 0 0.0 1.0
c12t0d3 0 0.0 1.0
c12t0d4 0 0.0 1.0
c12t0d5 0 0.0 1.0
c6t0d6 0 0.0 1.0
c6t0d7 0 0.0 1.0
c6t1d0 0 0.0 1.0
c6t1d1 0 0.0 1.0
c6t1d2 0 0.0 1.0
c6t1d3 0 0.0 1.0
c8t0d6 0 0.0 1.0
c8t0d7 0 0.0 1.0
c8t1d0 0 0.0 1.0
c8t1d1 0 0.0 1.0
c8t1d2 0 0.0 1.0
c8t1d3 0 0.0 1.0
c10t0d6 0 0.0 1.0
c10t0d7 0 0.0 1.0
c10t1d0 0 0.0 1.0
c10t1d1 0 0.0 1.0
c10t1d2 0 0.0 1.0
c10t1d3 0 0.0 1.0[/quote]

从以上三项可以基本初步评估主机在故障发生前后的资源使用情况，可以明确的看到，在发生故障时，主机资源比较空闲。
基于此类故障，在主机资源充足的情况下，发生资源争夺（如不能获得文件句柄），很可能于Oracle bug有关。查阅Oracle 官方文档，又一未公布bug（ unpublished Bug 6931689）与此故障极为类似，详见metalink doc 739557.1。
此bug主要发生的平台为：
[quote]HP-UX PA-RISC (64-bit)
HP-UX Itanium
HP IA64 HPUNIXHP 9000 Series HP-UX (64-bit)[/quote]
数据库版本为：10.2.0.3 to 11.1.0.6
[quote]- 10.2.0.3, 10.2.0.3 + CRS Bundle Patch #2 or CRS Bundle Patch #3
- 10.2.0.4
- 11.1.0.6[/quote]

解决方法为：
在目前版本的基础上，打下列补丁之一
[quote]- CRS 10.2.0.4 Bundle Patch #2 (Patch 7493592) or above. See Note 405820.1
- Latest 10.2.0.4 CRS PSU Patch as per Note 756671.1
The fix has to be applied to both CRS and RAC Database home to fix the problem.
The BUG is fixed in 11.1.0.7 and will be fixed in 10.2.0.5.[/quote]
建议：
1、目前数据库版本为10.2.0.4，可以在此补丁基础上应用最新的psu patch（10.2.0.4.4）
2、调大参数nfile至131072。