问题是aix上10.2.0.5版本RAC,用户发现应用慢,登陆数据库进行查看时发现RAC其中一个节点的sqlplus登陆时一直hang住无法登陆;后续经过分析排查发现原因是由于使用sga自动管理,SGA各组件频繁的resize,导致数据库出现大量SGA: allocation forcing component growth及cursor: pin S wait on X等待,数据库hang,表现出的故障现象就成了sqlplus客户端hang住无法正常登陆。
介绍下此问题的排查步骤思路:
1.首选接到反馈是sqlplus / as sysdba登陆hang住无法登陆,在了解到是在AIX系统上后,第一反应是检查AIX上sqlplus hang的几个常见问题:用了NFS挂载在/根目录下但是NFS出了问题、/etc/resolv.conf中写了DNS出现了问题,同时/etc/netsvc.conf(LINUX是/etc/nsswitch.conf)中的首选是DNS时等,简单的检查后排除了这些问题;检查alert日志,也无明显报错。
2.通过集群命令crs_stat -t查看集群资源,均显示正常。检查主机负载(cpu/内存swap/io),无明显异常,监听正常,应用到数据库的连接(ps -ef|grep LOCAL=NO|wc -l)仍在。
3.在简单的排查了常见的问题仍没有头绪之后,准备使用操作系统层面的trace及数据库层面的systemstate/hang命令工具来分析问题;并最终通过数据库的systemstate trace分析来确定出故障原因,进而得出临时处理方法(已经hang住只能kill进程或者sqlplus -prelim / as sysdba进去SHUTDOWN ABORT了)及后续解决方法(SGA自动管理相关的设置,在最后);
如下是对收集的TRACE的分析过程:
1.如下是操作系统的trace分析部分: AIX上的操作系统层面trace方法:AIX5L:$ truss -aefo <output file> <executable> truss -aefo /tmp/sqlplus_trace.log sqlplus / as sysdba
截取部分重要的trace;
877354: accessx("/etc/group", 04, 0) = 0
877354: statx("/etc/group", 0x0FFFFFFFFFFF4090, 176, 0) = 0
877354: getdomainname(0x0FFFFFFFFFFF4FC8, 256) = 0
877354: close(12) = 0
877354: getgidx(2) = 101
877354: open("/oracle/admin/cmdb/adump/cmdb1_ora_877354_1.aud", O_RDWR|O_CREAT|O_EXCL|O_LARGEFILE, S_IRUSR|S_IWUSR|S_IRGRP|S_IWGRP) = 12
877354: kwrite(12, " A u d i t f i l e ", 11) = 11
……省略部分输出
877354: kwrite(12, " F r i J u l 7 1".., 25) = 25
877354: kwrite(12, " L E N G T H : ' 1 6".., 14) = 14
877354: kwrite(12, "\n", 1) = 1
877354: kwrite(12, " A C T I O N : [ 7 ] ".., 159) = 159
877354: kwrite(12, "\n", 1) = 1
877354: times(0x0FFFFFFFFFFF5E30) = 1191467758
877354: times(0x0FFFFFFFFFFF5B10) = 1191467758
877354: times(0x0FFFFFFFFFFF5D30) = 1191467758
…………省略部分输出
877354: times(0x0FFFFFFFFFFF4E10) = 1191467759
877354: times(0x0FFFFFFFFFFF45F0) = 1191467759
877354: times(0x0FFFFFFFFFFF45F0) = 1191467759
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
877354: thread_wait(10) = 1
--后面均是877354: thread_wait(10) = 1这种输出
thread_wait(10)这个函数虽然不懂,从字面意思上来猜是进程等待,但是在等待什么呢,OS层面的资源(CPU/IO/MEMORY)目前未发现明显瓶颈,只能继续从ORACLE数据库的进程状态中查找原因了。
2.数据库的systemstate/hang输出
使用命令:
Hanganalyze:
sqlplus -prelim / as sysdba
SQL>oradebug setmypid
SQL>oradebug unlimit
SQL>oradebug -g all hanganalyze 3
Wait for 30 seconds
SQL>oradebug -g all hanganalyze 3
SQL>exit
Systemstate dump:
sqlplus -prelim / as sysdba
SQL>oradebug setmypid
SQL>oradebug unlimit
SQL>oradebug -g all dump systemstate 10
Wait for 30 seconds
SQL>oradebug -g all dump systemstate 10
SQL>oradebug tracefile_name
SQL>exit
对于Systemstate命令产生的TRACE文件,可以使用MOS官方提供的AWK脚本ass109.awk来进行快速分析,找出产生阻塞的进程及各进程的等待事件。
]$ awk -f ass109.awk cmdb1_ora_791490.trc
Starting Systemstate 1
..............................................................................
Ass.Awk Version 1.0.9 - Processing cmdb1_ora_791490.trc
System State 1
~~~~~~~~~~~~~~~~
1:
2: waiting for 'pmon timer'
3: waiting for 'DIAG idle wait'
4: waiting for 'rdbms ipc message'
5: waiting for 'rdbms ipc message'
6: waiting for 'ges remote message'
7: waiting for 'gcs remote message'
8: waiting for 'gcs remote message'
9: waiting for 'gcs remote message'
10: waiting for 'gcs remote message'
11: last wait for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'rdbms ipc message'
14: waiting for 'rdbms ipc message'
15: waiting for 'rdbms ipc message'
16: waiting for 'SGA: allocation forcing component growth' ============>>>>>>>>>>
17: waiting for 'cursor: pin S wait on X' =============>>>>>>>>>>
18: waiting for 'SGA: allocation forcing component growth'
19: waiting for 'SGA: allocation forcing component growth'
20: waiting for 'rdbms ipc message'
21:
22:
23: waiting for 'rdbms ipc message'
24: waiting for 'SQL*Net message from client'
Cmd: Insert
25: waiting for 'SQL*Net message from client'
Cmd: Insert
26: waiting for 'cursor: pin S wait on X'
Cmd: Select
27: waiting for 'SQL*Net message from client'
28: waiting for 'cursor: pin S wait on X'
29: waiting for 'SQL*Net message from client'
30: waiting for 'SQL*Net message from client'
Cmd: Insert
31: waiting for 'cursor: pin S wait on X'
Cmd: Select
32: waiting for 'cursor: pin S wait on X'
Cmd: Select
…………省略部分类似输出
54: waiting for 'SQL*Net message from client'
55: waiting for 'cursor: pin S wait on X'
Cmd: Select
56: waiting for 'cursor: pin S wait on X'
Cmd: Select
57: waiting for 'SQL*Net message from client'
58: waiting for 'SQL*Net message from client'
Cmd: Delete
59: waiting for 'SQL*Net message from client'
60: waiting for 'SQL*Net message from client'
61: last wait for 'SGA: allocation forcing component growth'
…………
100:waiting for 'SGA: allocation forcing component growth'
101:waiting for 'SQL*Net message from client'
102:waiting for 'SGA: allocation forcing component growth'
……省略部分类似输出
119:waiting for 'cursor: pin S wait on X'
120:waiting for 'cursor: pin S wait on X'
121:waiting for 'SGA: allocation forcing component growth'
122:waiting for 'SQL*Net message from client'
123:waiting for 'cursor: pin S wait on X'
Cmd: Select
124:waiting for 'cursor: pin S wait on X'
125:waiting for 'cursor: pin S wait on X'
Cmd: Select
126:waiting for 'SGA: allocation forcing component growth'
127:waiting for 'SQL*Net message from client'
128:waiting for 'SQL*Net message from client'
Cmd: Insert
129:waiting for 'cursor: pin S wait on X'
Cmd: Select
………… 省略部分类似输出
321:waiting for 'cursor: pin S wait on X'
322:waiting for 'SGA: allocation forcing component growth'
Cmd: Select
323:waiting for 'Streams AQ: qmn slave idle wait'
324:waiting for 'cursor: pin S wait on X'
325:waiting for 'cursor: pin S wait on X'
326:waiting for 'cursor: pin S wait on X'
327:waiting for 'cursor: pin S wait on X'
328:waiting for 'cursor: pin S wait on X'
329:waiting for 'cursor: pin S wait on X'
330:waiting for 'cursor: pin S wait on X'
331:waiting for 'cursor: pin S wait on X'
332:waiting for 'cursor: pin S wait on X'
Cmd: Select
333:waiting for 'SGA: allocation forcing component growth'
Cmd: Select
334:waiting for 'cursor: pin S wait on X'
335:waiting for 'cursor: pin S wait on X'
336:waiting for 'cursor: pin S wait on X'
337:waiting for 'cursor: pin S wait on X'
Cmd: Select
338:waiting for 'cursor: pin S wait on X'
…………
376:waiting for 'cursor: pin S wait on X'
532:waiting for 'SQL*Net message from client'
533:waiting for 'SQL*Net message from client'
534:waiting for 'cursor: pin S wait on X'
Cmd: Select
NO BLOCKING PROCESSES FOUND ===========>>>>>>>>>从SYSTEMSDUMP中未发现阻塞进程
554983 Lines Processed.
从分析来看,是存在大量的cursor: pin S wait on X及SGA: allocation forcing component growth异常等待事件,对SGA内存管理类问题有过研究的相信到这里已经基本可以推断出此次引起系统hang的原因了,具体在后面总结部分再展开。 查看hang trace: 从hang trace中也没有发现有直接的 hang住的进程,
*** 2017-07-07 19:00:26.013
==============
HANG ANALYSIS: level 3 syslevel 1
==============
session (kjzddbx) switches to a new action
PORADATA TRACEFILE /oracle/admin/cmdb/bdump/cmdb1_diag_95122.trc
PORADEBUG END ORIGINATING INST:1 SERIAL:0 PID:762600
********************************************************************
Found 22 objects waiting for <cnode/sid/sess_srno/proc_ptr/ospid/wait_event>
<0/1096/1/0x93708b8/176578/No Wait>
Open chains found:
Chain 1 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/1088/1/0xa3bcc48/250860/cursor: pin S wait on X>
-- <1/1090/1/enq: DR - contention>
Chain 2 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/1096/1/0x93708b8/176578/No Wait>
-- <0/571/33923/0xf3cbf88/831752/SGA: allocation forcing componen>
Other chains found:
Chain 3 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/516/51097/0xa3dd828/668272/cursor: pin S wait on X>
Chain 4 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/517/11143/0x937a778/430216/cursor: pin S wait on X>
Chain 5 : <cnode/sid/sess_srno/proc_ptr/ospid/wait_event> :
<0/519/1603/0x83a8210/860644/cursor: pin S wait on X>
……………………
[2193]/1/1089/1/0x9560e50//IGN/1291/1292//none
[2194]/1/1090/1/0x8565b58//NLEAF/1293/1294/[1087]/none
[2195]/1/1091/1/0xa5a5cb8//IGN/1295/1296//none
…………
[2206]/1/1102/1/0x8569bc0//IGN/1317/1318//none
[2207]/1/1103/1/0xa5a9d20//IGN/1319/1320//none
[2208]/1/1104/1/0xf594508//SINGLE_NODE_NW/1321/1322//none
[2209]/1/1105/1/0x9566430//IGN/1323/1324//none
session (kjzddbx) switches to a new action
*** 2017-07-07 19:00:34.443
====================
END OF HANG ANALYSIS
====================
从上面trace分析来看,数据库出现大量的cursor: pin S wait on X及SGA: allocation forcing component growth异常等待事件,数据库hang住。 根据已有信息来分析一下,首先是产生cursor: pin S wait on X及SGA: allocation forcing component growth的原因,可能是业务压力变大、业务变化等因素,导致对SGA内存组件(主要是buffer cache/shared_pool)的需求发生变化后发生了resize;
在SGA自动管理时,各组件resize时会使SGA内存类似冻结,同时会出现处于resize中间状态的KGH: NO ACCESS内存,经常会引起SQL解析的异常(如常见的ORA-04031问题、Cursor: Pin S Wait On X', 'Library Cache Lock' And "Latch: Shared Pool"等)。由于这个故障不是当前在维护的客户,所以后续也没有在重启数据库后收集故障时间段的AWR/ASH来分析更深层次的原因了,对于希望从技术角度来学习、深究问题根源来说,算是一个小小遗憾了;
好在从对trace的分析后提出的对SGA自动管理的相关设置建议,基本可以规避SGA resize引起的问题了。 对于SGA自动管理中如何规避SGA中各组件resize对系统稳定进行的影响,MOS上官方的建议是:
1.首先设置合理的 SGA_TARGET和SGA_MAX_SIZE值,可以参考AWR中或视图V$SGA_TARGET_ADVICE中的建议值,甚至有必要时可以关闭SGA自动管理。
2.设置合理的shared_pool_size及db_cache_size;默认在SGA自动管理时是不需要设置这两个参数值的;通过设置这两个参数值,会使这两个组件在SGA自动管理时不会低于设置值,也即设置的是这两个组件的最小值。如果SGA大小去除这两个参数后还有调整余地,这两个组件如果需要可以继续扩张;可以减少RESIZE的发生及避免RESIZE过多;
3.设置隐含参数:alter system set“_memory_broker_stat_interval”= 999;(单位秒) 减少resize操作的次数,即最多999秒resize一次,此参数默认设置为30秒,RAC需要在所有节点上设置。
4.调整应用减少SAL解析,包括硬解析、软解析。 5.打补丁解决一些与此相关的已知BUG或升级到较新的数据库版本。
可参考的MOS文档如下:
High Waits for "SGA: allocation forcing component growth" and Slow Database Performance in 10g (文档 ID 1170814.1)
High 'Cursor: Pin S Wait On X', 'Library Cache Lock' And "Latch: Shared Pool" Waits due to Shared Pool/Buffer Cache Resize Activity (文档 ID 742599.1)
Troubleshooting 'cursor: pin S wait on X' waits. (文档 ID 1349387.1)