一个os thread startup、log file sync等待的故障回顾-CSDN博客

这是一个原来遇到过，困扰过了我们很多人很久的故障，问题迟迟无法解决的原因是因为概念的模糊。os thread startup 是oracle 等待事件中Concurrency的一个等待，与之相关的解释实际非常少。从种种故障现象及字面意义来理解，在进行并行分配是向os 申请进程启动，而在正常的系统中，该动作是非常快的，在我们高压力的数据库环境下，os存在响应问题，激发了该等待事件的等待。最近因整理以前知识，由此记录一下。

以下摘录于网络的解释：


Know more about “os thread startup”
 
‘os thread startup’ takes significant amount of time in ‘create index parallel’.
All slaves are allocated one by one in serial.
SQL tracing on foreground, there is one ‘os thread startup’ wait per slave, each wait takes 100ms. –> May
need investigation
When there are 512 slaves, ‘os thread startup’ wait take 50 seconds before the slaves start to do any job.
Resolution is to set *.parallel_min_servers=512 to pre-allocated 512 slaves per instance duirng instance
startup, or to run PCIX twice and ignore the first run

【故障现象】

1.数据库出现os thread startup等待，并伴随着大量DFS lock handle,log file sync，latch free等等待堆积。

2.首先出现异常等待节点数据库阻塞达到一定程度，系统资源大量消耗，主机hang住。

3.另外一节点数据库受前一节点数据库影响出现大量gc 等待事件，但异常等待积累到一定程度便出现DFS lock handle等等待，系统资源大量消耗，主机hang住。

【分析处理思路】

1. 数据库等待情况：

从 dba_hist_active_sess_history 分析，数据库出现os thread startup，之后开始引起大量的log file sync，latch free等待和DFS lock handle等阻塞。并且阻塞的源头是 os thread startup.

节选：

节点1 根据时间点，按等待事件进行汇总：

28-JUL-16 04.22.16.187 PM PX Deq: Slave Session Stats 1

28-JUL-16 04.22.16.187 PM control file sequential read 1

28-JUL-16 04.22.16.187 PM os thread startup 1 <<<---

28-JUL-16 04.22.16.187 PM gc current grant 2-way 2

28-JUL-16 04.22.16.187 PM ASM file metadata operation 2

......

28-JUL-16 04.22.16.187 PM gc buffer busy release 11

28-JUL-16 04.22.16.187 PM db file sequential read 54

28-JUL-16 04.22.16.187 PM 107

28-JUL-16 04.22.16.187 PM log file sync 114 <<<----

28-JUL-16 04.22.16.187 PM latch free 138 <<<----

28-JUL-16 04.22.27.538 PM gc cr grant 2-way 1

2.LGWR trace 写入出现告警，实际写入的数据量非常小，却消耗了大量的时间。

### LGWR trace日志 ###

LGWR写入量及其耗用的时间，直观表明I / O出现写入缓慢的情况

*** 2016-06-21 17:11:22.234

Warning: log write elapsed time 516ms, size 16727KB

*** 2016-06-21 17:11:44.492

Warning: log write elapsed time 21902ms, size 16KB

*** 2016-06-21 17:12:08.449

Warning: log write elapsed time 23529ms, size 381KB

3.CPU使用情况

（ 1）故障前主机 CPU 资源耗尽并存在多个 CPU分区负荷100%的情况，另外存在多个 CPU分区高负荷并且是系统 SYS占用

（ 2）ORACLE SR 分析并结合其接触过的案例， AIX下可能存在 CPU折叠的情况，单个 CPU满负荷使用会造成无法调度，而造成OS hang或非常缓慢的情况后 IBM已解释并说明目前系统的CPU不会出现CPU折叠的情况，相关系统参数已设置正确，

vmstat：

zzz ***Thu Jul 28 09:37:45 BEIST 2016<<---CPU资源IDLE 30%左右

System configuration: lcpu=256 mem=745216MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

100 0 134703580 35782109 0 0 0 0 0 0 25724 255043 108481 61 12 22 5

86 0 134705919 35779762 0 0 0 0 0 0 21259 215539 88214 57 9 28 6

77 0 134701214 35784474 0 0 0 0 0 0 18884 194253 76418 54 9 30 7

zzz ***Thu Jul 28 09:38:41 BEIST 2016 <<<---CPU资源耗尽

System configuration: lcpu=256 mem=745216MB

kthr memory page faults cpu

----- ----------- ------------------------ ------------ -----------

r b avm fre re pi po fr sr cy in sy cs us sy id wa

272 1 134995188 35752075 0 0 0 0 0 0 4881 231230 79391 81 13 3 3

253 0 134995238 35752029 0 0 0 0 0 0 3699 179424 63599 79 13 4 3

203 1 134997247 35750015 0 0 0 0 0 0 4928 258230 62545 75 14 7 5

mp stat：

zzz ***Thu Jul 28 09:36:59 BEIST 2016

System configuration: lcpu=256 mode=Capped

cpu min maj mpc int cs ics rq mig lpa sysc us sy wa id pc

....

84 581 0 0 499 1298 239 1 386 100 4098 74 15 2 9 0.30

85 0 0 0 236 21 21 1 0 100 0 99 1 0 0 0.55 <<<---cpu 100%

4.磁盘IO情况

故障期间 redo file所在磁盘组 DGSYSDG 对应的所有磁盘都存在一个共同的现象，磁盘的 MAXSERV的值开始出现异常上涨， maxserv 从 20 以下上涨至 100~400 左右， maxserv 异常增长与 log file sync 异常增长的时间点非常吻合，并且峰值的时间点一致。