TS - AWR Snapshot stop issue Handle

1024x768

一、概述

故障简述:AWR 快照停止生成

操作系统:linux as 5.4

数据库:oracle 10.2.0.5 RAC 4 节点)

二、故障描述

使用TOAD收集情数据库性能信息时,运行AWR后发现无信息,查询快照信息,发现快照自75号后停止生成。

 

三、处理过程

1. 查看alert log,无相关错误信息。

2. 查看最近的AWR后台进程(mmon)日志报错信息,发现在75号,出现以下错误信息:

$ ls -lth  /u01/app/oracle/admin/qbzyzh/bdump/|less|grep mmon|head -3

-rw-rw---- 1 oracle dba 1.3M 07-12 02:00 qbzyzh3_mmon_20013.trc

-rw-rw---- 1 oracle dba  13M 07-01 10:52 qbzyzh3_mmon_21942.trc

-rw-rw---- 1 oracle dba 9.4M 02-21 14:15 qbzyzh3_mmon_21983.trc

 

[oracle@QBZYK-RAC3 ~]$ less /u01/app/oracle/admin/qbzyzh/bdump/qbzyzh3_mmon_20013.trc

/u01/app/oracle/admin/qbzyzh/bdump/qbzyzh3_mmon_20013.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining

and Real Application Testing options

ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1

System name:    Linux

Node name:      QBZYK-RAC3

Release:        2.6.18-164.el5

Version:        #1 SMP Tue Aug 18 15:51:48 EDT 2009

Machine:        x86_64

Instance name: qbzyzh3

Redo thread mounted by this instance: 3

Oracle process number: 22

Unix process pid: 20017, image: oracle@QBZYK-RAC3 (MMON)

 

*** 2012-07-05 09:00:35.209

*** SERVICE NAME:(SYS$BACKGROUND) 2012-07-05 09:00:35.209

*** SESSION ID:(1965.1) 2012-07-05 09:00:35.209

*** KEWRAFC: Flush slave failed, AWR Enqueue Timeout

*** 2012-07-05 09:01:35.275

*** KEWRAFC: Flush slave failed, AWR Enqueue Timeout

*** 2012-07-05 09:02:35.337

*** KEWRAFC: Flush slave failed, AWR Enqueue Timeout

*** 2012-07-05 09:03:35.359

*** KEWRAFC: Flush slave failed, AWR Enqueue Timeout

 

根据错误信息,metalink 给出的解释此错误可能是AWR所存放的表空间sysaux空间不足,查看sysaux表空间使用率,使用率很低,只为8%,排除这个可能性。

3. 进一步检查 AWR 其他子进程m00*

[oracle@QBZYK-RAC3 ~]$ ls -lth  /u01/app/oracle/admin/qbzyzh/bdump/|less|grep m00|head -n2

-rw-rw---- 1 oracle dba 3.1K 07-12 18:18 qbzyzh3_m000_13199.trc

-rw-rw---- 1 oracle dba 1.4K 07-05 23:38 qbzyzh3_m001_29798.trc

[oracle@QBZYK-RAC3 ~]$ less /u01/app/oracle/admin/qbzyzh/bdump/qbzyzh3_m000_13199.trc

/u01/app/oracle/admin/qbzyzh/bdump/qbzyzh3_m000_13199.trc

Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production

With the Partitioning, Real Application Clusters, OLAP, Data Mining

and Real Application Testing options

ORACLE_HOME = /u01/app/oracle/product/10.2.0/db_1

System name:    Linux

Node name:      QBZYK-RAC3

Release:        2.6.18-164.el5

Version:        #1 SMP Tue Aug 18 15:51:48 EDT 2009

Machine:        x86_64

Instance name: qbzyzh3

Redo thread mounted by this instance: 3

Oracle process number: 374

Unix process pid: 13199, image: oracle@QBZYK-RAC3 (m000)

 

*** ACTION NAME:(Remote-Flush Slave Action) 2012-07-05 08:00:10.672

*** MODULE NAME:(MMON_SLAVE) 2012-07-05 08:00:10.672

*** SERVICE NAME:(SYS$BACKGROUND) 2012-07-05 08:00:10.672

*** SESSION ID:(1691.44447) 2012-07-05 08:00:10.672

WARNING:io_submit failed due to kernel limitations MAXAIO for process=0 pending aio=0

WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65527

WARNING:1 Oracle process running out of OS kernelI/O resources aiolimit=0

ksfdgo()+1488

ssd_unwind_bp: unhandled instruction at 0x150b83b instr=68

*** 2012-07-05 08:16:11.707

ssd_unwind_bp: unhandled instruction at 0x1f028d7 instr=f

ssd_unwind_bp: unhandled instruction at 0x150d0e7 instr=6a

ssd_unwind_bp: unhandled instruction at 0x150b83b instr=68

 

根据错误信息内容,此为10.2.0.5bug, ID 号为9949948

Oralce metalink给出的解决办法为:

1)  增加系统参数 AIO-MAX-NR

2)  禁用异步IO Disable async IO set DISK_ASYNCH_IO=FLASE

3)  升级数据库到10.2.0.5.4

 

4. 解决故障

1)临时解决处理过程

查看AWR 相关JOB 还在运行,但AWR 快照仍然无法生成,初步分析可能跟BUG有关,尝试手工生成快照,操作hang住,怀疑存在相应的锁。

检查数据库等待事件, 发现其他节点AWR进程被节点3 AWR 进程block住:

 

SQL> select sid,serial#,event,username,PROGRAM,BLOCKING_INSTANCE,blocking_session,state

  2  from gv$session

  3  where (event not like '%rdbms%'  and event not like '%gcs remote%' and event not like 'smon timer%'

  4  and event not like 'SQL*Net%' and event not like 'Streams AQ%'  and  event not like '%jobq slave wait%'

  5  and event not like '%ASM background timer%' and event not like 'DIAG idle wait%')

  6  and program like '%m00%'

  7  and status='ACTIVE'

  8  order by seconds_in_wait desc

  9  /

 

SID SERIAL# EVENT        USERNAME       PROGRAM     BLOCKING_INSTANCE BLOCKING_SESSION  STATE

---------- ---------- ------------------------- --------------- --------------- --------------- --------------

1868 35389 enq: WF - contention        oracle@QBZYK-RAC2(m000)  3          1691         WAITING                                                                  

1654 14263 enq: WF - contention        oracle@QBZYK-RAC1(m000)  3          1691         WAITING

1619  4128 enq: WF - contention        oracle@QBZYK-RAC4(m000)  3          1691         WAITING

1691  44447 direct path read           oracle@QBZYK-RAC3 (m000)                         WAITING

         

查询节点2 blockingsession 信息,怀疑刷ASH信息到AWR时被block住了:    

                                                 

SQL> select sid,serial#,event,PROGRAM,status,action

  2  p1,p1text,p2,p2text

  3  from gv$session

  4  where  sid=1691 and PROGRAM like '%m00%'

  5  /

 

SID SERIAL# EVENT           PROGRAM          STATUS    P1         P1TEXT      P2         P2TEXT       ACTION

-------- -------------------------- -------------           ---------  -----      -------     ---      

1691 44447 direct path read oracle@QBZYK-RAC3(m000) ACTIVE  file number 1 first dba 131090 Remote-Flush Slave Action                           

                                                                    

查询被block的对象,发现AWR 数据字典表被锁住     

                                                                                                  

SQL> select t2.sid,t2.serial#,t2.logon_time,t2.paddr,t1.object_id

  2  from gv$locked_object t1,gv$session t2

  3  where t1.session_id=t2.sid

  4  and t2.sid=1691 and t2.program like '%m00%'

  5  order by t2.logon_time

  6  /

 

 

 

SID    SERIAL# LOGON_TIM  PADDR             OBJECT_ID

1691      44447 05-JUL-12 0000000344451088   8990

1691      44447 05-JUL-12 0000000344451088   8994

 

SQL> set linesize 180

SQL> col object_name for a30

SQL> select  object_name,object_type,owner  from dba_objects where object_id in (8990,8994);

 

OBJECT_NAME                    OBJECT_TYPE         OWNER

------------------------------ ------------------- ------------------------------

WRH$_SQL_PLAN                  TABLE               SYS

WRH$_SQL_BIND_METADATA         TABLE               SYS

 

在节点3 Kill session ‘1691,44447’, 并在系统层面kill掉对应的系统进程, AWR 相应等待事件enq: WF – contention消失

 

尝试手工生成AWR 快照, 成功。

 

后续观察,AWR 自动生成快照恢复正常

 

1) 最终解决处理过程:

根据oracle metalink的解决建议:

*增加系统参数 AIO-MAX-NR

*禁用异步IO Disable async IO set DISK_ASYNCH_IO=FLASE

*升级数据库到10.2.0.5.4

考虑到升级会对应用的停机影响,但增加系统参数无需重启系统,2012-07-30日在建议用户增加系统kernel 参数 aio-max-nr ,重新启动AWR后密切监控,并征得用户同意后,对数据库4节点做出以下修改:

a)  修改操作系统内核参数

    修改前:

cat /proc/sys/fs//proc/sys/fs/aio-max-nr

65536

修改后:

echo 1048576 >/proc/sys/fs/aio-max-nr

或者

vi /etc/sysctl.conf 添加:

fs.aio-max-nr = 1048576

执行命令使修改生效:

sysctl -p

b)  启动AWR 并监控

c)  后期监控结果

AWR运行正常,数据库日志文件再无异步IO的错误。

 

四、总结及建议

解决故障过程中,遇到了不少困难,一开始从等待事件入手,在kill掉阻塞进程时并没有杀掉对应系统进程,后来杀掉系统进程后,故障解决,并通过查看AWR后台进程,才发现了故障原因。

建议:目前很多业务系统的数据库跟故障库使用相同系统和数据库版本和ASM 存储,根据oracle metalink给用户提供了解决的方案,建议修改系统参数文件并保持后续监控。

 

五、附件

以下是Bug 9949948的相关信息:

 

Bug 9949948 - Linux: Process spin under ksfdrwat0 if OS Async IO not configured high enough [ID 9949948.8]

Modified 29-OCT-2011     Type PATCH     Status PUBLISHED  

Bug 9949948  Linux: Process spin under ksfdrwat0 if OS Async IO not configured high enough

 This note gives a brief overview of bug 9949948.

 The content was last updated on: 28-OCT-2011

 Click here for details of each of the sections below.

 

Affects:

Product (Component) Oracle Server (Rdbms) 

Range of versions believed to be affected Versions >= 10.2.0.4 but BELOW 11.1 

Versions confirmed as being affected ?10.2.0.5

 

Platforms affected ?Linux X86-64bit

?Linux 32bit

 

 It is believed to be a regression in default behaviour thus:

   Regression introduced in 10.2.0.5

 

Fixed:

This issue is fixed in ?11.1.0.6 (Base Release)

?10.2.0.5.2 Patch Set Update

?10.2.0.5 Patch 5 on Windows Platforms

 

Symptoms: Related To:

?Hang (Process Spins)

?Waits for "i/o slave wait"

 ?DISK_ASYNCH_IO

Description

This problem is introduced in 10.2.0.5

 

It only affects platforms where Oracle has to reserve async IO slots,

such as Linux platforms.

If the OS async IO layer is underconfigured and an Oracle process

cannot get sufficient AIO slots then rather than reverting to

using non AIO call the process may go into an infinite spin

under ksfdrwat0.

 

Rediscovery notes:

  The spin will be preceded by messages in the trace

  file of the form.:

    WARNING:io_submit failed due to kernel limitations MAXAIO

                for process=0 pending aio=0

    WARNING:asynch I/O kernel limits is set at AIO-MAX-NR=65536 AIO-NR=65518

    WARNING:1 Oracle process running out of OS kernelI/O resources aiolimit=0

  Notice specifically that the value for aiolimit is reported as "0"

  for this bug.

  The process then spins in ksfdrwat0 typically with a stack showing

         skgfqio ()

         ksfdgo ()

         ksfdwtio ()

         ksfdwat1 ()

         ksfdrwat0 ()   <<< Spin point

         ksfdblock ()

         kcflwi ()

         kcflci ()

         kcblci ()

         kcblcio ()

         kcblgt ()

         kcbldrget ()

 

  It will show repeated waits for "i/o slave wait", which can be

  misleading as that is normally considered an idle wait event.

 

Workaround

  Raise the OS AIO limits such that the number of concurrent slot

   requirements never exceeds the OS limit.

   ie: Increase AIO-MAX-NR

 OR

  Disable async IO (Set DISK_ASYNCH_IO=FALSE)

 

See Note:1313555.1 for additional notes on this issue.

HOOKS SPIN:ksfdrwat0 PARAMETER:DISK_ASYNCH_IO "WAITEVENT:i/o slave wait" LIKELYAFFECTS XAFFECTS_10.2.0.5 XAFFECTS_V10020005 AFFECTS=10.2.0.5 XPRODID_5 PRODUCT_ID=5 PRODID-5 RDBMS XCOMP_RDBMS COMPONENT=RDBMS TAG_HANG/SPIN TAG_P226 TAG_P46 TAG_RA205 TAG_REGRESSION HANG/SPIN P226 P46 RA205 REGRESSION FIXED_10.2.0.5.2 FIXED_11.1.0.6 FIXED_WIN:A205P05

 

Please note: The above is a summary description only. Actual symptoms can vary. Matching to any symptoms here does not confirm that you are encountering this problem. For questions about this bug please consult Oracle Support.

 

References

Bug:9949948 (This link will only work for PUBLISHED bugs)

Note:245840.1 Information on the sections in this article

Normal 0 7.8 磅 0 2 false false false EN-US ZH-CN X-NONE

转载于:http://blog.itpub.net/27064837/viewspace-750329/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值