shared server导致7445错误

从9i升级到10g后,数据库变得不稳定,每隔20小时左右,instance crash,
重起数据库后,又暂时能正常运行。过一段时间又会crash,如此反复。
问题的查找比较辛苦,初步感觉遇到了bug.
查看alert日志,和trace文件:
第一次的日志:
Thread 1 cannot allocate new log, sequence 204385
Checkpoint not complete
  Current log# 2 seq# 204384 mem# 0: /oradata/mon/redo02.log
Sat May 26 13:54:14 2012
Thread 1 advanced to log sequence 204385 (LGWR switch)
  Current log# 1 seq# 204385 mem# 0: /oradata/mon/redo01.log
Sat May 26 13:56:38 2012
Errors in file /oracle/admin/mon/bdump/mon_pmon_902.trc:
ORA-07445: exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sat May 26 13:56:48 2012
PSP0: terminating instance due to error 472
Instance terminated by PSP0, pid = 904
Sat May 26 13:57:42 2012
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-ree SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
第二次的日志:
Errors in file /oracle/admin/mon/bdump/mon_pmon_15360.trc:
ORA-07445:
exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sun May 27 18:38:48 2012
ARC1: terminating instance due to error 472
Instance terminated by ARC1, pid = 15445

第三次的日志:
Mon May 28 02:34:16 2012
Errors in file /oracle/admin/mon/bdump/mon_pmon_18798.trc:
ORA-07445: exception encountered: core dump [ksuxdp()+431] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
Mon May 28 02:34:20 2012
CJQ0: terminating instance due to error 472
 
 
vi /oracle/admin/mon/bdump/mon_pmon_18798.trc
*** 2012-05-28 02:34:16.457
*** SERVICE NAME:(SYS$BACKGROUND) 2012-05-28 02:34:16.457
*** SESSION ID:(555.1) 2012-05-28 02:34:16.457
Pseudo-process owns s.o. 0xd9b61888 which is not a session:
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x8, PC: [0x7968af, ksuxdp()+431]
*** 2012-05-28 02:34:16.463
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ksuxdp()+431] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
----- Call Stack Trace -----
calling              call     entry                argument values in hex
location             type     point                (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31          call     ksedst1()            000000000 ? 000000001 ?
                                                   2A975B6D20 ? 2A975B6D80 ?
                                                   2A975B6CC0 ? 000000000 ?
ksedmp()+610         call     ksedst()             000000000 ? 000000001 ?
                                                   2A975B6D20 ? 2A975B6D80 ?
                                                   2A975B6CC0 ? 000000000 ?
ssexhd()+629         call     ksedmp()             000000003 ? 000000001 ?
                                                   2A975B6D20 ? 2A975B6D80 ?
                                                   2A975B6CC0 ? 000000000 ?
__pthread_sighandle  call     ssexhd()             00000000B ? 2A975B7D70 ?
r_rt()+104                                         2A975B7C40 ? 2A975B6D80 ?
                                                   2A975B6CC0 ? 000000000 ?
killpg()+48          call     __pthread_sighandle  2A971FA4E0 ? 2A975B7D70 ?
                              r_rt()               2A975B7C40 ? 2A975B6D80 ?
                                                   2A975B6CC0 ? 000000000 ?
                           --注意这里,估计和线程有关                       
ksuxdp()+431         signal   killpg()             0066876E0 ? 0055EC954 ?
                                                   000000000 ? 000000000 ?
                                                   000000000 ? 000000001 ?
 
PROCESS STATE
-------------
Process global information:
     process: 0xdf3a8088, call: 0xdf5da340, xact: (nil), curses: 0xdc5a7f10, usrses: 0xdc5a7f10
  ----------------------------------------
  SO: 0xdf3a8088, type: 2, owner: (nil), flag: INIT/-/-/0x00
  (process) Oracle pid=2, calls cur/top: 0xdf5da340/0xdf5da340, flag: (e) SYSTEM
            int error: 0, call error: 0, sess error: 0, txn error 0
  (post info) last post received: 0 0 48
              last post received-location: ksoreq_reply
              last process to post me: df3ae750 154 0
              last post sent: 0 0 24
              last post sent-location: ksasnd
              last process posted by me: dc3b1128 1 6
    (latch info) wait_event=0 bits=1
      holding    (efd=4) df423fa0 Child session switching level=0 child#=1
        Location from where latch is held: ksuxdp:
        Context saved from call: 0
        state=busy, wlstate=free
          waiters [orapid (seconds since: put on list, posted, alive check)]:    --注意这里显示进程17有等待
           17 (0, 1338143656, 0)
           waiter count=1
    Process Group: DEFAULT, pseudo proc: 0xdc42d180
    O/S info: user: oracle, term: UNKNOWN, ospid: 18798
    OSD pid info: Unix process pid: 18798, image: oracle@monstd (PMON)
继续查找进程17的相关信息,找到等待事件:virtual circuit status
6CA123F6:09DA9E24    17     0 10005   4 KSL POST SENT postee=15 loc='kmcpdp' id1=0 id2=0 name=   type=0
6CA12431:09DA9E25    17     0 10005   1 KSL WAIT BEG [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0
6CA1294F:09DA9E2C    17     0 10005   2 KSL WAIT END [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0 time=1311
6CA12951:09DA9E2D    17     0 10005   3 KSL POST RCVD poster=15 loc='kmcmbf: not KMCVCFTOS' id1=0 id2=0 name=   type=0 fac#=5 facpost=1
                 
等待事件:virtual circuit status和shared server有关。
有相关文章介绍,virtual circuit status事件能引起shared server 连接挂起。是一个bug,bug号:5324905,在11.1版本已解决此问题。
以下是原文解释:
This is due to unpublished bug 5324905, fixed in 11.1.  A dispatcher process may corrupt the SGA causing the shared server connections
to hang or the database to crash.  This can also be preceded by the same dispatcher process taking up a large amount of CPU.
解决方法:
使用专用连接
               
               
以下是此次做的调整:             
检查.bash_profile
注释掉以下9i时设置的两个环境变量:
#export LD_ASSUME_KERNEL=2.4.1    --降低内核版本
#export THREADS_FLAG=native

取消shared_server模式:
alter system set shared_server=0 scope=both;
alter system set dispatchers='' scope=both;
 
调整后,数据库未再出现异常。                  
 
理论支持:

Symptoms
Active shared server connections hang waiting on 'virtual circuit status' wait event, and no new shared server connections can be made. Only dedicated server connections are possible. 
There may be several internal errors raised in the alertlog prior to this also, particularly ORA-7445 errors with various arguments which show indications of an SGA memory corruption.  For example:
    *** glibc detected *** corrupted double-linked list: 0x00000000067e64c0 ***
Killing the active shared server connections doesn't help resolve the problem.

Cause
This is due to unpublished bug 5324905, fixed in 11.1.  A dispatcher process may corrupt the SGA causing the shared server connections to hang or the database to crash.  This can also be preceded by the same dispatcher process taking up a large amount of CPU.
Solution
Choose from one of the following solutions:
1. Download and apply the one-off Patch 5324905 from MetaLink, if available.
2. The problem can temporarily be cleared by restarting the instance to clear and reset the SGA memory.
3. Workaround: Use dedicated connections.
4. This issue is fixed in 10.2.0.4 (Server Patch Set) and 11.1.0.6 (Base Release).
 
"shared server idle wait" - for when the shared server is idle waiting for something to do
 "virtual circuit wait"    - for when the shared server is blocked waiting on a specific circuit / message

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10173379/viewspace-731911/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/10173379/viewspace-731911/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值