从9i升级到10g后,数据库变得不稳定,每隔20小时左右,instance crash,
重起数据库后,又暂时能正常运行。过一段时间又会crash,如此反复。
重起数据库后,又暂时能正常运行。过一段时间又会crash,如此反复。
问题的查找比较辛苦,初步感觉遇到了bug.
查看alert日志,和trace文件:
第一次的日志:
Thread 1 cannot allocate new log, sequence 204385
Checkpoint not complete
Current log# 2 seq# 204384 mem# 0: /oradata/mon/redo02.log
Sat May 26 13:54:14 2012
Thread 1 advanced to log sequence 204385 (LGWR switch)
Current log# 1 seq# 204385 mem# 0: /oradata/mon/redo01.log
Sat May 26 13:56:38 2012
Errors in file /oracle/admin/mon/bdump/mon_pmon_902.trc:
ORA-07445: exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sat May 26 13:56:48 2012
PSP0: terminating instance due to error 472
Instance terminated by PSP0, pid = 904
Sat May 26 13:57:42 2012
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-ree SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
第一次的日志:
Thread 1 cannot allocate new log, sequence 204385
Checkpoint not complete
Current log# 2 seq# 204384 mem# 0: /oradata/mon/redo02.log
Sat May 26 13:54:14 2012
Thread 1 advanced to log sequence 204385 (LGWR switch)
Current log# 1 seq# 204385 mem# 0: /oradata/mon/redo01.log
Sat May 26 13:56:38 2012
Errors in file /oracle/admin/mon/bdump/mon_pmon_902.trc:
ORA-07445: exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sat May 26 13:56:48 2012
PSP0: terminating instance due to error 472
Instance terminated by PSP0, pid = 904
Sat May 26 13:57:42 2012
Starting ORACLE instance (normal)
LICENSE_MAX_SESSION = 0
LICENSE_SESSIONS_WARNING = 0
Picked latch-ree SCN scheme 3
Autotune of undo retention is turned on.
IMODE=BR
第二次的日志:
Errors in file /oracle/admin/mon/bdump/mon_pmon_15360.trc:
ORA-07445:
exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sun May 27 18:38:48 2012
ARC1: terminating instance due to error 472
Instance terminated by ARC1, pid = 15445
Errors in file /oracle/admin/mon/bdump/mon_pmon_15360.trc:
ORA-07445:
exception encountered: core dump [ksuxdp()+393] [SIGSEGV] [unknown code] [0x000000000] [] []
Sun May 27 18:38:48 2012
ARC1: terminating instance due to error 472
Instance terminated by ARC1, pid = 15445
第三次的日志:
Mon May 28 02:34:16 2012
Errors in file /oracle/admin/mon/bdump/mon_pmon_18798.trc:
ORA-07445: exception encountered: core dump [ksuxdp()+431] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
Mon May 28 02:34:20 2012
CJQ0: terminating instance due to error 472
vi /oracle/admin/mon/bdump/mon_pmon_18798.trc
*** 2012-05-28 02:34:16.457
*** SERVICE NAME:(SYS$BACKGROUND) 2012-05-28 02:34:16.457
*** SESSION ID:(555.1) 2012-05-28 02:34:16.457
Pseudo-process owns s.o. 0xd9b61888 which is not a session:
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x8, PC: [0x7968af, ksuxdp()+431]
*** 2012-05-28 02:34:16.463
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ksuxdp()+431] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31 call ksedst1() 000000000 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
ksedmp()+610 call ksedst() 000000000 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
ssexhd()+629 call ksedmp() 000000003 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
__pthread_sighandle call ssexhd() 00000000B ? 2A975B7D70 ?
r_rt()+104 2A975B7C40 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
killpg()+48 call __pthread_sighandle 2A971FA4E0 ? 2A975B7D70 ?
r_rt() 2A975B7C40 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
--注意这里,估计和线程有关
ksuxdp()+431 signal killpg() 0066876E0 ? 0055EC954 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
*** SERVICE NAME:(SYS$BACKGROUND) 2012-05-28 02:34:16.457
*** SESSION ID:(555.1) 2012-05-28 02:34:16.457
Pseudo-process owns s.o. 0xd9b61888 which is not a session:
Exception signal: 11 (SIGSEGV), code: 1 (Address not mapped to object), addr: 0x8, PC: [0x7968af, ksuxdp()+431]
*** 2012-05-28 02:34:16.463
ksedmp: internal or fatal error
ORA-07445: exception encountered: core dump [ksuxdp()+431] [SIGSEGV] [Address not mapped to object] [0x000000008] [] []
----- Call Stack Trace -----
calling call entry argument values in hex
location type point (? means dubious value)
-------------------- -------- -------------------- ----------------------------
ksedst()+31 call ksedst1() 000000000 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
ksedmp()+610 call ksedst() 000000000 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
ssexhd()+629 call ksedmp() 000000003 ? 000000001 ?
2A975B6D20 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
__pthread_sighandle call ssexhd() 00000000B ? 2A975B7D70 ?
r_rt()+104 2A975B7C40 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
killpg()+48 call __pthread_sighandle 2A971FA4E0 ? 2A975B7D70 ?
r_rt() 2A975B7C40 ? 2A975B6D80 ?
2A975B6CC0 ? 000000000 ?
--注意这里,估计和线程有关
ksuxdp()+431 signal killpg() 0066876E0 ? 0055EC954 ?
000000000 ? 000000000 ?
000000000 ? 000000001 ?
PROCESS STATE
-------------
Process global information:
process: 0xdf3a8088, call: 0xdf5da340, xact: (nil), curses: 0xdc5a7f10, usrses: 0xdc5a7f10
----------------------------------------
SO: 0xdf3a8088, type: 2, owner: (nil), flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 0xdf5da340/0xdf5da340, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: df3ae750 154 0
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: dc3b1128 1 6
(latch info) wait_event=0 bits=1
holding (efd=4) df423fa0 Child session switching level=0 child#=1
Location from where latch is held: ksuxdp:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]: --注意这里显示进程17有等待
17 (0, 1338143656, 0)
waiter count=1
Process Group: DEFAULT, pseudo proc: 0xdc42d180
O/S info: user: oracle, term: UNKNOWN, ospid: 18798
OSD pid info: Unix process pid: 18798, image: oracle@monstd (PMON)
-------------
Process global information:
process: 0xdf3a8088, call: 0xdf5da340, xact: (nil), curses: 0xdc5a7f10, usrses: 0xdc5a7f10
----------------------------------------
SO: 0xdf3a8088, type: 2, owner: (nil), flag: INIT/-/-/0x00
(process) Oracle pid=2, calls cur/top: 0xdf5da340/0xdf5da340, flag: (e) SYSTEM
int error: 0, call error: 0, sess error: 0, txn error 0
(post info) last post received: 0 0 48
last post received-location: ksoreq_reply
last process to post me: df3ae750 154 0
last post sent: 0 0 24
last post sent-location: ksasnd
last process posted by me: dc3b1128 1 6
(latch info) wait_event=0 bits=1
holding (efd=4) df423fa0 Child session switching level=0 child#=1
Location from where latch is held: ksuxdp:
Context saved from call: 0
state=busy, wlstate=free
waiters [orapid (seconds since: put on list, posted, alive check)]: --注意这里显示进程17有等待
17 (0, 1338143656, 0)
waiter count=1
Process Group: DEFAULT, pseudo proc: 0xdc42d180
O/S info: user: oracle, term: UNKNOWN, ospid: 18798
OSD pid info: Unix process pid: 18798, image: oracle@monstd (PMON)
继续查找进程17的相关信息,找到等待事件:virtual circuit status
6CA123F6:09DA9E24 17 0 10005 4 KSL POST SENT postee=15 loc='kmcpdp' id1=0 id2=0 name= type=0
6CA12431:09DA9E25 17 0 10005 1 KSL WAIT BEG [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0
6CA1294F:09DA9E2C 17 0 10005 2 KSL WAIT END [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0 time=1311
6CA12951:09DA9E2D 17 0 10005 3 KSL POST RCVD poster=15 loc='kmcmbf: not KMCVCFTOS' id1=0 id2=0 name= type=0 fac#=5 facpost=1
6CA12431:09DA9E25 17 0 10005 1 KSL WAIT BEG [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0
6CA1294F:09DA9E2C 17 0 10005 2 KSL WAIT END [virtual circuit status] -1/0xffffffffffffffff 2/0x2 0/0x0 time=1311
6CA12951:09DA9E2D 17 0 10005 3 KSL POST RCVD poster=15 loc='kmcmbf: not KMCVCFTOS' id1=0 id2=0 name= type=0 fac#=5 facpost=1
等待事件:virtual circuit status和shared server有关。
有相关文章介绍,virtual circuit status事件能引起shared server 连接挂起。是一个bug,bug号:5324905,在11.1版本已解决此问题。
以下是原文解释:
This is due to unpublished bug 5324905, fixed in 11.1. A dispatcher process may corrupt the SGA causing the shared server connections
to hang or the database to crash. This can also be preceded by the same dispatcher process taking up a large amount of CPU.
This is due to unpublished bug 5324905, fixed in 11.1. A dispatcher process may corrupt the SGA causing the shared server connections
to hang or the database to crash. This can also be preceded by the same dispatcher process taking up a large amount of CPU.
解决方法:
使用专用连接
以下是此次做的调整:
检查.bash_profile
使用专用连接
以下是此次做的调整:
检查.bash_profile
注释掉以下9i时设置的两个环境变量:
#export LD_ASSUME_KERNEL=2.4.1 --降低内核版本
#export THREADS_FLAG=native
#export LD_ASSUME_KERNEL=2.4.1 --降低内核版本
#export THREADS_FLAG=native
取消shared_server模式:
alter system set shared_server=0 scope=both;
alter system set dispatchers='' scope=both;
调整后,数据库未再出现异常。
理论支持:
Symptoms
Active shared server connections hang waiting on 'virtual circuit status' wait event, and no new shared server connections can be made. Only dedicated server connections are possible.
There may be several internal errors raised in the alertlog prior to this also, particularly ORA-7445 errors with various arguments which show indications of an SGA memory corruption. For example:
*** glibc detected *** corrupted double-linked list: 0x00000000067e64c0 ***
Killing the active shared server connections doesn't help resolve the problem.
Cause
This is due to unpublished bug 5324905, fixed in 11.1. A dispatcher process may corrupt the SGA causing the shared server connections to hang or the database to crash. This can also be preceded by the same dispatcher process taking up a large amount of CPU.
Solution
Choose from one of the following solutions:
Choose from one of the following solutions:
1. Download and apply the one-off Patch 5324905 from MetaLink, if available.
2. The problem can temporarily be cleared by restarting the instance to clear and reset the SGA memory.
3. Workaround: Use dedicated connections.
4. This issue is fixed in 10.2.0.4 (Server Patch Set) and 11.1.0.6 (Base Release).
2. The problem can temporarily be cleared by restarting the instance to clear and reset the SGA memory.
3. Workaround: Use dedicated connections.
4. This issue is fixed in 10.2.0.4 (Server Patch Set) and 11.1.0.6 (Base Release).
"shared server idle wait" - for when the shared server is idle waiting for something to do
"virtual circuit wait" - for when the shared server is blocked waiting on a specific circuit / message
"virtual circuit wait" - for when the shared server is blocked waiting on a specific circuit / message
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/10173379/viewspace-731911/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/10173379/viewspace-731911/