现象:
*** 2013-04-11 16:50:00.413
*** SESSION ID:(874.1) 2013-04-11 16:50:00.413
*** CLIENT ID:() 2013-04-11 16:50:00.413
*** SERVICE NAME:(SYS$BACKGROUND) 2013-04-11 16:50:00.413
*** MODULE NAME:() 2013-04-11 16:50:00.413
*** ACTION NAME:() 2013-04-11 16:50:00.413
Dead transaction 0x0002.001.0001d07f recovered by SMON
Dead transaction 0x000f.015.0002815a recovered by SMON
Dead transaction 0x0013.01a.0001b755 recovered by SMON
Serial Transaction recovery caught exception 18
*** 2013-04-11 17:50:00.246
SMON: following errors trapped and ignored:
ORA-00018: maximum number of sessions exceeded
Serial Transaction recovery caught exception 18
*** 2013-04-11 17:50:00.284
SMON: following errors trapped and ignored:
ORA-00018: maximum number of sessions exceeded
Serial Transaction recovery caught exception 18
*** 2013-04-11 17:50:00.616
SMON: following errors trapped and ignored:
ORA-00018: maximum number of sessions exceeded
Serial Transaction recovery caught exception 18
*** 2013-04-11 17:50:01.223
SMON: following errors trapped and ignored:
ORA-00018: maximum number of sessions exceeded
*** 2013-04-11 17:50:02.224
Serial Transaction recovery caught exception 18
*** 2013-04-11 17:50:02.224
SMON: following errors trapped and ignored:
ORA-00018: maximum number of sessions exceeded
之后数据库异常Down机。
分析:
> set linesize 200
> select * from v$resource_limit;
RESOURCE_NAME CURRENT_UTILIZATION MAX_UTILIZATION INITIAL_ALLOCATION LIMIT_VALUE
------------------------------ ------------------- --------------- -------------------- --------------------
processes 142 167 800 800
sessions 141 178 885 885
enqueue_locks 252 283 10990 10990
enqueue_resources 136 169 4112 UNLIMITED
ges_procs 0 0 0 0
ges_ress 0 0 0 UNLIMITED
ges_locks 0 0 0 UNLIMITED
ges_cache_ress 0 0 0 UNLIMITED
ges_reg_msgs 0 0 0 UNLIMITED
ges_big_msgs 0 0 0 UNLIMITED
ges_rsv_msgs 0 0 0 0
gcs_resources 0 0 0 0
gcs_shadows 0 0 0 0
dml_locks 0 0 3892 UNLIMITED
temporary_table_locks 0 0 UNLIMITED UNLIMITED
transactions 0 6 973 UNLIMITED
branches 0 4 973 UNLIMITED
cmtcallbk 0 1 973 UNLIMITED
max_rollback_segments 24 24 973 65535
sort_segment_locks 26 43 UNLIMITED UNLIMITED
k2q_locks 0 0 1770 UNLIMITED
max_shared_servers 1 1 UNLIMITED UNLIMITED
parallel_max_servers 0 7 160 3600
23 rows selected.
Elapsed: 00:00:00.01
sessions这个参数设置应该没有问题,正常连接数也不可能有那么多,正常应用的连接数也不会超过400个会话,那为什么会出现ORA-00018
>oerr ora 00018
00018, 00000, "maximum number of sessions exceeded"
// *Cause: All session state objects are in use.
// *Action: Increase the value of the SESSIONS initialization parameter.
>
从trc中的Session Wait History可以发现问题:
----------------------------------------
SO: 0x70000023ce1aae0, type: 4, owner: 0x70000023c804c80, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x70000023c804c80, name=session, file=ksu.h LINE:10299 ID:, pg=0
(session) sid: 874 ser: 1 trans: 0x0, creator: 0x70000023c804c80
flags: (0x100051) USR/- flags_idl: (0x1) BSY/-/-/-/-/-
flags2: (0x408) -/-
DID: , short-term DID:
txn branch: 0x0
oct: 0, prv: 0, sql: 0x0, psql: 0x0, user: 0/SYS
Dumping Current Wait Stack:
Not in wait; last wait ended 113541 usecs ago
Wait State:
auto_close=0 flags=0x21 boundary=0x0/-1
Dumping Session Wait History:
0: waited for 'smon timer'
sleep time=1, failed=4, =0
wait_id=207423 seq_num=10823 snap_id=1
wait times (usecs) - snap=976580 exc=976580 total=976580
wait times (usecs) - max=1000000
wait counts (exc) - calls=1 os=1
occurred after 712 microseconds of elapsed time
1: waited for 'smon timer'
sleep time=1, failed=3, =0
wait_id=207422 seq_num=10822 snap_id=1
wait times (usecs) - snap=592311 exc=592311 total=592311
wait times (usecs) - max=1000000
wait counts (exc) - calls=1 os=1
occurred after 750 microseconds of elapsed time
2: waited for 'smon timer'
sleep time=1, failed=2, =0
wait_id=207421 seq_num=10821 snap_id=1
wait times (usecs) - snap=322703 exc=322703 total=322703
wait times (usecs) - max=1000000
wait counts (exc) - calls=1 os=1
occurred after 699 microseconds of elapsed time
3: waited for 'smon timer'
sleep time=1, failed=1, =0
wait_id=207420 seq_num=10820 snap_id=1
wait times (usecs) - snap=10 exc=10 total=10
wait times (usecs) - max=1000000
wait counts (exc) - calls=1 os=1
occurred after 37324 microseconds of elapsed time
4: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207419 seq_num=10819 snap_id=1
wait times (usecs) - snap=5861823 exc=5861823 total=5861823
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=3
occurred after 801 microseconds of elapsed time
5: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207418 seq_num=10818 snap_id=1
wait times (usecs) - snap=5860203 exc=5860203 total=5860203
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=3
occurred after 769 microseconds of elapsed time
6: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207417 seq_num=10817 snap_id=1
wait times (usecs) - snap=2938013 exc=2938013 total=2938013
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=2
occurred after 897 microseconds of elapsed time
7: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207416 seq_num=10816 snap_id=1
wait times (usecs) - snap=2930246 exc=2930246 total=2930246
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=2
occurred after 649 microseconds of elapsed time
8: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207415 seq_num=10815 snap_id=1
wait times (usecs) - snap=334572 exc=334572 total=334572
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=1
occurred after 723 microseconds of elapsed time
9: waited for 'smon timer'
sleep time=12c, failed=0, =0
wait_id=207414 seq_num=10814 snap_id=1
wait times (usecs) - snap=824057 exc=824057 total=824057
wait times (usecs) - max=300000000
wait counts (exc) - calls=1 os=1
occurred after 689 microseconds of elapsed time
temporary object counter: 0
----------------------------------------
Virtual Thread:
kgskvt: 7000002315a0f78, sess: 70000023ce1aae0, vc: 0, proc: 70000023c804c80, id: 874
consumer group cur: _ORACLE_BACKGROUND_GROUP_ (upd? 0), mapped: _ORACLE_BACKGROUND_GROUP_, orig:
vt_state: 0x100, vt_flags: 0x30, blkrun: 0
location where insched last set: kgskthrrun
location where insched last cleared: kgskthrrun1
is_assigned: 0, in_sched: 0 (0)
vt_active: 0 (pending: 0)
used quanta: 0 (cg: 0) usec
cpu start time: 0, quantum status: 0x0
quantum checks to skip: 0, check thresh: 0
idle time: 0, active time: 0 (cg: 0)
cpu yields: 0 (cg: 0), waits: 0 (cg: 0), wait time: 0 (cg: 0) usec
io waits: 0 (cg: 0), wait time: 0 (cg: 0) usec
queued time outs: 0, time: 0 (cur 0, cg 0)
calls aborted: 0, num est exec limit hit: 0
undo current: 0k max: 0k
I/O credits acquired:small=0 large=0
I/O credits waiting for:small=0 large=0
----------------------------------------
KKS-UOL used : 0 locks(used=6, free=6)
KGX Atomic Operation Log 700000237b92a40
Mutex 0(0, 0) idn 0 oper NONE
Cursor Parent uid 874 efd 10 whr 12 slp 0
oper=NONE pt1=0 pt2=0 pt3=0
pt4=0 u41=0 stt=0
KGX Atomic Operation Log 700000237b92a88
Mutex 0(0, 0) idn 0 oper NONE
Cursor Stat uid 874 efd 13 whr 7 slp 0
oper=NONE pt1=700000228b9ed20 pt2=0 pt3=7000001bb1373b0
pt4=0 u41=3799783794 stt=0
KGX Atomic Operation Log 700000237b92ad0
Mutex 700000097e58338(0, 0) idn fe09a879 oper NONE
Cursor Stat uid 874 efd 13 whr 2 slp 0
oper=NONE pt1=0 pt2=0 pt3=0
pt4=0 u41=0 stt=8
KGX Atomic Operation Log 700000237b92b18
Mutex 0(0, 0) idn 0 oper NONE
FSO mutex uid 874 efd 0 whr 0 slp 0
----------------------------------------
主机并不繁忙的情况怎么会failed
说明:smon timer是SMON进程的一些操作时每隔一段实际循环执行的,即使系统不忙,此事件也不立即发生,而是等待计时器达到一定的时间才执行,此时出现的smon timer 等待事件
回到开头的:
Dead transaction 0x0002.001.0001d07f recovered by SMON
Dead transaction 0x000f.015.0002815a recovered by SMON
Dead transaction 0x0013.01a.0001b755 recovered by SMON
Serial Transaction recovery caught exception 18
Oracle在恢复一个死事务,0x0002.001.0001d07f,这个事务使用的回滚段是0x0002号,也就是2号回滚段,可以看到在随后的日志中可以看到0x0013:
----------------------------------------
SO: 0x70000023e5fb148, type: 46, owner: 0x70000023337c9b0, flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x70000023c804c80, name=transaction, file=ktccts.h LINE:323 ID:, pg=0
(trans) flg = 0x04, flg2 = 0x00, flg3 = 0x00, prx = 0x70000023e6452c0, ros = 2147483647 bsn = 0x10e0ced bndsn = 0x10e0cfb spn = 0x10e0d03
efd = 5
parent xid: 0x0000.000.00000000
env: (scn: 0x0000.892fdf26 xid: 0x0013.017.0001b544 uba: 0x030161fd.4ad8.0a statement num=0 parent xid: xid: 0x0000.000.00000000 scn: 0x0000.892fdf25 0sch: scn: 0x0000.00000000 mascn: (scn: 0x0000.00000000)
cev: (spc = 6522 arsp = 0 ubk tsn: 2 rdba: 0x030161fd useg tsn: 2147483647 rdba: 0x00000000
hwm uba: 0x030161fd.4ad8.0a col uba: 0x00000000.0000.00
num bl: 0 bk list: 0x0)
cr opc: 0x0 spc: 6522 uba: 0x030161fd.4ad8.0a
ccbstg: 0x00000003
(enqueue) CU-953D9990-07000000 DID: 0001-000C-00000002
lv: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 res_flag: 0x0
res: 0x7000002313b5db8, lock_flag: 0x0
own: 0x70000023ce1aae0, sess: 0x70000023ce1aae0, prv: 0x70000023e5fb1d0
xga: 0x0, heap: UGA
Trans IMU st: 0 Pool index 65535, Redo pool 0x70000023e5fba80, Undo pool 0x70000023e5fbb68
Redo pool range [0x0 0x110583eb8 0x1105866b8]
Undo pool range [0x0 0x1105816b8 0x110583eb8]
chnf control flags 0x100 CHNF hwm uba uba: 0x00000000.0000.00
Dump of memory from 0x07000002010C1790 to 0x07000002010C1BC8
SMON的作用包括清理死事务:Recover Dead transaction。当服务进程在提交事务(commit)前就意外终止的话会形成死事务(dead transaction),PMON进程负责轮询Oracle进程,找出这类意外终止的死进程(dead process),通知SMON将与该dead process相关的dead transaction回滚清理,并且PMON还负责恢复dead process原本持有的锁和latch。
fast_start_parallel_rollback参数决定了SMON在回滚事务时使用的并行度,若将该参数设置为false那么并行回滚将被禁用,若设置为Low(默认值)那么会以2*CPU_COUNT数目的并行度回滚,当设置为High则4*CPU_COUNT数目的回滚进程将参与进来。
> show parameter FAST_START_PARALLEL_ROLLBACK
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback string LOW
>
--alter system set FAST_START_PARALLEL_ROLLBACK=FALSE;
是不是fast_start_parallel_rollback=LOW清理死事务时出了BUG?
BUG
Bug 14357521 : TRANSACTION RECOVERY: LOCK CONFLICT CAUGHT AND IGNORED
Bug 11693365 - Concurrent Droptable and Select on Reference constraint table hangs (deadlock) [ID 11693365.8]
该bug的现象:
1) Deadlock
2) Hang(Process Hang)
3) Waits for "library cachelock"
4) Waits for "row cachelock"
5) Stack is likely to includedtbdrp
可以使用如下SQL 查看SMON进程处理事务recovery的进度:
SET LINESIZE 100
ALTER SESSIONSETNLS_DATE_FORMAT='DD-MON-YYYYHH24:MI:SS';
SELECT usn,
state,
undoblockstotal "Total",
undoblocksdone "Done",
undoblockstotal - undoblocksdone"ToDo",
DECODE(
cputime,
0, 'unknown',
SYSDATE
+( ( (undoblockstotal-undoblocksdone)
/ (undoblocksdone/cputime))
/86400))
"Estimatedtime to complete"
FROM v$fast_start_transactions;
在有些版本里,cputime 参数值不正常,一直为0,因此不能估算进度。
在某些案例中,v$fast_start_transactions视图也不能正常显示,不过还可以查询oracle 内部的数据字典:x$ktuxe。 该字典代表rollback 需要剩余的undo blocks的数量。
SELECT ktuxeusn,
TO_CHAR(SYSDATE,'DD-MON-YYYYHH24:MI:SS') "Time",
ktuxesiz,
ktuxesta
FROM x$ktuxe
WHEREktuxecfl = 'DEAD';