Alter system suspend/resume 导致的bug和特性

最新推荐文章于 2022-06-22 10:08:42 发布

conghuan4367

最新推荐文章于 2022-06-22 10:08:42 发布

阅读量190

点赞数

ORACLE:Oracle Database 11g Enterprise Edition Release 11.1.0.6.0

OS:redhat enterprise edition 2.6.18-8.el5 @ X86

From 11G Document:

The ALTER SYSTEM SUSPEND statement halts all input and output (I/O) to datafiles
(file header and file data) and control files. The suspended state lets you back up a
database without I/O interference. When the database is suspended all preexisting
I/O operations are allowed to complete and any new database accesses are placed in a
queued state.

According to BUG:3620559 on metalink:

"CONNECT /AS SYSDBA" HANG AFTER SUSPEND

我们可以很简单的reproduce这个bug在11G @ liux上，有两个命令可以导致这个bug:

Alter system flush shared_pool;

Alter system flush buffer_cache;

TEST case:

session 1 and session 2 login as sysdba;

session 1:

alter system flush BUFFER_CACHE/SHARED_POOL;

session 2:

alter system suspend;

session 3:

sqlplus / as sysdba

-- it will hung by the wait event "writes stopped by instance recovery or database suspension".

当我们pstack这个hung住的进程：

oracle@HaoRedHat: ~ > pstack 3531
#0 0x00ad1402 in __kernel_vsyscall ()
#1 0x0021ab54 in semtimedop () from /lib/libc.so.6
#2 0x0e57691f in sskgpwwait ()
#3 0x0e5758ae in skgpwwait ()
#4 0x0e2c3a44 in ksliwat ()
#5 0x0e2c33b1 in kslwaitctx. ()
#6 0x0e2c06f1 in kslwait ()
#7 0x0af1a640 in kcbwwa ()
#8 0x087fe162 in kcbzib ()
#9 0x0e348bb7 in kcbgtcr ()
#10 0x0e2ed288 in ktecgsc ()
#11 0x0e2ebad0 in ktecgetsh. ()
#12 0x0e2eba3a in ktecgshx ()
#13 0x0e2edd10 in kteinicnt1 ()
#14 0x0e4b668f in qertbFetch ()
#15 0x00000004 in ?? ()
#16 0x00000000 in ?? ()

会发现有kcb call，从这个bug metalink中：

kcbzib KCB: input buffer - reads a block from disk into a buffer

原来"connect as sysdba"会有physical reads存在，如果在我们flush 了buffer_cache或者shared_pool后。

如果我们此时不幸地退出了所有其他sysdba的连接，那么唯一让系统恢复的办法就只有kill掉oracle 进程（一般kill -9 就可以了）。

再重启数据库，此时会自动清掉 suspend的flag，select database_status from v$instance;会显示active。

我又测试了对于一般的查询，flush shared_pool 和flush buffer_cache的区别:

1. 如果只flush shared_pool，不flush buffer_cache，然后suspend system,再执行：

SQL> select * from test test2;

ID
----------
1

在suspend时结果会返回，这是因为结果已经保存在buffer_cache里了，而这里只需要hard parse一下就可以了。

可见在suspend的system中，hard parse是允许的，只要结果在buffer cache里，不需要physical read，那么结果也是可以返回的。

从10046 trace中可以证明这点：

PARSE #3:c=1999,e=2147,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,tim=1232295633394990 --hard parse发生

STAT #3 id=1 cnt=1 pid=0 pos=1 bj=12632 p='TABLE ACCESS FULL TEST (cr=3 pr=0 pw=0 time=0 us cost=2 size=13 card=1)' --没有physical read and write发生，所以可以执行。

2. 如果只flush buffer_cache，不flush shared_pool，然后suspend system，再执行：

SQL> select * from test test2;

----it will hung

从10046 trace中看：

PARSE #1:c=0,e=192,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,tim=1232295772815066 --没有hard parse发生

每隔五秒会检测一次，出现以下信息：

WAIT #1: nam='writes stopped by instance recovery or database suspension' ela= 5008936 by thread#=2147483647 our thread#=1 p3=0 obj#=12453 tim=1232295807892952

重新resume system后，

STAT #1 id=1 cnt=1 pid=0 pos=1 bj=12632 p='TABLE ACCESS FULL TEST (cr=3 pr=2 pw=2 time=0 us cost=2 size=13 card=1)' --两次物理读和写

3. 在suspend之后，执行：

SQL> insert into test values (2);

1 row created.

SQL> commit;

-- it will hung.

4. 在suspend之后，执行：

SQL> delete from test;

3 rows deleted.

SQL> exit

--it will hung.

当resume后，10046：

WAIT #0: nam='log file sync' ela= 26545 buffer#=1978 p2=0 p3=0 obj#=12632 tim=1232298062864245

这两个例子说明了当commit或正常退出时，从10046也可以看出，这是的hung是因为要“log file sync”，

也就是要在事务完成前将redo buffer写入redo log中去，这里是物理写。

综上，所以Alter system suspend;的一个重要特征就是禁止任何文件物理读写，这里的文件不仅包括datafile 和control file,也包括redo log。

而对于在buffer cache中的数据，可以有逻辑读写。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/15415488/viewspace-541353/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/15415488/viewspace-541353/

conghuan4367

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Alter system suspend/resume 导致的bug和特性

ORACLE:Oracle Database 11g Enterprise Edition Release 11.1.0.6.0OS:redhat enterprise edition 2.6.18-8.el5 @ X86...
复制链接

扫一扫