在线重定义引起的数据库挂起

最新推荐文章于 2022-09-29 20:59:36 发布

ckawt40802

最新推荐文章于 2022-09-29 20:59:36 发布

阅读量185

点赞数

文章标签：数据库操作系统网络

一个客户的数据库突然出现了HANG死的情况，连SQLPLUS都无法正常连接到数据库。

数据库版本为11.2.0.2 for Linux x86-64。客户由于要迅速解决问题，于是采用了重启数据库的方式，重启后问题解决。

事后通过检查数据库的trace信息以及询问客户dba的操作，基本上定位了导致问题的原因。

其实事后诊断是一个很头痛的问题，因为一旦重启，很多信息就丢失了。所有V$视图中的信息都被重置，想要使用AWR来获取信息，也会发现由于数据库重启，实际上没有办法对重启前时刻的数据库状态进行分析。

好在所有的TRACE信息和告警日志都可以检查到，首先检查告警日志信息：

Wed Jul 06 14:34:53 2011
Errors in file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_cjq0_19642.trc (incident=36249):
ORA-00445: background process "J002" did not start after 120 seconds
Incident details in: /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/incident/incdir_36249/mobiledb_cjq0_19642_i36249.trc
Wed Jul 06 14:35:21 2011
Dumping diagnostic data in directory=[cdmp_20110706143521], requested by (instance=1, sid=19642 (CJQ0)), summary=[incident=36249].
Wed Jul 06 14:35:30 2011
kkjcre1p: unable to spawn jobq slave process
Errors in file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_cjq0_19642.trc:
Wed Jul 06 14:35:53 2011
PMON failed to acquire latch, see PMON dump
Wed Jul 06 14:36:57 2011
ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
Process m000 submission failed with error = 20
.
.
.
***********************************************************************

Fatal NI connect error 12170.

VERSION INFORMATION:
TNS for Linux: Version 11.2.0.2.0 - Production
Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.2.0 - Production
TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.2.0 - Production
Time: 06-JUL-2011 14:38:31
Tracing not turned on.
Tns error struct:
ns main err code: 12535

TNS-12535: TNS:operation timed out
ns secondary err code: 12606
nt main err code: 0
nt secondary err code: 0
nt OS err code: 0
Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=192.168.90.102)(PORT=56704))

***********************************************************************

.
.
.
Wed Jul 06 14:38:15 2011
WARNING: inbound connection timed out (ORA-3136)
Wed Jul 06 14:38:15 2011
WARNING: inbound connection timed out (ORA-3136)
Wed Jul 06 14:38:18 2011
WARNING: inbound connection timed out (ORA-3136)
.
.
.

首先报错的是JOB相关的进程，J002进程在指定时间内没有启动：

ORA-00445: background process "J002" did not start after 120 seconds

随后CJQ0进程在尝试启动JOB进程时报错：

kkjcre1p: unable to spawn jobq slave process

接着PMON进程报错，无法获取LATCH：

PMON failed to acquire latch, see PMON dump

之后出现了ORA-20相关的错误：

ORA-00020: No more process state objects available
ORA-20 errors will not be written to the alert log for
the next minute. Please look at trace files to see all
the ORA-20 errors.
Process m000 submission failed with error = 20

最后基本上是满屏的12170、TNS-12535和ORA-3136错误。

最后的网络连接超时相关的12170、ORA-3136等，都是问题产生后的现象，而并非问题的原因，同样前面的J000和CJQ0的报错，同样也是由于其他问题导致了后台进程启动异常。

那么这里最明显的错误就是ORA-20错误，也就是说系统中的进程数超过了初始化参数PROCESSES的设置。当前这个参数设置为1500，而正常情况下，这个数据库中的连接会话数绝不会超过500。显然，虽然随后的网络连接超时错误，以及后台JOB启动的错误可能都是由于ORA-20错误导致的，但是当前数据库中进程超过参数设置这绝对不是导致问题的根本原因，而同样也是问题的表相，一定是其他的问题导致了数据库中进程数的暴涨。

剩下的只有一个PMON无法获取LATCH的信息了，这并不是一个ORA开头的错误信息，而实际上应该算是一个告警信息，但是这个告警所传递的信息却很重要。根据这个告警信息查询METALINK，没有发现和当前现象相似的案例。

由于错误信息与PMON进程有关，因此继续检查PMON进程：

Trace file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_pmon_19539.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/oracle/product/11.2.0/db
System name: Linux
Node name: Oracle112
Release: 2.6.32-71.el6.x86_64
Version: #1 SMP Wed Dec 15 12:36:54 EST 2010
Machine: x86_64
Instance name: mobiledb
Redo thread mounted by this instance: 1
Oracle process number: 2
Unix process pid: 19539, image: oracle@Oracle112 (PMON)

*** 2011-07-06 14:35:52.888
*** SESSION ID:(285.1) 2011-07-06 14:35:52.888
*** CLIENT ID:() 2011-07-06 14:35:52.888
*** SERVICE NAME:(SYS$BACKGROUND) 2011-07-06 14:35:52.888
*** MODULE NAME:() 2011-07-06 14:35:52.888
*** ACTION NAME:() 2011-07-06 14:35:52.888

PMON unable to acquire latch 60108140 Child shared pool level=7 child#=3
Location from where latch is held: kgh.h LINE:6383 ID:kghfsh:
Context saved from call: 0
state=busy [holder rapid=63] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
67 (146, 1309934152, 124)
80 (146, 1309934152, 122)
94 (146, 1309934152, 121)
91 (146, 1309934152, 121)
84 (146, 1309934152, 121)
98 (146, 1309934152, 121)
99 (146, 1309934152, 121)
46 (146, 1309934152, 146)
95 (146, 1309934152, 146)
111 (146, 1309934152, 121)
112 (146, 1309934152, 121)
115 (146, 1309934152, 120)
118 (146, 1309934152, 120)
119 (146, 1309934152, 119)
120 (146, 1309934152, 119)
121 (146, 1309934152, 119)
124 (146, 1309934152, 118)
123 (146, 1309934152, 118)
128 (146, 1309934152, 118)
127 (146, 1309934152, 118)
129 (146, 1309934152, 117)
131 (146, 1309934152, 117)
130 (146, 1309934152, 117)
134 (146, 1309934152, 116)
135 (146, 1309934152, 116)
136 (146, 1309934152, 116)
139 (146, 1309934152, 115)
138 (146, 1309934152, 115)
47 (146, 1309934152, 146)
141 (146, 1309934152, 112)
142 (146, 1309934152, 104)
140 (146, 1309934152, 113)
144 (146, 1309934152, 100)
147 (146, 1309934152, 94)
146 (146, 1309934152, 94)
148 (146, 1309934152, 92)
150 (146, 1309934152, 89)
152 (146, 1309934152, 88)
153 (146, 1309934152, 88)
155 (146, 1309934152, 88)
156 (146, 1309934152, 88)
154 (146, 1309934152, 88)
157 (146, 1309934152, 88)
.
.
.
1485 (88, 1309934152, 28)
1496 (88, 1309934152, 28)
1487 (88, 1309934152, 28)
1494 (88, 1309934152, 28)
1499 (88, 1309934152, 28)
1490 (87, 1309934152, 28)
1498 (87, 1309934152, 28)
1495 (87, 1309934152, 28)
1497 (87, 1309934152, 28)
65 (87, 1309934152, 87)
83 (47, 1309934152, 47)
waiter count=1408
gotten 92338703 times wait, failed first 49620 sleeps 6005
gotten 0 times nowait, failed: 0
Short stack dump:
ksedsts()+461 ----------------------------------------
SO: 0x1588acc90, type: 2, owner: (nil), flag: INIT/-/-/0x00 if: 0x3 c: 0x3
proc=0x1588acc90, name=process, file=ksu.h LINE:12451, pg=0
(process) Oracle pid:63, ser:85, calls cur/top: 0x147d554d8/0x147d554d8
flags : (0x0) -
flags2: (0x0), flags3: (0x0)
intr error: 0, call error: 0, sess error: 0, txn error 0
intr queue: empty
ksudlp FALSE at location: 0
(post info) last post received: 136 0 2
last post received-location: ksl2.h LINE:2293 ID:kslpsr
last process to post me: 15f820b38 1 6
last post sent: 0 0 26
last post sent-location: ksa2.h LINE:282 ID:ksasnd
last process posted by me: 15f820b38 1 6
(latch info) wait_event=0 bits=80
holding (efd=15) 60108140 Child shared pool level=7 child#=3
Location from where latch is held: kgh.h LINE:6383 ID:kghfsh:
Context saved from call: 0
state=busy [holder rapid=63] wlstate=free [value=0]
waiters [orapid (seconds since: put on list, posted, alive check)]:
67 (146, 1309934152, 124)
80 (146, 1309934152, 122)
94 (146, 1309934152, 121)
91 (146, 1309934152, 121)
84 (146, 1309934152, 121)
98 (146, 1309934152, 121)
99 (146, 1309934152, 121)
46 (146, 1309934152, 146)
95 (146, 1309934152, 146)
111 (146, 1309934152, 121)
112 (146, 1309934152, 121)
115 (146, 1309934152, 120)
118 (146, 1309934152, 120)
119 (146, 1309934152, 119)
120 (146, 1309934152, 119)
121 (146, 1309934152, 119)
124 (146, 1309934152, 118)
123 (146, 1309934152, 118)
128 (146, 1309934152, 118)
127 (146, 1309934152, 118)
129 (146, 1309934152, 117)
131 (146, 1309934152, 117)
130 (146, 1309934152, 117)
134 (146, 1309934152, 116)
.
.
.
1487 (88, 1309934152, 28)
1494 (88, 1309934152, 28)
1499 (88, 1309934152, 28)
1490 (87, 1309934152, 28)
1498 (87, 1309934152, 28)
1495 (87, 1309934152, 28)
1497 (87, 1309934152, 28)
65 (87, 1309934152, 87)
83 (47, 1309934152, 47)
waiter count=1408
Process Group: DEFAULT, pseudo proc: 0x15f9a0710
O/S info: user: oracle, term: UNKNOWN, ospid: 7698
OSD pid info: Unix process pid: 7698, image: oracle@Oracle112

*** 2011-07-06 14:35:53.548
Short stack dump:
ksedsts()+461 PSO child state object changes :
Dump of memory from 0x000000015886F9D0 to 0x000000015886FBD8
15886F9D0 00000000 00000000 00000000 00000000 [................]
Repeat 31 times
15886FBD0 00000000 00000000 [........]

*** 2011-07-06 14:35:53.557

根据PMON进程的信息，SPID为7698的一个数据库进程阻塞1408个进程，不用说也知道问题就出在这个7698的进程上，继续查询这个进程的trace信息：

Trace file /opt/oracle/diag/rdbms/mobiledb_n/mobiledb/trace/mobiledb_ora_7698.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.2.0 - 64bit Production
With the Partitioning, OLAP, Data Mining and Real Application Testing options
ORACLE_HOME = /opt/oracle/product/11.2.0/db
System name: Linux
Node name: Oracle112
Release: 2.6.32-71.el6.x86_64
Version: #1 SMP Wed Dec 15 12:36:54 EST 2010
Machine: x86_64
Instance name: mobiledb
Redo thread mounted by this instance: 1
Oracle process number: 63
Unix process pid: 7698, image: oracle@Oracle112

*** 2011-07-06 14:44:20.926
*** SESSION ID:(2229.4669) 2011-07-06 14:44:20.926
*** CLIENT ID:() 2011-07-06 14:44:20.926
*** SERVICE NAME:(MOBILEDB) 2011-07-06 14:44:20.926
*** MODULE NAME:(PL/SQL Developer) 2011-07-06 14:44:20.926
*** ACTION NAME:(Command Window - New) 2011-07-06 14:44:20.926

KQRCMT: Write failed with error=604 po=0x13350d1f0 cid=8
diagnostics : cid=8 hash=7188274a flag=2031

显然这是客户通过PL/SQL Developer工具执行的操作，而且看ACTION NAME是Command Window，多半客户是在进行DDL等操作。

而ORA-604错误的描述是：

ORA-00604: error occurred at recursive SQL level string
Cause: An error occurred while processing a recursive SQL statement (a statement applying to internal dictionary tables).
Action: If the situation described in the next error on the stack can be corrected, do so; otherwise contact Oracle Support.'

显然是由于客户运行的某个特殊命令，导致了数据字典在修改的过程中出现了错误，从而导致了数据库中其他所有的会话被锁。由于连接池中现有的连接全部被锁，因此连接池会尝试不断的建立新的连接来满足用户的请求，而这些新的连接同样会被锁定，并最终导致数据库中的进程数超过了操作系统上的设置1500，从而引发了其他一系列的错误。

导致问题的原理搞清楚了，那么客户DBA到底运行了什么操作，导致了这个现象呢，这个通过TRACE是无法获取到的，即使是通过LOGMINER，很可能也无法找到，由于是对DDL语句而言。

这时最简单的办法莫过于直接询问客户，实际上当时客户在运行的操作理论上是对数据库影响最小的在线重定义的操作，客户在运行DBMS_REDEFINITION.FINISH_REDEF_TABLE这个操作后，很快发现程序异常，而且后台出现了ORA-20错误，试图通过CTRL + C来中止任务，虽然这个操作被中止了，但是这个进程后台对于数据库其他进程的阻塞并没有释放，从而导致所有进行被挂起，最终不得不重启数据库。

导致问题的原因可能是这个：Bug 11834459: DBMS_REDEFINITION.FINISH_REDEF_TABLE MAY FLUSH THE SHARED POOL。在11.2中，执行FINISH_REDEF_TABLE过程很可能会刷新共享池，而这个操作就有可能导致上面所有会话被锁的问题。这个问题说明即使这个操作是Oracle提供的在线操作，即使这个操作看起来没有什么危害，在一个新的版本上运行的时候仍然要小心，最好在相同版本的测试环境中测试，而且尽量不要在高峰期执行。

来自 “ ITPUB博客 ” ，链接：http://blog.itpub.net/4227/viewspace-702946/，如需转载，请注明出处，否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-702946/