1. 故障介绍
这几天遇到一个状况,数据库莫名其妙的down掉。然后客户让我们找出原因。其实数据库down掉的原因有好多种,故障或者是人为。而我这个情况是人为造成的。既然是人为造成的,我们如何得知呢?
2. 故障重现
首先我们来思考一下,在RAC数据库中,直接发起一条shutdown abort的命令会记录什么?我们首先看一下alert日志。从这里我们可以看到一段话shutting down instance(abort),并且记录了用户的pid是9806。
Thu Jun 06 20:48:29 CST 2013
Shutting down instance (abort)
License high water mark = 12
Instance terminated by USER, pid = 9806
下面是crsd的记录。
2013-06-06 20:48:31.204: [ CRSRES][1497741632]0ora.racdb.racdb1.inst target set to OFFLINE before stop action
2013-06-06 20:48:31.204: [ CRSRES][1497741632]0StopResource: setting CLI values
2013-06-06 20:48:31.216: [ CRSRES][1497741632]0Attempting to stop `ora.racdb.racdb1.inst` on member `rac1`
2013-06-06 20:48:31.555: [ CRSRES][1497741632]0Stop of `ora.racdb.racdb1.inst` on member `rac1` succeeded.
如果我们用srvctl stop instance -d racdb -i racdb1 -o abort命令来停数据库的话。会出现如下结果
后台alert日志
Thu Jun 06 21:06:09 CST 2013
Shutting down instance (abort)
License high water mark = 5
Instance terminatedby USER, pid = 17446
crsd日志如下
2013-06-06 21:06:08.845: [ CRSRES][1497741632]0ora.racdb.racdb1.inst target set to OFFLINE before stop action
2013-06-06 21:06:08.845: [ CRSRES][1497741632]0StopResource: setting CLI values
2013-06-06 21:06:08.862: [ CRSRES][1497741632]0Attempting to stop `ora.racdb.racdb1.inst` on member `rac1`
2013-06-06 21:06:11.900: [ CRSRES][1497741632]0Stop of `ora.racdb.racdb1.inst` on member `rac1` succeeded.
从这儿可以看出手工shutdown abort和通过srvctl shutdown abort是一样的。那么在来看看kill -9 smon会怎么样。
Thu Jun 06 21:16:19 CST 2013
Errors in file /oracle/app/oracle/admin/racdb/bdump/racdb1_pmon_21008.trc:
ORA-00474: SMON process terminated with error
Thu Jun 06 21:16:19 CST 2013
PMON: terminating instance due to error 474
Thu Jun 06 21:16:19 CST 2013
System state dump is made for local instance
System State dumped to trace file /oracle/app/oracle/admin/racdb/bdump/racdb1_diag_21010.trc
Thu Jun 06 21:16:19 CST 2013
Trace dumping is performing id=[cdmp_20130606211619]
Thu Jun 06 21:16:20 CST 2013
Instance terminated by PMON, pid = 21008
crsd日志如下
2013-06-06 21:16:19.286: [ CRSRES][1497741632]0In stateChanged, ora.racdb.racdb1.inst target is ONLINE
2013-06-06 21:16:19.286: [ CRSRES][1497741632]0ora.racdb.racdb1.inst on rac1 went OFFLINE unexpectedly
2013-06-06 21:16:19.286: [ CRSRES][1497741632]0StopResource: setting CLI values
2013-06-06 21:16:20.523: [ CRSRES][1497741632]0Attempting to stop `ora.racdb.racdb1.inst` on member `rac1`
2013-06-06 21:16:24.015: [ CRSRES][1497741632]0Stop of `ora.racdb.racdb1.inst` on member `rac1` succeeded.
2013-06-06 21:16:24.016: [ CRSRES][1497741632]0ora.racdb.racdb1.inst RESTART_COUNT=0 RESTART_ATTEMPTS=5
2013-06-06 21:16:24.016: [ CRSRES][1497741632]0Restarting ora.racdb.racdb1.inst on rac1
2013-06-06 21:16:24.024: [ CRSRES][1497741632]0startRunnable: setting CLI values
2013-06-06 21:16:24.024: [ CRSRES][1497741632]0Attempting to start `ora.racdb.racdb1.inst` on member `rac1`
2013-06-06 21:16:43.749: [ CRSRES][1497741632]0Start of `ora.racdb.racdb1.inst` on member `rac1` succeeded.
2013-06-06 21:16:43.749: [ CRSRES][1497741632]0Successfully restarted ora.racdb.racdb1.inst on rac1, RESTART_COUNT=1
2013-06-06 21:16:43.792: [ CRSRES][1497741632]0ora.racdb.racdb1.inst Updated LAST_RESTART time in ocr
这里来总结一下,只要是手工执行shutdown abort或者是用srvctl stop instance来执行shutdown的话,都会在alert日志中记录shutdown by user,还会记录会话的session id。而Kill -9则不会记录,它只会告诉你PMON异常中断。而且CRS会立马将数据库重新拉起来。
3. 追查
正是因为alert日志中记录了session id,我们可以审计也就是adump下的文件来查看。在Oracle 10g和Oracle 11g这项功能默认是开启的。因此,我们到adump目录下检索一番就能够得到我们想要的信息。比如第一次shutdown abort的pid是9806。
[oracle@rac1 adump]$ more racdb1_ora_9806_1.aud
Audit file /oracle/app/oracle/admin/racdb/adump/racdb1_ora_9806_1.aud
Oracle Database 10g Enterprise Edition Release 10.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options
ORACLE_HOME = /oracle/app/oracle/product/10.2.0/db_1
System name: Linux
Node name: rac1.localdomain
Release: 2.6.18-164.el5
Version: #1 SMP Thu Sep 3 04:15:13 EDT 2009
Machine: x86_64
Instance name: racdb1
Redo thread mounted by this instance: 1
Oracle process number: 27
Unix process pid: 9806, image: oracle@rac1.localdomain (TNS V1-V3)
Thu Jun 6 20:47:39 2013
LENGTH : '158'
ACTION :[7] 'CONNECT'
DATABASE USER:[1] '/'
PRIVILEGE :[6] 'SYSDBA'
CLIENT USER:[6] 'oracle'
CLIENT TERMINAL:[5] 'pts/1'
STATUS:[1] '0'
DBID:[9] '782449062'
Thu Jun 6 20:48:29 2013
LENGTH : '159'
ACTION :[8] 'SHUTDOWN'
DATABASE USER:[1] '/'
PRIVILEGE :[6] 'SYSDBA'
CLIENT USER:[6] 'oracle'
CLIENT TERMINAL:[5] 'pts/1'
STATUS:[1] '0'
DBID:[9] '782449062'
这里我们通过审计文件看到了action是shutdown。是由pst/1这个终端执行的。那么怎么查看pst/1是从哪儿来的呢?我们可以用w命令来显示。
[oracle@rac1 adump]$ w
21:33:02 up 58 min, 4 users, load average: 0.27, 0.31, 0.31
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
root :0 - 20:37 ?xdm? 1:55 0.04s /usr/bin/gnome-session
root pts/1 192.168.56.1 20:38 0.00s 0.13s 0.00s w
root pts/2 192.168.56.1 20:47 22:56 0.12s 0.04s tail -200f crsd.log
root pts/3 192.168.56.1 21:15 17:03 0.04s 0.04s tail -200f alert_racdb1.log
这里显示了IP地址。正常情况下,都是从某个主机telnet过来的。也可以通过history命令来查看历史的操作记录。
分享到:
更多