EMC 故障情况下ORACLE 救火行动（之二）

最新推荐文章于 2021-04-11 05:12:38 发布

傻儿哥

最新推荐文章于 2021-04-11 05:12:38 发布

阅读量5.7k

点赞数

分类专栏： ORACLE STORAGE AIX 文章标签： oracle emc application 数据库 database 2010

本文链接：https://blog.csdn.net/csucxcc/article/details/5825091

版权

ORACLE 同时被 3 个专栏收录

106 篇文章 0 订阅

订阅专栏

AIX

45 篇文章 0 订阅

订阅专栏

STORAGE

24 篇文章 1 订阅

订阅专栏

客户反映，下午4点多，**平台访问速度突然变得很慢。

因为这套系统关系到全省**平台的使用，平时都是24小时不间断运行的，这次这个故障，很可能导致应用的停滞。因为数据库又没有备份，客户很担心，数据库一旦失败，就再起不来了。

因此客户一检查到数据库出现了昨天**平台数据库一样的问题，气氛一下就变得紧张了。

这是一套构建在AIX 5.3.10 上的 oracle 10.2.0.4 RAC应用，两节点间以负载均衡的模式对外提供服务。因为是核心应用，该平台部署在2台 IBM P595的两个分区上，配有72G内存和32颗POWER 5+的 CPU。客户和我反映，平时最忙的时候，系统的CPU 和内存的使用率也只有50%-60%左右。

到达客户现场之后，我很快查找了数据库的错误日志。

日志信息如下：

zhyw2 :

Tue Aug 17 22:59:46 2010

Errors in file /opt/oracle/admin/bsp/bdump/bsp1922_j000_729190.trc:

ORA-12012: error on auto execute of job 145

ORA-12008: error in materialized view refresh path

ORA-00376: file 43 cannot be read at this time

ORA-01110: data file 43: '/dev/rlv_raw37_16g'

ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2251

ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2457

ORA-06512: at "SYS.DBMS_IREFRESH", line 685

ORA-06512: at "SYS.DBMS_REFRESH", line 195

ORA-06512: at line 1

Tue Aug 17 21:39:46 2010

KCF: write/open error block=0xb080f online=1

file=54 /dev/rlv_raw48_16g

error=27063 txt: 'IBM AIX RISC System/6000 Error: 47: Write-protected media

Additional information: -1

Additional information: 8192'

Automatic datafile offline due to write error on

Tue Aug 17 21:55:46 2010

Errors in file /opt/oracle/admin/bsp/udump/bsp1922_ora_406246.trc:

ORA-00603: ORACLE server session terminated by fatal error

ORA-01115: IO error reading block from file 35 (block # 923276)

ORA-01110: data file 35: '/dev/rlv_raw29_16g'

ORA-27091: unable to queue I/O

ORA-27072: File I/O error

IBM AIX RISC System/6000 Error: 5: I/O error

Additional information: 7

Additional information: 923275

Additional information: -1

Tue Aug 17 23:47:35 2010

System State dumped to trace file /opt/oracle/admin/bsp/bdump/bsp1922_diag_63911

0.trc

Tue Aug 17 23:48:21 2010

Errors in file /opt/oracle/admin/bsp/bdump/bsp1922_m002_430378.trc:

ORA-00604: error occurred at recursive SQL level 1

ORA-00028: your session has been killed

ORA-06512: at "SYS.PRVT_HDM", line 10

ORA-06512: at "SYS.WRI$_ADV_HDM_T", line 16

ORA-06512: at "SYS.PRVT_ADVISOR", line 1535

ORA-06512: at "SYS.PRVT_ADVISOR", line 1618

ORA-06512: at "SYS.PRVT_HDM", line 106

ORA-06512: at line 1

我很快的过滤了一下两个节点的日志信息，找到了我们关心的几个问题,如下所示：

file 54: /dev/rlv_raw48_16g

ORA-01110: data file 35: '/dev/rlv_raw29_16g'

ORA-01110: data file 43: '/dev/rlv_raw37_16g'

看来故障也是因数据库文件的访问出错导致的。

通过下面命令，我查看了相关数据库文件当前的状态信息：

select name,status from v$datafile;

NAME STATUS

-------------------- --------------------

/dev/rlv_system_8g SYSTEM

/dev/rlv_undot11_8g ONLINE

/dev/rlv_sysaux_8g ONLINE

/dev/rlv_user_8g ONLINE

/dev/rlv_undot12_8g ONLINE

/dev/rlv_raw29_16g ONLINE

/dev/rlv_raw37_16g RECOVER

/dev/rlv_raw48_16g ONLINE

从上可以看到，当前数据块

"/dev/rlv_raw37_16g" 处于recover状态，需要做恢复。其他几个数据文件显示的状态是正常的。

我查看了下RAC两节点的归档日志信息

[oracle@zhyw1]$ls -l

total 135787576

-rw-r----- 1 oracle oinstall 16350676480 Jul 29 12:16 bsp1921_1_227_713969898.arc

-rw-r----- 1 oracle oinstall 16350670336 Aug 3 17:46 bsp1921_1_228_713969898.arc

-rw-rw---- 1 oracle oinstall 4119506432 Aug 4 21:15 bsp1921_1_229_713969898.arc

-rw-rw---- 1 oracle oinstall 16350673408 Aug 10 15:35 bsp1921_1_230_713969898.arc

-rw-rw---- 1 oracle oinstall 16350669824 Aug 14 21:45 bsp1921_1_231_713969898.arc

drwxr-xr-x 2 root system 256 Mar 16 09:15 lost+found

[oracle@zhyw1]$cd /arch2

[oracle@zhyw1]$ls -l

total 281756560

-rw-r----- 1 oracle oinstall 16350686720 Jul 22 09:47 bsp1922_2_221_713969898.arc

-rw-r----- 1 oracle oinstall 16350676480 Jul 23 18:56 bsp1922_2_222_713969898.arc

-rw-r----- 1 oracle oinstall 16350677504 Jul 28 18:11 bsp1922_2_223_713969898.arc

-rw-r----- 1 oracle oinstall 16350675968 Aug 2 11:23 bsp1922_2_224_713969898.arc

-rw-rw---- 1 oracle oinstall 13451708416 Aug 4 18:57 bsp1922_2_225_713969898.arc

-rw-rw---- 1 oracle oinstall 16350674432 Aug 8 20:05 bsp1922_2_226_713969898.arc

-rw-rw---- 1 oracle oinstall 16350808064 Aug 11 10:49 bsp1922_2_227_713969898.arc

-rw-rw---- 1 oracle oinstall 16350674944 Aug 13 16:46 bsp1922_2_228_713969898.arc

-rw-rw---- 1 oracle oinstall 16350668288 Aug 17 09:46 bsp1922_2_229_713969898.arc

drwxr-xr-x 2 root system 256 Mar 16 14:20 lost+found

[oracle@zhyw1]$

我心里暗暗的庆幸，看来归档的保留还比较完整，数据库数据文件的恢复应该没有问题。

在我正在检查数据库情况的时候，突然发现节点2的instance状态不正常了。

从节点2 查看群集信息，报了下面的错误

查看群集状态：如下：

# crsctl check crs

Failure 1 contacting CSS daemon

Cannot communicate with CRS

Cannot communicate with EVM

[oracle@zhyw2]$crs_stat -t

IOT/Abort trap

而从节点1用crs_stat 查看，发现节点2的instance已经down掉了！

# crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

# crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....921.srv application ONLINE ONLINE zhyw1

ora....922.srv application ONLINE OFFLINE

ora....p192.cs application ONLINE ONLINE zhyw1

ora....21.inst application ONLINE ONLINE zhyw1

ora....22.inst application ONLINE OFFLINE

ora.bsp.db application ONLINE ONLINE zhyw1

ora....W1.lsnr application ONLINE ONLINE zhyw1

ora.zhyw1.gsd application ONLINE ONLINE zhyw1

ora.zhyw1.ons application ONLINE ONLINE zhyw1

ora.zhyw1.vip application ONLINE ONLINE zhyw1

ora....W2.lsnr application ONLINE ONLINE zhyw2

ora.zhyw2.gsd application ONLINE OFFLINE

ora.zhyw2.ons application ONLINE OFFLINE

ora.zhyw2.vip application ONLINE ONLINE zhyw2

# crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

# crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....921.srv application ONLINE ONLINE zhyw1

ora....922.srv application ONLINE OFFLINE

ora....p192.cs application ONLINE ONLINE zhyw1

ora....21.inst application ONLINE ONLINE zhyw1

ora....22.inst application ONLINE OFFLINE

ora.bsp.db application ONLINE ONLINE zhyw1

ora....W1.lsnr application ONLINE ONLINE zhyw1

ora.zhyw1.gsd application ONLINE ONLINE zhyw1

ora.zhyw1.ons application ONLINE ONLINE zhyw1

ora.zhyw1.vip application ONLINE ONLINE zhyw1

ora....W2.lsnr application ONLINE ONLINE zhyw2

ora.zhyw2.gsd application ONLINE OFFLINE

ora.zhyw2.ons application ONLINE OFFLINE

ora.zhyw2.vip application ONLINE ONLINE zhyw2

我在第2个节点上尝试对crs 进程重启，并观察ocssd.log

信息如下：

[ CSSD]2010-08-18 00:00:42.061 [2572] >TRACE: Authentication OSD error, op:

scls_auth_response_prepare

loc: mkdir

info: failed to make dir /opt/oracle/product/10.2.0/crs/css/auth/A3572513, No s

pace left on device

dep: 28

[ CSSD]2010-08-18 00:00:42.489 [2572] >TRACE: Authentication OSD error, op:

scls_auth_response_prepare

loc: mkdir

info: failed to make dir /opt/oracle/product/10.2.0/crs/css/auth/A1193328, No s

pace left on device

dep: 28

[ CSSD]2010-08-18 00:00:42.544 [2572] >TRACE: Authentication OSD error, op:

scls_auth_response_prepare

loc: mkdir

info: failed to make dir /opt/oracle/product/10.2.0/crs/css/auth/A5267322, No s

pace lef

上面的一条info 引起了我的注意：怎么会报no space left?

难道空间满了，我马上df 查看了下第二个节点空间的使用情况：

查看当前的容量信息：

# df -k

Filesystem 1024-blocks Free %Used Iused %Iused Mounted on

/dev/hd4 2097152 1465464 31% 7967 3% /

/dev/hd2 3145728 1196032 62% 42303 14% /usr

/dev/hd9var 1048576 585188 45% 7592 6% /var

/dev/hd3 5242880 3774380 29% 748 1% /tmp

/dev/hd1 20971520 9642316 55% 8164 1% /home

/proc - - - - - /proc

/dev/hd10opt 31457280 0 100% 78501 92% /opt

/dev/archlv 298188800 157264636 48% 14 1% /arch2

10.142.56.2:/arch1 298188800 230249144 23% 9 1% /arch1

原来/opt 文件夹满了！估计是因为故障，oracle数据库不断产生trace文件，而trace 文件把这个目录撑死了。如果这样的话，那估计第一个节点也撑不了多久了。我查看下第一个节点的/opt 目录空间，果然也倒了 92%了。

为了以后分析的可能性，我暂时不想删除oracle 的 trace文件，于是我用下面命令确认rootvg是否有足够的剩余空间，查看当前的rootvg剩余空间如下：

# lsvg rootvg

VOLUME GROUP: rootvg VG IDENTIFIER: 00c450d500004c000000012795dce835

VG STATE: active PP SIZE: 256 megabyte(s)

VG PERMISSION: read/write TOTAL PPs: 1092 (279552 megabytes)

MAX LVs: 256 FREE PPs: 304 (77824 megabytes)

LVs: 10 USED PPs: 788 (201728 megabytes)

OPEN LVs: 9 QUORUM: 1 (Disabled)

TOTAL PVs: 2 VG DESCRIPTORS: 3

STALE PVs: 0 STALE PPs: 0

ACTIVE PVs: 2 AUTO ON: yes

MAX PPs per VG: 32512

MAX PPs per PV: 1016 MAX PVs: 32

LTG size (Dynamic): 256 kilobyte(s) AUTO SYNC: no

HOT SPARE: no BB POLICY: relocatable

可以看到，剩余容量为77G，于是我选择对OPT文件夹进行扩展，

smitty jfs2->

扩展之后，容量如下：

# df -k

Filesystem 1024-blocks Free %Used Iused %Iused Mounted on

/dev/hd4 2097152 1465392 31% 7967 3% /

/dev/hd2 3145728 1196032 62% 42303 14% /usr

/dev/hd9var 1048576 585260 45% 7591 6% /var

/dev/hd3 5242880 3774380 29% 748 1% /tmp

/dev/hd1 20971520 9642312 55% 8164 1% /home

/proc - - - - - /proc

/dev/hd10opt 41943040 10483932 76% 78521 4% /opt

/dev/archlv 298188800 157264636 48% 14 1% /arch2

10.142.56.2:/arch1 298188800 230249144 23% 9 1% /arch1

再对节点2的 crs 进程重启

# crsctl start crs

Attempting to start CRS stack

The CRS stack will be started shortly

可以看到，经过重启后，节点2又加入到RAC中来。

查看群集状态如下：

# crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

但是系统仍然有问题，gsd和ons进程还是没有起来

# crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....921.srv application ONLINE ONLINE zhyw1

ora....922.srv application ONLINE OFFLINE

ora....p192.cs application ONLINE ONLINE zhyw1

ora....21.inst application ONLINE ONLINE zhyw1

ora....22.inst application ONLINE OFFLINE

ora.bsp.db application ONLINE ONLINE zhyw1

ora....W1.lsnr application ONLINE ONLINE zhyw1

ora.zhyw1.gsd application ONLINE ONLINE zhyw1

ora.zhyw1.ons application ONLINE ONLINE zhyw1

ora.zhyw1.vip application ONLINE ONLINE zhyw1

ora....W2.lsnr application ONLINE ONLINE zhyw2

ora.zhyw2.gsd application ONLINE OFFLINE

ora.zhyw2.ons application ONLINE OFFLINE

ora.zhyw2.vip application ONLINE ONLINE zhyw2

我查看concurrent vg的状况，确认所有的关系的vg都已经挂载成功了，

# lsvg -o

oravg

oravg2

oravg3

oravg4

oravg5

oravg6

oravg7

archvg

rootvg

我手工重启crs 相关的进程

[oracle@zhyw2]$srvctl stop nodeapps -n zhyw2

[oracle@zhyw2]$srvctl start nodeapps -n zhyw2

日志 crsd.log

2010-08-18 02:48:04.944: [ CRSRES][12435]32Start of `ora.zhyw2.gsd` on member `zhyw2` succeeded.

2010-08-18 02:48:05.151: [ CRSRES][12438]32startRunnable: setting CLI values

2010-08-18 02:48:05.157: [ CRSRES][12438]32Attempting to start `ora.zhyw2.vip` on member `zhyw2`

2010-08-18 02:48:07.181: [ CRSRES][12438]32Start of `ora.zhyw2.vip` on member `zhyw2` succeeded.

2010-08-18 02:48:07.401: [ CRSRES][12443]32startRunnable: setting CLI values

2010-08-18 02:48:07.410: [ CRSRES][12443]32Attempting to start `ora.zhyw2.ons` on member `zhyw2`

2010-08-18 02:48:08.501: [ CRSRES][12443]32Start of `ora.zhyw2.ons` on member `zhyw2` succeeded.

2010-08-18 02:48:08.509: [ COMMCRS][9523]clsc_receive: (1146c80b0) error 2

2010-08-18 02:48:08.738: [ CRSRES][12446]32startRunnable: setting CLI values

2010-08-18 02:48:08.744: [ CRSRES][12446]32Attempting to start `ora.zhyw2.LISTENER_ZHYW2.lsnr` on member `zhyw2`

2010-08-18 02:48:09.767: [ CRSRES][12446]32Start of `ora.zhyw2.LISTENER_ZHYW2.lsnr` on member `zhyw2` succeeded.

在查看crs的状态，gsd,ons进程都已经起来了，但是实例以及它关联的服务还是offline状态：

[oracle@zhyw2]$crs_stat -t

Name Type Target State Host

------------------------------------------------------------

ora....921.srv application ONLINE ONLINE zhyw1

ora....922.srv application ONLINE OFFLINE

ora....p192.cs application ONLINE ONLINE zhyw1

ora....21.inst application ONLINE ONLINE zhyw1

ora....22.inst application ONLINE OFFLINE

ora.bsp.db application ONLINE ONLINE zhyw1

ora....W1.lsnr application ONLINE ONLINE zhyw1

ora.zhyw1.gsd application ONLINE ONLINE zhyw1

ora.zhyw1.ons application ONLINE ONLINE zhyw1

ora.zhyw1.vip application ONLINE ONLINE zhyw1

我查看了下当前的进程状况如下：

SQL> select status from v$instance;

STATUS

------------------------

STARTED

当前的数据库是nomount状态的。于是我尝试重启数据库进程，尝试手工启动zhyw2

SQL> startup

ORA-32004: obsolete and/or deprecated parameter(s) specified

ORACLE instance started.

Total System Global Area 1.7180E+10 bytes

Fixed Size                  2114248 bytes

Variable Size            1.2063E+10 bytes

Database Buffers         5100273664 bytes

Redo Buffers               14659584 bytes

奇怪的是，数据库在启动的时候,hang在上面的界面就不动了。

我查看了该时间的日志情况，如下所示：

查看alert*.log

Wed Aug 18 02:50:55 2010

Starting ORACLE instance (normal)

sskgpgetexecname failed to get name

LICENSE_MAX_SESSION = 0

LICENSE_SESSIONS_WARNING = 0

WARNING: No cluster interconnect has been specified. Depending on

           the communication driver configured Oracle cluster traffic

           may be directed to the public interface of this machine.

           Oracle recommends that RAC clustered databases be configured

           with a private interconnect for enhanced security and

           performance.

Picked latch-free SCN scheme 3

Autotune of undo retention is turned on.

LICENSE_MAX_USERS = 0

SYS auditing is disabled

ksdpec: called for event 13740 prior to event group initialization

Starting up ORACLE RDBMS Version: 10.2.0.4.0.

System parameters with non-default values:

processes                = 1500

sessions                 = 1655

sga_max_size             = 17179869184

__shared_pool_size       = 11995709440

__large_pool_size        = 16777216

__java_pool_size         = 16777216

__streams_pool_size      = 33554432

spfile                   = /dev/rlv_spfile_8g

sga_target               = 17179869184

control_files            = /dev/rlv_cnt11_512m, /dev/rlv_cnt12_512m, /dev/rlv_

cnt13_512m

db_block_size            = 8192

__db_cache_size          = 5100273664

compatible               = 10.2.0.3.0

log_archive_dest_1       = LOCATION=/arch2

log_archive_format       = bsp1922_%t_%s_%r.arc

db_file_multiblock_read_count= 16

cluster_database         = TRUE

cluster_database_instances= 2

。。。。。。。。。。。

Reconfiguration started (old inc 0, new inc 16)

List of nodes:

0 1

Global Resource Directory frozen

* allocate domain 0, invalid = TRUE

Communication channels reestablished

* domain 0 valid = 0 according to instance 0

Wed Aug 18 02:50:58 2010

Master broadcasted resource hash value bitmaps

Non-local Process blocks cleaned out

Wed Aug 18 02:50:58 2010

LMS 8: 0 GCS shadows traversed, 0 replayed

Wed Aug 18 02:50:58 2010

Submitted all GCS remote-cache requests

Fix write in gcs resources

Reconfiguration complete

LCK0 started with pid=31, OS id=815828

Wed Aug 18 02:50:59 2010

ALTER DATABASE   MOUNT

Wed Aug 18 02:54:33 2010

Wed Aug 18 02:54:33 2010

System State dumped to trace file /opt/oracle/admin/bsp/bdump/bsp1922_diag_12046

52.trc

System State dumped to trace file /opt/oracle/admin/bsp/bdump/bsp1922_diag_12046

52.trc

Wed Aug 18 02:54:59 2010

System State dumped to trace file /opt/oracle/admin/bsp/bdump/bsp1922_diag_12046

52.trc

日志写到这里就没有了：

感觉是RAC节点之间的同步问题，决定做数据库服务器节点2的重启操作，尝试解决这个RAC节点的故障：

重启节点2,手动启动资源

shutdown -Fr-> smitty clstart -> varyonvg -c oravg

节点2的数据库还是不能正常打开，错误信息如下：

ALLTER DATABASE   MOUNT

Wed Aug 18 03:15:35 2010

alter database mount

Wed Aug 18 03:15:35 2010

ORA-1154 signalled during: alter database mount...

^C[oracle@zhyw2]$tail -f alert*.log

Submitted all GCS remote-cache requests

Fix write in gcs resources

Reconfiguration complete

LCK0 started with pid=31, OS id=90800

Wed Aug 18 03:12:01 2010

ALTER DATABASE   MOUNT

Wed Aug 18 03:15:35 2010

alter database mount

Wed Aug 18 03:15:35 2010

ORA-1154 signalled during: alter database mount...

这时负责应用的王工也到场了，他也发现了节点2的情况，比较紧张。

我告诉王工："我觉得是两节点同步的问题导致的问题。这种情况下，有必要重启下节点1的数据库，

   来尝试解决节点2无法open的问题。" 王工说鉴于数据库访问已经过于缓慢，严重影响了使用，他们已经申请到了停机时间。有什么需要重启的就重启吧。

首先停止监听，

lsnrctl stop

再干掉了第一个节点上所有 LOCAL=NO 的进程：

ps -ef |grep NO | awk '{ print $2 } ' | xargs kill -9

再停止数据库实例

sqlplus / as sysdba -> shutdown immediate;

   在把第一个节点zhyw1重启后，看zhyw2的日志，发现数据库被很快的open了。

   我再次重启了2台小机，果然，第2个节点顺利的open成功了。

   现在轮到解决那个数据坏块的问题了。（未完待续）