检查备份的输出脚本的时候发现,备份没有按时完成。
备份脚本输出如下:
bash-3.00$ more /data/backup/backup_tradedb_100102.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Jan 2 23:00:00 CST 2010 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> 10.2.0.3.0 - Production on Sat Jan 2 23:00:01 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=96 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 02-JAN-10
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 02-JAN-10
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 02-JAN-10
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
显然备份从1月2日开始,就没有结束,登陆数据库,检查对应的进程:
bash-3.00$ sqlplus / as sysdba
SQL*Plus: Release 10.2.0.3.0 - Production on 星期五 1月 8 14:16:53 2010
Copyright (c) 1982, 2006, Oracle. All Rights Reserved.
连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options
SQL> set pages 100 lines 120
SQL> col event format a30
SQL> col client_info format a20
SQL> select sid, client_info, logon_time, event
2 from v$session
3 where program like 'rman%';
SID CLIENT_INFO LOGON_TIME EVENT
---------- -------------------- ------------------- ------------------------------
96 rman channel=C1 2010-01-02 23:00:05 Backup: sbtbackup
98 2010-01-02 23:00:02 SQL*Net message from client
288 rman channel=C2 2010-01-02 23:00:06 Backup: sbtbackup
295 2010-01-02 23:00:02 SQL*Net message from client
数据库中有4个rman相关的会话,其中98和295是父会话,而96和288分别是CHANNEL C1和C2对应的会话。可以看到,问题出现在sbtbackup过程中。
从上面的备份输出来看,错误出现在带库的第一个文件上,也就是说第一个文件的备份就出现了长时间的等待。
这个备份的脚本已经运行了一两年了,而2010年第一次向带库中进行备份就报错了,莫非带库备份软件也存在类似千年虫的问题。
先中止备份进程,测试一下问题是否可以重现,并检查问题是与备份有关,还是与带库或备份软件有关:
SQL> select 'kill -9 ' || spid
2 from v$process p, v$session s
3 where p.addr = s.paddr
4 and sid in (98, 295, 96, 288);
'KILL-9'||SPID
--------------------
kill -9 12879
kill -9 12998
kill -9 12885
kill -9 12968
已选择4行。
SQL> host
$ kill -9 12879
$ kill -9 12998
$ kill -9 12885
$ kill -9 12968
$ exit
bash-3.00$ rman target /
恢复管理器: Release 10.2.0.3.0 - Production on 星期五 1月 8 14:40:07 2010
Copyright (c) 1982, 2005, Oracle. All rights reserved.
连接到目标数据库: TRADEDB (DBID=4181457554)
RMAN> run
2> {
3> allocate channel c1 device type disk format '/data/backup/tradedb/%U';
4> backup tablespace users;
5> }
使用目标数据库控制文件替代恢复目录
分配的通道: c1
通道 c1: sid=120 实例=tradedb1 devtype=DISK
启动 backup 于 08-1月 -10
通道 c1: 启动全部数据文件备份集
通道 c1: 正在指定备份集中的数据文件
输入数据文件 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在启动段 1 于 08-1月 -10
通道 c1: 已完成段 1 于 08-1月 -10
段句柄=/data/backup/tradedb/gjl2uv84_1_1 标记=TAG20100108T144108 注释=NONE
通道 c1: 备份集已完成, 经过时间:00:00:01
完成 backup 于 08-1月 -10
释放的通道: c1
RMAN> run
2> {
3> allocate channel c1 device type sbt;
4> backup tablespace users;
5> }
分配的通道: c1
通道 c1: sid=120 实例=tradedb1 devtype=SBT_TAPE
通道c1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
启动 backup 于 08-1月 -10
通道 c1: 启动全部数据文件备份集
通道 c1: 正在指定备份集中的数据文件
输入数据文件 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在启动段 1 于 08-1月 -10
不出所料,备份到磁盘没有问题,而如果向磁带备份,则备份进程被挂起。
看来问题多半出现在带库或备份软件NETBACKUP的设置上。
检查NETBACKUP的日志,果然发现了错误信息:
TIME SERVER/CLIENT TEXT
01/07/2010 14:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 14:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:25 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:24 ahrac1 - abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:25 ahrac1 - abnormal exit, but EXIT_Status == 0
.
.
.
检查发现机械手处于DISABLE状态,尝试启动时报错,错误信息位:
robot inventory failed: unable to sense robotic device 202
看来可能是硬件的问题了。尝试重启NETBACKUP服务,问题没有解决。
仔细检查了以往的日志,结果意外的发现12月26日的备份也出现了错误:
bash-3.00$ more /data/backup/backup_tradedb_091226.out
Script. /data/backup/backup_tradedb.sh
==== started on Sat Dec 26 23:00:00 CST 2009 ====
RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>
Copyright (c) 1982, 2005, Oracle. All rights reserved.
connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog
RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=118 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)
Starting backup at 26-DEC-09
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 26-DEC-09
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 26-DEC-09
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
RMAN-03009: failure of backup command on C1 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="fvl1b6rb_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece (/data/backup/tradedb/fvl1b6rb_1_1)
channel C1 disabled, job failed on it will be run on another channel
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on C2 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="g1l1b71q_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
VxBSACreateObject: Failed with error:
Server Status: Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece (/data/backup/tradedb/g1l1b71q_1_1)
看来这个问题存在时间不短了,检查系统日志,结果发现大量的硬件错误信息:
# dmesg
Fri Jan 8 15:50:23 CST 2010
Jan 8 15:06:50 ahrac1 tldcd[877]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Jan 8 15:06:50 ahrac1 last message repeated 1 time
Jan 8 15:06:50 ahrac1 tldcd[877]: [ID 406877 daemon.error] TLD(0) mode_sense ioctl() failed: I/O error
Jan 8 15:06:50 ahrac1 tldd[857]: [ID 320639 daemon.error] TLD(0) unavailable: initialization failed: Unable to sense robotic device
Jan 8 15:06:51 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Jan 8 15:06:51 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.
检查过往的日志,发现在12月20日的时候出现的错误:
Dec 20 22:24:51 ahrac1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Link OFFLINE
Dec 20 22:24:51 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:24:51 ahrac1 SCSI transport failed: reason 'tran_err': retrying command
Dec 20 22:26:01 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:01 ahrac1 SCSI transport failed: reason 'timeout': retrying command
Dec 20 22:26:21 ahrac1 fctl: [ID 517869 kern.warning] WARNING: fp(0)::OFFLINE timeout
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion CK200070800845 port SP A3.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=2 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice] Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=1 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice] Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:40 ahrac1 transport rejected fatal error
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion CK200070800845 port SP B3.
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10700 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=2 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=1 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10600 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1 offlining lun=0 (trace=0), target=10300 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1):
Dec 20 22:26:40 ahrac1 transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1) offline
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0):
Dec 20 22:26:40 ahrac1 transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0) offline
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/sgen@w500e09e00b3b2000,0 (sgen0) offline
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.
看来是出现了硬件问题,仅靠调整软件已经不起作用了。
最终由厂家人员现场诊断发现带库的控制器、机械手等多处都出现了硬件故障,最终通过更换了新的带库解决了错误。
本以为是软件配置的问题,没想到最终检测到硬件的错误。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-627279/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/4227/viewspace-627279/