RMAN备份到带库被阻塞

检查备份的输出脚本的时候发现,备份没有按时完成。

 

 

备份脚本输出如下:

bash-3.00$ more /data/backup/backup_tradedb_100102.out

Script. /data/backup/backup_tradedb.sh
==== started on Sat Jan 2 23:00:00 CST 2010 ====


RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> 10.2.0.3.0 - Production on Sat Jan 2 23:00:01 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog

RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=96 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

Starting backup at 02-JAN-10
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 02-JAN-10
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 02-JAN-10
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1

显然备份从12日开始,就没有结束,登陆数据库,检查对应的进程:

bash-3.00$ sqlplus / as sysdba

SQL*Plus: Release 10.2.0.3.0 - Production on 星期五 1 8 14:16:53 2010

Copyright (c) 1982, 2006, Oracle.  All Rights Reserved.


连接到:
Oracle Database 10g Enterprise Edition Release 10.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Data Mining options

SQL> set pages 100 lines 120
SQL> col event format a30                    
SQL> col client_info format a20
SQL> select sid, client_info, logon_time, event 
  2  from v$session
  3  where program like 'rman%';

       SID CLIENT_INFO          LOGON_TIME          EVENT
---------- -------------------- ------------------- ------------------------------
        96 rman channel=C1      2010-01-02 23:00:05 Backup: sbtbackup
        98                      2010-01-02 23:00:02 SQL*Net message from client
       288 rman channel=C2      2010-01-02 23:00:06 Backup: sbtbackup
       295                      2010-01-02 23:00:02 SQL*Net message from client

数据库中有4rman相关的会话,其中98295是父会话,而96288分别是CHANNEL C1C2对应的会话。可以看到,问题出现在sbtbackup过程中。

从上面的备份输出来看,错误出现在带库的第一个文件上,也就是说第一个文件的备份就出现了长时间的等待。

这个备份的脚本已经运行了一两年了,而2010年第一次向带库中进行备份就报错了,莫非带库备份软件也存在类似千年虫的问题。

先中止备份进程,测试一下问题是否可以重现,并检查问题是与备份有关,还是与带库或备份软件有关:

SQL> select 'kill -9 ' || spid 
  2  from v$process p, v$session s
  3  where p.addr = s.paddr
  4  and sid in (98, 295, 96, 288);

'KILL-9'||SPID
--------------------
kill -9 12879
kill -9 12998
kill -9 12885
kill -9 12968

已选择4行。

SQL> host
$ kill -9 12879
$ kill -9 12998
$ kill -9 12885
$ kill -9 12968
$ exit

bash-3.00$ rman target /

恢复管理器: Release 10.2.0.3.0 - Production on 星期五 1 8 14:40:07 2010

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

连接到目标数据库: TRADEDB (DBID=4181457554)

RMAN> run
2> {
3> allocate channel c1 device type disk format '/data/backup/tradedb/%U';
4> backup tablespace users;
5> }

使用目标数据库控制文件替代恢复目录
分配的通道: c1
通道 c1: sid=120 实例=tradedb1 devtype=DISK

启动 backup 08-1 -10
通道 c1: 启动全部数据文件备份集
通道 c1: 正在指定备份集中的数据文件
输入数据文件 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在启动段 1 08-1 -10
通道 c1: 已完成段 1 08-1 -10
段句柄=/data/backup/tradedb/gjl2uv84_1_1 标记=TAG20100108T144108 注释=NONE
通道 c1: 备份集已完成, 经过时间:00:00:01
完成 backup 08-1 -10
释放的通道: c1

RMAN> run
2> {
3> allocate channel c1 device type sbt;  
4> backup tablespace users;
5> }

分配的通道: c1
通道 c1: sid=120 实例=tradedb1 devtype=SBT_TAPE
通道c1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

启动 backup 08-1 -10
通道 c1: 启动全部数据文件备份集
通道 c1: 正在指定备份集中的数据文件
输入数据文件 fno=00013 name=/dev/vx/rdsk/datadg/tradedb_users_1_200m
通道 c1: 正在启动段 1 08-1 -10

不出所料,备份到磁盘没有问题,而如果向磁带备份,则备份进程被挂起。

看来问题多半出现在带库或备份软件NETBACKUP的设置上。

检查NETBACKUP的日志,果然发现了错误信息:

    TIME            SERVER/CLIENT                      TEXT
01/07/2010 14:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 14:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:06:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:16:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:26:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:36:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:46:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:24 ahrac1 -  abnormal exit, but EXIT_Status == 0
01/07/2010 15:56:25 ahrac1 -  abnormal exit, but EXIT_Status == 0
.
.
.

检查发现机械手处于DISABLE状态,尝试启动时报错,错误信息位:

robot inventory failed: unable to sense robotic device 202

看来可能是硬件的问题了。尝试重启NETBACKUP服务,问题没有解决。

仔细检查了以往的日志,结果意外的发现1226日的备份也出现了错误:

bash-3.00$ more /data/backup/backup_tradedb_091226.out

Script. /data/backup/backup_tradedb.sh
==== started on Sat Dec 26 23:00:00 CST 2009 ====


RMAN: /opt/oracle/product/10.2/database/bin/rman
ORACLE_SID: tradedb1
ORACLE_HOME: /opt/oracle/product/10.2/database
RMAN> 2> 3> 4> 5> 6> 7> 8> RMAN> 2> 3> 4> 5> 6> 7> 8> 9> RMAN> 2> 3> 4> RMAN>

Copyright (c) 1982, 2005, Oracle.  All rights reserved.

connected to target database: TRADEDB (DBID=4181457554)
using target database control file instead of recovery catalog

RMAN> 2> 3> 4> 5> 6> 7> 8>
allocated channel: C1
channel C1: sid=118 instance=tradedb1 devtype=SBT_TAPE
channel C1: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

allocated channel: C2
channel C2: sid=288 instance=tradedb1 devtype=SBT_TAPE
channel C2: VERITAS NetBackup for Oracle - Release 6.0 (2006110304)

Starting backup at 26-DEC-09
input backupset count=1537 stamp=706059322 creation_time=19-DEC-09
channel C2: starting piece 1 at 26-DEC-09
channel C2: backup piece /data/backup/tradedb/g1l1b71q_1_1
input backupset count=1535 stamp=706059115 creation_time=19-DEC-09
channel C1: starting piece 1 at 26-DEC-09
channel C1: backup piece /data/backup/tradedb/fvl1b6rb_1_1
RMAN-03009: failure of backup command on C1 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="fvl1b6rb_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece  (/data/backup/tradedb/fvl1b6rb_1_1)
channel C1 disabled, job failed on it will be run on another channel
released channel: C1
released channel: C2
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03009: failure of backup command on C2 channel at 12/26/2009 23:30:33
ORA-19506: failed to create sequential file, name="g1l1b71q_1_2", parms=""
ORA-27028: skgfqcre: sbtbackup returned error
ORA-19511: Error received from media manager layer, error text:
   VxBSACreateObject: Failed with error:
   Server Status:  Communication with the server has not been iniatated or the server status has not been retrieved from the server.
ORA-19600: input file is backup piece  (/data/backup/tradedb/g1l1b71q_1_1)

看来这个问题存在时间不短了,检查系统日志,结果发现大量的硬件错误信息:

# dmesg

Fri Jan  8 15:50:23 CST 2010
Jan  8 15:06:50 ahrac1 tldcd[877]: [ID 498531 daemon.error] user scsi ioctl() failed, may be timeout, errno = 5, I/O error
Jan  8 15:06:50 ahrac1 last message repeated 1 time
Jan  8 15:06:50 ahrac1 tldcd[877]: [ID 406877 daemon.error] TLD(0) mode_sense ioctl() failed: I/O error
Jan  8 15:06:50 ahrac1 tldd[857]: [ID 320639 daemon.error] TLD(0) unavailable: initialization failed: Unable to sense robotic device
Jan  8 15:06:51 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Jan  8 15:06:51 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.

检查过往的日志,发现在1220日的时候出现的错误:

Dec 20 22:24:51 ahrac1 qlc: [ID 630585 kern.info] NOTICE: Qlogic qlc(1): Link OFFLINE
Dec 20 22:24:51 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:24:51 ahrac1  SCSI transport failed: reason 'tran_err': retrying command
Dec 20 22:26:01 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:01 ahrac1  SCSI transport failed: reason 'timeout': retrying command
Dec 20 22:26:21 ahrac1 fctl: [ID 517869 kern.warning] WARNING: fp(0)::OFFLINE timeout
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion   CK200070800845 port SP A3.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:30 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016341E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=2 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice]    Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=1 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.notice]    Device is gone
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/ssd@w5006016b41e048b3,0 (ssd4):
Dec 20 22:26:40 ahrac1  transport rejected fatal error
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 1 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10000 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Killing bus 3072 to Clariion   CK200070800845 port SP B3.
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 0 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10700 (trace=2800004)
Dec 20 22:26:40 ahrac1 emcp: [ID 801593 kern.notice] Error: Path Bus 3072 Tgt 5006016B41E048B3 Lun 2 to CK200070800845 is dead.
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=2 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=1 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10400 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10600 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 243001 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0 (fcp0):
Dec 20 22:26:40 ahrac1  offlining lun=0 (trace=0), target=10300 (trace=2800004)
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1):
Dec 20 22:26:40 ahrac1  transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2010,0 (st1) offline
Dec 20 22:26:40 ahrac1 scsi: [ID 107833 kern.warning] WARNING: /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0):
Dec 20 22:26:40 ahrac1  transport rejected
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/st@w500e09e00b3b2020,0 (st0) offline
Dec 20 22:26:40 ahrac1 genunix: [ID 408114 kern.info] /pci@8,600000/SUNW,qlc@2/fp@0,0/sgen@w500e09e00b3b2000,0 (sgen0) offline
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:26:55 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:10 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 230432 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.000 (device 0, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2020,0:raw) thru sg driver, I/O error
Dec 20 22:27:25 ahrac1 avrd[871]: [ID 160372 daemon.error] ioctl error on HP.ULTRIUM3-SCSI.001 (device 1, /devices/pci@8,600000/SUNW
,qlc@2/fp@0,0/sg@w500e09e00b3b2010,0:raw) thru sg driver, I/O error
.
.
.

看来是出现了硬件问题,仅靠调整软件已经不起作用了。

最终由厂家人员现场诊断发现带库的控制器、机械手等多处都出现了硬件故障,最终通过更换了新的带库解决了错误。

本以为是软件配置的问题,没想到最终检测到硬件的错误。

 

来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/4227/viewspace-627279/,如需转载,请注明出处,否则将追究法律责任。

转载于:http://blog.itpub.net/4227/viewspace-627279/

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值