每次出现备份问题时,现象总是大差不差的,所以就特别的晕,老是不能彻底解决这些问题,也太......
这次接到客户电话,说是备份系统又出问题了,大致现象是一台TDPO客户机在周六的全备份时,备份一半出错中断,另外有二台TDPO的增量备份在周日执行时也出现备份一部分时中断,中断时的错误信息如下:
[@more@](几台的备份信息相同)
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 05/12/2007 02:18:11
ORA-19502: write error on file "LEV0_SZDSCA_622346565_cfihgga5_1_1", blockno 1 (blocksize=512)
ORA-27030: skgfwrt: sbtwrite2 returned error
ORA-19511: Error received from media manager layer, error text:
ANS1235E (RC-72) An unknown system error has occurred from which TSM cannot recover.
Recovery Manager complete.
因为出错并不规律,并不是所有客户端都备份出错,故而怀疑是其中的某一个节点出错,导致的其它主机备份出错。
接下来再接服务器的日志(q actlog),发现如下错误,并且一直在重复提示:
05/14/2007 18:24:41 ANR0408I Session 47 started for server AGENT_CADB
(AIX-RS/6000) (Tcp/Ip) for library sharing. (SESSION: 47)
05/14/2007 18:24:41 ANR0409I Session 47 ended for server AGENT_CADB
(AIX-RS/6000). (SESSION: 47)
05/14/2007 18:24:41 ANR1794W TSM SAN discovery is disabled by options.
(SESSION: 47)
05/14/2007 18:24:41 ANR8963E Unable to find path to match the serial number
defined for drive DRIVE03 in library 3584LIB . (SESSION: 47)
05/14/2007 18:24:41 ANR8779E Unable to open drive /dev/rmt3, error number=16.
(SESSION: 47)
就是这台该死的AGENT_CADB节点,每次出错都是它捣的鬼,这次又是它@_@
然后开始仔细核查看是哪个drive配置出问题了,在cadb这台节点上执行dsmadmc命令进入TSM的命令控制台,查看drive信息如下:
TSM>q drive f=d
Drive Name: DRIVE01
Serial Number: 0007824390
Drive Name: DRIVE02
Serial Number: 0007823855
Drive Name: DRIVE03
Serial Number: 0007824186
Drive Name: DRIVE04
Serial Number: 0007824299
TSM>q path f=d
Source Name: AGENT_CADB
Destination Name: DRIVE01
Device: /dev/rmt3
Source Name: AGENT_CADB
Destination Name: DRIVE02
Device: /dev/rmt4
Source Name: AGENT_CADB
Destination Name: DRIVE03
Device: /dev/rmt2
Source Name: AGENT_CADB
Destination Name: DRIVE04
Device: /dev/rmt1
再退到AIX命令行:lscfg -vl rmt*# lscfg -vl rmt1Serial Number...............0007824299
# lscfg -vl rmt2
Serial Number...............0007824186
# lscfg -vl rmt3
Serial Number...............0007824390
# lscfg -vl rmt4
Serial Number...............0007823855
仔细一对比,没错啊,路径信息是一致的呀,这时再看服务器端的actlog日志,依然在重复着报agent_cadb的drive错误信息,直接晕倒@#$%^^$#@
算了,还是把错误信息收集齐了,回去公司再上网查吧。
于是疯狂google+baidu,根据客户端备份的错误信息来查,多数结果是说因为timeout时间太短,客户端备份超时。这个原因排队,因为最初的TSM安装工程师把时间调整的特别长了:
COMMTimeout 10800
IDLETimeout 480
再根据服务器端的actlog日志来查,多数结果是说的确是path路径变化所导致的,但自己又感觉不像,因为上面所核对的结果。
实在是查不到确切的原因,想去撞撞运气,干脆把他们的路径发现改为自动试试:
setopt sandiscovery on
然后再手动执行备份脚本测试(直接切换到oracle用户里,把备份脚本贴到RMAN里去执行),居然OK了,再看actlog里面的日志,竟然也没有再报driver错误了,难道真的是这个问题???
自己不解,不过已经可以备份了,先这样再观查一段时间看看吧。
因为昨天在测试时曾直接把TSM服务器端halt掉,但又没有kill掉客户端的dsmsta进程,另有一台还是不能备份,手动将该进程kill掉,然后再重新启动一次,备份OK。
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/266238/viewspace-915484/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/266238/viewspace-915484/