This is the first time i post blog using English.
Today i get a ticket from EBR team(3rd part backup team), saying that the backup job fail due to ora-235:
RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of backup command at 10/12/2011 02:38:13
ORA-00235: controlfile fixed table inconsistent due to concurrent update
RMAN-06031: could not translate database keyword
Recovery Manager complete.
so i go to the veritas netbackup path to check the backup log:
total 1000
-rw-rw-rw- 1 root root 3302 Oct 9 23:13 progress.1318162395.13914.log.Z
-rw-rw-rw- 1 root root 120859 Oct 10 11:48 progress.1318165604.231.log
-rw-rw-rw- 1 root root 107600 Oct 11 06:49 progress.1318248053.7838.log
-rw-rw-rw- 1 root root 102098 Oct 11 23:10 progress.1318334454.10590.log
-rw-rw-rw- 1 root root 8139 Oct 12 02:38 progress.1318347478.12109.log
-rw-rw-rw- 1 root root 121113 Oct 12 16:57 progress.1318362511.1274.log
we see there are 2 backup log file today(2011-10-12). And one is backup fail, other is backup successful:
BACKUP FAIL LOG:
INF - released channel: ch06
INF - released channel: ch07
INF - released channel: ch08
INF - released channel: ch09
INF - released channel: ch10
INF - released channel: ch11
INF - released channel: ch12
INF - released channel: ch13
INF - released channel: ch14
INF - released channel: ch15
INF - RMAN-00571: ===========================================================
INF - RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
INF - RMAN-00571: ===========================================================
INF - RMAN-03002: failure of backup command at 10/12/2011 02:38:13
INF - ORA-00235: controlfile fixed table inconsistent due to concurrent update
INF - RMAN-06031: could not translate database keyword
INF - Recovery Manager complete.
INF - logout
INF - End of Recovery Manager output.
INF - End Oracle Recovery Manager.
au11qap830tels2:SANL01P1:/usr/openv/netbackup/logs/user_ops/dbext/oracle>
BACKUP SUCCESS LOG:
INF - released channel: ch09
INF - released channel: ch10
INF - released channel: ch11
INF - released channel: ch12
INF - released channel: ch13
INF - released channel: ch14
INF - released channel: ch15
INF - allocated channel: ch00
INF - channel ch00: starting full datafile backupset
INF - including current controlfile in backupset
INF - piece handle=ctrl_uapmou8hk_s108889_p1_t764355124 comment=API Version 2.0,MMS Version 5.0.0.0
INF - channel ch00: backup set complete, elapsed time: 00:03:06
INF - Starting Control File and SPFILE Autobackup at 12-OCT-11
INF - piece handle=c-3411474590-20111012-12 comment=API Version 2.0,MMS Version 5.0.0.0
INF - Finished Control File and SPFILE Autobackup at 12-OCT-11
INF - released channel: ch00
INF - Recovery Manager complete.
INF - logout
INF - End of Recovery Manager output.
INF - End Oracle Recovery Manager.
au11qap830tels2:SANL01P1:/usr/openv/netbackup/logs/user_ops/dbext/oracle>
The backup fail due to ORA-00235 at 02:38am, and re-run the backup job at another time can be successfully.
The error happen because controlfile fixed table inconsistent due to concurrent update.
When we do the rman backup without catalog, just using controlfile to store backup information, it will read the controlfile and get the information like SCN from the controlfile.
When the database is doing a combination of a high rate of change, it will trigger redo log switch and when log switch, it will trigger checkpoint.
checkpoint operation will update the newest SCN to controlfile.
So the SCN is inconsistent with what we read at first time. ora-235 error raise.
From the netbackup log, we see the error happen at 10/12/2011 02:38:13.
From the log history, we also can see there are some log switch before 02:38:13.
------------------- ----------
2011-10-12 00:30:18 59728
2011-10-12 00:31:47 59729
2011-10-12 01:28:08 59730
2011-10-12 01:30:30 59731
2011-10-12 02:29:23 59732
2011-10-12 02:34:07 59733
2011-10-12 02:34:45 59734
2011-10-12 03:38:52 59735
2011-10-12 03:40:28 59736
2011-10-12 04:43:04 59737
2011-10-12 04:44:56 59738
====================================
So here we can get the root cause and solution:
+CAUSE:
++++++++++++++++
As each redo log is archived, the control file will be updated with the latest SCN of the REDO LOG switch. If this is happening very frequently, the control file is never released and made available for RMAN for the resync.
+++++++++++++++
+SOLUTION
+++++++++++++++
(1) Backup the database at the time which controlfile is not frequently update.
(2) Need to reduce the frequency of checkpoint.
(2.1) Increase the size of the redologfiles, but due to the redo log file size is already 4G, this solution is not recommend
(2.2) Increase the value of fast_start_mttr_target from 300 to 600.