Troubleshooting Media Recovery

彦祖的小号

于 2021-08-02 09:58:31 发布

阅读量125

点赞数

分类专栏： Oracle 文章标签： oracle

本文链接：https://blog.csdn.net/linsuhangoracle/article/details/119318579

版权

Oracle 专栏收录该内容

193 篇文章 0 订阅

订阅专栏

This section describes how to troubleshoot user-managed media recovery, that is, media recovery performed without using Recovery Manager (RMAN).

（一）About User-Managed Media Recovery Problems

Table 30-4 describes potential problems that can occur during media recovery.

Problem	Description
Missing or misnamed archived log	Recovery stops because the database cannot find the archived log recorded in the control file.
When you attempt to open the database, error ORA-1113 indicates that a data file needs media recovery.	This error commonly occurs because: You are performing incomplete recovery but failed to restore all needed data file backups. Incomplete recovery stopped before data files reached a consistent SCN. You are recovering data files from an online backup, but not enough redo was applied to make the data files consistent. You are performing recovery with a backup control file, and did not specify the location of a needed online redo log. A data file is undergoing media recovery when you attempt to open the database. Data files needing recovery were not brought online before you execute the RECOVER DATABASE command, and so were not recovered.
Redo record problems	Two possible cases are as follows: Recovery stops because of failed consistency checks, a problem called stuck recovery. Stuck recovery can occur when an underlying operating system or storage system loses a write issued by the database during normal operation. The database signals an internal error when applying the redo. This problem can be caused by an Oracle Database bug. If checksum verification is not being used, then the errors can also be caused by corruptions to the redo or data blocks.
Corrupted archived logs	Logs may be corrupted while they are stored on or copied between storage systems. If DB_BLOCK_CHECKSUM is enabled, then the database usually signals a checksum error. If checksum checking is disabled, then log corruption may appear as a problem with redo.
Archived logs with incompatible parallel redo format	If you enable the parallel redo feature, then the database generates redo logs in a new format. Prior releases of Oracle are unable to apply parallel redo logs. However, releases before Oracle9i Database Release 2 (9.2) can detect the parallel redo format and indicate the inconsistency with the following error message: External error 00303, 00000, "cannot process Parallel Redo".
Corrupted data blocks	A data file backup may have contained a corrupted data block, or the data block may become corrupted either during recovery or when it is copied to the backup. If DB_BLOCK_CHECKSUM is enabled, then the database computes a checksum for each block during normal operations and stores it in the block before writing it to disk. When the database reads the block from disk later, it recomputes the checksum and compares it to the stored value. If they do not match, then the database signals a checksum error. If checksum checking is disabled, then the problem may also appear as a redo corruption.
Random problems	Memory corruptions and other transient problems can occur during recovery.

The symptoms of media recovery problems are usually external or internal errors signaled during recovery. For example, an external error indicates that a redo block or a data block has failed checksum verification checks. Internal errors can be caused by either bugs in the database or errors arising from the underlying operating system and hardware.

If media recovery encounters a problem while recovering a database backup, then whether it is a stuck recovery problem or a problem during redo application, the database always stops and leaves the data files undergoing recovery in a consistent state, that is, at a consistent SCN preceding the failure. You can then do one of the following:

Open the database read-only to investigate the problem.
Open the database with the RESETLOGS option, if the requirements for opening RESETLOGS have been met. The RESETLOGS restrictions apply to opening the physical standby database as well, because a standby database is updated by a form of media recovery.

In general, opening the database read-only or opening with the RESETLOGS option requires all online data files to be recovered to the same SCN. If this requirement is not met, then the database may signal ORA-1113 or other errors when you attempt to open it. Some common causes of ORA-1113 are described in Table 30-4.

The basic methodology for responding to media recovery problems occurs in the following phases:

1) Try to identify the cause of the problem. Run a trial recovery if needed.

2) If the problem is related to missing redo logs or if you suspect that there is a redo log, memory, or data block corruption, then try to resolve the problem using the methods described in Table 30-5.

3) If you cannot resolve the problem using the methods described in Table 30-5, then do one of the following:

a. Open the database with the RESETLOGS option if you are recovering a whole database backup. If you have performed serial media recovery, then the database contains all the changes up to but not including the changes at the SCN where the corruption occurred. No changes from this SCN onward are in the recovered part of the database. If you have restored online backups, then opening RESETLOGS succeeds only if you have recovered through all the ALTER ... END BACKUP operations in the redo stream.

b. Proceed with recovery by allowing media recovery to corrupt data blocks. After media recovery completes, try performing block media recovery using RMAN.

c. Call Oracle Support Services as a last resort.

（二）Investigating the Media Recovery Problem

If media recovery encounters a problem, then obtain as much information as possible after recovery halts. You do not want to waste time fixing the wrong problem, which may make matters worse.

The goal of this initial investigation is to determine whether the problem is caused by incorrect setup, corrupted redo logs, corrupted data blocks, memory corruption, or other problems. If you see a checksum error on a data block, then the data block is corrupted. If you see a checksum error on a redo log block, then the redo log is corrupted.

Sometimes the cause of a recovery problem can be difficult to determine. Nevertheless, the methods in this section enable you to quickly recover a database even when you do not completely understand the cause of the problem.

To investigate media recovery problems:

Examine the alert.log to see whether the error messages give general information about the nature of the problem. For example, does the alert_SID.log indicate any checksum failures? Does the alert_SID.log indicate that media recovery may have to corrupt data blocks to continue?
Check the trace file generated by the Oracle Database during recovery. It may contain additional error information.

（三）Trying to Fix the Recovery Problem Without Corrupting Blocks

Depending on the type of media recovery problem you suspect, you have different solutions at your disposal. You can try one or a combination of the techniques described in Table 30-5. These solutions are common repair techniques and fairly safe for resolving most media recovery issues.

Table 30-5 Media Recovery Solutions

If You Suspect...	Then...
Missing or misnamed archived redo logs	Determine whether you entered the correct file name. If you did, then check whether the log is missing from the operating system. If it is missing, and if you have a backup, then restore the backup and apply the log. If you do not have a backup, then if possible perform incomplete recovery up to the point of the missing log.
ORA-1113 for ALTER DATABASE OPEN	Review the causes of this error in Table 30-4. Ensure that all read/write data files requiring recovery are online. If you use a backup control file for recovery, then the control file and data files must be at a consistent SCN for the database to be opened. If you do not have the necessary redo, then you must re-create the control file.
Corrupt archived logs	The log is corrupted if the checksum verification on the log redo block fails. If DB_BLOCK_CHECKSUM was not enabled either during the recovery session or when the database generated the redo, then recovery problems may be caused by corrupted logs. If the log is corrupt and an alternate copy of the corrupt log is available, then try to apply it and see whether this tactic fixes the problem. The DB_BLOCK_CHECKSUM initialization parameter determines whether checksums are computed for redo log and data blocks.
Archived logs with incompatible parallel redo format	If you run an Oracle Database release before Oracle9i Database Release 2, and if you attempt to apply redo logs created with the parallel redo format, then you must do the following steps: Upgrade the database to a later release. Perform media recovery. Shut down the database consistently and back up the database. Downgrade the database to the original release.
Memory corruption or transient problems	You may be able to fix the problem by shutting down the database and restarting recovery. The database should be left in a consistent state if the second attempt also fails.
Corrupt data blocks	Restore and recover the data file again with user-managed methods, or restore and recover individual data blocks with the RMAN RECOVER ... BLOCK command. This technique may fix the problem. A data block is corrupted if the checksum verification on the block fails. If DB_BLOCK_CHECKING is disabled, then a corrupted data block problem may appear as a redo problem. If you must proceed with media recovery, then you may want to allow media recovery to mark the block as corrupt for now, continue recovery, and then use RMAN to perform block media recovery later.

If you cannot fix the problem with the methods described in Table 30-5, then there may be no easy way to fix the problem without losing data. You have these options:

1) Open the database with the RESETLOGS option (for whole database recovery).

This solution discards all changes after the point where the redo problem occurred, but guarantees a logically consistent database.

2) Allow media recovery to corrupt one or more data blocks and then proceed.

This option only succeeds if the alert log indicates that recovery can continue if it is allowed to corrupt a data block, which is the case for most recovery problems. This option is best if you must bring up the database quickly and recover all changes. If you are considering this option, then proceed to "Deciding Whether to Allow Recovery to Mark as Corrupt Blocks: Phase 3".

（四）Deciding Whether to Allow Recovery to Mark as Corrupt Blocks

When media recovery encounters a problem, the alert log may indicate that recovery can continue if it is allowed to mark as corrupt the data block causing the problem. The alert log contains information about the block: its block type, block address, the tablespace it belongs to, and so forth. For blocks containing user data, the alert log may also report the data object number.

In this case, the database can proceed with recovery if it is allowed to mark the problem block as corrupt. Nevertheless, this response is not always advisable. For example, if the block is an important block in the SYSTEM tablespace, marking the block as corrupt can eventually prevent you from opening the recovered database. Another consideration is whether the recovery problem is isolated. If this problem is followed immediately by many other problems in the redo stream, then you may want to open the database with the RESETLOGS option.

For a block containing user data, you can usually query the database to discover which object or table owns this block. If the database is not open, then you can open the database read-only, even if you are recovering a whole database backup. The following example cancels recovery and opens the database read-only:

CANCEL

ALTER DATABASE OPEN READ ONLY;

Assume that the data object number reported in the alert_SID.log is 8031. You can determine the owner, object name, and object type by issuing this query:

SELECT OWNER, OBJECT_NAME, SUBOBJECT_NAME, OBJECT_TYPE

FROM DBA_OBJECTS

WHERE DATA_OBJECT_ID = 8031;

To determine whether a recovery problem is isolated, you can run a diagnostic trial recovery, which scans the redo stream for problems but does not actually make any changes to the recovered database. If a trial recovery discovers any recovery problems, then it reports them in the alert_SID.log. You can use the RECOVER ... TEST statement to invoke trial recovery, as described in "Executing the RECOVER... TEST Statement".

After you have done these investigations, you can follow the guidelines in Table 30-6 to decide whether to allow recovery to permit corrupt blocks.

Table 30-6 Guidelines for Allowing Recovery to Permit Corrupt Blocks

If the Problem Is...	and the Block Is...	Then...
Not isolated		You can open the database with the RESETLOGS option. This response is important for stuck recovery problems, because stuck recovery can be caused by the operating system or a storage system losing writes. If an operating system or storage system suddenly fails, then it can cause stuck recovery problems on several blocks.
Isolated	In the SYSTEM tablespace	Do not corrupt the block, because it may eventually prevent you from opening the database. However, sometimes data in the SYSTEM tablespace is unimportant. If you must corrupt a SYSTEM block and recover all changes, then contact Oracle Support Services.
Isolated	Index data	Consider corrupting index blocks because the index can be rebuilt later after the database has been recovered.
Isolated	User data	Decide based on the importance of the data. If you continue with data file recovery and corrupt a block, then you lose data in the block. However, you can use RMAN to perform block media recovery later, after data file recovery completes. If you open RESETLOGS, then the database is consistent but loses any changes made after the point where recovery was stopped.
Isolated	Rollback or undo data	If all of the transactions are committed, then consider corrupting the rollback or undo block. The database is not harmed if the transactions that generated the undo are never rolled back. However, if those transactions are rolled back, then corrupting the undo block can cause problems. If you are unsure, then contact Oracle Support Services.

See Also: "Performing Trial Recovery" to learn how to perform trial recovery, and "Allowing Recovery to Corrupt Blocks: Phase 4" if you decide to allow recovery to permit corrupt blocks

（五）Allowing Recovery to Corrupt Blocks

If you decide to allow recovery to proceed despite block corruptions, then run the RECOVER command with the ALLOW n CORRUPTION clause, where n is the number of allowable corrupt blocks.

To allow recovery to corrupt blocks:

Ensure that all normal recovery preconditions are met. For example, if the database is open, then take tablespaces offline before attempting recovery.
Run the RECOVER command as in the following example:

RECOVER DATABASE ALLOW 5 CORRUPTION

（六）Performing Trial Recovery

When problems such as stuck recovery occur, you have a difficult choice. If the block is relatively unimportant, and if the problem is isolated, then it is better to corrupt the block. But if the problem is not isolated, then it may be better to open the database with the RESETLOGS option.

Because of this situation, Oracle Database supports trial recovery. A trial recovery applies redo in a way similar to normal media recovery, but it never writes its changes to disk and it always rolls back its changes. Trial recovery occurs only in memory.

How Trial Recovery Works

By default, if a trial recovery encounters a stuck recovery or similar problem, then it always marks the data block as corrupt in memory when this action can allow recovery to proceed. The database writes errors generated during trial recovery to alert files. These errors are clearly marked as test run errors.

Like normal media recovery, trial recovery can prompt you for archived log file names and ask you to apply them. Trial recovery ends when:

1) The database runs out of the maximum number of buffers in memory that trial recovery is permitted to use

2) An unrecoverable error is signaled, that is, an error that cannot be resolved by corrupting a data block

3) You cancel or interrupt the recovery session

4) The next redo record in the redo stream changes the control file

5) All requested redo has been applied

When trial recovery ends, the database removes all effects of the test run from the system—except the possible error messages in the alert files. If the instance fails during trial recovery, then the database removes all effects of trial recovery from the system, because trial recovery never writes changes to disk.

Trial recovery lets you foresee what problems might occur if you were to continue with normal recovery. For problems caused by ongoing memory corruption, trial recovery and normal recovery can encounter different errors.

Executing the RECOVER... TEST Statement

You can use the TEST option for any RECOVER command. For example, you can start SQL*Plus and then issue any of the following commands:

RECOVER DATABASE TEST

RECOVER DATABASE USING BACKUP CONTROLFILE UNTIL CANCEL TEST

RECOVER TABLESPACE users TEST

RECOVER DATABASE UNTIL CANCEL TEST

By default, trial recovery always attempts to corrupt blocks in memory if this action allows trial recovery to proceed. Trial recovery by default can corrupt an unlimited number of data blocks. You can specify the ALLOW n CORRUPTION clause on the RECOVER ... TEST statement to limit the number of data blocks that trial recovery can corrupt in memory.

A trial recovery command is usable in any scenario in which a normal recovery command is usable. Nevertheless, you only need to run trial recovery when recovery runs into problems.