Troubleshooting Data Guard
In my last section on Data Guard we are going to cover troubleshooting, I cover common issues and resolutions regarding the configuration and maintainance of a Data Guard environment.
You already know that you have the alert log and trace files, database health check which I covered in monitoring, Data Guard Broker checks and configuration. Data Guard could have any of the following problems
- Common management issues
- Physical standby issues
- Logical Standby issues
- Switchover problems
- failover problems
- Data Guard Broker issues
- Data Guard snapshot standby issues
As I mentioned above I will tackle only the most common Data Guard problems, the internet should be able to help with most if not all problems, I am sure that someone out there has had the same problem and fixed it.
First we will start with log files, there are anumber of them to keep a check on
- Alert Log
- Observer Logs
- Data Guard trace files
- Data Guard Broker log files
To find the location of some of these logs use the command below
identify log files | col name for a25 col value for a65; select name, value from v$diag_info where value like '%oracle%'; |
If you are using the Fast-Start Failover (FSFO) you can check for additional logging in the observer log file, you can also check the configuration file called FSFO.dat they are normall kept in $ORACLE_HOME/dbs and $ORACLE_HOME/rdbms/log directories. when starting the observer you can state the logfile name, the file will be in the directory where you run the below command
observer logfile name | dgmgrl -logfile observer.log / "start observer" |
Data Guard trace files can be turned on by the parameter log_archive_trace, you basically set the level that you want to trace at the higher the level the more detail, when set on the standby database it will capture output from RFS and MRP0 processes.
turn on Data Guard tracing | sql> alter system set log_archive_trace=16; |
The Data Guard has a number of reports that you can run, these I covered in the Broker section.
Finally before we get in specific errors here is a list of views that might be useful
dba_logstdby_events | contains last 100 events that occured on the logical standby |
dba_logstdby_progress | check whether SQL Apply is progressing |
dba_logstdby_log | checks whether archive logs are being delivered to a logical standby |
dba_logstdby_unsupported | identifies SQL Apply unsupported data types |
v$archive_dest | Describes all the destinations in the Data Guard configuration |
v$archive_dest_status | Displays runtime and configuration information for the redo transport destinations |
v$archive_gap | Displays information to help you identify a gap in the archive redo log files on a physical database |
v$dataguard_config | Lists the db_unique_name parameters defined in the data guard configuration in log_archive_config |
v$dataguard_status | Displays and records events that would typically be triggered by any message to the alert log or server process trace files |
v$log | online redo log file information |
v$logfile | online redo log file and standby redo log file information |
v$log_history | contains archive log history information from the control file |
v$archive_log | contains more detail archived log information from the control file |
v$logstdby_progress | show whether logical standby pocess is running; if query returns no rows, its not running |
v$managed_standby | display current status information for Oracle database processes related to Data Guard |
v$standby_log | contains standby log file information |
Configuration and Management Errors
We start wiht eh password file error, if you see ORA-01034 or ORA16191 errors then check that the sys account password is set correctly and passwords are set the sam across databases, if you are certain these are set correctly then try recreating the password file, also confirm that the parameter remote_login_password is set to either shared orexclusive.
if the SQL Apply fails with the error ORA-01031 tells you that SQL Apply has failed with insufficent privileges, you should see "grant sysdba to prod1_user" or something similar in the alert log.
granting privileges | sql> alter session disable guard; sql> grant sysdba to prod1_user; sql> alter session enable guard; |
There will be a time when you have to resolve a redo gap manually, normally this is automatic but in some cases it must be done manually. First you have to determine where the gap exists on you physical standby
determining the redo gap | ## run this command on the physical standby select * from v$archive_gap; thread# low_sequence# high_sequence# --------------------------------------------------------------------------- 2 6233 6233 3 4531 4531 4 4938 4939 |
Using the example above we can see thread 2 is missing 6233 and thread 3 is missing 4531 and thread 4 is missing between 4938 and 4939, to identify the missing archive use the below command
identify missing archive logs | ## run the below on the primary once you have the archive gaps that are missing select name from v$archived_log where thread# = 2 and dest_id = 2 and sequence# = 6233; name ------------------------------------------------------------------------------------ /u01/oradata/PROD1/recovery_area/arch_2_6233_656445787.arc |
Copy the missing archives from the primary to the standby and register them, then check that they have been applied
register the missing archive logs | ## On the standby after you have copied the missing archives across from the primary sql> alter database resiter logfile ''; ## determine whether other archive log files are missing select max(r.sequence#) last_seq_recd, max(l.sequence#) last_seq_sent from v$archived_log r, v$log l where r.dest_id = 2 and l.archived='YES'; ## check that if gaps still exist select thread#, sequence#, file_name from dba_logstdby_log l where next_change# not in (select first_change# from dba_logstdby_log where l.thread# = thread#) order by thread#, sequence#; |
Lets say that you have a standby database that has been out of action for a little while and you need to get it back up and running the steps are as follows
get a standby up and rolling again |
|
To check that the standby is receiving archive redo logs you run the following command
check that archive logs are being received | select dest_id, status, error from v$archive_dest; |
Check the following on the listener
- The listener.ora has been correctly configured
- The listener is running on the standby
- The service is not correctly configured in the tnsnames.ora file (check both the primary and standby)
- use tnsping and ping to check all connections
also check the archive destination for any problems
- the service name is specified in the log_archive_dest_n parameter for the primary database
- the log_archive_dest_state_n parameter for the standby is not set to enable
- a disconnect occurred and the the number of seconds specified by reopen has not yet passed
- make sure the standby is up and running in mount mode
- check that the standby database is ot too out of sync otherwise use the above to get it rolling again
There are a number of common issues regarding a physical standby database
- You cannot mount the physical standby database
- The standby archive desination is not deifined properly
- The standby site does not receive logs
- The standby site is not processing the logs
The physical standby cannot be mounted if the controlfile was created with an operating system created backup or backup created using an alter database statement without thestandby options. it must be created with ones of the following alter database create standby controlfile statement or rman backup current controlfile with the for standbyoption.
The standby redo log files must be the same size as the primary, if the sizes are different and a logswitch occurs, then the physical standby will be shutdown if it is in maximum protection mode. The primary will become unsynchronized if it is in maximum availibility mode you will receive the following error message in the alert log no standby log files of size blocks available.
If you receive an ORA-16066 error message it is because the parameter remote_archive_enable is set to false, the parameter controls whether the archival of the redo logs to remote destinations is permitted. This parameter has been deprecated but make sure that it is set to true.
If you receive an ORA-16204 error message it means that the value in the parameter log_archive_dest_n is incorrect, double check this parameter.
Remote archival to standby database fails with ORA-01031, this is the result of a missing password file on the standby database, copy the password file from the primary to the standby database and restart the standby.
missing password file | sql> select status, error from v$archive_dest; |
If you receive an ORA-00326 error message, it means that media recovery has found an archive log which was generated after the required archive log, basically it needs the correct log and that the logs are out of order. MRP process will not allow this to happen but it can if you are applying manually. When you plan to manually recover a physical standby for any reason, you must first examine the standby control file to determine what archive logs have been registered for each thread. The you must make sure that you copy all archive logs from the primary that follow that last sequence for each thread, you must not rely on what you see on the disk - you must use the control file on the standby.
Logical Standby Database Failures
Several problems can occur
- SQL Apply stops
- SQL Apply hangs
- SQL Apply cannot keep up with the primary
- Data is not applied to the logical standby
Log apply cannot apply unsupported DML statements, DDL statements and Oracle-supplied packages to logical standby databases running SQL Apply. You can determine what has failed with the below statement
What has failed | select xidusn, xidslt, xidsqn, status, status_code from dba_logstdby_events where event_time = (select max(event_time) from dba_logstdby_events); ## Logical standby example of an error |
The abaove statement will display what cause the SQL Apply to stop, remeber that you cannot randomly skip errors or transactions in SQL Apply without first fixing the issue that caused the problem in the first place, so never randonly skip transactions at a logical standby never.
ORA-01403 no data found errors is when DML is executed on the logical standby to tables maintained by SQL Apply, the primary and logical standby can get out of sync and the error won't show up until the table on the primary standby is updated. to resolve the issue you would need to skip and re-instantiate th table, if you have a database link created to connect to the primary, you can use the dbms_logstdby.skip and dbms_logstdby.instantiate_table procedure.
no data found | sql> exec dbms_logstdby.skip('DML','ACTORS','%'); sql> exec dbms_logstdby.instantiate_table ('PROD_USER', 'ACTORS', 'PROD1_DBLINK'); |
If it appears that SQL Apply is hanging for the below steps
SQL Apply hanging |
|
If the switchover to the standby from the primary fails you probably have sessions connected you need to use the with session shutdown option, you can check what sessions or jobs are running, but the above option will allow you to switchover
Its is also possible that the standby never received the last redo from the primary, use the below query to determine when the last redo was transmitted, if the last was not transmitted just copy it from the primary and register it, when you restart the log services it will then be applied automatically.
last redo transmitted | select thread#, sequence#, archived, applied, status from v$archived_log; |
If you get ORA-01102 error message make sure that the db_unique_name has been set on both the primary and the standby, if have not set it then set it and restart the database.
If you active sessions when switching over it may fail with a ORA-01093 error then use the with session shutdown clause as part of the alter database commit to switchover to physical standby. Other processes that prevent switchover are the following
- Job queue scheduler process
- Advanced queue time manager
- Oracle Enterprise manager agent
You may end up with physical standbys if the switchover fails in the middle, in this case go to the original primary and run the following
two physical databases after switchover | ## On the original primary sql> alter database commit to switcover to primary; |
If you want to back out from an unsuccesssful switchover on a physical standby then try the below
backed out switchover on physical standby |
|
After a switchover if the archive logs are not being applied that the following actions
archive redo logs are not applied after switchover |
|
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/15747463/viewspace-767886/,如需转载,请注明出处,否则将追究法律责任。
转载于:http://blog.itpub.net/15747463/viewspace-767886/