数据库管理-第244期 一次无法switchover的故障处理(20240928)

数据库管理-第244期 一次无法switchover的故障处理(20240928)

作者:胖头鱼的鱼缸(尹海文)
Oracle ACE Pro: Database(Oracle与MySQL)
PostgreSQL ACE Partner
10年数据库行业经验,现主要从事数据库服务工作
拥有OCM 11g/12c/19c、MySQL 8.0 OCP、Exadata、CDP等认证
墨天轮MVP、年度墨力之星,ITPUB认证专家、专家百人团成员,OCM讲师,PolarDB开源社区技术顾问,HaloDB外聘技术顾问,OceanBase观察团成员,青学会MOP技术社区(青年数据库学习互助会)技术顾问
圈内拥有“总监”、“保安”、“国产数据库最大敌人”等称号,非著名社恐(社交恐怖分子)
公众号:胖头鱼的鱼缸;CSDN:胖头鱼的鱼缸(尹海文);墨天轮:胖头鱼的鱼缸;ITPUB:yhw1809。
除授权转载并标明出处外,均为“非法”抄袭

演示文稿1_01.png
中秋前做了一次数据库的倒换演练,结果发现无法switchover,本期就来跟随总监一步一步的寻找并解决问题(同时感谢SR支持)。

1 问题展现

在DGMGRL中进行switchover的时候出现了下面的问题:

DGMGRL> switchover to dbdg;
Performing switchover NOW, please wait...
Error: ORA-16775: target standby database in broker operation has potential data loss

Failed.
Unable to switchover, primary database is still "dbaas"

2 问题排查与处理

2.1 问题1

第一个问题呢是在DGMGRL中show configuration:

DGMGRL> show configuration

Configuration - dg

  Protection Mode: MaxPerformance
  Members:
  dbaas - Primary database
    dbdg - Physical standby database
      Error: ORA-16664: unable to receive the result from a member

Fast-Start Failover: Disabled

Configuration Status:
ERROR (status updated 34 seconds ago)

但是使用show configuration verbose则是显示正常,查看主备库也是正常的:

DGMGRL> show configuration verbose

Configuration - dg

  Protection Mode: MaxPerformance
  Members:
  dbaas - Primary database
    dbdg  - Physical standby database 

  Properties:
    FastStartFailoverThreshold      = '30'
    OperationTimeout                = '30'
    TraceLevel                      = 'SUPPORT'
    FastStartFailoverLagLimit       = '30'
    CommunicationTimeout            = '180'
    ObserverReconnect               = '0'
    FastStartFailoverAutoReinstate  = 'TRUE'
    FastStartFailoverPmyShutdown    = 'TRUE'
    BystandersFollowRoleChange      = 'ALL'
    ObserverOverride                = 'FALSE'
    ExternalDestination1            = ''
    ExternalDestination2            = ''
    PrimaryLostWriteAction          = 'CONTINUE'
    ConfigurationWideServiceName    = 'dbaas_CFG'

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS

DGMGRL> show database dbaas

Database - dbaas

  Role:               PRIMARY
  Intended State:     TRANSPORT-ON
  Instance(s):
    dbaas1
    dbaas2

Database Status:
SUCCESS

DGMGRL> show database dbdg
Database - dbdg

  Role:               PHYSICAL STANDBY
  Intended State:     APPLY-ON
  Transport Lag:      0 seconds (computed 0 seconds ago)
  Apply Lag:          0 seconds (computed 0 seconds ago)
  Average Apply Rate: 61.30 MByte/s
  Real Time Query:    ON
  Instance(s):
    dbdg1
    dbdg2 (apply instance)
    dbdg3
    dbdg4

Database Status:
SUCCESS

最终发现是数据库本身是配置了db_domain的,而tnsname和静态监听中未配置domain(即域名后缀,如xxx.com),随机调整监听和tnsname:
监听调整(以主库实例1和listener_scan1为例):

SID_LIST_LISTENER =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME = dbaas.scmcc.com) #增加domain
      (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
      (SID_NAME = dbaas1)
    )
    (SID_DESC =
      (GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain
      (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
      (SID_NAME = dbaas1)
    )
  )
SID_LIST_LISTENER_SCAN1 =
  (SID_LIST =
    (SID_DESC =
      (GLOBAL_DBNAME =dbaas.scmcc.com) #增加domain
      (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
      (SID_NAME = dbaas1)
    )
    (SID_DESC =
      (GLOBAL_DBNAME = dbaas_DGMGRL.scmcc.com) #增加domain
      (ORACLE_HOME = /u01/app/oracle/product/19.0.0.0/dbhome_1)
      (SID_NAME = dbaas1)
    )
  )

完成后reload所有监听。
tnsname调整:

DBAAS =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = primary-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = dbaas.xxx.com) #增加domain
    )
  )

DBDG =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = standby-scan)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = dbdg.xxx.com) #增加domain
    )
  )

完成后show configuration恢复正常:

DGMGRL> show configuration

Configuration - dg

  Protection Mode: MaxPerformance
  Members:
  dbaas - Primary database
    dbdg  - Physical standby database 

Fast-Start Failover:  Disabled

Configuration Status:
SUCCESS   (status updated 59 seconds ago)

2.2 问题2

接下来尝试switchover报错依然,validate备库发现一些问题:

DGMGRL> validate database verbose dbdg

  Database Role:     Physical standby database
  Primary Database:  dbaas

  Ready for Switchover:  No #<---Here
  Ready for Failover:    Yes (Primary Running)

  Flashback Database Status:
    dbaas:  Off
    dbdg :  Off

  Capacity Information:
    Database  Instances        Threads        
    dbaas   2                4              
    dbdg    4                4              

  Managed by Clusterware:
    dbaas:  YES            
    dbdg :  YES            

  Temporary Tablespace File Information:
    dbaas TEMP Files:  56
    dbdg TEMP Files:   67

  Data file Online Move in Progress:
    dbaas:  No
    dbdg:   No

  Standby Apply-Related Information:
    Apply State:      Running
    Apply Lag:        0 seconds (computed 0 seconds ago)
    Apply Delay:      0 minutes

  Transport-Related Information:
    Transport On:  Yes
    Gap Status:    Gap #<---Here
    Transport Lag:  0 seconds (computed 0 seconds ago)
    Transport Status:  Success

  Log Files Cleared:
    dbaas Standby Redo Log Files:  Cleared
    dbdg Online Redo Log Files:    Cleared
    dbdg Standby Redo Log Files:   Available

  Current Log File Groups Configuration:
    Thread #  Online Redo Log Groups  Standby Redo Log Groups Status       
              (dbaas)                 (dbdg)                             
    2         4                       7                       Sufficient SRLs
    1         4                       7                       Sufficient SRLs
    3         4                       7                       Sufficient SRLs
    4         4                       7                       Sufficient SRLs

  Future Log File Groups Configuration:
    Thread #  Online Redo Log Groups  Standby Redo Log Groups Status       
              (dbdg)                  (dbaas)                            
    2         4                       7                       Sufficient SRLs
    1         4                       7                       Sufficient SRLs
    3         4                       7                       Sufficient SRLs
    4         4                       7                       Sufficient SRLs

  Current Configuration Log File Sizes:
    Thread #   Smallest Online Redo      Smallest Standby Redo    
               Log File Size             Log File Size            
               (dbaas)                   (dbdg)                 
    2          2048 MBytes               2048 MBytes              
    1          2048 MBytes               2048 MBytes              
    3          2048 MBytes               2048 MBytes              
    4          2048 MBytes               2048 MBytes              

  Future Configuration Log File Sizes:
    Thread #   Smallest Online Redo      Smallest Standby Redo    
               Log File Size             Log File Size            
               (dbdg)                    (dbaas)                
    2          2048 MBytes               2048 MBytes              
    1          2048 MBytes               2048 MBytes              
    3          2048 MBytes               2048 MBytes              
    4          2048 MBytes               2048 MBytes              

  Apply-Related Property Settings:
    Property                        dbaas Value              dbdg Value
    DelayMins                       0                        0
    ApplyParallel                   AUTO                     AUTO
    ApplyInstances                  0                        0

  Transport-Related Property Settings:
    Property                        dbaas Value              dbdg Value
    LogShipping                     ON                       ON
    LogXptMode                      sync                     sync
    Dependency                      <empty>                  <empty>
    DelayMins                       0                        0
    Binding                         optional                 optional
    MaxFailure                      0                        0
    ReopenSecs                      300                      300
    NetTimeout                      30                       30
    RedoCompression                 DISABLE                  DISABLE

仍然显示无法switchover且存在GAP,但是通过数据库查询发现并未出现GAP(查询语句如下,结果略):

-- primary database
set markup HTML on
spool /tmp/primary_info.html
ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS';

select thread#, max(sequence#) "Last Primary Seq Generated"
from gv$archived_log val, gv$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
group by thread# order by 1;

SELECT thread#, dest_id, gvad.status, error, fail_sequence FROM gv$archive_dest gvad, gv$instance gvi WHERE gvad.inst_id = gvi.inst_id AND destination is NOT NULL ORDER BY thread#, dest_id;

select * from gv$dataguard_stats;

SELECT a.thread#, b. last_seq, a.applied_seq, a. last_app_timestamp, b.last_seq-a.applied_seq ARC_DIFF FROM (SELECT thread#, MAX(sequence#) applied_seq, MAX(next_time) last_app_timestamp FROM gv$archived_log WHERE applied = 'YES' GROUP BY thread#) a, (SELECT thread#, MAX (sequence#) last_seq FROM gv$archived_log GROUP BY thread#) b WHERE a.thread# = b.thread#;

spool off
set markup HTML off

-- standby database
set markup HTML on
spool /tmp/standby_info.html

ALTER SESSION SET NLS_DATE_FORMAT = 'DD-MON-YYYY HH24:MI:SS';

select process,thread#,sequence#,status from gv$managed_standby;

select * from v$dataguard_stats;

select a.thread#
,a.sequence#
,a.group# grp
, a.bytes/1024/1024 Size_MB
,a.status
,a.archived
,a.first_change# "First SCN Number"
,to_char(FIRST_TIME,'DD-Mon-RR HH24:MI:SS') "First SCN Time"
,to_char(LAST_TIME,'DD-Mon-RR HH24:MI:SS') "Last SCN Time" from
v$standby_log a order by 1,2,3,4;


select thread#, max(sequence#) "Last Standby Seq Received"
from v$archived_log val, v$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
group by thread# order by 1;

select thread#, max(sequence#) "Last Standby Seq Applied"
from v$archived_log val, v$database vdb
where val.resetlogs_change# = vdb.resetlogs_change#
and val.applied in ('YES','IN-MEMORY')
group by thread# order by 1;

spool off
set markup HTML off

主库检查thread:

SQL> SELECT thread#, instance, status FROM v$thread;

   THREAD# INSTANCE             STATUS
---------- -------------------- ------
         1 dbaas1               OPEN
         2 dbaas2               OPEN
         3 UNNAMED_INSTANCE_3   CLOSED
         4 UNNAMED_INSTANCE_4   CLOSED

这里后台建议做了一个操作:

ALTER DATABASE DISABLE THREAD 3;
ALTER DATABASE DISABLE THREAD 4;

运行一段时间后再次validate备库:

DGMGRL> validate database verbose dbdg
...
  Ready for Switchover:  Yes
...
    Gap Status:    No Gap
...

显示可以切换且没有Gap了。目前还没有尝试再次switchover,但应该是没问题了。

3 问题分析

这里也是我第一次遇到这个问题,应该是一开始因为domain配置引起的元数据问题。关于thread的问题,因为主备节点数量不一致,但是其他类似配置的库并没有出现过相关问题,所以我怀疑还是和domain配置有问题带来的连锁反应。

4 总结

本期处理了一个ADG无法switchover的问题,源自于最早的错误配置。
老规矩,知道写了些啥。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

胖头鱼的鱼缸(尹海文)

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值