oracle dataguard 自动故障转移及VIP漂移

韩量超标

已于 2023-09-17 11:10:42 修改

阅读量501

点赞数 6

文章标签： oracle lvs

于 2023-09-17 11:09:19 首次发布

本文链接：https://blog.csdn.net/xiaohan4536/article/details/132921558

版权

一、环境

主机名 ip 实例名唯一名
主库 oracle_dg1.test.com 218.168.129.64 orcl orclpri
备库 oracle_dg2.test.com 218.168.129.65 orcl orclstd
VIP scan_01.test.com 218.168.129.66 orcl

[root@oracle_dg1 ~]# uname -r
2.6.32-754.el6.x86_64

二、搭建dataguard

见：CSDNhttps://mp.csdn.net/mp_blog/creation/editor/132938794

三、配置dg_broker

3.1 配置broker的前提条件

3.1.1 主从数据库都需要打开flashback database

select flashback_onfrom v$database; --在主从都看查看是否开启
shutdown immediate
startup mount
alter database flashback on; --开启闪回
alter database open;

3.1.2 主从数据库必须置为最大可用状态

--如果不是也不要着急，一会用dgmgrl来切成最大可用。
SQL> select name,db_unique_name,database_role,open_mode,protection_mode,protection_level,force_logging,flashback_on,switchover_status from v$database;
NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE PROTECTION_MODE PROTECTION_LEVEL FOR FLASHBACK_ON SWITCHOVER_STATUS
--------- ------------------------------ ---------------- -------------------- -------------------- -------------------- --- ------------------ --------------------
ORCL orclpri PRIMARY READ WRITE MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY YES YES TO STANDBY

3.1.3 主从数据库都要使用spfile：

show parameter spfile;

3.1.4 配置 dg_broker_config_file

show parameter dg_broker_config_file
dg_broker_config_filen参数用于指定DataGuard配置文件的路径
特别注意一点：RAC中需要把Broker配置文件放入ASM中。

3.1.5 主备数据库都要启用broker

dg_broker_start 参数设置实例启动的时候是否自动启动Broker。
show parameter dg_broker_start;
alter system set dg_broker_start=true;

3.2 开始配置broker

3.2.1 创建配置(主库上执行)

--删除配置用DGMGRL> remove configuration;
在主库上使用dgmgrl连接到数据库
[oracle@oracle_dg1 admin]$ dgmgrl
DGMGRL> connect sys/oracle@orclpri
DGMGRL> create configuration 'orclpricfg' as primary database is 'orclpri' connect identifier is 'orclpri';
Configuration "orclpribroker" created with primary database "orclpri"

参数说明：
orclpricfg 是配置的名称,这里可以随便填.
PRIMARY DATABASE IS 'orclpri' : orclpri 是指 database 的 db_unique_name
connectidentifier is 'orclpri' : orclpri是指 tnsname.ora 连接到主库的 net service name

我们可以使用show confiruration查看配置信息.
DGMGRL> show configuration
Configuration - orclpricfg
Protection Mode: MaxPerformance
Databases:
orclpri - Primary database
Fast-Start Failover: DISABLED
Configuration Status:
DISABLED

3.2.2 添加standby database的配置(主库上执行)

DGMGRL> add database 'orclstd' as connect identifier is orclstd maintained as physical;
Database "orclstd" added
参数说明：
add database 'orclstd' : 这里的orclstd是指 database 的 db_unique_name
AS CONNECT IDENTIFIER IS orclstd: 这里的 orclstd 是指 tnsname.ora 连接到standby database的net service name.

DGMGRL> show configuration
Configuration - orclpricfg
Protection Mode: MaxPerformance
Databases:
orclpri - Primary database
orclstd - Physical standby database
Fast-Start Failover: DISABLED
Configuration Status:
DISABLED

3.2.3 设置日志同步传输模式(主库上执行)

DGMGRL> edit database 'orclpri' set property 'logxptmode'='sync';
Property”logxptmode” updated

DGMGRL> edit database 'orclstd' set property 'logxptmode'='sync';
Property”logxptmode” updated

3.2.4 设置 DATAGUARD 保护模式（如果前最面已经设置则此步省略）

DGMGRL> edit configuration set protection mode as MAXAvailability;

3.2.5 设置故障转移目标 FastStartFailoverTarget

DGMGRL> edit database orclpri set property FastStartFailoverTarget='orclstd';
Property "faststartfailovertarget" updated
DGMGRL> show database orclpri FastStartFailoverTarget;
FastStartFailoverTarget = 'orclstd'

DGMGRL> edit database orclstd set property FastStartFailoverTarget='orclpri';
Property "faststartfailovertarget" updated
DGMGRL> show database orclstd FastStartFailoverTarget;
FastStartFailoverTarget = 'orclpri'

3.2.6 修改快速故障转移的延迟时间(默认为30秒，有些少):

DGMGRL> edit configuration set property FastStartFailoverThreshold=60;
DGMGRL> show configuration FastStartFailoverThreshold;
FastStartFailoverThreshold = '60'
FastStartFailoverThreshold (表示观察者与备库同主库断网后几秒，自动启动failover，即主库终止从库延迟几秒后从库再启动
注意：在生产中要把 FastStartFailoverThreshold 这个参数调大点，要不在生产中发生网络不畅通的情况下，等30s就自动切换了，所以还是改大点吧，我改的是1分钟

3.2.7 配置监听

DGMGRL> show database verbose orclpri StaticConnectIdentifier;
StaticConnectIdentifier = '(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=oracle_dg1.test.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orclpri_DGMGRL)(INSTANCE_NAME=orcl)(SERVER=DEDICATED)))'
DGMGRL> show database verbose orclstd StaticConnectIdentifier;
StaticConnectIdentifier = '(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=oracle_dg2.test.com)(PORT=1521))(CONNECT_DATA=(SERVICE_NAME=orclstd_DGMGRL)(INSTANCE_NAME=orcl)(SERVER=DEDICATED)))'

3.2.7.1 主库
[oracle@oracle_dg1 ~]$ vi /u01/oracle/product/11.2.0/dbhome_1/network/admin/listener.ora
# listener.ora Network Configuration File: /u01/oracle/product/11.2.0/dbhome_1/network/admin/listener.ora
# Generated by Oracle configuration tools.

LISTENER =
(DESCRIPTION_LIST =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = oracle_dg1.test.com)(PORT = 1521))
# (ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
)
)

SID_LIST_LISTENER =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME = ORCL)
(ORACLE_HOME = /u01/oracle/product/11.2.0/dbhome_1)
(SID_NAME = ORCL)
)
(SID_DESC =
(GLOBAL_DBNAME = orclpri_DGMGRL)
(ORACLE_HOME = /u01/oracle/product/11.2.0/dbhome_1)
(SID_NAME = orcl)
)
)

ADR_BASE_LISTENER = /u01/oracle

3.2.7.2 备库
[oracle@oracle_dg2 trace]$ vi /u01/oracle/product/11.2.0/dbhome_1/network/admin/listener.ora
# listener.ora Network Configuration File: /u01/oracle/product/11.2.0/dbhome_1/network/admin/listener.ora
# Generated by Oracle configuration tools.

LISTENER =
(DESCRIPTION_LIST =
(DESCRIPTION =
(ADDRESS = (PROTOCOL = TCP)(HOST = oracle_dg2.test.com)(PORT = 1521))
(ADDRESS = (PROTOCOL = IPC)(KEY = EXTPROC1521))
)
)

SID_LIST_LISTENER =
(SID_LIST =
(SID_DESC =
(GLOBAL_DBNAME = orcl)
(ORACLE_HOME = /u01/oracle/product/11.2.0/dbhome_1)
(SID_NAME = orcl)
)
(SID_DESC =
(GLOBAL_DBNAME = orclstd_DGMGRL)
(ORACLE_HOME = /u01/oracle/product/11.2.0/dbhome_1)
(SID_NAME = orcl)
)
)

ADR_BASE_LISTENER = /u01/oracle

3.2.8 启动 observer

官方建议observer应该与primary和standby数据库隔离,不能在一台机器上,我这里资源有限,将observer部署在备机上面

如果 observer 没有启动，会有以下报错：
DGMGRL> show configuration
Configuration - orclpricfg
Protection Mode: MaxAvailability
Databases:
orclpri - Primary database
Error: ORA-16820: fast-start failover observer is no longer observing this database

orclstd - (*) Physical standby database
Error: ORA-16820: fast-start failover observer is no longer observing this database

Fast-Start Failover: ENABLED
Configuration Status:
ERROR

3.2.8.1 命令行启用 observer，开启后，界面无法返回
[oracle@oracle_dg2 ~]$ dgmgrl
DGMGRL> connect sys/oracle
DGMGRL> start observer;
Observer started

3.2.8.2 后台启动的方式
nohup dgmgrl -logfile /home/oracle/observer.log sys/oracle@orclstd "start observer" &
nohup dgmgrl -logfile /home/oracle/observer.log <<EOF
connect sys/passwd@orclstd
start observer
EOF

3.2.8.3 脚本启动
vi observer.sh
#!/bin/ksh
## Script to start observer via DGMGRL
dgmgrl -echo -logfile $HOME/observer.log << EOF
connect sys/passwd@orclstd
start observer
EOF

chmod +x observer.sh
./observer.sh &

3.2.9 启动快速故障转移

DGMGRL> enable fast_start failover;
Enabled.
DGMGRL> show configuration VERBOSE;
Configuration - orclpricfg
Protection Mode: MaxAvailability
Databases:
orclpri - Primary database
orclstd - (*) Physical standby database
(*) Fast-Start Failover target
Properties:
FastStartFailoverThreshold = '60'
OperationTimeout = '30'
FastStartFailoverLagLimit = '30'
CommunicationTimeout = '180'
ObserverReconnect = '0'
FastStartFailoverAutoReinstate = 'TRUE'
FastStartFailoverPmyShutdown = 'TRUE'
BystandersFollowRoleChange = 'ALL'
ObserverOverride = 'FALSE'
ExternalDestination1 = ''
ExternalDestination2 = ''
PrimaryLostWriteAction = 'CONTINUE'

Fast-Start Failover: ENABLED
Threshold: 60 seconds
Target: orclstd
Observer: oracle_dg2.test.com
Lag Limit: 30 seconds (not in use)
Shutdown Primary: TRUE
Auto-reinstate: TRUE
Observer Reconnect: (none)
Observer Override: FALSE

Configuration Status:
SUCCESS

3.3 switchover切换演练

3.3.1 手工将从库（orclpri）切换为主的角色

DGMGRL> switchover to 'orclpri';
Performing switchover NOW, please wait...
Operation requires a connection to instance "orcl" on database "orclpri"
Connecting to instance "orcl"...
Connected.
New primary database "orclpri" is opening...
Operation requires startup of instance "orcl" on database "orclstd"
Starting instance "orcl"...
ORACLE instance started.
Database mounted.
Database opened.
Switchover succeeded, new primary is "orclpri"

3.3.2 查看主备库状态

---- orclpri
SQL> select name,db_unique_name,database_role,open_mode,protection_mode,protection_level,force_logging,flashback_on,switchover_status from v$database;
NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE PROTECTION_MODE PROTECTION_LEVEL FOR FLASHBACK_ON SWITCHOVER_STATUS
--------- ------------------------------ ---------------- -------------------- -------------------- -------------------- --- ------------------ --------------------
ORCL orclpri PRIMARY READ WRITE MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY YES YES SESSIONS ACTIVE

----orclstd
SQL> select name,db_unique_name,database_role,open_mode,protection_mode,protection_level,force_logging,flashback_on,switchover_status from v$database
NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE PROTECTION_MODE PROTECTION_LEVEL FOR FLASHBACK_ON SWITCHOVER_STATUS
--------- ------------------------------ ---------------- -------------------- -------------------- -------------------- --- ------------------ --------------------
ORCL orclstd PHYSICAL STANDBY READ ONLY WITH APPLY MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY YES YES NOT ALLOWED

四、配置keepalived

本方案安装keepalived的作用是提供一个vip 218.168.129.66，当主库发生故障后，可以自动将vip漂移到备库，从而使应用不用再修改数据库连接。

4.1 安装

[root@oracle_dg1 yum.repos.d]# yum -y install gcc gcc-c++ libnl* popt* libnfnetlink-devel libnfnetlink lm_sensors-libs net-snmp-agent-libs net-snmp-libs openssh-server openssh-clients openssl openssl-devel automake iproute ipvsadm keepalived
Loaded plugins: fastestmirror, refresh-packagekit, security

4.2 主库Keepalived的配置

[root@oracle_dg1 ~]# cd /etc/keepalived/
[root@localhost keepalived]# cp keepalived.conf keepalived.conf.default
--------------------------------------------------------
--节点1（218.168.129.64）keepalived.conf文件内容：
--------------------------------------------------------
[root@oracle_dg1 ~]# vi /etc/keepalived/keepalived.conf

! Configuration File for keepalived

#全局配置
global_defs {
notification_email {
root@oracle_dg1
}

notification_email_from root@oracle_dg1
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id LVS_DEVEL #身份识别（全局唯一）
}

vrrp_script chk_dg_stats {
script "/etc/keepalived/check_dataguard.sh" ## 监控脚本
interval 2 ## 监控时间
weight -5 ## 即该节点出现故障后下面的priority的值减去5
fall 2
rise 1
}

# 配置VRRP协议
vrrp_instance VI_1 {
state MASTER # 状态，MASTER和BACKUP（这里仅仅是一个标记，真正确认VIP的是权重）
interface eth0 # 绑定网卡
mcast_src_ip 218.168.129.64
virtual_router_id 131 # 虚拟路由标示，可以理解为分组
priority 101 # 优先级（数字越大，权重越大）
inopreempt # 主库重新恢复后不抢占vip
advert_int 1 # 监测心跳间隔时间
authentication { # 配置认证
auth_type PASS # 认证类型
auth_pass 888888 # 认证的密码
}

virtual_ipaddress { # 设置VIP
218.168.129.66/24 dev eth0 label eth0:1 # 虚拟的VIP地址
}

track_script {
chk_dg_stats ###执行上面定义的chk_dg_stats
}
}

4.3 备库Keepalived的配置

[root@oracle_dg2 ~]# cd /etc/keepalived/
[root@oracle_dg2 keepalived]# cp keepalived.conf keepalived.conf.default
[root@oracle_dg2 ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived

global_defs {
notification_email {
root@oracle_dg2
}
notification_email_from root@oracle_dg2
smtp_server 127.0.0.1
smtp_connect_timeout 30
router_id LVS_DEVEL #集群的ID 主备两台机器的这名字得一样
}

vrrp_script chk_dg_stats {
script "/etc/keepalived/check_dataguard.sh"
interval 2
weight -5
fall 2
rise 1
}

vrrp_instance VI_1 {
state BACKUP
interface eth0
mcast_src_ip 218.168.129.65
virtual_router_id 131
priority 99
inopreempt
advert_int 1
authentication {
auth_type PASS
auth_pass 888888
}
virtual_ipaddress {
218.168.129.66/24 dev eth0 label eth0:1
}

track_script {
chk_dg_stats
}
}

4.4 配置脚本

两个机器上都配置如下脚本，让keepalived每隔2s执行以下该脚本，用来监控oracle的角色，从而判断本节点的oracle运行是否正常。如果发现不是 PRIMAYR ，立即把vip切换到另一台机。
--------------------------------------------------------
--所有节点配置脚本check_dataguard.sh，并确认具有x执行权限：
--------------------------------------------------------
# cat /etc/keepalived/check_dataguard.sh
#!/bin/bash
dbrole=`echo -e 'set head off;\n select database_role from v$database;' | su - oracle -c "sqlplus -s / as sysdba"`

if [ "${dbrole}" = 'PRIMARY' ]; then
echo 0
exit 0
else
echo 1
exit 1
fi

五、failover 切换演练

5.1 模拟自动故障转移

--关闭主库
SQL> shutdown abort

5.2 查看监听器日志

tail -f observer.log
21:17:21.06 Saturday, September 16, 2023
Initiating Fast-Start Failover to database "orclstd"...
Performing failover NOW, please wait...
Failover succeeded, new primary is "orclstd"
21:17:22.97 Saturday, September 16, 2023
以上提示，已自动启用备库为主库

此时查看IP情况，发现VIP已自动漂移到 orclstd

[root@oracle_dg2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
link/ether 00:50:56:b3:0d:68 brd ff:ff:ff:ff:ff:ff
inet 218.168.129.65/24 brd 218.168.129.255 scope global eth0
inet 218.168.129.66/24 scope global secondary eth0:1

5.3 查看 dgbroker 状态信息

dgmgrl
DGMGRL> conn sys/oracle@orclstd
DGMGRL> show configuration;
Configuration - orclpricfg
Protection Mode: MaxAvailability
Databases:
orclstd - Primary database
Warning: ORA-16817: unsynchronized fast-start failover configuration
orclpri - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: ENABLED

5.4 尝试修复故障库，把故障库启动到mount状态

[oracle@oracle_dg1 ~]$ sqlplus / as sysdba
SQL*Plus: Release 11.2.0.4.0 Production on Sat Sep 16 21:36:40 2023
Copyright (c) 1982, 2013, Oracle. All rights reserved.
Connected to an idle instance.
SQL> startup mount;
ORACLE instance started.
Total System Global Area 2455228416 bytes
Fixed Size 2255712 bytes
Variable Size 620758176 bytes
Database Buffers 1811939328 bytes
Redo Buffers 20275200 bytes
Database mounted.

5.5 监听器自动开始修复故障库

[oracle@oracle_dg2 ~]$ tail -f observer.log
Observer started
[W000 09/16 20:54:15.57] Observer started.

21:17:21.06 Saturday, September 16, 2023
Initiating Fast-Start Failover to database "orclstd"...
Performing failover NOW, please wait...
Failover succeeded, new primary is "orclstd"
21:17:22.97 Saturday, September 16, 2023

21:37:32.40 Saturday, September 16, 2023
Initiating reinstatement for database "orclpri"...
Reinstating database "orclpri", please wait...
Operation requires shutdown of instance "orcl" on database "orclpri"
Shutting down instance "orcl"...
ORA-01017: invalid username/password; logon denied

Warning: You are no longer connected to ORACLE.

Please complete the following steps and reissue the REINSTATE command:
shut down instance "orcl" of database "orclpri"
start up and mount instance "orcl" of database "orclpri"

21:37:40.08 Saturday, September 16, 2023

根据监听器提示重启故障库至mount
SQL> shutdown immediate;
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.

SQL> startup mount
ORACLE instance started.
Total System Global Area 2455228416 bytes
Fixed Size 2255712 bytes
Variable Size 620758176 bytes
Database Buffers 1811939328 bytes
Redo Buffers 20275200 bytes
Database mounted.

5.6 查看dgmgrl状态信息

DGMGRL> show configuration;
Configuration - orclpricfg
Protection Mode: MaxAvailability
Databases:
orclstd - Primary database
Warning: ORA-16817: unsynchronized fast-start failover configuration
orclpri - (*) Physical standby database (disabled)
ORA-16661: the standby database needs to be reinstated
Fast-Start Failover: ENABLED
Configuration Status:
WARNING

5.7 根据提示 reinstate 故障库

DGMGRL> reinstate database orclpri;
Reinstating database "orclpri", please wait...
Reinstatement of database "orclpri" succeeded
注意：
必须把数据库启动到mout状态才能执行reinstate的恢复命令，其他状态执行恢复会失败的。
另外，如果你执行reinstate命令失败，可用尝试把要恢复的数据库重启到mount状态，再执行reinstate命令

5.8 查看dgmgrl状态信息

DGMGRL> show configuration;
Configuration - orclpricfg
Protection Mode: MaxAvailability
Databases:
orclstd - Primary database
orclpri - (*) Physical standby database
Fast-Start Failover: ENABLED
Configuration Status:
SUCCESS
状态已恢复。

5.9 开启数据库，并打开日志应用

SQL> select name,db_unique_name,database_role,open_mode,protection_mode,protection_level,force_logging,flashback_on,switchover_status from v$database;
NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE PROTECTION_MODE PROTECTION_LEVEL FOR FLASHBACK_ON SWITCHOVER_STATUS
--------- ------------------------------ ---------------- -------------------- -------------------- -------------------- --- ------------------ --------------------
ORCL orclpri PHYSICAL STANDBY MOUNTED MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY YES YES NOT ALLOWED

SQL> alter database open;
Database altered.

SQL> alter database recover managed standby database using current logfile disconnect from session;
Database altered.

SQL> select name,db_unique_name,database_role,open_mode,protection_mode,protection_level,force_logging,flashback_on,switchover_status from v$database;
NAME DB_UNIQUE_NAME DATABASE_ROLE OPEN_MODE PROTECTION_MODE PROTECTION_LEVEL FOR FLASHBACK_ON SWITCHOVER_STATUS
--------- ------------------------------ ---------------- -------------------- -------------------- -------------------- --- ------------------ --------------------
ORCL orclpri PHYSICAL STANDBY READ ONLY WITH APPLY MAXIMUM AVAILABILITY MAXIMUM AVAILABILITY YES YES NOT ALLOWED