EDB PPAS/PostgreSQL异地容灾，并实现“0数据丢失”的灾难恢复

最新推荐文章于 2024-04-23 10:32:10 发布

萧少聪 Scott Siu

最新推荐文章于 2024-04-23 10:32:10 发布

阅读量4.8k

点赞数

分类专栏： Postgres 文章标签： server tuples 数据库 archive 服务器 database

Postgres 专栏收录该内容

5 篇文章 0 订阅

订阅专栏

转载自：EnterpriseDB中文社区
原文连接： http://www.enterprisedb.org.cn/?action-viewthread-tid-28

异地容灾：
容灾策略是保证企业数据库核心应用完整性的必要手段，自美国9.11事件后不少企业开始要求通过异地容灾实现地区性或全球性的容灾方案。在异地容灾方案中，要注意以下三点：
1、数据传输速度：这个主要与两地之间所使用的网络连接有关，这个连接的速度当然是越快越好的，但也意味着要付出的钱更多。
2、异步传输：异地容灾时一般不建议使用同步方式，因为一但备份服务器出现问题，在同步方式下将导致主服务器同时失效。
3、数据丢失：由于使用异步传输，各地之间的数据并不是完全一致，当系统在不同地点之间进行切换时，并不能直接得到主服务器中完整的数据，因此异地容灾方案在灾难恢复时一般都会有数据丢失。
就以上问题，本文讨论在EnterpriseDB Postgres Plus Advanced Server中如何实现“0数据丢失”(此方案也同样可用于PostgreSQL)。

场景：
两台运行EDB PPAS的服务器分另放于两个不同的地点，两地之间通过广域网进行连接，实现备份的异步传输，实现异地容灾。

要求：
当主服务器出现故障时，异地容灾服务器为实现“0数据丢失”，允许人工干预。如主服务器网络出现故障时，管理员通过人工介入，将最新的WAL日志通过可用的外部方式拷贝到远端，实现“0数据丢失”恢复。

使用的技术：
PITR：基于时间点的恢复，使用得数据库在灾难恢复时可以回滚到指定的时间点，并而回滚时与事务相关，保证事务完整性。

Warm Standby－暖备：这个词在国内很少有用到，在国外切换时间在1秒以内的才能叫Hot Standby－热备，要停机进行维护的叫Cold Standby－冷备，介乎两者之间的叫Warm Standby－暖备。为了实现“0数据丢失”，我们要通过Warm Standby，进行WAL写前日志的异步传输，然后加入人工干预对数据进行恢复。

系统环境：
server1       192.168.100.1(pub)       192.168.101.1(pri)
server2       192.168.100.2(pub)       192.168.101.2(pri)

两个服务器的/etc/hosts文件如下

1. 127.0.0.1       localhost.localdomain       localhost
2. ::1             localhost6.localdomain6 localhost6
3.

4. 192.168.101.1       server1pri.example.com
5. 192.168.101.2       server2pri.example.com

复制代码
通过默认选项在两台服务器上安装EnterpriseDB Postgres Plus Advanced Server，当中DynaTune可以按系统需求进行性能优化选择

================实验操作==================
ps: 如有笔漏，请兄弟们指正

一、设置两个服务器之间的SSH信任连接
server1:

1. [enterprisedb@server1 ~]$ ssh-keygen
2. [enterprisedb@server1 ~]$ ssh-copy-id enterprisedb@server2pri.example.com

复制代码
server2:

1. [enterprisedb@server2 ~]$ ssh-keygen
2. [enterprisedb@server2 ~]$ ssh-copy-id enterprisedb@server1pri.example.com

复制代码
测试:
两台服务器中通过enterprisedb登录到远端的同名帐号时不需要输入密码

二、设置主服务器server1
1、停止数据库服务

1. [enterprisedb@server1 ~]$ /etc/init.d/edb_8.3 stop

复制代码
2、修改/opt/PostgresPlus/8.3AS/data/postgresql.conf

1. archive_command = '/usr/bin/rsync -arv %p server2pri.example.com:/opt/PostgresPlus/warm_wal/%f </dev/null'

复制代码
通过rsync命令新的WAL日志定期归档到远程目录中

3、建立server1及server2中用于保存WAL归档文件的目录

1. [enterprisedb@server1 ~]$ mkdir /opt/PostgresPlus/warm_wal/
2. [enterprisedb@server1 ~]$ ssh enterprisedb@server2pri.example.com mkdir /opt/PostgresPlus/warm_wal/

复制代码
4、停止数据库服务

1. [enterprisedb@server1 ~]$ /etc/init.d/edb_8.3 start

复制代码
三、设置Warm Standby服务器server2
1、停止数据库服务

1. [enterprisedb@server2 ~]$ /etc/init.d/edb_8.3 stop

复制代码
2、在server1上进行在线数据备份并将相应的检查点数据备份到server2中
在server1上产生检查点

1. [enterprisedb@server1 ~]$ /opt/PostgresPlus/8.3AS/dbserver/bin/edb-psql edb
2. Password:
3. Welcome to edb-psql 8.3.0.12, the EnterpriseDB interactive terminal.
4.

5. Type:  /copyright for distribution terms
6.       /h for help with SQL commands
7.       /? for help with edb-psql commands
8.       /g or terminate with semicolon to execute query
9.       /q to quit
  10.

  11. edb=# select pg_start_backup('wal');
  12. pg_start_backup
  13. -----------------
  14. 0/6255200
  15. (1 row)
  16.

  17. edb=# quit
  18.

  19. [enterprisedb@server1 ~]$ tar cvf /opt/PostgresPlus/edb_data.tar /opt/PostgresPlus/8.3AS/data
  20. [enterprisedb@server1 ~]$ /opt/PostgresPlus/8.3AS/dbserver/bin/edb-psql edb
  21. edb=# select pg_stop_backup();
  22. pg_stop_backup
  23. ----------------
  24. 0/422F948
  25. (1 row)
  26. [enterprisedb@server1 ~]$ scp /opt/PostgresPlus/edb_data.tar enterprisedb@server2pri.example.com:/opt/PostgresPlus/

复制代码
对于较大的数据来讲，tar将会运行较长的时间，Postgres Plus在进行pg_start_backup后不会对数据库操作进行堵塞，所有在pg_stop_backup之前的操作将会保存在WAL中，这就是 Postgres Plus的在线数据备份功能。pg_stop_backup之后，缓存在WAL中的数据会重新进行bgwrite写入到真正的磁盘数据中。

此时查看server2的/opt/PostgresPlus/warm_wal会有类拟以下的信息

1. [enterprisedb@server2 ~]$ ll /opt/PostgresPlus/warm_wal
2. total 16408
3. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:19 000000010000000000000006
4. -rw------- 1 enterprisedb edb    236 2009-04-05 13:19 000000010000000000000006.00255200.backup
5. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:21 000000010000000000000007
6. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:22 000000010000000000000008
7. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:23 000000010000000000000009

复制代码
这些WAL归档文件是pg_start_backup后产生的，与/opt/PostgresPlus/edb_data.tar中的数据组合就是当前最新的数据。每个WAL文件大小为16M，如果数据库中的更新内存已经超过16M将会结束一个WAL文件，并将这个文件通过postgresql.conf 中 archive_command所指定的方式进行归档，如果文件归档成功，此文件将会在少后被重用，以释放磁盘空间。由于通过rsync将WAL复制到了远端的服务器，因此这也是一个异地备份的过程。另外要注意的是，WAL归档文件与主服务器系统中的真实数据并不是同步的，当中会有0-16M的数据差异 (1个WAL文件内容)。因此要实现“0数据丢失”，需要在服务器失效时进行人工干预操作，将差异的数据同步到远端系统中。一但灾难发生时无法通过人工干预操作进行差异数据的同步，那这部分的数据也将永远消失(场市上有通过硬件实现完全数据同步的方案，但这个就要考虑投资的问题的，就软件方案来讲，暂时我还没发现有那个方案可以做到自动的异地“0数据丢失”恢复)。

在server2上导入检查点的数据

1. [enterprisedb@server2 ~]$ mv /opt/PostgresPlus/8.3AS/data /opt/PostgresPlus/8.3AS/data_bak
2. [enterprisedb@server2 ~]$ tar xvf /opt/PostgresPlus/edb_data.tar
3. [enterprisedb@server2 ~]$ rm /opt/PostgresPlus/8.3AS/data/postmaster.pid -rf

复制代码
3、修改server2中的/opt/PostgresPlus/8.3AS/data/postgresql.conf

1. archive_command = '/usr/bin/rsync -arv %p server1pri.example.com:/opt/PostgresPlus/warm_wal/%f </dev/null'

复制代码
通过rsync命令新的WAL日志定期归档到远程目录中

4、在server2中建立/opt/PostgresPlus/8.3AS/data/recovery.conf内容如下

1. restore_command = '/opt/PostgresPlus/8.3R2AS/dbserver/bin/pg_standby -l -d -s 2 -k 5 -t /tmp/warm_standby.trigger.5444 /opt/PostgresPlus/data_wal/ %f %p %r 2>>standby. log'

复制代码
通过recovery.conf文件指定数据库启动时通过restore_command设定的指令进行WAL数据的恢复。pg_standby专用于系统的Warm Standby操作，系统将一直扫描/opt/PostgresPlus/data_wal/中的新文件，并导入到真实的数据文件中。这是 Postgres Plus中的PITR操作，用于进行数据恢复，一般的PITR通过cp命令进行，系统启动后会将当前所能使用的WAL文件与现有的数据文件进行组合，恢复数据。而使用pg_standby进行PITR操作时，系统会一直等待新的WAL归档文件，直到标记文件/tmp /warm_standby.trigger.5444被建立，或读到一个不完整的WAL归档为止。

5、启动server2上的数据库服务
打开一个新的终端，并运行tail -f /var/log/message，监视系统运行状态

1. [enterprisedb@server2 ~]$ /etc/init.d/edb_8.3 start
2. [enterprisedb@server2 ~]$ /opt/PostgresPlus/8.3R2AS/dbserver/bin/edb-psql -p 5444 edb
3. edb-psql: FATAL:  the database system is starting up

复制代码
启动数据库后尝试连接数据库会出现错误，提示当前数据库还正在启动。这是由于pg_standby正在等待新的WAL文件，数据库还在启动过程中，因此warm standby的数据库为只读模式，不允许访问及操作。

查看tail -f /var/log/message中显示类似如下信息

1. Apr  5 13:47:13 scottsiu postgres[5006]: [1-1] 2009-04-05 13:47:13 CST LOG:
2. Apr  5 13:47:13 scottsiu postgres[5006]: [1-2] #011
3. Apr  5 13:47:13 scottsiu postgres[5006]: [1-3] #011** EnterpriseDB Dynamic Tuning Agent ********************************************
4. Apr  5 13:47:13 scottsiu postgres[5006]: [1-4] #011*    System Utilization: 66 %                                              *
5. Apr  5 13:47:13 scottsiu postgres[5006]: [1-5] #011*       Database Version: 8.3.0.102                                        *
6. Apr  5 13:47:13 scottsiu postgres[5006]: [1-6] #011* Operating System Version:                                                    *
7. Apr  5 13:47:13 scottsiu postgres[5006]: [1-7] #011*    Number of Processors: 0                                                 *
8. Apr  5 13:47:13 scottsiu postgres[5006]: [1-8] #011*          Processor Type:                                                    *
9. Apr  5 13:47:13 scottsiu postgres[5006]: [1-9] #011* Processor Architecture:                                                    *
  10. Apr  5 13:47:13 scottsiu postgres[5006]: [1-10] #011*          Database Size: 0.1 GB                                        *
  11. Apr  5 13:47:13 scottsiu postgres[5006]: [1-11] #011*                   RAM: 3.9 GB                                        *
  12. Apr  5 13:47:13 scottsiu postgres[5006]: [1-12] #011*          Shared Memory: 975 MB                                        *
  13. Apr  5 13:47:13 scottsiu postgres[5006]: [1-13] #011*    Max DB Connections: 100                                              *
  14. Apr  5 13:47:13 scottsiu postgres[5006]: [1-14] #011*             Autovacuum: off                                              *
  15. Apr  5 13:47:13 scottsiu postgres[5006]: [1-15] #011*    Autovacuum Naptime: 60 Seconds                                     *
  16. Apr  5 13:47:13 scottsiu postgres[5006]: [1-16] #011*********************************************************************************
  17. Apr  5 13:47:13 scottsiu postgres[5006]: [1-17] #011
  18. Apr  5 13:47:13 scottsiu postgres[5009]: [2-1] 2009-04-05 13:47:13 CST LOG:  database system was interrupted at 2009-04-05 13:16:33 CST
  19. Apr  5 13:47:13 scottsiu postgres[5009]: [3-1] 2009-04-05 13:47:13 CST LOG:  starting archive recovery
  20. Apr  5 13:47:13 scottsiu postgres[5009]: [4-1] 2009-04-05 13:47:13 CST LOG:  restore_command = "/opt/PostgresPlus/8.3R2AS/dbserver/bin/pg_standby -l -d -s 2 -k 5 -t /tmp/warm_standby.trigger.5444 /opt/PostgresPlus/data_wal/
  21. Apr  5 13:47:13 scottsiu postgres[5009]: [4-2]  %f %p %r 2>>standby. log"
  22. Apr  5 13:47:13 scottsiu postgres[5006]: [2-1] 2009-04-05 13:47:13 CST LOG:
  23. Apr  5 13:47:13 scottsiu postgres[5006]: [2-2] #011
  24. Apr  5 13:47:13 scottsiu postgres[5006]: [2-3] #011** EnterpriseDB Dynamic Tuning Agent ********************************************
  25. Apr  5 13:47:13 scottsiu postgres[5006]: [2-4] #011*    System Utilization: 66 %                                              *
  26. Apr  5 13:47:13 scottsiu postgres[5006]: [2-5] #011*       Database Version: 8.3.0.102                                        *
  27. Apr  5 13:47:13 scottsiu postgres[5006]: [2-6] #011* Operating System Version:                                                    *
  28. Apr  5 13:47:13 scottsiu postgres[5006]: [2-7] #011*    Number of Processors: 0                                                 *
  29. Apr  5 13:47:13 scottsiu postgres[5006]: [2-8] #011*          Processor Type:                                                    *
  30. Apr  5 13:47:13 scottsiu postgres[5006]: [2-9] #011* Processor Architecture:                                                    *
  31. Apr  5 13:47:13 scottsiu postgres[5006]: [2-10] #011*          Database Size: 0.1 GB                                        *
  32. Apr  5 13:47:13 scottsiu postgres[5006]: [2-11] #011*                   RAM: 3.9 GB                                        *
  33. Apr  5 13:47:13 scottsiu postgres[5006]: [2-12] #011*          Shared Memory: 975 MB                                        *
  34. Apr  5 13:47:13 scottsiu postgres[5006]: [2-13] #011*    Max DB Connections: 100                                              *
  35. Apr  5 13:47:13 scottsiu postgres[5006]: [2-14] #011*             Autovacuum: on                                                 *
  36. Apr  5 13:47:13 scottsiu postgres[5006]: [2-15] #011*    Autovacuum Naptime: 60 Seconds                                     *
  37. Apr  5 13:47:13 scottsiu postgres[5006]: [2-16] #011*********************************************************************************
  38. Apr  5 13:47:13 scottsiu postgres[5006]: [2-17] #011
  39. Apr  5 13:47:28 scottsiu postgres[5009]: [5-1] 2009-04-05 13:47:28 CST LOG:  restored log file "000000010000000000000006.00255200.backup" from archive
  40. Apr  5 13:47:29 scottsiu postgres[5009]: [6-1] 2009-04-05 13:47:29 CST LOG:  restored log file "000000010000000000000006" from archive
  41. Apr  5 13:47:29 scottsiu postgres[5009]: [7-1] 2009-04-05 13:47:29 CST LOG:  checkpoint record is at 0/6255200
  42. Apr  5 13:47:29 scottsiu postgres[5009]: [8-1] 2009-04-05 13:47:29 CST LOG:  redo record is at 0/6255200; undo record is at 0/0; shutdown FALSE
  43. Apr  5 13:47:29 scottsiu postgres[5009]: [9-1] 2009-04-05 13:47:29 CST LOG:  next transaction ID: 0/29297; next OID: 17227
  44. Apr  5 13:47:29 scottsiu postgres[5009]: [10-1] 2009-04-05 13:47:29 CST LOG:  next MultiXactId: 1; next MultiXactOffset: 0
  45. Apr  5 13:47:29 scottsiu postgres[5009]: [11-1] 2009-04-05 13:47:29 CST LOG:  automatic recovery in progress
  46. Apr  5 13:47:29 scottsiu postgres[5009]: [12-1] 2009-04-05 13:47:29 CST LOG:  redo starts at 0/6255250
  47. Apr  5 13:47:30 scottsiu postgres[5009]: [13-1] 2009-04-05 13:47:30 CST LOG:  restored log file "000000010000000000000007" from archive
  48. Apr  5 13:47:31 scottsiu postgres[5009]: [14-1] 2009-04-05 13:47:31 CST LOG:  restored log file "000000010000000000000008" from archive
  49. Apr  5 13:47:31 scottsiu postgres[5009]: [15-1] 2009-04-05 13:47:31 CST LOG:  restored log file "000000010000000000000009" from archive

复制代码
当前两个数据库系统之间的Warm Standby已经生效。

6、测试

1. [enterprisedb@server1 ~]$ /opt/PostgresPlus/8.3AS/dbserver/bin/pgbench -i -P edb test1
2. creating tables...
3. 10000 tuples done.
4. 20000 tuples done.
5. 30000 tuples done.
6. 40000 tuples done.
7. 50000 tuples done.
8. 60000 tuples done.
9. 70000 tuples done.
  10. 80000 tuples done.
  11. 90000 tuples done.
  12. 100000 tuples done.
  13. set primary key...
  14. vacuum...done.
  15.

  16. [enterprisedb@server1 ~]$ tail -f /var/log/message
  17. Apr  5 13:48:28 scottsiu postgres[5032]: [4-1] 2009-04-05 13:48:28 CST LOG:  duration: 1227.833 ms  statement: create database test1;
  18. Apr  5 13:48:36 scottsiu postgres[5036]: [4-1] 2009-04-05 13:48:36 CST ERROR:  table "branches" does not exist
  19. Apr  5 13:48:36 scottsiu postgres[5036]: [4-2] 2009-04-05 13:48:36 CST STATEMENT:  drop table branches
  20. Apr  5 13:48:36 scottsiu postgres[5036]: [5-1] 2009-04-05 13:48:36 CST ERROR:  table "tellers" does not exist
  21. Apr  5 13:48:36 scottsiu postgres[5036]: [5-2] 2009-04-05 13:48:36 CST STATEMENT:  drop table tellers
  22. Apr  5 13:48:36 scottsiu postgres[5036]: [6-1] 2009-04-05 13:48:36 CST ERROR:  table "accounts" does not exist
  23. Apr  5 13:48:36 scottsiu postgres[5036]: [6-2] 2009-04-05 13:48:36 CST STATEMENT:  drop table accounts
  24. Apr  5 13:48:36 scottsiu postgres[5036]: [7-1] 2009-04-05 13:48:36 CST ERROR:  table "history" does not exist
  25. Apr  5 13:48:36 scottsiu postgres[5036]: [7-2] 2009-04-05 13:48:36 CST STATEMENT:  drop table history
  26. Apr  5 13:48:38 scottsiu postgres[4646]: [9-1] 2009-04-05 13:48:38 CST LOG:  archived transaction log file "00000001000000000000000A"
  27.

  28. [enterprisedb@server2 ~]$ tail -f /var/log/message
  29. Apr  5 13:48:40 scottsiu postgres[5009]: [16-1] 2009-04-05 13:48:40 CST LOG:  restored log file "00000001000000000000000A" from archive

复制代码
四、模拟主服务器出现故障，并实现“0数据丢失”的恢复
1、在server1中停止数据库

1. [enterprisedb@server1 ~]$ /etc/init.d/edb_8.3 stop

复制代码
你也可以更暴力一些，如kill -9 <pid>

2、对比server1的WAL日志及server2中的WAL归档
server1

1. [enterprisedb@server1 ~]$ ll /opt/PostgresPlus/8.3AS/data/pg_xlog/
2. total 65624
3. -rw------- 1 enterprisedb edb    236 2009-04-05 13:19 000000010000000000000006.00255200.backup
4. -rw------- 1 enterprisedb edb 16777216 2009-04-05 14:08 00000001000000000000000B
5. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:28 00000001000000000000000C
6. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:31 00000001000000000000000D
7. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:48 00000001000000000000000E
8. drwx------ 2 enterprisedb edb    4096 2009-04-05 14:08 archive_status

复制代码
server2

1. [enterprisedb@server2 ~]$ ll /opt/PostgresPlus/warm_wal
2. total 16408
3. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:19 000000010000000000000006
4. -rw------- 1 enterprisedb edb    236 2009-04-05 13:19 000000010000000000000006.00255200.backup
5. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:21 000000010000000000000007
6. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:22 000000010000000000000008
7. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:23 000000010000000000000009
8. -rw------- 1 enterprisedb edb 16777216 2009-04-05 13:30 00000001000000000000000A

复制代码
将server1中编号大于server2的文件拷贝到server2的/opt/PostgresPlus/warm_wal中

1. [enterprisedb@server1 ~]$ scp /opt/PostgresPlus/8.3AS/data/pg_xlog/00000001000000000000000{B,C,D,E} enterprisedb@server2pir.example.com:/opt/PostgresPlus/warm_wal/

复制代码
在server2中查看系统日志

1. [enterprisedb@server2 ~]$ tail -f /var/log/message
2. Apr  5 15:28:27 scottsiu postgres[7479]: [14-1] 2009-04-05 15:28:27 CST LOG:  restored log file "00000001000000000000000B" from archive
3. Apr  5 15:28:30 scottsiu postgres[7479]: [15-1] 2009-04-05 15:28:30 CST LOG:  record with zero length at 0/12013188
4. Apr  5 15:28:30 scottsiu postgres[7479]: [16-1] 2009-04-05 15:28:30 CST LOG:  redo done at 0/12013138
5. Apr  5 15:28:30 scottsiu postgres[7479]: [17-1] 2009-04-05 15:28:30 CST LOG:  restored log file "00000001000000000000000B" from archive
6. Apr  5 15:28:30 scottsiu postgres[7479]: [18-1] 2009-04-05 15:28:30 CST LOG:  restored log file "00000002.history" from archive
7. Apr  5 15:28:46 scottsiu postgres[7479]: [19-1] 2009-04-05 15:28:46 CST LOG:  selected new timeline ID: 3
8. Apr  5 15:28:46 scottsiu postgres[7479]: [20-1] 2009-04-05 15:28:46 CST LOG:  restored log file "00000003.history" from archive
9. Apr  5 15:28:46 scottsiu postgres[7479]: [21-1] 2009-04-05 15:28:46 CST LOG:  archive recovery complete
  10. Apr  5 15:28:46 scottsiu postgres[7479]: [22-1] 2009-04-05 15:28:46 CST LOG:  database system is ready

复制代码
当前server2已经可用了，可以通过edb-psql连接进行各种操作，看看在server1最后操作的数据是否还存在。
此时数据已经同步了！

3、对server2中的数据库服务进行重启

1. [enterprisedb@server2 ~]$ /etc/init.d/edb_8.3 restart

复制代码
这个操作主要是出于安全考虑，如果重启正常，那数据库系统可以更放心地使用了！

萧少聪 Scott Siu

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
EDB PPAS/PostgreSQL异地容灾，并实现“0数据丢失”的灾难恢复

转载自：EnterpriseDB中文社区原文连接：http://www.enterprisedb.org.cn/?action-viewthread-tid-28异地容灾：容灾策略是保证企业数据库核心应用完整性的必要手段，自美国9.11事件后不少企业开始要求通过异地容灾实现地区性或全球性的容灾方案。在异地容灾方案中，要注意以下三点：1、数据传输速度：这个主要与两地之间所使用的网
复制链接

扫一扫