postgresql 高可用_Patroni + Etcd 实现高可用之二: 高可用测试

v2-78806d9fc7d41ed61989af4cecb58225_1440w.jpg?source=172ae18b

博客链接: https://postgres.fun/20200530212600.html

上篇博客介绍了Patroni + Etcd部署PostgreSQL高可用,这篇博客进行高可用测试。

规划以下高可用测试场景:

  • 场景一 停备库PostgreSQL实例: 停任意一个备库实例,观察主备情况。
  • 场景二 停主库PostgreSQL实例: 停主库实例,观察主备情况。
  • 场景三 停主库主机: 停主库主机,观察主备情况。
  • 场景四 停两个备库主机: 停两个备库主机,观察主备情况。
  • 场景四 停一个etcd备进程: 停任意一个etcd备进程,观察etcd、patroni、主备情况。
  • 场景五 停etcd主进程: 停etcd主进程,观察etcd、patroni、主备情况。
  • 场景六 停两个etcd备进程: 停两个etcd备进程,观察etcd、patroni、主备情况。
  • 场景七 停一个patroni备进程: 停一个patroni备进程,观察etcd、patroni、主备情况。
  • 场景八 停patroni主进程 : 停patroni主进程,观察etcd、patroni、主备情况。
  • 场景九 停两个patroni备进程: 停两个patroni备进程,观察etcd、patroni、主备情况。

本文对以上九个高可用场景进行演练,记录前四个场景的演练过程,后五个场景仅记录演练结果。

场景一 停备库PostgreSQL实例

集群初始状态如下:

[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
|   Member  |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 |        | running | 18 |         0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 |           |
| pg_ydtf03 | 192.168.2.13:1921 |        | running | 18 |         0 |
+-----------+-------------------+--------+---------+----+-----------+

在主机pg_ydtf01上停从库,如下:

[pg12@ydtf01 ~]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped

查看pg_ydtf01主机上的patroni日志,如下:

2020-05-26 08:44:32,392 INFO: no action.  i am a secondary and i am following a leader
2020-05-26 08:44:42,374 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:42,374 INFO: does not have lock
2020-05-26 08:44:42,388 INFO: no action.  i am a secondary and i am following a leader


2020-05-26 08:44:52,381 WARNING: Postgresql is not running.
2020-05-26 08:44:52,382 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,407 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6823331807018309697
  Database cluster state: shut down in recovery
  pg_control last modified: Tue May 26 08:44:45 2020
  Latest checkpoint location: 0/15000098
  Latest checkpoint's REDO location: 0/15000060
  Latest checkpoint's REDO WAL file: 000000120000000000000015
  Latest checkpoint's TimeLineID: 18
  Latest checkpoint's PrevTimeLineID: 18
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:505
  Latest checkpoint's NextOID: 24576
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 505
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Tue May 26 08:33:37 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/16000450
  Min recovery ending locs timeline: 18
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 16384
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090

2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,417 INFO: starting as a secondary
2020-05-26 08:44:52,428 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:44:52.829 CST [39065] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:44:52.830 CST [39065] LOG:  listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:44:52.832 CST [39065] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:44:52,836 INFO: postmaster pid=39065
2020-05-26 08:44:52.849 CST [39065] LOG:  redirecting log output to logging collector process
2020-05-26 08:44:52.849 CST [39065] HINT:  Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:44:52,891 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,891 INFO: does not have lock
2020-05-26 08:44:52,891 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:44:52,938 INFO: no action.  i am a secondary and i am following a leader

2020-05-26 08:45:02,388 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:45:02,388 INFO: does not have lock
2020-05-26 08:45:02,397 INFO: no action.  i am a secondary and i am following a leader

通过日志看出,patroni 监控到本机上的备库宕机后,自动将它拉起。

查看集群状态,没有变化,如下:

[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
|   Member  |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 |        | running | 18 |         0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 |           |
| pg_ydtf03 | 192.168.2.13:1921 |        | running | 18 |         0 |
+-----------+-------------------+--------+---------+----+-----------+

场景二 停主库PostgreSQL实例

pg_ydtf02主机上关闭主库,如下:

[pg12@ydtf02 patroni]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped

监控主机pg_ydtf02上的patroni日志,如下:

2020-05-26 08:50:14,195 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:14,218 INFO: no action.  i am the leader with the lock

2020-05-26 08:50:24,198 WARNING: Postgresql is not running.
2020-05-26 08:50:24,198 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,222 INFO: pg_controldata:
  pg_control version number: 1201
  Catalog version number: 201909212
  Database system identifier: 6823331807018309697
  Database cluster state: shut down
  pg_control last modified: Tue May 26 08:50:15 2020
  Latest checkpoint location: 0/16000450
  Latest checkpoint's REDO location: 0/16000450
  Latest checkpoint's REDO WAL file: 000000120000000000000016
  Latest checkpoint's TimeLineID: 18
  Latest checkpoint's PrevTimeLineID: 18
  Latest checkpoint's full_page_writes: on
  Latest checkpoint's NextXID: 0:506
  Latest checkpoint's NextOID: 24576
  Latest checkpoint's NextMultiXactId: 1
  Latest checkpoint's NextMultiOffset: 0
  Latest checkpoint's oldestXID: 480
  Latest checkpoint's oldestXID's DB: 1
  Latest checkpoint's oldestActiveXID: 0
  Latest checkpoint's oldestMultiXid: 1
  Latest checkpoint's oldestMulti's DB: 1
  Latest checkpoint's oldestCommitTsXid: 0
  Latest checkpoint's newestCommitTsXid: 0
  Time of latest checkpoint: Tue May 26 08:50:15 2020
  Fake LSN counter for unlogged rels: 0/3E8
  Minimum recovery ending location: 0/0
  Min recovery ending locs timeline: 0
  Backup start location: 0/0
  Backup end location: 0/0
  End-of-backup record required: no
  wal_level setting: replica
  wal_log_hints setting: on
  max_connections setting: 100
  max_worker_processes setting: 8
  max_wal_senders setting: 10
  max_prepared_xacts setting: 0
  max_locks_per_xact setting: 64
  track_commit_timestamp setting: off
  Maximum data alignment: 8
  Database block size: 8192
  Blocks per segment of large relation: 131072
  WAL block size: 16384
  Bytes per WAL segment: 16777216
  Maximum length of identifiers: 64
  Maximum columns in an index: 32
  Maximum size of a TOAST chunk: 1996
  Size of a large-object chunk: 2048
  Date/time type storage: 64-bit integers
  Float4 argument passing: by value
  Float8 argument passing: by value
  Data page checksum version: 0
  Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090

2020-05-26 08:50:24,226 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,227 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,233 INFO: starting as readonly because i had the session lock
2020-05-26 08:50:24,234 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:50:24.490 CST [39033] LOG:  starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:50:24.491 CST [39033] LOG:  listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:50:24.493 CST [39033] LOG:  listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:50:24,512 INFO: postmaster pid=39033
2020-05-26 08:50:24.517 CST [39033] LOG:  redirecting log output to logging collector process
2020-05-26 08:50:24.517 CST [39033] HINT:  Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:50:24,540 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,540 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:50:24,579 INFO: promoted self to leader because i had the session lock
server promoting
2020-05-26 08:50:24,585 INFO: cleared rewind state after becoming the leader
2020-05-26 08:50:25,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:25,651 INFO: no action.  i am the leader with the lock
2020-05-26 08:50:35,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:35,632 INFO: no action.  i am the leader with the lock

根据日志看出,pg_ydtf02上的patroni发现本机的PostgreSQL宕机后,自动将它拉起。

查看集群状态,没有发生变化,如下:

[pg12@ydtf02 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf02.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
|   Member  |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 |        | running | 19 |         0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 19 |           |
| pg_ydtf03 | 192.168.2.13:1921 |        | running | 19 |         0 |
+-----------+-------------------+--------+---------+----+-----------+

场景三 停主库主机

重启 pg_ydtf02 主机,如下:

[root@ydtf02 ~]# reboot

查看 pg_ydtf01和pg_ydtf03的patroni的日志,发现 pg_ydtf01获得了leader锁,并切换为主库,日志如下:

2020-05-26 08:56:32,055 INFO: Got response from pg_ydtf03 http://192.168.2.13:8008/patroni: {"state": "running", "postmaster_start_time": "2020-05-26 08:37:43.074 CST", "role": "replica", "server_version": 120002, "cluster_unlocked": true, "xlog": {"received_location": 369100432, "replayed_location": 369100432, "replayed_timestamp": "2020-05-26 08:34:34.945 CST", "paused": false}, "timeline": 19, "database_system_identifier": "6823331807018309697", "patroni": {"version": "1.6.5", "scope": "pg_ydtf"}}
2020-05-26 08:56:34,010 WARNING: Request failed to pg_ydtf02: GET http://192.168.2.12:8008/patroni (HTTPConnectionPool(host='192.168.2.12', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f872c3da280>, 'Connection to 192.168.2.12 timed out. (connect timeout=2)')))
2020-05-26 08:56:34,039 INFO: promoted self to leader by acquiring session lock
server promoting
2020-05-26 08:56:34,049 INFO: cleared rewind state after becoming the leader
2020-05-26 08:56:35,070 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:35,120 INFO: no action.  i am the leader with the lock
2020-05-26 08:56:45,075 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:45,112 INFO: no action.  i am the leader with the lock
2020-05-26 08:56:55,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:55,093 INFO: no action.  i am the leader with the lock
2020-05-26 08:57:05,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:57:05,095 INFO: no action.  i am the leader with the lock

查看集群状态,发现进行了主备切换,如下:

[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
|   Member  |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | Leader | running | 20 |           |
| pg_ydtf03 | 192.168.2.13:1921 |        | running | 20 |         0 |
+-----------+-------------------+--------+---------+----+-----------+

pg_ydtf01升级为主库,验证 pg_ydtf01的主备状态,如下:

[pg12@ydtf01 ~]$ pg_controldata | grep cluster
Database cluster state:               in production

之后尝试启动pg_ydtf02上的备库,报时间线的错误,如下所示:

2020-05-29 16:53:52.780 CST,,,2574,,5ed0cd5f.a0e,27,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:52.785 CST,,,2574,,5ed0cd5f.a0e,28,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:52.786 CST,,,2505,,5ed0cd5a.9c9,20,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:53:57.295 CST,,,2600,"127.0.0.1:45300",5ed0cda5.a28,1,"",2020-05-29 16:53:57 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=45300",,,,,,,,,""
2020-05-29 16:53:57.297 CST,"repuser","",2600,"127.0.0.1:45300",5ed0cda5.a28,2,"authentication",2020-05-29 16:53:57 CST,3/9,0,LOG,00000,"replication connection authorized: user=repuser",,,,,,,,,""
2020-05-29 16:53:57.785 CST,,,2574,,5ed0cd5f.a0e,29,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:57.789 CST,,,2574,,5ed0cd5f.a0e,30,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:57.790 CST,,,2505,,5ed0cd5a.9c9,21,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:02.792 CST,,,2574,,5ed0cd5f.a0e,31,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2574,,5ed0cd5f.a0e,32,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2505,,5ed0cd5a.9c9,22,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:07.805 CST,,,2574,,5ed0cd5f.a0e,34,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""

解决方法: 关闭pg_ydtf02上的数据库,之后执行 pg_rewind增量同步主库上的数据,如下:

[pg12@ydtf02 log]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.2.11 port=1921 user=postgres dbname=postgres' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/1C0259E8 on timeline 29
pg_rewind: rewinding from last common checkpoint at 0/1C0252E8 on timeline 29
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 325 MB (total source directory size is 345 MB)
333628/333628 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!

场景四 停两个备库主机

集群状态如下:

[pg12@ydtf01 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
|   Member  |        Host       |  Role  |  State  | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 |        | running | 25 |         0 |
| pg_ydtf02 | 192.168.2.12:1921 |        | running | 25 |         0 |
| pg_ydtf03 | 192.168.2.13:1921 | Leader | running | 25 |           |
+-----------+-------------------+--------+---------+----+-----------+

在 pg_ydtf01 和 pg_ydtf02上执行 reboot模拟关闭主机,发现pg_ydtf03上的主库PostgreSQL实例也宕掉了,数据库日志如下:

[pg12@ydtf03 log]$ view postgresql-2020-05-29_153456.csv
2020-05-29 15:34:56.585 CST,,,35725,,5ed0bb20.8b8d,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2020-05-29 15:34:56.592 CST,,,35729,,5ed0bb20.8b91,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was interrupted; last known up at 2020-05-29 15:09:55 CST",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35731,"[local]",5ed0bb20.8b93,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"postgres","postgres",35731,"[local]",5ed0bb20.8b93,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35730,"127.0.0.1:49080",5ed0bb20.8b92,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49080",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"rewind_user","postgres",35730,"127.0.0.1:49080",5ed0bb20.8b92,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.603 CST,,,35733,"127.0.0.1:49084",5ed0bb20.8b95,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49084",,,,,,,,,""
2020-05-29 15:34:56.603 CST,"rewind_user","postgres",35733,"127.0.0.1:49084",5ed0bb20.8b95,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.634 CST,,,35729,,5ed0bb20.8b91,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2020-05-29 15:34:56.637 CST,,,35729,,5ed0bb20.8b91,3,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,4,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"redo starts at 0/1C021F68",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,5,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"invalid record length at 0/1C022050: wanted 24, got 0",,,,,,,,,""
2020-05-29 15:34:56.643 CST,,,35729,,5ed0bb20.8b91,6,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/1C022050",,,,,,,,,""
2020-05-29 15:34:56.644 CST,,,35725,,5ed0bb20.8b8d,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""

从日志看出,主库异常宕机后,以备库角色启动,验证数据库角色,如下:

[pg12@ydtf03 log]$ pg_controldata | grep cluster
Database cluster state:               in archive recovery

同时,pg_ydtf03上的patroni日志出现大量以下信息:

2020-05-29 15:34:51,784 ERROR: Request to server http://192.168.2.12:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.2.12', port=2379): Max retries exceeded with url: /v2/keys/service/pg_ydtf/leader (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f3de40eb790>, 'Connection to 192.168.2.12 timed out. (connect timeout=1)'))")
2020-05-29 15:34:51,785 INFO: Reconnection allowed, looking for another server.
2020-05-29 15:34:51,787 ERROR:
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 422, in wrapper
    retval = func(self, *args, **kwargs) is not None
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 621, in _update_leader
    return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 406, in retry
    return retry(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/patroni/utils.py", line 330, in __call__
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/etcd/client.py", line 500, in write
    response = self.api_execute(path, method, params=params)
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 264, in api_execute
    raise ex
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 248, in api_execute
    response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 219, in _do_http_request
    raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
2020-05-29 15:34:51,806 ERROR: failed to update leader lock
2020-05-29 15:34:55,268 INFO: Selected new etcd server http://192.168.2.12:2379

可见,停掉两个备库后,主库也会宕,同时主库角色切换为备库,由于备库仅提供只读服务,对生产系统有较大影响,引发生产故障。

总结

汇总九个高可用测试场景演练结果,如下:

v2-11a75543d093141081c0a3ac75888bae_b.jpg

参考

  • Patroni + Etcd 实现高可用之一: 安装部署
  • github patroni
  • Patroni document
  • Etcd document
  • pg_rewind
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值