![v2-78806d9fc7d41ed61989af4cecb58225_1440w.jpg?source=172ae18b](http://img-01.proxy.5ce.com/view/image?&type=2&guid=2fc07417-3430-eb11-8da9-e4434bdf6706&url=https://pic1.zhimg.com/v2-78806d9fc7d41ed61989af4cecb58225_1440w.jpg?source=172ae18b)
博客链接: https://postgres.fun/20200530212600.html
上篇博客介绍了Patroni + Etcd部署PostgreSQL高可用,这篇博客进行高可用测试。
规划以下高可用测试场景:
- 场景一 停备库PostgreSQL实例: 停任意一个备库实例,观察主备情况。
- 场景二 停主库PostgreSQL实例: 停主库实例,观察主备情况。
- 场景三 停主库主机: 停主库主机,观察主备情况。
- 场景四 停两个备库主机: 停两个备库主机,观察主备情况。
- 场景四 停一个etcd备进程: 停任意一个etcd备进程,观察etcd、patroni、主备情况。
- 场景五 停etcd主进程: 停etcd主进程,观察etcd、patroni、主备情况。
- 场景六 停两个etcd备进程: 停两个etcd备进程,观察etcd、patroni、主备情况。
- 场景七 停一个patroni备进程: 停一个patroni备进程,观察etcd、patroni、主备情况。
- 场景八 停patroni主进程 : 停patroni主进程,观察etcd、patroni、主备情况。
- 场景九 停两个patroni备进程: 停两个patroni备进程,观察etcd、patroni、主备情况。
本文对以上九个高可用场景进行演练,记录前四个场景的演练过程,后五个场景仅记录演练结果。
场景一 停备库PostgreSQL实例
集群初始状态如下:
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 18 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 18 | 0 |
+-----------+-------------------+--------+---------+----+-----------+
在主机pg_ydtf01上停从库,如下:
[pg12@ydtf01 ~]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
查看pg_ydtf01主机上的patroni日志,如下:
2020-05-26 08:44:32,392 INFO: no action. i am a secondary and i am following a leader
2020-05-26 08:44:42,374 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:42,374 INFO: does not have lock
2020-05-26 08:44:42,388 INFO: no action. i am a secondary and i am following a leader
2020-05-26 08:44:52,381 WARNING: Postgresql is not running.
2020-05-26 08:44:52,382 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,407 INFO: pg_controldata:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6823331807018309697
Database cluster state: shut down in recovery
pg_control last modified: Tue May 26 08:44:45 2020
Latest checkpoint location: 0/15000098
Latest checkpoint's REDO location: 0/15000060
Latest checkpoint's REDO WAL file: 000000120000000000000015
Latest checkpoint's TimeLineID: 18
Latest checkpoint's PrevTimeLineID: 18
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:505
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 505
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Tue May 26 08:33:37 2020
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/16000450
Min recovery ending locs timeline: 18
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 16384
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090
2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,411 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,417 INFO: starting as a secondary
2020-05-26 08:44:52,428 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:44:52.829 CST [39065] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:44:52.830 CST [39065] LOG: listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:44:52.832 CST [39065] LOG: listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:44:52,836 INFO: postmaster pid=39065
2020-05-26 08:44:52.849 CST [39065] LOG: redirecting log output to logging collector process
2020-05-26 08:44:52.849 CST [39065] HINT: Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:44:52,891 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:44:52,891 INFO: does not have lock
2020-05-26 08:44:52,891 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:44:52,938 INFO: no action. i am a secondary and i am following a leader
2020-05-26 08:45:02,388 INFO: Lock owner: pg_ydtf02; I am pg_ydtf01
2020-05-26 08:45:02,388 INFO: does not have lock
2020-05-26 08:45:02,397 INFO: no action. i am a secondary and i am following a leader
通过日志看出,patroni 监控到本机上的备库宕机后,自动将它拉起。
查看集群状态,没有变化,如下:
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 18 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 18 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 18 | 0 |
+-----------+-------------------+--------+---------+----+-----------+
场景二 停主库PostgreSQL实例
pg_ydtf02主机上关闭主库,如下:
[pg12@ydtf02 patroni]$ pg_ctl stop -m fast
waiting for server to shut down.... done
server stopped
监控主机pg_ydtf02上的patroni日志,如下:
2020-05-26 08:50:14,195 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:14,218 INFO: no action. i am the leader with the lock
2020-05-26 08:50:24,198 WARNING: Postgresql is not running.
2020-05-26 08:50:24,198 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,222 INFO: pg_controldata:
pg_control version number: 1201
Catalog version number: 201909212
Database system identifier: 6823331807018309697
Database cluster state: shut down
pg_control last modified: Tue May 26 08:50:15 2020
Latest checkpoint location: 0/16000450
Latest checkpoint's REDO location: 0/16000450
Latest checkpoint's REDO WAL file: 000000120000000000000016
Latest checkpoint's TimeLineID: 18
Latest checkpoint's PrevTimeLineID: 18
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:506
Latest checkpoint's NextOID: 24576
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 480
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Tue May 26 08:50:15 2020
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending locs timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 16384
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: e0b437070cc63fb2e55b904a85a73d4aa92d629ab56540beb214451f52967090
2020-05-26 08:50:24,226 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,227 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,233 INFO: starting as readonly because i had the session lock
2020-05-26 08:50:24,234 INFO: closed patroni connection to the postgresql cluster
2020-05-26 08:50:24.490 CST [39033] LOG: starting PostgreSQL 12.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2020-05-26 08:50:24.491 CST [39033] LOG: listening on IPv4 address "0.0.0.0", port 1921
2020-05-26 08:50:24.493 CST [39033] LOG: listening on Unix socket "/tmp/.s.PGSQL.1921"
2020-05-26 08:50:24,512 INFO: postmaster pid=39033
2020-05-26 08:50:24.517 CST [39033] LOG: redirecting log output to logging collector process
2020-05-26 08:50:24.517 CST [39033] HINT: Future log output will appear in directory "log".
localhost:1921 - rejecting connections
localhost:1921 - accepting connections
2020-05-26 08:50:24,540 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:24,540 INFO: establishing a new patroni connection to the postgres cluster
2020-05-26 08:50:24,579 INFO: promoted self to leader because i had the session lock
server promoting
2020-05-26 08:50:24,585 INFO: cleared rewind state after becoming the leader
2020-05-26 08:50:25,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:25,651 INFO: no action. i am the leader with the lock
2020-05-26 08:50:35,605 INFO: Lock owner: pg_ydtf02; I am pg_ydtf02
2020-05-26 08:50:35,632 INFO: no action. i am the leader with the lock
根据日志看出,pg_ydtf02上的patroni发现本机的PostgreSQL宕机后,自动将它拉起。
查看集群状态,没有发生变化,如下:
[pg12@ydtf02 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf02.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 19 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | Leader | running | 19 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 19 | 0 |
+-----------+-------------------+--------+---------+----+-----------+
场景三 停主库主机
重启 pg_ydtf02 主机,如下:
[root@ydtf02 ~]# reboot
查看 pg_ydtf01和pg_ydtf03的patroni的日志,发现 pg_ydtf01获得了leader锁,并切换为主库,日志如下:
2020-05-26 08:56:32,055 INFO: Got response from pg_ydtf03 http://192.168.2.13:8008/patroni: {"state": "running", "postmaster_start_time": "2020-05-26 08:37:43.074 CST", "role": "replica", "server_version": 120002, "cluster_unlocked": true, "xlog": {"received_location": 369100432, "replayed_location": 369100432, "replayed_timestamp": "2020-05-26 08:34:34.945 CST", "paused": false}, "timeline": 19, "database_system_identifier": "6823331807018309697", "patroni": {"version": "1.6.5", "scope": "pg_ydtf"}}
2020-05-26 08:56:34,010 WARNING: Request failed to pg_ydtf02: GET http://192.168.2.12:8008/patroni (HTTPConnectionPool(host='192.168.2.12', port=8008): Max retries exceeded with url: /patroni (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f872c3da280>, 'Connection to 192.168.2.12 timed out. (connect timeout=2)')))
2020-05-26 08:56:34,039 INFO: promoted self to leader by acquiring session lock
server promoting
2020-05-26 08:56:34,049 INFO: cleared rewind state after becoming the leader
2020-05-26 08:56:35,070 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:35,120 INFO: no action. i am the leader with the lock
2020-05-26 08:56:45,075 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:45,112 INFO: no action. i am the leader with the lock
2020-05-26 08:56:55,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:56:55,093 INFO: no action. i am the leader with the lock
2020-05-26 08:57:05,071 INFO: Lock owner: pg_ydtf01; I am pg_ydtf01
2020-05-26 08:57:05,095 INFO: no action. i am the leader with the lock
查看集群状态,发现进行了主备切换,如下:
[pg12@ydtf01 ~]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | Leader | running | 20 | |
| pg_ydtf03 | 192.168.2.13:1921 | | running | 20 | 0 |
+-----------+-------------------+--------+---------+----+-----------+
pg_ydtf01升级为主库,验证 pg_ydtf01的主备状态,如下:
[pg12@ydtf01 ~]$ pg_controldata | grep cluster
Database cluster state: in production
之后尝试启动pg_ydtf02上的备库,报时间线的错误,如下所示:
2020-05-29 16:53:52.780 CST,,,2574,,5ed0cd5f.a0e,27,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:52.785 CST,,,2574,,5ed0cd5f.a0e,28,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:52.786 CST,,,2505,,5ed0cd5a.9c9,20,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:53:57.295 CST,,,2600,"127.0.0.1:45300",5ed0cda5.a28,1,"",2020-05-29 16:53:57 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=45300",,,,,,,,,""
2020-05-29 16:53:57.297 CST,"repuser","",2600,"127.0.0.1:45300",5ed0cda5.a28,2,"authentication",2020-05-29 16:53:57 CST,3/9,0,LOG,00000,"replication connection authorized: user=repuser",,,,,,,,,""
2020-05-29 16:53:57.785 CST,,,2574,,5ed0cd5f.a0e,29,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:53:57.789 CST,,,2574,,5ed0cd5f.a0e,30,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:53:57.790 CST,,,2505,,5ed0cd5a.9c9,21,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:02.792 CST,,,2574,,5ed0cd5f.a0e,31,,2020-05-29 16:52:47 CST,,0,LOG,00000,"restarted WAL streaming at 0/1C000000 on timeline 29",,,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2574,,5ed0cd5f.a0e,32,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
2020-05-29 16:54:02.796 CST,,,2505,,5ed0cd5a.9c9,22,,2020-05-29 16:52:42 CST,1/0,0,LOG,00000,"new timeline 30 forked off current database system timeline 29 before current recovery point 0/1C025A60",,,,,,,,,""
2020-05-29 16:54:07.805 CST,,,2574,,5ed0cd5f.a0e,34,,2020-05-29 16:52:47 CST,,0,LOG,00000,"replication terminated by primary server","End of WAL reached on timeline 29 at 0/1C0259E8.",,,,,,,,""
解决方法: 关闭pg_ydtf02上的数据库,之后执行 pg_rewind
增量同步主库上的数据,如下:
[pg12@ydtf02 log]$ pg_rewind --target-pgdata $PGDATA --source-server='host=192.168.2.11 port=1921 user=postgres dbname=postgres' -P
pg_rewind: connected to server
pg_rewind: servers diverged at WAL location 0/1C0259E8 on timeline 29
pg_rewind: rewinding from last common checkpoint at 0/1C0252E8 on timeline 29
pg_rewind: reading source file list
pg_rewind: reading target file list
pg_rewind: reading WAL in target
pg_rewind: need to copy 325 MB (total source directory size is 345 MB)
333628/333628 kB (100%) copied
pg_rewind: creating backup label and updating control file
pg_rewind: syncing target data directory
pg_rewind: Done!
场景四 停两个备库主机
集群状态如下:
[pg12@ydtf01 patroni]$ patronictl -c /home/pg12/patroni/pg_ydtf01.yml list
+ Cluster: pg_ydtf (6823331807018309697) +---------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+-----------+-------------------+--------+---------+----+-----------+
| pg_ydtf01 | 192.168.2.11:1921 | | running | 25 | 0 |
| pg_ydtf02 | 192.168.2.12:1921 | | running | 25 | 0 |
| pg_ydtf03 | 192.168.2.13:1921 | Leader | running | 25 | |
+-----------+-------------------+--------+---------+----+-----------+
在 pg_ydtf01 和 pg_ydtf02上执行 reboot
模拟关闭主机,发现pg_ydtf03上的主库PostgreSQL实例也宕掉了,数据库日志如下:
[pg12@ydtf03 log]$ view postgresql-2020-05-29_153456.csv
2020-05-29 15:34:56.585 CST,,,35725,,5ed0bb20.8b8d,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,""
2020-05-29 15:34:56.592 CST,,,35729,,5ed0bb20.8b91,1,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was interrupted; last known up at 2020-05-29 15:09:55 CST",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35731,"[local]",5ed0bb20.8b93,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"postgres","postgres",35731,"[local]",5ed0bb20.8b93,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.593 CST,,,35730,"127.0.0.1:49080",5ed0bb20.8b92,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49080",,,,,,,,,""
2020-05-29 15:34:56.593 CST,"rewind_user","postgres",35730,"127.0.0.1:49080",5ed0bb20.8b92,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.603 CST,,,35733,"127.0.0.1:49084",5ed0bb20.8b95,1,"",2020-05-29 15:34:56 CST,,0,LOG,00000,"connection received: host=127.0.0.1 port=49084",,,,,,,,,""
2020-05-29 15:34:56.603 CST,"rewind_user","postgres",35733,"127.0.0.1:49084",5ed0bb20.8b95,2,"",2020-05-29 15:34:56 CST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2020-05-29 15:34:56.634 CST,,,35729,,5ed0bb20.8b91,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"entering standby mode",,,,,,,,,""
2020-05-29 15:34:56.637 CST,,,35729,,5ed0bb20.8b91,3,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system was not properly shut down; automatic recovery in progress",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,4,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"redo starts at 0/1C021F68",,,,,,,,,""
2020-05-29 15:34:56.641 CST,,,35729,,5ed0bb20.8b91,5,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"invalid record length at 0/1C022050: wanted 24, got 0",,,,,,,,,""
2020-05-29 15:34:56.643 CST,,,35729,,5ed0bb20.8b91,6,,2020-05-29 15:34:56 CST,1/0,0,LOG,00000,"consistent recovery state reached at 0/1C022050",,,,,,,,,""
2020-05-29 15:34:56.644 CST,,,35725,,5ed0bb20.8b8d,2,,2020-05-29 15:34:56 CST,,0,LOG,00000,"database system is ready to accept read only connections",,,,,,,,,""
从日志看出,主库异常宕机后,以备库角色启动,验证数据库角色,如下:
[pg12@ydtf03 log]$ pg_controldata | grep cluster
Database cluster state: in archive recovery
同时,pg_ydtf03上的patroni日志出现大量以下信息:
2020-05-29 15:34:51,784 ERROR: Request to server http://192.168.2.12:2379 failed: MaxRetryError("HTTPConnectionPool(host='192.168.2.12', port=2379): Max retries exceeded with url: /v2/keys/service/pg_ydtf/leader (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f3de40eb790>, 'Connection to 192.168.2.12 timed out. (connect timeout=1)'))")
2020-05-29 15:34:51,785 INFO: Reconnection allowed, looking for another server.
2020-05-29 15:34:51,787 ERROR:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 422, in wrapper
retval = func(self, *args, **kwargs) is not None
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 621, in _update_leader
return self.retry(self._client.write, self.leader_path, self._name, prevValue=self._name, ttl=self._ttl)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 406, in retry
return retry(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/patroni/utils.py", line 330, in __call__
return func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/etcd/client.py", line 500, in write
response = self.api_execute(path, method, params=params)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 264, in api_execute
raise ex
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 248, in api_execute
response = self._do_http_request(retry, machines_cache, request_executor, method, path, **kwargs)
File "/usr/local/lib/python3.8/site-packages/patroni/dcs/etcd.py", line 219, in _do_http_request
raise etcd.EtcdConnectionFailed('No more machines in the cluster')
etcd.EtcdConnectionFailed: No more machines in the cluster
2020-05-29 15:34:51,806 ERROR: failed to update leader lock
2020-05-29 15:34:55,268 INFO: Selected new etcd server http://192.168.2.12:2379
可见,停掉两个备库后,主库也会宕,同时主库角色切换为备库,由于备库仅提供只读服务,对生产系统有较大影响,引发生产故障。
总结
汇总九个高可用测试场景演练结果,如下:
![v2-11a75543d093141081c0a3ac75888bae_b.jpg](http://img-02.proxy.5ce.com/view/image?&type=2&guid=2fc07417-3430-eb11-8da9-e4434bdf6706&url=https://pic3.zhimg.com/v2-11a75543d093141081c0a3ac75888bae_b.jpg)
参考
- Patroni + Etcd 实现高可用之一: 安装部署
- github patroni
- Patroni document
- Etcd document
- pg_rewind