PostgreSQL-Patroni高可用启动报错

环境说明

一主两从:

角色主机组件
pa-pg-1172.22.138.220etcd、patroni、postgresql
pa-pg-2172.22.138.219etcd、patroni、postgresql
pa-pg-3172.22.138.218etcd、patroni、postgresql

安装部署好Patroni高可用后,主库pa-pg-1和从库pa-pg-3均启动正常,而pa-pg-2的patroni日志一直输出以下报错:

...
Aug 15 11:00:23 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 11:00:23 DB-test2 patroni: Traceback (most recent call last):
Aug 15 11:00:23 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 11:00:23 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 11:00:23 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 11:00:23 DB-test2 patroni: return next(self.gen)
Aug 15 11:00:23 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 11:00:23 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 11:00:23 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 11:00:23 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 11:00:23 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 11:00:23 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 11:00:23 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 11:00:23 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
...

排查思路

1.首先一直关注的报错是"ERROR: Exception when working with leader"所以按照这个报错,首先去排查了patroni各节点的状态,但是通过节点状态并没有发现什么明显异常。

postgres # patronictl -c /etc/patroni.yml list
+ Cluster: pgsql (7400977424401590005) ----------+----+-----------+-----------------+-----------------------------+
| Member  | Host           | Role    | State     | TL | Lag in MB | Pending restart | Pending restart reason      |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+
| pa-pg-1 | 172.22.138.220 | Leader  | running   |  3 |           |                 |                             |
| pa-pg-2 | 172.22.138.219 | Replica | streaming |  3 |         0 |                 |                             |
| pa-pg-3 | 172.22.138.218 | Replica | streaming |  3 |         0 | *               | max_connections: 10000->100 |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+

2.排查主库的pg_stat_replication和从库的pg_wal_stat_reciver,检查后未见明显异常

postgres=# select * from pg_stat_replication;
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
|  pid  | usesysid | usename | application_name |  client_addr   | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time           |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
| 12368 |    16384 | replica | pa-pg-2          | 172.22.138.219 |                 |       47170 | 2024-08-15 10:23:36.860733+08 |              | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 |           |           |            |             0 | async      | 2024-08-15 11:10:18.855529+08 |
| 11101 |    16384 | replica | pa-pg-3          | 172.22.138.218 |                 |       55800 | 2024-08-15 09:57:31.073827+08 |              | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 |           |           |            |             0 | async      | 2024-08-15 11:10:19.992588+08 |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
...
postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid                   | 18720
status                | streaming
receive_start_lsn     | 0/DB000000
receive_start_tli     | 3
written_lsn           | 0/DC000000
flushed_lsn           | 0/DC000000
received_tli          | 3
last_msg_send_time    | 2024-08-15 11:11:08.623733+08
last_msg_receipt_time | 2024-08-15 11:11:08.623328+08
latest_end_lsn        | 0/DC000000
latest_end_time       | 2024-08-15 11:09:52.14472+08
slot_name             | pa_pg_2
sender_host           | 172.22.138.220
sender_port           | 5432
conninfo              | user=replica passfile=/home/postgres/pgpass channel_binding=prefer dbname=replication host=172.22.138.220 port=5432 application_name=pa-pg-2 fallback_application_name=pgsql sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any

3.查看postgresql的server日志,更是一点信息都没有。

2024-08-15 11:07:18 CST--- :LOG:  redo starts at 0/DA0025B8
2024-08-15 11:07:18 CST--- :LOG:  consistent recovery state reached at 0/DB000110
2024-08-15 11:07:18 CST--- :LOG:  invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:07:18 CST--- :LOG:  database system is ready to accept read-only connections
2024-08-15 11:07:18 CST--- :LOG:  started streaming WAL from primary at 0/DB000000 on timeline 3
2024-08-15 11:10:06 CST--- :LOG:  received fast shutdown request
2024-08-15 11:10:06 CST--- :LOG:  aborting any active transactions
2024-08-15 11:10:06 CST-postgres-127.0.0.1-Patroni heartbeat :FATAL:  terminating connection due to administrator command
2024-08-15 11:10:06 CST--- :FATAL:  terminating walreceiver process due to administrator command
2024-08-15 11:10:06 CST--- :LOG:  shutting down
2024-08-15 11:10:06 CST--- :LOG:  database system is shut down
2024-08-15 11:10:10 CST--- :LOG:  starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2024-08-15 11:10:10 CST--- :LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-08-15 11:10:10 CST--- :LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-08-15 11:10:10 CST--- :LOG:  database system was shut down in recovery at 2024-08-15 11:10:06 CST
2024-08-15 11:10:10 CST--- :LOG:  entering standby mode
2024-08-15 11:10:10 CST--- :LOG:  redo starts at 0/DA0025B8
2024-08-15 11:10:10 CST--- :LOG:  consistent recovery state reached at 0/DB000110
2024-08-15 11:10:10 CST--- :LOG:  invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:10:10 CST--- :LOG:  database system is ready to accept read-only connections
2024-08-15 11:10:10 CST--- :LOG:  started streaming WAL from primary at 0/DB000000 on timeline 3

4.随后寻求外援,通过官方github的issues,有类似报错,建议为复制用户replica添加postgres库的connect权限,司马当活马医了。去添加了相关的权限。

https://github.com/patroni/patroni/issues/2256

postgres=# grant CONNECT on database postgres to replica;
GRANT

5.重启pa-pg-2服务,查看patroni的日志发现,报错是一点没变。此时把报错的重点放在了这条上面"psycopg2.ProgrammingError: invalid dsn: invalid connection option “target_session_attrs”"

Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 11:09:56 DB-test2 patroni: Traceback (most recent call last):
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 11:09:56 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 11:09:56 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 11:09:56 DB-test2 patroni: return next(self.gen)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 11:09:56 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 11:09:56 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 11:09:56 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)

6.通过官方github的issues,建议将psycopg2模块的版本修改为psycopg2==2.8.6,但是我的机服务器上,没有这个模块,正常启动的两台也没有这个模块。但是当我检查这个相关模块时发现了问题。

https://github.com/patroni/patroni/issues/1969

#pa-pg-1和pa-pg-3
root # pip3 list|grep psycopg2
psycopg2-binary    2.9.8

#pa-pg-2
psycopg2-binary    2.9.5

7.通过上方发现patroni的所需要的模块psycopg2-binary,三台节点唯有启动报错的pa-pg-2版本与其他两台不一致。所以决定在pa-pg-2节点重装这个模块。重装后pa-pg-2的patroni日志不再输出报错。

root # pip3 uninstall psycopg2-binary
root # pip3 install psycopg2-binary==2.9.8
...
Aug 15 11:23:14 DB-test2 systemd: Started patroni - a high-availability PostgreSQL.
Aug 15 11:23:14 DB-test2 systemd-logind: New session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd: Started Session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd-logind: Removed session 886.
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter listen_addresses=0.0.0.0 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter port=5432 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,225 INFO: No PostgreSQL configuration items changed, nothing to reload.
Aug 15 11:23:15 DB-test2 patroni: localhost:5432 - accepting connections
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,233 INFO: establishing a new patroni heartbeat connection to postgres
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,254 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
Aug 15 11:23:16 DB-test2 systemd-logind: New session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd: Started Session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd-logind: Removed session 887.
Aug 15 11:23:16 DB-test2 patroni: 2024-08-15 11:23:16,423 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)

总结

在进行集群相关的应用部署时,只是检查好相关模块安装成功与否还不够,还需要更细致的检查相关的版本号是否一致。

  • 4
    点赞
  • 8
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值