PostgreSQL的HA(主备切换)

最新推荐文章于 2024-03-08 09:18:46 发布

雪兰度

最新推荐文章于 2024-03-08 09:18:46 发布

阅读量5.8k

点赞数 1

分类专栏： postgres 文章标签： postgresql centos 宕机主备切换

postgres 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

PostgreSQL的HA(主备切换)

严格意义上的HA，是在主机宕机后备机作自动切换，一般还需要另一台机子做监控(或者叫vote)。以下写的是手工处理主机宕机后备机切换的功能。

环境：
DB: Postgres 9.1.2
OS: CentOS 6.0
VMWARE 6.0
主机(primary)：192.168.2.134:5432
备机(slave)：192.168.2.137:5432

一、通过流复制建立热备
(略)
参考：http://my.oschina.net/Kenyon/blog/54967

主备的区别有多种办法去判断
1.通过pg_controldata输出

主机
[postgres@localhost ~]$ pg_controldata
pg_control version number:            903
Catalog version number:               201105231
Database system identifier:           5809766734683241747
Database cluster state:               in production

备机
[postgres@localhost pg_log]$ pg_controldata
pg_control version number:            903
Catalog version number:               201105231
Database system identifier:           5809766734683241747
Database cluster state:               in archive recovery

2.通过数据字典表pg_stat_replication 主机表中能查到记录，备机表中是没有的
主机

postgres=# select procpid,usename,application_name,client_addr,client_port,state,sync_state from pg_stat_replication;
procpid | usename | application_name |  client_addr  | client_port |   state   | sync_state
---------+---------+------------------+---------------+-------------+-----------+------------
    7101 | repuser | walreceiver      | 192.168.2.137 |       46655 | streaming | async
(1 row)

–9.2版本,procpid已经改为pid

3.通过进程查看，显示wal sender的是主机，显示wal receiver的是备机

主机
[postgres@localhost ]$ ps -ef|grep postgres
postgres  5228  5224  0 05:26 ?        00:00:06 postgres: wal sender process   repuser 192.168.2.137(43246)  0/ED610000                     
.....

备机
[postgres@localhost ]$ ps -ef|grep postgres
postgres  5228  5224  0 05:22 ?        00:00:06 postgres: wal receiver process   streaming 0/ED610000
postgres  5231  5224  0 05:22 ?        00:00:01 postgres: writer process                      
.....

4.通过自带的函数，是备机则是true

postgres=# select pg_is_in_recovery();
pg_is_in_recovery 
-------------------
 f
(1 row)**

二、模拟主机宕机，备机切换到主机
1.主机宕机：
[postgres@localhost ]$pg_stop
主机停机前备机的进程

[postgres@localhost pg_log]$ ps -ef|grep postgres
root      2215  2197  0 00:35 pts/0    00:00:00 su - postgres
postgres  2217  2215  0 00:35 pts/0    00:00:00 -bash
postgres  5224     1  0 05:22 pts/0    00:00:02 /home/postgres/bin/postgres -D /database/pgdata
postgres  5226  5224  0 05:22 ?        00:00:00 postgres: logger process                      
postgres  5227  5224  0 05:22 ?        00:00:00 postgres: startup process   recovering 00000001000000000000003B
postgres  5228  5224  0 05:22 ?        00:00:06 postgres: wal receiver process   streaming 0/ED610000
postgres  5231  5224  0 05:22 ?        00:00:01 postgres: writer process                      
postgres  5232  5224  0 05:22 ?        00:00:00 postgres: stats collector process             
postgres  5902  2217  0 19:05 pts/0    00:00:00 ps -ef
postgres  5903  2217  0 19:05 pts/0    00:00:00 grep postgres

主机停机后备机的进程

[postgres@localhost pg_log]$ ps -ef|grep postgres
root      2215  2197  0 00:35 pts/0    00:00:00 su - postgres
postgres  2217  2215  0 00:35 pts/0    00:00:00 -bash
postgres  5224     1  0 05:22 pts/0    00:00:02 /home/postgres/bin/postgres -D /database/pgdata
postgres  5226  5224  0 05:22 ?        00:00:00 postgres: logger process                      
postgres  5227  5224  0 05:22 ?        00:00:00 postgres: startup process   waiting for 00000001000000000000003C
postgres  5231  5224  0 05:22 ?        00:00:01 postgres: writer process                      
postgres  5232  5224  0 05:22 ?        00:00:00 postgres: stats collector process             
postgres  5904  2217  0 19:05 pts/0    00:00:00 ps -ef
postgres  5905  2217  0 19:05 pts/0    00:00:00 grep postgres

可以发现原先的streaming进程(pid=5228)没了。同时，备机中的日志，出现大量的错误信息

2012-12-24 19:09:07.064 PST,,,5948,,50d918d3.173c,1,,2012-12-24 19:09:07 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host ""192.168.2.134"" and accepting
        TCP/IP connections on port 5432?
",,,,,,,,,""
2012-12-24 19:09:12.069 PST,,,5949,,50d918d8.173d,1,,2012-12-24 19:09:12 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host ""192.168.2.134"" and accepting
        TCP/IP connections on port 5432?
",,,,,,,,,""
2012-12-24 19:09:17.077 PST,,,5950,,50d918dd.173e,1,,2012-12-24 19:09:17 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host ""192.168.2.134"" and accepting
        TCP/IP connections on port 5432?
",,,,,,,,,""
2012-12-24 19:09:22.081 PST,,,5951,,50d918e2.173f,1,,2012-12-24 19:09:22 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host ""192.168.2.134"" and accepting
        TCP/IP connections on port 5432?

显示的错误信息很明显，primary 服务器连不上了。 此时查看备机的pg_controldata状态信息，仍是备机状态

[postgres@localhost ]$ pg_controldata
pg_control version number:            903
Catalog version number:               201105231
Database system identifier:           5809766734683241747
Database cluster state:               in archive recovery
pg_control last modified:             Mon 24 Dec 2012 07:07:54 PM PST
Latest checkpoint location:           0/F0000020
Prior checkpoint location:            0/EC01E768
Latest checkpoint's REDO location:    0/F0000020
Latest checkpoint's TimeLineID:       1
Latest checkpoint's NextXID:          0/1968
Latest checkpoint's NextOID:          24607
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        1792
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Time of latest checkpoint:            Mon 24 Dec 2012 07:05:05 PM PST
Minimum recovery ending location:     0/EC018CA0
Backup start location:                0/0
Current wal_level setting:            hot_standby
Current max_connections setting:      100
Current max_prepared_xacts setting:   0
......
[postgres@localhost ]$

2.备机切换成主机

之前备机上的recovery.conf中配置了 trigger_file = '/database/pgdata/trigger.kenyon' 要切换备机成主机，只要创建一个触发文件trigger.kenyon即可，这个名字可以随便写。
[postgres@localhost ]$ touch /database/pgdata/trigger.kenyon
此时查看备机上的日志,可以看到成功切换到主机了。

[postgres@localhost ]$tail -f postgresql-2012-12-24_190930.csv 
2012-12-24 19:09:37.100 PST,,,5954,,50d918f1.1742,1,,2012-12-24 19:09:37 PST,,0,FATAL,XX000,"could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host ""192.168.2.134"" and accepting
        TCP/IP connections on port 5432?
",,,,,,,,,""
2012-12-24 19:09:42.093 PST,,,5227,,50d85726.146b,6,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"trigger file found: /database/pgdata/trigger.kenyon",,,,,,,,,""
2012-12-24 19:09:42.097 PST,,,5227,,50d85726.146b,7,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"redo done at 0/F0000020",,,,,,,,,""
2012-12-24 19:09:42.104 PST,,,5227,,50d85726.146b,8,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"last completed transaction was at log time 2012-12-24 05:29:38.526602-08",,,,,,,,,""
2012-12-24 19:09:42.112 PST,,,5227,,50d85726.146b,9,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"selected new timeline ID: 2",,,,,,,,,""
2012-12-24 19:10:04.403 PST,,,5227,,50d85726.146b,10,,2012-12-24 05:22:46 PST,1/0,0,LOG,00000,"archive recovery complete",,,,,,,,,""
2012-12-24 19:10:04.705 PST,,,5224,,50d8571c.1468,2,,2012-12-24 05:22:36 PST,,0,LOG,00000,"database system is ready to accept connections",,,,,,,,,""
2012-12-24 19:10:04.710 PST,,,5964,,50d9190c.174c,1,,2012-12-24 19:10:04 PST,,0,LOG,00000,"autovacuum launcher started",,,,,,,,,""

日志里可以体现出来原来的备机已经切换为主机了。
再去看现在这台机子的pg_controldata的信息，再次确认一下：

[postgres@localhost pg_log]$ pg_controldata
pg_control version number:            903
Catalog version number:               201105231
Database system identifier:           5809766734683241747
Database cluster state:               in production
pg_control last modified:             Mon 24 Dec 2012 07:10:04 PM PST

已经变成production了，对，备机切主机就这么简单。
还有一处明显的变化是现在的主机(137)上的recovery.conf文件名字变成了recovery.done。
备机切换为主机后，就可以正常连接使用了。此时就有时间去处理原master端问题了。

3.宕机的主机切换成备机

先在现在的主机(137)上做一些数据的增删改
``` python
[postgres@localhost pgdata]$ psql
Password:
psql (9.1.2)
Type "help" for help.
postgres=# \d
            List of relations
Schema |     Name     | Type  |  Owner  
--------+--------------+-------+----------
public | empsalary    | table | postgres
public | t2           | table | postgres
public | t_index_test | table | postgres
public | t_kenyon     | table | postgres
public | tab_kenyon   | table | postgres
public | xxxx         | table | postgres
(6 rows)


postgres=# drop table xxxx;
DROP TABLE
postgres=# drop table t2;
DROP TABLE
postgres=# create table kenyon_rep(ir int,name varchar(10));
CREATE TABLE
postgres=# insert into kenyon_rep select generate_series(1,100),repeat('kenyon_rep',2);
ERROR:  value too long for type character varying(10)
postgres=# insert into kenyon_rep select generate_series(1,100),repeat('kenyon_rep',1);
INSERT 0 100
postgres=# \d
            List of relations
Schema |     Name     | Type  |  Owner  
--------+--------------+-------+----------
public | empsalary    | table | postgres
public | kenyon_rep   | table | postgres
public | t_index_test | table | postgres
public | t_kenyon     | table | postgres
public | tab_kenyon   | table | postgres
(5 rows)




<div class="se-preview-section-delimiter"></div>

首先在现在的备机(134)上准备恢复文件，拷贝recovery.conf文件，并修改

[postgres@localhost ~]$ cp $PGHOME/share/recovery.conf.sample $PGDATA/recovery.conf
[postgres@localhost ~]$ vi $PGDATA/recovery.conf

recovery_target_timeline = 'latest'
primary_conninfo = 'host=192.168.2.137 port=5432 user=repuser password=repuser'--指定137为新的主机
trigger_file = '/database/pgdata/trigger.kenyon'   --新的触发文件
standby_mode = on                                  --标记为备机

同时修改postgresql.conf文件
[postgres@localhost ~]$ vi $PGDATA/postgresql.conf
hot_standby = on




<div class="se-preview-section-delimiter"></div>

配置好了后，我们启动134这台模拟宕掉的原主机，并使之与137连接，并做他的备机。

[postgres@localhost ~]$ pg_start
此时我们查看备机(134)的日志，有很多时间线不一致的问题(timeline)

2012-12-24 21:57:13.135 PST,,,11936,,50d94039.2ea0,1,,2012-12-24 21:57:13 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""
2012-12-24 21:57:18.055 PST,,,11937,,50d9403e.2ea1,1,,2012-12-24 21:57:18 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""
2012-12-24 21:57:23.070 PST,,,11938,,50d94043.2ea2,1,,2012-12-24 21:57:23 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""
2012-12-24 21:57:28.072 PST,,,11939,,50d94048.2ea3,1,,2012-12-24 21:57:28 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""




<div class="se-preview-section-delimiter"></div>

遇到这种问题，需要把现在主机上归档线文件拷贝到备机上，并查看日志

[postgres@localhost]$ scp 00000002.history postgres@192.168.2.134:/database/pgdata/pg_xlog

查看日志，看其变化

2012-12-24 22:00:23.276 PST,,,12004,,50d940f7.2ee4,1,,2012-12-24 22:00:23 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""
2012-12-24 22:00:28.294 PST,,,12005,,50d940fc.2ee5,1,,2012-12-24 22:00:28 PST,,0,FATAL,XX000,"timeline 2 of the primary does not match recovery target timeline 1",,,,,,,,,""
2012-12-24 22:00:33.261 PST,,,11678,,50d93c09.2d9e,5,,2012-12-24 21:39:21 PST,,0,LOG,00000,"new target timeline is 2",,,,,,,,,""
2012-12-24 22:00:33.289 PST,,,12007,,50d94101.2ee7,1,,2012-12-24 22:00:33 PST,,0,LOG,00000,"streaming replication successfully connected to primary",,,,,,,,,""
2012-12-24 22:00:49.254 PST,,,11678,,50d93c09.2d9e,6,,2012-12-24 21:39:21 PST,,0,LOG,00000,"redo starts at 0/F0000078",,,,,,,,,""
2012-12-24 22:11:08.361 PST,"postgres","postgres",12047,"[local]",50d9437c.2f0f,1,"",2012-12-24 22:11:08 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2012-12-24 22:11:28.493 PST,"postgres","postgres",12051,"[local]",50d94390.2f13,1,"",2012-12-24 22:11:28 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2012-12-24 22:12:30.547 PST,,,11672,,50d93c02.2d98,1,,2012-12-24 21:39:14 PST,,0,LOG,00000,"received fast shutdown request",,,,,,,,,""
2012-12-24 22:12:30.549 PST,,,12007,,50d94101.2ee7,2,,2012-12-24 22:00:33 PST,,0,FATAL,57P01,"terminating walreceiver process due to administrator command",,,,,,,,,""
2012-12-24 22:12:30.563 PST,,,11679,,50d93c09.2d9f,1,,2012-12-24 21:39:21 PST,,0,LOG,00000,"shutting down",,,,,,,,,""
2012-12-24 22:12:30.567 PST,,,11679,,50d93c09.2d9f,2,,2012-12-24 21:39:21 PST,,0,LOG,00000,"database system is shut down",,,,,,,,,""




<div class="se-preview-section-delimiter"></div>

仔细看日志的话可以发现，里面有两条数据很奇怪

2012-12-24 22:11:08.361 PST,"postgres","postgres",12047,"[local]",50d9437c.2f0f,1,"",2012-12-24 22:11:08 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""
2012-12-24 22:11:28.493 PST,"postgres","postgres",12051,"[local]",50d94390.2f13,1,"",2012-12-24 22:11:28 PST,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,""




<div class="se-preview-section-delimiter"></div>

这是因为在切换原主机为备机的时候，postgresql.conf中的一个参数hot_standby = off所导致，结果导致流复制是能连上的，但是数据库起不来，连接了两次，报上述两条错误信息。将这个参数改成on，然后重启一下数据库就正常了。查看现在备机上的数据，看看更新有没有同步

查看备机上的表数据，与主机上一致了

postgres=# \d
            List of relations
Schema |     Name     | Type  |  Owner  
--------+--------------+-------+----------
public | empsalary    | table | postgres
public | kenyon_rep   | table | postgres
public | t_index_test | table | postgres
public | t_kenyon     | table | postgres
public | tab_kenyon   | table | postgres
(5 rows)

postgres=# select count(1) from kenyon_rep ;




<div class="se-preview-section-delimiter"></div>

count
-------
   100
(1 row)

至此，主机和备机互为切换成功，在新主机上更新的数据也已经同步到备机上去了。

补充：9.3的版本可以不用手工复制.history了，有一个补丁将实现这个功能。

其他：采用异步方式流复制，当原主机有大量的事务操作压力比较大时，比如update,delete等操作，在原备机提升为主机后，原主机很多时候并不能正常切为备机，这是因为对于原主机，原备机会有一定的延时，也就是说原主机是超前，切换后有一部分内容主备间是不一致的，这个时候原主机降为备机就会报错。这种情况很容易模拟，在不关闭原主机的时候，把备机提升为主机，然后原主机插入新数据，再切为备机即可。

雪兰度

关注

1
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
PostgreSQL的HA(主备切换)

PostgreSQL的HA(主备切换)严格意义上的HA，是在主机宕机后备机作自动切换，一般还需要另一台机子做监控(或者叫vote)。以下写的是手工处理主机宕机后备机切换的功能。环境： DB: Postgres 9.1.2 OS: CentOS 6.0 VMWARE 6.0 主机(primary)：192.168.2.134:5432 备机(slave)：192.168.2.137:5432
复制链接

扫一扫