Chapter 11 Stream Replication(流复制)
- The synchronous streaming replication was implemented in version 9.1. It is a so-called single-master-multi-slaves type replication, and those two terms – master and slave(s) – are usually referred to as primary and standby(s) respectively in PostgreSQL.(同步流复制是从PG9.1版本开始实施的,也就是所谓的一主多从模式的复制,以及这两个术语 — 主和从 — 通常在PG中被分别称之为主库和备库)
- This native replication feature is based on the log shipping, one of the general replication techniques, in which a primary server continues to send WAL data and then, each standby server replays the received data immediately.(这个本地复制功能是基于日志传输,也是普遍的复制技术之一,主要流程是,主库持续地发送WAL数据,每一个备库立马重现收到的数据)
- This chapter covers the following topics focusing on how streaming replication works:(这一章的内容覆盖以下话题,主要聚焦于流复制是怎样工作的:)
- How Streaming Replication starts up(流复制怎样启动)
- How the data are transferred between primary and standby servers(主库和备库之间怎样传输数据)
- How primary server manages multiple standby servers(主库是怎样管理多备库的)
- How primary server detects failures of standby servers(主库怎样检测备库的故障信息)
11.1 Starting the Streaming Replication(开启流复制)
- In Streaming Replication, three kinds of processes work cooperatively. A walsender process on the primary server sends WAL data to standby server; and then, a walreceiver and a startup processes on standby server receives and replays these data. A walsender and a walreceiver communicate using a single TCP connection.(在流复制中,由三种进程协调配合一起工作;wal发送者进程工作是在主库负责发送WAL数据给备库,wal接收者是工作在备库,负责接收和重现这些数据,wal发送者 和 wal接受者 使用一个单独的TCP进行数据传输)
- In this section, we will explore the start-up sequence of the streaming replication to understand how those processes are started and how the connection between them is established. Figure 11.1 shows the startup sequence diagram of streaming replication:(这一部分,我们将探索流复制的启动序列,以理解这些进程都是怎样启动的,主库备库之间的连接又是怎样建立的,下图11.1展示流复制启动序列的图片)
Fig. 11.1. SR startup sequence.
-
(1) Start primary and standby servers.(启动主库和备库)
-
(2) The standby server starts a startup process.(备库启动一个备份程序)
-
(3) The standby server starts a walreceiver process.(备库启动一个wal接收者 程序)
-
(4) The walreceiver sends a connection request to the primary server. If the primary server is not running, the walreceiver sends these requests periodically.(wal接收者 发送连接请求给主库,如果主库没有在运行,那么 wal接收者 就会定期地多次发送连接请求)
-
(5) When the primary server receives a connection request, it starts a walsender process and a TCP connection is established between the walsender and walreceiver.(当主库接收连接请求后,它会启动一个 wal发送者 进程和一个用于 wal发送者 与 wal接收者 之间的TCP连接)
-
(6) The walreceiver sends the latest LSN of standby’s database cluster. In general, this phase is known as handshaking in the field of information technology.(wal接收者 发送备库最近的LSN,总的来说,这个语句在信息技术领域也广泛被称之为 handshaking,握手)
-
(7) If the standby’s latest LSN is less than the primary’s latest LSN (Standby’s LSN < Primary’s LSN), the walsender sends WAL data from the former LSN to the latter LSN. Such WAL data are provided by WAL segments stored in the primary’s pg_xlog subdirectory (in version 10 or later, pg_wal subdirectory). Then, the standby server replays the received WAL data. In this phase, the standby catches up with the primary, so it is called catch-up.(如果备库最新的LSN比主库最新的LSN小,那么 wal发送者 就会把前一个LSN到后一个LSN之间的WAL数据发送到 wal接收者 处,这些数据被存储在
pg_wal/
目录下;备库会重现和接收这些WAL数据,在这个阶段,备库追赶主库的数据,因为也被称之为追赶阶段 -
(8) Streaming Replication begins to work.(流复制开始工作)
-
Each walsender process keeps a state appropriate for the working phase of connected walreceiver or any application (Note that it is not the state of walreceiver or application connected to the walsender.) The following are the possible states of it:(每一个 wal发送者 进程都保持一个能与其他 wal接收者 或者其他应用 进行连接的合适状态(注意:这不是 wal接收者 或连接到 wal发送者 的应用程序)以下是可能是状态:)
-
start-up – From starting the walsender to the end of handshaking. See Figs. 11.1(5)–(6).(启动状态——从 wal发送者 启动到TCP断开连接,见图11.5的(5)-(6))
-
catch-up – During the catch-up phase. See Fig. 11.1(7).(追赶阶段,见图11.1(7))
-
streaming – While Streaming Replication is working. See Fig. 11.1(8).(流复制阶段,见图11.1(8))
-
backup – During sending the files of the whole database cluster for backup tools such as pg_basebackup utility.(备份阶段——发送数据簇的文件给备份工具(如:pg_basebackup)期间)
-
-
The pg_stat_replication view shows the state of all running walsenders. An example is shown below:(
pg_stat_replication
的视图展示了所有 wal发送者 运行时的状态:)
testdb=> SELECT application_name,state FROM pg_stat_replication;
application_name | state
------------------+-----------
standby1 | streaming
standby2 | streaming
pg_basebackup | backup
(3 rows)
- As shown in the above result, two walsenders are running to send WAL data for the connected standby servers, and another one is running to send all files of the database cluster for pg_basebackup utility.(如上面执行结果所示,两个 wal发送者 都在运行发送WAL数据给与它们相连接的备库,另一个运行中的在用
pb_basebackup
进程发送所有文件给数据簇) - *What will happen if a standby server restarts after a long time in the stopped condition?*如果备用服务器在长时间处于停止状态后重新启动,会发生什么情况?
- In version 9.3 or earlier, if the primary’s WAL segments required by the standby server have already been recycled, the standby cannot catch up with the primary server. There is no reliable solution for this problem, but only to set a large value to the configuration parameter wal_keep_segments to reduce the possibility of the occurrence. It’s a stopgap solution.(
- In version 9.4 or later, this problem can be prevented by using replication slot. The replication slot is a feature that expands the flexibility of the WAL data sending, mainly for the logical replication, which also provides the solution to this problem – the WAL segment files which contain unsent data under the pg_xlog (or pg_wal if version 10 or later) can be kept in the replication slot by pausing recycling process. Refer the official document for detail.
11.2 How to Conduct Streaming Replication(怎样执行流复制)
- Streaming replication has two aspects: log shipping and database synchronization. Log shipping is obviously one of those aspects since the streaming replication is based on it – the primary server sends WAL data to the connected standby servers whenever the writing of them occurs. Database synchronization is required for synchronous replication – the primary server communicates with each multiple-standby server to synchronize its database clusters.(流复制有两个方面:日志传输和数据库同步;日志传送显然是其中一个方面,因为流复制是基于它的——每当写入WAL数据时,主服务器都会向连接的备用服务器发送WAL数据;数据库同步是同步复制所需 — 同步复制是指主库与每一个多备库交流同步数据簇)
- To accurately understand how streaming replication works, we should explore how one primary server manages multiple standby servers. To make it rather simple, the special case (i.e. single-primary single-standby system) is described in this section, while the general case (single-primary multi-standbys system) will be described in the next section.(为了准确地理解流复制是怎样工作的,我们应该探索一个主库是怎样管理多个备库的;为了让描述更加简单,这里举一个特例,(单个主库和单个备库的情况)而单个主库和多个备库的情况将在下一个部分(11.3)进行描述)
11.2.1 Communication Between a Primary and a Synchronous Standby(主库和同步备库之间的通信)
- Assume that the standby server is in the synchronous replication mode, but the configuration parameter hot-standby is disabled and wal_level is ‘archive’. The main parameter of the primary server is shown below:(假设备库处于同步复制模式,但是配置文件的参数
hot-standby
是不可用状态,wal_level
是归档模式,主库的主要参数如下所示:
synchronous_standby_names = 'standby1' //同步备库名字
hot_standby = off //
wal_level = archive //wal级别
- Additionally, among the three triggers to write the WAL data mentioned in Section 9.5, we focus on the transaction commits here.(另外,9.5小节中提到的三个写WAL数据的触发器,我们只关心事务提交)
- Suppose that one backend process on the primary server issues a simple INSERT statement in autocommit mode. The backend starts a transaction, issues an INSERT statement, and then commits the transaction immediately. Let’s explore further how this commit action will be completed. See the following sequence diagram in Fig. 11.2:(假设一个主库上的会话服务进程在自动提交模式下提出了一个简单插入语句,会话服务程序启动了一个事务,发出INSERT语句,然后立即提交这个事务;让我们更进一步探索这个提交操作是怎样完成的,如下图11.2的流程图:)
- WHAT IS AUTOCOMMIT MODE???
Fig. 11.2. Streaming Replication’s communication sequence diagram.(流复制的沟通流程图)
- (1) The backend process writes and flushes WAL data to a WAL segment file by executing the functions XLogInsert() and XLogFlush().(主库的会话服务程序通过执行
XLogInsert()
写和冲刷WAL数据到WAL分割文件) - (2) The walsender process sends the WAL data written into the WAL segment to the walreceiver process.(wal发送者 进程发送已经写入WAL分割文件中的WAL数据给 wal接收者 进程)
- (3) After sending the WAL data, the backend process continues to wait for an ACK response from the standby server. More precisely, the backend process gets a latch by executing the internal function SyncRepWaitForLSN(), and waits for it to be released.(WAL数据发送完毕后,会话服务程序持续地等待备库的ACK(acknowledge number,应答号) 响应,更准确地说,会话服务程序通过执行内部函数
SyncRepWaitForLSN()
获取了一个锁,然后持续等待对面发来的response来释放这个锁) - (4) The walreceiver on standby server writes the received WAL data into the standby’s WAL segment using the write() system call, and returns an ACK response to the walsender.(备库上的 wal接收者 用系统调用
write()
将接收到的WAL数据写入备库的WAL分割文件,并返回一个ACK响应给 wal发送者) - (5) The walreceiver flushes the WAL data to the WAL segment using the system call such as fsync(), returns another ACK response to the walsender, and informs the startup process about WAL data updated.(wal接收者 用系统调用比如
fsync()
将WAL数据冲刷进WAL分割文件,并返回另一个ACK响应给 wal发送者 ,并通知启动程序更新过的WAL数据) - (6) The startup process replays the WAL data, which has been written to the WAL segment.(启动程序重现已经被写入WAL分割文件中的WAL数据)
- (7) The walsender releases the latch of the backend process on receiving the ACK response from the walreceiver, and then, backend process’s commit or abort action will be completed. The timing for latch-release depends on the parameter synchronous_commit. It is ‘on’ (default), the latch is released when the ACK of step (5) received, whereas it is ‘remote_write’, when the ACK of step (4) received.(wal发送者 在接收到来自 wal接收者 的ACK响应后释放这个会话服务程序的锁,然后会话服务程序提交或者中止操作完成;锁被释放的时间点取决于参数
synchronous_commit
,如果这个参数是on
,锁就会在第(5)步接收到 wal接收者 的ACk响应的时候被释放,如果参数是remote_write
远程写入的时候,就会在第(4)步接收到 wal接收者 的ACK响应的时候被释放) - Each ACK response informs the primary server of the internal information of standby server. It contains four [items](javascript:void(0)) below:(每一个ACK响应都将备库的内部信息通知给了主库,ACK响应包含了以下一下四项条目:
- LSN location where the latest WAL data has been written.(最新一条WAL数据被写入到WAL缓冲区中的LSN地址)
- LSN location where the latest WAL data has been flushed.(最新一条WAL数据被冲洗到WAL分割文件的LSN地址)
- LSN location where the latest WAL data has been replayed in the startup process.(启动进程中最新一条被重现WAL数据LSN地址)
- The timestamp when this response has be sent.(当前这条响应被发送的时间的时间戳)
- Walreceiver returns ACK responses not only when WAL data have been written and flushed, but also periodically as the heartbeat of standby server. The primary server therefore always grasps the status of all connected standby servers.(wal接收者 返回ACK响应不仅仅是在WAL数据被写入或冲刷时,同样在作为备库周期性地返回ACK响应(周期间隔由
wal_receiver_status_interval
这个参数设置,默认是10s);因此,主服务器总是掌握所有连接的备用服务器的状态) - By issuing the queries as shown below, the LSN related information of the connected standby servers can be displayed.(通过发出如下所示的查询,可以显示连接的备库的LSN相关信息)
testdb=> SELECT application_name AS host,
write_location AS write_LSN, flush_location AS flush_LSN,
replay_location AS replay_LSN FROM pg_stat_replication;
host | write_lsn | flush_lsn | replay_lsn
----------+-----------+-----------+------------
standby1 | 0/5000280 | 0/5000280 | 0/5000280
standby2 | 0/5000280 | 0/5000280 | 0/5000280
(2 rows)
11.2.2 Behavior When a Failure Occurs(故障发生时的行为)
-
In this subsection, descriptions are made on how the primary server behaves when the synchronous standby server has failed, and how to deal with the situation.(在这一部分,将描述主库在同步备库出现故障时的行为,以及如何处理这种情况)
-
Even if the synchronous standby server has failed and is no longer able to return an ACK response, the primary server continues to wait for responses forever.(即使同步备库发生了故障,不能再返回ACK响应,主库也会一直等待备库的ACK响应) So, the running transactions cannot commit and subsequent query processing cannot be started. In other words, all of the primary server operations are practically stopped. (Streaming replication does not support a function to revert automatically to asynchronous-mode by timing out.)(所以,主库汇总运行的事务一直不能移交,后续查询程序也不能启动,换一种说法,所有主库的操作都停下来了(流复制不支持通过超时自动恢复到异步模式的函数))
-
There are two ways to avoid such situation. One of them is to use multiple standby servers to increase the system availability, and the other is to switch from synchronous to asynchronous mode by performing the following steps manually.(有两种方法避免上述情况,其一是使用多备库方案增加系统的可用性,其二是执行以下步骤切换同步模式到异步模式:)
- (1) Set empty string to the parameter synchronous_standby_names.(设置该参数
synchronous_standby_names
为空字符)
- (1) Set empty string to the parameter synchronous_standby_names.(设置该参数
-
synchronous_standby_names = ''
-
- (2) Execute the pg_ctl command with reload option.(执行
pg_ctl
命令的重新加载选项)
- (2) Execute the pg_ctl command with reload option.(执行
-
postgres> pg_ctl -D $PGDATA reload
-
The above procedure does not affect the connected clients. The primary server continues the transaction processing as well as all sessions between clients and the respective backend processes are kept.(以上步骤不会影响已经连接的服务器,主库继续事务处理,并且保留客户端和相应会话服务程序之间的所有会话)
11.3 Managing Multiple-Standby Servers(管理多备用服务器)
- In this section, the way streaming replication works with multiple standby servers is described.(这一部分,将介绍多备库的流复制工作流程)
11.3.1 sync_priority and sync_state(同步优先级和同步状态)
- Primary server gives sync_priority and sync_state to all managed standby servers, and treats each standby server depending on its respective values. (The primary server gives those values even if it manages just one standby server; haven’t mentioned this in the previous section.)(主库提供
sync_priority
和sync_state
给所有的被其管理的备库,并根据各个备库的值对其有所不同地处理(主库即使只管理一个备库,也会提供这些值)) - sync_priority indicates the priority of standby server in synchronous-mode and is a fixed value. The smaller value shows the higher priority, while 0 is the special value that means ‘in asynchronous-mode’. Priorities of standby servers are given in the order listed in the primary’s configuration parameter synchronous_standby_names. For example, in the following configuration, priorities of standby1 and standby2 are 1 and 2, respectively.(
sync_priority
表示同步模式下,所有备库的优先级,并且这个值是一个固定的值,更小的值代表更高的优先级,但0是一个特殊值,表示处于异步模式;备库的优先级按照主库的配置文件参数synchronous_standby_names
规定的顺序给出;例如,下面配置参数的值,standby1
的优先级为1,standby2
的优先级为2)
synchronous_standby_names = 'standby1, standby2'
- (Standby servers not listed on this parameter are in asynchronous-mode, and their priority is 0.)(备库名字没在这个参数中的默认优先级为0,即处于异步模式下,不在同步模式下)
- sync_state is the state of the standby server. It is variable according to the running status of all standby servers and the individual priority. The followings are the possible states:(
sync_state
是备库的状态,这个是一个变量,根据所有备库的运行状态和其优先级变化,以下是可能的状态种类:)- Sync is the state of synchronous-standby server of the highest priority among all working standbys (except asynchronous-servers).(
sync
是同步模式下所有正在运行的备库中最高优先级的备库的状态) - Potential is the state of spare synchronous-standby server of the second or lower priority among the all working standbys (except asynchronous-servers). If the synchronous-standby has failed, it will be replaced with the highest priority standby within the potential ones.(
potential
是所有优先级第二甚至更后面的工作中的同步模式下备库的状态(异步模式下的除外),如果同步库发生了故障,就会用potential中的一个替代最高优先级的备库) - Async is the state of asynchronous-standby server, and this state is fixed. The primary server treats asynchronous-standbys in the same way as potential standbys except that their sync_state never be ‘sync’ or ‘potential’.(
async
是异步模式下的备库的状态,这个状态是固定的,主库对待异步模式下的备库和对待potential状态下的备库是一样的,但是异步模式下的备库的状态永远不会变成sync
或者potential
)
- Sync is the state of synchronous-standby server of the highest priority among all working standbys (except asynchronous-servers).(
- The priority and the state of the standby servers can be shown by issuing the following query:(还行以下查询,可以查询到备库的优先级和状态)
testdb=> SELECT application_name AS host,
sync_priority, sync_state FROM pg_stat_replication;
host | sync_priority | sync_state
----------+---------------+------------
standby1 | 1 | sync
standby2 | 2 | potential
(2 rows)
11.3.2 How the Primary Manages Multiple-standbys(主库如何管理多个备库)
-
The primary server waits for ACK responses from the synchronous standby server alone. In other words, the primary server confirms only synchronous standby’s writing and flushing of WAL data. Streaming replication, therefore, ensures that only synchronous standby is in the consistent and synchronous state with the primary.( 主库只会等待来自同步模式下的备库发生的ACK响应;换句话说,主库仅确保同步模式下的备库写入并冲刷WAL数据,因此在流复制中,只有同步模式下的备库的状态是与主库始终一致且同步的)
-
Figure 11.3 shows the case in which the ACK response of potential standby has been returned earlier than that of the primary standby. There, the primary server does not complete the commit action of the current transaction, and continues to wait for the primary’s ACK response. And then, when the primary’s response is received, the backend process releases the latch and completes the current transaction processing.( 图11.3展示了潜在状态的备库的ACK响应早于优先级最高的备库ACK响应的情况;这时主库并不会完成当前事务的
COMMIT
操作,而是继续等待优先级最高的备库的ACK响应。而当收到优先级最高的备库的响应时,会话服务程序释放锁并完成当前事务的处理)
Fig. 11.3. Managing multiple standby servers.
- The sync_state of standby1 and standby2 are ‘sync’ and ‘potential’ respectively. (1) In spite of receiving an ACK response from the potential standby server, the primary’s backend process continues to wait for an ACK response from the synchronous-standby server. (2) The primary’s backend process releases the latch, completes the current transaction processing.(备库
standby1
和standby2
的sync_state
分别是sync
和potential
)(尽管接收到了一个潜在状态的备库的ACK响应,主库的会话服务程序还是在等待同步模式下最高优先级的备库的ACK响应)(主库中的会话服务程序释放锁,完成当前事务的运行) - In the opposite case (i.e. the primary’s ACK response has been returned earlier than the potential’s one), the primary server immediately completes the commit action of the current transaction without ensuring if the potential standby writes and flushes WAL data or not.(在一个相反的情况下(也就是优先级最高的备库的ACK响应比潜在状态下的ACK要早到主库),那么主库会立即完成当前事务的提交,不会再确认潜在模式下的备库是否写入和冲洗了WAL完成与否)
11.3.3 Behavior When a Failure Occurs(故障发生时的行为)
- Once again, see how the primary server behaves when the standby server has failed.(再来看看多备库情况,优先级最高的备库如果发生故障时,主库的行为)
- When either a potential or an asynchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby and continues all processing. In other words, transaction processing of the primary server would not be affected by the failure of either type of standby server.( 当潜在状态或异步状态的备库发生故障时,主库会终止连接到故障备库的walsender进程,并继续进行所有处理。换而言之,主库上的事务处理不会受到这两种状态下备库的影响)
- When a synchronous standby server has failed, the primary server terminates the walsender process connected to the failed standby, and replaces synchronous standby with the highest priority potential standby. See Fig. 11.4. In contrast to the failure described above, query processing on the primary server will be paused from the point of failure to the replacement of synchronous standby. (Therefore, failure detection of standby server is a very important function to increase availability of replication system. Failure detection will be described in the next section.)( 当最高优先级的备库(也就是唯一一个
sync
状态的备库)发生故障时,主库将终止连接到故障备库的walsender进程,并使用具有最高优先级的一个潜在状态的备库替换这个故障的备库,如图11.4。与上述的故障相反,主库将会暂停从失效点到成功替换同步备库之间的查询处理(因此备库的故障检测对于提高复制系统可用性至关重要,故障检测将在下一节介绍))
Fig. 11.4. Replacing of synchronous standby server.
- In any case, if one or more standby server shall be running in syncrhonous-mode, the primary server keeps only one synchronous standby server at all times, and the synchronous standby server is always in a consistent and synchronous state with the primary.(在任何情况下,如果一个或者多个备库在同步模式下运行,任何时间主库都只会让其中一个备库保持最高优先级,并且这个备库总是与主库保持一致且同步的状态)
11.4 Detecting Failures of Standby Servers(检测备库的故障)
- Streaming replication uses two common failure detection procedures that will not require any special hardware at all.(流复制用两种常见的故障检测程序,不需要任何特殊的硬件支持)
-
Failure detection of standby server process(备库程序的故障检测)
-
When a connection drop between walsender and walreceiver has been detected, the primary server immediately determines that the standby server or walreceiver process is faulty. When a low level network function returns an error by failing to write or to read the socket interface of walreceiver, the primary also immediately determines its failure.(当检测到 wal发送者 和wal接收者 之间的连接断开时,主库会立即判定备库或 wal接收者 进程出现故障。当底层网络函数由于未能成功读写 wal接收者 的套接字接口(也就是端口)而返回错误时,主库也会立即判定其失效)
-
Failure detection of hardware and networks(硬件和网络的故障检测)
-
If a walreceiver returns nothing within the time set for the parameter wal_sender_timeout (default 60 seconds), the primary server determines that the standby server is faulty. In contrast to the failure described above, it takes a certain amount of time – maximum is wal_sender_timeout seconds – to confirm the standby’s death on the primary server even if a standby server is no longer able to send any response by some failures (e.g. standby server’s hardware failure, network failure, and so on).(如果 wal接收者 在参数
wal_sender_timeout
(默认为60秒)配置的时间段内没有返回任何结果,则主库会判定备库出现故障。相对于上面的故障而言,尽管从库可能因为一些失效原因(例如备库上的硬件失效,网络失效等),已经无法发送任何响应,但主库仍需要耗费特定的时间 —— 最大为wal_sender_timeout
,来确认备库的死亡)
-
- Depending on the types of failures, it can usually be detected immediately after a failure occurs, while sometimes there might be a time lag between the occurrence of failure and the detection of it. In particular, if a latter type of failure occurs in synchronous standby server, all transaction processing on the primary server will be stopped until detecting the failure of standby, even though multiple potential standby servers may have been working.(取决于失效的类型,一些失效可以在失效发生时被立即检测到,而有时候则可能在出现失效与检测到失效之间存在一段时间延迟。如果在最高优先级的备库上出现后一种失效,那么即使有多个潜在状态的备库正常工作,直到检测到最高优先级的备库失效了,主库仍然可能会停止一段时间的事务处理)