PostgreSQL中的WAL：2.预写日志

最新推荐文章于 2024-06-19 10:15:43 发布

cullen2012

最新推荐文章于 2024-06-19 10:15:43 发布

阅读量324

点赞数 1

文章标签：大数据数据库 python linux mysql

原文链接：https://habr.com/en/company/postgrespro/blog/494246/

版权

Last time we got acquainted with the structure of an important component of the shared memory — the buffer cache. A risk of losing information from RAM is the main reason why we need techniques to recover data after failure. Now we will discuss these techniques. 上一次我们熟悉共享内存的重要组成部分的结构-缓冲区高速缓存。从RAM中丢失信息的风险是我们需要在故障后恢复数据的技术的主要原因。现在我们将讨论这些技术。

日志 (The log)

Sadly, there's no such thing as miracles: to survive the loss of information in RAM, everything needed must be duly saved to disk (or other nonvolatile media).

可悲的是，没有奇迹发生：为了避免RAM中信息丢失，必须将所有必需的东西妥善保存到磁盘(或其他非易失性介质)中。

Therefore, the following was done. Along with changing data, the log of these changes is maintained. When we change something on a page in the buffer cache, we create a record of this change in the log. The record contains the minimum information sufficient to redo the change if the need arises.

因此，完成了以下操作。除了更改数据外，还保留这些更改的日志。当我们更改缓冲区高速缓存中页面上的某些内容时，我们会在日志中创建此更改的记录。该记录包含必要时足以重做更改的最少信息。

For this to work, the log record must obligatory get to disk before the changed page gets there. And this explains the name: write-ahead log (WAL).

为此，日志记录必须在更改后的页面到达磁盘之前必须先进入磁盘。这解释了名称： 预写日志(WAL) 。

In case of failure, the data on disk appear to be inconsistent: some pages were written earlier, and others later. But WAL remains, which we can read and redo the operations that were performed before the failure but their result was late to reach the disk.

如果发生故障，磁盘上的数据可能会不一致：某些页面写得较早，而另一些页面写得较晚。但是WAL仍然存在，我们可以读取并重做在故障之前执行的操作，但是操作的结果迟到了磁盘。

First, WAL is a sequential stream of append-only data. Even HDD disks do the job of sequential writing fine. However, the data themselves are written in a random fashion since pages are spread across the disk more or less in disorder.

首先，WAL是仅追加数据的顺序流。甚至HDD磁盘也可以很好地执行顺序写入的工作。但是，由于页面或多或少地分散在磁盘上，因此数据本身是以随机方式写入的。

Second, a WAL record can be way smaller than the page.

其次，WAL记录可能比页面小。

Third, when writing to disk we do not need to take care of maintaining the consistency of data on disk at every point in time (this requirement makes life really difficult).

第三，在写入磁盘时，我们不必在每个时间点都要维护磁盘上数据的一致性(此要求使生活变得非常困难)。

And fourth, as we will see later, WAL (once it is available) can be used not only for recovery, but also for backup and replication.

第四，正如我们稍后将看到的，WAL(一旦可用)不仅可以用于恢复，还可以用于备份和复制。

All the operations must be WAL-logged that can result in inconsistent data on disk in case of failure. Specifically, the following operations are WAL-logged:

必须对所有操作进行WAL记录，以防发生故障时导致磁盘上的数据不一致。具体来说，以下操作是WAL记录的：

Changes to pages in the buffer cache (mostly table and index pages) — since it takes some time for the page changed to get to disk.
更改缓冲区高速缓存中的页面(主要是表页面和索引页面)-因为页面更改需要花费一些时间才能到达磁盘。
Transactions' commits and aborts — since a change of the status is done in XACT buffers and it also takes some time for the change to get to disk.
事务的提交和中止-因为状态更改是在XACT缓冲区中完成的，所以更改也需要一些时间才能到达磁盘。
File operations (creation and deletion of files and directories, such as creation of files during creation of a table) — since these operations must be synchronous with the changes to data.
文件操作(创建和删除文件和目录，例如在创建表期间创建文件)-因为这些操作必须与数据更改同步。

The following is not WAL-logged:

以下内容未进行WAL记录：

Operations with unlogged tables — their name is self-explanatory.
具有未记录表的操作-其名称不言自明。
Operations with temporary tables — logging makes no sense since the lifetime of such tables does not exceed the lifetime of the session that created them.
使用临时表的操作-日志记录没有意义，因为此类表的生存期不超过创建它们的会话的生存期。

Before PostgreSQL 10, hash indexes were not WAL-logged (they served only to associate hash functions with different data types), but this has been corrected.

在PostgreSQL 10之前，没有对WAL记录哈希索引 (它们仅用于将哈希函数与不同的数据类型相关联)，但是此问题已得到纠正。

逻辑结构 (Logical structure)

We can logically envisage WAL as a sequence of records of different lengths. Each record contains data on a certain operation, which are prefixed by a standard header. In the header, among the rest, the following is specified:

我们可以从逻辑上将WAL想象成一系列不同长度的记录。每个记录包含有关某个操作的数据，并以标准标头作为前缀。在标题中，其余部分指定以下内容：

The ID of the transaction that the record relates to.
与记录相关的交易的ID。
The resource manager — the system component responsible for the record.
资源管理器-负责记录的系统组件。
The checksum (CRC) — permits to detect data corruption.
校验和(CRC)-允许检测数据损坏。
The length of the record and link to the preceding record.
记录的长度，并链接到先前的记录。

As for the data, they can have different formats and meaning. For example: they can be represented by a page fragment that needs to be written on top of the page contents at a certain offset. The resource manager specified «understands» how to interpret the data in its record. There are separate managers for tables, each type of indexes, transaction statuses and so on. You can get the full list of them using the command

至于数据，它们可以具有不同的格式和含义。例如：它们可以由需要以一定偏移量写在页面内容顶部的页面片段表示。资源管理器指定了“理解”如何解释其记录中的数据。有单独的表管理器，每种类型的索引，事务状态等等。您可以使用以下命令获取它们的完整列表

pg_waldump -r list

物理结构 (Physical structure)

WAL is stored on disk as files in the $PGDATA/pg_wal directory. By default, each file is 16 MB. You can increase this size to avoid having many files in one catalog. Before PostgreSQL 11, you could do this only when compiling source codes, but now you can specify the size when initializing the cluster (use the --wal-segsize option).

WAL作为$PGDATA/pg_wal目录中的文件存储在磁盘上。默认情况下，每个文件为16 MB。您可以增加此大小，以避免在一个目录中包含多个文件。在PostgreSQL 11之前，您只能在编译源代码时执行此操作，但是现在您可以在初始化集群时指定大小(使用--wal-segsize选项)。

WAL records get into the currently used file, and once it is over, the next one will be used.

WAL记录进入当前使用的文件，一旦结束，将使用下一个文件。

In the shared memory of the server, special buffers are allocated for WAL. The wal_buffers parameter specifies the size of the WAL cache (the default value implies automatic setting: 1/32 of the buffer cache is allocated).

在服务器的共享内存中，为WAL分配了特殊的缓冲区。 wal_buffers参数指定WAL缓存的大小(默认值表示自动设置：已分配缓冲区缓存的1/32)。

The WAL cache is structured similarly to the buffer cache, but works mainly in the circular buffer mode: records are added to the «head», but get written to disk starting with the «tail».

WAL缓存的结构类似于缓冲区缓存，但主要在循环缓冲区模式下工作：将记录添加到“ head”，但从“ tail”开始写入磁盘。

The pg_current_wal_lsn and pg_current_wal_insert_lsn functions return the write («tail») and insert («head») locations, respectively:

pg_current_wal_lsn和pg_current_wal_insert_lsn函数分别返回写(«tail»)和插入(«head»)位置：

=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();

pg_current_wal_lsn | pg_current_wal_insert_lsn 
--------------------+---------------------------
 0/331E4E64         | 0/331E4EA0
(1 row)

To reference a certain record, the pg_lsn data type is used: it is a 64-bit integer that represents the byte offset of the beginning of the record with respect to the beginning of WAL. LSN (log sequence number) is output as two 32-bit hexadecimal numbers separated by a slash.

为了引用某个记录，使用了pg_lsn数据类型：它是一个64位整数，表示记录的开始相对于WAL的开始的字节偏移。 LSN(对数序列号)作为两个以斜杠分隔的32位十六进制数字输出。

We can get to know in what file we will find the location needed and at what offset from the beginning of the file:

我们可以知道在哪个文件中可以找到所需的位置以及与文件开头的偏移量：

=> SELECT file_name, upper(to_hex(file_offset)) file_offset
FROM pg_walfile_name_offset('0/331E4E64');

file_name         | file_offset 
--------------------------+-------------
 000000010000000000000033 | 1E4E64
 \      /\                       /
   time         0/331E4E64
   line

The filename consists of two parts. 8 high-order hexadecimal digits show the number of the time line (it is used in restoring from backup) and the remainder corresponds to the high-order LSN digits (and the rest low-order LSN digits show the offset).

文件名由两部分组成。 8位高位十六进制数字显示时间线的编号(用于从备份还原)，其余部分对应于高位LSN数字(其余的低位LSN数字显示偏移量)。

In the file system, you can see WAL files in the $PGDATA/pg_wal/ directory, but starting with PostgreSQL 10, you can also see them using a specialized function:

在文件系统中，您可以在$PGDATA/pg_wal/目录中看到WAL文件，但是从PostgreSQL 10开始，您还可以使用专门的功能来查看它们：

=> SELECT * FROM pg_ls_waldir() WHERE name = '000000010000000000000033';

name           |   size   |      modification      
--------------------------+----------+------------------------
 000000010000000000000033 | 16777216 | 2019-07-08 20:24:13+03
(1 row)

预写日志 (Write-ahead logging)

Let's see how WAL-logging is done and how writing ahead is ensured. Let's create a table:

让我们看看如何进行WAL记录以及如何确保提前写入。让我们创建一个表：

=> CREATE TABLE wal(id integer);
=> INSERT INTO wal VALUES (1);

We will be looking into the header of the table page. To do this, we will need a well-known extension:

我们将研究表格页面的标题。为此，我们需要一个著名的扩展：

=> CREATE EXTENSION pageinspect;

Let's start a transaction and remember the location of insertion into WAL:

让我们开始一个事务，并记住插入WAL的位置：

=> BEGIN;
=> SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn 
---------------------------
 0/331F377C
(1 row)

Now we will perform some operation, for example, update a row:

现在我们将执行一些操作，例如，更新一行：

=> UPDATE wal set id = id + 1;

This change was WAL-logged, and the insert location changed:

此更改已进行WAL记录，并且插入位置已更改：

=> SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn 
---------------------------
 0/331F37C4
(1 row)

To ensure the changed data page not to be flushed to disk prior to the WAL record, LSN of the last WAL record related to this page is stored in the page header:

为了确保在WAL记录之前不会将更改后的数据页刷新到磁盘，与该页相关的最后一个WAL记录的LSN存储在页头中：

=> SELECT lsn FROM page_header(get_raw_page('wal',0));

lsn     
------------
 0/331F37C4
(1 row)

Note that WAL is one for the entire cluster, and new records get there all the time. Therefore, LSN on the page can be less than the value just returned by the pg_current_wal_insert_lsn function. But since nothing is happening in our system, the numbers are the same.

请注意，WAL是整个集群的一个，新记录始终都在那里。因此，页面上的LSN可以小于pg_current_wal_insert_lsn函数刚返回的值。但是由于在我们的系统中什么也没有发生，因此数字是相同的。

Now let's commit the transaction.

现在，让我们提交事务。

=> COMMIT;

Commits are also WAL-logged, and the location changes again:

提交也被WAL记录，并且位置再次更改：

=> SELECT pg_current_wal_insert_lsn();

pg_current_wal_insert_lsn 
---------------------------
 0/331F37E8
(1 row)

Each commit changes the transaction status in the structure called XACT (we've already discussed it). Statuses are stored in files, but they also use their own cache, which occupies 128 pages in the shared memory. Therefore, for XACT pages, LSN of the last WAL record also has to be tracked. But this information is stored in RAM rather than in the page itself.

每个提交都会在称为XACT的结构中更改事务状态(我们已经讨论过 )。状态存储在文件中，但它们也使用自己的缓存，该缓存在共享内存中占据128页。因此，对于XACT页面，还必须跟踪最后一个WAL记录的LSN。但是，此信息存储在RAM中，而不是存储在页面本身中。

WAL records created will once be written to disk. We will discuss sometime later when exactly this happens, but in the above situation, it has already happened:

创建的WAL记录将被一次写入磁盘。我们将在稍后的某个时间讨论确切的情况，但是在上述情况下，情况已经发生：

=> SELECT pg_current_wal_lsn(), pg_current_wal_insert_lsn();

pg_current_wal_lsn | pg_current_wal_insert_lsn 
--------------------+---------------------------
 0/331F37E8         | 0/331F37E8
(1 row)

Hereafter the data and XACT pages can be flushed to disk. But if we had to flush them earlier, it would be detected and the WAL records would be forced to get to disk first.

此后，数据和XACT页面可以刷新到磁盘。但是，如果我们不得不更早地刷新它们，它将被检测到，并且WAL记录将被迫首先进入磁盘。

Having two LSN locations, we can get the amount of WAL records between them (in bytes) by simply subtracting one from the other. We only need to cast the locations to the pg_lsn type:

具有两个LSN位置，我们可以通过简单地从另一个中减去一个来获得它们之间的WAL记录数量(以字节为单位)。我们只需要将位置转换为pg_lsn类型：

=> SELECT '0/331F37E8'::pg_lsn - '0/331F377C'::pg_lsn;

?column? 
----------
      108
(1 row)

In this case, the update of the row and commit required 108 bytes in WAL.

在这种情况下，行和提交的更新在WAL中需要108个字节。

The same way we can evaluate the amount of WAL records that the server generates per unit of time at a certain load. This is important information, which will be needed for tuning (which we will discuss next time).

我们可以用相同的方式评估服务器在一定负载下每单位时间生成的WAL记录的数量。这是重要的信息，需要进行调优(我们将在下次讨论)。

Now let's use the pg_waldump utility to look at the WAL records created.

现在，让我们使用pg_waldump实用工具查看创建的WAL记录。

The utility can also work with a range of LSNs (as in this example) and select the records for a transaction specified. You should run the utility as postgres OS user since it will need access to WAL files on disk.

该实用程序还可以使用一系列LSN(如本例所示)，并为指定的事务选择记录。您应该以postgres OS用户身份运行该实用程序，因为它需要访问磁盘上的WAL文件。

postgres$ /usr/lib/postgresql/11/bin/pg_waldump -p /var/lib/postgresql/11/main/pg_wal -s 0/331F377C -e 0/331F37E8 000000010000000000000033

rmgr: Heap        len (rec/tot):     69/    69, tx:     101085, lsn: 0/331F377C, prev 0/331F3014, desc: HOT_UPDATE off 1 xmax 101085 ; new off 2 xmax 0, blkref #0: rel 1663/16386/33081 blk 0

rmgr: Transaction len (rec/tot):     34/    34, tx:     101085, lsn: 0/331F37C4, prev 0/331F377C, desc: COMMIT 2019-07-08 20:24:13.945435 MSK

Here we see the headers of two records.

在这里，我们看到两个记录的标题。

The first one is the HOT_UPDATE operation, related to the Heap resource manager. The filename and page number are specified in the blkref field and are the same as of the updated table page:

第一个是HOT_UPDATE操作，与堆资源管理器有关。文件名和页码在blkref字段中指定，与更新后的表格页面相同：

=> SELECT pg_relation_filepath('wal');

pg_relation_filepath 
----------------------
 base/16386/33081
(1 row)

The second record is COMMIT, related to the Transaction resource manager.

第二个记录是COMMIT，与事务资源管理器有关。

This format is hardly easy-to-read, but allows us to clarify the things if needed.

这种格式几乎不容易阅读，但是可以让我们在需要时进行澄清。

复苏 (Recovery)

When we start the server, the postmaster process is launched first, which, in turn, launches the startup process, whose task is to ensure the recovery in case of failure.

当我们启动服务器时，首先启动postmaster进程，然后启动启动进程，该启动进程的任务是确保在发生故障时进行恢复。

To figure out a need for the recovery, startup looks at the cluster state in the specialized control file $PGDATA/global/pg_control. But we can also check the state on our own by means of the pg_controldata utility:

为了确定恢复的需要， 启动会在专用控制文件$PGDATA/global/pg_control查看集群状态。但是我们也可以通过pg_controldata实用程序自己检查状态：

postgres$ /usr/lib/postgresql/11/bin/pg_controldata -D /var/lib/postgresql/11/main | grep state

Database cluster state:               in production

A server that was shut down in a regular way will have the «shut down» state. If a server is not working, but the state is still «in production», it means that the DBMS is down and the recovery will be done automatically.

定期关闭的服务器将处于“关闭”状态。如果服务器不工作，但状态仍处于“生产中”，则表示DBMS已关闭，并且恢复将自动完成。

For the recovery, the startup process will sequentially read WAL and apply records to the pages if needed. The need can be determined by comparing LSN of the page on disk with LSN of the WAL record. If LSN of the page appears to be greater, the record does not need to be applied. Actually, it even cannot be applied since the records are meant to be applied in a strictly sequential order.

为了进行恢复，启动过程将顺序读取WAL并将记录应用于页面(如果需要)。可以通过将磁盘上页面的LSN与WAL记录的LSN进行比较来确定需求。如果页面的LSN看起来更大，则不需要应用该记录。实际上，它甚至无法应用，因为记录是按照严格的顺序应用的。

During a recovery, pages are changed in the buffer cache, as during regular work. To this end, postmaster launches the background processes needed.

在恢复过程中，与常规工作一样，页面将在缓冲区高速缓存中更改。为此， postmaster启动所需的后台进程。

WAL records are applied to files in a similar way: for example, if it is clear from a record that the file must exist, but it does not, the file is created.

WAL记录以类似的方式应用于文件：例如，如果从记录中清楚该文件必须存在，但不存在，则创建该文件。

And at the very end of the recovery process, respective initialization forks overwrite all unlogged tables to make them empty.

在恢复过程的最后，各个初始化叉会覆盖所有未记录的表，以使它们为空。

This is a very simplified description of the algorithm. Specifically, we haven't said a word so far on where to start reading WAL records (we have to put a talk on this off until we discuss a checkpoint).

这是该算法的非常简化的描述。具体来说，到目前为止，我们还没有说起从哪里开始阅读WAL记录(我们不得不对此进行讨论，直到讨论检查点为止)为止。

And the last thing to clarify. «Classically», a recovery process consists of two phases. At the first (roll forward) phase, log records are applied and the server redoes all the work lost due to failure. At the second (roll back) phase, transactions that were not committed by the moment of failure are rolled back. But PostgreSQL does not need the second phase. As we discussed earlier, thanks to the implementation features of the multiversion concurrency control, transactions do not need to be physically rolled back — it is sufficient that the commit bit is not set in XACT.

最后要澄清的是。 “经典地”，恢复过程包括两个阶段。在第一(前滚)阶段，将应用日志记录，并且服务器将重做由于故障而丢失的所有工作。在第二(回滚)阶段，将回滚在故障时刻尚未提交的事务。但是PostgreSQL不需要第二阶段。如前所述，由于多版本并发控制的实现功能，因此无需物理回滚事务-无需在XACT中设置提交位就足够了。

Read on. 继续阅读。