PostgreSQL重启恢复---XLOG 1.0

obvious__

已于 2024-07-27 13:15:06 修改

阅读量1.4k

点赞数 4

分类专栏： postgresql

于 2021-07-30 11:28:27 首次发布

本文链接：https://blog.csdn.net/obvious__/article/details/119242661

版权

PostgreSQL XLOG insert Page-orientedLog 物理逻辑日志

关键词由CSDN通过智能技术生成

postgresql 专栏收录该内容

25 篇文章 31 订阅

订阅专栏

XLOG 1.0

参考资料

https://zhmin.github.io/2019/11/05/postgresql-wal-format/

预备知识

《PostgreSQL 流程—插入》

《PostgreSQL 基础模块—表和元组组织方式》

概述

前文说过，XLOG是PostgreSQL的重做日志，用于重启恢复时重做已提交事务中未落盘的数据。XLOG是一种物理逻辑日志，也可以称为Page-oriented Log，这类日志记录的是数据库中的页面变化。这两个概念直接讲的话可能比较抽象，所以就在XLOG流程中进行详细阐述。本文主要解决以下几个问题：

insert操作向XLOG中写了些什么？
XLOG日志中的数据如何组织？
什么是Page-oriented Log？
如何向XLOG写入数据？

注意，XLOG日志的写入其实有两个非常重要的步骤：

将XLOG写入log buffer
将log buffer中的XLOG落盘

WAL的本质是，XLOG先于数据落盘，事务相关的XLOG全部落盘事务才能提交。所以XLOG何时落盘如何落盘是非常重要的。但WAL的落盘本身相对独立，与前面提出的几个问题关系也不太大，所以本文主要关注XLOG是如何写入log buffer的。对于WAL的落盘会由专门的文章来阐述。

XLOG的写入

向PostgreSQL中插入一条数据时，会调用heap_insert函数，在heap_insert中会先调用RelationPutHeapTuple函数向页面写入数据，然后就会写XLOG，下面列出插入时与重做日志相关的代码：

//代码来源：heapam.c line2442~line2516
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
    xl_heap_insert xlrec;
    xl_heap_header xlhdr;
    XLogRecPtr	recptr;
    Page		page = BufferGetPage(buffer);
    uint8		info = XLOG_HEAP_INSERT;
    int			bufflags = 0;

    /*
	 * If this is a catalog, we need to transmit combocids to properly
	 * decode, so log that as well.
	 */
    if (RelationIsAccessibleInLogicalDecoding(relation))
        log_heap_new_cid(relation, heaptup);

    /*
	 * If this is the single and first tuple on page, we can reinit the
	 * page instead of restoring the whole thing.  Set flag, and hide
	 * buffer references from XLogInsert.
	 */
    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
    {
        info |= XLOG_HEAP_INIT_PAGE;
        bufflags |= REGBUF_WILL_INIT;
    }

    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
    xlrec.flags  = 0;
    if (all_visible_cleared)
        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
    if (options & HEAP_INSERT_SPECULATIVE)
        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));

    /*
	 * For logical decoding, we need the tuple even if we're doing a full
	 * page write, so make sure it's included even if we take a full-page
	 * image. (XXX We could alternatively store a pointer into the FPW).
	 */
    if (RelationIsLogicallyLogged(relation))
    {
        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
        bufflags |= REGBUF_KEEP_DATA;
    }

    XLogBeginInsert();
    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);

    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
    xlhdr.t_infomask = heaptup->t_data->t_infomask;
    xlhdr.t_hoff = heaptup->t_data->t_hoff;

    /*
	 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
	 * write the whole page to the xlog, we don't need to store
	 * xl_heap_header in the xlog.
	 */
    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
    XLogRegisterBufData(0,
                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
                        heaptup->t_len - SizeofHeapTupleHeader);

    /* filtering by origin on a row level is much more efficient */
    XLogIncludeOrigin();

    recptr = XLogInsert(RM_HEAP_ID, info);

    PageSetLSN(page, recptr);
}

我们的故事就从这里开始。这里重要的操作主要有两个部分：

调用XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData注册需要写入XLOG的数据。
调用XLogInsert向log buffer中写入XLOG。

下面，我们就来看看insert操作会注册些什么数据，这些数据又是如何写入的。

XLOG组成

为了解决上述问题，我们首先需要明白，XLOG是由哪些部分组成的。有如下用例：

DROP TABLE IF EXISTS test;
CREATE TABLE test(a int);
INSERT INTO test values(1);

这个用例非常简单，向表中插入一条数据。对于这条insert语句，XLOG由四个部分组成：

第一部分：XLOG头部

第一部分，也是最复杂的一部分。这部分又是由几个小部分组成

XLogRecord

XLogRecord是每条XLOG的头，用于记录XLOG的基本信息，定义如下：

/*
 * The overall layout of an XLOG record is:
 *		Fixed-size header (XLogRecord struct)
 *		XLogRecordBlockHeader struct
 *		XLogRecordBlockHeader struct
 *		...
 *		XLogRecordDataHeader[Short|Long] struct
 *		block data
 *		block data
 *		...
 *		main data
 *
 * There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
 * rmgr-specific data not associated with a block.  XLogRecord structs
 * always start on MAXALIGN boundaries in the WAL files, but the rest of
 * the fields are not aligned.
 *
 * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
 * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
 * used to distinguish between block references, and the main data structs.
 */
typedef struct XLogRecord
{
	uint32		xl_tot_len;		/* total len of entire record */
	TransactionId xl_xid;		/* xact id */
	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
	uint8		xl_info;		/* flag bits, see below */
	RmgrId		xl_rmid;		/* resource manager for this record */
	/* 2 bytes of padding here, initialize to zero */
	pg_crc32c	xl_crc;			/* CRC for this record */

	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */

} XLogRecord;

这个结构体，其实有很多细节值得考究，现在我们只介绍一些最基础的，本文后面会用到的东西。

xl_tot_len

XLOG的总长度，也就是XLOG的四个部分加起来的长度。
xl_xid

事务ID。
xl_prev

前一条日志的物理偏移（也就是LSN）。
xl_info

信息标志位
xl_rmid

资源管理器号。表示当前做的是什么操作，后面恢复的时候才好调用相应的函数来做redo。比如，对于前面的insert语句，这里的值就是RM_HEAP_ID，后面redo时会调用heap_redo。
xl_crc

校验位

XLogRecordBlockHeader

紧挨着XLogRecord之后就是XLogRecordBlockHeader结构体，定义如下：


/*
 * Header info for block data appended to an XLOG record.
 *
 * 'data_length' is the length of the rmgr-specific payload data associated
 * with this block. It does not include the possible full page image, nor
 * XLogRecordBlockHeader struct itself.
 *
 * Note that we don't attempt to align the XLogRecordBlockHeader struct!
 * So, the struct must be copied to aligned local storage before use.
 */
typedef struct XLogRecordBlockHeader
{
	uint8		id;				/* block reference ID */
	uint8		fork_flags;		/* fork within the relation, and flags */
	uint16		data_length;	/* number of payload bytes (not including page
								 * image) */

	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
	/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
	/* BlockNumber follows */
} XLogRecordBlockHeader;

该结构体主要有两个作用：

fork_flags：低4位是fork类型，高四位是标记位。标记位有如下几种

/*
 * The fork number fits in the lower 4 bits in the fork_flags field. The upper
 * bits are used for flags.
 */
#define BKPBLOCK_FORK_MASK	0x0F
#define BKPBLOCK_FLAG_MASK	0xF0
#define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage */
#define BKPBLOCK_HAS_DATA	0x20
#define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
#define BKPBLOCK_SAME_REL	0x80	/* RelFileNode omitted, same as previous */

这个后面再解释。

data_length：表示后续数据有效载荷的大小。

XLogRecordBlockImageHeader

在XLogRecordBlockHeader之后的数据，就需要分情况了。如果需要备份区块，那么XLogRecordBlockHeader后面就会跟XLogRecordBlockImageHeader。备份区块是为了解决partial write的问题，这个后面再讲，我们先假设不需要备份区块，即没有这个部分。
XLogRecordBlockCompressHeader

在XLogRecordBlockImageHeader之后，如果开启了压缩，就需要跟上XLogRecordBlockImageHeader。假设现在我们没有开启压缩，也就没有这部分数据。
RelFileNode

接下来是RelFileNode结构体，该结构体定义如下：
```
typedef struct RelFileNode
{
	Oid			spcNode;		/* tablespace */
	Oid			dbNode;			/* database */
	Oid			relNode;		/* relation */
} RelFileNode;
```
这个结构体用于表明我们的XLOG在重启恢复时会作用于哪个表空间、哪个数据库、哪张表。
BlockNumber

接下来BlockNumber，表明我们的XLOG在重启恢复时会作用于哪个块。
```
typedef uint32 BlockNumber;
```
mainrdata_len

最后是mainrdata头，用于标识mainrdata的长度。不是所有的XLOG都有mainrdata，但是我们用例中的insert语句就有mainrdata，所以这个是绕不开的。

小结

小结一下，针对用例中的insert语句，假设无需备份区块和压缩。我们的XLOG头部组成为：XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len，共46个字节（24 + 4 + 12 + 4 + 2）。

第二部分： xl_heap_header

XLOG的第二部分是xl_heap_header结构体，定义如下：

/*
 * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
 * or updated tuple in WAL; we can save a few bytes by reconstructing the
 * fields that are available elsewhere in the WAL record, or perhaps just
 * plain needn't be reconstructed.  These are the fields we must store.
 * NOTE: t_hoff could be recomputed, but we may as well store it because
 * it will come for free due to alignment considerations.
 */
typedef struct xl_heap_header
{
	uint16		t_infomask2;
	uint16		t_infomask;
	uint8		t_hoff;
} xl_heap_header;

xl_heap_header是HeapTupleHeaderData（也就是元组头，参见《PostgreSQL 基础模块—表和元组组织方式》）的一个简化版。xl_heap_header结构体上面的注释说的很清楚，不用将整个HeapTupleHeaderData都写入XLOG，HeapTupleHeaderData中的很多信息都可以重构或者不需要重构。所以只用存放一些必要的信息，而xl_heap_header就用于记录这些必要信息。

xl_heap_header结构体大小为5个字节。

第三部分：元组具体数据

XLOG的第三部分就是元组的具体数据，这部分数据和插入的时候写入到数据页中的数据完全一样。就上述用例而言，插入的元组长度为3个字节。

第四部分：xl_heap_insert

最后一部分是xl_heap_insert结构体，这个结构体表明了该元组所在的物理块中的偏移，定义如下：

/* This is what we need to know about insert */
typedef struct xl_heap_insert
{
	OffsetNumber offnum;		/* inserted tuple's offset */
	uint8		flags;

	/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;

该结构体大小为5个字节。

在《PostgreSQL 基础模块—表和元组组织方式》中讲过，PostgreSQL中一条完整的元组，是由ItemIdData+元组实体组成的。在一个数据页中ItemIdData是定长的，在页面中从前向后分配。元组实体由HeapTupleHeaderData+元组内容组成，长度不固定在页面中从后向前分配。ItemIdData通过lp_off来标记元组实体的位置。这里需要重点关注的一个地方是：xl_heap_insert中记录的是ItemIdData在页面中的偏移，而不是元组实体在页面中的偏移。以这种方式记录的日志称为物理逻辑日志；如果记录的是元组实体的偏移，就称为物理日志。接下来我们会重点讲解这两种日志的区别。

xl_heap_insert的组装详见heap_insert的2444行以后。关键代码：

//代码来源：heapam.c line2471
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);

小结

我们现在来小结一下，对于一条insert来说，最简单的情况下，XLOG主要由以下几部分组成：

XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len 46字节
+
xl_heap_header 5字节
+
实际元组数据 视具体数据长度而定，当前用例为5字节
+
xl_heap_insert 3字节

这几部分包含了如下关键信息：

做了什么操作？

由XLogRecord的xl_rmid表示
写了什么数据？

由实际元组数据决定
数据写到了哪里？

哪个表空间？由RelFileNode的spcNode表示。

哪个库？由RelFileNode的dbNode表示。

哪张表？由RelFileNode的relNode表示。

表里的那个块？由BlockNumber表示。

块里的哪个位置？由xl_heap_insert的offnum表示。

上述信息其实就是一个插入操作必要的信息，有了这样的信息，后面我们就可以非常方便的进行redo。

Page-oriented Log

在明白了XLOG的结构之后，我们就可以来解释什么叫做Page-oriented Log了。从XLOG的信息中，我们不难发现，XLOG描述了一条元组应该被写入到哪个页面的什么位置。从heap_insert的流程中，我们也不难发现，当一条元组写入数据页面后，我们就立即为这次写入操作生成一个XLOG，并写入log buffer。也就是说XLOG描述了页面中的数据变化，这就是Page-oriented Log。与之相对应的是逻辑日志（logic log），逻辑日志通常只是记录一条SQL语句，在redo时，会重新执行这条SQL语句。所以对于Page-oriented Log而言，在redo时元组总是写入到先前写入的那个页面，但对于逻辑日志，redo时的写入就很随意了。

对于Page-oriented Log又分为物理日志和物理逻辑日志两种。前面提到过，对于物理日志会记录元组插入页面中的物理位置（ItemIdData中lp_off的值），而对于物理逻辑日志，只记录元组插入页面中的逻辑位置（ItemIdData自身的偏移）。

对于物理日志而言，由于记录了元组的实际偏移，所以在redo时只用定位到实际位置，然后直接覆盖原有元组（不管元组有没有落盘），这种操作本身是具有幂等性的，不论执行多少次redo结果都一样。但这个方式有一个问题，就是一旦块做了整理（比如：vacuum操作）那么元组的物理位置会发生变化。为了保持精确的物理信息，整理也会产生大量物理日志，这非常影响性能。

所以PostgreSQL采用的是物理逻辑日志，所谓物理是指记录了元组实际插入的数据页，所谓逻辑具体写入到数据页中的什么位置是一个逻辑的值。这样在vacuum的时候只需要保持ItemIdData的位置不变，就没有任何影响。但是物理逻辑日志本身不具有幂等性，如果不加任何处理直接多次redo的话，就会写入多条数据。所以对于物理逻辑日志需要一种手段来判断该XLOG是否需要在对应页面中进行redo操作，这也就是所谓的LSN。这部分内容后面会由专门的文档进行说明。

如何向XLOG写入数据

最后，我们来看看XLOG所需要的几部分信息是如何写入和组织起来的。我们在XLOG的写入中介绍过，XLOG的写入主要有两个步骤：

注册需要写入XLOG的数据。
调用XLogInsert向log buffer中写入XLOG。

下面我们分别来介绍这两个步骤。

注册数据

在heap_insert中使用到的注册函数主要有：XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData。调用代码如下：

//xlrec为xl_heap_insert结构体
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);		
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
//xlhdr为xl_heap_header结构体
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
//(char *) heaptup->t_data + SizeofHeapTupleHeader为实际元组
XLogRegisterBufData(0,
                    (char *) heaptup->t_data + SizeofHeapTupleHeader,
					heaptup->t_len - SizeofHeapTupleHeader);

从上述代码中，我们可以很直观的看到，heap_insert注册了xl_heap_insert、xl_heap_header、实际元组数据。回顾下前面阐述过的XLOG的组成部分：

XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +

xl_heap_header + 实际元组数据 + xl_heap_insert

不难发现，我们现在已经拥有了除XLOG头部的数据。现在我们来看看这几个注册函数：

XLogRegisterData

/*
 * Add data to the WAL record that's being constructed.
 *
 * The data is appended to the "main chunk", available at replay with
 * XLogRecGetData().
 */
void
XLogRegisterData(char *data, int len)
{
	XLogRecData *rdata;

	Assert(begininsert_called);

	if (num_rdatas >= max_rdatas)
		elog(ERROR, "too much WAL data");
    //1.从全局数组rdatas中获取一个XLogRecData对象rdata。
	rdata = &rdatas[num_rdatas++];
	//2.将需要注册的数据写入rdata中。
	rdata->data = data;
	rdata->len = len;

	/*
	 * we use the mainrdata_last pointer to track the end of the chain, so no
	 * need to clear 'next' here.
	 */
	//3.将rdata加入mainrdata链表中
	mainrdata_last->next = rdata;
	mainrdata_last = rdata;

	mainrdata_len += len;
}

这个函数非常简单，主要有3个步骤：

从全局数组rdatas中获取一个XLogRecData对象rdata。
将需要注册的数据写入rdata中。
将rdata加入mainrdata链表中

我们先来看看XLogRecData结构体：

/*
 * The functions in xloginsert.c construct a chain of XLogRecData structs
 * to represent the final WAL record.
 */
typedef struct XLogRecData
{
	struct XLogRecData *next;	/* next struct in chain, or NULL */
	char	   *data;			/* start of rmgr data to include */
	uint32		len;			/* length of rmgr data to include */
} XLogRecData;

这是一个典型的链表结构体，这个结构体非常关键，前面讲的XLOG组成实际上是指XLOG在磁盘上的组织结构，而XLOG在内存中的组织结构，就是由XLogRecData链表来链接4个组成部分的。

对于XLogRegisterData函数，有两个点值得注意：

rdatas数组

rdatas数组是为了防止频繁分配和释放空间带来的性能开销，在进程初始化时，又调用InitXLogInsert预先分配的一个数组。在实际使用时，需要注册的数据个数，大于数组大小，则直接报错。
mainrdata链表

注意，调用XLogRegisterData注册的数据，会被链接到mainrdata链表中。mainrdata_len表示mainrdata链表上所有数据的总长度。这个后面会用到。

XLogRegisterBuffer

从上面的代码中我们不难发现在调用XLogRegisterBufData注册xl_heap_header和实际元组之前。先调用了XLogRegisterBuffer。XLogRegisterBuffer的作用是注册一个页面的基本信息。回顾下前面讲的Page-oriented Log，XLOG是一个跟页面相关的日志，后面注册的实际元组也是属于某个页面的。所以在注册元组之前需要先注册页面。XLogRegisterBuffer的具体实现如下：

/*
 * Register a reference to a buffer with the WAL record being constructed.
 * This must be called for every page that the WAL-logged operation modifies.
 */
void
XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
{
	registered_buffer *regbuf;

	/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
	Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
	Assert(begininsert_called);

	if (block_id >= max_registered_block_id)
	{
		if (block_id >= max_registered_buffers)
			elog(ERROR, "too many registered buffers");
		max_registered_block_id = block_id + 1;
	}
	//1.从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
	regbuf = &registered_buffers[block_id];

    //2.将页面信息写入regbuf中。
	BufferGetTag(buffer, &regbuf->rnode, &regbuf->forkno, &regbuf->block);
	regbuf->page = BufferGetPage(buffer);
	regbuf->flags = flags;
    //3.初始化regbuf的数据链表
	regbuf->rdata_tail = (XLogRecData *) &regbuf->rdata_head;
	regbuf->rdata_len = 0;

	/*
	 * Check that this page hasn't already been registered with some other
	 * block_id.
	 */
#ifdef USE_ASSERT_CHECKING
	//这里代码不重要，所以省略了。
#endif

	regbuf->in_use = true;
}

这个函数的流程和XLogRegisterData非常类似，主要有3个步骤：

从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
将页面信息写入regbuf中。
初始化regbuf的数据链表。

我们先来看看registered_buffer结构体：

/*
 * For each block reference registered with XLogRegisterBuffer, we fill in
 * a registered_buffer struct.
 */
typedef struct
{
	bool		in_use;			/* is this slot in use? */
	uint8		flags;			/* REGBUF_* flags */
	RelFileNode rnode;			/* identifies the relation and block */
	ForkNumber	forkno;
	BlockNumber block;
	Page		page;			/* page content */
	uint32		rdata_len;		/* total length of data in rdata chain */
	XLogRecData *rdata_head;	/* head of the chain of data registered with
								 * this block */
	XLogRecData *rdata_tail;	/* last entry in the chain, or &rdata_head if
								 * empty */

	XLogRecData bkp_rdatas[2];	/* temporary rdatas used to hold references to
								 * backup block data in XLogRecordAssemble() */

	/* buffer to store a compressed version of backup block image */
	char		compressed_page[PGLZ_MAX_BLCKSZ];
} registered_buffer;

其中，跟本文关系比较大的几个成员如下：

rnode

RelFileNode结构体，XLOG头的组成之一。
block

BlockNumber类型，XLOG头的组成之一。
rdata_head、rdata_tail

XLogRecData链表，后续的实际元组数据会注册到这里。

与前面XLogRegisterData一样，registered_buffers数组也是在InitXLogInsert中事先分配的。
在XLogRegisterBuffer执行完毕后，我们便完成了RelFileNode和BlockNumber的注册的。

XLogRegisterBufData

最后我们来看看XLogRegisterBufData，该函数用于注册xl_heap_header结构体和实际元组数据，实际上就是注册一条元组。具体实现如下：

/*
 * Add buffer-specific data to the WAL record that's being constructed.
 *
 * Block_id must reference a block previously registered with
 * XLogRegisterBuffer(). If this is called more than once for the same
 * block_id, the data is appended.
 *
 * The maximum amount of data that can be registered per block is 65535
 * bytes. That should be plenty; if you need more than BLCKSZ bytes to
 * reconstruct the changes to the page, you might as well just log a full
 * copy of it. (the "main data" that's not associated with a block is not
 * limited)
 */
void
XLogRegisterBufData(uint8 block_id, char *data, int len)
{
	registered_buffer *regbuf;
	XLogRecData *rdata;

	Assert(begininsert_called);

	/* find the registered buffer struct */
	regbuf = &registered_buffers[block_id];
	if (!regbuf->in_use)
		elog(ERROR, "no block with id %d registered with WAL insertion",
			 block_id);

	if (num_rdatas >= max_rdatas)
		elog(ERROR, "too much WAL data");
	rdata = &rdatas[num_rdatas++];

	rdata->data = data;
	rdata->len = len;

	regbuf->rdata_tail->next = rdata;
	regbuf->rdata_tail = rdata;
	regbuf->rdata_len += len;
}

这个函数主要有3个步骤：

从全局数组rdatas中获取一个XLogRecData对象rdata。
将需要注册的数据写入rdata中。
将rdata加入regbuf链表中。

小结

现在我们来小结一下，通过注册流程，我们现构建了XLOG如下部分的数据（绿色为已构建的，红色为尚未构建的）：

XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +

xl_heap_header+ 实际元组数据+ xl_heap_insert

其中绿色部分的构建，我们在前面已经讲过了，绿色部分的数据包含在两条链表中：

xl_heap_insert在mainrdata链表中。
RelFileNode+BlockNumber+xl_heap_header+ 实际元组数据在regbuf链表中。

接下来，就会调用XLogInsert函数，开始执行写XLOG的相关函数。

写入XLOG

在前面的小结中，我们了解到，经历的注册阶段，我们已经获取到了绝大部分信息。但是仍然有部分数据还没有获取到。所以写入XLOG时，我们首先需要获取到红色部分的数据，然后再将数据写入log buffer。这个过程由XLogInsert函数实现，代码如下：

/*
 * Insert an XLOG record having the specified RMID and info bytes, with the
 * body of the record being the data and buffer references registered earlier
 * with XLogRegister* calls.
 *
 * Returns XLOG pointer to end of record (beginning of next record).
 * This can be used as LSN for data pages affected by the logged action.
 * (LSN is the XLOG point up to which the XLOG must be flushed to disk
 * before the data page can be written out.  This implements the basic
 * WAL rule "write the log before the data".)
 */
XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info)
{
	XLogRecPtr	EndPos;

	/* XLogBeginInsert() must have been called. */
	if (!begininsert_called)
		elog(ERROR, "XLogBeginInsert was not called");

	/*
	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
	 * reserved for use by me.
	 */
	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
		elog(PANIC, "invalid xlog info mask %02X", info);

	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);

	/*
	 * In bootstrap mode, we don't actually log anything but XLOG resources;
	 * return a phony record pointer.
	 */
	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
	{
		XLogResetInsertion();
		EndPos = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
		return EndPos;
	}

	do
	{
		XLogRecPtr	RedoRecPtr;
		bool		doPageWrites;
		XLogRecPtr	fpw_lsn;
		XLogRecData *rdt;

		/*
		 * Get values needed to decide whether to do full-page writes. Since
		 * we don't yet have an insertion lock, these could change under us,
		 * but XLogInsertRecord will recheck them once it has a lock.
		 */
		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
		//1.获取红色部分的数据
		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
								 &fpw_lsn);
		//2.将数据写入log buffer
		EndPos = XLogInsertRecord(rdt, fpw_lsn);
	} while (EndPos == InvalidXLogRecPtr);

	XLogResetInsertion();

	return EndPos;
}

其中关键函数为XLogRecordAssemble和XLogInsertRecord。下面我们分别来看看具体实现：

XLogRecordAssemble

XLogRecordAssemble负责获取前面红色部分的数据：XLogRecord、XLogRecordBlockHeader、mainrdata_len。然后将XLOG的4个部分：XLOG头部 + xl_heap_header + 元组具体数据 + xl_heap_insert组装成XLogRecData链表。XLogRecordAssemble的实现如下：

/*
 * Assemble a WAL record from the registered data and buffers into an
 * XLogRecData chain, ready for insertion with XLogInsertRecord().
 *
 * The record header fields are filled in, except for the xl_prev field. The
 * calculated CRC does not include the record header yet.
 *
 * If there are any registered buffers, and a full-page image was not taken
 * of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
 * signals that the assembled record is only good for insertion on the
 * assumption that the RedoRecPtr and doPageWrites values were up-to-date.
 */
static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
				   XLogRecPtr RedoRecPtr, bool doPageWrites,
				   XLogRecPtr *fpw_lsn)
{
	XLogRecData *rdt;
	uint32		total_len = 0;
	int			block_id;
	pg_crc32c	rdata_crc;
	registered_buffer *prev_regbuf = NULL;
	XLogRecData *rdt_datas_last;
	XLogRecord *rechdr;
	char	   *scratch = hdr_scratch;

	/*
	 * Note: this function can be called multiple times for the same record.
	 * All the modifications we do to the rdata chains below must handle that.
	 */

	/* The record begins with the fixed-size header */
	rechdr = (XLogRecord *) scratch;
	scratch += SizeOfXLogRecord;

	hdr_rdt.next = NULL;
	rdt_datas_last = &hdr_rdt;
	hdr_rdt.data = hdr_scratch;

	/*
	 * Make an rdata chain containing all the data portions of all block
	 * references. This includes the data for full-page images. Also append
	 * the headers for the block references in the scratch buffer.
	 */
	*fpw_lsn = InvalidXLogRecPtr;
	for (block_id = 0; block_id < max_registered_block_id; block_id++)
	{
		registered_buffer *regbuf = &registered_buffers[block_id];
		bool		needs_backup;
		bool		needs_data;
		XLogRecordBlockHeader bkpb;
		XLogRecordBlockImageHeader bimg;
		XLogRecordBlockCompressHeader cbimg = {0};
		bool		samerel;
		bool		is_compressed = false;

		if (!regbuf->in_use)
			continue;

		/* Determine if this block needs to be backed up */
		if (regbuf->flags & REGBUF_FORCE_IMAGE)
			needs_backup = true;
		else if (regbuf->flags & REGBUF_NO_IMAGE)
			needs_backup = false;
		else if (!doPageWrites)
			needs_backup = false;
		else
		{
			/*
			 * We assume page LSN is first data on *every* page that can be
			 * passed to XLogInsert, whether it has the standard page layout
			 * or not.
			 */
			XLogRecPtr	page_lsn = PageGetLSN(regbuf->page);

			needs_backup = (page_lsn <= RedoRecPtr);
			if (!needs_backup)
			{
				if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
					*fpw_lsn = page_lsn;
			}
		}

		/* Determine if the buffer data needs to included */
		if (regbuf->rdata_len == 0)
			needs_data = false;
		else if ((regbuf->flags & REGBUF_KEEP_DATA) != 0)
			needs_data = true;
		else
			needs_data = !needs_backup;

		bkpb.id = block_id;
		bkpb.fork_flags = regbuf->forkno;
		bkpb.data_length = 0;

		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;

		if (needs_backup)
		{
			Page		page = regbuf->page;
			uint16		compressed_len;

			/*
			 * The page needs to be backed up, so calculate its hole length
			 * and offset.
			 */
			if (regbuf->flags & REGBUF_STANDARD)
			{
				/* Assume we can omit data between pd_lower and pd_upper */
				uint16		lower = ((PageHeader) page)->pd_lower;
				uint16		upper = ((PageHeader) page)->pd_upper;

				if (lower >= SizeOfPageHeaderData &&
					upper > lower &&
					upper <= BLCKSZ)
				{
					bimg.hole_offset = lower;
					cbimg.hole_length = upper - lower;
				}
				else
				{
					/* No "hole" to compress out */
					bimg.hole_offset = 0;
					cbimg.hole_length = 0;
				}
			}
			else
			{
				/* Not a standard page header, don't try to eliminate "hole" */
				bimg.hole_offset = 0;
				cbimg.hole_length = 0;
			}

			/*
			 * Try to compress a block image if wal_compression is enabled
			 */
			if (wal_compression)
			{
				is_compressed =
					XLogCompressBackupBlock(page, bimg.hole_offset,
											cbimg.hole_length,
											regbuf->compressed_page,
											&compressed_len);
			}

			/*
			 * Fill in the remaining fields in the XLogRecordBlockHeader
			 * struct
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

			/*
			 * Construct XLogRecData entries for the page content.
			 */
			rdt_datas_last->next = &regbuf->bkp_rdatas[0];
			rdt_datas_last = rdt_datas_last->next;

			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;

			if (is_compressed)
			{
				bimg.length = compressed_len;
				bimg.bimg_info |= BKPIMAGE_IS_COMPRESSED;

				rdt_datas_last->data = regbuf->compressed_page;
				rdt_datas_last->len = compressed_len;
			}
			else
			{
				bimg.length = BLCKSZ - cbimg.hole_length;

				if (cbimg.hole_length == 0)
				{
					rdt_datas_last->data = page;
					rdt_datas_last->len = BLCKSZ;
				}
				else
				{
					/* must skip the hole */
					rdt_datas_last->data = page;
					rdt_datas_last->len = bimg.hole_offset;

					rdt_datas_last->next = &regbuf->bkp_rdatas[1];
					rdt_datas_last = rdt_datas_last->next;

					rdt_datas_last->data =
						page + (bimg.hole_offset + cbimg.hole_length);
					rdt_datas_last->len =
						BLCKSZ - (bimg.hole_offset + cbimg.hole_length);
				}
			}

			total_len += bimg.length;
		}

		if (needs_data)
		{
			/*
			 * Link the caller-supplied rdata chain for this buffer to the
			 * overall list.
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
			bkpb.data_length = regbuf->rdata_len;
			total_len += regbuf->rdata_len;

			rdt_datas_last->next = regbuf->rdata_head;
			rdt_datas_last = regbuf->rdata_tail;
		}

		if (prev_regbuf && RelFileNodeEquals(regbuf->rnode, prev_regbuf->rnode))
		{
			samerel = true;
			bkpb.fork_flags |= BKPBLOCK_SAME_REL;
		}
		else
			samerel = false;
		prev_regbuf = regbuf;

		/* Ok, copy the header to the scratch buffer */
		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
		scratch += SizeOfXLogRecordBlockHeader;
		if (needs_backup)
		{
			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
			scratch += SizeOfXLogRecordBlockImageHeader;
			if (cbimg.hole_length != 0 && is_compressed)
			{
				memcpy(scratch, &cbimg,
					   SizeOfXLogRecordBlockCompressHeader);
				scratch += SizeOfXLogRecordBlockCompressHeader;
			}
		}
		if (!samerel)
		{
			memcpy(scratch, &regbuf->rnode, sizeof(RelFileNode));
			scratch += sizeof(RelFileNode);
		}
		memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
		scratch += sizeof(BlockNumber);
	}

	/* followed by the record's origin, if any */
	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
	{
		*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
		scratch += sizeof(replorigin_session_origin);
	}

	/* followed by main data, if any */
	if (mainrdata_len > 0)
	{
		if (mainrdata_len > 255)
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
			memcpy(scratch, &mainrdata_len, sizeof(uint32));
			scratch += sizeof(uint32);
		}
		else
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
			*(scratch++) = (uint8) mainrdata_len;
		}
		rdt_datas_last->next = mainrdata_head;
		rdt_datas_last = mainrdata_last;
		total_len += mainrdata_len;
	}
	rdt_datas_last->next = NULL;

	hdr_rdt.len = (scratch - hdr_scratch);
	total_len += hdr_rdt.len;

	/*
	 * Calculate CRC of the data
	 *
	 * Note that the record header isn't added into the CRC initially since we
	 * don't know the prev-link yet.  Thus, the CRC will represent the CRC of
	 * the whole record in the order: rdata, then backup blocks, then record
	 * header.
	 */
	INIT_CRC32C(rdata_crc);
	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);

	/*
	 * Fill in the fields in the record header. Prev-link is filled in later,
	 * once we know where in the WAL the record will be inserted. The CRC does
	 * not include the record header yet.
	 */
	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
	rechdr->xl_tot_len = total_len;
	rechdr->xl_info = info;
	rechdr->xl_rmid = rmid;
	rechdr->xl_prev = InvalidXLogRecPtr;
	rechdr->xl_crc = rdata_crc;

	return &hdr_rdt;
}

这段代码比较长，我们带着问题来一个一个看。前面说过XLogRecordAssemble会产生一个链表，链表中包含了XLOG的4个部分，而经过注册阶段，获取到了除XLOG头之外的其余三个部分。所以对于XLogRecordAssemble函数我们需要解决如下两个问题。

XLOG头如何构建

我们先来看一句很关键的代码：

char	   *scratch = hdr_scratch;

hdr_scratch是什么？

static char *hdr_scratch = NULL;

hdr_scratch是一个全局的buffer。我们在前面详细的描述过XLOG头的组成，由此我们可以得知，XLOG头的长度是不固定的。同样是为了防止频繁的分配和释放内存，PostgreSQL在InitXLogInsert中事先为XLOG头分配了空间，空间大小为XLOG头的最大长度。

#define HEADER_SCRATCH_SIZE \
	(SizeOfXLogRecord + \
	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)

所以，为了了解XLOG头的构建，我们需要观察XLogRecordAssemble对scratch指针的操作。

构建XLogRecord

rechdr = (XLogRecord *) scratch;
scratch += SizeOfXLogRecord;
//中间代码省略
rechdr->xl_xid = GetCurrentTransactionIdIfAny();
rechdr->xl_tot_len = total_len;
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
rechdr->xl_prev = InvalidXLogRecPtr;
rechdr->xl_crc = rdata_crc;

构建XLogRecordBlockHeader

XLogRecordBlockHeader bkpb;
bkpb.id = block_id;
bkpb.fork_flags = regbuf->forkno;
bkpb.data_length = 0;

if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
    bkpb.fork_flags |= BKPBLOCK_WILL_INIT;

memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
scratch += SizeOfXLogRecordBlockHeader;

构建RelFileNode

if (!samerel)
{
	memcpy(scratch, &regbuf->rnode, sizeof(RelFileNode));
	scratch += sizeof(RelFileNode);
}

这里，我们就看到了，RelFileNode的数据是从之前注册的regbuf->rnode中获取的。

构建BlockNumber

memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
scratch += sizeof(BlockNumber);

构建mainrdata_len

if (mainrdata_len > 0)
{
	if (mainrdata_len > 255)
	{
		*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
		memcpy(scratch, &mainrdata_len, sizeof(uint32));
		scratch += sizeof(uint32);
	}
	else
	{
		*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
		*(scratch++) = (uint8) mainrdata_len;
	}
    //将mainrdata链接到XLogRecData链表中。
	rdt_datas_last->next = mainrdata_head;
	rdt_datas_last = mainrdata_last;
	total_len += mainrdata_len;
}

这里需要注意两点：

XLOG头中只是记录了mainrdata的长度，在我们的用例中，也就是xl_heap_insert的长度。
上述代码，除了构建mainrdata_len，还将mainrdata也就是xl_heap_insert链接到XLogRecData链表中，这也是我们接下来看要解决的问题。

如何将XLOG的4个部分串链

接下来，我来看看XLOG的4个部分是如何串链的。XLogRecordAssemble函数最终返回hdr_rdt。

return &hdr_rdt;

所以，我们需要观察XLogRecordAssemble是如何操作hdr_rdt。

hdr_rdt.next = NULL;			//初始化next的指针
rdt_datas_last = &hdr_rdt;		//指向链头

hdr_rdt将作为链表的链头，所以这里使用rdt_datas_last指针指向链头。

XLOG头加入链表
```
hdr_rdt.data = hdr_scratch;
//中间代码省略
hdr_rdt.len = (scratch - hdr_scratch);
```
当前hdr_rdt为链头，所以直接将XLOG头的buffer赋值给data，构建好XLOG头之后，再计算XLOG头的长度。

xl_heap_header、元组具体数据加入链表

在注册阶段，我们知道xl_heap_header和元组具体数据都存放在regbuf的XLogRecData链表中，并且xl_heap_header在前元组具体数据在后（xl_heap_header先注册）。所以直接将regbuf的XLogRecData链表头，添加到hdr_rdt中即可。

if (needs_data)
{
    /*
	 * Link the caller-supplied rdata chain for this buffer to the
	 * overall list.
	 */
    bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
    bkpb.data_length = regbuf->rdata_len;
    total_len += regbuf->rdata_len;

    //串链
    rdt_datas_last->next = regbuf->rdata_head;
    rdt_datas_last = regbuf->rdata_tail;
}

xl_heap_insert加入链表

这个在构建mainrdata_len时已经说过了，这里不再赘述。

小结

经过XLogRecordAssemble之后，我们得到了一个链表，链表中有我们希望写入XLOG的四个部分。

XLogInsertRecord

最后，我们来看看将XLOG真正写入log buffer的函数XLogInsertRecord。

/*
 * Insert an XLOG record represented by an already-constructed chain of data
 * chunks.  This is a low-level routine; to construct the WAL record header
 * and data, use the higher-level routines in xloginsert.c.
 *
 * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
 * WAL record applies to, that were not included in the record as full page
 * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
 * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
 * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
 * record is always inserted.
 *
 * The first XLogRecData in the chain must be for the record header, and its
 * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
 * xl_crc fields in the header, the rest of the header must already be filled
 * by the caller.
 *
 * Returns XLOG pointer to end of record (beginning of next record).
 * This can be used as LSN for data pages affected by the logged action.
 * (LSN is the XLOG point up to which the XLOG must be flushed to disk
 * before the data page can be written out.  This implements the basic
 * WAL rule "write the log before the data".)
 */
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	pg_crc32c	rdata_crc;
	bool		inserted;
	XLogRecord *rechdr = (XLogRecord *) rdata->data;
	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
							   rechdr->xl_info == XLOG_SWITCH);
	XLogRecPtr	StartPos;
	XLogRecPtr	EndPos;

	/* we assume that all of the record header is in the first chunk */
	Assert(rdata->len >= SizeOfXLogRecord);

	/* cross-check on whether we should be here or not */
	if (!XLogInsertAllowed())
		elog(ERROR, "cannot make new WAL entries during recovery");

	/*----------
	 *
	 * We have now done all the preparatory work we can without holding a
	 * lock or modifying shared state. From here on, inserting the new WAL
	 * record to the shared WAL buffer cache is a two-step process:
	 *
	 * 1. Reserve the right amount of space from the WAL. The current head of
	 *	  reserved space is kept in Insert->CurrBytePos, and is protected by
	 *	  insertpos_lck.
	 *
	 * 2. Copy the record to the reserved WAL space. This involves finding the
	 *	  correct WAL buffer containing the reserved space, and copying the
	 *	  record in place. This can be done concurrently in multiple processes.
	 *
	 * To keep track of which insertions are still in-progress, each concurrent
	 * inserter acquires an insertion lock. In addition to just indicating that
	 * an insertion is in progress, the lock tells others how far the inserter
	 * has progressed. There is a small fixed number of insertion locks,
	 * determined by NUM_XLOGINSERT_LOCKS. When an inserter crosses a page
	 * boundary, it updates the value stored in the lock to the how far it has
	 * inserted, to allow the previous buffer to be flushed.
	 *
	 * Holding onto an insertion lock also protects RedoRecPtr and
	 * fullPageWrites from changing until the insertion is finished.
	 *
	 * Step 2 can usually be done completely in parallel. If the required WAL
	 * page is not initialized yet, you have to grab WALBufMappingLock to
	 * initialize it, but the WAL writer tries to do that ahead of insertions
	 * to avoid that from happening in the critical path.
	 *
	 *----------
	 */
	START_CRIT_SECTION();
	if (isLogSwitch)
		WALInsertLockAcquireExclusive();
	else
		WALInsertLockAcquire();

	/*
	 * Check to see if my copy of RedoRecPtr or doPageWrites is out of date.
	 * If so, may have to go back and have the caller recompute everything.
	 * This can only happen just after a checkpoint, so it's better to be slow
	 * in this case and fast otherwise.
	 *
	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
	 * affect the contents of the XLOG record, so we'll update our local copy
	 * but not force a recomputation.  (If doPageWrites was just turned off,
	 * we could recompute the record without full pages, but we choose not to
	 * bother.)
	 */
	if (RedoRecPtr != Insert->RedoRecPtr)
	{
		Assert(RedoRecPtr < Insert->RedoRecPtr);
		RedoRecPtr = Insert->RedoRecPtr;
	}
	doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);

	if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
	{
		/*
		 * Oops, some buffer now needs to be backed up that the caller didn't
		 * back up.  Start over.
		 */
		WALInsertLockRelease();
		END_CRIT_SECTION();
		return InvalidXLogRecPtr;
	}

	/*
	 * Reserve space for the record in the WAL. This also sets the xl_prev
	 * pointer.
	 */
	if (isLogSwitch)
		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
	else
	{
		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
								  &rechdr->xl_prev);
		inserted = true;
	}

	if (inserted)
	{
		/*
		 * Now that xl_prev has been filled in, calculate CRC of the record
		 * header.
		 */
		rdata_crc = rechdr->xl_crc;
		COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
		FIN_CRC32C(rdata_crc);
		rechdr->xl_crc = rdata_crc;

		/*
		 * All the record data, including the header, is now ready to be
		 * inserted. Copy the record in the space reserved.
		 */
		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
							StartPos, EndPos);
	}
	else
	{
		/*
		 * This was an xlog-switch record, but the current insert location was
		 * already exactly at the beginning of a segment, so there was no need
		 * to do anything.
		 */
	}

	/*
	 * Done! Let others know that we're finished.
	 */
	WALInsertLockRelease();

	MarkCurrentTransactionIdLoggedIfAny();

	END_CRIT_SECTION();

	/*
	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
	 */
	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
	{
		SpinLockAcquire(&XLogCtl->info_lck);
		/* advance global request to include new block(s) */
		if (XLogCtl->LogwrtRqst.Write < EndPos)
			XLogCtl->LogwrtRqst.Write = EndPos;
		/* update local result copy while I have the chance */
		LogwrtResult = XLogCtl->LogwrtResult;
		SpinLockRelease(&XLogCtl->info_lck);
	}

	/*
	 * If this was an XLOG_SWITCH record, flush the record and the empty
	 * padding space that fills the rest of the segment, and perform
	 * end-of-segment actions (eg, notifying archiver).
	 */
	if (isLogSwitch)
	{
		TRACE_POSTGRESQL_XLOG_SWITCH();
		XLogFlush(EndPos);

		/*
		 * Even though we reserved the rest of the segment for us, which is
		 * reflected in EndPos, we return a pointer to just the end of the
		 * xlog-switch record.
		 */
		if (inserted)
		{
			EndPos = StartPos + SizeOfXLogRecord;
			if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
			{
				if (EndPos % XLOG_SEG_SIZE == EndPos % XLOG_BLCKSZ)
					EndPos += SizeOfXLogLongPHD;
				else
					EndPos += SizeOfXLogShortPHD;
			}
		}
	}

#ifdef WAL_DEBUG
	if (XLOG_DEBUG)
	{
		static XLogReaderState *debug_reader = NULL;
		StringInfoData buf;
		StringInfoData recordBuf;
		char	   *errormsg = NULL;
		MemoryContext oldCxt;

		oldCxt = MemoryContextSwitchTo(walDebugCxt);

		initStringInfo(&buf);
		appendStringInfo(&buf, "INSERT @ %X/%X: ",
						 (uint32) (EndPos >> 32), (uint32) EndPos);

		/*
		 * We have to piece together the WAL record data from the XLogRecData
		 * entries, so that we can pass it to the rm_desc function as one
		 * contiguous chunk.
		 */
		initStringInfo(&recordBuf);
		for (; rdata != NULL; rdata = rdata->next)
			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);

		if (!debug_reader)
			debug_reader = XLogReaderAllocate(NULL, NULL);

		if (!debug_reader)
		{
			appendStringInfoString(&buf, "error decoding record: out of memory");
		}
		else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
								   &errormsg))
		{
			appendStringInfo(&buf, "error decoding record: %s",
							 errormsg ? errormsg : "no error message");
		}
		else
		{
			appendStringInfoString(&buf, " - ");
			xlog_outdesc(&buf, debug_reader);
		}
		elog(LOG, "%s", buf.data);

		pfree(buf.data);
		pfree(recordBuf.data);
		MemoryContextSwitchTo(oldCxt);
	}
#endif

	/*
	 * Update our global variables
	 */
	ProcLastRecPtr = StartPos;
	XactLastRecEnd = EndPos;

	return EndPos;
}

该函数主要有两个重要的函数

ReserveXLogInsertLocation

向log buffer申请空间，为即将写入的XLOG预留空间。
CopyXLogRecordToWAL

将rdata，也就是XLogRecData链表中的内容，写入log buffer。

下面，我们来分别看看这两个函数：

ReserveXLogInsertLocation

/*
 * Reserves the right amount of space for a record of given size from the WAL.
 * *StartPos is set to the beginning of the reserved section, *EndPos to
 * its end+1. *PrevPtr is set to the beginning of the previous record; it is
 * used to set the xl_prev of this record.
 *
 * This is the performance critical part of XLogInsert that must be serialized
 * across backends. The rest can happen mostly in parallel. Try to keep this
 * section as short as possible, insertpos_lck can be heavily contended on a
 * busy system.
 *
 * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
 * where we actually copy the record to the reserved space.
 */
static void
ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
						  XLogRecPtr *PrevPtr)
{
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	uint64		startbytepos;
	uint64		endbytepos;
	uint64		prevbytepos;

	size = MAXALIGN(size);

	/* All (non xlog-switch) records should contain data. */
	Assert(size > SizeOfXLogRecord);

	/*
	 * The duration the spinlock needs to be held is minimized by minimizing
	 * the calculations that have to be done while holding the lock. The
	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
	 * page headers. The mapping between "usable" byte positions and physical
	 * positions (XLogRecPtrs) can be done outside the locked region, and
	 * because the usable byte position doesn't include any headers, reserving
	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
	 */
    //加锁
	SpinLockAcquire(&Insert->insertpos_lck);

    //预留空间
	startbytepos = Insert->CurrBytePos;
	endbytepos = startbytepos + size;
	prevbytepos = Insert->PrevBytePos;
	Insert->CurrBytePos = endbytepos;
	Insert->PrevBytePos = startbytepos;

    //解锁
	SpinLockRelease(&Insert->insertpos_lck);

    //返回写入位置
	*StartPos = XLogBytePosToRecPtr(startbytepos);
	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);

	/*
	 * Check that the conversions between "usable byte positions" and
	 * XLogRecPtrs work consistently in both directions.
	 */
	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
}

该函数的核心代码为line39~line55。其中Insert是一个XLogCtlInsert的结构体对象，该对象是一个全局对象。该对象中的CurrBytePos和PrevBytePos用于控制log buffer的写入。预留空间后，ReserveXLogInsertLocation会返回XLOG的写入位置。

注意：

XLogBytePosToRecPtr是一个非常精妙的设计，其具体的含义我们留到《XLOG 2.0》来讲。这里只需要明白，他返回了一个XLOG写入位置就行了。

CopyXLogRecordToWAL

通过ReserveXLogInsertLocation分配了空间之后，就可以调用CopyXLogRecordToWAL来进行真正的写入了。CopyXLogRecordToWAL的代码如下：

/*
 * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
 * area in the WAL.
 */
static void
CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
					XLogRecPtr StartPos, XLogRecPtr EndPos)
{
	char	   *currpos;
	int			freespace;
	int			written;
	XLogRecPtr	CurrPos;
	XLogPageHeader pagehdr;

	/*
	 * Get a pointer to the right place in the right WAL buffer to start
	 * inserting to.
	 */
	CurrPos = StartPos;
	currpos = GetXLogBuffer(CurrPos);
	freespace = INSERT_FREESPACE(CurrPos);

	/*
	 * there should be enough space for at least the first field (xl_tot_len)
	 * on this page.
	 */
	Assert(freespace >= sizeof(uint32));

	/* Copy record data */
	written = 0;
	while (rdata != NULL)
	{
		char	   *rdata_data = rdata->data;
		int			rdata_len = rdata->len;

		while (rdata_len > freespace)
		{
			/*
			 * Write what fits on this page, and continue on the next page.
			 */
			Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0);
			memcpy(currpos, rdata_data, freespace);
			rdata_data += freespace;
			rdata_len -= freespace;
			written += freespace;
			CurrPos += freespace;

			/*
			 * Get pointer to beginning of next page, and set the xlp_rem_len
			 * in the page header. Set XLP_FIRST_IS_CONTRECORD.
			 *
			 * It's safe to set the contrecord flag and xlp_rem_len without a
			 * lock on the page. All the other flags were already set when the
			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
			 * only backend that needs to set the contrecord flag.
			 */
			currpos = GetXLogBuffer(CurrPos);
			pagehdr = (XLogPageHeader) currpos;
			pagehdr->xlp_rem_len = write_len - written;
			pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;

			/* skip over the page header */
			if (CurrPos % XLogSegSize == 0)
			{
				CurrPos += SizeOfXLogLongPHD;
				currpos += SizeOfXLogLongPHD;
			}
			else
			{
				CurrPos += SizeOfXLogShortPHD;
				currpos += SizeOfXLogShortPHD;
			}
			freespace = INSERT_FREESPACE(CurrPos);
		}

		Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
		memcpy(currpos, rdata_data, rdata_len);
		currpos += rdata_len;
		CurrPos += rdata_len;
		freespace -= rdata_len;
		written += rdata_len;

		rdata = rdata->next;
	}
	Assert(written == write_len);

	/*
	 * If this was an xlog-switch, it's not enough to write the switch record,
	 * we also have to consume all the remaining space in the WAL segment. We
	 * have already reserved it for us, but we still need to make sure it's
	 * allocated and zeroed in the WAL buffers so that when the caller (or
	 * someone else) does XLogWrite(), it can really write out all the zeros.
	 */
	if (isLogSwitch && CurrPos % XLOG_SEG_SIZE != 0)
	{
		/* An xlog-switch record doesn't contain any data besides the header */
		Assert(write_len == SizeOfXLogRecord);

		/*
		 * We do this one page at a time, to make sure we don't deadlock
		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
		 */
		Assert(EndPos % XLogSegSize == 0);

		/* Use up all the remaining space on the first page */
		CurrPos += freespace;

		while (CurrPos < EndPos)
		{
			/* initialize the next page (if not initialized already) */
			WALInsertLockUpdateInsertingAt(CurrPos);
			AdvanceXLInsertBuffer(CurrPos, false);
			CurrPos += XLOG_BLCKSZ;
		}
	}
	else
	{
		/* Align the end position, so that the next record starts aligned */
		CurrPos = MAXALIGN64(CurrPos);
	}

	if (CurrPos != EndPos)
		elog(PANIC, "space reserved for WAL record does not match what was written");
}

我们先对该函数的重要参数进行说明：

write_len

XLOG的总长度，用于做校验。
rdata

XLogRecData链表，存放了XLOG4个部分的数据。
StartPos

XLOG的写入位置
EndPos

XLOG的结束位置用于做校验

该函数的核心代码为line31~line85，遍历rdata链表，将rdata的每一部分写入CurrPos指向的位置。这段核心代码的主要部分如下：

while (rdata != NULL)
{
    char	   *rdata_data = rdata->data;
    int			rdata_len = rdata->len;

    while (rdata_len > freespace)
    {
        /*
		 * Write what fits on this page, and continue on the next page.
		 * 省略
		 */
    }

    Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
    memcpy(currpos, rdata_data, rdata_len);
    currpos += rdata_len;
    CurrPos += rdata_len;
    freespace -= rdata_len;
    written += rdata_len;

    rdata = rdata->next;
}

line6的while循环是用于处理当前需要写入的XLOG长度大于log buffer中当前page的可用空间的情况，在这种情况下，需要先将XLOG一部分写入当前的page，然后再切换到下一个page。

结束语

至此，我们已经阐述了一条insert操作写入XLOG的主要流程。在流程中我们暂时回避了一些问题，比如：如何应对partial write。也忽略了一些有意思的地方，比如：XLogBytePosToRecPtr是如何实现的？这些问题我们会在《PostgreSQL重启恢复—XLOG 2.0》中进行介绍。