PostgreSQL重启恢复---XLOG 1.0

XLOG 1.0

参考资料

https://zhmin.github.io/2019/11/05/postgresql-wal-format/

预备知识

PostgreSQL 流程—插入

PostgreSQL 基础模块—表和元组组织方式

概述

前文说过,XLOG是PostgreSQL的重做日志,用于重启恢复时重做已提交事务中未落盘的数据。XLOG是一种物理逻辑日志,也可以称为Page-oriented Log,这类日志记录的是数据库中的页面变化。这两个概念直接讲的话可能比较抽象,所以就在XLOG流程中进行详细阐述。本文主要解决以下几个问题:

  • insert操作向XLOG中写了些什么?
  • XLOG日志中的数据如何组织?
  • 什么是Page-oriented Log?
  • 如何向XLOG写入数据?

注意,XLOG日志的写入其实有两个非常重要的步骤:

  • 将XLOG写入log buffer
  • 将log buffer中的XLOG落盘

WAL的本质是,XLOG先于数据落盘,事务相关的XLOG全部落盘事务才能提交。所以XLOG何时落盘如何落盘是非常重要的。但WAL的落盘本身相对独立,与前面提出的几个问题关系也不太大,所以本文主要关注XLOG是如何写入log buffer的。对于WAL的落盘会由专门的文章来阐述。

XLOG的写入

向PostgreSQL中插入一条数据时,会调用heap_insert函数,在heap_insert中会先调用RelationPutHeapTuple函数向页面写入数据,然后就会写XLOG,下面列出插入时与重做日志相关的代码:

//代码来源:heapam.c line2442~line2516
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
    xl_heap_insert xlrec;
    xl_heap_header xlhdr;
    XLogRecPtr	recptr;
    Page		page = BufferGetPage(buffer);
    uint8		info = XLOG_HEAP_INSERT;
    int			bufflags = 0;

    /*
	 * If this is a catalog, we need to transmit combocids to properly
	 * decode, so log that as well.
	 */
    if (RelationIsAccessibleInLogicalDecoding(relation))
        log_heap_new_cid(relation, heaptup);

    /*
	 * If this is the single and first tuple on page, we can reinit the
	 * page instead of restoring the whole thing.  Set flag, and hide
	 * buffer references from XLogInsert.
	 */
    if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
        PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
    {
        info |= XLOG_HEAP_INIT_PAGE;
        bufflags |= REGBUF_WILL_INIT;
    }

    xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
    xlrec.flags  = 0;
    if (all_visible_cleared)
        xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
    if (options & HEAP_INSERT_SPECULATIVE)
        xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
    Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));

    /*
	 * For logical decoding, we need the tuple even if we're doing a full
	 * page write, so make sure it's included even if we take a full-page
	 * image. (XXX We could alternatively store a pointer into the FPW).
	 */
    if (RelationIsLogicallyLogged(relation))
    {
        xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
        bufflags |= REGBUF_KEEP_DATA;
    }

    XLogBeginInsert();
    XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);

    xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
    xlhdr.t_infomask = heaptup->t_data->t_infomask;
    xlhdr.t_hoff = heaptup->t_data->t_hoff;

    /*
	 * note we mark xlhdr as belonging to buffer; if XLogInsert decides to
	 * write the whole page to the xlog, we don't need to store
	 * xl_heap_header in the xlog.
	 */
    XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
    XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
    /* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
    XLogRegisterBufData(0,
                        (char *) heaptup->t_data + SizeofHeapTupleHeader,
                        heaptup->t_len - SizeofHeapTupleHeader);

    /* filtering by origin on a row level is much more efficient */
    XLogIncludeOrigin();

    recptr = XLogInsert(RM_HEAP_ID, info);

    PageSetLSN(page, recptr);
}

我们的故事就从这里开始。这里重要的操作主要有两个部分:

  • 调用XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData注册需要写入XLOG的数据。
  • 调用XLogInsert向log buffer中写入XLOG。

下面,我们就来看看insert操作会注册些什么数据,这些数据又是如何写入的。

XLOG组成

为了解决上述问题,我们首先需要明白,XLOG是由哪些部分组成的。有如下用例:

DROP TABLE IF EXISTS test;
CREATE TABLE test(a int);
INSERT INTO test values(1);

这个用例非常简单,向表中插入一条数据。对于这条insert语句,XLOG由四个部分组成:

第一部分:XLOG头部

第一部分,也是最复杂的一部分。这部分又是由几个小部分组成

  • XLogRecord

    XLogRecord是每条XLOG的头,用于记录XLOG的基本信息,定义如下:

    /*
     * The overall layout of an XLOG record is:
     *		Fixed-size header (XLogRecord struct)
     *		XLogRecordBlockHeader struct
     *		XLogRecordBlockHeader struct
     *		...
     *		XLogRecordDataHeader[Short|Long] struct
     *		block data
     *		block data
     *		...
     *		main data
     *
     * There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of
     * rmgr-specific data not associated with a block.  XLogRecord structs
     * always start on MAXALIGN boundaries in the WAL files, but the rest of
     * the fields are not aligned.
     *
     * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and
     * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's
     * used to distinguish between block references, and the main data structs.
     */
    typedef struct XLogRecord
    {
    	uint32		xl_tot_len;		/* total len of entire record */
    	TransactionId xl_xid;		/* xact id */
    	XLogRecPtr	xl_prev;		/* ptr to previous record in log */
    	uint8		xl_info;		/* flag bits, see below */
    	RmgrId		xl_rmid;		/* resource manager for this record */
    	/* 2 bytes of padding here, initialize to zero */
    	pg_crc32c	xl_crc;			/* CRC for this record */
    
    	/* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */
    
    } XLogRecord;
    

    这个结构体,其实有很多细节值得考究,现在我们只介绍一些最基础的,本文后面会用到的东西。

    • xl_tot_len

      XLOG的总长度,也就是XLOG的四个部分加起来的长度。

    • xl_xid

      事务ID。

    • xl_prev

      前一条日志的物理偏移(也就是LSN)。

    • xl_info

      信息标志位

    • xl_rmid

      资源管理器号。表示当前做的是什么操作,后面恢复的时候才好调用相应的函数来做redo。比如,对于前面的insert语句,这里的值就是RM_HEAP_ID,后面redo时会调用heap_redo。

    • xl_crc

      校验位

  • XLogRecordBlockHeader

    紧挨着XLogRecord之后就是XLogRecordBlockHeader结构体,定义如下:

    
    /*
     * Header info for block data appended to an XLOG record.
     *
     * 'data_length' is the length of the rmgr-specific payload data associated
     * with this block. It does not include the possible full page image, nor
     * XLogRecordBlockHeader struct itself.
     *
     * Note that we don't attempt to align the XLogRecordBlockHeader struct!
     * So, the struct must be copied to aligned local storage before use.
     */
    typedef struct XLogRecordBlockHeader
    {
    	uint8		id;				/* block reference ID */
    	uint8		fork_flags;		/* fork within the relation, and flags */
    	uint16		data_length;	/* number of payload bytes (not including page
    								 * image) */
    
    	/* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */
    	/* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */
    	/* BlockNumber follows */
    } XLogRecordBlockHeader;
    

    该结构体主要有两个作用:

    • fork_flags:低4位是fork类型,高四位是标记位。标记位有如下几种

      /*
       * The fork number fits in the lower 4 bits in the fork_flags field. The upper
       * bits are used for flags.
       */
      #define BKPBLOCK_FORK_MASK	0x0F
      #define BKPBLOCK_FLAG_MASK	0xF0
      #define BKPBLOCK_HAS_IMAGE	0x10	/* block data is an XLogRecordBlockImage */
      #define BKPBLOCK_HAS_DATA	0x20
      #define BKPBLOCK_WILL_INIT	0x40	/* redo will re-init the page */
      #define BKPBLOCK_SAME_REL	0x80	/* RelFileNode omitted, same as previous */
      

      这个后面再解释。

    • data_length:表示后续数据有效载荷的大小。

  • XLogRecordBlockImageHeader

    在XLogRecordBlockHeader之后的数据,就需要分情况了。如果需要备份区块,那么XLogRecordBlockHeader后面就会跟XLogRecordBlockImageHeader。备份区块是为了解决partial write的问题,这个后面再讲,我们先假设不需要备份区块,即没有这个部分。

  • XLogRecordBlockCompressHeader

    在XLogRecordBlockImageHeader之后,如果开启了压缩,就需要跟上XLogRecordBlockImageHeader。假设现在我们没有开启压缩,也就没有这部分数据。

  • RelFileNode

    接下来是RelFileNode结构体,该结构体定义如下:

    typedef struct RelFileNode
    {
    	Oid			spcNode;		/* tablespace */
    	Oid			dbNode;			/* database */
    	Oid			relNode;		/* relation */
    } RelFileNode;
    

    这个结构体用于表明我们的XLOG在重启恢复时会作用于哪个表空间、哪个数据库、哪张表。

  • BlockNumber

    接下来BlockNumber,表明我们的XLOG在重启恢复时会作用于哪个块。

    typedef uint32 BlockNumber;
    
  • mainrdata_len

    最后是mainrdata头,用于标识mainrdata的长度。不是所有的XLOG都有mainrdata,但是我们用例中的insert语句就有mainrdata,所以这个是绕不开的。

小结

小结一下,针对用例中的insert语句,假设无需备份区块和压缩。我们的XLOG头部组成为:XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len,共46个字节(24 + 4 + 12 + 4 + 2)。

第二部分: xl_heap_header

XLOG的第二部分是xl_heap_header结构体 ,定义如下:

/*
 * We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
 * or updated tuple in WAL; we can save a few bytes by reconstructing the
 * fields that are available elsewhere in the WAL record, or perhaps just
 * plain needn't be reconstructed.  These are the fields we must store.
 * NOTE: t_hoff could be recomputed, but we may as well store it because
 * it will come for free due to alignment considerations.
 */
typedef struct xl_heap_header
{
	uint16		t_infomask2;
	uint16		t_infomask;
	uint8		t_hoff;
} xl_heap_header;

xl_heap_header是HeapTupleHeaderData(也就是元组头,参见《PostgreSQL 基础模块—表和元组组织方式》)的一个简化版。xl_heap_header结构体上面的注释说的很清楚,不用将整个HeapTupleHeaderData都写入XLOG,HeapTupleHeaderData中的很多信息都可以重构或者不需要重构。所以只用存放一些必要的信息,而xl_heap_header就用于记录这些必要信息。

xl_heap_header结构体大小为5个字节。

第三部分:元组具体数据

XLOG的第三部分就是元组的具体数据,这部分数据和插入的时候写入到数据页中的数据完全一样。就上述用例而言,插入的元组长度为3个字节。

第四部分:xl_heap_insert

最后一部分是xl_heap_insert结构体,这个结构体表明了该元组所在的物理块中的偏移,定义如下:

/* This is what we need to know about insert */
typedef struct xl_heap_insert
{
	OffsetNumber offnum;		/* inserted tuple's offset */
	uint8		flags;

	/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;

该结构体大小为5个字节。

在《PostgreSQL 基础模块—表和元组组织方式》中讲过,PostgreSQL中一条完整的元组,是由ItemIdData+元组实体组成的。在一个数据页中ItemIdData是定长的,在页面中从前向后分配。元组实体由HeapTupleHeaderData+元组内容组成,长度不固定在页面中从后向前分配。ItemIdData通过lp_off来标记元组实体的位置。这里需要重点关注的一个地方是:xl_heap_insert中记录的是ItemIdData在页面中的偏移,而不是元组实体在页面中的偏移。以这种方式记录的日志称为物理逻辑日志;如果记录的是元组实体的偏移,就称为物理日志。接下来我们会重点讲解这两种日志的区别。

xl_heap_insert的组装详见heap_insert的2444行以后。关键代码:

//代码来源:heapam.c line2471
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self); 

小结

我们现在来小结一下,对于一条insert来说,最简单的情况下,XLOG主要由以下几部分组成:

XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len 46字节
+
xl_heap_header 5字节
+
实际元组数据 视具体数据长度而定,当前用例为5字节
+
xl_heap_insert 3字节

这几部分包含了如下关键信息:

  • 做了什么操作?

    由XLogRecord的xl_rmid表示

  • 写了什么数据?

    由实际元组数据决定

  • 数据写到了哪里?

    哪个表空间?由RelFileNode的spcNode表示。

    哪个库?由RelFileNode的dbNode表示。

    哪张表?由RelFileNode的relNode表示。

    表里的那个块?由BlockNumber表示。

    块里的哪个位置?由xl_heap_insert的offnum表示。

上述信息其实就是一个插入操作必要的信息,有了这样的信息,后面我们就可以非常方便的进行redo。

Page-oriented Log

在明白了XLOG的结构之后,我们就可以来解释什么叫做Page-oriented Log了。从XLOG的信息中,我们不难发现,XLOG描述了一条元组应该被写入到哪个页面的什么位置。从heap_insert的流程中,我们也不难发现,当一条元组写入数据页面后,我们就立即为这次写入操作生成一个XLOG,并写入log buffer。也就是说XLOG描述了页面中的数据变化,这就是Page-oriented Log。与之相对应的是逻辑日志(logic log),逻辑日志通常只是记录一条SQL语句,在redo时,会重新执行这条SQL语句。所以对于Page-oriented Log而言,在redo时元组总是写入到先前写入的那个页面,但对于逻辑日志,redo时的写入就很随意了。

对于Page-oriented Log又分为物理日志和物理逻辑日志两种。前面提到过,对于物理日志会记录元组插入页面中的物理位置(ItemIdData中lp_off的值),而对于物理逻辑日志,只记录元组插入页面中的逻辑位置(ItemIdData自身的偏移)。

对于物理日志而言,由于记录了元组的实际偏移,所以在redo时只用定位到实际位置,然后直接覆盖原有元组(不管元组有没有落盘),这种操作本身是具有幂等性的,不论执行多少次redo结果都一样。但这个方式有一个问题,就是一旦块做了整理(比如:vacuum操作)那么元组的物理位置会发生变化。为了保持精确的物理信息,整理也会产生大量物理日志,这非常影响性能。

所以PostgreSQL采用的是物理逻辑日志,所谓物理是指记录了元组实际插入的数据页,所谓逻辑具体写入到数据页中的什么位置是一个逻辑的值。这样在vacuum的时候只需要保持ItemIdData的位置不变,就没有任何影响。但是物理逻辑日志本身不具有幂等性,如果不加任何处理直接多次redo的话,就会写入多条数据。所以对于物理逻辑日志需要一种手段来判断该XLOG是否需要在对应页面中进行redo操作,这也就是所谓的LSN。这部分内容后面会由专门的文档进行说明。

如何向XLOG写入数据

最后,我们来看看XLOG所需要的几部分信息是如何写入和组织起来的。我们在XLOG的写入中介绍过,XLOG的写入主要有两个步骤:

  • 注册需要写入XLOG的数据。
  • 调用XLogInsert向log buffer中写入XLOG。

下面我们分别来介绍这两个步骤。

注册数据

在heap_insert中使用到的注册函数主要有:XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData。调用代码如下:

//xlrec为xl_heap_insert结构体
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);		
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
//xlhdr为xl_heap_header结构体
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
//(char *) heaptup->t_data + SizeofHeapTupleHeader为实际元组
XLogRegisterBufData(0,
                    (char *) heaptup->t_data + SizeofHeapTupleHeader,
					heaptup->t_len - SizeofHeapTupleHeader);

从上述代码中,我们可以很直观的看到,heap_insert注册了xl_heap_insertxl_heap_header实际元组数据。回顾下前面阐述过的XLOG的组成部分:

XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +

xl_heap_header + 实际元组数据 + xl_heap_insert

不难发现,我们现在已经拥有了除XLOG头部的数据。现在我们来看看这几个注册函数:

XLogRegisterData
/*
 * Add data to the WAL record that's being constructed.
 *
 * The data is appended to the "main chunk", available at replay with
 * XLogRecGetData().
 */
void
XLogRegisterData(char *data, int len)
{
	XLogRecData *rdata;

	Assert(begininsert_called);

	if (num_rdatas >= max_rdatas)
		elog(ERROR, "too much WAL data");
    //1.从全局数组rdatas中获取一个XLogRecData对象rdata。
	rdata = &rdatas[num_rdatas++];
	//2.将需要注册的数据写入rdata中。
	rdata->data = data;
	rdata->len = len;

	/*
	 * we use the mainrdata_last pointer to track the end of the chain, so no
	 * need to clear 'next' here.
	 */
	//3.将rdata加入mainrdata链表中
	mainrdata_last->next = rdata;
	mainrdata_last = rdata;

	mainrdata_len += len;
}

这个函数非常简单,主要有3个步骤:

  1. 从全局数组rdatas中获取一个XLogRecData对象rdata。
  2. 将需要注册的数据写入rdata中。
  3. 将rdata加入mainrdata链表中

我们先来看看XLogRecData结构体:

/*
 * The functions in xloginsert.c construct a chain of XLogRecData structs
 * to represent the final WAL record.
 */
typedef struct XLogRecData
{
	struct XLogRecData *next;	/* next struct in chain, or NULL */
	char	   *data;			/* start of rmgr data to include */
	uint32		len;			/* length of rmgr data to include */
} XLogRecData;

这是一个典型的链表结构体,这个结构体非常关键,前面讲的XLOG组成实际上是指XLOG在磁盘上的组织结构,而XLOG在内存中的组织结构,就是由XLogRecData链表来链接4个组成部分的。

对于XLogRegisterData函数,有两个点值得注意:

  • rdatas数组

    rdatas数组是为了防止频繁分配和释放空间带来的性能开销,在进程初始化时,又调用InitXLogInsert预先分配的一个数组。在实际使用时,需要注册的数据个数,大于数组大小,则直接报错。

  • mainrdata链表

    注意,调用XLogRegisterData注册的数据,会被链接到mainrdata链表中。mainrdata_len表示mainrdata链表上所有数据的总长度。这个后面会用到。

XLogRegisterBuffer

从上面的代码中我们不难发现在调用XLogRegisterBufData注册xl_heap_header和实际元组之前。先调用了XLogRegisterBuffer。XLogRegisterBuffer的作用是注册一个页面的基本信息。回顾下前面讲的Page-oriented Log,XLOG是一个跟页面相关的日志,后面注册的实际元组也是属于某个页面的。所以在注册元组之前需要先注册页面。XLogRegisterBuffer的具体实现如下:

/*
 * Register a reference to a buffer with the WAL record being constructed.
 * This must be called for every page that the WAL-logged operation modifies.
 */
void
XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
{
	registered_buffer *regbuf;

	/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
	Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
	Assert(begininsert_called);

	if (block_id >= max_registered_block_id)
	{
		if (block_id >= max_registered_buffers)
			elog(ERROR, "too many registered buffers");
		max_registered_block_id = block_id + 1;
	}
	//1.从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
	regbuf = &registered_buffers[block_id];

    //2.将页面信息写入regbuf中。
	BufferGetTag(buffer, &regbuf->rnode, &regbuf->forkno, &regbuf->block);
	regbuf->page = BufferGetPage(buffer);
	regbuf->flags = flags;
    //3.初始化regbuf的数据链表
	regbuf->rdata_tail = (XLogRecData *) &regbuf->rdata_head;
	regbuf->rdata_len = 0;

	/*
	 * Check that this page hasn't already been registered with some other
	 * block_id.
	 */
#ifdef USE_ASSERT_CHECKING
	//这里代码不重要,所以省略了。
#endif

	regbuf->in_use = true;
}

这个函数的流程和XLogRegisterData非常类似,主要有3个步骤:

  1. 从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
  2. 将页面信息写入regbuf中。
  3. 初始化regbuf的数据链表。

我们先来看看registered_buffer结构体:

/*
 * For each block reference registered with XLogRegisterBuffer, we fill in
 * a registered_buffer struct.
 */
typedef struct
{
	bool		in_use;			/* is this slot in use? */
	uint8		flags;			/* REGBUF_* flags */
	RelFileNode rnode;			/* identifies the relation and block */
	ForkNumber	forkno;
	BlockNumber block;
	Page		page;			/* page content */
	uint32		rdata_len;		/* total length of data in rdata chain */
	XLogRecData *rdata_head;	/* head of the chain of data registered with
								 * this block */
	XLogRecData *rdata_tail;	/* last entry in the chain, or &rdata_head if
								 * empty */

	XLogRecData bkp_rdatas[2];	/* temporary rdatas used to hold references to
								 * backup block data in XLogRecordAssemble() */

	/* buffer to store a compressed version of backup block image */
	char		compressed_page[PGLZ_MAX_BLCKSZ];
} registered_buffer;

其中,跟本文关系比较大的几个成员如下:

  • rnode

    RelFileNode结构体,XLOG头的组成之一。

  • block

    BlockNumber类型,XLOG头的组成之一。

  • rdata_head、rdata_tail

    XLogRecData链表,后续的实际元组数据会注册到这里。

与前面XLogRegisterData一样,registered_buffers数组也是在InitXLogInsert中事先分配的。
在XLogRegisterBuffer执行完毕后,我们便完成了RelFileNode和BlockNumber的注册的。

XLogRegisterBufData

最后我们来看看XLogRegisterBufData,该函数用于注册xl_heap_header结构体和实际元组数据,实际上就是注册一条元组。具体实现如下:

/*
 * Add buffer-specific data to the WAL record that's being constructed.
 *
 * Block_id must reference a block previously registered with
 * XLogRegisterBuffer(). If this is called more than once for the same
 * block_id, the data is appended.
 *
 * The maximum amount of data that can be registered per block is 65535
 * bytes. That should be plenty; if you need more than BLCKSZ bytes to
 * reconstruct the changes to the page, you might as well just log a full
 * copy of it. (the "main data" that's not associated with a block is not
 * limited)
 */
void
XLogRegisterBufData(uint8 block_id, char *data, int len)
{
	registered_buffer *regbuf;
	XLogRecData *rdata;

	Assert(begininsert_called);

	/* find the registered buffer struct */
	regbuf = &registered_buffers[block_id];
	if (!regbuf->in_use)
		elog(ERROR, "no block with id %d registered with WAL insertion",
			 block_id);

	if (num_rdatas >= max_rdatas)
		elog(ERROR, "too much WAL data");
	rdata = &rdatas[num_rdatas++];

	rdata->data = data;
	rdata->len = len;

	regbuf->rdata_tail->next = rdata;
	regbuf->rdata_tail = rdata;
	regbuf->rdata_len += len;
}

这个函数主要有3个步骤:

  1. 从全局数组rdatas中获取一个XLogRecData对象rdata。
  2. 将需要注册的数据写入rdata中。
  3. 将rdata加入regbuf链表中。
小结

现在我们来小结一下,通过注册流程,我们现构建了XLOG如下部分的数据(绿色为已构建的,红色为尚未构建的):

XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +

xl_heap_header+ 实际元组数据+ xl_heap_insert

其中绿色部分的构建,我们在前面已经讲过了,绿色部分的数据包含在两条链表中:

  • xl_heap_insert在mainrdata链表中。
  • RelFileNode+BlockNumber+xl_heap_header+ 实际元组数据在regbuf链表中。

接下来,就会调用XLogInsert函数,开始执行写XLOG的相关函数。

写入XLOG

在前面的小结中,我们了解到,经历的注册阶段,我们已经获取到了绝大部分信息。但是仍然有部分数据还没有获取到。所以写入XLOG时,我们首先需要获取到红色部分的数据,然后再将数据写入log buffer。这个过程由XLogInsert函数实现,代码如下:

/*
 * Insert an XLOG record having the specified RMID and info bytes, with the
 * body of the record being the data and buffer references registered earlier
 * with XLogRegister* calls.
 *
 * Returns XLOG pointer to end of record (beginning of next record).
 * This can be used as LSN for data pages affected by the logged action.
 * (LSN is the XLOG point up to which the XLOG must be flushed to disk
 * before the data page can be written out.  This implements the basic
 * WAL rule "write the log before the data".)
 */
XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info)
{
	XLogRecPtr	EndPos;

	/* XLogBeginInsert() must have been called. */
	if (!begininsert_called)
		elog(ERROR, "XLogBeginInsert was not called");

	/*
	 * The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
	 * reserved for use by me.
	 */
	if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
		elog(PANIC, "invalid xlog info mask %02X", info);

	TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);

	/*
	 * In bootstrap mode, we don't actually log anything but XLOG resources;
	 * return a phony record pointer.
	 */
	if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
	{
		XLogResetInsertion();
		EndPos = SizeOfXLogLongPHD;		/* start of 1st chkpt record */
		return EndPos;
	}

	do
	{
		XLogRecPtr	RedoRecPtr;
		bool		doPageWrites;
		XLogRecPtr	fpw_lsn;
		XLogRecData *rdt;

		/*
		 * Get values needed to decide whether to do full-page writes. Since
		 * we don't yet have an insertion lock, these could change under us,
		 * but XLogInsertRecord will recheck them once it has a lock.
		 */
		GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
		//1.获取红色部分的数据
		rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
								 &fpw_lsn);
		//2.将数据写入log buffer
		EndPos = XLogInsertRecord(rdt, fpw_lsn);
	} while (EndPos == InvalidXLogRecPtr);

	XLogResetInsertion();

	return EndPos;
}

其中关键函数为XLogRecordAssemble和XLogInsertRecord。下面我们分别来看看具体实现:

XLogRecordAssemble

XLogRecordAssemble负责获取前面红色部分的数据:XLogRecord、XLogRecordBlockHeader、mainrdata_len。然后将XLOG的4个部分:XLOG头部 + xl_heap_header + 元组具体数据 + xl_heap_insert组装成XLogRecData链表。XLogRecordAssemble的实现如下:

/*
 * Assemble a WAL record from the registered data and buffers into an
 * XLogRecData chain, ready for insertion with XLogInsertRecord().
 *
 * The record header fields are filled in, except for the xl_prev field. The
 * calculated CRC does not include the record header yet.
 *
 * If there are any registered buffers, and a full-page image was not taken
 * of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
 * signals that the assembled record is only good for insertion on the
 * assumption that the RedoRecPtr and doPageWrites values were up-to-date.
 */
static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
				   XLogRecPtr RedoRecPtr, bool doPageWrites,
				   XLogRecPtr *fpw_lsn)
{
	XLogRecData *rdt;
	uint32		total_len = 0;
	int			block_id;
	pg_crc32c	rdata_crc;
	registered_buffer *prev_regbuf = NULL;
	XLogRecData *rdt_datas_last;
	XLogRecord *rechdr;
	char	   *scratch = hdr_scratch;

	/*
	 * Note: this function can be called multiple times for the same record.
	 * All the modifications we do to the rdata chains below must handle that.
	 */

	/* The record begins with the fixed-size header */
	rechdr = (XLogRecord *) scratch;
	scratch += SizeOfXLogRecord;

	hdr_rdt.next = NULL;
	rdt_datas_last = &hdr_rdt;
	hdr_rdt.data = hdr_scratch;

	/*
	 * Make an rdata chain containing all the data portions of all block
	 * references. This includes the data for full-page images. Also append
	 * the headers for the block references in the scratch buffer.
	 */
	*fpw_lsn = InvalidXLogRecPtr;
	for (block_id = 0; block_id < max_registered_block_id; block_id++)
	{
		registered_buffer *regbuf = &registered_buffers[block_id];
		bool		needs_backup;
		bool		needs_data;
		XLogRecordBlockHeader bkpb;
		XLogRecordBlockImageHeader bimg;
		XLogRecordBlockCompressHeader cbimg = {0};
		bool		samerel;
		bool		is_compressed = false;

		if (!regbuf->in_use)
			continue;

		/* Determine if this block needs to be backed up */
		if (regbuf->flags & REGBUF_FORCE_IMAGE)
			needs_backup = true;
		else if (regbuf->flags & REGBUF_NO_IMAGE)
			needs_backup = false;
		else if (!doPageWrites)
			needs_backup = false;
		else
		{
			/*
			 * We assume page LSN is first data on *every* page that can be
			 * passed to XLogInsert, whether it has the standard page layout
			 * or not.
			 */
			XLogRecPtr	page_lsn = PageGetLSN(regbuf->page);

			needs_backup = (page_lsn <= RedoRecPtr);
			if (!needs_backup)
			{
				if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
					*fpw_lsn = page_lsn;
			}
		}

		/* Determine if the buffer data needs to included */
		if (regbuf->rdata_len == 0)
			needs_data = false;
		else if ((regbuf->flags & REGBUF_KEEP_DATA) != 0)
			needs_data = true;
		else
			needs_data = !needs_backup;

		bkpb.id = block_id;
		bkpb.fork_flags = regbuf->forkno;
		bkpb.data_length = 0;

		if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
			bkpb.fork_flags |= BKPBLOCK_WILL_INIT;

		if (needs_backup)
		{
			Page		page = regbuf->page;
			uint16		compressed_len;

			/*
			 * The page needs to be backed up, so calculate its hole length
			 * and offset.
			 */
			if (regbuf->flags & REGBUF_STANDARD)
			{
				/* Assume we can omit data between pd_lower and pd_upper */
				uint16		lower = ((PageHeader) page)->pd_lower;
				uint16		upper = ((PageHeader) page)->pd_upper;

				if (lower >= SizeOfPageHeaderData &&
					upper > lower &&
					upper <= BLCKSZ)
				{
					bimg.hole_offset = lower;
					cbimg.hole_length = upper - lower;
				}
				else
				{
					/* No "hole" to compress out */
					bimg.hole_offset = 0;
					cbimg.hole_length = 0;
				}
			}
			else
			{
				/* Not a standard page header, don't try to eliminate "hole" */
				bimg.hole_offset = 0;
				cbimg.hole_length = 0;
			}

			/*
			 * Try to compress a block image if wal_compression is enabled
			 */
			if (wal_compression)
			{
				is_compressed =
					XLogCompressBackupBlock(page, bimg.hole_offset,
											cbimg.hole_length,
											regbuf->compressed_page,
											&compressed_len);
			}

			/*
			 * Fill in the remaining fields in the XLogRecordBlockHeader
			 * struct
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;

			/*
			 * Construct XLogRecData entries for the page content.
			 */
			rdt_datas_last->next = &regbuf->bkp_rdatas[0];
			rdt_datas_last = rdt_datas_last->next;

			bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;

			if (is_compressed)
			{
				bimg.length = compressed_len;
				bimg.bimg_info |= BKPIMAGE_IS_COMPRESSED;

				rdt_datas_last->data = regbuf->compressed_page;
				rdt_datas_last->len = compressed_len;
			}
			else
			{
				bimg.length = BLCKSZ - cbimg.hole_length;

				if (cbimg.hole_length == 0)
				{
					rdt_datas_last->data = page;
					rdt_datas_last->len = BLCKSZ;
				}
				else
				{
					/* must skip the hole */
					rdt_datas_last->data = page;
					rdt_datas_last->len = bimg.hole_offset;

					rdt_datas_last->next = &regbuf->bkp_rdatas[1];
					rdt_datas_last = rdt_datas_last->next;

					rdt_datas_last->data =
						page + (bimg.hole_offset + cbimg.hole_length);
					rdt_datas_last->len =
						BLCKSZ - (bimg.hole_offset + cbimg.hole_length);
				}
			}

			total_len += bimg.length;
		}

		if (needs_data)
		{
			/*
			 * Link the caller-supplied rdata chain for this buffer to the
			 * overall list.
			 */
			bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
			bkpb.data_length = regbuf->rdata_len;
			total_len += regbuf->rdata_len;

			rdt_datas_last->next = regbuf->rdata_head;
			rdt_datas_last = regbuf->rdata_tail;
		}

		if (prev_regbuf && RelFileNodeEquals(regbuf->rnode, prev_regbuf->rnode))
		{
			samerel = true;
			bkpb.fork_flags |= BKPBLOCK_SAME_REL;
		}
		else
			samerel = false;
		prev_regbuf = regbuf;

		/* Ok, copy the header to the scratch buffer */
		memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
		scratch += SizeOfXLogRecordBlockHeader;
		if (needs_backup)
		{
			memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
			scratch += SizeOfXLogRecordBlockImageHeader;
			if (cbimg.hole_length != 0 && is_compressed)
			{
				memcpy(scratch, &cbimg,
					   SizeOfXLogRecordBlockCompressHeader);
				scratch += SizeOfXLogRecordBlockCompressHeader;
			}
		}
		if (!samerel)
		{
			memcpy(scratch, &regbuf->rnode, sizeof(RelFileNode));
			scratch += sizeof(RelFileNode);
		}
		memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
		scratch += sizeof(BlockNumber);
	}

	/* followed by the record's origin, if any */
	if (include_origin && replorigin_session_origin != InvalidRepOriginId)
	{
		*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
		memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
		scratch += sizeof(replorigin_session_origin);
	}

	/* followed by main data, if any */
	if (mainrdata_len > 0)
	{
		if (mainrdata_len > 255)
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
			memcpy(scratch, &mainrdata_len, sizeof(uint32));
			scratch += sizeof(uint32);
		}
		else
		{
			*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
			*(scratch++) = (uint8) mainrdata_len;
		}
		rdt_datas_last->next = mainrdata_head;
		rdt_datas_last = mainrdata_last;
		total_len += mainrdata_len;
	}
	rdt_datas_last->next = NULL;

	hdr_rdt.len = (scratch - hdr_scratch);
	total_len += hdr_rdt.len;

	/*
	 * Calculate CRC of the data
	 *
	 * Note that the record header isn't added into the CRC initially since we
	 * don't know the prev-link yet.  Thus, the CRC will represent the CRC of
	 * the whole record in the order: rdata, then backup blocks, then record
	 * header.
	 */
	INIT_CRC32C(rdata_crc);
	COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
	for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
		COMP_CRC32C(rdata_crc, rdt->data, rdt->len);

	/*
	 * Fill in the fields in the record header. Prev-link is filled in later,
	 * once we know where in the WAL the record will be inserted. The CRC does
	 * not include the record header yet.
	 */
	rechdr->xl_xid = GetCurrentTransactionIdIfAny();
	rechdr->xl_tot_len = total_len;
	rechdr->xl_info = info;
	rechdr->xl_rmid = rmid;
	rechdr->xl_prev = InvalidXLogRecPtr;
	rechdr->xl_crc = rdata_crc;

	return &hdr_rdt;
}

这段代码比较长,我们带着问题来一个一个看。前面说过XLogRecordAssemble会产生一个链表,链表中包含了XLOG的4个部分,而经过注册阶段,获取到了除XLOG头之外的其余三个部分。所以对于XLogRecordAssemble函数我们需要解决如下两个问题。

XLOG头如何构建

我们先来看一句很关键的代码:

char	   *scratch = hdr_scratch;

hdr_scratch是什么?

static char *hdr_scratch = NULL;

hdr_scratch是一个全局的buffer。我们在前面详细的描述过XLOG头的组成,由此我们可以得知,XLOG头的长度是不固定的。同样是为了防止频繁的分配和释放内存,PostgreSQL在InitXLogInsert中事先为XLOG头分配了空间,空间大小为XLOG头的最大长度。

#define HEADER_SCRATCH_SIZE \
	(SizeOfXLogRecord + \
	 MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
	 SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)

所以,为了了解XLOG头的构建,我们需要观察XLogRecordAssemble对scratch指针的操作。

  • 构建XLogRecord

    rechdr = (XLogRecord *) scratch;
    scratch += SizeOfXLogRecord;
    //中间代码省略
    rechdr->xl_xid = GetCurrentTransactionIdIfAny();
    rechdr->xl_tot_len = total_len;
    rechdr->xl_info = info;
    rechdr->xl_rmid = rmid;
    rechdr->xl_prev = InvalidXLogRecPtr;
    rechdr->xl_crc = rdata_crc;
    
  • 构建XLogRecordBlockHeader

    XLogRecordBlockHeader bkpb;
    bkpb.id = block_id;
    bkpb.fork_flags = regbuf->forkno;
    bkpb.data_length = 0;
    
    if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
        bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
    
    memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
    scratch += SizeOfXLogRecordBlockHeader;
    
  • 构建RelFileNode

    if (!samerel)
    {
    	memcpy(scratch, &regbuf->rnode, sizeof(RelFileNode));
    	scratch += sizeof(RelFileNode);
    }
    

    这里,我们就看到了,RelFileNode的数据是从之前注册的regbuf->rnode中获取的。

  • 构建BlockNumber

    memcpy(scratch, &regbuf->block, sizeof(BlockNumber));
    scratch += sizeof(BlockNumber);
    
  • 构建mainrdata_len

    if (mainrdata_len > 0)
    {
    	if (mainrdata_len > 255)
    	{
    		*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
    		memcpy(scratch, &mainrdata_len, sizeof(uint32));
    		scratch += sizeof(uint32);
    	}
    	else
    	{
    		*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
    		*(scratch++) = (uint8) mainrdata_len;
    	}
        //将mainrdata链接到XLogRecData链表中。
    	rdt_datas_last->next = mainrdata_head;
    	rdt_datas_last = mainrdata_last;
    	total_len += mainrdata_len;
    }
    

    这里需要注意两点:

    • XLOG头中只是记录了mainrdata的长度,在我们的用例中,也就是xl_heap_insert的长度。
    • 上述代码,除了构建mainrdata_len,还将mainrdata也就是xl_heap_insert链接到XLogRecData链表中,这也是我们接下来看要解决的问题。
如何将XLOG的4个部分串链

接下来,我来看看XLOG的4个部分是如何串链的。XLogRecordAssemble函数最终返回hdr_rdt。

return &hdr_rdt;

所以,我们需要观察XLogRecordAssemble是如何操作hdr_rdt。

hdr_rdt.next = NULL;			//初始化next的指针
rdt_datas_last = &hdr_rdt;		//指向链头

hdr_rdt将作为链表的链头,所以这里使用rdt_datas_last指针指向链头。

  • XLOG头加入链表

    hdr_rdt.data = hdr_scratch;
    //中间代码省略
    hdr_rdt.len = (scratch - hdr_scratch);
    

    当前hdr_rdt为链头,所以直接将XLOG头的buffer赋值给data,构建好XLOG头之后,再计算XLOG头的长度。

  • xl_heap_header、元组具体数据加入链表

    在注册阶段,我们知道xl_heap_header和元组具体数据都存放在regbuf的XLogRecData链表中,并且xl_heap_header在前元组具体数据在后(xl_heap_header先注册)。所以直接将regbuf的XLogRecData链表头,添加到hdr_rdt中即可。

    if (needs_data)
    {
        /*
    	 * Link the caller-supplied rdata chain for this buffer to the
    	 * overall list.
    	 */
        bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
        bkpb.data_length = regbuf->rdata_len;
        total_len += regbuf->rdata_len;
    
        //串链
        rdt_datas_last->next = regbuf->rdata_head;
        rdt_datas_last = regbuf->rdata_tail;
    }
    
  • xl_heap_insert加入链表

    这个在构建mainrdata_len时已经说过了,这里不再赘述。

小结

经过XLogRecordAssemble之后,我们得到了一个链表,链表中有我们希望写入XLOG的四个部分。

XLogInsertRecord

最后,我们来看看将XLOG真正写入log buffer的函数XLogInsertRecord。

/*
 * Insert an XLOG record represented by an already-constructed chain of data
 * chunks.  This is a low-level routine; to construct the WAL record header
 * and data, use the higher-level routines in xloginsert.c.
 *
 * If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
 * WAL record applies to, that were not included in the record as full page
 * images.  If fpw_lsn >= RedoRecPtr, the function does not perform the
 * insertion and returns InvalidXLogRecPtr.  The caller can then recalculate
 * which pages need a full-page image, and retry.  If fpw_lsn is invalid, the
 * record is always inserted.
 *
 * The first XLogRecData in the chain must be for the record header, and its
 * data must be MAXALIGNed.  XLogInsertRecord fills in the xl_prev and
 * xl_crc fields in the header, the rest of the header must already be filled
 * by the caller.
 *
 * Returns XLOG pointer to end of record (beginning of next record).
 * This can be used as LSN for data pages affected by the logged action.
 * (LSN is the XLOG point up to which the XLOG must be flushed to disk
 * before the data page can be written out.  This implements the basic
 * WAL rule "write the log before the data".)
 */
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	pg_crc32c	rdata_crc;
	bool		inserted;
	XLogRecord *rechdr = (XLogRecord *) rdata->data;
	bool		isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
							   rechdr->xl_info == XLOG_SWITCH);
	XLogRecPtr	StartPos;
	XLogRecPtr	EndPos;

	/* we assume that all of the record header is in the first chunk */
	Assert(rdata->len >= SizeOfXLogRecord);

	/* cross-check on whether we should be here or not */
	if (!XLogInsertAllowed())
		elog(ERROR, "cannot make new WAL entries during recovery");

	/*----------
	 *
	 * We have now done all the preparatory work we can without holding a
	 * lock or modifying shared state. From here on, inserting the new WAL
	 * record to the shared WAL buffer cache is a two-step process:
	 *
	 * 1. Reserve the right amount of space from the WAL. The current head of
	 *	  reserved space is kept in Insert->CurrBytePos, and is protected by
	 *	  insertpos_lck.
	 *
	 * 2. Copy the record to the reserved WAL space. This involves finding the
	 *	  correct WAL buffer containing the reserved space, and copying the
	 *	  record in place. This can be done concurrently in multiple processes.
	 *
	 * To keep track of which insertions are still in-progress, each concurrent
	 * inserter acquires an insertion lock. In addition to just indicating that
	 * an insertion is in progress, the lock tells others how far the inserter
	 * has progressed. There is a small fixed number of insertion locks,
	 * determined by NUM_XLOGINSERT_LOCKS. When an inserter crosses a page
	 * boundary, it updates the value stored in the lock to the how far it has
	 * inserted, to allow the previous buffer to be flushed.
	 *
	 * Holding onto an insertion lock also protects RedoRecPtr and
	 * fullPageWrites from changing until the insertion is finished.
	 *
	 * Step 2 can usually be done completely in parallel. If the required WAL
	 * page is not initialized yet, you have to grab WALBufMappingLock to
	 * initialize it, but the WAL writer tries to do that ahead of insertions
	 * to avoid that from happening in the critical path.
	 *
	 *----------
	 */
	START_CRIT_SECTION();
	if (isLogSwitch)
		WALInsertLockAcquireExclusive();
	else
		WALInsertLockAcquire();

	/*
	 * Check to see if my copy of RedoRecPtr or doPageWrites is out of date.
	 * If so, may have to go back and have the caller recompute everything.
	 * This can only happen just after a checkpoint, so it's better to be slow
	 * in this case and fast otherwise.
	 *
	 * If we aren't doing full-page writes then RedoRecPtr doesn't actually
	 * affect the contents of the XLOG record, so we'll update our local copy
	 * but not force a recomputation.  (If doPageWrites was just turned off,
	 * we could recompute the record without full pages, but we choose not to
	 * bother.)
	 */
	if (RedoRecPtr != Insert->RedoRecPtr)
	{
		Assert(RedoRecPtr < Insert->RedoRecPtr);
		RedoRecPtr = Insert->RedoRecPtr;
	}
	doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);

	if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
	{
		/*
		 * Oops, some buffer now needs to be backed up that the caller didn't
		 * back up.  Start over.
		 */
		WALInsertLockRelease();
		END_CRIT_SECTION();
		return InvalidXLogRecPtr;
	}

	/*
	 * Reserve space for the record in the WAL. This also sets the xl_prev
	 * pointer.
	 */
	if (isLogSwitch)
		inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
	else
	{
		ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
								  &rechdr->xl_prev);
		inserted = true;
	}

	if (inserted)
	{
		/*
		 * Now that xl_prev has been filled in, calculate CRC of the record
		 * header.
		 */
		rdata_crc = rechdr->xl_crc;
		COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
		FIN_CRC32C(rdata_crc);
		rechdr->xl_crc = rdata_crc;

		/*
		 * All the record data, including the header, is now ready to be
		 * inserted. Copy the record in the space reserved.
		 */
		CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
							StartPos, EndPos);
	}
	else
	{
		/*
		 * This was an xlog-switch record, but the current insert location was
		 * already exactly at the beginning of a segment, so there was no need
		 * to do anything.
		 */
	}

	/*
	 * Done! Let others know that we're finished.
	 */
	WALInsertLockRelease();

	MarkCurrentTransactionIdLoggedIfAny();

	END_CRIT_SECTION();

	/*
	 * Update shared LogwrtRqst.Write, if we crossed page boundary.
	 */
	if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
	{
		SpinLockAcquire(&XLogCtl->info_lck);
		/* advance global request to include new block(s) */
		if (XLogCtl->LogwrtRqst.Write < EndPos)
			XLogCtl->LogwrtRqst.Write = EndPos;
		/* update local result copy while I have the chance */
		LogwrtResult = XLogCtl->LogwrtResult;
		SpinLockRelease(&XLogCtl->info_lck);
	}

	/*
	 * If this was an XLOG_SWITCH record, flush the record and the empty
	 * padding space that fills the rest of the segment, and perform
	 * end-of-segment actions (eg, notifying archiver).
	 */
	if (isLogSwitch)
	{
		TRACE_POSTGRESQL_XLOG_SWITCH();
		XLogFlush(EndPos);

		/*
		 * Even though we reserved the rest of the segment for us, which is
		 * reflected in EndPos, we return a pointer to just the end of the
		 * xlog-switch record.
		 */
		if (inserted)
		{
			EndPos = StartPos + SizeOfXLogRecord;
			if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
			{
				if (EndPos % XLOG_SEG_SIZE == EndPos % XLOG_BLCKSZ)
					EndPos += SizeOfXLogLongPHD;
				else
					EndPos += SizeOfXLogShortPHD;
			}
		}
	}

#ifdef WAL_DEBUG
	if (XLOG_DEBUG)
	{
		static XLogReaderState *debug_reader = NULL;
		StringInfoData buf;
		StringInfoData recordBuf;
		char	   *errormsg = NULL;
		MemoryContext oldCxt;

		oldCxt = MemoryContextSwitchTo(walDebugCxt);

		initStringInfo(&buf);
		appendStringInfo(&buf, "INSERT @ %X/%X: ",
						 (uint32) (EndPos >> 32), (uint32) EndPos);

		/*
		 * We have to piece together the WAL record data from the XLogRecData
		 * entries, so that we can pass it to the rm_desc function as one
		 * contiguous chunk.
		 */
		initStringInfo(&recordBuf);
		for (; rdata != NULL; rdata = rdata->next)
			appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);

		if (!debug_reader)
			debug_reader = XLogReaderAllocate(NULL, NULL);

		if (!debug_reader)
		{
			appendStringInfoString(&buf, "error decoding record: out of memory");
		}
		else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
								   &errormsg))
		{
			appendStringInfo(&buf, "error decoding record: %s",
							 errormsg ? errormsg : "no error message");
		}
		else
		{
			appendStringInfoString(&buf, " - ");
			xlog_outdesc(&buf, debug_reader);
		}
		elog(LOG, "%s", buf.data);

		pfree(buf.data);
		pfree(recordBuf.data);
		MemoryContextSwitchTo(oldCxt);
	}
#endif

	/*
	 * Update our global variables
	 */
	ProcLastRecPtr = StartPos;
	XactLastRecEnd = EndPos;

	return EndPos;
}

该函数主要有两个重要的函数

  • ReserveXLogInsertLocation

    向log buffer申请空间,为即将写入的XLOG预留空间。

  • CopyXLogRecordToWAL

    将rdata,也就是XLogRecData链表中的内容,写入log buffer。

下面,我们来分别看看这两个函数:

ReserveXLogInsertLocation
/*
 * Reserves the right amount of space for a record of given size from the WAL.
 * *StartPos is set to the beginning of the reserved section, *EndPos to
 * its end+1. *PrevPtr is set to the beginning of the previous record; it is
 * used to set the xl_prev of this record.
 *
 * This is the performance critical part of XLogInsert that must be serialized
 * across backends. The rest can happen mostly in parallel. Try to keep this
 * section as short as possible, insertpos_lck can be heavily contended on a
 * busy system.
 *
 * NB: The space calculation here must match the code in CopyXLogRecordToWAL,
 * where we actually copy the record to the reserved space.
 */
static void
ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
						  XLogRecPtr *PrevPtr)
{
	XLogCtlInsert *Insert = &XLogCtl->Insert;
	uint64		startbytepos;
	uint64		endbytepos;
	uint64		prevbytepos;

	size = MAXALIGN(size);

	/* All (non xlog-switch) records should contain data. */
	Assert(size > SizeOfXLogRecord);

	/*
	 * The duration the spinlock needs to be held is minimized by minimizing
	 * the calculations that have to be done while holding the lock. The
	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
	 * page headers. The mapping between "usable" byte positions and physical
	 * positions (XLogRecPtrs) can be done outside the locked region, and
	 * because the usable byte position doesn't include any headers, reserving
	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
	 */
    //加锁
	SpinLockAcquire(&Insert->insertpos_lck);

    //预留空间
	startbytepos = Insert->CurrBytePos;
	endbytepos = startbytepos + size;
	prevbytepos = Insert->PrevBytePos;
	Insert->CurrBytePos = endbytepos;
	Insert->PrevBytePos = startbytepos;

    //解锁
	SpinLockRelease(&Insert->insertpos_lck);

    //返回写入位置
	*StartPos = XLogBytePosToRecPtr(startbytepos);
	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);

	/*
	 * Check that the conversions between "usable byte positions" and
	 * XLogRecPtrs work consistently in both directions.
	 */
	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
}

该函数的核心代码为line39~line55。其中Insert是一个XLogCtlInsert的结构体对象,该对象是一个全局对象。该对象中的CurrBytePos和PrevBytePos用于控制log buffer的写入。预留空间后,ReserveXLogInsertLocation会返回XLOG的写入位置。

注意:

XLogBytePosToRecPtr是一个非常精妙的设计,其具体的含义我们留到《XLOG 2.0》来讲。这里只需要明白,他返回了一个XLOG写入位置就行了。

CopyXLogRecordToWAL

通过ReserveXLogInsertLocation分配了空间之后,就可以调用CopyXLogRecordToWAL来进行真正的写入了。CopyXLogRecordToWAL的代码如下:

/*
 * Subroutine of XLogInsertRecord.  Copies a WAL record to an already-reserved
 * area in the WAL.
 */
static void
CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
					XLogRecPtr StartPos, XLogRecPtr EndPos)
{
	char	   *currpos;
	int			freespace;
	int			written;
	XLogRecPtr	CurrPos;
	XLogPageHeader pagehdr;

	/*
	 * Get a pointer to the right place in the right WAL buffer to start
	 * inserting to.
	 */
	CurrPos = StartPos;
	currpos = GetXLogBuffer(CurrPos);
	freespace = INSERT_FREESPACE(CurrPos);

	/*
	 * there should be enough space for at least the first field (xl_tot_len)
	 * on this page.
	 */
	Assert(freespace >= sizeof(uint32));

	/* Copy record data */
	written = 0;
	while (rdata != NULL)
	{
		char	   *rdata_data = rdata->data;
		int			rdata_len = rdata->len;

		while (rdata_len > freespace)
		{
			/*
			 * Write what fits on this page, and continue on the next page.
			 */
			Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0);
			memcpy(currpos, rdata_data, freespace);
			rdata_data += freespace;
			rdata_len -= freespace;
			written += freespace;
			CurrPos += freespace;

			/*
			 * Get pointer to beginning of next page, and set the xlp_rem_len
			 * in the page header. Set XLP_FIRST_IS_CONTRECORD.
			 *
			 * It's safe to set the contrecord flag and xlp_rem_len without a
			 * lock on the page. All the other flags were already set when the
			 * page was initialized, in AdvanceXLInsertBuffer, and we're the
			 * only backend that needs to set the contrecord flag.
			 */
			currpos = GetXLogBuffer(CurrPos);
			pagehdr = (XLogPageHeader) currpos;
			pagehdr->xlp_rem_len = write_len - written;
			pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;

			/* skip over the page header */
			if (CurrPos % XLogSegSize == 0)
			{
				CurrPos += SizeOfXLogLongPHD;
				currpos += SizeOfXLogLongPHD;
			}
			else
			{
				CurrPos += SizeOfXLogShortPHD;
				currpos += SizeOfXLogShortPHD;
			}
			freespace = INSERT_FREESPACE(CurrPos);
		}

		Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
		memcpy(currpos, rdata_data, rdata_len);
		currpos += rdata_len;
		CurrPos += rdata_len;
		freespace -= rdata_len;
		written += rdata_len;

		rdata = rdata->next;
	}
	Assert(written == write_len);

	/*
	 * If this was an xlog-switch, it's not enough to write the switch record,
	 * we also have to consume all the remaining space in the WAL segment. We
	 * have already reserved it for us, but we still need to make sure it's
	 * allocated and zeroed in the WAL buffers so that when the caller (or
	 * someone else) does XLogWrite(), it can really write out all the zeros.
	 */
	if (isLogSwitch && CurrPos % XLOG_SEG_SIZE != 0)
	{
		/* An xlog-switch record doesn't contain any data besides the header */
		Assert(write_len == SizeOfXLogRecord);

		/*
		 * We do this one page at a time, to make sure we don't deadlock
		 * against ourselves if wal_buffers < XLOG_SEG_SIZE.
		 */
		Assert(EndPos % XLogSegSize == 0);

		/* Use up all the remaining space on the first page */
		CurrPos += freespace;

		while (CurrPos < EndPos)
		{
			/* initialize the next page (if not initialized already) */
			WALInsertLockUpdateInsertingAt(CurrPos);
			AdvanceXLInsertBuffer(CurrPos, false);
			CurrPos += XLOG_BLCKSZ;
		}
	}
	else
	{
		/* Align the end position, so that the next record starts aligned */
		CurrPos = MAXALIGN64(CurrPos);
	}

	if (CurrPos != EndPos)
		elog(PANIC, "space reserved for WAL record does not match what was written");
}

我们先对该函数的重要参数进行说明:

  • write_len

    XLOG的总长度,用于做校验。

  • rdata

    XLogRecData链表,存放了XLOG4个部分的数据。

  • StartPos

    XLOG的写入位置

  • EndPos

    XLOG的结束位置用于做校验

该函数的核心代码为line31~line85,遍历rdata链表,将rdata的每一部分写入CurrPos指向的位置。这段核心代码的主要部分如下:

while (rdata != NULL)
{
    char	   *rdata_data = rdata->data;
    int			rdata_len = rdata->len;

    while (rdata_len > freespace)
    {
        /*
		 * Write what fits on this page, and continue on the next page.
		 * 省略
		 */
    }

    Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
    memcpy(currpos, rdata_data, rdata_len);
    currpos += rdata_len;
    CurrPos += rdata_len;
    freespace -= rdata_len;
    written += rdata_len;

    rdata = rdata->next;
}

line6的while循环是用于处理当前需要写入的XLOG长度大于log buffer中当前page的可用空间的情况,在这种情况下,需要先将XLOG一部分写入当前的page,然后再切换到下一个page。

结束语

至此,我们已经阐述了一条insert操作写入XLOG的主要流程。在流程中我们暂时回避了一些问题,比如:如何应对partial write。也忽略了一些有意思的地方,比如:XLogBytePosToRecPtr是如何实现的?这些问题我们会在《PostgreSQL重启恢复—XLOG 2.0》中进行介绍。

回答: 要在项目中整合PostgreSQL和MyBatis-Plus,你需要进行以下几个步骤。 首先,你需要在项目的pom.xml文件中添加MyBatis-Plus和PostgreSQL的依赖项。在依赖项中,你需要添加以下代码段:\[1\] ```xml <!-- mybatis-plus --> <dependency> <groupId>com.baomidou</groupId> <artifactId>mybatis-plus-boot-starter</artifactId> <version>3.2.0</version> </dependency> <!-- postgresql --> <dependency> <groupId>org.postgresql</groupId> <artifactId>postgresql</artifactId> <scope>runtime</scope> </dependency> ``` 接下来,你需要在项目的application.yml文件中进行配置。你需要设置数据库的连接信息,包括URL、用户名和密码。此外,你还需要设置schema的名称。以下是一个示例配置:\[2\] ```yaml spring: datasource: platform: postgres url: jdbc:postgresql://192.188.1.245:5432/uum?currentSchema=uum schemaName: uum username: xxxx password: xxxx driver-class-name: org.postgresql.Driver ``` 最后,你需要在数据库中创建自增字段。在PostgreSQL中,你可以使用sequence来实现自增字段的功能。以下是一个示例的SQL语句:\[3\] ```sql create sequence uum.userid_seq start with 1 increment by 1 no minvalue no maxvalue cache 1; alter sequence uum.userid_seq owner to smartsys; alter table uum.user alter column id set default nextval('uum.userid_seq'); ``` 通过以上步骤,你就可以成功地将PostgreSQL和MyBatis-Plus整合在一起了。你可以使用MyBatis-Plus提供的功能来进行数据库操作。 #### 引用[.reference_title] - *1* [springboot 整合 mybatis plus postgresql](https://blog.csdn.net/weixin_41010294/article/details/105710247)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* *3* [MybatisPlus+Postgresql整合的几个坑](https://blog.csdn.net/xuruilll/article/details/122670781)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^koosearch_v1,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值