XLOG 1.0
参考资料
https://zhmin.github.io/2019/11/05/postgresql-wal-format/
预备知识
概述
前文说过,XLOG是PostgreSQL的重做日志,用于重启恢复时重做已提交事务中未落盘的数据。XLOG是一种物理逻辑日志,也可以称为Page-oriented Log,这类日志记录的是数据库中的页面变化。这两个概念直接讲的话可能比较抽象,所以就在XLOG流程中进行详细阐述。本文主要解决以下几个问题:
- insert操作向XLOG中写了些什么?
- XLOG日志中的数据如何组织?
- 什么是Page-oriented Log?
- 如何向XLOG写入数据?
注意,XLOG日志的写入其实有两个非常重要的步骤:
- 将XLOG写入log buffer
- 将log buffer中的XLOG落盘
WAL的本质是,XLOG先于数据落盘,事务相关的XLOG全部落盘事务才能提交。所以XLOG何时落盘如何落盘是非常重要的。但WAL的落盘本身相对独立,与前面提出的几个问题关系也不太大,所以本文主要关注XLOG是如何写入log buffer的。对于WAL的落盘会由专门的文章来阐述。
XLOG的写入
向PostgreSQL中插入一条数据时,会调用heap_insert函数,在heap_insert中会先调用RelationPutHeapTuple函数向页面写入数据,然后就会写XLOG,下面列出插入时与重做日志相关的代码:
//代码来源:heapam.c line2442~line2516
/* XLOG stuff */
if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
{
xl_heap_insert xlrec;
xl_heap_header xlhdr;
XLogRecPtr recptr;
Page page = BufferGetPage(buffer);
uint8 info = XLOG_HEAP_INSERT;
int bufflags = 0;
/*
* If this is a catalog, we need to transmit combocids to properly
* decode, so log that as well.
*/
if (RelationIsAccessibleInLogicalDecoding(relation))
log_heap_new_cid(relation, heaptup);
/*
* If this is the single and first tuple on page, we can reinit the
* page instead of restoring the whole thing. Set flag, and hide
* buffer references from XLogInsert.
*/
if (ItemPointerGetOffsetNumber(&(heaptup->t_self)) == FirstOffsetNumber &&
PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
{
info |= XLOG_HEAP_INIT_PAGE;
bufflags |= REGBUF_WILL_INIT;
}
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
xlrec.flags = 0;
if (all_visible_cleared)
xlrec.flags |= XLH_INSERT_ALL_VISIBLE_CLEARED;
if (options & HEAP_INSERT_SPECULATIVE)
xlrec.flags |= XLH_INSERT_IS_SPECULATIVE;
Assert(ItemPointerGetBlockNumber(&heaptup->t_self) == BufferGetBlockNumber(buffer));
/*
* For logical decoding, we need the tuple even if we're doing a full
* page write, so make sure it's included even if we take a full-page
* image. (XXX We could alternatively store a pointer into the FPW).
*/
if (RelationIsLogicallyLogged(relation))
{
xlrec.flags |= XLH_INSERT_CONTAINS_NEW_TUPLE;
bufflags |= REGBUF_KEEP_DATA;
}
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
xlhdr.t_infomask = heaptup->t_data->t_infomask;
xlhdr.t_hoff = heaptup->t_data->t_hoff;
/*
* note we mark xlhdr as belonging to buffer; if XLogInsert decides to
* write the whole page to the xlog, we don't need to store
* xl_heap_header in the xlog.
*/
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
/* PG73FORMAT: write bitmap [+ padding] [+ oid] + data */
XLogRegisterBufData(0,
(char *) heaptup->t_data + SizeofHeapTupleHeader,
heaptup->t_len - SizeofHeapTupleHeader);
/* filtering by origin on a row level is much more efficient */
XLogIncludeOrigin();
recptr = XLogInsert(RM_HEAP_ID, info);
PageSetLSN(page, recptr);
}
我们的故事就从这里开始。这里重要的操作主要有两个部分:
- 调用XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData注册需要写入XLOG的数据。
- 调用XLogInsert向log buffer中写入XLOG。
下面,我们就来看看insert操作会注册些什么数据,这些数据又是如何写入的。
XLOG组成
为了解决上述问题,我们首先需要明白,XLOG是由哪些部分组成的。有如下用例:
DROP TABLE IF EXISTS test;
CREATE TABLE test(a int);
INSERT INTO test values(1);
这个用例非常简单,向表中插入一条数据。对于这条insert语句,XLOG由四个部分组成:
第一部分:XLOG头部
第一部分,也是最复杂的一部分。这部分又是由几个小部分组成
-
XLogRecord
XLogRecord是每条XLOG的头,用于记录XLOG的基本信息,定义如下:
/* * The overall layout of an XLOG record is: * Fixed-size header (XLogRecord struct) * XLogRecordBlockHeader struct * XLogRecordBlockHeader struct * ... * XLogRecordDataHeader[Short|Long] struct * block data * block data * ... * main data * * There can be zero or more XLogRecordBlockHeaders, and 0 or more bytes of * rmgr-specific data not associated with a block. XLogRecord structs * always start on MAXALIGN boundaries in the WAL files, but the rest of * the fields are not aligned. * * The XLogRecordBlockHeader, XLogRecordDataHeaderShort and * XLogRecordDataHeaderLong structs all begin with a single 'id' byte. It's * used to distinguish between block references, and the main data structs. */ typedef struct XLogRecord { uint32 xl_tot_len; /* total len of entire record */ TransactionId xl_xid; /* xact id */ XLogRecPtr xl_prev; /* ptr to previous record in log */ uint8 xl_info; /* flag bits, see below */ RmgrId xl_rmid; /* resource manager for this record */ /* 2 bytes of padding here, initialize to zero */ pg_crc32c xl_crc; /* CRC for this record */ /* XLogRecordBlockHeaders and XLogRecordDataHeader follow, no padding */ } XLogRecord;
这个结构体,其实有很多细节值得考究,现在我们只介绍一些最基础的,本文后面会用到的东西。
-
xl_tot_len
XLOG的总长度,也就是XLOG的四个部分加起来的长度。
-
xl_xid
事务ID。
-
xl_prev
前一条日志的物理偏移(也就是LSN)。
-
xl_info
信息标志位
-
xl_rmid
资源管理器号。表示当前做的是什么操作,后面恢复的时候才好调用相应的函数来做redo。比如,对于前面的insert语句,这里的值就是RM_HEAP_ID,后面redo时会调用heap_redo。
-
xl_crc
校验位
-
-
XLogRecordBlockHeader
紧挨着XLogRecord之后就是XLogRecordBlockHeader结构体,定义如下:
/* * Header info for block data appended to an XLOG record. * * 'data_length' is the length of the rmgr-specific payload data associated * with this block. It does not include the possible full page image, nor * XLogRecordBlockHeader struct itself. * * Note that we don't attempt to align the XLogRecordBlockHeader struct! * So, the struct must be copied to aligned local storage before use. */ typedef struct XLogRecordBlockHeader { uint8 id; /* block reference ID */ uint8 fork_flags; /* fork within the relation, and flags */ uint16 data_length; /* number of payload bytes (not including page * image) */ /* If BKPBLOCK_HAS_IMAGE, an XLogRecordBlockImageHeader struct follows */ /* If BKPBLOCK_SAME_REL is not set, a RelFileNode follows */ /* BlockNumber follows */ } XLogRecordBlockHeader;
该结构体主要有两个作用:
-
fork_flags:低4位是fork类型,高四位是标记位。标记位有如下几种
/* * The fork number fits in the lower 4 bits in the fork_flags field. The upper * bits are used for flags. */ #define BKPBLOCK_FORK_MASK 0x0F #define BKPBLOCK_FLAG_MASK 0xF0 #define BKPBLOCK_HAS_IMAGE 0x10 /* block data is an XLogRecordBlockImage */ #define BKPBLOCK_HAS_DATA 0x20 #define BKPBLOCK_WILL_INIT 0x40 /* redo will re-init the page */ #define BKPBLOCK_SAME_REL 0x80 /* RelFileNode omitted, same as previous */
这个后面再解释。
-
data_length:表示后续数据有效载荷的大小。
-
-
XLogRecordBlockImageHeader
在XLogRecordBlockHeader之后的数据,就需要分情况了。如果需要备份区块,那么XLogRecordBlockHeader后面就会跟XLogRecordBlockImageHeader。备份区块是为了解决partial write的问题,这个后面再讲,我们先假设不需要备份区块,即没有这个部分。
-
XLogRecordBlockCompressHeader
在XLogRecordBlockImageHeader之后,如果开启了压缩,就需要跟上XLogRecordBlockImageHeader。假设现在我们没有开启压缩,也就没有这部分数据。
-
RelFileNode
接下来是RelFileNode结构体,该结构体定义如下:
typedef struct RelFileNode { Oid spcNode; /* tablespace */ Oid dbNode; /* database */ Oid relNode; /* relation */ } RelFileNode;
这个结构体用于表明我们的XLOG在重启恢复时会作用于哪个表空间、哪个数据库、哪张表。
-
BlockNumber
接下来BlockNumber,表明我们的XLOG在重启恢复时会作用于哪个块。
typedef uint32 BlockNumber;
-
mainrdata_len
最后是mainrdata头,用于标识mainrdata的长度。不是所有的XLOG都有mainrdata,但是我们用例中的insert语句就有mainrdata,所以这个是绕不开的。
小结
小结一下,针对用例中的insert语句,假设无需备份区块和压缩。我们的XLOG头部组成为:XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len,共46个字节(24 + 4 + 12 + 4 + 2)。
第二部分: xl_heap_header
XLOG的第二部分是xl_heap_header结构体 ,定义如下:
/*
* We don't store the whole fixed part (HeapTupleHeaderData) of an inserted
* or updated tuple in WAL; we can save a few bytes by reconstructing the
* fields that are available elsewhere in the WAL record, or perhaps just
* plain needn't be reconstructed. These are the fields we must store.
* NOTE: t_hoff could be recomputed, but we may as well store it because
* it will come for free due to alignment considerations.
*/
typedef struct xl_heap_header
{
uint16 t_infomask2;
uint16 t_infomask;
uint8 t_hoff;
} xl_heap_header;
xl_heap_header是HeapTupleHeaderData(也就是元组头,参见《PostgreSQL 基础模块—表和元组组织方式》)的一个简化版。xl_heap_header结构体上面的注释说的很清楚,不用将整个HeapTupleHeaderData都写入XLOG,HeapTupleHeaderData中的很多信息都可以重构或者不需要重构。所以只用存放一些必要的信息,而xl_heap_header就用于记录这些必要信息。
xl_heap_header结构体大小为5个字节。
第三部分:元组具体数据
XLOG的第三部分就是元组的具体数据,这部分数据和插入的时候写入到数据页中的数据完全一样。就上述用例而言,插入的元组长度为3个字节。
第四部分:xl_heap_insert
最后一部分是xl_heap_insert结构体,这个结构体表明了该元组所在的物理块中的偏移,定义如下:
/* This is what we need to know about insert */
typedef struct xl_heap_insert
{
OffsetNumber offnum; /* inserted tuple's offset */
uint8 flags;
/* xl_heap_header & TUPLE DATA in backup block 0 */
} xl_heap_insert;
该结构体大小为5个字节。
在《PostgreSQL 基础模块—表和元组组织方式》中讲过,PostgreSQL中一条完整的元组,是由ItemIdData+元组实体组成的。在一个数据页中ItemIdData是定长的,在页面中从前向后分配。元组实体由HeapTupleHeaderData+元组内容组成,长度不固定在页面中从后向前分配。ItemIdData通过lp_off来标记元组实体的位置。这里需要重点关注的一个地方是:xl_heap_insert中记录的是ItemIdData在页面中的偏移,而不是元组实体在页面中的偏移。以这种方式记录的日志称为物理逻辑日志;如果记录的是元组实体的偏移,就称为物理日志。接下来我们会重点讲解这两种日志的区别。
xl_heap_insert的组装详见heap_insert的2444行以后。关键代码:
//代码来源:heapam.c line2471
xlrec.offnum = ItemPointerGetOffsetNumber(&heaptup->t_self);
小结
我们现在来小结一下,对于一条insert来说,最简单的情况下,XLOG主要由以下几部分组成:
XLogRecord + XLogRecordBlockHeader + RelFileNode+BlockNumber+mainrdata_len 46字节
+
xl_heap_header 5字节
+
实际元组数据 视具体数据长度而定,当前用例为5字节
+
xl_heap_insert 3字节
这几部分包含了如下关键信息:
-
做了什么操作?
由XLogRecord的xl_rmid表示
-
写了什么数据?
由实际元组数据决定
-
数据写到了哪里?
哪个表空间?由RelFileNode的spcNode表示。
哪个库?由RelFileNode的dbNode表示。
哪张表?由RelFileNode的relNode表示。
表里的那个块?由BlockNumber表示。
块里的哪个位置?由xl_heap_insert的offnum表示。
上述信息其实就是一个插入操作必要的信息,有了这样的信息,后面我们就可以非常方便的进行redo。
Page-oriented Log
在明白了XLOG的结构之后,我们就可以来解释什么叫做Page-oriented Log了。从XLOG的信息中,我们不难发现,XLOG描述了一条元组应该被写入到哪个页面的什么位置。从heap_insert的流程中,我们也不难发现,当一条元组写入数据页面后,我们就立即为这次写入操作生成一个XLOG,并写入log buffer。也就是说XLOG描述了页面中的数据变化,这就是Page-oriented Log。与之相对应的是逻辑日志(logic log),逻辑日志通常只是记录一条SQL语句,在redo时,会重新执行这条SQL语句。所以对于Page-oriented Log而言,在redo时元组总是写入到先前写入的那个页面,但对于逻辑日志,redo时的写入就很随意了。
对于Page-oriented Log又分为物理日志和物理逻辑日志两种。前面提到过,对于物理日志会记录元组插入页面中的物理位置(ItemIdData中lp_off的值),而对于物理逻辑日志,只记录元组插入页面中的逻辑位置(ItemIdData自身的偏移)。
对于物理日志而言,由于记录了元组的实际偏移,所以在redo时只用定位到实际位置,然后直接覆盖原有元组(不管元组有没有落盘),这种操作本身是具有幂等性的,不论执行多少次redo结果都一样。但这个方式有一个问题,就是一旦块做了整理(比如:vacuum操作)那么元组的物理位置会发生变化。为了保持精确的物理信息,整理也会产生大量物理日志,这非常影响性能。
所以PostgreSQL采用的是物理逻辑日志,所谓物理是指记录了元组实际插入的数据页,所谓逻辑具体写入到数据页中的什么位置是一个逻辑的值。这样在vacuum的时候只需要保持ItemIdData的位置不变,就没有任何影响。但是物理逻辑日志本身不具有幂等性,如果不加任何处理直接多次redo的话,就会写入多条数据。所以对于物理逻辑日志需要一种手段来判断该XLOG是否需要在对应页面中进行redo操作,这也就是所谓的LSN。这部分内容后面会由专门的文档进行说明。
如何向XLOG写入数据
最后,我们来看看XLOG所需要的几部分信息是如何写入和组织起来的。我们在XLOG的写入中介绍过,XLOG的写入主要有两个步骤:
- 注册需要写入XLOG的数据。
- 调用XLogInsert向log buffer中写入XLOG。
下面我们分别来介绍这两个步骤。
注册数据
在heap_insert中使用到的注册函数主要有:XLogRegisterData、XLogRegisterBuffer、XLogRegisterBufData。调用代码如下:
//xlrec为xl_heap_insert结构体
XLogRegisterData((char *) &xlrec, SizeOfHeapInsert);
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD | bufflags);
//xlhdr为xl_heap_header结构体
XLogRegisterBufData(0, (char *) &xlhdr, SizeOfHeapHeader);
//(char *) heaptup->t_data + SizeofHeapTupleHeader为实际元组
XLogRegisterBufData(0,
(char *) heaptup->t_data + SizeofHeapTupleHeader,
heaptup->t_len - SizeofHeapTupleHeader);
从上述代码中,我们可以很直观的看到,heap_insert注册了xl_heap_insert、xl_heap_header、实际元组数据。回顾下前面阐述过的XLOG的组成部分:
XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +
xl_heap_header + 实际元组数据 + xl_heap_insert
不难发现,我们现在已经拥有了除XLOG头部的数据。现在我们来看看这几个注册函数:
XLogRegisterData
/*
* Add data to the WAL record that's being constructed.
*
* The data is appended to the "main chunk", available at replay with
* XLogRecGetData().
*/
void
XLogRegisterData(char *data, int len)
{
XLogRecData *rdata;
Assert(begininsert_called);
if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
//1.从全局数组rdatas中获取一个XLogRecData对象rdata。
rdata = &rdatas[num_rdatas++];
//2.将需要注册的数据写入rdata中。
rdata->data = data;
rdata->len = len;
/*
* we use the mainrdata_last pointer to track the end of the chain, so no
* need to clear 'next' here.
*/
//3.将rdata加入mainrdata链表中
mainrdata_last->next = rdata;
mainrdata_last = rdata;
mainrdata_len += len;
}
这个函数非常简单,主要有3个步骤:
- 从全局数组rdatas中获取一个XLogRecData对象rdata。
- 将需要注册的数据写入rdata中。
- 将rdata加入mainrdata链表中
我们先来看看XLogRecData结构体:
/*
* The functions in xloginsert.c construct a chain of XLogRecData structs
* to represent the final WAL record.
*/
typedef struct XLogRecData
{
struct XLogRecData *next; /* next struct in chain, or NULL */
char *data; /* start of rmgr data to include */
uint32 len; /* length of rmgr data to include */
} XLogRecData;
这是一个典型的链表结构体,这个结构体非常关键,前面讲的XLOG组成实际上是指XLOG在磁盘上的组织结构,而XLOG在内存中的组织结构,就是由XLogRecData链表来链接4个组成部分的。
对于XLogRegisterData函数,有两个点值得注意:
-
rdatas数组
rdatas数组是为了防止频繁分配和释放空间带来的性能开销,在进程初始化时,又调用InitXLogInsert预先分配的一个数组。在实际使用时,需要注册的数据个数,大于数组大小,则直接报错。
-
mainrdata链表
注意,调用XLogRegisterData注册的数据,会被链接到mainrdata链表中。mainrdata_len表示mainrdata链表上所有数据的总长度。这个后面会用到。
XLogRegisterBuffer
从上面的代码中我们不难发现在调用XLogRegisterBufData注册xl_heap_header和实际元组之前。先调用了XLogRegisterBuffer。XLogRegisterBuffer的作用是注册一个页面的基本信息。回顾下前面讲的Page-oriented Log,XLOG是一个跟页面相关的日志,后面注册的实际元组也是属于某个页面的。所以在注册元组之前需要先注册页面。XLogRegisterBuffer的具体实现如下:
/*
* Register a reference to a buffer with the WAL record being constructed.
* This must be called for every page that the WAL-logged operation modifies.
*/
void
XLogRegisterBuffer(uint8 block_id, Buffer buffer, uint8 flags)
{
registered_buffer *regbuf;
/* NO_IMAGE doesn't make sense with FORCE_IMAGE */
Assert(!((flags & REGBUF_FORCE_IMAGE) && (flags & (REGBUF_NO_IMAGE))));
Assert(begininsert_called);
if (block_id >= max_registered_block_id)
{
if (block_id >= max_registered_buffers)
elog(ERROR, "too many registered buffers");
max_registered_block_id = block_id + 1;
}
//1.从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
regbuf = ®istered_buffers[block_id];
//2.将页面信息写入regbuf中。
BufferGetTag(buffer, ®buf->rnode, ®buf->forkno, ®buf->block);
regbuf->page = BufferGetPage(buffer);
regbuf->flags = flags;
//3.初始化regbuf的数据链表
regbuf->rdata_tail = (XLogRecData *) ®buf->rdata_head;
regbuf->rdata_len = 0;
/*
* Check that this page hasn't already been registered with some other
* block_id.
*/
#ifdef USE_ASSERT_CHECKING
//这里代码不重要,所以省略了。
#endif
regbuf->in_use = true;
}
这个函数的流程和XLogRegisterData非常类似,主要有3个步骤:
- 从全局数组registered_buffers中获取一个registered_buffer对象regbuf。
- 将页面信息写入regbuf中。
- 初始化regbuf的数据链表。
我们先来看看registered_buffer结构体:
/*
* For each block reference registered with XLogRegisterBuffer, we fill in
* a registered_buffer struct.
*/
typedef struct
{
bool in_use; /* is this slot in use? */
uint8 flags; /* REGBUF_* flags */
RelFileNode rnode; /* identifies the relation and block */
ForkNumber forkno;
BlockNumber block;
Page page; /* page content */
uint32 rdata_len; /* total length of data in rdata chain */
XLogRecData *rdata_head; /* head of the chain of data registered with
* this block */
XLogRecData *rdata_tail; /* last entry in the chain, or &rdata_head if
* empty */
XLogRecData bkp_rdatas[2]; /* temporary rdatas used to hold references to
* backup block data in XLogRecordAssemble() */
/* buffer to store a compressed version of backup block image */
char compressed_page[PGLZ_MAX_BLCKSZ];
} registered_buffer;
其中,跟本文关系比较大的几个成员如下:
-
rnode
RelFileNode结构体,XLOG头的组成之一。
-
block
BlockNumber类型,XLOG头的组成之一。
-
rdata_head、rdata_tail
XLogRecData链表,后续的实际元组数据会注册到这里。
与前面XLogRegisterData一样,registered_buffers数组也是在InitXLogInsert中事先分配的。
在XLogRegisterBuffer执行完毕后,我们便完成了RelFileNode和BlockNumber的注册的。
XLogRegisterBufData
最后我们来看看XLogRegisterBufData,该函数用于注册xl_heap_header结构体和实际元组数据,实际上就是注册一条元组。具体实现如下:
/*
* Add buffer-specific data to the WAL record that's being constructed.
*
* Block_id must reference a block previously registered with
* XLogRegisterBuffer(). If this is called more than once for the same
* block_id, the data is appended.
*
* The maximum amount of data that can be registered per block is 65535
* bytes. That should be plenty; if you need more than BLCKSZ bytes to
* reconstruct the changes to the page, you might as well just log a full
* copy of it. (the "main data" that's not associated with a block is not
* limited)
*/
void
XLogRegisterBufData(uint8 block_id, char *data, int len)
{
registered_buffer *regbuf;
XLogRecData *rdata;
Assert(begininsert_called);
/* find the registered buffer struct */
regbuf = ®istered_buffers[block_id];
if (!regbuf->in_use)
elog(ERROR, "no block with id %d registered with WAL insertion",
block_id);
if (num_rdatas >= max_rdatas)
elog(ERROR, "too much WAL data");
rdata = &rdatas[num_rdatas++];
rdata->data = data;
rdata->len = len;
regbuf->rdata_tail->next = rdata;
regbuf->rdata_tail = rdata;
regbuf->rdata_len += len;
}
这个函数主要有3个步骤:
- 从全局数组rdatas中获取一个XLogRecData对象rdata。
- 将需要注册的数据写入rdata中。
- 将rdata加入regbuf链表中。
小结
现在我们来小结一下,通过注册流程,我们现构建了XLOG如下部分的数据(绿色为已构建的,红色为尚未构建的):
XLogRecord+XLogRecordBlockHeader+RelFileNode+BlockNumber + mainrdata_len +
xl_heap_header+ 实际元组数据+ xl_heap_insert
其中绿色部分的构建,我们在前面已经讲过了,绿色部分的数据包含在两条链表中:
- xl_heap_insert在mainrdata链表中。
- RelFileNode+BlockNumber+xl_heap_header+ 实际元组数据在regbuf链表中。
接下来,就会调用XLogInsert函数,开始执行写XLOG的相关函数。
写入XLOG
在前面的小结中,我们了解到,经历的注册阶段,我们已经获取到了绝大部分信息。但是仍然有部分数据还没有获取到。所以写入XLOG时,我们首先需要获取到红色部分的数据,然后再将数据写入log buffer。这个过程由XLogInsert函数实现,代码如下:
/*
* Insert an XLOG record having the specified RMID and info bytes, with the
* body of the record being the data and buffer references registered earlier
* with XLogRegister* calls.
*
* Returns XLOG pointer to end of record (beginning of next record).
* This can be used as LSN for data pages affected by the logged action.
* (LSN is the XLOG point up to which the XLOG must be flushed to disk
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
*/
XLogRecPtr
XLogInsert(RmgrId rmid, uint8 info)
{
XLogRecPtr EndPos;
/* XLogBeginInsert() must have been called. */
if (!begininsert_called)
elog(ERROR, "XLogBeginInsert was not called");
/*
* The caller can set rmgr bits and XLR_SPECIAL_REL_UPDATE; the rest are
* reserved for use by me.
*/
if ((info & ~(XLR_RMGR_INFO_MASK | XLR_SPECIAL_REL_UPDATE)) != 0)
elog(PANIC, "invalid xlog info mask %02X", info);
TRACE_POSTGRESQL_XLOG_INSERT(rmid, info);
/*
* In bootstrap mode, we don't actually log anything but XLOG resources;
* return a phony record pointer.
*/
if (IsBootstrapProcessingMode() && rmid != RM_XLOG_ID)
{
XLogResetInsertion();
EndPos = SizeOfXLogLongPHD; /* start of 1st chkpt record */
return EndPos;
}
do
{
XLogRecPtr RedoRecPtr;
bool doPageWrites;
XLogRecPtr fpw_lsn;
XLogRecData *rdt;
/*
* Get values needed to decide whether to do full-page writes. Since
* we don't yet have an insertion lock, these could change under us,
* but XLogInsertRecord will recheck them once it has a lock.
*/
GetFullPageWriteInfo(&RedoRecPtr, &doPageWrites);
//1.获取红色部分的数据
rdt = XLogRecordAssemble(rmid, info, RedoRecPtr, doPageWrites,
&fpw_lsn);
//2.将数据写入log buffer
EndPos = XLogInsertRecord(rdt, fpw_lsn);
} while (EndPos == InvalidXLogRecPtr);
XLogResetInsertion();
return EndPos;
}
其中关键函数为XLogRecordAssemble和XLogInsertRecord。下面我们分别来看看具体实现:
XLogRecordAssemble
XLogRecordAssemble负责获取前面红色部分的数据:XLogRecord、XLogRecordBlockHeader、mainrdata_len。然后将XLOG的4个部分:XLOG头部 + xl_heap_header + 元组具体数据 + xl_heap_insert组装成XLogRecData链表。XLogRecordAssemble的实现如下:
/*
* Assemble a WAL record from the registered data and buffers into an
* XLogRecData chain, ready for insertion with XLogInsertRecord().
*
* The record header fields are filled in, except for the xl_prev field. The
* calculated CRC does not include the record header yet.
*
* If there are any registered buffers, and a full-page image was not taken
* of all of them, *fpw_lsn is set to the lowest LSN among such pages. This
* signals that the assembled record is only good for insertion on the
* assumption that the RedoRecPtr and doPageWrites values were up-to-date.
*/
static XLogRecData *
XLogRecordAssemble(RmgrId rmid, uint8 info,
XLogRecPtr RedoRecPtr, bool doPageWrites,
XLogRecPtr *fpw_lsn)
{
XLogRecData *rdt;
uint32 total_len = 0;
int block_id;
pg_crc32c rdata_crc;
registered_buffer *prev_regbuf = NULL;
XLogRecData *rdt_datas_last;
XLogRecord *rechdr;
char *scratch = hdr_scratch;
/*
* Note: this function can be called multiple times for the same record.
* All the modifications we do to the rdata chains below must handle that.
*/
/* The record begins with the fixed-size header */
rechdr = (XLogRecord *) scratch;
scratch += SizeOfXLogRecord;
hdr_rdt.next = NULL;
rdt_datas_last = &hdr_rdt;
hdr_rdt.data = hdr_scratch;
/*
* Make an rdata chain containing all the data portions of all block
* references. This includes the data for full-page images. Also append
* the headers for the block references in the scratch buffer.
*/
*fpw_lsn = InvalidXLogRecPtr;
for (block_id = 0; block_id < max_registered_block_id; block_id++)
{
registered_buffer *regbuf = ®istered_buffers[block_id];
bool needs_backup;
bool needs_data;
XLogRecordBlockHeader bkpb;
XLogRecordBlockImageHeader bimg;
XLogRecordBlockCompressHeader cbimg = {0};
bool samerel;
bool is_compressed = false;
if (!regbuf->in_use)
continue;
/* Determine if this block needs to be backed up */
if (regbuf->flags & REGBUF_FORCE_IMAGE)
needs_backup = true;
else if (regbuf->flags & REGBUF_NO_IMAGE)
needs_backup = false;
else if (!doPageWrites)
needs_backup = false;
else
{
/*
* We assume page LSN is first data on *every* page that can be
* passed to XLogInsert, whether it has the standard page layout
* or not.
*/
XLogRecPtr page_lsn = PageGetLSN(regbuf->page);
needs_backup = (page_lsn <= RedoRecPtr);
if (!needs_backup)
{
if (*fpw_lsn == InvalidXLogRecPtr || page_lsn < *fpw_lsn)
*fpw_lsn = page_lsn;
}
}
/* Determine if the buffer data needs to included */
if (regbuf->rdata_len == 0)
needs_data = false;
else if ((regbuf->flags & REGBUF_KEEP_DATA) != 0)
needs_data = true;
else
needs_data = !needs_backup;
bkpb.id = block_id;
bkpb.fork_flags = regbuf->forkno;
bkpb.data_length = 0;
if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT)
bkpb.fork_flags |= BKPBLOCK_WILL_INIT;
if (needs_backup)
{
Page page = regbuf->page;
uint16 compressed_len;
/*
* The page needs to be backed up, so calculate its hole length
* and offset.
*/
if (regbuf->flags & REGBUF_STANDARD)
{
/* Assume we can omit data between pd_lower and pd_upper */
uint16 lower = ((PageHeader) page)->pd_lower;
uint16 upper = ((PageHeader) page)->pd_upper;
if (lower >= SizeOfPageHeaderData &&
upper > lower &&
upper <= BLCKSZ)
{
bimg.hole_offset = lower;
cbimg.hole_length = upper - lower;
}
else
{
/* No "hole" to compress out */
bimg.hole_offset = 0;
cbimg.hole_length = 0;
}
}
else
{
/* Not a standard page header, don't try to eliminate "hole" */
bimg.hole_offset = 0;
cbimg.hole_length = 0;
}
/*
* Try to compress a block image if wal_compression is enabled
*/
if (wal_compression)
{
is_compressed =
XLogCompressBackupBlock(page, bimg.hole_offset,
cbimg.hole_length,
regbuf->compressed_page,
&compressed_len);
}
/*
* Fill in the remaining fields in the XLogRecordBlockHeader
* struct
*/
bkpb.fork_flags |= BKPBLOCK_HAS_IMAGE;
/*
* Construct XLogRecData entries for the page content.
*/
rdt_datas_last->next = ®buf->bkp_rdatas[0];
rdt_datas_last = rdt_datas_last->next;
bimg.bimg_info = (cbimg.hole_length == 0) ? 0 : BKPIMAGE_HAS_HOLE;
if (is_compressed)
{
bimg.length = compressed_len;
bimg.bimg_info |= BKPIMAGE_IS_COMPRESSED;
rdt_datas_last->data = regbuf->compressed_page;
rdt_datas_last->len = compressed_len;
}
else
{
bimg.length = BLCKSZ - cbimg.hole_length;
if (cbimg.hole_length == 0)
{
rdt_datas_last->data = page;
rdt_datas_last->len = BLCKSZ;
}
else
{
/* must skip the hole */
rdt_datas_last->data = page;
rdt_datas_last->len = bimg.hole_offset;
rdt_datas_last->next = ®buf->bkp_rdatas[1];
rdt_datas_last = rdt_datas_last->next;
rdt_datas_last->data =
page + (bimg.hole_offset + cbimg.hole_length);
rdt_datas_last->len =
BLCKSZ - (bimg.hole_offset + cbimg.hole_length);
}
}
total_len += bimg.length;
}
if (needs_data)
{
/*
* Link the caller-supplied rdata chain for this buffer to the
* overall list.
*/
bkpb.fork_flags |= BKPBLOCK_HAS_DATA;
bkpb.data_length = regbuf->rdata_len;
total_len += regbuf->rdata_len;
rdt_datas_last->next = regbuf->rdata_head;
rdt_datas_last = regbuf->rdata_tail;
}
if (prev_regbuf && RelFileNodeEquals(regbuf->rnode, prev_regbuf->rnode))
{
samerel = true;
bkpb.fork_flags |= BKPBLOCK_SAME_REL;
}
else
samerel = false;
prev_regbuf = regbuf;
/* Ok, copy the header to the scratch buffer */
memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader);
scratch += SizeOfXLogRecordBlockHeader;
if (needs_backup)
{
memcpy(scratch, &bimg, SizeOfXLogRecordBlockImageHeader);
scratch += SizeOfXLogRecordBlockImageHeader;
if (cbimg.hole_length != 0 && is_compressed)
{
memcpy(scratch, &cbimg,
SizeOfXLogRecordBlockCompressHeader);
scratch += SizeOfXLogRecordBlockCompressHeader;
}
}
if (!samerel)
{
memcpy(scratch, ®buf->rnode, sizeof(RelFileNode));
scratch += sizeof(RelFileNode);
}
memcpy(scratch, ®buf->block, sizeof(BlockNumber));
scratch += sizeof(BlockNumber);
}
/* followed by the record's origin, if any */
if (include_origin && replorigin_session_origin != InvalidRepOriginId)
{
*(scratch++) = (char) XLR_BLOCK_ID_ORIGIN;
memcpy(scratch, &replorigin_session_origin, sizeof(replorigin_session_origin));
scratch += sizeof(replorigin_session_origin);
}
/* followed by main data, if any */
if (mainrdata_len > 0)
{
if (mainrdata_len > 255)
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG;
memcpy(scratch, &mainrdata_len, sizeof(uint32));
scratch += sizeof(uint32);
}
else
{
*(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT;
*(scratch++) = (uint8) mainrdata_len;
}
rdt_datas_last->next = mainrdata_head;
rdt_datas_last = mainrdata_last;
total_len += mainrdata_len;
}
rdt_datas_last->next = NULL;
hdr_rdt.len = (scratch - hdr_scratch);
total_len += hdr_rdt.len;
/*
* Calculate CRC of the data
*
* Note that the record header isn't added into the CRC initially since we
* don't know the prev-link yet. Thus, the CRC will represent the CRC of
* the whole record in the order: rdata, then backup blocks, then record
* header.
*/
INIT_CRC32C(rdata_crc);
COMP_CRC32C(rdata_crc, hdr_scratch + SizeOfXLogRecord, hdr_rdt.len - SizeOfXLogRecord);
for (rdt = hdr_rdt.next; rdt != NULL; rdt = rdt->next)
COMP_CRC32C(rdata_crc, rdt->data, rdt->len);
/*
* Fill in the fields in the record header. Prev-link is filled in later,
* once we know where in the WAL the record will be inserted. The CRC does
* not include the record header yet.
*/
rechdr->xl_xid = GetCurrentTransactionIdIfAny();
rechdr->xl_tot_len = total_len;
rechdr->xl_info = info;
rechdr->xl_rmid = rmid;
rechdr->xl_prev = InvalidXLogRecPtr;
rechdr->xl_crc = rdata_crc;
return &hdr_rdt;
}
这段代码比较长,我们带着问题来一个一个看。前面说过XLogRecordAssemble会产生一个链表,链表中包含了XLOG的4个部分,而经过注册阶段,获取到了除XLOG头之外的其余三个部分。所以对于XLogRecordAssemble函数我们需要解决如下两个问题。
XLOG头如何构建
我们先来看一句很关键的代码:
char *scratch = hdr_scratch;
hdr_scratch是什么?
static char *hdr_scratch = NULL;
hdr_scratch是一个全局的buffer。我们在前面详细的描述过XLOG头的组成,由此我们可以得知,XLOG头的长度是不固定的。同样是为了防止频繁的分配和释放内存,PostgreSQL在InitXLogInsert中事先为XLOG头分配了空间,空间大小为XLOG头的最大长度。
#define HEADER_SCRATCH_SIZE \
(SizeOfXLogRecord + \
MaxSizeOfXLogRecordBlockHeader * (XLR_MAX_BLOCK_ID + 1) + \
SizeOfXLogRecordDataHeaderLong + SizeOfXlogOrigin)
所以,为了了解XLOG头的构建,我们需要观察XLogRecordAssemble对scratch指针的操作。
-
构建XLogRecord
rechdr = (XLogRecord *) scratch; scratch += SizeOfXLogRecord; //中间代码省略 rechdr->xl_xid = GetCurrentTransactionIdIfAny(); rechdr->xl_tot_len = total_len; rechdr->xl_info = info; rechdr->xl_rmid = rmid; rechdr->xl_prev = InvalidXLogRecPtr; rechdr->xl_crc = rdata_crc;
-
构建XLogRecordBlockHeader
XLogRecordBlockHeader bkpb; bkpb.id = block_id; bkpb.fork_flags = regbuf->forkno; bkpb.data_length = 0; if ((regbuf->flags & REGBUF_WILL_INIT) == REGBUF_WILL_INIT) bkpb.fork_flags |= BKPBLOCK_WILL_INIT; memcpy(scratch, &bkpb, SizeOfXLogRecordBlockHeader); scratch += SizeOfXLogRecordBlockHeader;
-
构建RelFileNode
if (!samerel) { memcpy(scratch, ®buf->rnode, sizeof(RelFileNode)); scratch += sizeof(RelFileNode); }
这里,我们就看到了,RelFileNode的数据是从之前注册的regbuf->rnode中获取的。
-
构建BlockNumber
memcpy(scratch, ®buf->block, sizeof(BlockNumber)); scratch += sizeof(BlockNumber);
-
构建mainrdata_len
if (mainrdata_len > 0) { if (mainrdata_len > 255) { *(scratch++) = (char) XLR_BLOCK_ID_DATA_LONG; memcpy(scratch, &mainrdata_len, sizeof(uint32)); scratch += sizeof(uint32); } else { *(scratch++) = (char) XLR_BLOCK_ID_DATA_SHORT; *(scratch++) = (uint8) mainrdata_len; } //将mainrdata链接到XLogRecData链表中。 rdt_datas_last->next = mainrdata_head; rdt_datas_last = mainrdata_last; total_len += mainrdata_len; }
这里需要注意两点:
- XLOG头中只是记录了mainrdata的长度,在我们的用例中,也就是xl_heap_insert的长度。
- 上述代码,除了构建mainrdata_len,还将mainrdata也就是xl_heap_insert链接到XLogRecData链表中,这也是我们接下来看要解决的问题。
如何将XLOG的4个部分串链
接下来,我来看看XLOG的4个部分是如何串链的。XLogRecordAssemble函数最终返回hdr_rdt。
return &hdr_rdt;
所以,我们需要观察XLogRecordAssemble是如何操作hdr_rdt。
hdr_rdt.next = NULL; //初始化next的指针
rdt_datas_last = &hdr_rdt; //指向链头
hdr_rdt将作为链表的链头,所以这里使用rdt_datas_last指针指向链头。
-
XLOG头加入链表
hdr_rdt.data = hdr_scratch; //中间代码省略 hdr_rdt.len = (scratch - hdr_scratch);
当前hdr_rdt为链头,所以直接将XLOG头的buffer赋值给data,构建好XLOG头之后,再计算XLOG头的长度。
-
xl_heap_header、元组具体数据加入链表
在注册阶段,我们知道xl_heap_header和元组具体数据都存放在regbuf的XLogRecData链表中,并且xl_heap_header在前元组具体数据在后(xl_heap_header先注册)。所以直接将regbuf的XLogRecData链表头,添加到hdr_rdt中即可。
if (needs_data) { /* * Link the caller-supplied rdata chain for this buffer to the * overall list. */ bkpb.fork_flags |= BKPBLOCK_HAS_DATA; bkpb.data_length = regbuf->rdata_len; total_len += regbuf->rdata_len; //串链 rdt_datas_last->next = regbuf->rdata_head; rdt_datas_last = regbuf->rdata_tail; }
-
xl_heap_insert加入链表
这个在构建mainrdata_len时已经说过了,这里不再赘述。
小结
经过XLogRecordAssemble之后,我们得到了一个链表,链表中有我们希望写入XLOG的四个部分。
XLogInsertRecord
最后,我们来看看将XLOG真正写入log buffer的函数XLogInsertRecord。
/*
* Insert an XLOG record represented by an already-constructed chain of data
* chunks. This is a low-level routine; to construct the WAL record header
* and data, use the higher-level routines in xloginsert.c.
*
* If 'fpw_lsn' is valid, it is the oldest LSN among the pages that this
* WAL record applies to, that were not included in the record as full page
* images. If fpw_lsn >= RedoRecPtr, the function does not perform the
* insertion and returns InvalidXLogRecPtr. The caller can then recalculate
* which pages need a full-page image, and retry. If fpw_lsn is invalid, the
* record is always inserted.
*
* The first XLogRecData in the chain must be for the record header, and its
* data must be MAXALIGNed. XLogInsertRecord fills in the xl_prev and
* xl_crc fields in the header, the rest of the header must already be filled
* by the caller.
*
* Returns XLOG pointer to end of record (beginning of next record).
* This can be used as LSN for data pages affected by the logged action.
* (LSN is the XLOG point up to which the XLOG must be flushed to disk
* before the data page can be written out. This implements the basic
* WAL rule "write the log before the data".)
*/
XLogRecPtr
XLogInsertRecord(XLogRecData *rdata, XLogRecPtr fpw_lsn)
{
XLogCtlInsert *Insert = &XLogCtl->Insert;
pg_crc32c rdata_crc;
bool inserted;
XLogRecord *rechdr = (XLogRecord *) rdata->data;
bool isLogSwitch = (rechdr->xl_rmid == RM_XLOG_ID &&
rechdr->xl_info == XLOG_SWITCH);
XLogRecPtr StartPos;
XLogRecPtr EndPos;
/* we assume that all of the record header is in the first chunk */
Assert(rdata->len >= SizeOfXLogRecord);
/* cross-check on whether we should be here or not */
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
/*----------
*
* We have now done all the preparatory work we can without holding a
* lock or modifying shared state. From here on, inserting the new WAL
* record to the shared WAL buffer cache is a two-step process:
*
* 1. Reserve the right amount of space from the WAL. The current head of
* reserved space is kept in Insert->CurrBytePos, and is protected by
* insertpos_lck.
*
* 2. Copy the record to the reserved WAL space. This involves finding the
* correct WAL buffer containing the reserved space, and copying the
* record in place. This can be done concurrently in multiple processes.
*
* To keep track of which insertions are still in-progress, each concurrent
* inserter acquires an insertion lock. In addition to just indicating that
* an insertion is in progress, the lock tells others how far the inserter
* has progressed. There is a small fixed number of insertion locks,
* determined by NUM_XLOGINSERT_LOCKS. When an inserter crosses a page
* boundary, it updates the value stored in the lock to the how far it has
* inserted, to allow the previous buffer to be flushed.
*
* Holding onto an insertion lock also protects RedoRecPtr and
* fullPageWrites from changing until the insertion is finished.
*
* Step 2 can usually be done completely in parallel. If the required WAL
* page is not initialized yet, you have to grab WALBufMappingLock to
* initialize it, but the WAL writer tries to do that ahead of insertions
* to avoid that from happening in the critical path.
*
*----------
*/
START_CRIT_SECTION();
if (isLogSwitch)
WALInsertLockAcquireExclusive();
else
WALInsertLockAcquire();
/*
* Check to see if my copy of RedoRecPtr or doPageWrites is out of date.
* If so, may have to go back and have the caller recompute everything.
* This can only happen just after a checkpoint, so it's better to be slow
* in this case and fast otherwise.
*
* If we aren't doing full-page writes then RedoRecPtr doesn't actually
* affect the contents of the XLOG record, so we'll update our local copy
* but not force a recomputation. (If doPageWrites was just turned off,
* we could recompute the record without full pages, but we choose not to
* bother.)
*/
if (RedoRecPtr != Insert->RedoRecPtr)
{
Assert(RedoRecPtr < Insert->RedoRecPtr);
RedoRecPtr = Insert->RedoRecPtr;
}
doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites);
if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites)
{
/*
* Oops, some buffer now needs to be backed up that the caller didn't
* back up. Start over.
*/
WALInsertLockRelease();
END_CRIT_SECTION();
return InvalidXLogRecPtr;
}
/*
* Reserve space for the record in the WAL. This also sets the xl_prev
* pointer.
*/
if (isLogSwitch)
inserted = ReserveXLogSwitch(&StartPos, &EndPos, &rechdr->xl_prev);
else
{
ReserveXLogInsertLocation(rechdr->xl_tot_len, &StartPos, &EndPos,
&rechdr->xl_prev);
inserted = true;
}
if (inserted)
{
/*
* Now that xl_prev has been filled in, calculate CRC of the record
* header.
*/
rdata_crc = rechdr->xl_crc;
COMP_CRC32C(rdata_crc, rechdr, offsetof(XLogRecord, xl_crc));
FIN_CRC32C(rdata_crc);
rechdr->xl_crc = rdata_crc;
/*
* All the record data, including the header, is now ready to be
* inserted. Copy the record in the space reserved.
*/
CopyXLogRecordToWAL(rechdr->xl_tot_len, isLogSwitch, rdata,
StartPos, EndPos);
}
else
{
/*
* This was an xlog-switch record, but the current insert location was
* already exactly at the beginning of a segment, so there was no need
* to do anything.
*/
}
/*
* Done! Let others know that we're finished.
*/
WALInsertLockRelease();
MarkCurrentTransactionIdLoggedIfAny();
END_CRIT_SECTION();
/*
* Update shared LogwrtRqst.Write, if we crossed page boundary.
*/
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
SpinLockAcquire(&XLogCtl->info_lck);
/* advance global request to include new block(s) */
if (XLogCtl->LogwrtRqst.Write < EndPos)
XLogCtl->LogwrtRqst.Write = EndPos;
/* update local result copy while I have the chance */
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
}
/*
* If this was an XLOG_SWITCH record, flush the record and the empty
* padding space that fills the rest of the segment, and perform
* end-of-segment actions (eg, notifying archiver).
*/
if (isLogSwitch)
{
TRACE_POSTGRESQL_XLOG_SWITCH();
XLogFlush(EndPos);
/*
* Even though we reserved the rest of the segment for us, which is
* reflected in EndPos, we return a pointer to just the end of the
* xlog-switch record.
*/
if (inserted)
{
EndPos = StartPos + SizeOfXLogRecord;
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
if (EndPos % XLOG_SEG_SIZE == EndPos % XLOG_BLCKSZ)
EndPos += SizeOfXLogLongPHD;
else
EndPos += SizeOfXLogShortPHD;
}
}
}
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
static XLogReaderState *debug_reader = NULL;
StringInfoData buf;
StringInfoData recordBuf;
char *errormsg = NULL;
MemoryContext oldCxt;
oldCxt = MemoryContextSwitchTo(walDebugCxt);
initStringInfo(&buf);
appendStringInfo(&buf, "INSERT @ %X/%X: ",
(uint32) (EndPos >> 32), (uint32) EndPos);
/*
* We have to piece together the WAL record data from the XLogRecData
* entries, so that we can pass it to the rm_desc function as one
* contiguous chunk.
*/
initStringInfo(&recordBuf);
for (; rdata != NULL; rdata = rdata->next)
appendBinaryStringInfo(&recordBuf, rdata->data, rdata->len);
if (!debug_reader)
debug_reader = XLogReaderAllocate(NULL, NULL);
if (!debug_reader)
{
appendStringInfoString(&buf, "error decoding record: out of memory");
}
else if (!DecodeXLogRecord(debug_reader, (XLogRecord *) recordBuf.data,
&errormsg))
{
appendStringInfo(&buf, "error decoding record: %s",
errormsg ? errormsg : "no error message");
}
else
{
appendStringInfoString(&buf, " - ");
xlog_outdesc(&buf, debug_reader);
}
elog(LOG, "%s", buf.data);
pfree(buf.data);
pfree(recordBuf.data);
MemoryContextSwitchTo(oldCxt);
}
#endif
/*
* Update our global variables
*/
ProcLastRecPtr = StartPos;
XactLastRecEnd = EndPos;
return EndPos;
}
该函数主要有两个重要的函数
-
ReserveXLogInsertLocation
向log buffer申请空间,为即将写入的XLOG预留空间。
-
CopyXLogRecordToWAL
将rdata,也就是XLogRecData链表中的内容,写入log buffer。
下面,我们来分别看看这两个函数:
ReserveXLogInsertLocation
/*
* Reserves the right amount of space for a record of given size from the WAL.
* *StartPos is set to the beginning of the reserved section, *EndPos to
* its end+1. *PrevPtr is set to the beginning of the previous record; it is
* used to set the xl_prev of this record.
*
* This is the performance critical part of XLogInsert that must be serialized
* across backends. The rest can happen mostly in parallel. Try to keep this
* section as short as possible, insertpos_lck can be heavily contended on a
* busy system.
*
* NB: The space calculation here must match the code in CopyXLogRecordToWAL,
* where we actually copy the record to the reserved space.
*/
static void
ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
XLogRecPtr *PrevPtr)
{
XLogCtlInsert *Insert = &XLogCtl->Insert;
uint64 startbytepos;
uint64 endbytepos;
uint64 prevbytepos;
size = MAXALIGN(size);
/* All (non xlog-switch) records should contain data. */
Assert(size > SizeOfXLogRecord);
/*
* The duration the spinlock needs to be held is minimized by minimizing
* the calculations that have to be done while holding the lock. The
* current tip of reserved WAL is kept in CurrBytePos, as a byte position
* that only counts "usable" bytes in WAL, that is, it excludes all WAL
* page headers. The mapping between "usable" byte positions and physical
* positions (XLogRecPtrs) can be done outside the locked region, and
* because the usable byte position doesn't include any headers, reserving
* X bytes from WAL is almost as simple as "CurrBytePos += X".
*/
//加锁
SpinLockAcquire(&Insert->insertpos_lck);
//预留空间
startbytepos = Insert->CurrBytePos;
endbytepos = startbytepos + size;
prevbytepos = Insert->PrevBytePos;
Insert->CurrBytePos = endbytepos;
Insert->PrevBytePos = startbytepos;
//解锁
SpinLockRelease(&Insert->insertpos_lck);
//返回写入位置
*StartPos = XLogBytePosToRecPtr(startbytepos);
*EndPos = XLogBytePosToEndRecPtr(endbytepos);
*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
/*
* Check that the conversions between "usable byte positions" and
* XLogRecPtrs work consistently in both directions.
*/
Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
}
该函数的核心代码为line39~line55。其中Insert是一个XLogCtlInsert的结构体对象,该对象是一个全局对象。该对象中的CurrBytePos和PrevBytePos用于控制log buffer的写入。预留空间后,ReserveXLogInsertLocation会返回XLOG的写入位置。
注意:
XLogBytePosToRecPtr是一个非常精妙的设计,其具体的含义我们留到《XLOG 2.0》来讲。这里只需要明白,他返回了一个XLOG写入位置就行了。
CopyXLogRecordToWAL
通过ReserveXLogInsertLocation分配了空间之后,就可以调用CopyXLogRecordToWAL来进行真正的写入了。CopyXLogRecordToWAL的代码如下:
/*
* Subroutine of XLogInsertRecord. Copies a WAL record to an already-reserved
* area in the WAL.
*/
static void
CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
XLogRecPtr StartPos, XLogRecPtr EndPos)
{
char *currpos;
int freespace;
int written;
XLogRecPtr CurrPos;
XLogPageHeader pagehdr;
/*
* Get a pointer to the right place in the right WAL buffer to start
* inserting to.
*/
CurrPos = StartPos;
currpos = GetXLogBuffer(CurrPos);
freespace = INSERT_FREESPACE(CurrPos);
/*
* there should be enough space for at least the first field (xl_tot_len)
* on this page.
*/
Assert(freespace >= sizeof(uint32));
/* Copy record data */
written = 0;
while (rdata != NULL)
{
char *rdata_data = rdata->data;
int rdata_len = rdata->len;
while (rdata_len > freespace)
{
/*
* Write what fits on this page, and continue on the next page.
*/
Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || freespace == 0);
memcpy(currpos, rdata_data, freespace);
rdata_data += freespace;
rdata_len -= freespace;
written += freespace;
CurrPos += freespace;
/*
* Get pointer to beginning of next page, and set the xlp_rem_len
* in the page header. Set XLP_FIRST_IS_CONTRECORD.
*
* It's safe to set the contrecord flag and xlp_rem_len without a
* lock on the page. All the other flags were already set when the
* page was initialized, in AdvanceXLInsertBuffer, and we're the
* only backend that needs to set the contrecord flag.
*/
currpos = GetXLogBuffer(CurrPos);
pagehdr = (XLogPageHeader) currpos;
pagehdr->xlp_rem_len = write_len - written;
pagehdr->xlp_info |= XLP_FIRST_IS_CONTRECORD;
/* skip over the page header */
if (CurrPos % XLogSegSize == 0)
{
CurrPos += SizeOfXLogLongPHD;
currpos += SizeOfXLogLongPHD;
}
else
{
CurrPos += SizeOfXLogShortPHD;
currpos += SizeOfXLogShortPHD;
}
freespace = INSERT_FREESPACE(CurrPos);
}
Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
memcpy(currpos, rdata_data, rdata_len);
currpos += rdata_len;
CurrPos += rdata_len;
freespace -= rdata_len;
written += rdata_len;
rdata = rdata->next;
}
Assert(written == write_len);
/*
* If this was an xlog-switch, it's not enough to write the switch record,
* we also have to consume all the remaining space in the WAL segment. We
* have already reserved it for us, but we still need to make sure it's
* allocated and zeroed in the WAL buffers so that when the caller (or
* someone else) does XLogWrite(), it can really write out all the zeros.
*/
if (isLogSwitch && CurrPos % XLOG_SEG_SIZE != 0)
{
/* An xlog-switch record doesn't contain any data besides the header */
Assert(write_len == SizeOfXLogRecord);
/*
* We do this one page at a time, to make sure we don't deadlock
* against ourselves if wal_buffers < XLOG_SEG_SIZE.
*/
Assert(EndPos % XLogSegSize == 0);
/* Use up all the remaining space on the first page */
CurrPos += freespace;
while (CurrPos < EndPos)
{
/* initialize the next page (if not initialized already) */
WALInsertLockUpdateInsertingAt(CurrPos);
AdvanceXLInsertBuffer(CurrPos, false);
CurrPos += XLOG_BLCKSZ;
}
}
else
{
/* Align the end position, so that the next record starts aligned */
CurrPos = MAXALIGN64(CurrPos);
}
if (CurrPos != EndPos)
elog(PANIC, "space reserved for WAL record does not match what was written");
}
我们先对该函数的重要参数进行说明:
-
write_len
XLOG的总长度,用于做校验。
-
rdata
XLogRecData链表,存放了XLOG4个部分的数据。
-
StartPos
XLOG的写入位置
-
EndPos
XLOG的结束位置用于做校验
该函数的核心代码为line31~line85,遍历rdata链表,将rdata的每一部分写入CurrPos指向的位置。这段核心代码的主要部分如下:
while (rdata != NULL)
{
char *rdata_data = rdata->data;
int rdata_len = rdata->len;
while (rdata_len > freespace)
{
/*
* Write what fits on this page, and continue on the next page.
* 省略
*/
}
Assert(CurrPos % XLOG_BLCKSZ >= SizeOfXLogShortPHD || rdata_len == 0);
memcpy(currpos, rdata_data, rdata_len);
currpos += rdata_len;
CurrPos += rdata_len;
freespace -= rdata_len;
written += rdata_len;
rdata = rdata->next;
}
line6的while循环是用于处理当前需要写入的XLOG长度大于log buffer中当前page的可用空间的情况,在这种情况下,需要先将XLOG一部分写入当前的page,然后再切换到下一个page。
结束语
至此,我们已经阐述了一条insert操作写入XLOG的主要流程。在流程中我们暂时回避了一些问题,比如:如何应对partial write。也忽略了一些有意思的地方,比如:XLogBytePosToRecPtr是如何实现的?这些问题我们会在《PostgreSQL重启恢复—XLOG 2.0》中进行介绍。