在PG中,磁盘存储和内存中的最小管理单位都是page,也是通常所说的block。一般PG页的大小为8K,在源码编译时可以设置。此后都不可更改,因为许多PG内存结构设计都是以此为基础的。
在一个page中,表的记录是从page的底部开始存储,然后慢慢向上涨。Page结构图如下:
上图为一个page的结构,主要由5个部分组成:
Page Header:为页头,主要存储LSN,page中空闲空间的开始offset和结束offset等。下面再展开讲。
ItemId data:是page中表记录的索引条目。一个索引条目4个字节,由两部分组成:此记录在page中的offset和记录长度length。
Free space:是此page中剩余可用的空间,不算标记为delete后的空间;是指完全没有被使用的空间,也相当于page中没有被分配的空间。
Item:就是指表实际存储的记录。
Special space: 存储索引访问方法(AM: Access Method)信息,不同的索引访问方法,内容不一样。但如果是表的page,那么这里是空的,没有任何信息。
源码在src/backend/storage/page/bufpage.c中,以下为Page的初始化:
Page header 24个字节说明如下:
PageHeader 源码定义如下:
/*
* disk page organization
*
* space management information generic to any page
*
* pd_lsn - identifies xlog record for last change to this page.
* pd_checksum - page checksum, if set.
* pd_flags - flag bits.
* pd_lower - offset to start of free space.
* pd_upper - offset to end of free space.
* pd_special - offset to start of special space.
* pd_pagesize_version - size in bytes and page layout version number.
* pd_prune_xid - oldest XID among potentially prunable tuples on page.
*
* The LSN is used by the buffer manager to enforce the basic rule of WAL:
* "thou shalt write xlog before data". A dirty buffer cannot be dumped
* to disk until xlog has been flushed at least as far as the page's LSN.
*
* pd_checksum stores the page checksum, if it has been set for this page;
* zero is a valid value for a checksum. If a checksum is not in use then
* we leave the field unset. This will typically mean the field is zero
* though non-zero values may also be present if databases have been
* pg_upgraded from releases prior to 9.3, when the same byte offset was
* used to store the current timelineid when the page was last updated.
* Note that there is no indication on a page as to whether the checksum
* is valid or not, a deliberate design choice which avoids the problem
* of relying on the page contents to decide whether to verify it. Hence
* there are no flag bits relating to checksums.
*
* pd_prune_xid is a hint field that helps determine whether pruning will be
* useful. It is currently unused in index pages.
*
* The page version number and page size are packed together into a single
* uint16 field. This is for historical reasons: before PostgreSQL 7.3,
* there was no concept of a page version number, and doing it this way
* lets us pretend that pre-7.3 databases have page version number zero.
* We constrain page sizes to be multiples of 256, leaving the low eight
* bits available for a version number.
*
* Minimum possible page size is perhaps 64B to fit page header, opaque space
* and a minimal tuple; of course, in reality you want it much bigger, so
* the constraint on pagesize mod 256 is not an important restriction.
* On the high end, we can only support pages up to 32KB because lp_off/lp_len
* are 15 bits.
*/
typedef struct PageHeaderData
{
/* XXX LSN is member of *any* block, not only page-organized ones */
PageXLogRecPtr pd_lsn; /* LSN: next byte after last byte of xlog
* record for last change to this page */
uint16 pd_checksum; /* checksum */
uint16 pd_flags; /* flag bits, see below */
LocationIndex pd_lower; /* offset to start of free space */
LocationIndex pd_upper; /* offset to end of free space */
LocationIndex pd_special; /* offset to start of special space */
uint16 pd_pagesize_version;
TransactionId pd_prune_xid; /* oldest prunable XID, or zero if none */
ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* line pointer array */
} PageHeaderData;
typedef PageHeaderData *PageHeader;
其中,PageXLogRecPtr为一个结构体,64位。记录xlog信息的原因:
-
保证buffer manger WAL原则,即写日志先于写数据
-
脏块checkpoint时,日志先刷出到disk
/*
* For historical reasons, the 64-bit LSN value is stored as two 32-bit
* values.
*/
typedef struct
{
uint32 xlogid; /* high bits */
uint32 xrecoff; /* low bits */
} PageXLogRecPtr;
总的来讲,PG中页的结构大体上与Oracle的Block结构是比较类似的,都是采用向上涨的方式来存储记录。但是在小细节上还是分别比较大的。Oracle的Block中还有ITL等事务相关信息等。