PG数据库内核分析学习笔记_MultiXact日志管理器
MultiXact日志是PG系统用来记录组合事务ID的一种日志. 由于PG采用了多版本并发控制, 因此同一个元组相关的事务ID可能有多个, 为了在加锁(行共享锁)的时候统一操作, PG将与该元组相关联的多个事务ID组合起来用一个MultiXactID代替来管理.同CLOG, SubTrans日志一样, MultiXact日志也是利用SLRU缓冲区来实现.
typedef uint32 TransactionId;
/* MultiXactId must be equivalent to TransactionId, to fit in t_xmax */
typedef TransactionId MultiXactId;
1 MultiXact日志管理器相关数据结构
MultiXact是一个多对一的映射关系, 需要在事务ID数组中标记哪一段映射到一个MultiXactID(如图). 所以在映射的过程中需要存储两种信息, 即需要标识一段事务ID的偏移量(Offset), 还需要记录这段偏移量的大小(NMembers).
图 MultiXactID组合关系
由于需要对MultiXactID的分配进行维护, 于是定义了数据结构MultiXactStateData(数据结构如图所示).
数据结构 MultiXactStateData
// multixact.c
/*
* MultiXact state shared across all backends. All this state is protected
* by MultiXactGenLock. (We also use MultiXactOffsetControlLock and
* MultiXactMemberControlLock to guard accesses to the two sets of SLRU
* buffers. For concurrency's sake, we avoid holding more than one of these
* locks at a time.)
*/
typedef struct MultiXactStateData
{
/* 下一个可分配的 MultiXactId */
MultiXactId nextMXact;
/* 下一个对应MultiXactId的起始偏移量 */
MultiXactOffset nextOffset;
/* Have we completed multixact startup? */
bool finishedStartup;
/*
* Oldest multixact that is still potentially referenced by a relation.
* Anything older than this should not be consulted. These values are
* updated by vacuum.
*/
MultiXactId oldestMultiXactId;
Oid oldestMultiXactDB;
/*
* Oldest multixact offset that is potentially referenced by a multixact
* referenced by a relation. We don't always know this value, so there's
* a flag here to indicate whether or not we currently do.
*/
MultiXactOffset oldestOffset;
bool oldestOffsetKnown;
/* support for anti-wraparound measures */
MultiXactId multiVacLimit;
MultiXactId multiWarnLimit;
MultiXactId multiStopLimit;
MultiXactId multiWrapLimit;
/* support for members anti-wraparound measures */
MultiXactOffset offsetStopLimit; /* known if oldestOffsetKnown */
/*
* Per-backend data starts here. We have two arrays stored in the area
* immediately following the MultiXactStateData struct. Each is indexed by
* BackendId.
*
* In both arrays, there's a slot for all normal backends (1..MaxBackends)
* followed by a slot for max_prepared_xacts prepared transactions. Valid
* BackendIds start from 1; element zero of each array is never used.
*
* OldestMemberMXactId[k] is the oldest MultiXactId each backend's current
* transaction(s) could possibly be a member of, or InvalidMultiXactId
* when the backend has no live transaction that could possibly be a
* member of a MultiXact. Each backend sets its entry to the current
* nextMXact counter just before first acquiring a shared lock in a given
* transaction, and clears it at transaction end. (This works because only
* during or after acquiring a shared lock could an XID possibly become a
* member of a MultiXact, and that MultiXact would have to be created
* during or after the lock acquisition.)
*
* OldestVisibleMXactId[k] is the oldest MultiXactId each backend's
* current transaction(s) think is potentially live, or InvalidMultiXactId
* when not in a transaction or not in a transaction that's paid any
* attention to MultiXacts yet. This is computed when first needed in a
* given transaction, and cleared at transaction end. We can compute it
* as the minimum of the valid OldestMemberMXactId[] entries at the time
* we compute it (using nextMXact if none are valid). Each backend is
* required not to attempt to access any SLRU data for MultiXactIds older
* than its own OldestVisibleMXactId[] setting; this is necessary because
* the checkpointer could truncate away such data at any instant.
*
* The oldest valid value among all of the OldestMemberMXactId[] and
* OldestVisibleMXactId[] entries is considered by vacuum as the earliest
* possible value still having any live member transaction. Subtracting
* vacuum_multixact_freeze_min_age from that value we obtain the freezing
* point for multixacts for that table. Any value older than that is
* removed from tuple headers (or "frozen"; see FreezeMultiXactId. Note
* that multis that have member xids that are older than the cutoff point
* for xids must also be frozen, even if the multis themselves are newer
* than the multixid cutoff point). Whenever a full table vacuum happens,
* the freezing point so computed is used as the new pg_class.relminmxid
* value. The minimum of all those values in a database is stored as
* pg_database.datminmxid. In turn, the minimum of all of those values is
* stored in pg_control and used as truncation point for pg_multixact. At
* checkpoint or restartpoint, unneeded segments are removed.
*/
MultiXactId perBackendXactIds[FLEXIBLE_ARRAY_MEMBER];
} MultiXactStateData;
MultiXactStateData用来管理和维护MultiXactID, 另外还需要定义一个统一的接口来操作MultiXactID, 即存储这种多对一的映射关系, PG定义mXactCacheEnt来完成此工作.
/*
* Definitions for the backend-local MultiXactId cache.
*
* We use this cache to store known MultiXacts, so we don't need to go to
* SLRU areas every time.
*
* The cache lasts for the duration of a single transaction, the rationale
* for this being that most entries will contain our own TransactionId and
* so they will be uninteresting by the time our next transaction starts.
* (XXX not clear that this is correct --- other members of the MultiXact
* could hang around longer than we did. However, it's not clear what a
* better policy for flushing old cache entries would be.) FIXME actually
* this is plain wrong now that multixact's may contain update Xids.
*
* We allocate the cache entries in a memory context that is deleted at
* transaction end, so we don't need to do retail freeing of entries.
*/
typedef struct mXactCacheEnt
{
MultiXactId multi;
int nmembers;
dlist_node node;
MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];
} mXactCacheEnt;
MultiXact日志存储在PGDATA/pg_multixact/目录下, 其中会有两个子目录members和offset, 分别存储XactID成员和偏移量.
[postgres@localhost ~/db1_normal/pg_multixact]$ ls
members offsets
[postgres@localhost ~/db1_normal/pg_multixact]$
从一个MultiXactID映射到具体的存储位置是通过下面的变换完成的:
/* We need four bytes per offset */
#define MULTIXACT_OFFSETS_PER_PAGE (BLCKSZ / sizeof(MultiXactOffset))
#define MultiXactIdToOffsetPage(xid) \
((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
#define MultiXactIdToOffsetEntry(xid) \
((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
#define MultiXactIdToOffsetSegment(xid) (MultiXactIdToOffsetPage(xid) / SLRU_PAGES_PER_SEGMENT)
/* page in which a member is to be found */
#define MXOffsetToMemberPage(xid) ((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_PAGE)
#define MXOffsetToMemberSegment(xid) (MXOffsetToMemberPage(xid) / SLRU_PAGES_PER_SEGMENT)
/* Location (byte offset within page) of flag word for a given member */
#define MXOffsetToFlagsOffset(xid) \
((((xid) / (TransactionId) MULTIXACT_MEMBERS_PER_MEMBERGROUP) % \
(TransactionId) MULTIXACT_MEMBERGROUPS_PER_PAGE) * \
(TransactionId) MULTIXACT_MEMBERGROUP_SIZE)
#define MXOffsetToFlagsBitShift(xid) \
(((xid) % (TransactionId) MULTIXACT_MEMBERS_PER_MEMBERGROUP) * \
MXACT_MEMBER_BITS_PER_XACT)
/* Location (byte offset within page) of TransactionId of given member */
#define MXOffsetToMemberOffset(xid) \
(MXOffsetToFlagsOffset(xid) + MULTIXACT_FLAGBYTES_PER_GROUP + \
((xid) % MULTIXACT_MEMBERS_PER_MEMBERGROUP) * sizeof(TransactionId))
MultiXact日志的缓冲区用两个SLRU缓冲区来实现, 分别是MultiXactOffsetCtl和MultiXactMemberCtl, 分别记录Members和Offsets. 在Postmaster启动后, 就注册在共享内存中, 管理全局的MultiXactID.
// multixact.c
/*
* Links to shared-memory data structures for MultiXact control
*/
static SlruCtlData MultiXactOffsetCtlData;
static SlruCtlData MultiXactMemberCtlData;
#define MultiXactOffsetCtl (&MultiXactOffsetCtlData)
#define MultiXactMemberCtl (&MultiXactMemberCtlData)
// slru.h
/*
* SlruCtlData is an unshared structure that points to the active information
* in shared memory.
*/
typedef struct SlruCtlData
{
SlruShared shared;
/*
* This flag tells whether to fsync writes (true for pg_xact and multixact
* stuff, false for pg_subtrans and pg_notify).
*/
bool do_fsync;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
* need to use comparison of TransactionIds here in order to do the right
* thing with wraparound XID arithmetic.
*/
bool (*PagePrecedes) (int, int);
/*
* Dir is set during SimpleLruInit and does not change thereafter. Since
* it's always the same, it doesn't need to be in shared memory.
*/
char Dir[64];
} SlruCtlData;
参考
<PG数据库内核分析> 7.11.4 MultiXact日志管理器