1 GetTransactionSnapshot()
GetTransactionSnapshot函数中,首先通过全局变量 FirstSnapshotSet来判断当前要获取快照是否为事务的第一个快照。如果是,则通过GetSnapshotData获得快照并将快照缓存。在已提交读隔离级别下,直接返回获得的快照;在可重复读及串行化隔离级别下,返回缓存的快照,代码执行流程如下:
/*
* GetTransactionSnapshot
* Get the appropriate snapshot for a new query in a transaction.
* // 为一个新查询获取合适快照
* Note that the return value may point at static storage that will be modified
* by future calls and by CommandCounterIncrement(). Callers should call
* RegisterSnapshot or PushActiveSnapshot on the returned snap if it is to be
* used very long.
*/
Snapshot
GetTransactionSnapshot(void)
{
/*
* Return historic snapshot if doing logical decoding. We'll never need a
* non-historic transaction snapshot in this (sub-)transaction, so there's
* no need to be careful to set one up for later calls to
* GetTransactionSnapshot().
*/
// 如果进行逻辑解析,则返回对应的 HistoricSnapshot
if (HistoricSnapshotActive())
{
Assert(!FirstSnapshotSet);
return HistoricSnapshot;
}
/* First call in transaction? */
if (!FirstSnapshotSet)
{
/*
* Don't allow catalog snapshot to be older than xact snapshot. Must
* do this first to allow the empty-heap Assert to succeed.
*/
InvalidateCatalogSnapshot(); // 失效CatalogSnapshot
// 确保存放注册快照结构体为空(该结构体为堆)
Assert(pairingheap_is_empty(&RegisteredSnapshots));
Assert(FirstXactSnapshot == NULL);
// 不允许在并行操作下获取快照
if (IsInParallelMode())
elog(ERROR,
"cannot take query snapshot during a parallel operation");
/*
* In transaction-snapshot mode, the first snapshot must live until
* end of xact regardless of what the caller does with it, so we must
* make a copy of it rather than returning CurrentSnapshotData
* directly. Furthermore, if we're running in serializable mode,
* predicate.c needs to wrap the snapshot fetch in its own processing.
*/
if (IsolationUsesXactSnapshot())
{
/* First, create the snapshot in CurrentSnapshotData */
if (IsolationIsSerializable()) // 可串行化隔离界别
/* 串行化隔离级别除了获得快照,还需要初始化SSI所需的各种结构,因此它调用自己专有的函数 */
CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
else
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); // 可重复读隔离界别
/* Make a saved copy */
CurrentSnapshot = CopySnapshot(CurrentSnapshot); //复制备份,将其注册至FirstXactSnapshot
FirstXactSnapshot = CurrentSnapshot;
/* Mark it as "registered" in FirstXactSnapshot */
FirstXactSnapshot->regd_count++;
pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
}
else
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); // 读已提交
FirstSnapshotSet = true; // 更新标记,下次判断不会进入此 if 逻辑
return CurrentSnapshot;
}
if (IsolationUsesXactSnapshot()) // 非第一次获取快照,直接读取缓存的快照副本
return CurrentSnapshot;
/* Don't allow catalog snapshot to be older than xact snapshot. */
InvalidateCatalogSnapshot();
CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); // RC 获取新快照
return CurrentSnapshot;
}
事务的隔离级别:
/*
* Xact isolation levels
*/
#define XACT_READ_UNCOMMITTED 0 // 读未提交
#define XACT_READ_COMMITTED 1 // 读已提交
#define XACT_REPEATABLE_READ 2 // 可重复读
#define XACT_SERIALIZABLE 3 // 可串行化
/*
* We implement three isolation levels internally.
* The two stronger ones use one snapshot per database transaction;
* the others use one snapshot per statement.
* Serializable uses predicate locks in addition to snapshots.
* These macros should be used to check which isolation level is selected.
*/
#define IsolationUsesXactSnapshot() (XactIsoLevel >= XACT_REPEATABLE_READ)
#define IsolationIsSerializable() (XactIsoLevel == XACT_SERIALIZABLE)
2 GetSnapshotData函数
GetSnapshotData函数主要作用是确定快照的xmin,xmax,xip,其中xmin是所有进程中最小事务号。
在早期版本中,相关变量都保存在PGPROC结构体,因此PGPROC结构体非常大。由于获取快照需要遍历PGPROC数组(即所有进程),且获取快照是高频操作,因此不同进程会频繁遍历PGPROC,这样就容易产生cache miss。因此pg将一些变量从PGPROC中抽取出来,组成了PGXACT结构体。
即便如此,在高并发场景下,获取快照仍是pg的瓶颈点,因此pg 14又增加了一些新特性,对xmin和xid进行优化实现了一套GlobalVis*系列函数,判断元组是否可清理
在ProcGlobal(PROC_HDR结构体)对PGPROC中的xid做镜像,每个PGPROC都含有一个pgxactoff变量,用户可以通过ProcGlobal->xids[PGPROC->pgxactoff]来获得活跃事务id。这样,当前活跃的事务id都紧凑地保存在一个数组中,可以避免读取整个PGPROC而产生cache miss
原本在已提交读模式下,事务块中的每个命令都应重新获取快照,但如果两次快照间没有事务状态发生变化,则它们的快照应该是相同的,因此可以重用前一次的快照。
/*
* Struct representing all kind of possible snapshots.
*
* There are several different kinds of snapshots:
* * Normal MVCC snapshots
* * MVCC snapshots taken during recovery (in Hot-Standby mode)
* * Historic MVCC snapshots used during logical decoding
* * snapshots passed to HeapTupleSatisfiesDirty()
* * snapshots passed to HeapTupleSatisfiesNonVacuumable()
* * snapshots used for SatisfiesAny, Toast, Self where no members are
* accessed.
*
* TODO: It's probably a good idea to split this struct using a NodeTag
* similar to how parser and executor nodes are handled, with one type for
* each different kind of snapshot to avoid overloading the meaning of
* individual fields.
*/
typedef struct SnapshotData
{
SnapshotType snapshot_type; /* type of snapshot */
/*
* The remaining fields are used only for MVCC snapshots, and are normally
* just zeroes in special snapshots. (But xmin and xmax are used
* specially by HeapTupleSatisfiesDirty, and xmin is used specially by
* HeapTupleSatisfiesNonVacuumable.)
*
* An MVCC snapshot can never see the effects of XIDs >= xmax. It can see
* the effects of all older XIDs except those listed in the snapshot. xmin
* is stored as an optimization to avoid needing to search the XID arrays
* for most tuples.
*/
TransactionId xmin; /* all XID < xmin are visible to me */
TransactionId xmax; /* all XID >= xmax are invisible to me */
/*
* For normal MVCC snapshot this contains the all xact IDs that are in
* progress, unless the snapshot was taken during recovery in which case
* it's empty. For historic MVCC snapshots, the meaning is inverted, i.e.
* it contains *committed* transactions between xmin and xmax.
*
* note: all ids in xip[] satisfy xmin <= xip[i] < xmax
*/
TransactionId *xip;
uint32 xcnt; /* # of xact ids in xip[] */
/*
* For non-historic MVCC snapshots, this contains subxact IDs that are in
* progress (and other transactions that are in progress if taken during
* recovery). For historic snapshot it contains *all* xids assigned to the
* replayed transaction, including the toplevel xid.
*
* note: all ids in subxip[] are >= xmin, but we don't bother filtering
* out any that are >= xmax
*/
TransactionId *subxip;
int32 subxcnt; /* # of xact ids in subxip[] */
bool suboverflowed; /* has the subxip array overflowed? */
bool takenDuringRecovery; /* recovery-shaped snapshot? */
bool copied; /* false if it's a static snapshot */
CommandId curcid; /* in my xact, CID < curcid are visible */
/*
* An extra return value for HeapTupleSatisfiesDirty, not used in MVCC
* snapshots.
*/
uint32 speculativeToken;
/*
* For SNAPSHOT_NON_VACUUMABLE (and hopefully more in the future) this is
* used to determine whether row could be vacuumed.
*/
struct GlobalVisState *vistest;
/*
* Book-keeping information, used by the snapshot manager
*/
uint32 active_count; /* refcount on ActiveSnapshot stack */
uint32 regd_count; /* refcount on RegisteredSnapshots */
pairingheap_node ph_node; /* link in the RegisteredSnapshots heap */
TimestampTz whenTaken; /* timestamp when snapshot was taken */
XLogRecPtr lsn; /* position in the WAL stream when taken */
/*
* The transaction completion count at the time GetSnapshotData() built
* this snapshot. Allows to avoid re-computing static snapshots when no
* transactions completed since the last GetSnapshotData().
*/
uint64 snapXactCompletionCount;
} SnapshotData;
了解快照的基本内容,我们来看下GetSnapshotData的执行逻辑:
1 首先变量初始化,获取全局procArry结构和ProcGlobal->xids数组,并为快照申请内存 ;
2 以 LW_SHARED 模式获取 ProcArrayLock,如果此事务在两次获取快照期间(活跃事务链表并发生变化,即ShmemVariableCache->xactCompletionCount == snapshot->snapXactCompletionCount),则直接重用上次的快照内容,在一定程度上加速了快照获取,避免重复进行费时的ProcArray扫描操作;反之,进入步骤3;
3 遍历 other_xids[]数组,跳过本事务XID、逻辑backend事务和 lazy-vacuum事务,如果存在本事务小的xid,则更新xmin = xid,xip[]相应更新;子事务相同执行逻辑;
4 如果是 hot standby,需要从 KnownAssignedXids数组获取xids;
5 如果第一次获取,将上述xmin填充至 MyProc->xmin;释放ProcArrayLock;
6 根据上述逻辑填充 snapshot相关字段信息并返回;
Snapshot
GetSnapshotData(Snapshot snapshot)
{
ProcArrayStruct *arrayP = procArray;
TransactionId *other_xids = ProcGlobal->xids;
TransactionId xmin;
TransactionId xmax;
int count = 0;
int subcount = 0;
bool suboverflowed = false;
FullTransactionId latest_completed;
TransactionId oldestxid;
int mypgxactoff;
TransactionId myxid;
uint64 curXactCompletionCount;
TransactionId replication_slot_xmin = InvalidTransactionId;
TransactionId replication_slot_catalog_xmin = InvalidTransactionId;
Assert(snapshot != NULL);
/*
* Allocating space for maxProcs xids is usually overkill; numProcs would
* be sufficient. But it seems better to do the malloc while not holding
* the lock, so we can't look at numProcs. Likewise, we allocate much
* more subxip storage than is probably needed.
*
* This does open a possibility for avoiding repeated malloc/free: since
* maxProcs does not change at runtime, we can simply reuse the previous
* xip arrays if any. (This relies on the fact that all callers pass
* static SnapshotData structs.)
*/
if (snapshot->xip == NULL)
{
/*
* First call for this snapshot. Snapshot is same size whether or not
* we are in recovery, see later comments.
*/
snapshot->xip = (TransactionId *)
malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
if (snapshot->xip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
Assert(snapshot->subxip == NULL);
snapshot->subxip = (TransactionId *)
malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
if (snapshot->subxip == NULL)
ereport(ERROR,
(errcode(ERRCODE_OUT_OF_MEMORY),
errmsg("out of memory")));
}
/*
* It is sufficient to get shared lock on ProcArrayLock, even if we are
* going to set MyProc->xmin.
*/
LWLockAcquire(ProcArrayLock, LW_SHARED);
if (GetSnapshotDataReuse(snapshot))
{
LWLockRelease(ProcArrayLock);
return snapshot;
}
latest_completed = ShmemVariableCache->latestCompletedXid;
mypgxactoff = MyProc->pgxactoff;
myxid = other_xids[mypgxactoff];
Assert(myxid == MyProc->xid);
oldestxid = ShmemVariableCache->oldestXid;
curXactCompletionCount = ShmemVariableCache->xactCompletionCount;
/* xmax is always latestCompletedXid + 1 */
xmax = XidFromFullTransactionId(latest_completed);
TransactionIdAdvance(xmax);
Assert(TransactionIdIsNormal(xmax));
/* initialize xmin calculation with xmax */
xmin = xmax;
/* take own xid into account, saves a check inside the loop */
if (TransactionIdIsNormal(myxid) && NormalTransactionIdPrecedes(myxid, xmin))
xmin = myxid;
snapshot->takenDuringRecovery = RecoveryInProgress();
if (!snapshot->takenDuringRecovery)
{
int numProcs = arrayP->numProcs;
TransactionId *xip = snapshot->xip;
int *pgprocnos = arrayP->pgprocnos;
XidCacheStatus *subxidStates = ProcGlobal->subxidStates;
uint8 *allStatusFlags = ProcGlobal->statusFlags;
/* 遍历other_xids[]数组获取最小的事务号 xmin
* First collect set of pgxactoff/xids that need to be included in the
* snapshot.
*/
for (int pgxactoff = 0; pgxactoff < numProcs; pgxactoff++)
{
/* Fetch xid just once - see GetNewTransactionId */
TransactionId xid = UINT32_ACCESS_ONCE(other_xids[pgxactoff]);
uint8 statusFlags;
Assert(allProcs[arrayP->pgprocnos[pgxactoff]].pgxactoff == pgxactoff);
/*
* If the transaction has no XID assigned, we can skip it; it
* won't have sub-XIDs either.
*/
if (likely(xid == InvalidTransactionId))
continue;
/*
* We don't include our own XIDs (if any) in the snapshot. It
* needs to be includeded in the xmin computation, but we did so
* outside the loop.
*/
if (pgxactoff == mypgxactoff)
continue;
/*
* The only way we are able to get here with a non-normal xid is
* during bootstrap - with this backend using
* BootstrapTransactionId. But the above test should filter that
* out.
*/
Assert(TransactionIdIsNormal(xid));
/*
* If the XID is >= xmax, we can skip it; such transactions will
* be treated as running anyway (and any sub-XIDs will also be >=
* xmax).
*/
if (!NormalTransactionIdPrecedes(xid, xmax))
continue;
/*
* Skip over backends doing logical decoding which manages xmin
* separately (check below) and ones running LAZY VACUUM.
*/
statusFlags = allStatusFlags[pgxactoff];
if (statusFlags & (PROC_IN_LOGICAL_DECODING | PROC_IN_VACUUM))
continue;
if (NormalTransactionIdPrecedes(xid, xmin))
xmin = xid;
/* Add XID to snapshot. */
xip[count++] = xid;
/*
* Save subtransaction XIDs if possible (if we've already
* overflowed, there's no point). Note that the subxact XIDs must
* be later than their parent, so no need to check them against
* xmin. We could filter against xmax, but it seems better not to
* do that much work while holding the ProcArrayLock.
*
* The other backend can add more subxids concurrently, but cannot
* remove any. Hence it's important to fetch nxids just once.
* Should be safe to use memcpy, though. (We needn't worry about
* missing any xids added concurrently, because they must postdate
* xmax.)
*
* Again, our own XIDs are not included in the snapshot.
*/
if (!suboverflowed)
{
if (subxidStates[pgxactoff].overflowed)
suboverflowed = true;
else
{
int nsubxids = subxidStates[pgxactoff].count;
if (nsubxids > 0)
{
int pgprocno = pgprocnos[pgxactoff];
PGPROC *proc = &allProcs[pgprocno];
pg_read_barrier(); /* pairs with GetNewTransactionId */
memcpy(snapshot->subxip + subcount,
(void *) proc->subxids.xids,
nsubxids * sizeof(TransactionId));
subcount += nsubxids;
}
}
}
}
}
else
{
/* 备库
* We're in hot standby, so get XIDs from KnownAssignedXids.
*
* We store all xids directly into subxip[]. Here's why:
*
* In recovery we don't know which xids are top-level and which are
* subxacts, a design choice that greatly simplifies xid processing.
*
* It seems like we would want to try to put xids into xip[] only, but
* that is fairly small. We would either need to make that bigger or
* to increase the rate at which we WAL-log xid assignment; neither is
* an appealing choice.
*
* We could try to store xids into xip[] first and then into subxip[]
* if there are too many xids. That only works if the snapshot doesn't
* overflow because we do not search subxip[] in that case. A simpler
* way is to just store all xids in the subxact array because this is
* by far the bigger array. We just leave the xip array empty.
*
* Either way we need to change the way XidInMVCCSnapshot() works
* depending upon when the snapshot was taken, or change normal
* snapshot processing so it matches.
*
* Note: It is possible for recovery to end before we finish taking
* the snapshot, and for newly assigned transaction ids to be added to
* the ProcArray. xmax cannot change while we hold ProcArrayLock, so
* those newly added transaction ids would be filtered away, so we
* need not be concerned about them.
*/
subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
xmax);
if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
suboverflowed = true;
}
/*
* Fetch into local variable while ProcArrayLock is held - the
* LWLockRelease below is a barrier, ensuring this happens inside the
* lock.
*/
// 事务槽中保存了数据xmin和catalog xmin,防止备库所需的元组被回收
replication_slot_xmin = procArray->replication_slot_xmin;
replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;
if (!TransactionIdIsValid(MyProc->xmin))
MyProc->xmin = TransactionXmin = xmin;
LWLockRelease(ProcArrayLock);
/* maintain state for GlobalVis* */
{
...
}
RecentXmin = xmin;
Assert(TransactionIdPrecedesOrEquals(TransactionXmin, RecentXmin));
填充snashot字段信息
snapshot->xmin = xmin;
snapshot->xmax = xmax;
snapshot->xcnt = count;
snapshot->subxcnt = subcount;
snapshot->suboverflowed = suboverflowed;
snapshot->snapXactCompletionCount = curXactCompletionCount;
snapshot->curcid = GetCurrentCommandId(false);
/*
* This is a new snapshot, so set both refcounts are zero, and mark it as
* not copied in persistent memory.
*/
snapshot->active_count = 0;
snapshot->regd_count = 0;
snapshot->copied = false;
GetSnapshotDataInitOldSnapshot(snapshot);
return snapshot;
}