PostgreSQL 事务—MVCC

最新推荐文章于 2024-06-30 14:45:06 发布

obvious__

最新推荐文章于 2024-06-30 14:45:06 发布

阅读量1.3k

点赞数 2

分类专栏： postgresql 文章标签： postgresql 数据库

本文链接：https://blog.csdn.net/obvious__/article/details/120710977

版权

postgresql 专栏收录该内容

25 篇文章 30 订阅

订阅专栏

MVCC

预备知识

《PostgreSQL 流程—全表遍历》

《PostgreSQL重启恢复—Checkpoint&Redo》

概述

在《PostgreSQL 流程—全表遍历》中我们讲解过一个函数heapgetpage，该函数用于获取页面中或有可见的元组。在全表遍历中，我们遗留了一个问题，就是如何判断元组的可见性，本文就来重点描述关于可见性的相关问题。

MVCC

考虑这样一个场景，当一个进程P1正在修改某元组R，进程P2希望读取R，此时应该如何处理？这是一个典型的多进程并发控制的场景。对于多进程并发控制，常用的方式就是采用读写锁，即读加共享锁，写加互斥锁，从而使读写操作互斥。但在数据库中，这样的方式会极大的降低系统的并发性。为此数据库采用了Mutil-Version Concurrency Control（MVCC，多版本并发控制），为元组保留多个不同的版本，读写操作作用于不同的版本，从而可以并行执行，解决读写互斥的问题。关于MVCC更详细的内容可以参见相关资料。

在MVCC中，对某条元组执行update操作后，就会存在两个版本，修改前的版本和修改后的版本。多次修改后，就会存在多个版本。然而对于一个查询操作，最终只可能返回其中的一个版本。那么应该返回哪个版本呢？这也就是所谓的可见性问题。即多个版本中，只有一个版本对于当前事务可见。那么如何判断元组是否可见呢？元组可见的准则如下：

在当前事务开启之前提交的最新版本，对当前事务可见。

该准则包含了两个要点：

在当前事务开启之前提交。
最新版本。

下面分别阐述PostgreSQL是如何校验这两个要点的。

事务可见性判断

PostgreSQL会为每个事务生成一个事务ID，即xid。xid是一个32位的整数，按照事务开启的先后顺序递增，所以通过xid就可以判断事务开启的先后顺序。当事务提交后，PostgreSQL会记录clog，表明该事务已经提交，所以通过clog也可以判断事务是否提交。通过xid和clog可以很方便的判断事务何时开启以及是否提交。此外PostgreSQL还维护了全局的活跃事务数组，里面存放了所有当前活跃事务的xid。所以当一个事务开启时，我们可以很方便的获取两个重要的信息：

此时有哪些事务是活跃的

获取这些活跃事务中最小的xid，即xmin。那么所有< xmin的事务一定是当前事务开启之前提交的。这些事务所作的操作对于当前事务都是可见的。

这些活跃事务本身在当前事务开启后依然是活跃的，所以这些事务的操作对当前事务是不可见的。
此时有哪些事务是提交的

获取这些提交事务中最大的xid，即xmax。那么所有> xmax的事务一定是当前事务开启之后还未提交的。这些事务所作的操作对于当前事务都是不可见的。
这里还有一个细节，xmax实际是最大提交事务的xid+1，代码如下：
```
/* procarray.c 1565行 */
/* xmax is always latestCompletedXid + 1 */
xmax = ShmemVariableCache->latestCompletedXid;
Assert(TransactionIdIsNormal(xmax));
TransactionIdAdvance(xmax);  /* xmax++ */
```
然后所有 >= xmax的事务所作的操作，对当前事务不可见。

设某事务T的事务id为xid，根据上述性质，我们可以很方便的得出事务T可见性的判断流程：

比较xid与xmin

如果xid < xmin，则T对当前事务可见，否则执行步骤2。
比较xid与xmax

如果xid >= xmax，则T对当前事务不可见，否则执行步骤3。
在活跃事务数组中查找xid

如果xid存在于活跃事务数组，则T对当前事务不可见，否则T对当前事务可见。

对于步骤1，是一个优化步骤，去掉步骤1，不会出现任何正确性问题，只不过对于xid < xmin的事务都需要到在步骤3中通过遍历活跃数组的方式来判断可见性，这样性能会比较低。对于步骤2，是一个影响正确性的步骤，因为对于在当前事务开启之后再开启的事务，可能不会出现在活跃事务数组中（后面会说明为什么会出现这样的情况），显然这些事务对当前事务都不可见，所以需要步骤2来过滤。（那些未出现在活跃事务数组中的xid一定>= xmax，后面会具体说明）。

下面我们来看看事务可见性判断的实现函数，XidInMVCCSnapshot。

XidInMVCCSnapshot

bool
XidInMVCCSnapshot(TransactionId xid, Snapshot snapshot)
{
	uint32		i;

	/*
	 * Make a quick range check to eliminate most XIDs without looking at the
	 * xip arrays.  Note that this is OK even if we convert a subxact XID to
	 * its parent below, because a subxact with XID < xmin has surely also got
	 * a parent with XID < xmin, while one with XID >= xmax must belong to a
	 * parent that was not yet committed at the time of this snapshot.
	 */

	/* 
	 * Any xid < xmin is not in-progress 
	 * 步骤1：比较xid与xmin
	 */
	if (TransactionIdPrecedes(xid, snapshot->xmin))
		return false;
	/* 
	 * Any xid >= xmax is in-progress 
	 * 步骤2：比较xid与xmax
	 */
	if (TransactionIdFollowsOrEquals(xid, snapshot->xmax))
		return true;

	/*
	 * Snapshot information is stored slightly differently in snapshots taken
	 * during recovery.
	 */
	if (!snapshot->takenDuringRecovery)
	{
		/*
		 * If the snapshot contains full subxact data, the fastest way to
		 * check things is just to compare the given XID against both subxact
		 * XIDs and top-level XIDs.  If the snapshot overflowed, we have to
		 * use pg_subtrans to convert a subxact XID to its parent XID, but
		 * then we need only look at top-level XIDs not subxacts.
		 */
		if (!snapshot->suboverflowed)
		{
			/* we have full data, so search subxip */
			int32		j;

			for (j = 0; j < snapshot->subxcnt; j++)
			{
				if (TransactionIdEquals(xid, snapshot->subxip[j]))
					return true;
			}

			/* not there, fall through to search xip[] */
		}
		else
		{
			/*
			 * Snapshot overflowed, so convert xid to top-level.  This is safe
			 * because we eliminated too-old XIDs above.
			 */
			xid = SubTransGetTopmostTransaction(xid);

			/*
			 * If xid was indeed a subxact, we might now have an xid < xmin,
			 * so recheck to avoid an array scan.  No point in rechecking
			 * xmax.
			 */
			if (TransactionIdPrecedes(xid, snapshot->xmin))
				return false;
		}
		
        /* 步骤3：在活跃事务数组中查找xid */
		for (i = 0; i < snapshot->xcnt; i++)
		{
			if (TransactionIdEquals(xid, snapshot->xip[i]))
				return true;
		}
	}
	else
	{
		int32		j;

		/*
		 * In recovery we store all xids in the subxact array because it is by
		 * far the bigger array, and we mostly don't know which xids are
		 * top-level and which are subxacts. The xip array is empty.
		 *
		 * We start by searching subtrans, if we overflowed.
		 */
		if (snapshot->suboverflowed)
		{
			/*
			 * Snapshot overflowed, so convert xid to top-level.  This is safe
			 * because we eliminated too-old XIDs above.
			 */
			xid = SubTransGetTopmostTransaction(xid);

			/*
			 * If xid was indeed a subxact, we might now have an xid < xmin,
			 * so recheck to avoid an array scan.  No point in rechecking
			 * xmax.
			 */
			if (TransactionIdPrecedes(xid, snapshot->xmin))
				return false;
		}

		/*
		 * We now have either a top-level xid higher than xmin or an
		 * indeterminate xid. We don't know whether it's top level or subxact
		 * but it doesn't matter. If it's present, the xid is visible.
		 */
		for (j = 0; j < snapshot->subxcnt; j++)
		{
			if (TransactionIdEquals(xid, snapshot->subxip[j]))
				return true;
		}
	}

	return false;
}

这里需要注意一下XidInMVCCSnapshot的返回值，XidInMVCCSnapshot返回true表示xid对当前事务不可见。

元组可见性

说完事务可见性，我们来看看元组可见性，元组可见性是基于事务可见性的。在PostgreSQL中，每条元组上记录了两个xid，t_xmin和t_xmax，注意这里的t_xmin、t_xmax和前面事务可见性的xmin、xmax完全不是一回事！！在元组中，t_xmin表示对该元组执行插入操作的事务的xid，t_xmax表示对该元组执行删除操作的事务的xid，如果元组没有被删除t_xmax为0。在PostgreSQL中，update操作删除+插入操作，先删除原始的元组，再插入新的元组，所以t_xmin和t_xmax逻辑也完全适用于更新操作。下面我们先从源代码的角度来调试下t_xmin和t_xmax。

用例1：元组插入

-- 建表
drop table if exists t1;
create table t1(a int);
-- 插入
begin;
select txid_current(); 		-- 查看当前事务id
insert into t1 values(1);	-- 该元组的t_xmin等于txid_current(),t_xmax为0
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

插入操作的事务id如下：

在这里插入图片描述

执行查询操作，我们可以看到元组的xmin和xmax的情况，如下图：

在这里插入图片描述

HeapTupleSatisfiesMVCC是判断元组可见性的函数，后面我们会详细阐述。从图中可以看出，元组的t_xmin为688，就是前面插入事务的xid，而由于此时元组没有被删除，所以t_xmax为0。

用例2：元组更新

接着我们把用例1插入的元组进行更新。

-- 更新
begin;
select txid_current(); 		-- 查看当前事务id
update t1 set a = 2;		-- 该操作会将a = 1的元组的t_xmax变为txid_current(),
							-- 然后插入a = 2的元组，该元组的t_xmin为txid_current()，t_xmax为0
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

执行查询操作，此时我们会看到两条元组，由于是全表遍历，所以会先看到原始元组（即a = 1的元组），然后看到更新后的元组（即a = 2的元组）。

在这里插入图片描述

原始元组

在这里插入图片描述

可以看到原始元组的t_xmax被改为了698，正是此次update操作的事务id。

新元组

在这里插入图片描述

这是update操作插入的新元组，新元组的t_xmin为此次update操作的事务id，新元组的t_xmax为0。

用例3：元组删除

接着我们把用例2更新的元组进行删除。

-- 删除
begin;
select txid_current(); 		-- 查看当前事务id
delete from t1;				-- 该操作会将元组的t_xmax变为txid_current(),
commit;
-- 查询
select * from t1;			-- 验证t_xmin和t_xmax

明白了更新操作，删除其实就非常简单了，删除操作会修改元组的t_xmax。

在这里插入图片描述

明白了元组的t_xmin和t_xmax，再来谈元组的可见性其实就非常简单了。

如果某元组的t_xmin对当前事务不可见，那么该元组对当前事务不可见。

这是显而易见的，t_xmin不可见意味着，插入这条元组的事务对当前事务不可见，自然该元组对当前事务也就不可见。
如果元组的t_xmax对当前事务可见，那么该元组对当前事务不可见。

t_xmax对当前事务可见，就意味着删除对当前事务可见，删除既然可见，就说明了对当前事务而言，该元组已经被删除了，元组既然被删除自然也就不可见了。

下面我们来看看HeapTupleSatisfiesMVCC函数的实现。

HeapTupleSatisfiesMVCC

在文件tqual.h和tqual.c中定义了一系列HeapTupleSatisfies函数，如下图所示：

在这里插入图片描述

这些函数用于在不同情况下判断元组的可见性。而在查询时用于判断元组可见性的函数为HeapTupleSatisfiesMVCC，其实现如下：

bool
HeapTupleSatisfiesMVCC(HeapTuple htup, Snapshot snapshot,
					   Buffer buffer)
{
	HeapTupleHeader tuple = htup->t_data;

	Assert(ItemPointerIsValid(&htup->t_self));
	Assert(htup->t_tableOid != InvalidOid);

    /* 关键函数，暂时忽略 */
	if (!HeapTupleHeaderXminCommitted(tuple))
	{
        /* 关键函数，暂时忽略 */
		if (HeapTupleHeaderXminInvalid(tuple))
			return false;

		/* 
		 * Used by pre-9.0 binary upgrades 
		 * 兼容老版本，可以忽略
		 */
		if (tuple->t_infomask & HEAP_MOVED_OFF)
		{
			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);

			if (TransactionIdIsCurrentTransactionId(xvac))
				return false;
			if (!XidInMVCCSnapshot(xvac, snapshot))
			{
				if (TransactionIdDidCommit(xvac))
				{
					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
								InvalidTransactionId);
					return false;
				}
				SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
							InvalidTransactionId);
			}
		}
		/* 
		 * Used by pre-9.0 binary upgrades 
		 * 兼容老版本，可以忽略
		 */
		else if (tuple->t_infomask & HEAP_MOVED_IN)
		{
			TransactionId xvac = HeapTupleHeaderGetXvac(tuple);

			if (!TransactionIdIsCurrentTransactionId(xvac))
			{
				if (XidInMVCCSnapshot(xvac, snapshot))
					return false;
				if (TransactionIdDidCommit(xvac))
					SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
								InvalidTransactionId);
				else
				{
					SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
								InvalidTransactionId);
					return false;
				}
			}
		}
        /* 判断该元组是否由当前事务插入 */
		else if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmin(tuple)))
		{
			if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
				return false;	/* inserted after scan started */

			if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid */
				return true;

			if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))	/* not deleter */
				return true;

			if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
			{
				TransactionId xmax;

				xmax = HeapTupleGetUpdateXid(tuple);

				/* not LOCKED_ONLY, so it has to have an xmax */
				Assert(TransactionIdIsValid(xmax));

				/* updating subtransaction must have aborted */
				if (!TransactionIdIsCurrentTransactionId(xmax))
					return true;
				else if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
					return true;	/* updated after scan started */
				else
					return false;		/* updated before scan started */
			}

			if (!TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
			{
				/* deleting subtransaction must have aborted */
				SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
							InvalidTransactionId);
				return true;
			}

			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
        /* 判断元组t_xmin的可见性 */
		else if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
			return false; /* 如果t_xmin不可见，则元组不可见 */
         /* 关键函数，暂时忽略 */
		else if (TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuple)))
			SetHintBits(tuple, buffer, HEAP_XMIN_COMMITTED,
						HeapTupleHeaderGetRawXmin(tuple));
		else
		{
			/* it must have aborted or crashed */
			SetHintBits(tuple, buffer, HEAP_XMIN_INVALID,
						InvalidTransactionId);
			return false;
		}
	}
	else
	{
		/* 
		 * xmin is committed, but maybe not according to our snapshot 
		 * 判断元组t_xmin的可见性，如果t_xmin不可见，则元组不可见
		 */
		if (!HeapTupleHeaderXminFrozen(tuple) &&
			XidInMVCCSnapshot(HeapTupleHeaderGetRawXmin(tuple), snapshot))
			return false;		/* treat as still in progress */
	}

	/* by here, the inserting transaction has committed */

	if (tuple->t_infomask & HEAP_XMAX_INVALID)	/* xid invalid or aborted */
		return true;

	if (HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask))
		return true;

	if (tuple->t_infomask & HEAP_XMAX_IS_MULTI)
	{
		TransactionId xmax;

		/* already checked above */
		Assert(!HEAP_XMAX_IS_LOCKED_ONLY(tuple->t_infomask));

		xmax = HeapTupleGetUpdateXid(tuple);

		/* not LOCKED_ONLY, so it has to have an xmax */
		Assert(TransactionIdIsValid(xmax));

		if (TransactionIdIsCurrentTransactionId(xmax))
		{
			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
		if (XidInMVCCSnapshot(xmax, snapshot))
			return true;
		if (TransactionIdDidCommit(xmax))
			return false;		/* updating transaction committed */
		/* it must have aborted or crashed */
		return true;
	}

	if (!(tuple->t_infomask & HEAP_XMAX_COMMITTED))
	{
		if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetRawXmax(tuple)))
		{
			if (HeapTupleHeaderGetCmax(tuple) >= snapshot->curcid)
				return true;	/* deleted after scan started */
			else
				return false;	/* deleted before scan started */
		}
		/* 判断t_xmax的可见性， t_xmax不可见，则元组可见*/
		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
			return true;

		if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmax(tuple)))
		{
			/* it must have aborted or crashed */
			SetHintBits(tuple, buffer, HEAP_XMAX_INVALID,
						InvalidTransactionId);
			return true;
		}

		/* xmax transaction committed */
		SetHintBits(tuple, buffer, HEAP_XMAX_COMMITTED,
					HeapTupleHeaderGetRawXmax(tuple));
	}
	else
	{
		/* 
		 * xmax is committed, but maybe not according to our snapshot 
		 * 判断t_xmax的可见性， t_xmax不可见，则元组可见
		 */
		if (XidInMVCCSnapshot(HeapTupleHeaderGetRawXmax(tuple), snapshot))
			return true;		/* treat as still in progress */
	}

	/* xmax transaction committed */

	return false;
}

我们忽略了HeapTupleSatisfiesMVCC中很多重要的问题，比如HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用，关于这几个函数，会在本文最后进行说明，现在我们只关注对于t_xmin和t_xmax的判断。回到前面在MVCC中我们提出的可见性的两个要素：

在当前事务开启之前提交。
最新版本。

其实前面的描述，通过事务的可见性判断规则，我们可以判断出一个事务是否在当前事务开启前提交。那么如何获取元组的最新版本呢？其实从元组的可见性部分，我们不难看出根本不存在这个问题，对于一个事务来讲，只可能有一个版本的元组对其可见。为什么？首先，元组存在多个版本的原因只可能是对改元组执行了多次更新，从前面的描述中不难发现，更新操作有一个特点，原始元组的t_xmax与新元组的t_xmin相同，而t_xmax和t_xmin对于元组的可见性有相反的作用。所以，要么原始元组与新元组都不可见，要么只有一个可见！

再议XidInMVCCSnapshot

前面我们讲解了XidInMVCCSnapshot作用与实现，现在我们来讨论一下XidInMVCCSnapshot的名字，哈哈！我们分解下这个函数名Xid_In_MVCC_Snapshot。xid已经介绍过，in就是字面意思，MVCC也已经介绍过，那么什么是Snapshot？PostgreSQL可是出了名的见名知意啊。

在解释Snapshot之前，我们来回顾下heapgetpage的流程：获取页面中一条元组，判断元组可见性，获取下一条元组，判断元组可见性，循环往复直到页面尽头。通过前面的描述我们知道，元组可见性的判断需要依赖三个元素：

xmin、xmax、活跃事务数组。这三个元组会随着数据库事务的开启和提交实时变化。那么如果我们在查询时直接使用全局xmin、xmax和活跃事务数组来判断可见性可以么？考虑一个简单的场景：

-- 事务T1
begin;
insert into table1 values(1),(2),(3),(4),(5);
commit;

-- 事务T2
select * from table1;

事务T1向table1中插入了5条元组。假设事务T2在全表遍历时使用全局事务数组来进行可见性判断，T2遍历到1，2两条元组时，事务T1还没有提交，所以1，2元组不可见，当遍历到元组3时，T1提交了，所以全局xmin、xmax和活跃事务数组发生了变化，T1不再是活跃的，于是元组3，4，5变为了可见！这显然违背了事务的原子性。所以在判断元组可见性时一定不能使用全局xmin、xmax和活跃事务数组。而是需要将xmin、xmax和活跃事务数组做一个快照，这也就是Snapshot！如此才能保证同一个事务产生的元组具有相同的可见性！

明白了Snapshot的作用之后，其实就非常好理解事务的隔离级别了。

read-uncommit

脏读：不判断元组可见性，只要存在就读。这个比较非主流，一般没有数据库会支持这个级别，也没这种需求。
read-commit

提交读：在每次执行SQL语句之前，做一次Snapshot，保证语句执行过程中可见性是一致的，但同一个事务语句与语句之前的Snapshot可能不一样。
repeatable-read

可重读：在事务执行第一条语句时，做一次Snapshot，后面执行的所有语句都使用这个Snapshot，保证事务执行过程中可见性是一致的。
serializable

串行化：事务之前串行执行，这个也就不用考虑可见性问题了，串行是由锁来实现。

下面我们来看看Snapshot是如何产生的。

GetTransactionSnapshot

Snapshot是通过GetTransactionSnapshot函数产生的，代码如下：

Snapshot
GetTransactionSnapshot(void)
{
	/*
	 * Return historic snapshot if doing logical decoding. We'll never need a
	 * non-historic transaction snapshot in this (sub-)transaction, so there's
	 * no need to be careful to set one up for later calls to
	 * GetTransactionSnapshot().
	 */
	if (HistoricSnapshotActive())
	{
		Assert(!FirstSnapshotSet);
		return HistoricSnapshot;
	}

	/* First call in transaction? */
	if (!FirstSnapshotSet)
	{
		/*
		 * Don't allow catalog snapshot to be older than xact snapshot.  Must
		 * do this first to allow the empty-heap Assert to succeed.
		 */
		InvalidateCatalogSnapshot();

		Assert(pairingheap_is_empty(&RegisteredSnapshots));
		Assert(FirstXactSnapshot == NULL);

		if (IsInParallelMode())
			elog(ERROR,
				 "cannot take query snapshot during a parallel operation");

		/*
		 * In transaction-snapshot mode, the first snapshot must live until
		 * end of xact regardless of what the caller does with it, so we must
		 * make a copy of it rather than returning CurrentSnapshotData
		 * directly.  Furthermore, if we're running in serializable mode,
		 * predicate.c needs to wrap the snapshot fetch in its own processing.
		 */
		if (IsolationUsesXactSnapshot())
		{
			/* First, create the snapshot in CurrentSnapshotData */
			if (IsolationIsSerializable())
				CurrentSnapshot = GetSerializableTransactionSnapshot(&CurrentSnapshotData);
			else
				CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
			/* Make a saved copy */
			CurrentSnapshot = CopySnapshot(CurrentSnapshot);
			FirstXactSnapshot = CurrentSnapshot;
			/* Mark it as "registered" in FirstXactSnapshot */
			FirstXactSnapshot->regd_count++;
			pairingheap_add(&RegisteredSnapshots, &FirstXactSnapshot->ph_node);
		}
		else
			CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);

		FirstSnapshotSet = true;
		return CurrentSnapshot;
	}

	if (IsolationUsesXactSnapshot())
		return CurrentSnapshot;

	/* Don't allow catalog snapshot to be older than xact snapshot. */
	InvalidateCatalogSnapshot();

	CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);

	return CurrentSnapshot;
}

该函数有两个要点：

FirstSnapshotSet变量

该变量用来控制是直接只用当前的snapshot还是重新生成snapshot。初始值为false，生成snapshot后设置为true。如果隔离级别是read-commit，则当sql执行完成后会将FirstSnapshotSet设为false，如果是repeatable-read只有在事务结束后才会将FirstSnapshotSet设为false。通过监视FirstSnapshotSet变量的变化（添加数据断点&FirstSnapshotSet）可以观察语句结束和事务提交的情况。
GetSnapshotData函数

该函数用于生成snapshot，下面会详细讲解。

GetSnapshotData

GetSnapshotData是实际生成snapshot的函数，代码如下：

Snapshot
GetSnapshotData(Snapshot snapshot)
{
	ProcArrayStruct *arrayP = procArray;
	TransactionId xmin;
	TransactionId xmax;
	TransactionId globalxmin;
	int			index;
	int			count = 0;
	int			subcount = 0;
	bool		suboverflowed = false;
	volatile TransactionId replication_slot_xmin = InvalidTransactionId;
	volatile TransactionId replication_slot_catalog_xmin = InvalidTransactionId;

	Assert(snapshot != NULL);

	/*
	 * Allocating space for maxProcs xids is usually overkill; numProcs would
	 * be sufficient.  But it seems better to do the malloc while not holding
	 * the lock, so we can't look at numProcs.  Likewise, we allocate much
	 * more subxip storage than is probably needed.
	 *
	 * This does open a possibility for avoiding repeated malloc/free: since
	 * maxProcs does not change at runtime, we can simply reuse the previous
	 * xip arrays if any.  (This relies on the fact that all callers pass
	 * static SnapshotData structs.)
	 *
	 * 要点1
	 */
	if (snapshot->xip == NULL)
	{
		/*
		 * First call for this snapshot. Snapshot is same size whether or not
		 * we are in recovery, see later comments.
		 */
		snapshot->xip = (TransactionId *)
			malloc(GetMaxSnapshotXidCount() * sizeof(TransactionId));
		if (snapshot->xip == NULL)
			ereport(ERROR,
					(errcode(ERRCODE_OUT_OF_MEMORY),
					 errmsg("out of memory")));
		Assert(snapshot->subxip == NULL);
		snapshot->subxip = (TransactionId *)
			malloc(GetMaxSnapshotSubxidCount() * sizeof(TransactionId));
		if (snapshot->subxip == NULL)
			ereport(ERROR,
					(errcode(ERRCODE_OUT_OF_MEMORY),
					 errmsg("out of memory")));
	}

	/*
	 * It is sufficient to get shared lock on ProcArrayLock, even if we are
	 * going to set MyPgXact->xmin.
	 * 要点2
	 */
	LWLockAcquire(ProcArrayLock, LW_SHARED);

	/* 
	 * xmax is always latestCompletedXid + 1 
	 * 要点3
	 */
	xmax = ShmemVariableCache->latestCompletedXid;
	Assert(TransactionIdIsNormal(xmax));
	TransactionIdAdvance(xmax);

	/* initialize xmin calculation with xmax */
	globalxmin = xmin = xmax;

	snapshot->takenDuringRecovery = RecoveryInProgress();

	if (!snapshot->takenDuringRecovery)
	{
		int		   *pgprocnos = arrayP->pgprocnos;
		int			numProcs;

		/*
		 * Spin over procArray checking xid, xmin, and subxids.  The goal is
		 * to gather all active xids, find the lowest xmin, and try to record
		 * subxids.
		 * 要点4
		 */
		numProcs = arrayP->numProcs;
		for (index = 0; index < numProcs; index++)
		{
			int			pgprocno = pgprocnos[index];
            /* 要点5 */
			volatile PGXACT *pgxact = &allPgXact[pgprocno];
			TransactionId xid;

			/*
			 * Backend is doing logical decoding which manages xmin
			 * separately, check below.
			 */
			if (pgxact->vacuumFlags & PROC_IN_LOGICAL_DECODING)
				continue;

			/* Ignore procs running LAZY VACUUM */
			if (pgxact->vacuumFlags & PROC_IN_VACUUM)
				continue;

			/* Update globalxmin to be the smallest valid xmin */
			xid = pgxact->xmin; /* fetch just once */
			if (TransactionIdIsNormal(xid) &&
				NormalTransactionIdPrecedes(xid, globalxmin))
				globalxmin = xid;

			/* 
			 * Fetch xid just once - see GetNewTransactionId 
			 * 要点6
			 */
			xid = pgxact->xid;

			/*
			 * If the transaction has no XID assigned, we can skip it; it
			 * won't have sub-XIDs either.  If the XID is >= xmax, we can also
			 * skip it; such transactions will be treated as running anyway
			 * (and any sub-XIDs will also be >= xmax).
			 */
			if (!TransactionIdIsNormal(xid)
				|| !NormalTransactionIdPrecedes(xid, xmax))
				continue;

			/*
			 * We don't include our own XIDs (if any) in the snapshot, but we
			 * must include them in xmin.
			 * 要点7
			 */
			if (NormalTransactionIdPrecedes(xid, xmin))
				xmin = xid;
			if (pgxact == MyPgXact)
				continue;

			/* 
			 * Add XID to snapshot. 
			 * 要点8
			 */
			snapshot->xip[count++] = xid;

			/*
			 * Save subtransaction XIDs if possible (if we've already
			 * overflowed, there's no point).  Note that the subxact XIDs must
			 * be later than their parent, so no need to check them against
			 * xmin.  We could filter against xmax, but it seems better not to
			 * do that much work while holding the ProcArrayLock.
			 *
			 * The other backend can add more subxids concurrently, but cannot
			 * remove any.  Hence it's important to fetch nxids just once.
			 * Should be safe to use memcpy, though.  (We needn't worry about
			 * missing any xids added concurrently, because they must postdate
			 * xmax.)
			 *
			 * Again, our own XIDs are not included in the snapshot.
			 */
			if (!suboverflowed)
			{
				if (pgxact->overflowed)
					suboverflowed = true;
				else
				{
					int			nxids = pgxact->nxids;

					if (nxids > 0)
					{
						volatile PGPROC *proc = &allProcs[pgprocno];

						memcpy(snapshot->subxip + subcount,
							   (void *) proc->subxids.xids,
							   nxids * sizeof(TransactionId));
						subcount += nxids;
					}
				}
			}
		}
	}
	else
	{
		/*
		 * We're in hot standby, so get XIDs from KnownAssignedXids.
		 *
		 * We store all xids directly into subxip[]. Here's why:
		 *
		 * In recovery we don't know which xids are top-level and which are
		 * subxacts, a design choice that greatly simplifies xid processing.
		 *
		 * It seems like we would want to try to put xids into xip[] only, but
		 * that is fairly small. We would either need to make that bigger or
		 * to increase the rate at which we WAL-log xid assignment; neither is
		 * an appealing choice.
		 *
		 * We could try to store xids into xip[] first and then into subxip[]
		 * if there are too many xids. That only works if the snapshot doesn't
		 * overflow because we do not search subxip[] in that case. A simpler
		 * way is to just store all xids in the subxact array because this is
		 * by far the bigger array. We just leave the xip array empty.
		 *
		 * Either way we need to change the way XidInMVCCSnapshot() works
		 * depending upon when the snapshot was taken, or change normal
		 * snapshot processing so it matches.
		 *
		 * Note: It is possible for recovery to end before we finish taking
		 * the snapshot, and for newly assigned transaction ids to be added to
		 * the ProcArray.  xmax cannot change while we hold ProcArrayLock, so
		 * those newly added transaction ids would be filtered away, so we
		 * need not be concerned about them.
		 */
		subcount = KnownAssignedXidsGetAndSetXmin(snapshot->subxip, &xmin,
												  xmax);

		if (TransactionIdPrecedesOrEquals(xmin, procArray->lastOverflowedXid))
			suboverflowed = true;
	}


	/* fetch into volatile var while ProcArrayLock is held */
	replication_slot_xmin = procArray->replication_slot_xmin;
	replication_slot_catalog_xmin = procArray->replication_slot_catalog_xmin;

	if (!TransactionIdIsValid(MyPgXact->xmin))
		MyPgXact->xmin = TransactionXmin = xmin;

	LWLockRelease(ProcArrayLock);

	/*
	 * Update globalxmin to include actual process xids.  This is a slightly
	 * different way of computing it than GetOldestXmin uses, but should give
	 * the same result.
	 */
	if (TransactionIdPrecedes(xmin, globalxmin))
		globalxmin = xmin;

	/* Update global variables too */
	RecentGlobalXmin = globalxmin - vacuum_defer_cleanup_age;
	if (!TransactionIdIsNormal(RecentGlobalXmin))
		RecentGlobalXmin = FirstNormalTransactionId;

	/* Check whether there's a replication slot requiring an older xmin. */
	if (TransactionIdIsValid(replication_slot_xmin) &&
		NormalTransactionIdPrecedes(replication_slot_xmin, RecentGlobalXmin))
		RecentGlobalXmin = replication_slot_xmin;

	/* Non-catalog tables can be vacuumed if older than this xid */
	RecentGlobalDataXmin = RecentGlobalXmin;

	/*
	 * Check whether there's a replication slot requiring an older catalog
	 * xmin.
	 */
	if (TransactionIdIsNormal(replication_slot_catalog_xmin) &&
		NormalTransactionIdPrecedes(replication_slot_catalog_xmin, RecentGlobalXmin))
		RecentGlobalXmin = replication_slot_catalog_xmin;

	RecentXmin = xmin;

	snapshot->xmin = xmin;
	snapshot->xmax = xmax;
	snapshot->xcnt = count;
	snapshot->subxcnt = subcount;
	snapshot->suboverflowed = suboverflowed;

	snapshot->curcid = GetCurrentCommandId(false);

	/*
	 * This is a new snapshot, so set both refcounts are zero, and mark it as
	 * not copied in persistent memory.
	 */
	snapshot->active_count = 0;
	snapshot->regd_count = 0;
	snapshot->copied = false;

	if (old_snapshot_threshold < 0)
	{
		/*
		 * If not using "snapshot too old" feature, fill related fields with
		 * dummy values that don't require any locking.
		 */
		snapshot->lsn = InvalidXLogRecPtr;
		snapshot->whenTaken = 0;
	}
	else
	{
		/*
		 * Capture the current time and WAL stream location in case this
		 * snapshot becomes old enough to need to fall back on the special
		 * "old snapshot" logic.
		 */
		snapshot->lsn = GetXLogInsertRecPtr();
		snapshot->whenTaken = GetSnapshotCurrentTimestamp();
		MaintainOldSnapshotTimeMapping(snapshot->whenTaken, xmin);
	}

	return snapshot;
}

该函数较长，我们只关注其中的要点。

要点1：if (snapshot->xip == NULL)

xip是活跃事务链表，在客户端连接服务端后，创建服务进程时进行初始化。具体可以调试服务进程的创建流程。在snapshot->xip创建时会分配GetMaxSnapshotXidCount()个xid的空间，GetMaxSnapshotXidCount()是事务数的上限，如此初始化后就可以直接使用这块空间，而不必每次创建时都需要分配空间，从而提升了性能。
要点2：LWLockAcquire

既然要产生snapshot，显然就要对全局活动事务链进行上锁。
要点3：获取xmax

这个步骤前面已经提到过了。
要点4：numProcs

numProcs当前PostgreSQL的用户连接数，不管该用户连接是否开启了事务。
要点5：allPgXact

存储用户的事务信息，allPgXact中元组个数与numProcs相同。
要点6：pgxact->xid

如果当前连接没有创建事务，那么pgxact->xid就为0。
要点7：xmin = xid

获取xmin，snapshot的活动事务数组中不包含自己，但xmin可能会是自己。
要点8：snapshot->xip

向活动事务数组中添加xid。

CLOG

最后，我们来解决前面遗留的问题：HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid以及TransactionIdDidCommit的作用。在《PostgreSQL重启恢复—Checkpoint&Redo》中，我们提到过，PostgreSQL在重启恢复时只会无脑地redo所有记录在XLOG中的数据，而不管这些数据对应的事务是否提交。所以，完成重启恢复后，数据库中就包含了一些未提交的事务产生的数据。在我们执行查询操作时，需要过滤掉这些数据。所以现在的问题就变成了：如何在查询时，过滤掉未提交事务产生的数据。所以我们需要一种机制。来判断一条元组上t_xmin和t_xmax是否提交。于是就有了CLOG。

CLOG是一种用于记录事务状态的日志，有如下四种状态：

#define TRANSACTION_STATUS_IN_PROGRESS		0x00
#define TRANSACTION_STATUS_COMMITTED		0x01
#define TRANSACTION_STATUS_ABORTED			0x02
#define TRANSACTION_STATUS_SUB_COMMITTED	0x03

当开启一个事务时，事务的状态为TRANSACTION_STATUS_IN_PROGRESS，当事务提交后状态为TRANSACTION_STATUS_COMMITTED，当用户执行了rollback后事务的状态为TRANSACTION_STATUS_ABORTED。所以通过CLOG可以非常方便的判断事务是否提交，从而判断事务的可见性。而TransactionIdDidCommit就是通过CLOG来判断事务是否提交的。

然而，CLOG毕竟是日志，访问日志通常是很耗时的，所以如果每次需要判断元组可见性时，都去访问CLOG，那么效率是十分低下的。所以在PostgreSQL中，只有第一次判断元组可见性时，才调用TransactionIdDidCommit读取CLOG，获取可见性之后，就调用SetHintBits将结果存储到元组的t_infomask成员中。在后面的判断中只需要访问t_infomask就可以判断一条元组上t_xmin和t_xmax是否提交，而这就是HeapTupleHeaderXminCommitted、HeapTupleHeaderXminInvalid函数的作用。

延伸

前面说到，在查询时，会通过SetHintBits来设置元组的t_infomask成员。而这个操作，是不会产生XLOG的，原因是，我们不需要保证这个操作的持久性。因为即便是系统发生故障，导致修改后的t_infomask没有落盘，也没有关系，判断可见性时再从CLOG中取一次便好。

那如果CLOG不见了呢？我测试过的情况是这样，如果直接把CLOG删了，数据库是启不起来的。但是可以自己构建一个CLOG让数据库启起来，但是元组的可见性肯定就会出问题。

obvious__

关注

2
点赞
踩
6

收藏

觉得还不错? 一键收藏
6
评论
PostgreSQL 事务—MVCC

MVCC预备知识《PostgreSQL 流程—全表遍历》《PostgreSQL重启恢复—Checkpoint&Redo》概述在《PostgreSQL 流程—全表遍历》中我们讲解过一个函数heapgetpage，该函数用于获取页面中或有可见的元组。在全表遍历中，我们遗留了一个问题，就是如何判断元组的可见性，本文就来重点描述关于可见性的相关问题。MVCC考虑这样一个场景，当一个进程P1正在修改某元组R，进程P2希望读取R，此时应该如何处理？这是一个典型的多进程并发控制的场景。对于多进程并发
复制链接

扫一扫

专栏目录