背景
postgres采用MVCC机制实现读不阻塞写,写不阻塞读,极大提高数据库的并发度。其实现原理为:当发生删除/更新时,保留旧元组数据并标记死元组,将新数据插入表中,结合MVCC和隔离界别技术,便能实现上述功能。近阅读源代码过程中发现pg中涉及原地更新元组的情况,这无疑破坏了上述规则,接下来一起学习分析。
源码解析
1 heap_inplace_update 接口函数介绍
/*
* heap_inplace_update - update a tuple "in place" (ie, overwrite it)
*
* Overwriting violates both MVCC and transactional safety, so the uses
* of this function in Postgres are extremely limited. Nonetheless we
* find some places to use it.
// 原地重写覆盖违反 MVCC和事务安全,因此使用Postgres 中的这个功能非常有限。尽管如此,
我们还是找到了一些使用它的地方。
* The tuple cannot change size, and therefore it's reasonable to assume
* that its null bitmap (if any) doesn't change either. So we just
* overwrite the data portion of the tuple without touching the null
* bitmap or any of the header fields.
// 元组不能改变大小,因此可以合理地假设它的空位图(如果有
的话)也不会改变。 所以我们只是覆盖元组的数据部分而不触及空位图或任何标题字段。
* tuple is an in-memory tuple structure containing the data to be written
* over the target tuple. Also, tuple->t_self identifies the target tuple.
// tuple 是一个内存中的元组结构,包含要写入目标元组
的数据。 此外, tuple->t_self 标识目标元组。
* Note that the tuple updated here had better not come directly from the
* syscache if the relation has a toast relation as this tuple could
* include toast values that have been expanded, causing a failure here.
/* 请注意,如果关系具有 toast 关系,则此处更新的元组最好不要直接来自 syscache,因为此元
组可能包含已扩展的 toast 值,从而导致此处失败。
*/
** 2 源码流程详解**
2.1 有效性检查
void
heap_inplace_update(Relation relation, HeapTuple tuple)
{
Buffer buffer;
Page page;
OffsetNumber offnum;
ItemId lp = NULL;
HeapTupleHeader htup;
uint32 oldlen;
uint32 newlen;
/*
* For now, we don't allow parallel updates. Unlike a regular update,
* this should never create a combo CID, so it might be possible to relax
* this restriction, but not without more thought and testing. It's not
* clear that it would be useful, anyway.
*/
// 并行模式,不允许更新tuples
if (IsInParallelMode())
ereport(ERROR,
(errcode(ERRCODE_INVALID_TRANSACTION_STATE),
errmsg("cannot update tuples during a parallel operation")));
// 将指定元组躲在的物理读入buffer中 【可能会涉及页面置换算法】,并施加buffer排他锁,防止并发操作
buffer = ReadBuffer(relation, ItemPointerGetBlockNumber(&(tuple->t_self)));
LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
page = (Page) BufferGetPage(buffer);
// 有效性检查
offnum = ItemPointerGetOffsetNumber(&(tuple->t_self));
if (PageGetMaxOffsetNumber(page) >= offnum)
lp = PageGetItemId(page, offnum);
if (PageGetMaxOffsetNumber(page) < offnum || !ItemIdIsNormal(lp))
elog(ERROR, "invalid lp");
// 获取该元组的HeapTupleHeader信息
htup = (HeapTupleHeader) PageGetItem(page, lp);
// 获取元组数据真实长度
oldlen = ItemIdGetLength(lp) - htup->t_hoff;
newlen = tuple->t_len - tuple->t_data->t_hoff;
if (oldlen != newlen || htup->t_hoff != tuple->t_data->t_hoff) // 这里看出上述两种计算元组数据长度方法,
elog(ERROR, "wrong tuple length");
2.2元组原地拷贝
提示: tuple 是修改后的数据,位于syscache
htup 是修改前的元祖数据,位于buffer pool
/* NO EREPORT(ERROR) from here till changes are logged */
START_CRIT_SECTION();
// 进入临界区,
memcpy((char *) htup + htup->t_hoff, // 将修改的元组数据填充至相应位置
(char *) tuple->t_data + tuple->t_data->t_hoff,
newlen);
// 标脏,后续有checkpoint 刷盘
MarkBufferDirty(buffer);
// 写相应的XLOG日志
/* XLOG stuff */
if (RelationNeedsWAL(relation))
{
xl_heap_inplace xlrec;
XLogRecPtr recptr;
xlrec.offnum = ItemPointerGetOffsetNumber(&tuple->t_self);
XLogBeginInsert();
XLogRegisterData((char *) &xlrec, SizeOfHeapInplace);
XLogRegisterBuffer(0, buffer, REGBUF_STANDARD);
XLogRegisterBufData(0, (char *) htup + htup->t_hoff, newlen);
/* inplace updates aren't decoded atm, don't log the origin */
recptr = XLogInsert(RM_HEAP_ID, XLOG_HEAP_INPLACE);
PageSetLSN(page, recptr);
}
END_CRIT_SECTION();
//释放前施加的buffer lock
UnlockReleaseBuffer(buffer);
/*
* Send out shared cache inval if necessary. Note that because we only
* pass the new version of the tuple, this mustn't be used for any
* operations that could change catcache lookup keys. But we aren't
* bothering with index updates either, so that's true a fortiori.
* 告知其他backend此元组的cache失效
/
if (!IsBootstrapProcessingMode())
CacheInvalidateHeapTuple(relation, tuple, NULL);
}
日志填充信息:该元组所在偏移量 + 替换的数据,
/* This is what we need to know about in-place update */
typedef struct xl_heap_inplace
{
OffsetNumber offnum; /* updated tuple's offset on page */
/* TUPLE DATA FOLLOWS AT END OF STRUCT */
} xl_heap_inplace;
#define SizeOfHeapInplace (offsetof(xl_heap_inplace, offnum) + sizeof(OffsetNumber))
3 接口调用场景
阅读pg源代码,发现该接口主要用于修改系统表数据时调用,一般的表数据还是采用MVCC机制保留旧版本数据,插入新数据。虽然原地修改需要具备严格的条件,但是其实现效率远比上述方法快的多,给后续的优化改造添加另一种思路和想法。
1) index_update_stats --- update pg_class entry after CREATE INDEX or REINDEX
2) vac_update_relstats --- update statistics for one relation
3) vac_update_datfrozenxid --- update pg_database.datfrozenxid for our DB
4) create_toast_table --- While bootstrapping, we cannot UPDATE, so overwrite
in-place