pg_clog的原子操作与pg_subtrans(子事务)

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

  • Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
  • Postgres-XL的项目发起人Mason Sharp
  • pgpool的作者石井达夫(Tatsuo Ishii)
  • PG-Strom的作者海外浩平(Kaigai Kohei)
  • Greenplum研发总监姚延栋
  • 周正中(德哥), PostgreSQL中国用户会创始人之一
  • 汪洋,平安科技数据库技术部经理
  • ……
 
  • 2015年度PG大象会报名地址:http://postgres2015.eventdove.com/
  • PostgreSQL中国社区: http://postgres.cn/
  • PostgreSQL专业1群: 3336901(已满)
  • PostgreSQL专业2群: 100910388
  • PostgreSQL专业3群: 150657323



如果没有子事务,其实很容易保证pg_clog的原子操作,但是,如果加入了子事务并为子事务分配了XID,并且某些子事务XID和父事务的XID不在同一个CLOG PAGE时,保证事务一致性就涉及CLOG的原子写了。
PostgreSQL是通过2PC来实现CLOG的原子写的。
1. 首先将主事务以外的CLOG PAGE中的子事务设置为sub-committed状态。
2. 然后将主事务所在的CLOG PAGE中的子事务设置为sub-committed,同时设置主事务为committed状态,将同页的子事务设置为committed状态。
3. 将其他CLOG PAGE中的子事务设置为committed状态。
代码如下:
src/backend/access/transam/clog.c

/*
 * TransactionIdSetTreeStatus
 *
 * Record the final state of transaction entries in the commit log for
 * a transaction and its subtransaction tree. Take care to ensure this is
 * efficient, and as atomic as possible.
 *
 * xid is a single xid to set status for. This will typically be
 * the top level transactionid for a top level commit or abort. It can
 * also be a subtransaction when we record transaction aborts.
 *
 * subxids is an array of xids of length nsubxids, representing subtransactions
 * in the tree of xid. In various cases nsubxids may be zero.
 *
 * lsn must be the WAL location of the commit record when recording an async
 * commit.  For a synchronous commit it can be InvalidXLogRecPtr, since the
 * caller guarantees the commit record is already flushed in that case.  It
 * should be InvalidXLogRecPtr for abort cases, too.
 *
 * In the commit case, atomicity is limited by whether all the subxids are in
 * the same CLOG page as xid.  If they all are, then the lock will be grabbed
 * only once, and the status will be set to committed directly.  Otherwise
 * we must
 *       1. set sub-committed all subxids that are not on the same page as the
 *              main xid
 *       2. atomically set committed the main xid and the subxids on the same page
 *       3. go over the first bunch again and set them committed
 * Note that as far as concurrent checkers are concerned, main transaction
 * commit as a whole is still atomic.
 *
 * Example:
 *              TransactionId t commits and has subxids t1, t2, t3, t4
 *              t is on page p1, t1 is also on p1, t2 and t3 are on p2, t4 is on p3
 *              1. update pages2-3:
 *                                      page2: set t2,t3 as sub-committed
 *                                      page3: set t4 as sub-committed
 *              2. update page1:
 *                                      set t1 as sub-committed,
 *                                      then set t as committed,
                                        then set t1 as committed
 *              3. update pages2-3:
 *                                      page2: set t2,t3 as committed
 *                                      page3: set t4 as committed
 *
 * NB: this is a low-level routine and is NOT the preferred entry point
 * for most uses; functions in transam.c are the intended callers.
 *
 * XXX Think about issuing FADVISE_WILLNEED on pages that we will need,
 * but aren't yet in cache, as well as hinting pages not to fall out of
 * cache yet.
 */

实际调用的入口代码在transam.c。subtrans.c中是一些低级接口。

那么什么是subtrans?
当我们使用savepoint时,会产生子事务。子事务和父事务一样,可能消耗XID。一旦为子事务分配了XID,那么就涉及CLOG的原子操作了。因为要保证父事务和所有的子事务的CLOG一致性。
当不消耗XID时,需要通过SubTransactionId来区分子事务。
src/backend/access/transam/README

Transaction and Subtransaction Numbering
----------------------------------------
事务和子事务都可以有XID,子事务和事务一样,在真正需要XID的时候才会分配XID
也就是说,一个事务,如果它有子事务,可能消耗多个XID
另外需要注意,如果子事务要分配XID,必须先给它的父事务分配一个XID,才能给子事务分配XID,因为要确保子事务的XID是在父事务后分配的。
Transactions and subtransactions are assigned permanent XIDs only when/if
they first do something that requires one --- typically, insert/update/delete
a tuple, though there are a few other places that need an XID assigned.
If a subtransaction requires an XID, we always first assign one to its
parent.  This maintains the invariant that child transactions have XIDs later
than their parents, which is assumed in a number of places.

The subsidiary actions of obtaining a lock on the XID and entering it into
pg_subtrans and PG_PROC are done at the time it is assigned.

A transaction that has no XID still needs to be identified for various
purposes, notably holding locks.  For this purpose we assign a "virtual
transaction ID" or VXID to each top-level transaction.  VXIDs are formed from
two fields, the backendID and a backend-local counter; this arrangement allows
assignment of a new VXID at transaction start without any contention for
shared memory.  To ensure that a VXID isn't re-used too soon after backend
exit, we store the last local counter value into shared memory at backend
exit, and initialize it from the previous value for the same backendID slot
at backend start.  All these counters go back to zero at shared memory
re-initialization, but that's OK because VXIDs never appear anywhere on-disk.

子事务没有分配事务号时,如何区分各个子事务呢?
这里用到了SubTransactionId数据类型,从父事务开始SubTransactionId=1,后面的子事务递增。SubTransactionIduint32的类型。
Internally, a backend needs a way to identify subtransactions whether or not
they have XIDs; but this need only lasts as long as the parent top transaction
endures.  Therefore, we have SubTransactionId, which is somewhat like
CommandId in that it's generated from a counter that we reset at the start of
each top transaction.  The top-level transaction itself has SubTransactionId 1,
and subtransactions have IDs 2 and up.  (Zero is reserved for
InvalidSubTransactionId.)  Note that subtransactions do not have their
own VXIDs; they use the parent top transaction's VXID.


因为一个子事务要消耗4个字节,而且主事务默认会分配一个子事务号,所以和CLOG每事务消耗2BIT相比,pg_subtrans中会产生更多的文件。
另外需要注意的是,子事务不一定会分配事务号,所以对于未分配事务号的子事务,在CLOG中是没有记录的。而在pg_subtrans中一定有记录并占空间。
src/backend/access/transam/subtrans.c

/*
 * Defines for SubTrans page sizes.  A page is the same BLCKSZ as is used
 * everywhere else in Postgres.
 *
 * Note: because TransactionIds are 32 bits and wrap around at 0xFFFFFFFF,
 * SubTrans page numbering also wraps around at
 * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE, and segment numbering at
 * 0xFFFFFFFF/SUBTRANS_XACTS_PER_PAGE/SLRU_PAGES_PER_SEGMENT.  We need take no
 * explicit notice of that fact in this module, except when comparing segment
 * and page numbers in TruncateSUBTRANS (see SubTransPagePrecedes).
 */

/* We need four bytes per xact */
#define SUBTRANS_XACTS_PER_PAGE (BLCKSZ / sizeof(TransactionId))

#define TransactionIdToPage(xid) ((xid) / (TransactionId) SUBTRANS_XACTS_PER_PAGE)
#define TransactionIdToEntry(xid) ((xid) % (TransactionId) SUBTRANS_XACTS_PER_PAGE)


验证:

postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# select pg_backend_pid();
 pg_backend_pid 
----------------
           5749
(1 row)

跟踪:

[root@digoal ~]# cat trc.stp 
global f_start[999999]

probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c").call { 
   f_start[execname(), pid(), tid(), cpu()] = gettimeofday_ms()
   printf("%s -> time:%d, pp:%s, par:%s\n", thread_indent(-1), gettimeofday_ms(), pp(), $$parms$$)
   # printf("%s -> time:%d, pp:%s\n", thread_indent(1), f_start[execname(), pid(), tid(), cpu()], pp() )
}

probe process("/opt/pgsql/bin/postgres").function("*@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c").return {
  t=gettimeofday_ms()
  a=execname()
  b=cpu()
  c=pid()
  d=pp()
  e=tid()
  if (f_start[a,c,e,b]) {
  printf("%s <- time:%d, pp:%s, par:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d, $$locals$$)
  # printf("%s <- time:%d, pp:%s\n", thread_indent(-1), t - f_start[a,c,e,b], d)
  }
}


执行如下SQL:

postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# begin;  // 主事务开始,但是不分配事务号。
BEGIN
postgres=# select txid_current();  // 主事务调用DML函数,分配一个事务号。
 txid_current 
--------------
    607466850
(1 row)
postgres=# savepoint a;  // 开启子事务,但是不分配事务号,父事务号为607466850
SAVEPOINT
postgres=# \dt
        List of relations
 Schema | Name | Type  |  Owner   
--------+------+-------+----------
 public | t    | table | postgres
 public | test | table | postgres
(2 rows)
postgres=# delete from t;  // 子事务中调用DML,分配事务号607466851
DELETE 2
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850
ROLLBACK
postgres=# delete from t;  // 子事务中调用DML,分配事务号607466852
DELETE 2
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850
ROLLBACK
postgres=# delete from t; // 子事务中调用DML,分配事务号607466853
DELETE 2
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850
ROLLBACK
postgres=# insert into t values (1);    // 子事务中调用DML,分配事务号607466854
INSERT 0 1
postgres=# insert into t values (1);
INSERT 0 1
postgres=# insert into t values (1);
INSERT 0 1
postgres=# savepoint b;   // 开启子事务,但是不分配事务号,父事务号为607466854
SAVEPOINT
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466855
INSERT 0 1
postgres=# insert into t values (1);
INSERT 0 1
postgres=# savepoint c;  // 开启子事务,但是不分配事务号,父事务号为607466855
SAVEPOINT
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466856
INSERT 0 1
postgres=# savepoint d;  // 开启子事务,但是不分配事务号,父事务号为607466856
SAVEPOINT
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466857
INSERT 0 1
postgres=# rollback to a;  //  回滚子事务,创建新的子事务,但是不分配事务号,父事务号为607466850
ROLLBACK
postgres=# insert into t values (1);   // 子事务中调用DML,分配事务号607466858
INSERT 0 1
postgres=# select txid_current();   // 查看主事务的事务号
 txid_current 
--------------
    607466850
(1 row)


跟踪结果

[root@digoal ~]# stap -vp 5 -DMAXSKIPPED=9999999 -DSTP_NO_OVERLOAD -DMAXTRYLOCK=100 ./trc.stp -x 5749
Pass 1: parsed user script and 112 library script(s) using 209284virt/36876res/3172shr/34504data kb, in 110usr/90sys/192real ms.
Pass 2: analyzed script: 36 probe(s), 33 function(s), 4 embed(s), 27 global(s) using 223660virt/51416res/4248shr/48880data kb, in 0usr/130sys/134real ms.
Pass 3: using cached /root/.systemtap/cache/28/stap_282339931bbfe754a24af75ea3476930_35559.c
Pass 4: using cached /root/.systemtap/cache/28/stap_282339931bbfe754a24af75ea3476930_35559.ko
Pass 5: starting run.
     0 postgres(5749): -> time:1441519748850, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466848
    22 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
20726607 postgres(5749): -> time:1441519769576, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466849
20726671 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
69692931 postgres(5749): -> time:1441519818543, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466850
69692991 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
85924642 postgres(5749): -> time:1441519834774, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466851
85924720 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
85924766 postgres(5749): -> time:1441519834774, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466851 parent=607466850 overwriteOK='\000'
85924838 postgres(5749): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466851 ptr=0
102973659 postgres(5749): -> time:1441519851823, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466852
102973718 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
102973746 postgres(5749): -> time:1441519851823, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466852 parent=607466850 overwriteOK='\000'
102973782 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466852 ptr=0
112206905 postgres(5749): -> time:1441519861057, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466853
112206964 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
112206992 postgres(5749): -> time:1441519861057, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466853 parent=607466850 overwriteOK='\000'
112207028 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466853 ptr=0
152610154 postgres(5749): -> time:1441519901460, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466854
152610212 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
152610238 postgres(5749): -> time:1441519901460, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466854 parent=607466850 overwriteOK='\000'
152610275 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466854 ptr=0
167139858 postgres(5749): -> time:1441519915990, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466855
167139929 postgres(5749): <- time:1, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
167139958 postgres(5749): -> time:1441519915990, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466855 parent=607466854 overwriteOK='\000'
167139995 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466855 ptr=0
184727823 postgres(5749): -> time:1441519933578, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466856
184727849 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
184727859 postgres(5749): -> time:1441519933578, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466856 parent=607466855 overwriteOK='\000'
184727872 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466856 ptr=0
228240429 postgres(5749): -> time:1441519977090, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466857
228240493 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
228240520 postgres(5749): -> time:1441519977090, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466857 parent=607466856 overwriteOK='\000'
228240557 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466857 ptr=0
316079437 postgres(5749): -> time:1441520064929, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").call, par:newestXact=607466858
316079496 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("ExtendSUBTRANS@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:307").return, par:pageno=?
316079524 postgres(5749): -> time:1441520064929, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").call, par:xid=607466858 parent=607466850 overwriteOK='\000'
316079560 postgres(5749): <- time:0, pp:process("/opt/pgsql9.4.4/bin/postgres").function("SubTransSetParent@/opt/soft_bak/postgresql-9.4.4/src/backend/access/transam/subtrans.c:75").return, par:pageno=? entryno=? slotno=607466858 ptr=0


重新开一个会话,你会发现,子事务也消耗了XID。因为重新分配的XID已经从607466859开始了。

postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# select txid_current();
 txid_current 
--------------
    607466859
(1 row)


[参考]
src/backend/access/transam/clog.c
src/backend/access/transam/subtrans.c
src/backend/access/transam/transam.c
src/backend/access/transam/README
src/include/c.h:typedef uint32 SubTransactionId;
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值