Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:
- Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
- Postgres-XL的项目发起人Mason Sharp
- pgpool的作者石井达夫(Tatsuo Ishii)
- PG-Strom的作者海外浩平(Kaigai Kohei)
- Greenplum研发总监姚延栋
- 周正中(德哥), PostgreSQL中国用户会创始人之一
- 汪洋,平安科技数据库技术部经理
- ……
|
multixact是管理共享行锁的基础代码,一个MultiXactId对应一个MultiXactMember数组,每个MultiXactMember包含事务号以及该事务号对应的flag(MultiXactStatus)。
src/include/access/multixact.h
typedef struct MultiXactMember{TransactionId xid;MultiXactStatus status;} MultiXactMember;
typedef struct xl_multixact_create{MultiXactId mid; /* new MultiXact's ID */MultiXactOffset moff; /* its starting offset in members file */int32 nmembers; /* number of member XIDs */MultiXactMember members[FLEXIBLE_ARRAY_MEMBER];} xl_multixact_create;
src/backend/access/transam/multixact.c
/*-------------------------------------------------------------------------** multixact.c* PostgreSQL multi-transaction-log manager** The pg_multixact manager is a pg_clog-like manager that stores an array of* MultiXactMember for each MultiXactId. It is a fundamental part of the* shared-row-lock implementation. Each MultiXactMember is comprised of a* TransactionId and a set of flag bits. The name is a bit historical:* originally, a MultiXactId consisted of more than one TransactionId (except* in rare corner cases), hence "multi". Nowadays, however, it's perfectly* legitimate to have MultiXactIds that only include a single Xid.** The meaning of the flag bits is opaque to this module, but they are mostly* used in heapam.c to identify lock modes that each of the member transactions* is holding on any given tuple. This module just contains support to store* and retrieve the arrays.** We use two SLRU areas, one for storing the offsets at which the data* starts for each MultiXactId in the other one. This trick allows us to* store variable length arrays of TransactionIds. (We could alternatively* use one area containing counts and TransactionIds, with valid MultiXactId* values pointing at slots containing counts; but that way seems less robust* since it would get completely confused if someone inquired about a bogus* MultiXactId that pointed to an intermediate slot containing an XID.)** XLOG interactions: this module generates an XLOG record whenever a new* OFFSETs or MEMBERs page is initialized to zeroes, as well as an XLOG record* whenever a new MultiXactId is defined. This allows us to completely* rebuild the data entered since the last checkpoint during XLOG replay.* Because this is possible, we need not follow the normal rule of* "write WAL before data"; the only correctness guarantee needed is that* we flush and sync all dirty OFFSETs and MEMBERs pages to disk before a* checkpoint is considered complete. If a page does make it to disk ahead* of corresponding WAL records, it will be forcibly zeroed before use anyway.* Therefore, we don't need to mark our pages with LSN information; we have* enough synchronization already.** Like clog.c, and unlike subtrans.c, we have to preserve state across* crashes and ensure that MXID and offset numbering increases monotonically* across a crash. We do this in the same way as it's done for transaction* IDs: the WAL record is guaranteed to contain evidence of every MXID we* could need to worry about, and we just make sure that at the end of* replay, the next-MXID and next-offset counters are at least as large as* anything we saw during replay.** We are able to remove segments no longer necessary by carefully tracking* each table's used values: during vacuum, any multixact older than a certain* value is removed; the cutoff value is stored in pg_class. The minimum value* across all tables in each database is stored in pg_database, and the global* minimum across all databases is part of pg_control and is kept in shared* memory. At checkpoint time, after the value is known flushed in WAL, any* files that correspond to multixacts older than that value are removed.* (These files are also removed when a restartpoint is executed.)** When new multixactid values are to be created, care is taken that the* counter does not fall within the wraparound horizon considering the global* minimum value.** Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group* Portions Copyright (c) 1994, Regents of the University of California** src/backend/access/transam/multixact.c**-------------------------------------------------------------------------*/
因为每个MultiXactMember包含事务号xid以及该事务号对应的flag(MultiXactStatus),PG使用1个字节存储flag,为了对其,每4个字节和4个XID作为一个MultiXactMember组(刚好20个字节,含4个MultiXactMember)
同样multixact page和BLOCKSZ一样大,所以如果是8K的block size,可以包含409个MultiXactMember组。
如果你的某一条记录被409个事务同时加共享行锁,需要消耗一个8KB的multixact page。
/** The situation for members is a bit more complex: we store one byte of* additional flag bits for each TransactionId. To do this without getting* into alignment issues, we store four bytes of flags, and then the* corresponding 4 Xids. Each such 5-word (20-byte) set we call a "group", and* are stored as a whole in pages. Thus, with 8kB BLCKSZ, we keep 409 groups* per page. This wastes 12 bytes per page, but that's OK -- simplicity (and* performance) trumps space efficiency here.** Note that the "offset" macros work with byte offset, not array indexes, so* arithmetic must be done using "char *" pointers.*//* We need eight bits per xact, so one xact fits in a byte */#define MXACT_MEMBER_BITS_PER_XACT 8#define MXACT_MEMBER_FLAGS_PER_BYTE 1#define MXACT_MEMBER_XACT_BITMASK ((1 << MXACT_MEMBER_BITS_PER_XACT) - 1)
/* how many full bytes of flags are there in a group? */#define MULTIXACT_FLAGBYTES_PER_GROUP 4#define MULTIXACT_MEMBERS_PER_MEMBERGROUP \(MULTIXACT_FLAGBYTES_PER_GROUP * MXACT_MEMBER_FLAGS_PER_BYTE)/* size in bytes of a complete group */#define MULTIXACT_MEMBERGROUP_SIZE \(sizeof(TransactionId) * MULTIXACT_MEMBERS_PER_MEMBERGROUP + MULTIXACT_FLAGBYTES_PER_GROUP)#define MULTIXACT_MEMBERGROUPS_PER_PAGE (BLCKSZ / MULTIXACT_MEMBERGROUP_SIZE)#define MULTIXACT_MEMBERS_PER_PAGE \(MULTIXACT_MEMBERGROUPS_PER_PAGE * MULTIXACT_MEMBERS_PER_MEMBERGROUP)
multixact头文件:
有效的multixactid从1开始,最大为0xFFFFFFFF(和xid一致)。
src/include/access/multixact.h
/** The first two MultiXactId values are reserved to store the truncation Xid* and epoch of the first segment, so we start assigning multixact values from* 2.*/#define InvalidMultiXactId ((MultiXactId) 0)#define FirstMultiXactId ((MultiXactId) 1)#define MaxMultiXactId ((MultiXactId) 0xFFFFFFFF)
#define MultiXactIdIsValid(multi) ((multi) != InvalidMultiXactId)
#define MaxMultiXactOffset ((MultiXactOffset) 0xFFFFFFFF)
/* Number of SLRU buffers to use for multixact */#define NUM_MXACTOFFSET_BUFFERS 8#define NUM_MXACTMEMBER_BUFFERS 16
目前使用1个字节存储flag,即MultiXactStatus,现在支持6个状态值。
可以看到这些状态涉及6种行锁操作。
FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATEan update that doesn't touch "key" columnsother updates, and delete/** Possible multixact lock modes ("status"). The first four modes are for* tuple locks (FOR KEY SHARE, FOR SHARE, FOR NO KEY UPDATE, FOR UPDATE); the* next two are used for update and delete modes.*/typedef enum{MultiXactStatusForKeyShare = 0x00,MultiXactStatusForShare = 0x01,MultiXactStatusForNoKeyUpdate = 0x02,MultiXactStatusForUpdate = 0x03,/* an update that doesn't touch "key" columns */MultiXactStatusNoKeyUpdate = 0x04,/* other updates, and delete */MultiXactStatusUpdate = 0x05} MultiXactStatus;
[参考]
src/include/access/multixact.h
src/backend/access/transam/multixact.c