openGauss存储结构浅析-relation heappage

openGauss 存储结构浅析-relation heappage

Page 整体布局

openGauss 基于 pg9.2 开发,在存储数据结构上基本上沿用 PG 的数据结构,并稍微修改。整体布局上这里盗了一张 PG 的图,进行介绍。

这张 page 结构图中少了在 page 尾部的 special space 数据区,一般存放与索引相关的特定数据,table relation 没有该区域。

整理布局分为 4 个部分:

1、PageHeaderData/HeapPageHeaderData 页面头/heap 页面头数据。

2、linp array,行指针数组。

3、free space 空闲空间。

4、tuple 数据,行数据。

5、special space,在页面尾部,通过 pagehead 中的 pd_special 指向该区域, 一般存放与索引相关的特定数据,table relation page 该区域为空,这张图没画出来。

下面逐个介绍:

HeapPageHeaderData

./src/include/storage/bufpage.h
typedef struct {
    /* XXX LSN is member of *any* block, not only page-organized ones */
    PageXLogRecPtr pd_lsn;    /* LSN: next byte after last byte of xlog
                               * record for last change to this page */
    uint16 pd_checksum;       /* checksum */
    uint16 pd_flags;          /* flag bits, see below */
    LocationIndex pd_lower;   /* offset to start of free space */
    LocationIndex pd_upper;   /* offset to end of free space */
    LocationIndex pd_special; /* offset to start of special space */
    uint16 pd_pagesize_version;
    ShortTransactionId pd_prune_xid;           /* oldest prunable XID, or zero if none */
    TransactionId pd_xid_base;                 /* base value for transaction IDs on page */
    TransactionId pd_multi_base;               /* base value for multixact IDs on page */
    ItemIdData pd_linp[FLEXIBLE_ARRAY_MEMBER]; /* beginning of line pointer array */
} HeapPageHeaderData;
disk page organization
space management information generic to any page
    pd_lsn        - identifies xlog record for last change to this page.
    pd_checksum    - page checksum, if set.
    pd_flags    - flag bits.
    pd_lower    - offset to start of free space.
    pd_upper    - offset to end of free space.
    pd_special    - offset to start of special space.
    pd_pagesize_version - size in bytes and page layout version number.
    pd_prune_xid - oldest XID among potentially prunable tuples on page.

pg 的 PageHeaderData 为 20 字节,openGauss 的 HeapPageHeaderData 在此基础上增加了两个 8 字节数据项,pd_xid_base,pd_multi_base。这也说明 opengauss 把事物 ID 从 4 字节扩到了 8 字节。

空闲空间起始位置:

  • pd_lower 指向 freespace 开始,也就是 linp array 结尾处。
  • pd_upper 执行 freespace 结束,也就是 tuple data 开始处。

linp array

typedef struct ItemIdData {
    unsigned lp_off : 15, /* offset to tuple (from start of page) */
        lp_flags : 2,     /* state of item pointer, see below */
        lp_len : 15;      /* byte length of tuple */
} ItemIdData;

lp_flags
define LP_UNUSED 0   /* unused (should always have lp_len=0) */
define LP_NORMAL 1   /* used (should always have lp_len>0) */
define LP_REDIRECT 2 /* HOT redirect (should have lp_len=0) */
define LP_DEAD 3     /* dead, may or may not have storage */
  • 每个行指针大小为 4 字节,也就是 32bit。
  • 低 15 位为 lp_off,tuple 起始位置在页面内的偏移量。
  • 中间 2 位为 lp_flags,标志信息。

    • LP_UNUSED 0 未使用,lp_len=0
    • LP_NORMAL 1 正常状态,lp_len>0
    • LP_REDIRECT 2 发生 HOT 时,执行行新的位置,lp_len=0
    • LP_DEAD 3 行为 dead
  • lp_len tuple 占用空间长度。

  • 计算 linp 数组在 heappageheader 之后,结束于 pd_lower 计算 linp 指针的个数,(pd_lower-linp 开始位置)/4。

freespace 空闲空间

页面内的空闲空间。

linp 从 pd_lower 处往后增长。

tuple data 从尾部,也就是 pd_upper 往前增长。

tuple data

tuple 数据,也是页面最重要的部分,由 tuple header+tuple data 组成。

tuple header

  • Tuple 头部是由 23byte 固定大小的前缀和可选的 NullBitMap 构成。
  • t_xmin:代表插入此元组的事务 xid;
  • t_xmax:代表更新或者删除此元组的事务 xid,如果该元组插入后未进行更新或者删除,t_xmax=0;
  • t_cid:command id,代表在当前事务中,已经执行过多少条 sql,例如执行第一条 sql 时 cid=0,执行第二条 sql 时 cid=1;
  • t_ctid:待研究,在 pg 中为 update 后旧版本指向新 tuple 的指针。
  • t_infomask 标志位,记录各种信息,如是否存在 null 列,是否有变长列,是否有 OID 列等。
  • 如果有允许为空的列,则存在 null bitmap,可以通过 t_infomask 判断 t_infomask&0x0001, bitmap 的大小与列个数有关。
  • t_hoff 记录 header 的大小,包含 null bitmap,padding。
  • tuple header 后会有 padding,使 tuple header 的大小为 8 的整数倍。

    typedef struct HeapTupleHeaderData {
        union {
            HeapTupleFields t_heap;
            DatumTupleFields t_datum;
        } t_choice;
        ItemPointerData t_ctid; /* current TID of this or newer tuple */
        /* Fields below here must match MinimalTupleData! */
        uint16 t_infomask2; /* number of attributes + various flags */
        uint16 t_infomask; /* various flag bits, see below */
        int8 t_hoff; /* sizeof header incl. bitmap, padding */
        /* ^ - 23 bytes - ^ */
        bits8 t_bits[FLEXIBLE_ARRAY_MEMBER]; /* bitmap of NULLs -- VARIABLE LENGTH */
        /* MORE DATA FOLLOWS AT END OF STRUCT */
    } HeapTupleHeaderData;
    
    typedef struct HeapTupleFields {
        ShortTransactionId t_xmin; /* inserting xact ID */
        ShortTransactionId t_xmax; /* deleting or locking xact ID */
        union {
            CommandId t_cid;           /* inserting or deleting command ID, or both */
            ShortTransactionId t_xvac; /* old-style VACUUM FULL xact ID */
        } t_field3;
    } HeapTupleFields;
    
    /*
     * information stored in t_infomask:
     */
    #define HEAP_HASNULL 0x0001          /* has null attribute(s) */
    #define HEAP_HASVARWIDTH 0x0002      /* has variable-width attribute(s) */
    #define HEAP_HASEXTERNAL 0x0004      /* has external stored attribute(s) */
    #define HEAP_HASOID 0x0008           /* has an object-id field */
    #define HEAP_COMPRESSED 0x0010       /* has compressed data */
    #define HEAP_COMBOCID 0x0020         /* t_cid is a combo cid */
    #define HEAP_XMAX_EXCL_LOCK 0x0040   /* xmax is exclusive locker */
    #define HEAP_XMAX_SHARED_LOCK 0x0080 /* xmax is shared locker */
    /* if either LOCK bit is set, xmax hasn't deleted the tuple, only locked it */
    #define HEAP_IS_LOCKED (HEAP_XMAX_EXCL_LOCK | HEAP_XMAX_SHARED_LOCK)
    #define HEAP_XMIN_COMMITTED 0x0100 /* t_xmin committed */
    #define HEAP_XMIN_INVALID 0x0200   /* t_xmin invalid/aborted */
    #define HEAP_XMIN_FROZEN (HEAP_XMIN_INVALID | HEAP_XMIN_COMMITTED)
    #define HEAP_XMAX_COMMITTED 0x0400 /* t_xmax committed */
    #define HEAP_XMAX_INVALID 0x0800   /* t_xmax invalid/aborted */
    #define HEAP_XMAX_IS_MULTI 0x1000  /* t_xmax is a MultiXactId */
    #define HEAP_UPDATED 0x2000        /* this is UPDATEd version of row */
    #define HEAP_MOVED_OFF                          \
        0x4000 /* moved to another place by pre-9.0 \
                * VACUUM FULL; kept for binary      \
                * upgrade support */
    #define HEAP_MOVED_IN                             \
        0x8000 /* moved from another place by pre-9.0 \
                * VACUUM FULL; kept for binary        \
                * upgrade support */
    #define HEAP_MOVED (HEAP_MOVED_OFF | HEAP_MOVED_IN)
    
    #define HEAP_XACT_MASK 0xFFE0 /* visibility-related bits */
    
    /*
     * information stored in t_infomask2:
     */
    #define HEAP_NATTS_MASK 0x07FF /* 11 bits for number of attributes */
    /* bits 0x1800 are available */
    #define HEAP_HAS_REDIS_COLUMNS 0x2000 /* tuple has hidden columns added by redis */
    #define HEAP_HOT_UPDATED 0x4000 /* tuple was HOT-updated */
    #define HEAP_ONLY_TUPLE 0x8000  /* this is heap-only tuple */
    
    #define HEAP2_XACT_MASK 0xC000 /* visibility-related bits */
    

tuple data

1、tuple data 有多个列(属性)组成。

2、每个属性的长度分定长与变长。

3、每个属性有不同的字节对齐。

4、每个属性有不同的存储策略。

5、tuple data 中不存储 null 值。

6、起始位置 8 字节对齐。

变长类型(typlen = -1)存储策略

  • PLAN 避免压缩和行外存储
  • EXTENDED 允许压缩和行外存储,大多数可以 TOAST 的数据类型的默认策略
  • EXTERNAL 允许行外存储,但不许压缩
  • MAIN 允许压缩,尽量不使用行外存储

参考 pg_type.typstorage

  • p: Value must always be stored plain.
  • e: Value can be stored in a “secondary” relation (if relation has one, see pg_class.reltoastrelid).
  • m: Value can be stored compressed inline.
  • x: Value can be stored compressed inline or stored in “secondary” storage.

对齐(起始地址需要是某个长度的整数倍)

参考 pg_type.typalign

  • c = char alignment, i.e., no alignment needed.
  • s = short alignment (2 bytes on most machines).
  • i = int alignment (4 bytes on most machines).
  • d = double alignment (8 bytes on many machines, but by no means all).
 typalign | typstorage | count
----------+------------+-------
 c        | p          |     6
 i        | m          |     3
 d        | x          |   269
 s        | p          |     3
 d        | p          |    17
 i        | p          |    39
 i        | x          |    70
(7 rows)

通过 pg_type 可以查到各个数据类型数据存储长度,对齐大小,存储策略也可以通过 pg_attribute 去查表上各列的相关属性。如下面 t 表示例:

create table  t(id int,id2 bigint,c varchar(100),d date,ts timestamp);
insert into t select i,i,'test'||i,now(),now() from generate_series(1,1000)i;

test=# SELECT a.attname,
test-#   pg_catalog.format_type(a.atttypid, a.atttypmod),attlen,
test-#   (SELECT substring(pg_catalog.pg_get_expr(d.adbin, d.adrelid) for 128)
test(#    FROM pg_catalog.pg_attrdef d
test(#    WHERE d.adrelid = a.attrelid AND d.adnum = a.attnum AND a.atthasdef),
test-#   a.attnotnull, a.attnum,
test-#   (SELECT c.collname FROM pg_catalog.pg_collation c, pg_catalog.pg_type t
test(#    WHERE c.oid = a.attcollation AND t.oid = a.atttypid AND a.attcollation <> t.typcollation) AS attcollation,
test-#   a.attidentity,
test-#   NULL AS indexdef,
test-#   NULL AS attfdwoptions,
test-#   a.attstorage,attalign,
test-#   CASE WHEN a.attstattarget=-1 THEN NULL ELSE a.attstattarget END AS attstattarget, pg_catalog.col_description(a.attrelid, a.attnum)
test-# FROM pg_catalog.pg_attribute a
test-# WHERE a.attrelid = 't'::regclass AND a.attnum > 0 AND NOT a.attisdropped
test-# ORDER BY a.attnum;
attname |         format_type         | attlen | substring | attnotnull | attnum | attstorage | attalign
---------+-----------------------------+--------+-----------+------------+--------+------------+----------
id      | integer                     |      4 |           | f          |      1 | p          | i
id2     | bigint                      |      8 |           | f          |      2 | p          | d
c       | character varying(100)      |     -1 |           | f          |      3 | x          | i
d       | date                        |      4 |           | f          |      4 | p          | i
ts      | timestamp without time zone |      8 |           | f          |      5 | p          | d
(5 rows)

test=# select * from heap_page_items(get_raw_page('t',0));
lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                       t_data
-----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+------------------------------------------------------------------------------------
  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | \x010000000000000001000000000000000d74657374310000e91d000000000000961c1f75bc590200
  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | \x020000000000000002000000000000000d74657374320000e91d000000000000961c1f75bc590200
  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | \x030000000000000003000000000000000d74657374330000e91d000000000000961c1f75bc590200
  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | \x040000000000000004000000000000000d74657374340000e91d000000000000961c1f75bc590200
  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | \x050000000000000005000000000000000d74657374350000e91d000000000000961c1f75bc590200

                                                                                                                                 0100000000000000 0100000000000000 0d74657374310000 e91d000000000000 961c1f75bc590200
test=# select * from heap_page_item_attrs(get_raw_page('t',0),'t');
lp  | lp_off | lp_flags | lp_len |  t_xmin   | t_xmax | t_field3 | t_ctid  | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid |                                             t_attrs

-----+--------+----------+--------+-----------+--------+----------+---------+-------------+------------+--------+--------+-------+-------------------------------------------------------------------------------------------
------
  1 |   8128 |        1 |     64 | 377048720 |      0 |        0 | (0,1)   |           5 |       2306 |     24 |        |       | {"\\x01000000","\\x0100000000000000","\\x0d7465737431","\\xe91d0000","\\x961c1f75bc590200"
}
  2 |   8064 |        1 |     64 | 377048720 |      0 |        0 | (0,2)   |           5 |       2306 |     24 |        |       | {"\\x02000000","\\x0200000000000000","\\x0d7465737432","\\xe91d0000","\\x961c1f75bc590200"
}
  3 |   8000 |        1 |     64 | 377048720 |      0 |        0 | (0,3)   |           5 |       2306 |     24 |        |       | {"\\x03000000","\\x0300000000000000","\\x0d7465737433","\\xe91d0000","\\x961c1f75bc590200"
}
  4 |   7936 |        1 |     64 | 377048720 |      0 |        0 | (0,4)   |           5 |       2306 |     24 |        |       | {"\\x04000000","\\x0400000000000000","\\x0d7465737434","\\xe91d0000","\\x961c1f75bc590200"
}
  5 |   7872 |        1 |     64 | 377048720 |      0 |        0 | (0,5)   |           5 |       2306 |     24 |        |       | {"\\x05000000","\\x0500000000000000","\\x0d7465737435","\\xe91d0000","\\x961c1f75bc590200"
}

变长类型的存储

定长数据类型根据 pg_attribute.attlen 对齐后直接存储在 tupledata 中,变长数据类型由于长度不固定,存储时有些特殊。需要在数据中保存长度等信息。pg_attribute.attlen==-1 则为变长类型

1、第一个字节(va_header)=0x01,第二个字节 va_tag=18(VARTAG_INDIRECT)toast 机制,1 字节 va_header+1 字节 va_tag+4*4 字节 varatt_external(toast 指针)。

typedef struct varatt_external {
    int32 va_rawsize;  /* Original data size (includes header) */
    int32 va_extsize;  /* External saved size (doesn't) */
    Oid va_valueid;    /* Unique ID of value within TOAST table */
    Oid va_toastrelid; /* RelID of TOAST table containing it */
} varatt_external;

2、short varlena,当长度<=126 时,判断 va_header!=0x01 and va_header&0x01==0x01,第一个字节前 7 位存储大小,最后一位为 0x01。

3、full 4-byte header varlena (前 4 字节>>2)&0x3FFFFFFF 数据长度。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值