database cluster 在哪?
在pg启动的时候,会用到一个参数:PGDATA,所以,一个cluster中所有的数据都是存放在变量PGDATA目录下。
database 在哪?
每个database的信息存放在PGDATA/base目录下,目录名以database的
命名。
psql中查询database的oid:
postgres@postgres:5432 # select oid ,datname from pg_database;
oid | datname
--------+-----------
1 | template1
13076 | template0
13081 | postgres
24580 | testdb
49471 | testdb02
281734 | pgbench
(6 rows)
查看PGDATA/base的子目录:
[postgres@dang-db base]$ll
total 68
drwx------ 2 postgres postgres 12288 Aug 11 10:53 1
drwx------ 2 postgres postgres 4096 Aug 11 10:53 13076
drwx------ 2 postgres postgres 12288 Aug 30 10:36 13081
drwx------ 2 postgres postgres 12288 Aug 30 10:36 24580
drwx------ 2 postgres postgres 12288 Aug 30 10:36 281734
drwx------ 2 postgres postgres 12288 Aug 29 17:53 49471
drwx------ 2 postgres postgres 4096 Aug 30 09:45 pgsql_tmp
可以看到每个database对应的oid均在base目录下。
目录1 存放的是数据库template1的文件。
tablespace在哪?
表空间分为系统表空间和用户自定义表空间,系统表空间有两个:pg_default和pg_global。
pg_default的文件存放在 PGDATA/base/ 目录下。
pg_global的文件存放在 PGDATA/global 目录下。
用户自定义表空间在PGDATA/pg_tblspc存在一个软连接,指向实际的存放目录,软连接以表空间的oid命名。
查看 用户自定义表空间tbs_test的oid
[postgres@dang-db tbs]$oid2name -s
All tablespaces:
Oid Tablespace Name
-------------------------
1663 pg_default
1664 pg_global
306000 tbs_test
查看PGDATA/pg_tblspc目录的内容:
[postgres@dang-db pg_tblspc]$ll
total 0
lrwxrwxrwx 1 postgres postgres 20 Aug 30 11:05 306000 -> /postgres/adb2_1/tbs
查看真正存放表空间文件的目录内容:
[postgres@dang-db tbs]$cd /postgres/adb2_1/tbs
[postgres@dang-db tbs]$ll
total 24
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_coord1
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_coord2
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_coord3
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_datanode1
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_datanode2
drwx------ 2 postgres postgres 4096 Aug 30 11:05 PG_9.3_201507012_datanode3
[postgres@dang-db tbs]$
table 在哪?如何存放?
pg中每个表都是一个文件,文件以filenode命名。filenode的值可以从pg_class的relfilenode字段获取。
当表大小超过1G的时候,表文件会被分割,第一个文件的名字是filenode,第二个则是filenode.1,以此类推。1G是pg的默认值,该值可以通过编译参数--with-segsize来改变。
查询pgbench_accounts的filenode:
postgres@pgbench:5532 # select
pgbench-# c.relname,
pgbench-# c.relfilenode,
pgbench-# n.nspname
pgbench-# from
pgbench-# pg_class c,
pgbench-# pg_namespace n
pgbench-# where 1=1
pgbench-# and c.relnamespace = n.oid
pgbench-# --and n.nspname like '%user01%'
pgbench-# and c.relname like 'pgbench_accounts';
relname | relfilenode | nspname
------------------+-------------+---------
pgbench_accounts | 33376 | public
(1 row)
也可以通过函数 pg_relation_filepath() 得到relation的存放路径:
postgres@pgbench:5532 # select pg_relation_filepath('pgbench_accounts') ;
pg_relation_filepath
----------------------
base/33344/33376
(1 row)
postgres@pgbench:5532 # \d+
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------------------+-------+----------+---------+-------------
public | pgbench_accounts | table | postgres | 1281 MB |
cd $PGDATA/base/33344中,查看33376的信息:
[postgres@dang-db 33344]$ll | grep 33376
-rw------- 1 postgres postgres 1073741824 Aug 30 14:32 33376
-rw------- 1 postgres postgres 269213696 Aug 30 14:33 33376.1
-rw------- 1 postgres postgres 352256 Aug 30 14:32 33376_fsm
-rw------- 1 postgres postgres 24576 Aug 30 14:32 33376_vm
[postgres@dang-db 33344]$pwd
/postgres/pgsql/pgdata/base/33344
由于pgbench_accounts的size大于1G,所有被分割,形成了两个文件:33376和33376.1
表的大小在文件中也会反映,在数据库中查询表的大小:
relname | relpages
------------------+----------
pgbench_accounts | 163935
默认情况下,每个page大小为8k,表文件大小应该是relpages*block_size的值相等:
postgres@pgbench:5532 # select 163935*8192;
?column?
------------
1342955520
(1 row)
postgres@pgbench:5532 # select 1073741824+269213696;
?column?
------------
1342955520
(1 row)
另外注意到,与表filenode相关的还有两个文件,filenode_fsm和filenode_vm,其中,
fsm为 free space map,vm为Visibility map。
fsm跟踪表文件中的空闲空间,vm跟踪哪些page包含对所有事务都可见的tuple。
在大多数情况下,表中数据的存放是无序的,我们称之为堆表,heap table.
多条数据一起存放在一个page中,多个page形成一个数据文件,pg中最小的io单位为page,所有每个文件的大小一定是page size的整数倍。
根据官方文档的描述,page的结构如下:
Item | Description |
---|---|
PageHeaderData | 24 bytes long. Contains general information about the page, including free space pointers. |
ItemIdData | Array of (offset,length) pairs pointing to the actual items. 4 bytes per item. |
Free space | The unallocated space. New item pointers are allocated from the start of this area, new items from the end. |
Items | The actual items themselves. |
Special space | Index access method specific data. Different methods store different data. Empty in ordinary tables. |
每个page的结构都是如此。insert数据的时候,是从free space的底部开始往上,一直到空闲空间不足以放下一条数据为止。每条数据都有一个内部列ctid(相当于Oracle中的rowid,但没有其包含的信息多),来反映该条数据在文件中的位置,比如第几个page的第多少行。
user01@testdb:5532 > select ctid,* from test;
ctid | id | name
-------------+---------+------
(0,1) | 2 | dang
(0,2) | 3 | dang
(0,3) | 4 | dang
(0,4) | 5 | dang
(0,5) | 6 | dang
(0,6) | 7 | dang
(0,7) | 8 | dang
(0,8) | 9 | dang
(0,9) | 10 | dang
(0,10) | 11 | dang
(0,11) | 12 | dang
(0,12) | 13 | dang
pg中提供了pageinspect 模块可以帮助我们看到page中的数据。参考文档:https://www.postgresql.org/docs/9.5/static/pageinspect.html
pageinspect提供的函数必须在superuser下执行。
构造测试数据:
CREATE TABLE tree (key int NOT NULL, id int NOT NULL);
ALTER TABLE tree ADD CONSTRAINT pk_tree PRIMARY KEY (key, id);
INSERT INTO TREE (Key, ID) VALUES (1, 1), (1, 2), (1, 3), (1, 4);
- read page items get_raw_page(relname text,blockno int)会返回bytea类型的数据,而heap_page_items可以接收bytea类型的数据,转换成可视化数据进行展示。
postgres@testdb:5532 # SELECT * FROM heap_page_items(get_raw_page('tree', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 8160 | 1 | 32 | 2453 | 0 | 0 | (0,1) | 2 | 2048 | 24 | |
2 | 8128 | 1 | 32 | 2453 | 0 | 0 | (0,2) | 2 | 2048 | 24 | |
3 | 8096 | 1 | 32 | 2453 | 0 | 0 | (0,3) | 2 | 2048 | 24 | |
4 | 8064 | 1 | 32 | 2453 | 0 | 0 | (0,4) | 2 | 2048 | 24 | |
(4 rows)
这些字段中t_ 代表 tuple,lp_ 代表 line point。
page_header 函数可以查看page的header信息:
postgres@testdb:5532 # select * from page_header(get_raw_page('tree', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/6554E108 | 0 | 0 | 40 | 8064 | 8192 | 8192 | 4 | 0
mvcc在page中的体现:
初始化insert四条数据,然后对id=4的数据做了两次更新,最后删除了id=4的数据,按照mvcc的原则,表tree对外可见的数据是三条,实际page中应该是六条,分别查询下:
postgres@testdb:5532 # select * from tree;
key | id | name
-----+----+------
1 | 1 |
1 | 2 |
1 | 3 |
(3 rows)
三条
postgres@testdb:5532 # SELECT * FROM heap_page_items(get_raw_page('tree', 0));
lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid
----+--------+----------+--------+--------+--------+----------+--------+-------------+------------+--------+--------+-------
1 | 8160 | 1 | 32 | 2453 | 0 | 0 | (0,1) | 2 | 2304 | 24 | |
2 | 8128 | 1 | 32 | 2453 | 0 | 0 | (0,2) | 2 | 2304 | 24 | |
3 | 8096 | 1 | 32 | 2453 | 0 | 0 | (0,3) | 2 | 2304 | 24 | |
4 | 8064 | 1 | 32 | 2453 | 2454 | 0 | (0,5) | 8194 | 1280 | 24 | |
5 | 8032 | 1 | 32 | 2454 | 2455 | 0 | (0,6) | 8194 | 9472 | 24 | |
6 | 8000 | 1 | 32 | 2455 | 2460 | 0 | (0,6) | 8194 | 8448 | 24 | |
(6 rows)
六条
数据update或者delete之后,t_xmax字段更新为操作该表数据的事务id,t_ctid 指向了替代它的那条数据。
每做一次update或者insert,free space减少32+4个字节,从free space的底部划出32个字节来存放数据,从free space的顶部划出4个字节来存放ItemIdData。
postgres@testdb:5532 # select * from page_header(get_raw_page('tree', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/65573668 | 0 | 0 | 48 | 8000 | 8192 | 8192 | 4 | 2454
(1 row)
postgres@testdb:5532 # INSERT INTO TREE (Key, ID) VALUES (1, 4);
INSERT 0 1
postgres@testdb:5532 # select * from page_header(get_raw_page('tree', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/655739D8 | 0 | 0 | 52 | 7968 | 8192 | 8192 | 4 | 2454
(1 row)
postgres@testdb:5532 # INSERT INTO TREE (Key, ID) VALUES (1, 5);
INSERT 0 1
postgres@testdb:5532 # select * from page_header(get_raw_page('tree', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/65573D68 | 0 | 0 | 56 | 7936 | 8192 | 8192 | 4 | 2454
(1 row)
postgres@testdb:5532 # INSERT INTO TREE (Key, ID) VALUES (1, 7);
INSERT 0 1
postgres@testdb:5532 # update tree set id=6 where id=7;
UPDATE 1
postgres@testdb:5532 # select * from page_header(get_raw_page('tree', 0));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/65573FE0 | 0 | 0 | 64 | 7872 | 8192 | 8192 | 4 | 2454
(1 row)
index 在哪?如何存放?
跟表的存放方式类似,索引也是文件的形式存放在文件系统中。
postgres@testdb:5532 # select pg_relation_filepath('pk_tree');
pg_relation_filepath
----------------------------------------------
pg_tblspc/16415/PG_9.5_201510051/24603/33405
(1 row)
[postgres@dang-db ~]$cd $PGDATA
[postgres@dang-db pgdata]$ll pg_tblspc/16415/PG_9.5_201510051/24603/33405
-rw------- 1 postgres postgres 16384 Aug 30 17:17 pg_tblspc/16415/PG_9.5_201510051/24603/33405
利用pageinspect模块提供的函数来了解索引的物理存储结构。
bt_metap 返回索引的基本信息:
postgres@testdb:5532 # SELECT * FROM bt_metap('pk_tree');
magic | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
340322 | 2 | 3 | 1 | 3 | 1
(1 row)
bt_page_stats 返回指定索引上指定page的信息:
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',0);
ERROR: block 0 is a meta page
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
(1 row)
索引file的page 0 存放的meta数据。
bt_page_items返回index page中具体 index entries:
postgres@testdb:5532 # SELECT * FROM bt_page_items('pk_tree',1);
itemoffset | ctid | itemlen | nulls | vars | data
------------+---------+---------+-------+------+-------------------------
1 | (1,145) | 16 | f | f | 01 00 00 00 6f 01 00 00
2 | (0,1) | 16 | f | f | 01 00 00 00 01 00 00 00
3 | (0,2) | 16 | f | f | 01 00 00 00 02 00 00 00
4 | (0,3) | 16 | f | f | 01 00 00 00 03 00 00 00
5 | (0,7) | 16 | f | f | 01 00 00 00 04 00 00 00
6 | (0,8) | 16 | f | f | 01 00 00 00 05 00 00 00
7 | (0,10) | 16 | f | f | 01 00 00 00 06 00 00 00
8 | (0,11) | 16 | f | f | 01 00 00 00 07 00 00 00
。。。。。。
366 | (1,143) | 16 | f | f | 01 00 00 00 6d 01 00 00
367 | (1,144) | 16 | f | f | 01 00 00 00 6e 01 00 00
(367 rows)
共计367条数据,跟stats中统计的一致。
查看索引root page的信息:
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',3);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
3 | r | 3 | 0 | 13 | 8192 | 8096 | 0 | 0 | 1 | 2
(1 row)
postgres@testdb:5532 # SELECT * FROM bt_page_items('pk_tree',3);
itemoffset | ctid | itemlen | nulls | vars | data
------------+-------+---------+-------+------+-------------------------
1 | (1,1) | 8 | f | f |
2 | (2,1) | 16 | f | f | 01 00 00 00 6f 01 00 00
3 | (4,1) | 16 | f | f | 01 00 00 00 dd 02 00 00
(3 rows)
查看三个leaf page的信息:
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
1 | l | 367 | 0 | 16 | 8192 | 808 | 0 | 2 | 0 | 1
(1 row)
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',2);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
2 | l | 367 | 0 | 16 | 8192 | 808 | 1 | 4 | 0 | 1
(1 row)
postgres@testdb:5532 # SELECT * FROM bt_page_stats('pk_tree',4);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------+------------
4 | l | 268 | 0 | 16 | 8192 | 2788 | 2 | 0 | 0 | 1
(1 row)
其中 type、btpo 都说明其为leaf page。
学习链接
- Introduction to PostgreSQL physical storage
- https://www.postgresql.org/docs/9.5/static/storage-page-layout.html
- https://facility9.com/2011/03/postgresql-row-storage-fundamentals/
- https://yq.aliyun.com/articles/53701