熟悉oracle的人可能都知道,oracle中有高水位(HWM)这个概念,这使得我们在oracle中delete一张表的数据后,表上占用的空间却不会被释放.通俗的讲,就好比一个装了水的水桶,现在我将水全部倒掉,可是之前水位的痕迹还是清晰可见.
其实在PostgreSQL中也有类似的情况,我们先来看看下面这个例子:
创建一张表,插入1000000条数据,观察表大小:
bill=# create table t1(id int);
CREATE TABLE
bill=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
bill=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+-------+-------+-------------
public | t1 | table | bill | 35 MB |
(1 row)
现在我们删除其中的999999条记录,按理说表应该很小了
bill=# delete from t1 where id <> 1000000;
DELETE 999999
bill=# vacuum (verbose ,analyze) t1;
psql: INFO: 00000: vacuuming "public.t1"
LOCATION: lazy_scan_heap, vacuumlazy.c:540
psql: INFO: 00000: "t1": removed 999999 row versions in 4425 pages
LOCATION: lazy_scan_heap, vacuumlazy.c:1478
psql: INFO: 00000: "t1": found 999999 removable, 1 nonremovable row versions in 4425 out of 4425 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 34002268
There were 0 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.09 s, system: 0.00 s, elapsed: 0.09 s.
LOCATION: lazy_scan_heap, vacuumlazy.c:1509
psql: INFO: 00000: analyzing "public.t1"
LOCATION: do_analyze_rel, analyze.c:330
psql: INFO: 00000: "t1": scanned 4425 of 4425 pages, containing 1 live rows and 0 dead rows; 1 rows in sample, 1 estimated total rows
LOCATION: acquire_sample_rows, analyze.c:1133
VACUUM
实际情况是什么样呢?竟然还是35MB!
bill=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------±-----±------±------±------±------------
public | t1 | table | bill | 35 MB |
(1 row)
那现在我们再来换个方式,同样的一张表:
bill=# drop table t1;
DROP TABLE
bill=# create table t1(id int);
CREATE TABLE
bill=# insert into t1 select generate_series(1,1000000);
INSERT 0 1000000
这次依然是删除其中的999999条记录:
bill=# delete from t1 where id <> 1;
DELETE 999999
bill=# vacuum (verbose ,analyze) t1;
psql: INFO: 00000: vacuuming "public.t1"
LOCATION: lazy_scan_heap, vacuumlazy.c:540
psql: INFO: 00000: "t1": found 0 removable, 1 nonremovable row versions in 1 out of 1 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 34002276
There were 225 unused item identifiers.
Skipped 0 pages due to buffer pins, 0 frozen pages.
0 pages are entirely empty.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
LOCATION: lazy_scan_heap, vacuumlazy.c:1509
psql: INFO: 00000: analyzing "public.t1"
LOCATION: do_analyze_rel, analyze.c:330
psql: INFO: 00000: "t1": scanned 1 of 1 pages, containing 1 live rows and 0 dead rows; 1 rows in sample, 1 estimated total rows
LOCATION: acquire_sample_rows, analyze.c:1133
VACUUM
然后再查看表的大小,和上面的情况一样,我们猜想估计还是35MB吧,可是实际竟然只有40KB了(实际数据只有8kb,包含vm和fsm文件)
bill=# \dt+ t1
List of relations
Schema | Name | Type | Owner | Size | Description
--------+------+-------+-------+-------+-------------
public | t1 | table | bill | 40 kB |
(1 row)
这是为什么呢?通过对比我们不难发现,第一个实验我删除的是前面的999999条数据,第二个实验我删除的是最后面的999999条数据,是不是也有点“高水位”的感觉?那为什么在PostgreSQL中会有这种情况发生呢?
因为在PostgreSQL中索引指向的是每行数据的ctid,这个是物理地址,加入我删除page中在最前面的数据,这个时候如果表大小变小的话,是不是就意味着后面的数据的ctid要往前移动,那此时就会出现一个问题,索引中key对应的ctid不准确了,所以为了避免这个问题,才会出现上面的这个情况,但是不用太过担心,即使表的大小不会变小,那些被删除的数据的数据块仍然可以插入数据.如果表上面没有索引,那我们通过行迁移之类的手段是可以使表的物理大小也变小.