04 如何进行数据表表分区? | OushuDB 数据库使用入门

表分区

在数据库日渐庞大的今天,为了方便对数据库数据的管理,比如按时间、地区去统计一些数据时,基数过于庞大带来了诸多不便。很多商业数据库都提供分区的概念,按不同的维度去存放数据,便于后期的管理,OushuDB也不例外。

表分区是把逻辑上的一个大表拆分成多个子表,这样不仅能带来访问速度的提升,更能带来管理和维护上的方便,

这样的话,有两个好处:

  • 查询优化器可以针对分区表进行优化,如果查询只涉及到某些分区,则查询计划只需要扫描这些分区,从而加速查询

  • 如果我们按照日期进行分区的话,我们可以简单的加入分区和删除过期的分区。

分区表的目的是为了提高查询性能,但并非所有的表都适合做分区。只有大型事实表、经常使用特定>条件[日期、地区等]查询数据、维护历史数据、数据分布均匀的情况,可通过分区策略大大提升使用性能。且分区数不宜过多,否则会影响维护和管理工作速度。

OushuDB支持基于Range和List的两种分区方式:

  • Range分区:表被一个或者多个关键列分区成”范围”,这些范围在不同的分区里没有重叠。 比如依据日期、价格数值范围进行分区。

  • List分区:依据一个值的列表进行分区,比如依据地区列表进行分区。

创建新的分区表使用CREATE TABLE … PATITION BY语法实现,创建分区表步骤如下:

  1. 确定分区类型:范围分区RANGE或列表分区List。

  2. 选定分区字段:范围分区多为日期、数值类型字段;列表分区多为枚举类型的特定列表;

  3. 确定分区级别:OushuDB支持子分区SUBPARTITION,可以按照不同维度实现多级分区。

定义Range分区

比如,假设我们为一个巨大的冰激凌公司构造数据库。该公司每天都测量最高温度,以及每个地区的冰激凌销售。且大多数查询都只会访问最后一周,最后一个月或者最后一个季度的数据,因为这个表的主要用途是为管理准备在线报告。为了减少需要存储的旧数据,我们决定只保留最近三年的有用数据。在每个月的开头,我们都会删除最旧的一个月的数据。因此,我们会选定日期字段作按月的范围分区:

CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktemp int,
unitsales int
)
DISTRIBUTED BY (city_id)
PARTITION BY RANGE (logdate)
(START (date '2020-01-01') INCLUSIVE END (date '2020-02-01') EXCLUSIVE EVERY (INTERVAL '1 month'));

又例如,我们来创建一个sales表,按照date列Range分区,从2008年到2009年每月创建一个分区:

postgres=# CREATE TABLE sales (id int, date date, amt decimal(10,2))
PARTITION BY RANGE (date)
( START (date '2008-01-01') INCLUSIVE
  END (date '2009-01-01') EXCLUSIVE
  EVERY (INTERVAL '1 month') );

要查看创建的表信息,可以通过\d+给出该表的所有信息

postgres=# \d+ sales
            Append-Only Table "public.sales"
Column |     Type     | Modifiers | Storage | Description
--------+---------------+-----------+---------+-------------
id     | integer       |           | plain   |
date   | date         |           | plain   |
amt   | numeric(10,2) |           | main   |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Child tables: sales_1_prt_1,
            sales_1_prt_10,
            sales_1_prt_11,
            sales_1_prt_12,
            sales_1_prt_2,
            sales_1_prt_3,
            sales_1_prt_4,
            sales_1_prt_5,
            sales_1_prt_6,
            sales_1_prt_7,
            sales_1_prt_8,
            sales_1_prt_9
Has OIDs: no
Options: appendonly=true
Distributed randomly
Partition by: (date)

你也可以显式得声明子分区并指定子表名字。

CREATE TABLE sales_exp (id int, date date, amt decimal(10,2))
PARTITION BY RANGE (date)
( PARTITION Jan08 START (date '2008-01-01') INCLUSIVE ,
PARTITION Feb08 START (date '2008-02-01') INCLUSIVE ,
PARTITION Mar08 START (date '2008-03-01') INCLUSIVE ,
PARTITION Apr08 START (date '2008-04-01') INCLUSIVE ,
PARTITION May08 START (date '2008-05-01') INCLUSIVE ,
PARTITION Jun08 START (date '2008-06-01') INCLUSIVE ,
PARTITION Jul08 START (date '2008-07-01') INCLUSIVE ,
PARTITION Aug08 START (date '2008-08-01') INCLUSIVE ,
PARTITION Sep08 START (date '2008-09-01') INCLUSIVE ,
PARTITION Oct08 START (date '2008-10-01') INCLUSIVE ,
PARTITION Nov08 START (date '2008-11-01') INCLUSIVE ,
PARTITION Dec08 START (date '2008-12-01') INCLUSIVE
                  END (date '2009-01-01') EXCLUSIVE );

查看创建的表信息

postgres=# \d+ sales_exp
          Append-Only Table "public.sales_exp"
Column |     Type     | Modifiers | Storage | Description
--------+---------------+-----------+---------+-------------
id     | integer       |           | plain   |
date   | date         |           | plain   |
amt   | numeric(10,2) |           | main   |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Child tables: sales_exp_1_prt_apr08,
            sales_exp_1_prt_aug08,
            sales_exp_1_prt_dec08,
            sales_exp_1_prt_feb08,
            sales_exp_1_prt_jan08,
            sales_exp_1_prt_jul08,
            sales_exp_1_prt_jun08,
            sales_exp_1_prt_mar08,
            sales_exp_1_prt_may08,
            sales_exp_1_prt_nov08,
            sales_exp_1_prt_oct08,
            sales_exp_1_prt_sep08
Has OIDs: no
Options: appendonly=true
Distributed randomly
Partition by: (date)

下面是另外一个根据Range分区的例子,这次使用的是整型列进行分区。这里面我们添加了一个DEFAULT PARTITION, 在不满足其他分区的条件下,数据会被插入DEFAULT PARTITION。

CREATE TABLE rank (id int, rank int, year int, gender char(1), count int)
PARTITION BY RANGE (year)
( START (2001) END (2008) EVERY (1),
DEFAULT PARTITION extra );

定义List分区

列表分区的分区字段可以是任意可进行等于比较的数据类型,且允许多列组合进行分区,而范围分区只允许选定一列作分区字段。列表分区时必须显式为每个分区申明分区值,比如还是上述提到的冰淇淋数据表,我们可以按照地区列表对数据进行分区操作:

CREATE TABLE magma_measurement_city (
city text,
logdate date,
peaktemp int,
unitsales int
) format 'magmaap'
PARTITION BY LIST (city)
( PARTITION beijing VALUES ('BJ'),
PARTITION shanghai VALUES ('SH'),
PARTITION guangzhou VALUES ('GZ'),
PARTITION shenzhen VALUES ('SZ'),
DEFAULT PARTITION other );

又如下面的例子创建了一个基于List的分区表。List分区表可以基于任意支持等值比较的数据类型。对与List分区,你需要 显式的指定所有子分区。

CREATE TABLE rank (id int, rank int, year int, gender char(1), count int )
PARTITION BY LIST (gender)
( PARTITION girls VALUES ('F'),
  PARTITION boys VALUES ('M'),
  DEFAULT PARTITION other );

查看表信息

postgres=# \d+ rank
              Append-Only Table "public.rank"
 Column |     Type     | Modifiers | Storage  | Description
--------+--------------+-----------+----------+-------------
 id     | integer      |           | plain    |
 rank   | integer      |           | plain    |
 year   | integer      |           | plain    |
 gender | character(1) |           | extended |
 count  | integer      |           | plain    |
Compression Type: None
Compression Level: 0
Block Size: 32768
Checksum: f
Child tables: rank_1_prt_boys,
              rank_1_prt_girls,
              rank_1_prt_other
Has OIDs: no
Options: appendonly=true
Distributed randomly
Partition by: (gender)

定义多级分区

多级分区使用PARTITION BY和SUBPARTITION BY实现,使用SUBPARTITION TEMPLATE可以保证所有的子分区都一样,包括后续新增的子分区。还是上述实例,我们可以将范围分区和列表分区进行组合,使用多级分区实现:

CREATE TABLE mult_measurement (
  city text,
  logdate date,
  peaktemp int,
  unitsales int
) WITH (APPENDONLY = true, OIDS = FALSE, ORIENTATION = orc)
PARTITION BY RANGE (logdate)
SUBPARTITION BY LIST (city)
SUBPARTITION TEMPLATE
(SUBPARTITION beijing VALUES ('BJ'),
 SUBPARTITION shanghai VALUES ('SH'),
 SUBPARTITION guangzhou VALUES ('GZ'),
 SUBPARTITION shenzhen VALUES ('SZ'),
 DEFAULT SUBPARTITION other)
(START (date '2020-01-01') INCLUSIVE END (date '2020-02-01') EXCLUSIVE EVERY (INTERVAL '1 month'));

或者按照年、月进行多级分区组合:

CREATE TABLE ym_measurement (
  city_id int not null,
  r_year int,
  r_month int,
  peaktemp int,
  unitsales int
) format 'magmaap'
PARTITION BY RANGE (r_year)
SUBPARTITION BY RANGE (r_month)
SUBPARTITION TEMPLATE
(START (1) END (13) EVERY (1),DEFAULT SUBPARTITION other_months)
(START (2017) END (2020) EVERY (1), DEFAULT PARTITION other_years);

由或对上述三个条件进行组合可创建三级分区:

CREATE TABLE mult_measurement_3 (
  city text,
  r_year int,
  r_month int,
  peaktemp int,
  unitsales int
) with (appendonly=true, orientation=parquet)
PARTITION BY RANGE (r_year)
SUBPARTITION BY RANGE (r_month)
  SUBPARTITION TEMPLATE
    (START (1) END (13) EVERY (1),DEFAULT SUBPARTITION other_months)
    SUBPARTITION BY LIST (city)
      SUBPARTITION TEMPLATE (
        SUBPARTITION beijing VALUES ('BJ'),
        SUBPARTITION shanghai VALUES ('SH'),
        SUBPARTITION guangzhou VALUES ('GZ'),
        SUBPARTITION shenzhen VALUES ('SZ'),
        DEFAULT SUBPARTITION other)
(START (2017) END (2020) EVERY (1), DEFAULT PARTITION other_years);

需要注意的是,当你在使用多级分区的时候,系统会产生大量的小表,有些表可能没有数据或包含很少数据,这样会对系统元数据管理产生过多压力。 建议不要创建具有过多分区的表。一般限制分区数在100或以内比较合理。

分区现有表

如果你想对一张已有大表数据进行分区,步骤如下:

  1. 按照现有大表结构创建一个新的分区表

  2. 将现有大表中的数据load到新创建的分区表中

  3. 删除原来的大表

  4. 按照原来大表名称重命名分区表名称

假设上述已有表measure,先需要进行分区,操作如下:

CREATE TABLE measurement_cp (like measurement)
PARTITION BY RANGE (logdate)
(START (date '2020-01-01') INCLUSIVE END (date '2020-04-01') EXCLUSIVE EVERY (INTERVAL '1 month'));

INSERT INTO measurement_cp SELECT * FROM measurement;

DROP TABLE measurement;

ALTER TABLE measurement_cp RENAME TO measurement;

ANALYZE measurement;

查看你的分区设计

你可以通过pg_partitions视图来查看你的分区表设计。例如通过下面的语句可以查看出sales表的分区设计。

postgres=# SELECT partitionboundary, partitiontablename, partitionname, partitionlevel, partitionrank
postgres-# FROM pg_partitions
postgres-# WHERE tablename='sales';
                                           partitionboundary                                          | partitiontablename | partitionname | partitionlevel | partitionrank
------------------------------------------------------------------------------------------------------+--------------------+---------------+----------------+---------------
 START ('2008-01-01'::date) END ('2008-02-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_1      |               |              0 |             1
 START ('2008-02-01'::date) END ('2008-03-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_2      |               |              0 |             2
 START ('2008-03-01'::date) END ('2008-04-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_3      |               |              0 |             3
 START ('2008-04-01'::date) END ('2008-05-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_4      |               |              0 |             4
 START ('2008-05-01'::date) END ('2008-06-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_5      |               |              0 |             5
 START ('2008-06-01'::date) END ('2008-07-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_6      |               |              0 |             6
 START ('2008-07-01'::date) END ('2008-08-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_7      |               |              0 |             7
 START ('2008-08-01'::date) END ('2008-09-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_8      |               |              0 |             8
 START ('2008-09-01'::date) END ('2008-10-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_9      |               |              0 |             9
 START ('2008-10-01'::date) END ('2008-11-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_10     |               |              0 |            10
 START ('2008-11-01'::date) END ('2008-12-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_11     |               |              0 |            11
 START ('2008-12-01'::date) END ('2009-01-01'::date) EVERY ('1 mon'::interval) WITH (appendonly=true) | sales_1_prt_12     |               |              0 |            12
(12 rows)

通过查询pg_partition_templates查看使用子分区模板创建的子分区信息,例如:

select * from pg_partition_templates where tablename = 'mult_measurement_3';

 schemaname |     tablename      | partitionname | partitiontype | partitionlevel | partitionrank | partitionposition | partitionlistvalues | partitionrangestart | partitionstartinclusive | partitionrangeend | partitionendinclusive | partitioneveryclause | partitionisdefault |          partitionboundary
------------+--------------------+---------------+---------------+----------------+---------------+-------------------+---------------------+---------------------+-------------------------+-------------------+-----------------------+----------------------+--------------------+-------------------------------------
 public     | mult_measurement_3 | other_months  | range         |              1 |             1 |                 1 |                     |                     | f                       |                   | f                     |                      | t                  | DEFAULT SUBPARTITION other_months
 public     | mult_measurement_3 |               | range         |              1 |             2 |                 2 |                     | 1                   | t                       | 2                 | f                     | 1                    | f                  | START (1) END (2) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             3 |                 3 |                     | 2                   | t                       | 3                 | f                     | 1                    | f                  | START (2) END (3) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             4 |                 4 |                     | 3                   | t                       | 4                 | f                     | 1                    | f                  | START (3) END (4) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             5 |                 5 |                     | 4                   | t                       | 5                 | f                     | 1                    | f                  | START (4) END (5) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             6 |                 6 |                     | 5                   | t                       | 6                 | f                     | 1                    | f                  | START (5) END (6) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             7 |                 7 |                     | 6                   | t                       | 7                 | f                     | 1                    | f                  | START (6) END (7) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             8 |                 8 |                     | 7                   | t                       | 8                 | f                     | 1                    | f                  | START (7) END (8) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |             9 |                 9 |                     | 8                   | t                       | 9                 | f                     | 1                    | f                  | START (8) END (9) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |            10 |                10 |                     | 9                   | t                       | 10                | f                     | 1                    | f                  | START (9) END (10) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |            11 |                11 |                     | 10                  | t                       | 11                | f                     | 1                    | f                  | START (10) END (11) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |            12 |                12 |                     | 11                  | t                       | 12                | f                     | 1                    | f                  | START (11) END (12) EVERY (1)
 public     | mult_measurement_3 |               | range         |              1 |            13 |                13 |                     | 12                  | t                       | 13                | f                     | 1                    | f                  | START (12) END (13) EVERY (1)
 public     | mult_measurement_3 | beijing       | list          |              2 |               |                 1 | 'BJ'::text          |                     |                         |                   |                       |                      | f                  | SUBPARTITION beijing VALUES('BJ')
 public     | mult_measurement_3 | shanghai      | list          |              2 |               |                 2 | 'SH'::text          |                     |                         |                   |                       |                      | f                  | SUBPARTITION shanghai VALUES('SH')
 public     | mult_measurement_3 | guangzhou     | list          |              2 |               |                 3 | 'GZ'::text          |                     |                         |                   |                       |                      | f                  | SUBPARTITION guangzhou VALUES('GZ')
 public     | mult_measurement_3 | shenzhen      | list          |              2 |               |                 4 | 'SZ'::text          |                     |                         |                   |                       |                      | f                  | SUBPARTITION shenzhen VALUES('SZ')
 public     | mult_measurement_3 | other         | list          |              2 |               |                 5 |                     |                     |                         |                   |                       |                      | t                  | DEFAULT SUBPARTITION other

查看分区表使用的分区键列通过查询pg_partition_columns获取:

select * from pg_partition_columns where tablename = 'measurement';

 schemaname |  tablename  | columnname | partitionlevel | position_in_partition_key
------------+-------------+------------+----------------+---------------------------
 public     | measurement | logdate    |              0 |                         1

最后,给大家几个常用的小提醒:

  • 数据量达到千万记录以上时使用分区表。数据仓库中的事实表适合作为分区表。对于小于这个数量级的表通常不需要分区。因为系统管理与维护分区的开销会抵消掉分区带来的可见的性能优势。

  • 只有当实施了其它优化手段后,响应时间仍然不可接受时,再考虑使用分区。

  • 根据分区定义条件,分区条件应尽可能使数据平均划分。例如,将一个大表分成10个相等的分区,如果查询条件中带有分区键,那么理论上查询应该比非分区表快将近10倍。

  • 分区数不要超过128个,太多的分区将会减慢管理和维护任务。如检查磁盘使用、集群扩展、释放剩余空间。其次,只有在查询条件可以利用分区消除时,性能才会得到提升。否则,一个需要扫描所有分区的查询会比非分区表还慢。

  • 除非必要,少用多级分区。多级分区会使分区文件的数量快速增长。例如,如果一个表按日期和城市做分区,1000天的1000个城市的数据,就会形成100万个分区。假设表有100列,并且假设表使用面向列的物理存储格式,那么系统为此表需要管理1亿个文件。

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值