Clickhouse 新功能 Projection

        前几天有个同事问了我一个问题,他在建立MergeTree表的时候,只定义了order by,于是Clickhouse会自动创建与排序键一致的默认primary key。他想要删除一个列,但是由于primary key包含了这个列,而primary key一旦定义便无法修改,导致这一列无论如何没办法删除。因此,我们能想到,要么重建表重导数据,测试环境可以这么做,但如果是生产的话,这样做代价太大了;要么就保留这个列,它就只是占用一点空间而已。因为遇到这个问题时还在测试阶段,所以我们采用了前者。

        这个问题引发了我的另一个思考,如果这个问题发生在了生产环境,主键已经被定义好了,假设是(a,b,c),而刚好要删除的(即保留但无用的)列是a,那么主键索引不就失效了?那可就对查询效率产生重大影响了呀!

        于是我查了一下针对这个问题是否有解决办法,首先物化视图(materialized view)是可以考虑的,但是存在数据一致性问题。另外Clickhouse在上个月正式投产的Projection这个功能非常优秀,可以完美解决上述问题。Projection原来在Experimental Feature下,需要开启配置才能使用,于2022-02-17正式投入生产使用。

1. Projection简介

        Projection的概念出自《C-Store A Column-oriented DBMS》这篇论文,作者是2015年图领奖获得者、Vertica之父——Mike Stonebraker。来自快手的Amos Bird(郑天祺)借鉴了这个思想,在Clickhouse中实现了Projection功能,并贡献到社区。

Projection的特点:

        1) part-level:Projection类型物化视图,但物化视图是完完全全的一张新表,而Projection是原表的一个部分,存储在分区目录中的一个子目录。

        2) consistency:Projection保存在原表的分区,数据的更新、合并都是同源的,不存在不一致的问题。在读写一致性方面,Projection采用版本控制(multi-versioning),没有冗长的锁机制,插入不会影响读操作。

        3) automatic: 查询数据时,自动匹配最佳的projection。匹配规则如下:

2.Projection的用法

        1)创建Projection

ALTER TABLE [db].name ADD PROJECTION name ( SELECT <COLUMN LIST EXPR> [GROUP BY] [ORDER BY] )

        2) 删除Projection

ALTER TABLE [db].name DROP PROJECTION name

        3)手动触发物化(首次创建Projection时使用,否则只有新插入数据会自动物化)

ALTER TABLE [db.]table MATERIALIZE PROJECTION name IN PARTITION partition_name

        4)删除Projection磁盘文件,但不删除说明(类似Mysql的truncate)

ALTER TABLE [db.]table CLEAR PROJECTION name IN PARTITION partition_name

        Projection的ADD,DROP,CLEAR只是对元数据进行修改和文件的删除,因此这些操作都是轻量级的。但是MATERIALIZE 是一个重量级的操作,需要对原有数据进行重排序,所以尽量分区内进行操作或在一个合适的时间节点进行操作。

3. Projection的示例及性能

        案例来自官网:UK Property Price Paid | ClickHouse Documentation

1)下载数据源

wget http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv

 2)建表

CREATE TABLE uk_price_paid
(
    price UInt32,
    date Date,
    postcode1 LowCardinality(String),
    postcode2 LowCardinality(String),
    type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),
    is_new UInt8,
    duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),
    addr1 String,
    addr2 String,
    street LowCardinality(String),
    locality LowCardinality(String),
    town LowCardinality(String),
    district LowCardinality(String),
    county LowCardinality(String),
    category UInt8
) ENGINE = MergeTree ORDER BY (postcode1, postcode2, addr1, addr2);

3)导入数据

clickhouse-local --input-format CSV --structure '
    uuid String,
    price UInt32,
    time DateTime,
    postcode String,
    a String,
    b String,
    c String,
    addr1 String,
    addr2 String,
    street String,
    locality String,
    town String,
    district String,
    county String,
    d String,
    e String
' --query "
    WITH splitByChar(' ', postcode) AS p
    SELECT
        price,
        toDate(time) AS date,
        p[1] AS postcode1,
        p[2] AS postcode2,
        transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,
        b = 'Y' AS is_new,
        transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration,
        addr1,
        addr2,
        street,
        locality,
        town,
        district,
        county,
        d = 'B' AS category
    FROM table" --date_time_input_format best_effort < pp-complete.csv | clickhouse-client --query "INSERT INTO uk_price_paid FORMAT TSV"

 4)验证

SELECT count() FROM uk_price_paid;

5)查询以下三条语句

--Query 1. Average Price Per Year
SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 1000000, 80) FROM uk_price_paid GROUP BY year ORDER BY year;

--Query 2. Average Price per Year in London
SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 2000000, 100) FROM uk_price_paid WHERE town = 'LONDON' GROUP BY year ORDER BY year;

--Query 3. The Most Expensive Neighborhoods
SELECT
    town,
    district,
    count() AS c,
    round(avg(price)) AS price,
    bar(price, 0, 5000000, 100)
FROM uk_price_paid
WHERE date >= '2020-01-01'
GROUP BY
    town,
    district
HAVING c >= 100
ORDER BY price DESC
LIMIT 100;

 6)建立并物化Projection

ALTER TABLE uk_price_paid
    ADD PROJECTION projection_by_year_district_town
    (
        SELECT
            toYear(date),
            district,
            town,
            avg(price),
            sum(price),
            count()
        GROUP BY
            toYear(date),
            district,
            town
    );

ALTER TABLE uk_price_paid
    MATERIALIZE PROJECTION projection_by_year_district_town
SETTINGS mutations_sync = 1;

 7)再次执行前面三条查询语句

--Query 1. Average Price Per Year
SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 1000000, 80) FROM uk_price_paid GROUP BY year ORDER BY year;

--Query 2. Average Price per Year in London
SELECT toYear(date) AS year, round(avg(price)) AS price, bar(price, 0, 2000000, 100) FROM uk_price_paid WHERE town = 'LONDON' GROUP BY year ORDER BY year;

--Query 3. The Most Expensive Neighborhoods
SELECT
    town,
    district,
    count() AS c,
    round(avg(price)) AS price,
    bar(price, 0, 5000000, 100)
FROM uk_price_paid
WHERE date >= '2020-01-01'
GROUP BY
    town,
    district
HAVING c >= 100
ORDER BY price DESC
LIMIT 100;

8)查看执行计划

EXPLAIN
SELECT
    toYear(date) AS year,
    round(avg(price)) AS price,
    bar(price, 0, 1000000, 80)
FROM uk_price_paid
GROUP BY year
ORDER BY year ASC

 看到 ReadFromStorage (MergeTree(with * projection *))说明命中了projection。

9)结果对比

Query 1

no projection: 27 rows in set. Elapsed: 0.158 sec. Processed 26.32 million rows, 157.93 MB (166.57 million rows/s., 999.39 MB/s.)
projection: 27 rows in set. Elapsed: 0.007 sec. Processed 105.96 thousand rows, 3.33 MB (14.58 million rows/s., 458.13 MB/s.)


Query 2

no projection: 27 rows in set. Elapsed: 0.163 sec. Processed 26.32 million rows, 80.01 MB (161.75 million rows/s., 491.64 MB/s.)
projection: 27 rows in set. Elapsed: 0.008 sec. Processed 105.96 thousand rows, 3.67 MB (13.29 million rows/s., 459.89 MB/s.)

Query 3

no projection: 100 rows in set. Elapsed: 0.069 sec. Processed 26.32 million rows, 62.47 MB (382.13 million rows/s., 906.93 MB/s.)
projection: 100 rows in set. Elapsed: 0.029 sec. Processed 8.08 thousand rows, 511.08 KB (276.06 thousand rows/s., 17.47 MB/s.)

        可以看出,无论是在时间还是空间上,Projection的表现都是相当优秀的。

4. Projection原理解析

        projection在功能上有点类似mysql的index,但跟MergeTree的index完全是两码事,MergeTree的index是主键基础上的跳数索引。Projection的主要构件包含三个部分,分别是:Projection定义、Projection存储、Projection查询分析。

4.1 Projection定义

    Projection的定义可以看作CREATE TABLE AS SELECT (CTAS)语句

        1) select中的别名与表达式转换为标准命名

        2)group by语句生成Aggregate projections,使用Group by指定的聚合维度作为主键

        3)聚合维度将生成中间状态类型AggregateFunction(...),AggregateFunction(sum, UInt64), AggregateFunction(duration, UInt64) --注:这一点我不是很理解

 4.2 Projection存储

    Projection是底表Part的伴生Part(即分区目录的子目录)

        1)存储结果完全跟底表的MergeTree结构相同

        2)Projection Part复用底表Part的分区信息

        3)merge, mutation 和replication均对照父part进行处理

        PS:在我自己做的实验中,Projection的存储跟父Part的存储结构并非是完全一致的。我想应该是Projection的数据量较少的时候,.bin文件和.mrk2文件均合并成一个文件,而不是按列独立存储。

 4.3 Projection查询分析

原理:

一致性保障: 

5. 参考

Clickhouse官方文档:MergeTree | ClickHouse Documentation

 《C-Store A Column-oriented DBMS》:http://www.doczj.com/doc/f236381bfc4ffe473368ab67-1.html

【clickhouse】clickhouse强大智能的 Projection (投影) 功能_九师兄的博客-CSDN博客

Experimental ClickHouse: Projections · Tinybird

《Projections in ClickHouse》PPT-- Amos Bird (郑天祺)  

  • 3
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 5
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值