ClickHouse物化视图实时聚合【大厂落地篇】

目录

​编辑

一、ClickHouse明细表的构建

二、ReplicatedSummingMergeTree 聚合物化视图

三、结语


一、ClickHouse明细表的构建

创建语句:

1、基于ClickHouse集群开发,利用 ON CLUSTER CK_CLUSTER_A 在机器创建本地表

2、分区字段 sale_ord_dt 不能为空,必须设置默认值【排序字段最好设置默认值】

3、jrdw_mid为13位时间戳字符串,利用函数 parseDateTimeBestEffortOrZero 转为 Date类型

4、{replica} 和 {shard} 分别为宏变量, 配置文件中配置【通常集群运维维护参数】

5、TTL:保留近5天的数据

CREATE TABLE prod.gdm_order_basic_replica on CLUSTER CK_CLUSTER_A
(
    `sale_ord_det_id` String,
    `item_sku_id` String,
    `sku_name` Nullable(String),
    `bu_id` String,
    `dept_id_1` String,
    `dept_id_2` String,
    `dept_id_3` String,
    `delv_center_id` Nullable(String),
    `store_id` Nullable(String),
    `sale_qtty` Nullable(Float64),
    `amount` Nullable(Float64),
    `sale_ord_type_cd` Nullable(String),
    `split_status_cd` Nullable(String),
    `parent_sale_ord_flag` Nullable(String),
    `sale_ord_dt` String,
    `ord_deal_tm` Nullable(String),
    `cancel_flag` Nullable(String),
    `data_type` Nullable(String),
    `jrdw_mid` Nullable(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/gdm_order_basic_replica', '{replica}')
PARTITION BY toYYYYMM(toDateOrDefault(sale_ord_dt))
ORDER BY (sale_ord_dt, item_sku_id, bu_id, dept_id_1, dept_id_2, dept_id_3)
TTL parseDateTimeBestEffortOrZero(jrdw_mid) + toIntervalDay(5)

创建分布式表:

1、本场景利用的是 cityHash64 对 sale_ord_det_id 的 hash路由处理, 其他方式: rand()、sipHash64 等, 具体区别可详见官网 

2、create table ...... as ....... 的好处就是无需例举字段并和本地表字段列保持一致

CREATE TABLE prod.gdm_order_basic on CLUSTER CK_CLUSTER_A
as prod.gdm_order_basic_replica
ENGINE = Distributed('CK_CLUSTER_A', 'prod', 'gdm_order_basic_replica', cityHash64(sale_ord_det_id))

二、ReplicatedSummingMergeTree 聚合物化视图

创建语句:

1、物化视图 和 MergeTree 创建方式类似

2、toDateOrDefault 如何字段为空,默认时间1970年

3、allow_nullable_key = 1 允许 stat_dt 为空,不设置会报错,因为系统默认 stat_dt 可能为空

4、formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d') 获取 T+2 时间

5、WINDOW w AS (PARTITION BY sale_ord_det_id ORDER BY jrdw_mid DESC)为开窗函数

6、GLOBAL IN 对子查询进行临时数据片段生成,之后发到每台机器进行本地过滤,极大提升了查询速度

CREATE MATERIALIZED VIEW prod.sum_ord_sales_view on CLUSTER CK_CLUSTER_A
(
    `stat_dt` Nullable(String),
    `stat_hour` Nullable(String),
    `bu_id` String,
    `dept_id_1` String,
    `dept_id_2` String,
    `dept_id_3` String,
    `succ_sales` Float64,
    `succ_amt` Float64
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{shard}/test/sum_ord_sales_view', '{replica}')
PARTITION BY toYYYYMM(toDate(stat_dt))
ORDER BY (stat_dt, stat_hour, bu_id, dept_id_1, dept_id_2, dept_id_3)
TTL toDateOrDefault(stat_dt) + toIntervalDay(10)
SETTINGS allow_nullable_key = 1, index_granularity = 8192 AS
SELECT
    stat_dt,
    stat_hour,
    bu_id,
    dept_id_1,
    dept_id_2,
    dept_id_3,
    round(sum(coalesce(succ_sales, 0)), 0) AS succ_sales,
    round(sum(coalesce(succ_amt, 0.)), 2) AS succ_amt
FROM
(
    SELECT
        substring(ord_deal_tm, 1, 10) AS stat_dt,
        substring(ord_deal_tm, 12, 2) AS stat_hour,
        bu_id,
        dept_id_1,
        dept_id_2,
        dept_id_3,
        multiIf(cancel_flag = '0', sale_qtty, sale_qtty * -1) AS succ_sales,
        multiIf(cancel_flag = '0', amount, amount * -1) AS succ_amt,
        row_number() OVER w AS rn
    FROM prod.gdm_order_basic_replica
    WHERE (sale_ord_dt >= '2023-04-24') AND (data_type IN ('1', '2')) AND (sale_ord_type_cd GLOBAL IN (
        SELECT sale_ord_type_cd
        FROM prod.dim_sale_ord_type_new
        WHERE dt = formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d')
    )) AND (concat(delv_center_id, '_', store_id) GLOBAL IN (
        SELECT store_id
        FROM prod.dim_store
        WHERE dt = formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d')
    )) AND (parent_sale_ord_flag != '1') AND (split_status_cd != '1') AND (CAST(bu_id, 'int') > 0) AND (ord_deal_tm IS NOT NULL) AND (ord_deal_tm != '')
    WINDOW w AS (PARTITION BY sale_ord_det_id ORDER BY jrdw_mid DESC)
) AS M1
WHERE (rn = 1) AND (stat_dt IS NOT NULL) AND (stat_dt != '')
GROUP BY stat_dt,stat_hour,bu_id,dept_id_1,dept_id_2,dept_id_3

三、结语

1、灵活的使用ClickHouse本地表进而提升查询速度,避免分布式表的N次查询

2、注重ClickHouse的分区,不易太多,因为分区太多只能增加查询负担和复杂度

3、ClickHouse字段默认值,关键字段尽量设置默认值,避免不必要的后期排错

4、物化视图 和 明细表 的使用方式类似,对于旧数据处理可设置TTL,如果需要添加大量历史数据,可手工插入物化视图,效率会有明显的提升

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

熊猫码农

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值