目录
二、ReplicatedSummingMergeTree 聚合物化视图
一、ClickHouse明细表的构建
创建语句:
1、基于ClickHouse集群开发,利用 ON CLUSTER CK_CLUSTER_A 在机器创建本地表
2、分区字段 sale_ord_dt 不能为空,必须设置默认值【排序字段最好设置默认值】
3、jrdw_mid为13位时间戳字符串,利用函数 parseDateTimeBestEffortOrZero 转为 Date类型
4、{replica} 和 {shard} 分别为宏变量, 配置文件中配置【通常集群运维维护参数】
5、TTL:保留近5天的数据
CREATE TABLE prod.gdm_order_basic_replica on CLUSTER CK_CLUSTER_A
(
`sale_ord_det_id` String,
`item_sku_id` String,
`sku_name` Nullable(String),
`bu_id` String,
`dept_id_1` String,
`dept_id_2` String,
`dept_id_3` String,
`delv_center_id` Nullable(String),
`store_id` Nullable(String),
`sale_qtty` Nullable(Float64),
`amount` Nullable(Float64),
`sale_ord_type_cd` Nullable(String),
`split_status_cd` Nullable(String),
`parent_sale_ord_flag` Nullable(String),
`sale_ord_dt` String,
`ord_deal_tm` Nullable(String),
`cancel_flag` Nullable(String),
`data_type` Nullable(String),
`jrdw_mid` Nullable(String)
)
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/gdm_order_basic_replica', '{replica}')
PARTITION BY toYYYYMM(toDateOrDefault(sale_ord_dt))
ORDER BY (sale_ord_dt, item_sku_id, bu_id, dept_id_1, dept_id_2, dept_id_3)
TTL parseDateTimeBestEffortOrZero(jrdw_mid) + toIntervalDay(5)
创建分布式表:
1、本场景利用的是 cityHash64 对 sale_ord_det_id 的 hash路由处理, 其他方式: rand()、sipHash64 等, 具体区别可详见官网
2、create table ...... as ....... 的好处就是无需例举字段并和本地表字段列保持一致
CREATE TABLE prod.gdm_order_basic on CLUSTER CK_CLUSTER_A
as prod.gdm_order_basic_replica
ENGINE = Distributed('CK_CLUSTER_A', 'prod', 'gdm_order_basic_replica', cityHash64(sale_ord_det_id))
二、ReplicatedSummingMergeTree 聚合物化视图
创建语句:
1、物化视图 和 MergeTree 创建方式类似
2、toDateOrDefault 如何字段为空,默认时间1970年
3、allow_nullable_key = 1 允许 stat_dt 为空,不设置会报错,因为系统默认 stat_dt 可能为空
4、formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d') 获取 T+2 时间
5、WINDOW w AS (PARTITION BY sale_ord_det_id ORDER BY jrdw_mid DESC)为开窗函数
6、GLOBAL IN 对子查询进行临时数据片段生成,之后发到每台机器进行本地过滤,极大提升了查询速度
CREATE MATERIALIZED VIEW prod.sum_ord_sales_view on CLUSTER CK_CLUSTER_A
(
`stat_dt` Nullable(String),
`stat_hour` Nullable(String),
`bu_id` String,
`dept_id_1` String,
`dept_id_2` String,
`dept_id_3` String,
`succ_sales` Float64,
`succ_amt` Float64
)
ENGINE = ReplicatedSummingMergeTree('/clickhouse/tables/{shard}/test/sum_ord_sales_view', '{replica}')
PARTITION BY toYYYYMM(toDate(stat_dt))
ORDER BY (stat_dt, stat_hour, bu_id, dept_id_1, dept_id_2, dept_id_3)
TTL toDateOrDefault(stat_dt) + toIntervalDay(10)
SETTINGS allow_nullable_key = 1, index_granularity = 8192 AS
SELECT
stat_dt,
stat_hour,
bu_id,
dept_id_1,
dept_id_2,
dept_id_3,
round(sum(coalesce(succ_sales, 0)), 0) AS succ_sales,
round(sum(coalesce(succ_amt, 0.)), 2) AS succ_amt
FROM
(
SELECT
substring(ord_deal_tm, 1, 10) AS stat_dt,
substring(ord_deal_tm, 12, 2) AS stat_hour,
bu_id,
dept_id_1,
dept_id_2,
dept_id_3,
multiIf(cancel_flag = '0', sale_qtty, sale_qtty * -1) AS succ_sales,
multiIf(cancel_flag = '0', amount, amount * -1) AS succ_amt,
row_number() OVER w AS rn
FROM prod.gdm_order_basic_replica
WHERE (sale_ord_dt >= '2023-04-24') AND (data_type IN ('1', '2')) AND (sale_ord_type_cd GLOBAL IN (
SELECT sale_ord_type_cd
FROM prod.dim_sale_ord_type_new
WHERE dt = formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d')
)) AND (concat(delv_center_id, '_', store_id) GLOBAL IN (
SELECT store_id
FROM prod.dim_store
WHERE dt = formatDateTime(toStartOfDay(today() - toIntervalDay(2)), '%Y-%m-%d')
)) AND (parent_sale_ord_flag != '1') AND (split_status_cd != '1') AND (CAST(bu_id, 'int') > 0) AND (ord_deal_tm IS NOT NULL) AND (ord_deal_tm != '')
WINDOW w AS (PARTITION BY sale_ord_det_id ORDER BY jrdw_mid DESC)
) AS M1
WHERE (rn = 1) AND (stat_dt IS NOT NULL) AND (stat_dt != '')
GROUP BY stat_dt,stat_hour,bu_id,dept_id_1,dept_id_2,dept_id_3
三、结语
1、灵活的使用ClickHouse本地表进而提升查询速度,避免分布式表的N次查询
2、注重ClickHouse的分区,不易太多,因为分区太多只能增加查询负担和复杂度
3、ClickHouse字段默认值,关键字段尽量设置默认值,避免不必要的后期排错
4、物化视图 和 明细表 的使用方式类似,对于旧数据处理可设置TTL,如果需要添加大量历史数据,可手工插入物化视图,效率会有明显的提升