clickhouse Join优化之分桶Join

最新推荐文章于 2025-02-26 20:31:25 发布

贝琦野菜汁

最新推荐文章于 2025-02-26 20:31:25 发布

阅读量3k

点赞数 1

分类专栏： # Clickhouse

本文链接：https://blog.csdn.net/a495679822/article/details/118548564

版权

Clickhouse 专栏收录该内容

6 篇文章

订阅专栏

背景：

ck在单表查询能够做到极致，但是在join上性能就相对尬尴,

A JOIN B 特别是当两张表的数据都不小的时候，经常就会有内存溢出，超时等等情况

特别是当AB都为分布表的时候

就拿常用的事件表（events_all）和用户表（users_all）做JOIN为例，都是分布表

表结构例子：

事件本地表

create table events_local
(
    event_dt                      UInt32,
    user_id                 UInt64)
    engine = ReplicatedMergeTree('/clickhouse/tables/demo/events_local/{shard}', '{replica}')
        PARTITION BY event_dt
        ORDER BY (event_dt , intHash64(user_id))
        SETTINGS index_granularity = 8192, 
        enable_mixed_granularity_parts = 1;

用户本地表

create table users_local
(
    dt                      UInt32,
    user_id                 UInt64)
    engine = ReplicatedMergeTree('/clickhouse/tables/my_sdap/users_local/{shard}', '{replica}')
        PARTITION BY dt
        ORDER BY (dt, intHash64(user_id))
        SAMPLE BY intHash64(user_id)
        SETTINGS index_granularity = 8192, 
allow_nullable_key = 1;

事件分布表

CREATE TABLE IF NOT EXISTS my_test.events_all

ON CLUSTER cluster_name

AS my_test.events_local

ENGINE = Distributed(cluster_name, my_test, events_local, intHash64(user_id));

用户分布表

CREATE TABLE IF NOT EXISTS my_test.users_all

ON CLUSTER cluster_name

AS my_test.users_local

ENGINE = Distributed(cluster_name, my_test, users_local, intHash64(user_id));