背景:
ck在单表查询能够做到极致,但是在join上性能就相对尬尴,
A JOIN B 特别是当两张表的数据都不小的时候,经常就会有内存溢出,超时等等情况
特别是当AB都为分布表的时候
就拿常用的事件表(events_all)和用户表(users_all)做JOIN为例,都是分布表
表结构例子:
事件本地表
create table events_local ( event_dt UInt32, user_id UInt64) engine = ReplicatedMergeTree('/clickhouse/tables/demo/events_local/{shard}', '{replica}') PARTITION BY event_dt ORDER BY (event_dt , intHash64(user_id)) SETTINGS index_granularity = 8192, enable_mixed_granularity_parts = 1;
用户本地表
create table users_local ( dt UInt32, user_id UInt64) engine = ReplicatedMergeTree('/clickhouse/tables/my_sdap/users_local/{shard}', '{replica}') PARTITION BY dt ORDER BY (dt, intHash64(user_id)) SAMPLE BY intHash64(user_id) SETTINGS index_granularity = 8192, allow_nullable_key = 1;
事件分布表
CREATE TABLE IF NOT EXISTS my_test.events_all
ON CLUSTER cluster_name
AS my_test.events_local
ENGINE = Distributed(cluster_name, my_test, events_local, intHash64(user_id));
用户分布表
CREATE TABLE IF NOT EXISTS my_test.users_all
ON CLUSTER cluster_name
AS my_test.users_local
ENGINE = Distributed(cluster_name, my_test, users_local, intHash64(user_id));
通常JOIN我们这样写
select countDistinct(user_id)
from events_all global semi left join users_all using(user_id)
where ......
导致接收请求的服务器压力较大
前提:把相同user_id的数据都放到同一台服务器上
此时我们就可以把所有请求和计算的压力分发到每台服务器上。怎么写sql?
select countDistinct(user_id)
from events_all semi left join my_test.users_local using(user_id)
where ......
注意:不要B表括号起来A join (select from B) 这样 否则优化无效,这跟ck的解析有关
这里没有优化对比,因为是之前的笔记了,愿意可以一试,优化效果明显