ClickHouse sql优化技巧

高并发

已于 2023-01-12 15:58:40 修改

阅读量4.7k

点赞数 8

分类专栏： clickhouse 文章标签： sql 数据库 database

于 2022-01-18 14:51:15 首次发布

本文链接：https://blog.csdn.net/weixin_39025362/article/details/122559488

版权

clickhouse 专栏收录该内容

29 篇文章

订阅专栏

本文介绍了ClickHouse数据库中提升查询性能的方法，包括合理利用分区、使用group by替代distinct、使用bitmap进行高效集合运算、减少重复查询、正确使用join等策略。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.使用分区

clickhouse的表，走索引和非索引效率差距很大，在使用一个表进行查询时，必须限制索引字段。避免扫描全表

确定索引分区字段，可以用show create table default.ods_user，查看本地表的建表语句，partition by 的字段就是分区字段。
如果需要限制的时间和分区字段不是同一个字段时，可以扩大分区字段取数区间，然后再过滤

2.distinct 和 group by

优先使用group by，distinct满足不了的情况，可以使用group by，
如果count(distinct uid)查询响应过长，可以使用groupBitmap(uid)代替

3.global in，global not in 和bitmap

当涉及到集合in的操作，如果字段是Int类型，可以转化成bitmap进行计算，效率会大幅提高

-- global in
select uid from xxx where uid global in (select uid from yyyy)

-- bitmap
with (select groupbitmapState(toUint64(uid)) from yyyy) as bitmap_uid
select uid from xxx where bitmapContains(bitmap_uid,uid)==1

-- global not in
select uid from xxx where uid global in (select uid from yyyy)

-- bitmap
with (select groupbitmapState(toUint64(uid)) from yyyy) as bitmap_uid
select uid from xxx where bitmapContains(bitmap_uid,uid)==0

4.使用with子句代替重复查询，查询耗时有可能减半

-- 重复子查询
select t1.*, t2.*
from (
     select *
     from test1_all
     where uid global in (
         select distinct uid
         from test_all
         where toDate(totalDate) = yesterday()
     )
) as t1 global
left join(
    select distinct uid
    from test2_all
    where uid global not in (
        select uid
        from test_all
        where toDate(totalDate) = yesterday()
    )
) as t2 using uid;

-- 使用with子句+bitmap代替

with(
     select groupBitmapState(uid)
     from test_all
     where toDate(totalDate) = yesterday()
) as bm
select t1.*, t2.*
from (
         select *
         from test1_all
         where bitmapContains(bm)==1
) as t1 global
left join(
    select *
    from test2_all
    where bitmapContains(bm)==0
) as t2 using uid;

5.global in 和 in

in子查询，使用global in
in 枚举值，可以使用in，例如 age in (10,18)

6.global all left join

clickhouse的join操作，如a left join b，当a表和b表进行join的时候，ck会将b表加载到内存，然后遍历a表数据，与b表进行match，所以必须要左边大表，右边小表，切忌小表join大表。
并且可以注意以下优化：

能够提前过滤的左右表数据，提前过滤
join条件，数值类比字符串类型快
global all left join与global left join并无区别

7.为什么查询sql有时候能执行

为什么有些sql，有时候能运行有时候不能运行，因为我们的集群的ck机器，有些是128g内存，有些是256g内存的，all表是随机路由的，如果路由到256g的可以运行，128g可能就会失败了，对应此种情况，我们做出的解决方案是配置一个新的lb路由，将大查询sql全部路由到256g的机器上

8.优化整体思路

首先检查是否使用分区字段，必须使用分区字段避免扫描全表
分而治之，如果一个业务的SQL，如果按某个字段，分成N批执行，最终的结果不变，那么就可以采取分批的方式优化，比如mod(uid,10)=batch,这样来实现分10批执行。
提前缩小数据量，如果涉及到的表数据过大，可以提前过滤数据，再进行join操作，而不是join之后再加where条件过滤。尽量做到在每个select都能够过滤一部分数据