一. 目的
测试 clickhouse在OLAP 场景的可行性, 这里着重关注 复杂关联的场景。
二. 环境说明
服务器名 | 内存 | 逻辑核 | 存储 | 版本 | 说明 |
|
cdh01 | 15G, 可用1~2G | 4 | 980G | 20.5.2 | 当前最新版 |
|
cdh02 | 15G, 可用1~2G | 4 | 980G | 20.5.2 | 当前最新版 |
三.创建表结构
1.索引不合理
CREATE TABLE dwd.flow_analysis_collect_detail_6 ON CLUSTER cluster_work \
( `dataTime` DateTime64(3, 'Asia/Shanghai'),\
`event` String,\
`user_id` String,\
`did` String,\
`loc` String,\
`city` String,\
`province` String,\
`country` String,\
`ip` String,\
`app_ver` String,\
`channel` String,\
`net` String,\
`tm_s` String,\
`model` String,\
`brand` String,\
`client_type` String,\
`appnm` String,\
`isNewUser` Int32,\
`eventTimes_sum` Int32,\
`activeRate_defined` Int32,\
`uv_count_distinct` Int32,\
`avgTimes_avg` Int32,\
`city_name` String \
)ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/flow_analysis_collect_detail_6','{replica}' ) \
PARTITION BY (toYYYYMMDD(dataTime),toHour(dataTime)) \
ORDER BY (toYYYYMMDD(dataTime),event) \
SAMPLE BY (toYYYYMMDD(dataTime),event) SETTINGS index_granularity = 8192;
2.索引稍稍优化
CREATE TABLE dwd.flow_analysis_collect_detail_7 ON CLUSTER cluster_work \
( `dataTime` DateTime64(3, 'Asia/Shanghai'),\
`event` String,\
`user_id` String,\
`did` String,\
`loc` String,\
`city` String,\
`province` String,\
`country` String,\
`ip` String,\
`app_ver` String,\
`channel` String,\
`net` String,\
`tm_s` String,\
`model` String,\
`brand` String,\
`client_type` String,\
`appnm` String,\
`isNewUser` Int32,\
`eventTimes_sum` Int32,\
`activeRate_defined` Int32,\
`uv_count_distinct` Int32,\
`avgTimes_avg` Int32,\
`city_name` String \
)ENGINE = ReplicatedMergeTree('/clickhouse/tables/{layer}-{shard}/flow_analysis_collect_detail_7','{replica}' ) \
PARTITION BY (toDate(dataTime)) \
ORDER BY (toDate(dataTime),event) \
SETTINGS index_granularity = 8192;
分布式表,方便统一查询
CREATE TABLE dwd.flow_analysis_collect_detail_dist_5 ON CLUSTER cluster_work \
AS dwd.flow_analysis_collect_detail_6 \
ENGINE = Distributed(cluster_work, dwd, flow_analysis_collect_detail_6, rand());
注:一天数据总量大概 7千万条,35G;
四. 实践测试
1.单表单日查询
select toHour(dataTime),count(1) from dwd.flow_analysis_collect_detail_dist_6 \
where toDate(dataTime)='2020-07-20' and event='进入小程序' \
group by toHour(dataTime);
数据总量 | 数据总条数 | 数据范围 | 执行环境 | 耗时(秒) | 执行详情 |
6G | 13162122 | 7月20号中1/6数据 | 单台 | 0.217 | 13. 16 million rows, 471. 40 MB ( 60. 56 million rows/s., 2. 17 GB/s.) |
|
|
| 两台 | 0.146 | 12. 99 million rows, 465. 38 MB ( 88. 87 million rows/s., 3. 18 GB/s.) |
72G | 150009393 | 7月20号、7月21号 | 单台 | 1.396 | 99. 78 million rows, 2. 98 GB ( 71. 47 million rows/s., 2. 13 GB/s.) |
|
|
| 两台 | 0.882 | 118. 76 million rows, 3. 13 GB ( 134. 67 million rows/s., 3. 55 GB/s.) |
|
|
| 两台+索引不合理 | 0. 175 | 15. 41 million rows, 529. 22 MB ( 88. 08 million rows/s., 3. 03 GB/s.) |
180G | 496403745 | 7月14号~7月20号 | 单台 | 2. 272 | 478. 69 million rows, 8. 19 GB ( 210. 70 million rows/s., 3. 60 GB/s.) |
|
|
| 两台+索引不合理 | 0. 770 | 90. 80 million rows, 3. 10 GB ( 117. 99 million rows/s., 4. 02 GB/s.) |
|
|
| 两台+索引合理 | 0.110 | 6. 30 million rows, 203. 90 MB ( 52. 59 million rows/s., 1. 70 GB/s.) |
校验真正应该扫描的行数
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) = '2020-07-20') AND (event = '进入小程序')
┌─count(1)─┐
│ 6125605 │
└──────────┘
执行结果:
┌─toHour(dataTime)─┬─ count( 1)─┐
│ 0 │ 103067 │
│ 1 │ 65022 │
│ 2 │ 47014 │
│ 3 │ 31370 │
│ 4 │ 21733 │
│ 5 │ 30135 │
│ 6 │ 79276 │
│ 7 │ 205524 │
│ 8 │ 302196 │
│ 9 │ 293494 │
│ 10 │ 304415 │
│ 11 │ 330321
│ 12 │ 322003 │
│ 13 │ 303266 │
│ 14 │ 347044 │
│ 15 │ 338380 │
│ 16 │ 366408 │
│ 17 │ 429977 │
│ 18 │ 482553 │
│ 19 │ 466647 │
│ 20 │ 446981 │
│ 21 │ 389081 │
│ 22 │ 260909 │
│ 23 │ 158789 │
└──────────────────┴──────────┘
2.单表多日查询
select toDate(dataTime),count(1) from dwd.flow_analysis_collect_detail_dist_6 \
where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入小程序' \
group by toDate(dataTime)
数据总量 | 数据总条数 | 数据范围 | 执行环境 | 耗时(秒) | 执行详情 |
36G | 78293173 | 7月20号 | 单台 | 1.460 | 99.78 million rows, 2.98 GB (68.34 million rows/s., 2.04 GB/s.) |
两台 | 0. 685 | 78. 29 million rows, 2. 80 GB ( 114. 29 million rows/s., 4. 09 GB/s.) | |||
72G | 150009393
| 20号、21号 | 单台 | 2.071 | 138. 99 million rows, 4. 98 GB ( 67. 12 million rows/s., 2. 40 GB/s.) |
两台 | 1. 211 | 150. 01 million rows, 5. 37 GB ( 123. 90 million rows/s., 4. 44 GB/s.) | |||
两台+索引不合理 | 0. 159 | 15. 42 million rows, 529. 79 MB ( 96. 89 million rows/s., 3. 33 GB/s.) | |||
180G | 496403745 | 7月14号~7月20号 | 单台 | 9. 117 | 1. 15 billion rows, 41. 26 GB ( 126. 38 million rows/s., 4. 53 GB/s.) |
两台+索引不合理 | 0.799 | 89.63 million rows, 3.05 GB (112.16 million rows/s., 3.82 GB/s.) | |||
两台+索引合理 | 0.543 | 73.97 million rows, 2.38 GB (98.81 million rows/s., 3.18 GB/s.) |
校验真正应该扫描的行数
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '进入小程序')
┌─count(1)─┐
│ 67322569 │
└──────────┘
执行结果:
┌─toDate(dataTime)─┬─ count( 1)─┐
│ 2020- 07- 14 │ 5433342 │
│ 2020- 07- 15 │ 5941668 │
│ 2020- 07- 16 │ 5656247 │
│ 2020- 07- 17 │ 5509169 │
│ 2020- 07- 18 │ 5735335 │
│ 2020- 07- 19 │ 5903877 │
│ 2020- 07- 20 │ 6125605 │
│ 2020- 07- 21 │ 4686035 │
└──────────────────┴──────────┘
3.层关联汇总查询
select toDate(t1.dataTime),count(1) from ( \
select * from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='扫码结果' \
)t1 \
inner join ( \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入小程序' \
group by did \
having(count(1)>=2) \
)t2 on t1.did=t2.did \
group by toDate(t1.dataTime)
数据总量 | 数据总条数 | 数据范围 | 执行环境 | 耗时(秒) | 执行详情 |
6G | 13162122 | 7月20号中1/6数据 | 单台 | 1. 667 | 26.32 million rows ,Processed 26.32 million rows, 2.23 GB (15.80 million rows/s., 1.34 GB/s.) |
两台 | 0. 886 | 25. 99 million rows, 2. 20 GB ( 29. 34 million rows/s., 2. 49 GB/s.) | |||
72G | 150009393
| 20号、21号 | 单台 | 10. 008 | 277. 99 million rows, 23. 53 GB ( 27. 78 million rows/s., 2. 35 GB/s.) |
两台 | 6. 833 | 300. 02 million rows, 25. 40 GB ( 43. 91 million rows/s., 3. 72 GB/s.) | |||
两台+索引 | 2. 927 | 24. 64 million rows, 1. 89 GB ( 8. 42 million rows/s., 644. 46 MB/s.) | |||
180G | 496403745 | 7月14号~7月20号 | 单台 | 56. 768 | 2. 30 billion rows, 194. 87 GB ( 40. 59 million rows/s., 3. 43 GB/s.) |
两台+索引不合理 | 20. 676 | 141. 23 million rows, 10. 83 GB ( 6. 83 million rows/s., 523. 88 MB/s.) | |||
两台+索引不合理+查询20号、21号(等同于查询 72G的情况) | 18.537 | 140.71 million rows, 10.80 GB (7.59 million rows/s., 582.57 MB/s.) | |||
两台+索引合理 | 12.453 | 97. 45 million rows, 7. 58 GB ( 7. 57 million rows/s., 589. 35 MB/s.) | |||
两台+索引合理+查询20号、21号(等同于查询 72G的情况) | 1.474 | 9.08 million rows, 707.96 MB (6.16 million rows/s., 480.15 MB/s.) |
校验真正应该扫描的行数
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '扫码结果')
┌─count(1)─┐
│ 28053553 │
└──────────┘
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '进入小程序')
┌─count(1)─┐
│ 67322569 │
└──────────┘
┌─toDate(dataTime)─┬─ count( 1)─┐
│ 2020- 07- 14 │ 3570968 │
│ 2020- 07- 15 │ 3895880 │
│ 2020- 07- 16 │ 3720074 │
│ 2020- 07- 17 │ 3643796 │
│ 2020- 07- 18 │ 3787920 │
│ 2020- 07- 19 │ 3902946 │
│ 2020- 07- 20 │ 4086668 │
│ 2020- 07- 21 │ 3121926 │
└──────────────────┴──────────┘
4.层内关联查询
select toDate(t1.dataTime),count(1) from ( \
select * from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='扫码结果' \
)t1 \
inner join ( \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入小程序' \
group by did \
having(count(1)>=2) \
union all \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入开锁页面' \
group by did \
having(count(1)>=2) \
)t2 on t1.did=t2.did \
group by toDate(t1.dataTime);
数据总量 | 数据总条数 | 数据范围 | 执行环境 | 耗时(秒) | 执行详情 |
6G | 13162122 | 7月20号中1/6数据 | 单台 | 2. 006 | 39.49 million rows, 3.35 GB (19.69 million rows/s., 1.67 GB/s.) |
两台 | 1. 706 | 38. 98 million rows, 3. 30 GB ( 22. 86 million rows/s., 1. 94 GB/s.) | |||
72G | 150009393
| 20号、21号 | 单台 | 22. 999 | 416. 98 million rows, 35. 30 GB ( 18. 13 million rows/s., 1. 53 GB/s.) |
两台 | 15. 476 | 450. 03 million rows, 38. 10 GB ( 29. 08 million rows/s., 2. 46 GB/s.) | |||
两台+索引不合理 | 5.499 | 32. 90 million rows, 2. 55 GB ( 5. 98 million rows/s., 463. 13 MB/s.) | |||
180G | 496403745 | 7月14号~7月20号 | 单台 | 58. 371 | 只能查询19号~21号, 查询14号~21号 clickhouse-server OOM 1.28 billion rows, 92.39 GB (21.99 million rows/s., 1.58 GB/s.) |
两台+索引不合理 | 38.832 | 188. 10 million rows, 14. 53 GB ( 5. 07 million rows/s., 391. 88 MB/s.) | |||
两台+索引不合理+查询20号、21号(等同于查询 72G的情况) | 24.277 | 187.35 million rows, 14.48 GB (7.72 million rows/s., 596.54 MB/s.) | |||
两台+索引合理 | 27.050 | 136.23 million rows, 10.67 GB (5.04 million rows/s., 394.33 MB/s.) | |||
两台+索引合理+查询20号、21号(等同于查询 72G的情况) | 4.841 | 22.04 million rows, 1.73 GB (4.55 million rows/s., 357.34 MB/s.) |
校验真正应该扫描的行数
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '扫码结果')
┌─count(1)─┐
│ 30398934 │
└──────────┘
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '进入小程序')
┌─count(1)─┐
│ 72859193 │
└──────────┘
SELECT count(1)
FROM dwd.flow_analysis_collect_detail_dist_6
WHERE (toDate(dataTime) >= '2020-07-14') AND (toDate(dataTime) <= '2020-07-21') AND (event = '进入开锁页面')
┌─count(1)─┐
│ 29591612 │
└──────────┘
┌─toDate(dataTime)─┬─ count( 1)─┐
│ 2020- 07- 17 │ 3444621 │
│ 2020- 07- 18 │ 3590432 │
│ 2020- 07- 19 │ 3709326 │
│ 2020- 07- 20 │ 3897574 │
│ 2020- 07- 21 │ 2920165 │
└──────────────────┴──────────┘
5.多层嵌套关联查询
select toDate(t1.dataTime),count(1) from ( \
select * from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-17' and toDate(dataTime)<='2020-07-21' and event='扫码结果' \
)t1 \
inner join ( \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入小程序' \
group by did \
having(count(1)>=2) \
union all \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入开锁页面' \
group by did \
having(count(1)>=2) \
)t2 on t1.did=t2.did \
inner join ( \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='进入小程序' \
group by did \
having(count(1)>=2) \
union all \
select did from dwd.flow_analysis_collect_detail_dist_6 where toDate(dataTime)>='2020-07-14' and toDate(dataTime)<='2020-07-21' and event='点击机动挪车' \
group by did \
having(count(1)>=2) \
)t3 on t1.did=t3.did \
group by toDate(t1.dataTime);
数据总量 | 数据总条数 | 数据范围 | 执行环境 | 耗时(秒) | 执行详情 |
6G | 13162122 | 7月20号中1/6数据 | 单台 | 3. 658 | 65.81 million rows, 5.58 GB (17.99 million rows/s., 1.53 GB/s.) |
两台 | 2.993 | 64. 97 million rows, 5. 50 GB ( 21. 71 million rows/s., 1. 84 GB/s.) | |||
72G | 150009393
| 20号、21号 | 单台 | 42. 588 | 694. 96 million rows, 58. 84 GB ( 16. 32 million rows/s., 1. 38 GB/s.) |
两台 | 28. 125 | 750. 05 million rows, 63. 49 GB ( 26. 67 million rows/s., 2. 26 GB/s.) | |||
两台+索引 | 9.135 | 53. 03 million rows, 4. 06 GB ( 5. 81 million rows/s., 443. 92 MB/s.) | |||
180G | 496403745 | 7月14号~7月20号 | 单台 | 110. 158 | 只能查询19号~21号, 查询14号~21号 clickhouse-server OOM 2. 14 billion rows, 153. 99 GB ( 19. 42 million rows/s., 1. 40 GB/s.) |
两台+索引不合理 | 59. 311 | 只能查询17号~21号, 查询14号~21号 clickhouse-server OOM 301. 41 million rows, 23. 07 GB ( 5. 08 million rows/s., 389. 02 MB/s.) | |||
两台+索引合理 | 53.255 | 215.30 million rows, 16.76 GB (4.04 million rows/s., 314.65 MB/s.) |
执行结果
┌─toDate(--t1.dataTime)─┬─count(1)─┐
│ 2020-07-14 │ 7006930 │
│ 2020-07-15 │ 7653622 │
│ 2020-07-16 │ 7306898 │
│ 2020-07-17 │ 7156312 │
│ 2020-07-18 │ 4404902 │
│ 2020-07-19 │ 5785404 │
│ 2020-07-20 │ 3831247 │
│ 2020-07-21 │ 3461854 │
└───────────────────────┴──────────┘
五.结论
clickhouse查询性能与服务器数据量成正比, 但要达到最优(2倍提升),需要数据进行合理分片且数据要非常均衡
clickhouse 索引会极大提升查询性能,因为合理的索引会直接命中需要扫描的数据,从而避免非必要的全表扫描
clickhouse 关联性能远低于单表查询,且需要注意大表间关联容易出现 clickhouse-server 进程OOM
clickhouse 适合 OLAP下的多维动态复杂查询(druid 关联查询性能极差)