clickhouse简介和性能测试

最新推荐文章于 2024-08-05 21:57:23 发布

hochoy

最新推荐文章于 2024-08-05 21:57:23 发布

阅读量1.1w

点赞数 2

分类专栏： clickhouse DB 文章标签： clickhouse DBMS

本文链接：https://blog.csdn.net/hochoy/article/details/88020245

版权

DB 同时被 2 个专栏收录

2 篇文章 0 订阅

订阅专栏

clickhouse

1 篇文章 0 订阅

订阅专栏

clickhouse使用文档：相对简洁的介绍了clickhouse的使用，对各种引擎做了简单介绍

clickhouse官网：clickhouse的权威官方网站

准备：在分布式clickhouse集群搭建好之后，进行建库建表，并导入数据并做性能测试。

clickhouse特点：

数据通过小批量Batch存储
支持高强度的写操作（数千行写入/每秒）
读数据量非常小
读数据操作中Primary Key 的数量有限（<1百万）
每一行的数据量很小

优点：

多个服务器上的分布式处理：分布式查询：从分布式表查询-> 重写 ->负载均衡,发送给远程节点查询->接收结果、合并
非常快速的扫描，可用于实时查询
列存储非常适合使用“宽”/“非规范化”表（许多列）：计算类查询时，大大减少IO消耗
压缩性好：相对mysql压缩10倍
SQL支持（有限制）
良好的功能集，包括支持近似计算
不同的表引擎：MergeTree，ReplicatedMergeTree，Distributed等
非常适合结构日志/事件数据以及时间序列数据（引擎MergeTree需要日期字段）
索引支持（仅限主键，不是所有存储引擎）
漂亮的命令行界面，具有用户友好的进度条和格式

缺点：

没有真正的删除/更新支持，也没有事务（与Spark和大多数大数据系统相同），没有delete/update
没有二级密钥（与Spark和大多数大数据系统相同）
只支持自己的协议（没有MySQL协议支持）
有限的SQL支持，以及连接实现是不同的。如果要从MySQL或Spark迁移，则可能必须使用连接重新编写所有查询。
没有窗口功能

ClickHouse vs. Spark

Size / compression	Spark v. 2.0.2	ClickHouse
Data storage format	Parquet, compressed: snappy	Internal storage, compressed
Size (uncompressed: 1.2TB)	395G	212G

Test	Spark v. 2.0.2	ClickHouse	Diff
Query 1: count (warm)	7.37 sec (no disk IO)	6.61 sec	~same
Query 2: simple group (warm)	792.55 sec (no disk IO)	37.45 sec	21x better
Query 3: complex group by	2522.9 sec	398.55 sec	6.3x better

Clickhouse

clickhouse 引擎：clickhouse较核心的模块

性能测试数据准备

环境：

clickhouse 集群：节点为192.168.1.1 192.168.1.2 192.168.1.3 三个节点的 clickhouse集群。

创建数据库：

本文使用default默认的数据库

建表：

在三台机器上进入clickhouse-client 并分别执行以下创建表的语句

CREATE TABLE test_table  (city String,model String,module String,duration UInt32,network String,lib_version String,platform String,country String,osversion String,region String,appkey String,version String,channelid String,deviceid UInt32,time DateTime,action Stringdate Date)ENGINE = MergeTree(date,(date), 8192);

建分布式表：

分布表（Distributed）本身不存储数据，相当于路由，需要指定集群名、数据库名、数据表名、分片KEY，这里分片用rand()函数，表示随机分片。查询分布表，会根据集群配置信息，路由到具体的数据表，再把结果进行合并。

CREATE TABLE test_table_all  AS test_table ENGINE = Distributed(perftest_3shards_1replicas, default, test_table, rand());

数据导入：

将准备好的json数据导入action_data中

cat test.log | clickhouse-client --query="INSERT INTO default.test_table FORMAT JSONEachRow"

导入之后查询

select * from test_table_all limit 1000;

select count(*) from test_table_all ;

三个节点上均可查询，已经备份。

性能测试

集群机器配置

集群的三个节点均为虚机，配置如下

服务器	配置
1	Linux apm01 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux cpu cores : 4 MemTotal: 16266780 kB
2	Linux apm02 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux cpu cores : 4 MemTotal: 16266764 kB
3	Linux clickhouse02 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux cpu cores : 4 MemTotal: 16266764 kB

服务器

配置

Linux apm01 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cpu cores : 4

MemTotal: 16266780 kB

Linux apm02 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cpu cores : 4

MemTotal: 16266764 kB

Linux clickhouse02 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

cpu cores : 4

MemTotal: 16266764 kB

数据条数：

243597660

查询性能时长：

查询结果和查询语句	数据大小	处理的行数
耗时：6ms 执行（select count(*) as GROUP_BY_MODULE, module from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and time >= '2018-09-03 00:00:00' and time <'2018-10-04 00:00:00' group by module）	14.51 GB	243.60 million rows
耗时：8ms 执行（ select round(avg(duration)) as ROUND_AVG from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and time >= '2018-09-03 00:00:00' and time <'2018-10-04 00:00:00'）	11.94 GB	243.60 million rows
耗时：1375ms 执行（select count(*) as ACTIVE_USER_OF_DAY from (select distinct deviceid from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and time >= '2018-09-03 00:00:00' and time <'2018-09-04 00:00:00');）	11.94 GB	243.60 million rows
耗时：1609ms 执行（select count(1) as ALL_USER_COUNT from (select distinct deviceid from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6')）	10.96 GB	243.60 million rows
耗时：2117ms 执行（select count(*) as LAUNCH_COUNT from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and action = '$launch' and time >= '2018-09-03 00:00:00' and time <'2018-09-04 00:00:00';）	15.17 GB	243.60 million rows
耗时：2883ms 执行（select count(1) as ALL_USER_COUNT_Nanjing from (select distinct deviceid from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and country = '中国' and region = '江苏' and city = '南京')）	21.98 GB	243.60 million rows
耗时：2918ms 执行（select count(*) as LOGIN_COUNT from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and action like 'login%' and time >= '2018-09-03 00:00:00' and time <'2018-09-04 00:00:00'）	15.17 GB	243.60 million rows
耗时：3924ms 执行（select count(*) as GROUP_BY_VERSION_CHANNELID_HAVING, version ,channelid from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and action = '$launch' and platform = 'android' and time >= '2018-09-03 00:00:00' and time <'2018-09-04 00:00:00' group by version ,channelid having GROUP_BY_VERSION_CHANNELID_HAVING > 50 and channelid like '1000%'）	25.50 GB	243.60 million rows
耗时：3932ms 执行（select count(*) as GROUP_BY_VERSION_CHANNELID , version ,channelid from test_table_all where appkey = '33645ea6eda746b3be4cfc5fa23ec6a6' and action = '$launch' and platform = 'android' and time >= '2018-09-03 00:00:00' and time <'2018-09-04 00:00:00' group by version ,channelid）	25.50 GB	243.60 million rows

结论：

在count 方面，速度很快，消耗内存较大
在group by 方面，速度很快，消耗内存很大

实时处理方面，处理的数据量远大于 mysql，但是count 和 group by 消耗内存较多，并发较高的情况下，业务服务器难以承受；group by 查询时性能还是不能达到妙级响应的要求；并发较小，100 Queries / second；所以不适合做业务型高并发实时查询
在批处理方面，由于其列式存储的设计，减小IO的消耗等原因计算性能远超 spark，hive ；100 Queries / second远高于spark几十个并发任务的数量；所以可以替代hadoop集群作为批处理离线框架

提供一个性能测试文档供参考