19.Kafka引擎
19.1.Kafka引擎
Kafka引擎结合Kafka使用,可实现订阅或发布数据流。
指定表引擎:
ENGINE = Kafka()
SETTINGS
kafka_broker_list = 'host:port',
kafka_topic_list = 'topic1,topic2,...',
kafka_group_name = 'group_name',
kafka_format = 'data_format'[,]
[kafka_row_delimiter = 'delimiter_symbol',]
[kafka_schema = '',]
[kafka_num_consumers = N,]
[kafka_skip_broken_messages = N]
必选参数:
kafka_broker_list :以逗号分隔的brokers列表。
kafka_topic_list :以逗号分隔的kafka的topic列表。
kafka_group_name :Kafka消费组。
kafka_format :消息的格式,例如JSONEachRow。
可选参数:
kafka_row_delimiter :行之间的分隔符。
kafka_schema :按需定义schema,例如Cap’n Proto格式需指定。
kafka_num_consumers :消费者数量,默认1,最多不超过Topic的分区数。
kafka_skip_broken_messages :每个block中,Kafka的消息解析器容忍schema不兼容消息的数量。默认值:0。
创建Kafka引擎表示例
示例1:
CREATE TABLE queue (
timestamp UInt64,
level String,
message String
) ENGINE = Kafka('localhost:9092', 'topic', 'group1', 'JSONEachRow');
示例2:
CREATE TABLE queue2 (
timestamp UInt64,
level String,
message String
) ENGINE = Kafka('localhost:9092', 'topic', 'group1')
SETTINGS kafka_format ='JSONEachRow',
kafka_num_consumers = 4;
示例3:
CREATE TABLE queue2 (
timestamp UInt64,
level String,
message String
) ENGINE = Kafka SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'topic',
kafka_group_name = 'group1',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 4;
SELECT 查询对于读取消息并不是很有用(除了调试),因为每个消息只能读取一次。
通常,将该引擎结合物化视图一起使用,使用方法:
(1)、使用Kafka引擎创建一个Kafka的消费者,并将其视为一个数据流。
(2)、创建所需结构的表。
(3)、创建一个物化视图,该视图转换来自引擎的数据并将其放入上一步创建的表中。
当物化视图添加至该引擎,它将会在后台收集数据。这就允许你从Kafka持续接收消息并使用SELECT将数据转换为所需的格式。它们不直接从Kafka中读取数据,而是接收新记录,以block为单位,这样就可以写入具有不同详细信息级别的多个表(分组聚合或无聚合)中。
为了提高性能,接收到的消息将被分组为大小为max_insert_block_size的block(块)。如果block没有在stream_flush_interval_ms时间内形成,则不管block的完整性如何,数据将刷新到表中。
要停止接收topic数据或更改转换逻辑,需detach物化视图。
DETACH TABLE consumer;
ATTACH MATERIALIZED VIEW consumer;
如果要使用ALTER更改目标表,建议禁用物化视图,以避免目标表和该视图中的数据之间出现差异。
Kafka的扩展配置
Kafka引擎支持使用ClickHouse配置文件扩展配置。
用户可以使用两个配置key,全局的kafka和topic级别的kafka_*。首先应用全局配置,然后应用topic级别的配置。
<!-- Global configuration options for all tables of Kafka engine type -->
<kafka>
<debug>cgrp</debug>
<auto_offset_reset>smallest</auto_offset_reset>
</kafka>
<!-- Configuration specific for topic “logs” kafka下划线后面是topic的名称 -->
<kafka_logs>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_logs>
有关可能的配置选项的列表,参见librdkafka配置,链接:
https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md。
ClickHouse配置中使用下划线(_)代替点,例如,check.crcs=true将配置为
<check_crcs>true</check_crcs>
最终在自己的机器上的配置如下:
[root@middleware config.d]# vim kafka.xml
[root@middleware config.d]# pwd
/etc/clickhouse-server/config.d
[root@middleware config.d]# ls
kafka.xml
[root@middleware config.d]# ls
kafka.xml
[root@middleware config.d]# cat kafka.xml
<yandex>
<kafka>
<debug>cgrp</debug>
<auto_offset_reset>smallest</auto_offset_reset>
</kafka>
<!-- Configuration specific for topic "topic_ch" -->
<kafka_topic_ch>
<auto_offset_reset>latest</auto_offset_reset>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_topic_ch>
<kafka_my_topic>
<auto_offset_reset>latest</auto_offset_reset>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_my_topic>
</yandex>
[root@middleware config.d]#
19.2.示例1
示例1:通过两张表分别保存Kafka的清单数据和分组聚合数据。
创建Kafka的topic:
参考地址:https://kafka.apachecn.org/quickstart.html
[root@hadoop4 kafka-broker]# bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic topic_ch
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "topic_ch".
[root@hadoop4 kafka-broker]#
可以运行list(列表)命令来查看这个topic:
[root@middleware kafka_2.12-2.6.0]# bin/kafka-topics.sh --list --zookeeper localhost:2181
(1)、创建topic的数据流
drop table if exists topic_ch_kafka;
CREATE TABLE topic_ch_kafka (
timestamp UInt64,
level String,
message String
) ENGINE = Kafka('localhost:9092', 'topic_ch', 'group_ch', 'JSONEachRow');
效果图:
middleware :) CREATE TABLE topic_ch_kafka (
:-] timestamp UInt64,
:-] level String,
:-] message String
:-] ) ENGINE = Kafka('localhost:9092', 'topic_ch', 'group_ch', 'JSONEachRow');
CREATE TABLE topic_ch_kafka
(
`timestamp` UInt64,
`level` String,
`message` String
)
ENGINE = Kafka('localhost:9092', 'topic_ch', 'group_ch', 'JSONEachRow')
Ok.
0 rows in set. Elapsed: 0.007 sec.
middleware :)
(2)、创建保存清单的表以及以及相应的物化视图:
DROP TABLE topic_ch_list;
CREATE TABLE topic_ch_list (
timestamp UInt64,
level String,
message String
) ENGINE = MergeTree()
order by (timestamp);
DROP TABLE topic_ch_list_view;
CREATE MATERIALIZED VIEW topic_ch_list_view TO topic_ch_list
AS SELECT timestamp, level, message
FROM topic_ch_kafka;
效果图:
middleware :) DROP TABLE topic_ch_list;
DROP TABLE topic_ch_list
Received exception from server (version 20.9.3):
Code: 60. DB::Exception: Received from localhost:9000. DB::Exception: Table default.topic_ch_list doesn't exist..
0 rows in set. Elapsed: 0.015 sec.
middleware :) CREATE TABLE topic_ch_list (
:-] timestamp UInt64,
:-] level String,
:-] message String
:-] ) ENGINE = MergeTree()
:-] order by (timestamp);
CREATE TABLE topic_ch_list
(
`timestamp` UInt64,
`level` String,
`message` String
)
ENGINE = MergeTree()
ORDER BY timestamp
Ok.
0 rows in set. Elapsed: 0.006 sec.
middleware :) DROP TABLE topic_ch_list_view;
DROP TABLE topic_ch_list_view
Received exception from server (version 20.9.3):
Code: 60. DB::Exception: Received from localhost:9000. DB::Exception: Table default.topic_ch_list_view doesn't exist..
0 rows in set. Elapsed: 0.002 sec.
middleware :) CREATE MATERIALIZED VIEW topic_ch_list_view TO topic_ch_list
:-] AS SELECT timestamp, level, message
:-] FROM topic_ch_kafka;
CREATE MATERIALIZED VIEW topic_ch_list_view TO topic_ch_list AS
SELECT
timestamp,
level,
message
FROM topic_ch_kafka
Ok.
0 rows in set. Elapsed: 0.005 sec.
middleware :)
(3)、创建统计聚合的表以及相应的物化视图:
DROP TABLE topic_ch_daily;
CREATE TABLE topic_ch_daily (
day Date,
level String,
total UInt64
) ENGINE = SummingMergeTree(day)
ORDER BY (day, level);
DROP TABLE topic_ch_daily_view;
CREATE MATERIALIZED VIEW topic_ch_daily_view TO topic_ch_daily
AS SELECT toDate(toDateTime(timestamp)) AS day, level, count() as total
FROM topic_ch_kafka GROUP BY day, level;
2、生产数据
[root@middleware ~]# source /etc/profile
[root@middleware ~]# $ZOOKEEPER_HOME/bin/zkServer.sh start
ZooKeeper JMX enabled by default
Using config: /root/apache-zookeeper-3.6.2-bin/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED
[root@middleware ~]# # 启动kafka
[root@middleware ~]# cd $KAFKA_HOME
[root@middleware kafka_2.12-2.6.0]# bin/kafka-server-start.sh -daemon config/server.properties
[root@middleware kafka_2.12-2.6.0]# bin/kafka-topics.sh --list --zookeeper localhost:2181
__consumer_offsets
my_topic
test
topic
topic_ch
topic_ch2
[root@middleware kafka_2.12-2.6.0]# bin/kafka-console-producer.sh --bootstrap-server localhost:9092 --topic topic_ch
>{"timestamp":1542426134, "level":"high", "message":"hehe"}
>{"timestamp":1542427132, "level":"high", "message":"hehe"}
>{"timestamp":1542428133, "level":"mid", "message":"hehe"}
>{"timestamp":1542429134, "level":"low", "message":"hehe"}
>{"timestamp":1542430134, "level":"high", "message":"hehe"}
>{"timestamp":1542423134, "level":"low", "message":"hehe"}
>{"timestamp":1542434434, "level":"low", "message":"hehe"}
>{"timestamp":1542444134, "level":"low", "message":"hehe"}
>{"timestamp":1542454136, "level":"high", "message":"hehe"}
>{"timestamp":1542464134, "level":"high", "message":"hehe"}
>{"timestamp":1542474134, "level":"high", "message":"hehe"}
>{"timestamp":1542484134, "level":"low", "message":"hehe"}
>{"timestamp":1542494134, "level":"high", "message":"hehe"}
>{"timestamp":1542424194, "level":"mid", "message":"hehe"}
>
查看结果(数据有历史消息):
middleware :) select * from topic_ch_list;
SELECT *
FROM topic_ch_list
┌──timestamp─┬─level─┬─message─┐
│ 1542474134 │ high │ hehe │
└────────────┴───────┴─────────┘
┌──timestamp─┬─level─┬─message─┐
│ 1542423134 │ low │ hehe │
│ 1542424132 │ high │ hehe │
│ 1542424132 │ high │ hehe │
│ 1542424133 │ mid │ hehe │
│ 1542424133 │ mid │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ low │ hehe │
│ 1542424134 │ low │ hehe │
│ 1542424134 │ low │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ low │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ high │ hehe │
│ 1542424134 │ low │ hehe │
│ 1542424136 │ high │ hehe │
│ 1542424434 │ low │ hehe │
│ 1542426134 │ high │ hehe │
│ 1542427132 │ high │ hehe │
│ 1542428133 │ mid │ hehe │
│ 1542429134 │ low │ hehe │
│ 1542430134 │ high │ hehe │
│ 1542434134 │ high │ hehe │
│ 1542434434 │ low │ hehe │
│ 1542444134 │ low │ hehe │
│ 1542454136 │ high │ hehe │
│ 1542464134 │ high │ hehe │
└────────────┴───────┴─────────┘
┌──timestamp─┬─level─┬─message─┐
│ 1542424194 │ mid │ hehe │
└────────────┴───────┴─────────┘
┌──timestamp─┬─level─┬─message─┐
│ 1542484134 │ low │ hehe │
└────────────┴───────┴─────────┘
┌──timestamp─┬─level─┬─message─┐
│ 1542494134 │ high │ hehe │
└────────────┴───────┴─────────┘
33 rows in set. Elapsed: 0.058 sec.
middleware :)
聚合统计表:
SELECT level, sum(total) FROM topic_ch_daily GROUP BY level;
结果类似:
┌─level─┬─sum(total)─┐
│ mid │ 3 │
│ low │ 5 │
│ high │ 8 │
└───────┴────────────┘
如果要停止接收主题或更改转换逻辑,可以使用下面的命令分离物化视图(这个是在clickhouse-client -m中执行的):
DETACH TABLE consumer;
ATTACH TABLE consumer;
19.3.示例2:Kafka的配置
通过使用ClickHouse配置文件,Kafka引擎支持扩展配置。有两个配置key,你可以使用:全局(kafka)和topic-level(kafka_*)。全局的配置在最前面,接着是topic-level的配置。
在目录/etc/clickhouse-server/config.d/新建配置文件,配资文件名称任意指定,这里命名为kafka.xml。如下:
[root@middleware config.d]# pwd
/etc/clickhouse-server/config.d
[root@middleware config.d]# ls
kafka.xml
[root@middleware config.d]#
Kafka.xml的具体内容如下:
<yandex>
<!-- 下面是通用的配置,支持所有的kafka的topic中的消息 -->
<kafka>
<debug>cgrp</debug>
<auto_offset_reset>smallest</auto_offset_reset>
</kafka>
<!-- Configuration specific for topic "topic_ch" 只是针对topic_ch这个topic的 -->
<kafka_topic_ch>
<auto_offset_reset>latest</auto_offset_reset>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_topic_ch>
<!-- 这个是针对my_topic这个topic的 -->
<kafka_my_topic>
<auto_offset_reset>latest</auto_offset_reset>
<retry_backoff_ms>250</retry_backoff_ms>
<fetch_min_bytes>100000</fetch_min_bytes>
</kafka_my_topic>
</yandex>
如果想了解更多的关于这些的可选配置,参见librdkafka configuration reference(https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md)。使用下划线(_)代替ClickHouse配置文件中的点。例如:check.crcs=true将会写成<check_crcs>true</check_crcs>。