kafka定义

最新推荐文章于 2023-08-19 21:16:20 发布

weixin_33670713

最新推荐文章于 2023-08-19 21:16:20 发布

阅读量131

点赞数

文章标签：大数据 python

原文链接：https://my.oschina.net/LucasZhu/blog/1837337

版权

2019独角兽企业重金招聘Python工程师标准>>>

定义：

Apache Kafka® is a distributed streaming platform.

A streaming platform has three key capabilities(流平台的三个定义):

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.
Store streams of records in a fault-tolerant durable way.
Process streams of records as they occur.

一类似于类MQ的发布订阅功能，二是以容错的持久化方式来记录流，三是处理数据的能力。所以我们可以理解Kafka是一消息中间件。接下来我们看下Kafka的定位及集成系统：

kafka几个重要的概念:

Kafka is run as a cluster on one or more servers that can span multiple datacenters.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

架构

Kafka架构入下图所示，消息中间件本质就是生产-存储-消费。由下图可知，在kafka的架构设计里，无论是生产者，还是消费者，还是消息存储，都可以水平扩容从而提高整个集群的处理能力，生来就是分布式系统。另外图中没有展示出来的kafka另一个很重要的特性，那就是副本，在创建topic的时候指定分区数量的同时，还可以指定副本的数量（副本最大数量不允许超过broker的数量，否则会报错：replication factor:2 larger than available brokers : 1）。各个副本之间只有一个leader,其他都是follow，只有leader副本提供读写服务，follow副本只是冷备，当leader挂掉会从follow中选举一个leader。从而达到高可用。

topic

下图是topic的解剖图，kafka只有topic的概念，没有类似ActiveMQ中的Queue（一对一）的概念（ActiveMQ既有Topic又有Queue）一个topic可以有若干个分区，且分区可以动态修改，但是只允许增加不允许减少。每个分区中的消息是有序的。各个分区之间的消息是无序的。新消息采用追加的方式写入，这种顺序写入方式，从而使kafka的吞吐能力非常强大（一些验证表名顺序写入磁盘的速度超过随机写入内存）。

topic定义
官方定义：A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it.
意思是：topic是发布记录的名称，Kafka中的topic总是多用户的，可以有0个或者1个或者多个订阅该主题的消费者。
例如订单支付成功后，发送名为TOPIC_PAYMENT_ORDER_SUCCESS，短信系统可以接收这个topic，给用户发送短信。物流信息系统可以接收这个topic，增加一条新的动态。

磁盘&内存速度对比

由下图可知，顺序写入磁盘的速度（Sequential, disk）为53.2M，而随机写入内存的速度（Random, memory）为36.7M。

durable

kafka对消息日志的存储策略为：The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

Kafka会持久化所有存储过的消息，不管它是否已经被消费过，-——通过配置滞留周期来配置。例如
（以log.retention开头的一些配置，例如log.retention.ms，log.retention.minutes，log.retention.hours，log.retention.bytes）例如配置有效期两天，那么两天内这些消息日志都能通过offset访问。到期后，kafka会删除这些消息日志文件释放磁盘空间。

consumer

kafka消费topic中某个分区示意图如下，至于kafka如何在各个topic的各个分区中选择某个分区，后面的文章会提到。由下图可知，消费者通过offset定位并读取消息，且各个消费者持有的offset是自己的消费进度。

consumer group

each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

即对于订阅了某个topic的consumer group下的所有consumer，任意一条消息只会被其中一个consumer消费。如果有多个consumer group，各个consumer group之间互不干扰。consumer group示意图如下所示，某个topic消息有4个分区：P0, P1, P2, P3。Consumer Group A中有两个consumer：C1和C2。Consumer Group B中有4个consumer：C3，C4，C5和C6。如果现在生产者发送了一条消息，那么这条消息只会被Consumer Group A中的C1和C2之中某个消费者消费到，以及被Consumer Group B中的C3，C4，C5和C6之中某个消费者消费到。