《kafka中文手册》-快速开始（一）-CSDN博客

1.1 Introduction

Kafka™ is a distributed streaming platform. What exactly does that mean? Kafka是一个分布式数据流处理系统, 这意味着什么呢?

We think of a streaming platform as having three key capabilities:我们回想下流数据处理系统的三个关键能力指标

It lets you publish and subscribe to streams of records. In this respect it is similar to a message queue or enterprise messaging system.系统具备发布和订阅流数据的能力, 在这方面, 它类似于消息队列或企业消息总线
It lets you store streams of records in a fault-tolerant way.系统具备在存储数据时具备容错能力
It lets you process streams of records as they occur.系统具备在数据流触发时进行实时处理

What is Kafka good for?那kafka适用在哪些地方?

It gets used for two broad classes of application: 它适用于这两类应用

Building real-time streaming data pipelines that reliably get data between systems or applications 在系统或应用间需要相互进行数据流交互处理的实时系统
Building real-time streaming applications that transform or react to the streams of data 需要对数据流中的数据进行转换或及时处理的实时系统

To understand how Kafka does these things, let’s dive in and explore Kafka’s capabilities from the bottom up. 为了了解Kafka做了哪些事情, 我们开始从下往上分析kafka的能力

First a few concepts: 首先先了解这几个概念

Kafka is run as a cluster on one or more servers. kafka是一个可以跑在一台或多台服务器上的集群
The Kafka cluster stores streams of records in categories called topics. Kafka集群存储不同的数据流以topic形式进行划分
Each record consists of a key, a value, and a timestamp. 每条数据流中的每条记录包含key, value, timestamp三个属性

Kafka has four core APIs: Kafka拥有4个核心的api

The Producer API allows an application to publish a stream records to one or more Kafka topics. Producer API 用于让应用发布流数据到Kafka的topic中
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them. Consumer API 用于让应用订阅一个或多个topic后, 获取数据流进行处理
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams. Streams API 用于让应用传教流处理器, 流处理器的输入可以是一个或多个topic, 并输出数据流结果到一个或多个topic中, 它提供一种有效的数据流处理方式
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table. Connector API 用于为现在的应用或数据系统提供可重用的生产者或消费者, 他们连接到kafka的topic进行数据交互. 例如, 创建一个到关系型数据库连接器, 用于捕获对某张表的所有数据变更

In Kafka the communication between the clients and the servers is done with a simple, high-performance, language agnostic TCP protocol. This protocol is versioned and maintains backwards compatibility with older version. We provide a Java client for Kafka, but clients are available in many languages. Kafka 基于简单、高效的tcp协议完成服务器端和客户端的通讯, 该协议是受版本控制的, 并可以兼容老版本. 我们有提供java的kafka客户端, 但也提供了很多其他语言的客户端

Topics and Logs 主题和日志

Let’s first dive into the core abstraction Kafka provides for a stream of records—the topic.首先我们考察下kafka提供的核心数据流结构– topic(主题)

A topic is a category or feed name to which records are published. Topics in Kafka are always multi-subscriber; that is, a topic can have zero, one, or many consumers that subscribe to the data written to it. topic是一个分类栏目,由于记录一类数据发布的位置. topic在kafka中通常都有多个订阅者, 也就是说一个topic在写入数据后, 可以零个, 一个, 或多个订阅者进行消费

For each topic, the Kafka cluster maintains a partitioned log that looks like this: 针对每个topic队列, kafka集群构建一组这样的分区日志:

Each partition is an ordered, immutable sequence of records that is continually appended to—a structured commit log. The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition.

每个日志分区都是有序, 不可变, 持续提交的结构化日志, 每条记录提交到日志分区时, 都分配一个有序的位移对象offset, 用以唯一区分记数据在分区的位置

The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.

无论发布到Kafka的数据是否有被消费, 都会保留所有已经发布的记录, Kafka使用可配置的数据保存周期策略, 例如, 如果保存策略设置为两天, 则两天前发布的数据可以被订阅者消费, 过了两天后, 数据占用的空间就会被删除并回收. 在存储数据上, kafka提供高效的O(1)性能处理算法, 所以保存长期时间不是一个问题

In fact, the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes. For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from “now”.

实际上, 每个消费者唯一保存的元数据信息就是消费者当前消费日志的位移位置. 位移位置是被消费者控制, 正常情况下, 如果消费者读取记录后, 位移位置往前移动. 但是事实上, 由于位移位置是消费者控制的, 所以消费者可以按照任何他喜欢的次序进行消费, 例如, 消费者可以重置位移到之前的位置以便重新处理数据, 或者跳过头部从当前最新的位置进行消费

This combination of features means that Kafka consumers are very cheap—they can come and go without much impact on the cluster or on other consumers. For example, you can use our command line tools to “tail” the contents of any topic without changing what is consumed by any existing consumers.

这些特性表明Kafka消费者消费的代价是十分小的, 消费者可以随时消费或停止, 而对集群或其他消费者没有太多的影响, 例如你可以使用命令行工具, 像”tail”工具那样读取topic的内容, 而对其它消费者没有影响

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.

分区在日志中有几个目的, 首先, 它能扩大日志在单个服务器里面的大小, 每个分区大小必须适应它从属的服务器的规定的大小, 但是一个topic可以有任意很多个分区, 这样topic就能存储任意大小的数据量, 另一方面, 分区还和并发有关系, 这个后面会讲到

Distribution 分布式

The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions. Each partition is replicated across a configurable number of servers for fault tolerance.

kafka的日志分区机制跨越整个kafka日志集群, 每个服务器使用一组公用的分区进行数据处理, 每个分区可以在集群中配置副本数

Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”. The leader handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Each server acts as a leader for some of its partitions and a follower for others so load is well balanced within the cluster.

每个分区都有一台服务器是主的, 另外零台或多台是从服务器, 主服务器责所有分区的读写请求, 从服务器被动从主分区同步数据. 如果主服务器分区的失败了, 那么备服务器的分区就会自动变成主的. 每台服务器的所有分区中, 只有部分会作为主分区, 另外部分作为从分区, 这样可以在集群中对个个服务器做负载均摊

Producers 生产者

Producers publish data to the topics of their choice. The producer is responsible for choosing which record to assign to which partition within the topic. This can be done in a round-robin fashion simply to balance load or it can be done according to some semantic partition function (say based on some key in the record). More on the use of partitioning in a second!

生产者发布消息到他们选择的topic中, 生产者负责选择记录要发布到topic的那个分区中, 这个可以简单通过轮询的方式进行负载均摊, 或者可以通过特定的分区选择函数(基于记录特定键值), 更多分区的用法后面马上介绍

Consumers 消费者

Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

消费者使用消费组进行标记, 发布到topic里面的每条记录, 至少会被消费组里面一个消费者实例进行消费. 消费者实例可以是不同的进程, 分布在不同的机器上

If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.

如果所有的消费者属于同一消费组, 则记录会有效地分摊到每一个消费者上, 也就是说每个消费者只会处理部分记录

If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.

如果所有的消费者都属于不同的消费组, 则记录会被广播到所有的消费者上, 也就说每个消费者会处理所有记录

A two server Kafka cluster hosting four partitions (P0-P3) with two consumer groups. Consumer group A has two consumer instances and group B has four.

图为一个2个服务器的kafka集群, 拥有4个分区, 2个消费组, 消费组A有2个消费者, 消费组B有4个消费者

More commonly, however, we have found that topics have a small number of consumer groups, one for each “logical subscriber”. Each group is composed of many consumer instances for scalability and fault tolerance. This is nothing more than publish-subscribe semantics where the subscriber is a cluster of consumers instead of a single process.

在大多数情况下, 一般一个topic值需要少量的消费者组, 一个消费组对应于一个逻辑上的消费者. 每个消费组一般包含多个实例用于容错和水平扩展. 这仅仅是发布订阅语义，其中订阅者是消费者群集，而不是单个进程.

The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a “fair share” of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

在kafka中实现日志消费的方式, 是把日志分区后分配到不同的消费者实例上, 所以每个实例在某个时间点都是”公平共享”式独占每个分区. 在这个处理过程中, 维持组内的成员是由kafka协议动态决定的, 如果有新的实例加入组中, 则会从组中的其他成员分配一些分区给新成员, 如果某个实例销毁了, 则它负责的分区也会分配给组内的其它成员

Kafka only provides a total order over records within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over records this can be achieved with a topic that has only one partition, though this will mean only one consumer process per consumer group.

kafka值提供在一个日志分区里面顺序消费的能力, 在同一topic的不同分区里面是没有保证的. 由于记录可以结合键值做分区, 这样的分区顺序一般可以满足各个应用的需求了, 但是如果你要求topic下的所有记录都要按照次序进行消费, 则可以考虑一个topic值创建一个分区, 这样意味着你这个topic只能让一个消费者消费

Guarantees 保障

At a high-level Kafka gives the following guarantees: 在一个高可用能的kafka集群有如下的保证:

Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
被同一个发布者发送到特定的日志分区后, 会按照他们发送的顺序进行添加, 例如记录M1 和记录M2 都被同一个提供者发送, M1比较早发送, 则M1的位移值比M2小, 并记录在比较早的日志位置
A consumer instance sees records in the order they are stored in the log.
消费者实例按照日志记录的顺序进行读取
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.
如果topic有N个副本, 则可以容忍N-1台服务器宕机时, 提交的记录不会丢失

More details on these guarantees are given in the design section of the documentation.

更多关于kafka能提供的特性会在设计这个章节讲到

Kafka as a Messaging System kafka当作消息系统

How does Kafka’s notion of streams compare to a traditional enterprise messaging system?

kafka的流概念和传统的企业消息系统有什么不一样呢?

Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each record goes to one of them; in publish-subscribe the record is broadcast to all consumers. Each of these two models has a strength and a weakness. The strength of queuing is that it allows you to divide up the processing of data over multiple consumer instances, which lets you scale your processing. Unfortunately, queues aren’t multi-subscriber—once one process reads the data it’s gone. Publish-subscribe allows you broadcast data to multiple processes, but has no way of scaling processing since every message goes to every subscriber.

传统的消息系统有两种模型, 队列模型和发布订阅模型, 在订阅模型中, 一群消费者从服务器读取记录, 每条记录会分发到其中一个消费者中, 在发布和订阅模型中, 记录分发给所有的消费者. 这两种模型都有各自的优缺点, 队列的优点是它允许你把数据处理提交到多个消费者实例中, 适用于数据处理的水平扩展, 但是队列不是多订阅的, 一旦其中的一个消费者读取了记录, 则记录就算处理过了. 在发布订阅模型中允许你广播到记录到不同的订阅者上, 但是这种方式没法对不同的订阅者进行负载均摊

The consumer group concept in Kafka generalizes these two concepts. As with a queue the consumer group allows you to divide up processing over a collection of processes (the members of the consumer group). As with publish-subscribe, Kafka allows you to broadcast messages to multiple consumer groups.

kafka的消费组产生源于对着两种概念的融合,在队列模型中, 它允许你把记录分摊在同一个消费组的不同处理者身上, 在订阅和发布模型中, 它允许你把消费广播到不同的消费组中

The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.

kafka这个模型的好处是, 这样每个topic都能同时拥有这样的属性, 既能消费者有水平扩展的处理能力, 又能允许有多个不同的订阅者–不需要让用户选择到底是要使用队列模型还是发布订阅模型

Kafka has stronger ordering guarantees than a traditional messaging system, too.

Kafka也比传统的消息系统有更强的消息顺序保证

A traditional queue retains records in-order on the server, and if multiple consumers consume from the queue then the server hands out records in the order they are stored. However, although the server hands out records in order, the records are delivered asynchronously to consumers, so they may arrive out of order on different consumers. This effectively means the ordering of the records is lost in the presence of parallel consumption. Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

传统的队列在服务器端按顺序保存记录, 如果有多个消费者同时从服务器端读取数据, 则服务器按保存的顺序分发记录. 但是尽管服务器按顺序分发记录, 这些记录使用异步分发到消费者上, 所以记录到不同的消费者时顺序可能是不一致的. 这就是说记录的顺序有可能在记录被并发消费时已经被丢失了, 在消息系统中为了支持顺序消费这种情况经常使用一个概念叫做”独占消费者”, 表示只允许一个消费者去订阅队列, 这也意味了牺牲掉记录并行处理能力

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

kafka在这点上做得更好些, 通过对日志提出分区的概念, kafka保证了记录处理的顺序和对一组消费者实例进行负载分摊的水平扩展能力. 通过把topic中的分区唯一指定消费者分组中的某个消费者, 这样可以保证仅且只有这样的一个消费者实例从这个分区读取数据, 并按顺序进行消费. 这样topic中的多个分区就可以分摊到多个消费者实例上, 当然消费者的数量不能比分区数量多, 否则有些消费者将分配不到分区.

Kafka as a Storage System kafka当作储存系统

Any message queue that allows publishing messages decoupled from consuming them is effectively acting as a storage system for the in-flight messages. What is different about Kafka is that it is a very good storage system.

作为存储系统, 任意消息队列都允许发布到消息队列中, 并能高效消费这些消息记录, kafka不同的地方是它是一个很好的存储系统

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails.

数据写入kafka时被写入到磁盘, 并复制到其他服务器上进行容错, kafka允许生产者只有在消息已经复制完, 并存储后才得到写成功的通知, 否则就认为失败.

The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

磁盘结构kafka也很有效率利用了–无论你存储的是50KB或50TB的数据在kafka上, kafka都会有同样的性能

As a result of taking storage seriously and allowing the clients to control their read position, you can think of Kafka as a kind of special purpose distributed filesystem dedicated to high-performance, low-latency commit log storage, replication, and propagation.

对于那些需要认真考虑存储性能, 并允许客户端自主控制读取位置的, 你可以把kafka当作是一种特殊的分布式文件系统, 并致力于高性能, 低延迟提交日志存储, 复制和传播.

Kafka for Stream Processing kafka作为数据流处理

It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.

仅仅读取、写入和存储数据流是不够的，最终的目的是使流实时处理.。

In Kafka a stream processor is anything that takes continual streams of data from input topics, performs some processing on this input, and produces continual streams of data to output topics.

kafka的流数据处理器是持续从输入的topic读取连续的数据流, 进行数据处理, 转换, 后产生连续的数据流输出到topic中

For example, a retail application might take in input streams of sales and shipments, and output a stream of reorders and price adjustments computed off this data.

例如，一个零售的应用可能需要在获取销售和出货量的输入流, 在计算分析了之后, 重新输出价格调整的记录

It is possible to do simple processing directly using the producer and consumer APIs. However for more complex transformations Kafka provides a fully integrated Streams API. This allows building applications that do non-trivial processing that compute aggregations off of streams or join streams together.

通常情况下可以直接使用提供者或消费者的api方法做些简单的处理. 但是kafka通过stream api 也提供一些更复杂的数据转换处理机制, stream api可以让应用计算流的聚合或流的归

This facility helps solve the hard problems this type of application faces: handling out-of-order data, reprocessing input as code changes, performing stateful computations, etc.

这些功能有助于解决一些应用上的难题: 处理无序的数据, 在编码修改后从新处理输入数据, 执行有状态的计算等

The streams API builds on the core primitives Kafka provides: it uses the producer and consumer APIs for input, uses Kafka for stateful storage, and uses the same group mechanism for fault tolerance among the stream processor instances.

流的api基于kafka的核心基本功能上构建的: 它使用生产者和消费者提供的api作为输入输出, 使用kafka作为状态存储, 使用一样的分组机制在不同的流处理器上进行容错

转载自并发编程网 - ifeve.com