apache kafka_Apache Kafka简介

最新推荐文章于 2024-07-21 15:32:18 发布

weixin_26752759

最新推荐文章于 2024-07-21 15:32:18 发布

阅读量145

点赞数

文章标签： python

原文链接：https://medium.com/javarevisited/a-brief-introduction-to-apache-kafka-25a4ab386f4b

版权

apache kafka

During the last years, technologies for building real-time data pipelines and event streaming apps have emerged, promoting also the horizontal scalability and the fault tolerance of a system. One of these technologies is Apache Kafka.

在过去的几年中，用于构建实时数据管道和事件流应用程序的技术应运而生，这也促进了系统的水平可扩展性和容错能力。这些技术之一是Apache Kafka 。

介绍 (Introduction)

Apache Kafka is an open-source distributed streaming platform developed initially by LinkedIn and donated to the Apache Software Foundation. The project, written in Scala and Java, aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. By definition, a streaming platform has 3 key capabilities :

Apache Kafka是一个开放源代码的分布式流媒体平台，最初由LinkedIn开发，并捐赠给Apache Software Foundation。该项目以Scala和Java编写，旨在提供一个统一的，高吞吐量，低延迟的平台来处理实时数据馈送。根据定义，流媒体平台具有3个关键功能：

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system
发布和订阅记录流，类似于消息队列或企业消息传递系统
Store streams of records in a fault-tolerant durable way
以容错的持久方式存储记录流
Process streams of records as they occur
处理记录流

In general, Kafka is used to building real-time event streaming applications. To understand how Kafka works, let’s see some basic concepts :

通常，Kafka用于构建实时事件流应用程序。为了了解Kafka的工作原理，让我们看一些基本概念：

Kafka is run as a cluster on one or more servers that can span multiple datacenters
Kafka在一个或多个可以跨越多个数据中心的服务器上作为集群运行
the Kafka cluster stores stream of records in categories called topics
Kafka集群将记录流存储在称为主题的类别中
each record consists of a key, a value, and a timestamp
每个记录由一个键，一个值和一个时间戳组成

Kafka has 5 core APIs to interact with topics :

Kafka有5个核心API与主题进行交互：

Producer API : allows to publish a stream of records to one or more topics
生产者API ：允许将记录流发布到一个或多个主题
Consumer API : allows to subscribe to one or more topics and process the stream of records produced to them
消费者API ：允许订阅一个或多个主题并处理为其生成的记录流
Streams API : allows to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams
Streams API ：允许充当流处理器，使用一个或多个主题的输入流，并生成一个或多个输出主题的输出流，从而有效地将输入流转换为输出流
Connector API : allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems (e.g. to a relational database capturing every change to a table)
连接器API ：允许建立和运行可重用的生产者或使用者，以将Kafka主题连接到现有应用程序或数据系统(例如，关系数据库，以捕获对表的所有更改)
Admin API : allows managing and inspecting topics, brokers and other Kafka objects
Admin API ：允许管理和检查主题，代理和其他Kafka对象

主题，分区和偏移 (Topics, Partitions and Offsets)

A topic is a stream of records and represents a category or a feed name to which records are published at cluster level. It is always multi-subscriber, in the sense that it can have zero, one or more consumers subscribing and listening to the data written to it. A topic is split into partitions.

主题是记录的流，表示在群集级别将记录发布到的类别或订阅源名称。从它可以有零个意义上说，它始终是多用户，一个或多个消费者订阅并侦听写入其中的数据。主题分为多个分区。

Each partition is ordered and messages are consumed in the arrival order (FIFO — first in, first out). Each message getting published to a topic partition gets an incremental id, called offset.

每个分区都有序，消息按到达顺序使用(FIFO-先进先出)。每个发布到主题分区的消息都获得一个增量ID，称为offset 。

Other significant information:

其他重要信息：

Offsets only have meaning inside a specific partition of the topic
偏移仅在主题的特定分区内具有含义
The data written to a partition can’t be changed and it is kept only for a limited time (topics have a phyisical structure similar to logs)
写入分区的数据无法更改，只能保留有限的时间(主题的物理结构类似于日志)
Order is guaranteed only within a partition
仅在分区内保证订单

生产者和消费者 (Producers and Consumers)

Producers publish data to the topics of their choice. A producer is responsible for choosing which record to assign to which partition within the topic : this can be done in a round-robin fashion or according to a specific given function, based on the key of the record.

生产者将数据发布到他们选择的主题。生产者负责选择将哪个记录分配给主题中的哪个分区：可以基于记录的密钥以循环方式或根据给定的特定功能来完成。

A key of a record can be of any kind (string, number) and must be specified to send the data to the specific partition (assigned to that key). If the key is null, the data is sent in round-robin way, but if the key is not null then all the messages for that key will go to the same partition.

记录的键可以是任何类型(字符串，数字)，并且必须指定该键才能将数据发送到特定分区(分配给该键)。如果密钥为null，则以循环方式发送数据，但是如果密钥不为null，则该密钥的所有消息都将进入同一分区。

Consumers read data from a topic, labeling themselves with a consumer group name. This is a particular concept which guarantees that a consumer within a group reads from exclusive partitions (i.e. a consumer gets assigned to one partition). If you have more consumers than partitions, some consumers will be idle.

消费者从主题中读取数据，并用消费者组名称标记自己。这是一个特殊的概念，可确保组中的使用者从互斥分区读取(即，将使用者分配到一个分区)。如果您的使用者数量大于分区数量，则某些使用者将处于空闲状态。

Consumer Offsets is the offset concepts for the consuming side. Kafka stores the offset at which a consumer group has arrived to read so that if a consumer within a specific consumer group dies, another consumer in the consumer group will be able to read back from where the dead consumer left off. Consumer offsets are stored in an internal Kafka topics called __consumer_offsets. The offset must be committed each time a single consumer has finished to consume the message : in this way, the message is no more available to that consumer group unless the offset is reset to the starting or a specific offset.

消费者抵销是消费方的抵销概念。 Kafka存储了一个消费组已到达要读取的偏移量，因此，如果特定消费组内的某个消费者死亡，则该消费组中的另一个消费者将能够从死去的消费者停止的地方进行读取。使用者偏移量存储在内部__fums_offsets的 Kafka主题中。每当单个使用者结束使用该消息时，就必须提交偏移量：这样，除非该偏移量重置为起始偏移量或特定偏移量，否则该消息对该使用者组不再可用。

Other consumers information to keep in mind :

其他消费者要记住的信息：

consumers know which broker to read from and if one broker fails, they know how to failover
消费者知道从哪个代理读取，并且如果一个代理发生故障，他们就会知道如何进行故障转移
a consumer reading data from one partition make it with the same order of the publishing of the records
消费者从一个分区中读取数据，使其具有与记录发布相同的顺序
there is no guarantee across the order between two partitions of the same topic
不能保证同一主题的两个分区之间的顺序一致

Another important concept for the consuming is the Delivery Semantics. Kafka provides 3 delivery semantics for consumers :

消费的另一个重要概念是交付语义 。 Kafka为消费者提供3种交付语义：

At most once: offsets are committed as soon as the message is received. If the processing goes wrong, the message will be lost
最多一次 ：收到消息后立即提交偏移量。如果处理出错，该消息将丢失
At least once (usually): offsets are committed after the message is processed. If the processing goes wrong, the message will be read again
至少一次 (通常)：在处理消息后提交偏移量。如果处理出错，将再次读取该消息
Exactly once: it can be achieved for Kafka to Kafka workflows using Kafka Streams API. For Kafka to External System workflows use an idempotent consumer
一次：可以使用Kafka Streams API从Kafka到Kafka工作流程来实现。对于从Kafka到外部系统的工作流程，请使用幂等的使用者

ZooKeeper，集群和经纪人 (ZooKeeper, Cluster and Brokers)

Until this point, we have said that Kafka brokers are in a cluster, but how this cluster is managed? Kafka relies on Apache Zookeeper to manage their brokers. Let’s see the main ZooKeeper characteristics :

在此之前，我们已经说过Kafka经纪人在一个集群中，但是如何管理这个集群？ Kafka依靠Apache Zookeeper来管理其代理。让我们看看ZooKeeper的主要特征：

it helps performing leader election for partitions
它有助于执行分区的领导者选举
it sends notifications to Kafka in case of changes
如果发生更改，它将向Kafka发送通知
it must be run before a Kafka server starts
它必须在Kafka服务器启动之前运行
by design, it operates with an odd number of servers
根据设计，它可以在奇数个服务器上运行
it has a leader, which handle the writes from the brokers, while the rest of the servers are the followers (handling only reads)
它有一个领导者，负责处理来自代理的写操作，而其余服务器则是跟随者(仅处理读操作)
it doesn’t store consumer offsets, instead they are stored in an internal Kafka topic (as we said previously)
它不存储消费者偏移量，而是存储在内部Kafka主题中(如我们之前所述)

A Kafka broker is a server inside a cluster. Each broker is identified with an integer ID and when you connect to any broker of the cluster, you get connected to the entire cluster. Each broker can contains certain topic partitions, with some kind of data but not necessarily all the data of the topic. Brokers are stateless, then it is ZooKeeper that maintains the cluster state. A good number of brokers for a cluster is 3.

Kafka代理是群集中的服务器。每个代理都用一个整数ID标识，当您连接到集群的任何代理时，您就可以连接到整个集群。每个代理可以包含某些主题分区，其中包含某种数据，但不一定包含该主题的所有数据。代理是无状态的，然后是ZooKeeper维护集群状态。集群的大量代理为3。

A broker can subscribe in ZooKeeper to the “/brokers/ids” path where all brokers are registered so that they can be notified when other brokers are added or removed. Starting another broker with the same ID will produce an error, then the broker won’t start. Even though the node representing the broker is gone when the broker is stopped, the broker ID still exists in other data structures; this way, if you completely lose a broker and start a new broker with the ID of the old one, it will immediately join the cluster in place of the missing broker with the same partitions and topics assigned to it. The first broker that starts in the cluster becomes the controller, responsible for electing partition leaders; the following brokers will receive a “node already exists” exception.

经纪人可以在ZooKeeper中订阅所有经纪人都注册到的“ / brokers / ids”路径，以便在添加或删除其他经纪人时可以通知他们。启动具有相同ID的另一个代理将产生错误，然后该代理将无法启动。即使停止代理程序时代表代理程序的节点不见了，代理程序ID仍存在于其他数据结构中。这样，如果您完全丢失了一个代理，并使用旧ID来启动一个新的代理，它将立即加入群集，以代替丢失的具有相同分区和主题的代理。在集群中启动的第一个代理将成为控制者，负责选举分区负责人。以下代理将收到“节点已存在”异常。

经纪人发现 (Broker Discovery)

Every Kafka broker is also called a “bootstrap server” : this means you only need to connect to one broker and you will be connected to the entire cluster, because each broker knows all the information about the others. A list of Kafka servers is passed to the bootstrap-server parameter for the consumer instantiation and even though only one broker is needed, the consumer client will learn about the other broker from just one server. Usually, you list multiple brokers in case there is an outage so that the client can connect.

每个Kafka代理也称为“引导服务器”：这意味着您只需要连接到一个代理，就可以连接到整个集群，因为每个代理都知道有关其他代理的所有信息。 Kafka服务器列表将传递给bootstrap-server参数以进行消费者实例化，即使只需要一个代理，消费者客户端也将从一台服务器中了解其他代理。通常，您列出多个代理以防万一发生故障，以便客户端可以连接。

结论 (Conclusion)

Well, these were the very basic notions of Apache Kafka. We’ve done a little introduction to the main Kafka API, we have defined what topics, partitions and offsets are, and we have seen how it works producing and consuming data on topic. Finally, we’ve seen how is managed the cluster of brokers in Kafka.

嗯，这些是Apache Kafka的非常基本的概念。我们已经对主要的Kafka API进行了一些介绍，定义了主题，分区和偏移量，并了解了它如何工作并生成和使用主题数据。最后，我们已经了解了如何在Kafka中管理经纪人集群。

This is the very starting point for Apache Kafka, because there are many other notions, as fault-tolerance and topic distribution and replication, and many other features you can use with Kafka, as Stream Processing and Connect API. I hope this article gives you the hint to learn Apache Kafka in depth, because it is one of most interesting technology that we have nowadays and it is used too much in the enterprise industry.

这是Apache Kafka的起点，因为还有许多其他概念，例如容错，主题分发和复制，以及可以与Kafka一起使用的许多其他功能，例如Stream Processing和Connect API。我希望本文能为您提供深入学习Apache Kafka的提示，因为它是当今我们拥有的最有趣的技术之一，并且在企业行业中使用过多。

翻译自: https://medium.com/javarevisited/a-brief-introduction-to-apache-kafka-25a4ab386f4b

apache kafka

weixin_26752759

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
apache kafka_Apache Kafka简介

apache kafka Apache Kafka简介 (A Brief Introduction to Apache Kafka) Apache Kafka平台概述 (A little overview of the Apache Kafka platform)Photo by Joshua Sortino on Unsplash Joshua Sortino在Unsplash上拍摄的照片...
复制链接

扫一扫