kafka入门介绍

最新推荐文章于 2022-09-21 10:39:48 发布

stark_summer

最新推荐文章于 2022-09-21 10:39:48 发布

阅读量1.7k

点赞数

分类专栏： kafka 文章标签： kafka

本文链接：https://blog.csdn.net/stark_summer/article/details/48831149

版权

kafka 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

kafka 是一个分布式的、分区的、复制的提交日志服务。
这意味着什么呢？
首先让我们回顾下几个基本的消息术语
- kafka 维护的类别称为主题的消息源。
- 我们调用生产者处理发布消息到kafka 主题。
- 我们调用消费者处理订阅的主题，处理发布的主题的消息源。
- kafka 运行在一个或者多个称为broker的服务的集群。

因此，在一个较高的水平，当生产者通过网络发送消息到kafka集群后，相关订阅消息的消费者会处理消息。
如图所示：
这里写图片描述

客户端和服务器之间的通信是通过一个简单的、高性能、语言无关的TCP协议。我们提供一个kafka的java客户端，但也支持各种其他语言客户端的支持。

话题 & 日志 (Topics and Logs)

让我们首先深入kafka的高级抽象提供了主题。
发布的消息是通过一个主题，这个主题是一个类别或订阅名称。

对于每一个话题，kafka集群维护一个分区日志，如下所示：
这里写图片描述

每个分区是一个命令,不变的序列的消息，可不断追加对客户提交日志。

分区中的消息都分配一个连续的id号叫做offset,惟一地标识每个消息内的分区

kafka集群保留所有发布的消息，不管他们是否已经消费了，在一个可配置的一段时间。

例如，如果日志保留设置为两天,然后两天后发布消息可以消费,之后,它将被丢弃,节约空间。

kafka的性能是有效常数对数据大小，所以保留大量的数据并不是一个问题。

事实上,唯一的元数据在每个消费者的基础上,是保留的位置消费日志中,称为“offset”。

offset 是由消费者控制：正常一个消费者将会推进offset作为线性读取消息，但事实上，这个位置信息由消费者控制的，并能按照任何顺序消息消息的。
例如，一个消费者可以重置到之前的offset来重新处理消息。

这种组合的特性意味着kafka消费者非常轻量&方便，不管怎么来消费消息对集群或其他消费者的影响。
例如，你可以使用命令行终端工具去 tail 任何topic的内容,而不改变任何消费者消费的topic。

日志中的分区为多个目的。
首先,它们允许日志规模超出一个大小适合在一台服务器上。每个分区必须适合在服务器主机上,但一个主题可能有多个分区,这样它可以处理任意数量的数据。
其次，他们充当并行化的单元。

分布式(Distribution)

在服务器日志的分区分布kafka集群中的每个服务器处理数据和请求的分区。
每个分区是可复制的，在一个可配置的容错服务器数量。
每个分区有一个服务器充当”leader”,大于0个服务器充当”followers”。
“leader”处理分区所有的读和写请求，然而 “followers”积极的复制”leader”的数据。如果 “leader”失败了，”followers”中某一个follower将会充当新的”leader”角色。每个服务器充当一个”leader”的分区和”follower”的分区,因此集群中的负载均衡很好。

生产者(Producers)

生产者发布数据到他们选择的话题。生产者负责选择哪个消息分配给哪个分区内的话题。
循环的方式可以简单地来平衡负载或可以根据一些语义配分函数(比如基于一些关键的消息)。更多的使用分区。

消费者(Consumers)

消息通常有两个模型：队列和发布-订阅。
在一个队列,消费者可能会从服务器读取和每个消息去其中一个;
在发布-订阅消息被广播给所有的消费者。
kafka提供单个消费者的抽象,概括了这些消费者组。

消费者与消费者组名称标记本身,每个消息发布到主题是下发给一个消费者实例，在每个订阅的消费者组。
消费者实例可以在单独进程或者在单独机器。

如果所有的消费者实例有相同的消费组,那么这个消息模式就像一个传统的队列在消费者均衡负载。
如果所有的消费者实例有不同的消费组,那么这个消息模式就像发布-订阅和所有消息被广播给所有的消费者。

更常见的,然而,我们发现,对每个“逻辑订阅者，主题有一个小数量的消费者组。每组由许多消费者对可扩展性和容错性的实例。这只不过是发布-订阅语义订阅者是集群的消费者,而不是单个的进程。

kafka比其他传统的消息系统有更强的顺序保证。

传统的队列在服务器上保留消息顺序,如果多个消费者从同一个队列消费，那么服务器分发消息是按照存储顺序。虽然服务器分发消息是有序的，消息被异步的下发到消费者端，所以就会出现到达不同的消费者端是无序的。

待续未完成~

This effectively means the ordering of the messages is lost in the presence of parallel consumption.
Messaging systems often work around this by having a notion of “exclusive consumer” that allows only one process to consume from a queue, but of course this means that there is no parallelism in processing.

Kafka does it better. By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances than partitions.

Kafka only provides a total order over messages within a partition, not between different partitions in a topic. Per-partition ordering combined with the ability to partition data by key is sufficient for most applications. However, if you require a total order over messages this can be achieved with a topic that has only one partition, though this will mean only one consumer process.

Guarantees

At a high-level Kafka gives the following guarantees:
Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a message M1 is sent by the same producer as a message M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.
A consumer instance sees messages in the order they are stored in the log.
For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any messages committed to the log.
More details on these guarantees are given in the design section of the documentation.
1.2 Use Cases

Here is a description of a few of the popular use cases for Apache Kafka. For an overview of a number of these areas in action, see this blog post.
Messaging

Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.

In this domain Kafka is comparable to traditional messaging systems such as ActiveMQ or RabbitMQ.

Website Activity Tracking

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity (page views, searches, or other actions users may take) is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.
Activity tracking is often very high volume as many activity messages are generated for each user page view.

Metrics

Kafka is often used for operational monitoring data. This involves aggregating statistics from distributed applications to produce centralized feeds of operational data.
Log Aggregation

Many people use Kafka as a replacement for a log aggregation solution. Log aggregation typically collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages. This allows for lower-latency processing and easier support for multiple data sources and distributed data consumption. In comparison to log-centric systems like Scribe or Flume, Kafka offers equally good performance, stronger durability guarantees due to replication, and much lower end-to-end latency.
Stream Processing

Many users end up doing stage-wise processing of data where data is consumed from topics of raw data and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. For example a processing flow for article recommendation might crawl article content from RSS feeds and publish it to an “articles” topic; further processing might help normalize or deduplicate this content to a topic of cleaned article content; a final stage might attempt to match this content to users. This creates a graph of real-time data flow out of the individual topics. Storm and Samza are popular frameworks for implementing these kinds of transformations.
Event Sourcing

Event sourcing is a style of application design where state changes are logged as a time-ordered sequence of records. Kafka’s support for very large stored log data makes it an excellent backend for an application built in this style.
Commit Log

Kafka can serve as a kind of external commit-log for a distributed system. The log helps replicate data between nodes and acts as a re-syncing mechanism for failed nodes to restore their data. The log compaction feature in Kafka helps support this usage. In this usage Kafka is similar to Apache BookKeeper project.

stark_summer

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
kafka入门介绍

kafka 是一个分布式的、分区的、复制的提交日志服务。这意味着什么呢？首先让我们回顾下几个基本的消息术语 - kafka 维护的类别称为主题的消息源。 - 我们调用生产者处理发布消息到kafka 主题。 - 我们调用消费者处理订阅的主题，处理发布的主题的消息源。 - kafka 运行在一个或者多个称为broker的服务的集群。因此，在一个较高的水平，当生产者通过网络发送消息到kaf
复制链接

扫一扫