大数据分布式流式平台Kafka官方文档解读

最新推荐文章于 2023-12-04 23:22:25 发布

Darren.P

最新推荐文章于 2023-12-04 23:22:25 发布

阅读量582

点赞数

分类专栏： KAFKA 大数据文章标签： Kafka 大数据 Topic Consumer group Partition

大数据同时被 2 个专栏收录

12 篇文章 1 订阅

订阅专栏

KAFKA

7 篇文章 0 订阅

订阅专栏

文章目录

http://kafka.apache.org/intro

Kafka简介

功能

Publish and subscribe to streams of records, similar to a message queue or enterprise messaging system.类似消息队列
Store streams of records in a fault-tolerant durable way.流式数据存储的容错性和持久性
Process streams of records as they occur.处理流式数据

应用

Building real-time streaming data pipelines that reliably get data between systems or applications建立实时的流式数据管道
Building real-time streaming applications that transform or react to the streams of data建立实时的流式应用处理数据

概念

Kafka is run as a cluster on one or more servers that can span multiple datacenters.
The Kafka cluster stores streams of records in categories called topics.
Each record consists of a key, a value, and a timestamp.

核心API

The Producer API allows an application to publish a stream of records to one or more Kafka topics.
The Consumer API allows an application to subscribe to one or more topics and process the stream of records produced to them.
The Streams API allows an application to act as a stream processor, consuming an input stream from one or more topics and producing an output stream to one or more output topics, effectively transforming the input streams to output streams.
The Connector API allows building and running reusable producers or consumers that connect Kafka topics to existing applications or data systems. For example, a connector to a relational database might capture every change to a table.
在这里插入图片描述
最后这个？

跟其他系统的关系

分布式

Topics and Logs

Topic是Kafka的对流的一个核心抽象。
A topic can have zero, one, or many consumers that subscribe to the data written to it. 一个topic可以有0个，1个或者多个 consumer。
Kafka集群维护了a partitioned log：（问题log的数据结构是什么？）
在这里插入图片描述
The records in the partitions are each assigned a sequential id number called the offset that uniquely identifies each record within the partition. 每个partition内的record被分配了一个唯一的序列ID称为offset。
The Kafka cluster durably persists all published records—whether or not they have been consumed—using a configurable retention period. 不管有没有被消费，Kafaka集群都会保存发布的记录（时间通过留存期来配置）。For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka’s performance is effectively constant with respect to data size so storing data for a long time is not a problem.
在这里插入图片描述
For example a consumer can reset to an older offset to reprocess data from the past or skip ahead to the most recent record and start consuming from “now”. 一个consumer可以跳到任何一个offset位置处理数据。一个consumer的消费行为不会影响其他消费者。

Partition的作用

The partitions in the log serve several purposes. First, they allow the log to scale beyond a size that will fit on a single server. Each individual partition must fit on the servers that host it, but a topic may have many partitions so it can handle an arbitrary amount of data. Second they act as the unit of parallelism—more on that in a bit.
1、突破了一个服务器容纳的数据大小的界限，一个topic可以有多个partition；
2、Partition是并行处理的单元。
Each partition is replicated across a configurable number of servers for fault tolerance. 每个partition在集群中复制多份保证容错Each partition has one server which acts as the “leader” and zero or more servers which act as “followers”.生产者负责partition 分区方法
Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.
If all the consumer instances have the same consumer group, then the records will effectively be load balanced over the consumer instances.
If all the consumer instances have different consumer groups, then each record will be broadcast to all the consumer processes.
在这里插入图片描述
每个consumer group可以有多个consumer instances，each instance is the exclusive consumer of a “fair share” of partitions at any point in time.每个instance互斥处理每个partition。
Kafka只对topic的每个分区内提供了序列顺序，不同分区内的顺序互不干扰。如果需要对总数据的一个统一的总序列顺序，则需要保证一个topic只有一个分区partition，这样的话一个consumer group里只能有一个consumer instance存在。

多租户multi-tenancy

Multi-tenancy is enabled by configuring which topics can produce or consume data. There is also operations support for quotas. Administrators can define and enforce quotas on requests to control the broker resources that are used by clients.

Guarantees

1、消息在同一分区按照生产者发布的顺序出现在log里。
2、consumer instance看到的数据跟log中的一致。
3、如果一个topic有N个副本，最多容忍N-1个服务器failure。

具体应用

消息系统

传统的消息模式有两种：
队列模式(queuing) ：多个消费者从一个生产者读取数据。优点是可以将数据分给多个消费者处理，缺点是只要有一个消费者读取队列中的一条数据后，这条数据就没了。
观察者模式(发布-订阅/Publish-subscribe模式)：数据广播给所有消费者。优点是可以将数据广播给多个消费者，但是没办法所分批处理。
Kafka通过consumer group，解决了上述问题。支持把数据分给多个consumer group处理，并广播给多个consumer group。
The advantage of Kafka’s model is that every topic has both these properties—it can scale processing and is also multi-subscriber—there is no need to choose one or the other.
传统的队列模式无法保证并行消费时，多个消费者收到的数据顺序。通常用互斥消费者(exclusive consumer)来保证，也意味着无法并行。
By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. 具体来说，每个分区partition只被consumer group里的一个consumer instance消费，这样保证了消费的顺序。

存储系统

Data written to Kafka is written to disk and replicated for fault-tolerance. Kafka allows producers to wait on acknowledgement so that a write isn’t considered complete until it is fully replicated and guaranteed to persist even if the server written to fails. 生产者的写操作只有在复制备份完成后才算完成。The disk structures Kafka uses scale well—Kafka will perform the same whether you have 50 KB or 50 TB of persistent data on the server.

流处理

It isn’t enough to just read, write, and store streams of data, the purpose is to enable real-time processing of streams.光是读写存还不够，Kafka提供了实时流处理，源源不断的流数据给到output topics。复杂的流操作可以使用Streams API，比如compute aggregations off of streams or join streams together.

Putting the Pieces Together

This combination of messaging, storage, and stream processing may seem unusual but it is essential to Kafka’s role as a streaming platform.
分布式文件系统比如HDFS存储静态文件用来做批处理，处理的是历史数据historical data。传统的消息系统在消息抵达时处理，处理的是future messages。
By combining storage and low-latency subscriptions, streaming applications can treat both past and future data the same way.

Darren.P

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
大数据分布式流式平台Kafka官方文档解读

文章目录Kafka简介功能应用概念核心API跟其他系统的关系分布式Topics and LogsPartition的作用多租户multi-tenancyGuaranteeshttp://kafka.apache.org/introKafka简介功能Publish and subscribe to streams of records, similar to a message queue ...
复制链接

扫一扫