kinesis_Kinesis对阵Kafka

最新推荐文章于 2023-12-07 09:34:17 发布

weixin_26752759

最新推荐文章于 2023-12-07 09:34:17 发布

阅读量386

点赞数

原文链接：https://medium.com/flo-engineering/kinesis-vs-kafka-6709c968813

版权

kinesis

背景故事 (Backstory)

Flo needs to understand how users interact with the app and what features they use more frequently. Based on this information, we decide how to improve our product. We gather this information from the different parts of our system in the form of analytic events. They must be stored in a message broker waiting for processing. For the past two years, we’ve used AWS Kinesis as our internal message broker. However, the more we’ve worked with it, the more pitfalls we’ve found. At some point, we realized that we would like to consider alternative message brokers without the drawbacks of Kinesis. The most promising replacement candidate for us was Kafka. During our investigation, one question arose: whether Kafka is better than Kinesis from a latency/throughput perspective. So, we decided to find out the answer through benchmarks.

Flo需要了解用户如何与应用交互以及他们更常使用哪些功能。根据这些信息，我们决定如何改进我们的产品。我们以分析事件的形式从系统的不同部分收集此信息。它们必须存储在消息代理中以等待处理。在过去两年中，我们一直使用AWS Kinesis作为内部消息代理。但是，我们使用的越多，发现的陷阱就越多。在某个时候，我们意识到我们想考虑没有Kinesis缺点的替代消息代理。对我们来说，最有希望的替代候选人是卡夫卡。在我们的调查中，出现了一个问题：从延迟/吞吐量的角度来看，Kafka是否比Kinesis好。因此，我们决定通过基准找出答案。

基准框架 (Benchmark framework)

We developed a small benchmark framework based on the Akka Streams library. Why it uses the Akka Streams? Well, for two reasons. First, because we were already using it in our backend services. Second, because there are integrations of Akka Streams with both Kinesis and Kafka (i.e., the Alpakka library). The library versions are as follows:

我们基于Akka Streams库开发了一个小型基准框架。为什么使用Akka流？好吧，有两个原因。首先，因为我们已经在后端服务中使用它。其次，因为Akka Streams与Kinesis和Kafka集成在一起(即Alpakka库)。库版本如下：

akka-streams — 2.5.31
akka-streams — 2.5.31
akka-stream-kafka — 2.0.0
akka-stream-kafka-2.0.0
akka-stream-alpakka-kinesis — 2.0.0
akka-stream-alpakka-kinesis-2.0.0

How do we simulate a realistic events stream? Instead of generating some random synthetic events, we took 100000 real events in JSON format from our production environment. Every event producer (Kinesis or Kafka) makes sampling with replacement from this 100000-event pool that gives us a realistic infinite event stream.

我们如何模拟现实事件流？我们没有产生一些随机的合成事件，而是从生产环境中获取了100,000个JSON格式的真实事件。每个事件产生者(Kinesis或Kafka)都会从此100000个事件池中进行替换采样，从而为我们提供了逼真的无限事件流。

Each event is marked with a timestamp when it’s selected from the event pool, the so-called ‘created_at’ field. We can interpret this timestamp as the moment of event creation. Both Kinesis and Kafka have the ‘internal stored_at’ timestamp, which indicates the moment when our event is successfully stored. This timestamp is used as the event storing time. Finally, when we consume an event, we get the current moment’s timestamp, which is the event receiving time. In addition, we get the size in bytes for each event. All of this data is used to calculate latency (write, read, and total) and throughput (write and read) metrics.

从事件池中选择每个事件时，都会用时间戳标记该事件，即所谓的“ created_at”字段。我们可以将此时间戳解释为事件创建的时刻。 Kinesis和Kafka都具有“内部stored_at”时间戳，该时间戳指示成功存储事件的时刻。该时间戳用作事件存储时间。最后，当我们使用一个事件时，我们将获得当前时刻的时间戳，即事件的接收时间。另外，我们获得每个事件的大小(以字节为单位)。所有这些数据都用于计算延迟(写入，读取和总计)和吞吐量(写入和读取)指标。

A short remark about Kafka configuration. We used the following settings to configure Kafka topics used for benchmarking:

关于Kafka配置的简短说明。我们使用以下设置来配置用于基准测试的Kafka主题：

SSL is on
SSL已开启
Replication factor is set to 3 (emulating reliability)
复制因子设置为3(模拟可靠性)
Acknowledgement during writing to Kafka is set to “all” (waiting for data to be written to most of the replicas). This also contributes to reliability and fault tolerance.
写入Kafka期间的确认设置为“全部”(等待将数据写入大多数副本)。这也有助于提高可靠性和容错能力。

计算指标 (Calculating metrics)

Let’s first introduce some notations:

让我们首先介绍一些符号：

tc_n – timestamp when the n-th event was created (milliseconds)

tc_n –创建第n个事件的时间戳(毫秒)

ts_n – timestamp when the n-th event was stored (milliseconds)

ts_n –第n个事件存储的时间戳(毫秒)

tr_n – timestamp when the n-th event was received (milliseconds)

tr_n –接收到第n个事件的时间戳(毫秒)

s_n – n-th event size (bytes)

s_n –第n个事件大小(字节)

M – sliding window size, used for read and write throughput calculation. The bigger the value we use for this parameter, the smoother the throughput metric will be. For this benchmark, we set this parameter to 10,000.

M –滑动窗口大小，用于读取和写入吞吐量计算。我们用于此参数的值越大，吞吐量指标将越平滑。对于此基准，我们将此参数设置为10,000。

Now we can describe which metrics we’re going to calculate and how.

现在，我们可以描述要计算的指标以及如何计算。

Write latency (milliseconds):

写入延迟(毫秒)：

Read latency (milliseconds):

读取延迟(毫秒)：

Total latency (milliseconds):

总延迟(毫秒)：

Write throughput (MB per second):

写入吞吐量(每秒MB)：

Read throughput (MB per second):

读取吞吐量(每秒MB)：

标杆策略 (Benchmarking strategy)

We developed 3 main test cases:

我们开发了3个主要的测试案例：

Default config. We just take consumers and producers for both message brokers and use them without any specific setup (all parameters are default). This case simulates when a developer uses a library as-is, without any configuration.
默认配置我们只是将消费者和生产者同时用作两个消息代理，并且无需进行任何特定设置即可使用它们(所有参数均为默认设置)。这种情况模拟了开发人员在不进行任何配置的情况下按原样使用库的情况。
Latency first. We will try to configure the parameters of both messages brokers to improve latency.
延迟优先。我们将尝试配置两个消息代理的参数以提高延迟。
Throughput first. We will try to configure the parameters of both message brokers to improve throughput.
吞吐量第一。我们将尝试配置两个消息代理的参数以提高吞吐量。

In order to adjust between “Latency first” and “Throughput first” cases we will use the following settings.

为了在“延迟优先”和“吞吐量优先”的情况下进行调整，我们将使用以下设置。

Kafka producer:

卡夫卡制片人：

batch.size (default is 16384) — The producer will attempt to batch records together into fewer requests whenever multiple records are being sent to the same partition. This configuration controls the default batch size in bytes. By increasing this setting we increase throughput but decrease latency.
batch.size(默认值为16384)—每当将多个记录发送到同一分区时，生产者将尝试将记录一起批处理成更少的请求。此配置控制默认的批处理大小(以字节为单位)。通过增加此设置，我们可以提高吞吐量，但可以减少延迟。
linger.ms (default is 0) — The maximum time to buffer data in asynchronous mode. For example, instead of sending immediately, you can set linger.ms to 5 and send more messages in one batch. By increasing this setting we increase throughput but decrease latency.
linger.ms(默认为0)—在异步模式下缓冲数据的最长时间。例如，您可以将linger.ms设置为5，而不是立即发送，而是批量发送更多消息。通过增加此设置，我们可以提高吞吐量，但可以减少延迟。

Kafka consumer:

卡夫卡消费者：

fetch.min.bytes (default is 1) — The minimum amount of data that the server should return for a fetch request. If the data is insufficient, the request will wait for that much data to accumulate before answering the request. By increasing this setting we increase throughput but decrease latency.
fetch.min.bytes(默认值为1)—服务器应为获取请求返回的最小数据量。如果数据不足，则请求将等待该数据积累，然后再回答请求。通过增加此设置，我们可以提高吞吐量，但可以减少延迟。
fetch.max.wait.ms (default is 500) — The maximum amount of time that the server will block before answering the fetch request, if there isn’t sufficient data to immediately satisfy the requirement given by fetch.min.bytes. By increasing this setting we increase throughput but decrease latency.
fetch.max.wait.ms(默认为500)—如果没有足够的数据立即满足fetch.min.bytes给出的要求，则服务器在响应提取请求之前将阻塞的最长时间。通过增加此设置，我们可以提高吞吐量，但可以减少延迟。

Kinesis producer:

运动学制作人：

putRecords() maxBatchSize (default is 500, maximum is 500) — The amount of records which will be sent to Kinesis as one batch. By decreasing this setting we increase latency but decrease throughput.
putRecords()maxBatchSize(默认为500，最大为500)—将作为一批发送到Kinesis的记录数量。通过减少此设置，我们增加了延迟，但降低了吞吐量。

Kinesis consumer:

Kinesis消费者：

getRecords() maxBatchSize (default is 10,000, maximum is 10,000) — The amount of records that will be read from Kinesis as one batch. By decreasing this setting we increase latency but decrease throughput.
getRecords()maxBatchSize(默认为10,000，最大为10,000)—将作为一批从Kinesis中读取的记录数。通过减少此设置，我们增加了延迟，但降低了吞吐量。
idleTimeBetweenReadsInMillis (default is 1000) — A delay in milliseconds between getRecords() calls. By decreasing this setting we increase latency but decrease throughput.
idleTimeBetweenReadsInMillis(默认为1000)—两次getRecords()调用之间的延迟(以毫秒为单位)。通过减少此设置，我们增加了延迟，但降低了吞吐量。

Based on the our Kafka cluster cost, we derived possible Kinesis configurations that consume the same amount of money. It’s either 9 shards with 24 hours retention policy or 4 shards with 7 days retention policy. Therefore we are going to use two possible configurations of the Kinesis stream: 4 shards and 9 shards. To equalize parallelization levels, we’ll also use two configurations for Kafka topic: 4 partitions and 9 partitions. Gathering it all together, we have 12 test cases in total. In each case, we’ll gather metrics only for 500,000 events.

基于我们的Kafka集群成本，我们得出了消耗相同金额的可能的Kinesis配置。具有24小时保留策略的9个碎片或具有7天保留策略的4个碎片。因此，我们将使用Kinesis流的两种可能的配置：4个分片和9个分片。为了均衡并行化级别，我们还将对Kafka主题使用两种配置：4个分区和9个分区。汇总在一起，我们总共有12个测试用例。在每种情况下，我们只会收集500,000个事件的指标。

默认配置情况 (Default config case)

Kafka producer configuration:

Kafka生产者配置：

batch.size — 16384
batch.size — 16384
linger.ms — 0
linger.ms — 0

Kafka consumer configuration:

Kafka使用者配置：

fetch.min.bytes — 1
fetch.min.bytes — 1
fetch.max.wait.ms — 500
fetch.max.wait.ms — 500

Kinesis producer configuration:

Kinesis生产者配置：

putRecords() maxBatchSize — 500
putRecords()maxBatchSize — 500

Kinesis consumer configuration:

Kinesis使用者配置：

getRecords() maxBatchSize — 10000
getRecords()maxBatchSize — 10000
idleTimeBetweenReadsInMillis — 1000
idleTimeBetweenReadsInMillis — 1000

默认配置案例比较 (Default config case comparison)

延迟优先 (Latency first case)

With Kafka, the default settings are already latency-oriented. As such, we weren’t able to improve latency significantly.

使用Kafka时，默认设置已经面向延迟。因此，我们无法显着改善延迟。

Kafka producer configuration:

Kafka生产者配置：

batch.size — 8192
batch.size — 8192
linger.ms — 0
linger.ms — 0

Kafka consumer configuration:

Kafka使用者配置：

fetch.min.bytes — 1
fetch.min.bytes — 1
fetch.max.wait.ms — 100
fetch.max.wait.ms — 100

Kinesis producer configuration:

Kinesis生产者配置：

putRecords() maxBatchSize — 1
putRecords()maxBatchSize — 1

Kinesis consumer configuration:

Kinesis使用者配置：

getRecords() maxBatchSize — 1000
getRecords()maxBatchSize — 1000
idleTimeBetweenReadsInMillis — 1
idleTimeBetweenReadsInMillis — 1

Kinesis 9碎片流 (Kinesis 9 shards stream)

Detailed results (benchmark plots)

详细结果(基准图)

Kinesis 4碎片流 (Kinesis 4 shards stream)

Detailed results (benchmark plots)

详细结果(基准图)

Kafka 9分区主题 (Kafka 9 partitions topic)

Detailed results (benchmark plots)

详细结果(基准图)

Kafka 4分区主题 (Kafka 4 partitions topic)

Detailed results (benchmark plots)

详细结果(基准图)

延迟首例比较 (Latency first case comparison)

吞吐量第一种情况 (Throughput first case)

Kafka producer configuration:

Kafka生产者配置：

batch.size — 262144
batch.size — 262144
linger.ms — 5000
linger.ms — 5000

Kafka consumer configuration:

Kafka使用者配置：

fetch.min.bytes — 262144
fetch.min.bytes — 262144
fetch.max.wait.ms — 5000
fetch.max.wait.ms — 5000

Kinesis producer configuration:

Kinesis生产者配置：

putRecords() maxBatchSize — 500
putRecords()maxBatchSize — 500

Kinesis consumer configuration:

Kinesis使用者配置：

getRecords() maxBatchSize — 10000
getRecords()maxBatchSize — 10000
idleTimeBetweenReadsInMillis — 2000
idleTimeBetweenReadsInMillis — 2000

Kinesis 9碎片流 (Kinesis 9 shards stream)

Detailed results (benchmark plots)

详细结果(基准图)

Kinesis 4碎片流 (Kinesis 4 shards stream)

Detailed results (benchmark plots)

详细结果(基准图)

Kafka 9分区主题 (Kafka 9 partitions topic)

Detailed results (benchmark plots)

详细结果(基准图)

Kafka 4分区主题 (Kafka 4 partitions topic)

Detailed results (benchmark plots)

详细结果(基准图)

吞吐量首例比较 (Throughput first case comparison)

最终结果 (Final results)

Kafka beats Kinesis in all test cases in every metric. Kafka is also more flexible in terms of adjusting between latency and throughput. By contrast, almost the only way to adjust latency and throughput for Kinesis is to change shards count (which is quite expensive). So the winner is Kafka.

在所有指标中，Kafka在所有测试用例中均胜过Kinesis。 Kafka在延迟和吞吐量之间进行调整方面也更加灵活。相比之下，调整Kinesis的延迟和吞吐量的几乎唯一方法是更改分片数量(这非常昂贵)。所以胜利者是卡夫卡。