kafka streams_使用Kafka Streams创建流数据管道

本文介绍了如何利用Kafka Streams创建流数据管道。通过Kafka Streams,开发者可以处理和转换实时数据流,实现复杂的数据处理任务。
摘要由CSDN通过智能技术生成

kafka streams

什么是流拓扑? (What is a streaming topology?)

A topology is a directed acyclic graph (DAG) of stream processors (nodes) connected by streams (edges). A few key features of a DAG is that it is finite and does not contain any cycles. Creating a streaming topology allows data processors to be small, focused microservices that can be easily distributed and scaled and can execute their work in parallel.

拓扑是通过流(边缘)连接的流处理器(节点)的有向无环图(DAG )。 DAG的一些关键特征是它是有限的并且不包含任何循环。 创建流式拓扑可以使数据处理器成为小型,专注的微服务,可以轻松地对其进行分配和扩展,并可以并行执行其工作。

为什么要使用Kafka Streams? (Why use Kafka Streams?)

Kafka Streams is a API developed by Confluent for building streaming applications that consume Kafka topics, analyzing, transforming, or enriching input data and then sending results to another Kafka topic. It lets you do this with concise code in a way that is distributed and fault-tolerant. Kafka Streams defines a processor topology as a logical abstraction for your stream processing code.

Kafka Streams是由Confluent开发的API,用于构建使用Kafka主题的流应用程序,分析,转换或丰富输入数据,然后将结果发送到另一个Kafka主题。 它使您可以使用简洁的代码以分布式且容错的方式执行此操作。 Kafka Streams将处理器拓扑定义为流处理代码的逻辑抽象。

Kafka Streams的关键概念 (Key concepts of Kafka Streams)

  • A stream is an unbounded, continuously updating data set, consisting of an ordered, replayable, and fault-tolerant sequence of key-value pairs.

    是无界的,不断更新的数据集,由有序,可重播和容错的键值对序列组成。

  • A stream processor is a node in the topology that receives one input record at a time from its upstream processors in the topology, applying its operation to it, and can optionally produce one or more output records to its downstream processors.

    流处理器是拓扑中的一个节点,它一次从拓扑中的上游处理器接收一个输入记录,对其应用操作,并可以选择向其下游处理器生成一个或多个输出记录。

  • A source processor is a processor that does not have any upstream processors.

    源处理器是没有任何上游处理器的处理器。

  • A sink processor is a processor that does not have any down-stream processors.

    接收器处理器是没有任何下游处理器的处理器。

入门 (Getting Started)

For this tutorial, I will be using the Java APIs for Kafka and Kafka Streams. I’m going to assume a basic understanding of using Maven to build a Java project and a rudimentary familiarity with Kafka and that a Kafka instance has already been setup. Lenses.io provides a quick and easy containerized solution to setting up a Kafka instance here.

在本教程中,我将使用Kafka和Kafka Streams的Java API。 我将假设您对使用Maven来构建Java项目有基本的了解,并且对Kafka有基本的了解,并且已经设置了Kafka实例。 Lenses.io提供了一种快速简便的容器化解决方案,可以在此处设置Kafka实例。

To get started, we need to add kafka-clients and kafka-streams as dependencies to the project pom.xml:

首先,我们需要将kafka-clientskafka-streams添加为项目pom.xml的依赖项:

建立流式拓扑 (Building a Streaming Topology)

One or more input, intermediate, and output topics are needed for the streaming topology. Information for creating new Kafka topics can be found here. Once we have created the requisite topics, we can create a streaming topology. Here is an example of creating a topology for an input topic, where the value is serialized as JSON (serialized/deserialized by GSON).

流拓扑需要一个或多个输入,中间和输出主题。 可在此处找到有关创建新Kafka主题的信息。 一旦创建了必要的主题,就可以创建流式拓扑。 这是为输入主题创建拓扑的示例,其中值序列化为JSON(由GSON序列化/反序列化)。

Simple example of streaming topology
流拓扑的简单示例

The above example is a very simple streaming topology, but at this point it doesn’t really do anything. It is important to note, that the topology is executed and persisted by the application executing the previous code snippet, the topology does not run inside the Kafka brokers. All topology processing overhead is paid for by the creating application.

上面的示例是一个非常简单的流拓扑,但是在这一点上它实际上并没有做任何事情。 重要的是要注意,拓扑是由执行先前代码片段的应用程序执行和持久化的,该拓扑不会在Kafka代理中运行。 由创建的应用程序支付所有拓扑处理开销。

A running topology can be stopped by executing:

可以通过执行以下命令来停止正在运行的拓扑:

streams.close();

To make this topology more useful, we need to define rule-based branches (or edges). In the next example, we create a basic topology with 3 branches, based on the values of a specific field in the JSON message payload.

为了使这种拓扑更加有用,我们需要定义基于规则的分支(或边)。 在下一个示例中,我们基于JSON消息有效负载中特定字段的值创建具有3个分支的基本拓扑。

Streaming topology with multiple branches
具有多个分支的流拓扑

The topology we just created would look like the following graph:

我们刚刚创建的拓扑如下图所示:

Image for post

Downstream consumers for the branches in the previous example can consume the branch topics exactly the same way as any other Kafka topic. The downstream processors may produce their own output topics. Therefore, it may be useful to combine the results from downstream processors with the original input topic. We can also use the Kafka Streams API to define rules for joining the resulting output topics into a single stream.

上一个示例中分支的下游使用者可以使用与任何其他Kafka主题完全相同的方式来使用分支主题。 下游处理器可以产生自己的输出主题。 因此,将下游处理器的结果与原始输入主题结合起来可能很有用。 我们还可以使用Kafka Streams API定义规则,以将结果输出主题加入单个流中。

越过溪流 (Crossing the Streams)

Kafka Streams models its stream joining functionality off SQL joins. There are three kinds of joins:

Kafka Streams通过SQL连接对其流连接功能进行建模。 共有三种连接:

Image for post
  • inner join: emits an output when both input topics have records with the same key.

    内部联接:当两个输入主题都有具有相同键的记录时,发出输出。

  • left join: emits an output for each record in the left or primary input topic. If the other topic does not have a value for a given key, it is set to null.

    左联接:为左或主输入主题中的每个记录发出输出。 如果另一个主题没有给定键的值,则将其设置为null。

  • outer join: emits an output for each record in either input topic. If only one source contains a key, the other is null.

    外部联接:为任一输入主题中的每个记录发出输出。 如果只有一个源包含密钥,则另一个为null。

For our example, we are joining together a stream of input records and the results from downstream processors. In this case, it makes the most sense to perform a left join with the input topic being considered the primary topic. This will ensure the joined stream always outputs the original input records, even if there are no processor results available.

对于我们的示例,我们将输入记录流和下游处理器的结果连接在一起。 在这种情况下,将输入主题视为主要主题执行左联接是最有意义的。 这将确保加入的流始终输出原始输入记录,即使没有可用的处理器结果也是如此。

The final overall topology will look like the following graph:

最终的整体拓扑如下图所示:

Image for post

It is programmatically possible to have the same service create and execute both streaming topologies, but I avoided doing this in the example to keep the graph acyclical.

以编程方式可以使用相同的服务来创建和执行两个流拓扑,但是在示例中我避免这样做以使图保持非循环状态。

翻译自: https://itnext.io/creating-a-streaming-data-pipeline-with-kafka-streams-898fb352a7b7

kafka streams

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值