使用Apache Kafka构建实时数据流应用程序

最新推荐文章于 2023-12-08 16:43:57 发布

VIP文章 cunxiedian8614

最新推荐文章于 2023-12-08 16:43:57 发布

阅读量457

点赞数

w ^ritten by Alexander Nnakwue✏️

Introduction

大多数大型科技公司都以各种方式从其用户那里获取数据，并且在大多数情况下，这些数据是原始格式的。数据以一种可理解且可用的格式，可以帮助推动业务需求。挑战在于处理数据，并在必要时转换或清除数据以使其有意义。

基本数据流应用程序将数据从源存储桶移动到目标存储桶。涉及流的更复杂的应用程序会动态执行某些操作，例如更改输出数据的结构或使用新的属性或字段来丰富它。

在本文中，我们将学习如何使用Apache Kafka构建最小的实时数据流应用程序。该帖子还将介绍以下内容：

Kafka和ZooKeeper作为我们的工具批量数据处理与存储在本地安装和运行Kafka引导我们的应用程序安装依赖创建一个Kafka主题产生所创建的主题消费一个话题

According to its website, Kafka is an open-source, highly distributed streaming platform. Built by the engineers at LinkedIn (now part of the Apache software foundation), it prides itself as a reliable, resilient, and scalable system that supports streaming events/applications. It is horizontally scalable, fault-tolerant by default, and offers high speed.

Kafka has a variety of use cases, one of which is to build data pipelines or applications that handle streaming events and/or processing of batch data in real-time.

使用Apache Kafka，我们将研究如何构建数据管道来移动批处理数据。作为一个小示例，我们将模拟在源中生成的大型JSON数据存储。

然后，我们将编写一个生产者脚本，该脚本从A点到本地代理/集群Kafka设置上的特定主题的源中生成/写入JSON数据。最后，我们将编写一个使用者脚本，该脚本使用指定Kafka主题中存储的数据。

注意：数据转换和/或充实主要是从输入主题中消费以供另一个应用程序或输出主题使用时处理的。这是数据工程中非常常见的场景，因为始终需要清理，转换，聚合甚至重新处理Kafka主题中通常原始和临时存储的数据，以使其符合特定的标准或格式。

Prerequisites

为了使您能够继续学习本教程，您将需要：

The latest versions of Node.js and npm installed on your machine
The latest Java version (JVM) installed on your machine
Kafka installed on your local machine. In this tutorial, we will be running through installing Kafka locally on our machines
A basic understanding of writing Node.js applications

但是，在继续之前，让我们回顾一下有关Kafka的一些基本概念和术语，以便我们可以轻松地随本教程一起学习。

ZooKeeper

Kafka is highly dependent on ZooKeeper, which is the service it uses to keep track of its cluster state. ZooKeeper helps control the synchronization and configuration of Kafka brokers or servers, which involves selecting the appropriate leaders. For more detailed information on ZooKeeper, you can check its awesome documentation.

Topic

Kafka主题是跨多个Kafka代理的一组分区或组。为了更清楚地理解，该主题充当集群中流数据的间歇存储机制。对于每个Kafka主题，我们可以选择设置复制因子和其他参数，例如分区数等。

Producers, consumers, and clusters

生产者是产生数据或将数据写入Kafka经纪人或Kafka主题的客户，更确切地说是。消费者则读取数据，或者顾名思义，是从Kafka主题或Kafka经纪人那里消费数据。集群只是一组驱动当前Kafka实例的代理或服务器。

Figure 1: Showing the relationship between a producer, cluster, and consumer in Kafka.

For more detailed information on all these vital concepts, you can check this section of the Apache Kafka documentation.

Installing Kafka

To install Kafka, all we have to do is download the binaries here and extract the archive. We do so by running the following command on our terminal or command prompt:

cd <location-of-downloaded-kafka-binary>
tar -xzf <downloaded-kafka-binary>
cd <name-of_kafka-binary>

的柏油命令提取下载的Kafka二进制文件。之后，我们导航到Kafka的安装目录。我们将看到如下所示的所有文件：

Figure 2: A screenshot of the installed Kafka folder structure with the files.

注意：Kafka二进制文件可以在我们希望在计算机上下载的任何路径上下载。另外，在撰写本文时，最新的Kafka版本是2.3.0。

此外，如果我们升级（光盘..），我们会找到一个配置 folder inside the downloaded Kafka binary directory. Here, we can 配置ure our Kafka server and include any changes or 配置urations we may want. Now, let’s play along: