apache beam_如何使用apache beam pub sub和sql为在线商店建立实时数据管道

apache beam

It is fascinating to see how malleable our data is becoming. Nowadays we have tools to convert highly nested and complex log data to simple rows format, tools to store and process petabytes of transaction data, and tools to capture raw data as soon as it gets generated, to process and provide useful insights from it.

令人惊讶的是,我们的数据变得具有可塑性。 如今,我们拥有将高度嵌套和复杂的日志数据转换为简单行格式的工具,用于存储和处理PB事务数据的工具,以及用于在原始数据生成后立即对其进行捕获,处理并提供有用见解的工具。

In this article, I would like to share one such step-by-step process of generating, ingesting, processing, and finally utilizing real-time data from our own virtual online store.

在本文中,我想分享一个循序渐进的过程,从我们自己的虚拟在线商店中生成,提取,处理并最终利用实时数据。

先决条件 (Prerequisite)

  1. Google Cloud Platform account (If you don’t have an account sign up here for a 3-month free trial with $300 credit to use).

    Google Cloud Platform帐户(如果您没有帐户,请在此处注册以获得300美元的信用额度的3个月免费试用)。

  2. Linux OS.

    Linux操作系统。
  3. Python 3.

    Python 3。

建筑 (Architecture)

Image for post
My flow chart designing skill on a whiteboard and some editing of course 😉
我在白板上的流程图​​设计技巧和一些课程编辑😉

步骤1:创建发布/订阅主题和订阅者 (Step1: Create Pub/Sub Topic and Subscriber)

Pub/Sub is a messaging service available in the Google Cloud Platform. It can be considered as a managed version of Apache Kafka or Rabbitmq.

发布/订阅是Google Cloud Platform中提供的消息服务。 可以将其视为Apache Kafka或Rabbitmq的托管版本。

A messaging service basically de-couples the system which produces data(virtual store application in our case) from the system which processes the data(Apache beam on Dataflow in our case).

消息传递服务基本上将产生数据的系统(在本例中为虚拟商店应用程序)与处理数据的系统(在本例中为Dataflow上的Apache Beam)分离。

First, we need to create a service account in GCP which will allow us to access Pub/Sub from our application and Apache beam:

首先,我们需要在GCP中创建一个服务帐户,该帐户将允许我们从应用程序和Apache梁访问Pub / Sub:

Login into your GCP console and select the IAM & Admin from the left menu:

登录到您的GCP控制台,然后从左侧菜单中选择IAM&Admin

Image for post
Creating a service account.
创建一个服务帐户。

Select the Service Accounts option and create a new service account:

选择“ 服务帐户”选项并创建一个新的服务帐户:

Image for post
Inserting a new service account details.
插入新的服务帐户详细信息。

Provide the Pub/Sub Admin role to our service account:

为我们的服务帐户提供发布/订阅管理员角色:

Image for post
Providing the required roles for accessing Pub/Sub.
提供访问发布/订阅所需的角色。

And finally, Create a key and download the private key JSON file for later use:

最后, 创建一个密钥并下载私钥JSON文件以供以后使用:

Image for post
Downloading the service account key.
下载服务帐户密钥。

Next, we will create a topic in Pub/Sub to publish our virtual store data into it and a subscriber to pull data from it using Apache Beam.

接下来,我们将在Pub / Sub中创建一个主题以将虚拟商店数据发布到其中,并创建一个订阅者以使用Apache Beam从中提取数据。

Select Pub/Sub from the left menu of the GCP console.

从GCP控制台的左侧菜单中选择发布/订阅。

Image for post
Selecting Pub/Sub tools from console.
从控制台中选择发布/订阅工具。

Select Topics from the sub-menu. From the top, select CREATE TOPIC. Enter a suitable name and click CREATE TOPIC.

从子菜单中选择主题 。 在顶部,选择“ 创建主题”。 输入合适的名称,然后单击CREATE TOPIC

Image for post
Creating a new topic.
创建一个新主题。

Now, select the Subscriptions option from the left and from the top, select CREATE SUBSCRIPTION. Enter a suitable name, select the topic from the drop-down (we created in the previous step) to which the subscriber will listen for the stream of data. After that click Create at the end, keeping other options as it is.

现在,从左侧选择“ Subscriptions”选项,然后从顶部选择“ CREATE SUBSCRIPTION”。 输入合适的名称,从下拉列表(我们在上一步中创建)中选择主题,订阅者将在该主题上侦听数据流。 之后,单击最后的“ 创建 ”,保持其他选项不变。

Image for post
Creating a new subscriber to created topic.
为创建的主题创建新的订阅者。

步骤2:创建用于生成实时数据的虚拟存储 (Step 2: Create A Virtual Store For Generating Realtime Data)

Now we will create a virtual online store that will push the transaction data into the pub/sub Topic that we have created in previous steps.

现在,我们将创建一个虚拟的在线商店,该商店会将交易数据推送到我们在先前步骤中创建的pub / sub主题。

I have used Dash which is a tool created by Plotly to quickly build a web application using different prebuild UI components like button, text input, etc. A complete tutorial to build a web application is out of this article’s scope since our main focus is to build a realtime pipeline. So you can just download the complete application from the GitHub repo Here.

我已经使用Dash (由Plotly创建的工具)来使用不同的预构建UI组件(例如按钮,文本输入等)快速构建Web应用程序。构建Web应用程序的完整教程超出了本文的范围,因为我们的主要重点是建立实时管道。 因此,您可以从GitHub存储库Here下载完整的应用程序

The only thing important is the script which publishes our data into the Pub/Sub topic:

唯一重要的是将数据发布到Pub / Sub主题中的脚本:

Let's start our online virtual store, after downloading the project from the Git create a virtual python environment and install all the packages using the requirement .txt file. Now open the folder in the terminal and run app.py file. You will see the below output:

从Git下载项目后,让我们开始在线虚拟商店,创建一个虚拟python环境并使用需求.txt文件安装所有软件包。 现在,在终端中打开文件夹并运行app.py文件。 您将看到以下输出:

Image for post
Running virtual store dash application server.
运行虚拟商店破折号应用程序服务器。

Go to your web browser and open localhost:8085. You will see the virtual store home page.

转到您的Web浏览器并打开localhost:8085。 您将看到虚拟商店主页。

Image for post
My own virtual store 😜. You got the code, you can change the name if you want😉.
我自己的虚拟商店😜。 您获得了代码,可以根据需要更改名称😉。

Now the fun part lets order a few items and see how our data is getting published into the pub/sub topic and then pulled by the subscriber, that we have created earlier. Add some quantity for each item and click Submit:

现在,有趣的部分允许订购一些项目,并查看我们的数据如何发布到pub / sub主题中,然后由我们先前创建的订阅者提取。 为每个项目添加一些数量,然后点击Submit

Image for post
Testing by placing order.
通过下订单进行测试。

You can see in the terminal some transaction data in JSON format is getting printed every time you place an order. Same JSON data is pushed to the Pub/Sub Topic also. Let’s pull the data from our subscriber, go to the pub/sub dashboard in GCP select the Subscriptions option from the left menu after that click on VIEW MESSAGES from the top and then click Pull to see the published data:

您可以在终端中看到每次下订单时都会打印一些JSON格式的交易数据。 同样的JSON数据也被推送到发布/订阅主题。 让我们从订户中提取数据,转到GCP中的pub / sub仪表板,然后从顶部单击“ 查看消息 ”,然后从左侧菜单中选择“ 订阅”选项,然后单击“ 拉”以查看已发布的数据:

Image for post
Checking in the Pub/Sub.
签入发布/订阅。

步骤3:创建Apache Beam管道并在数据流上运行 (Step 3: Create Apache Beam Pipeline And Run It On Dataflow)

At this stage, we are getting the data in real-time from our virtual online store to our Pub/Sub subscriber. Now we are going to write our pipeline in Apache Beam to unnest the data and convert it into row like format to store it in MySQL server. And finally, we will run our pipeline using GCP Dataflow runner.

在此阶段,我们正在从虚拟在线商店向发布/订阅用户实时获取数据。 现在,我们将在Apache Beam中编写管道以取消嵌套数据,并将其转换为类似行的格式,以将其存储在MySQL服务器中。 最后,我们将使用GCP Dataflow运行器运行管道。

Before we start writing our data pipeline let’s create a cloud SQL instance in GCP which will be our final destination to store processed data, you can use other cloud SQL services as well, I have written my pipeline for MySQL server.

在开始编写数据管道之前,让我们在GCP中创建一个云SQL实例,这将是我们存储处理后数据的最终目的地,您也可以使用其他云SQL服务,我已经为MySQL服务器编写了管道。

From the GCP console, select the SQL option from the left menu:

在GCP控制台中,从左侧菜单中选择SQL选项:

Image for post
Selecting SQL tool from GCP console.
从GCP控制台中选择SQL工具。

Select CREATE INSTANCE:

选择创建实例:

Image for post
Creating cloud SQL instance.
创建云SQL实例。

Select MySQL:

选择MySQL:

Image for post
Selecting MySQL server.
选择MySQL服务器。

Enter Instance Name and Password for the root user, leave other settings to default and click Create, Now sit back and relax it will take 5–10 min to start the instance:

输入root用户的“ 实例名称”和“ 密码 ”,将其他设置保留为默认设置,然后单击“ 创建”。现在,坐下来放松一下,将需要5-10分钟来启动实例:

Image for post
Inserting details and creating the Cloud SQL instance.
插入详细信息并创建Cloud SQL实例。

After the DB is up and running we need to create a database and table. Connect with your MySQL instance using any SQL client and run below queries:

数据库启动并运行后,我们需要创建一个数据库和表。 使用任何SQL客户端连接MySQL实例并在以下查询中运行:

CREATE DATABASE virtual_store;CREATE TABLE transaction_data(
`id` int(11) NOT NULL AUTO_INCREMENT,
`order_id` VARCHAR(255),
`timestamp` INT(11),
`item_id` VARCHAR(255),
`item_name` VARCHAR(255),
`category_name` VARCHAR(255),
`item_price` FLOAT,
`item_qty` INT,
PRIMARY KEY(`id`)
);

Till now have created our source (Pub/Sub Subscriber ) and Sink (MySQL), now we will create our data pipeline.

到现在为止,我们已经创建了源(Pub / Sub Subscriber)和接收器(MySQL),现在我们将创建数据管道。

Representation of directory for our pipeline is given below, you can clone the complete directory from my GitHub repo here:

下面给出了我们管道的目录表示,您可以在这里从我的GitHub存储库中克隆完整目录:

├── dataflow_pipeline
│ ├── mainPipeline.py
│ ├── pipeline_config.py
│ ├── requirement.txt
│ └── setup.py

Let's start first with the configuration file pipeline_config.py ,this file contains all the configuration like Pub/Sub subscriber details, service account key path, MySQL DB connection details, and table details.

让我们首先从配置文件pipeline_config.py开始,该文件包含所有配置,例如发布/订阅用户详细信息,服务帐户密钥路径,MySQL DB连接详细信息和表详细信息。

Next is the main pipeline file, mainPipeline.py , this is the entry point for different runners (local, Dataflow, etc) for running the pipeline. In this pipeline script, we are reading data from the Pub/Sub, unnesting the data, and storing the final data in a relational database. Later we will visualize it using Google Data studio. Let's look at the code:

接下来是主管道文件mainPipeline.py ,这是用于运行管道的不同运行程序(本地,数据流等)的入口点。 在此管道脚本中,我们从Pub / Sub中读取数据,取消嵌套数据,并将最终数据存储在关系数据库中。 稍后,我们将使用Google Data Studio对其进行可视化。 让我们看一下代码:

First, let's run our pipeline in local:

首先,让我们在本地运行管道:

python mainPipeline.py

You will see the below output, this means our pipeline is now listening to Pub/Sub for incoming data.

您将看到以下输出,这意味着我们的管道现在正在侦听Pub / Sub中的传入数据。

Image for post
The output of running Pipeline in local.
在本地运行Pipeline的输出。

Let’s place some orders from our virtual online store and see the output of the pipeline.

让我们从虚拟在线商店下订单,然后查看管道的输出。

Image for post
Placing sample orders.
放置样品订单。

After clicking submit you will immediately see the output in the pipeline terminal:

单击提交之后,您将立即在管道终端中看到输出:

Image for post
Testing the pipeline by placing orders.
通过下订单测试管道。

As you can see our input was nested data in which all the items are nested in a single object, but our pipeline unnested the data into row level.

如您所见,我们的输入是嵌套数据,其中所有项目都嵌套在一个对象中,但是我们的管道将数据取消嵌套到行级别。

Let’s check our database table:

让我们检查一下数据库表:

Image for post
Checking the final destination table.
检查最终目标表。

As expected our single order is transformed into item wise row-level data and inserted in our database on the fly, in real-time.

正如预期的那样,我们将单个订单实时转换为逐项的行级数据,并即时插入到我们的数据库中。

Now we will run our pipeline in GCP Dataflow, for this, we need to run below command:

现在,我们将在GCP Dataflow中运行管道,为此,我们需要运行以下命令:

python mainPipeline.py --runner DataflowRunner \
--project hadooptest-223316 \
--temp_location gs://dataflow-stag-bucket/ \
--requirements_file requirement.txt \
--setup_file ./setup.py

Make sure you create a staging bucket in GCP as I did and provide the link in the above command under “temp_location” option and also create a setup.py in your directory with the below content, this will prevent ModuleNotFoundError.

确保像我一样在GCP中创建一个过渡存储段,并在“ temp_location ”选项下的上述命令中提供链接,并在目录中使用以下内容创建setup.py ,这将防止ModuleNotFoundError

Sit back and relax it will take 5–10 min to start the pipeline in GCP dataflow. Now go to the GCP Dataflow dashboard to check if the server started or not.

坐下来放松一下,将需要5到10分钟才能启动GCP数据流中的管道。 现在转到GCP数据流仪表板,检查服务器是否启动。

Image for post
Checking Dataflow job running in GCP.
检查在GCP中运行的数据流作业。

you can also see the different stages of the pipeline, click on the running job to see the details.

您还可以查看管道的不同阶段,单击正在运行的作业以查看详细信息。

Image for post
Visual representation of all the stages of pipeline in the dataflow.
可视化表示数据流中管道的所有阶段。

Place some orders from the virtual store and test if the data is coming in DB or not. In my case, it is working as expected. Rows of data in MySQL table is getting inserted in real-time:

从虚拟商店下订单,并测试数据是否来自数据库。 就我而言,它按预期运行。 MySQL表中的数据行正在实时插入:

Image for post
Final testing to check the data processed by Dataflow.
最终测试,以检查由Dataflow处理的数据。

Note: Closing our local terminal from which we deployed the pipeline in GCP won’t affect the pipeline running in Dataflow on GCP. Make sure to terminate the pipeline from the GCP as well.

注意:关闭在GCP上部署管道的本地终端不会影响在GCP上的Dataflow中运行的管道。 确保也从GCP终止管道。

步骤4:创建Datastudio仪表板以可视化我们的实时数据 (Step 4: Create Datastudio Dashboard For Visualizing Our Realtime Data)

Google Data Studio is a free tool for visualizing data. It enables users to create an interactive and effective reporting dashboard very quickly from different data sources.

Google Data Studio是用于可视化数据的免费工具。 它使用户可以从不同的数据源中快速创建交互式且有效的报告仪表板。

Let's connect our sink (MySQL server) to the Data Studio and create a dashboard on the top of our real-time data.

让我们将接收器(MySQL服务器)连接到Data Studio,并在实时数据的顶部创建一个仪表板。

Go to https://datastudio.google.com. Click on Create and select Data source.

前往https://datastudio.google.com 。 单击创建,然后选择数据源。

Image for post
Creating Data Source for data studio dashboard.
为Data Studio仪表板创建数据源。

Give a name to your source at the top left corner and select Cloud SQL for MYSQL as source (If your MySQL database is not in GCP select only MySQL)

在左上角为源命名,并选择Cloud SQL for MYSQL作为源(如果您MySQL数据库不在GCP中,则仅选择MySQL )

Image for post
Naming our data source and selecting Cloud SQL for connection.
命名我们的数据源并选择Cloud SQL进行连接。

Enter your DB credentials and click Authenticate. After that select CUSTOM QUERY, enter the query and select Connect at the top right corner.

输入您的数据库凭据,然后单击身份验证。 之后,选择CUSTOM QUERY ,输入查询并选择右上角的Connect

Image for post
Putting all the required details and a custom query to fetch data from DB.
放置所有必需的详细信息和一个自定义查询以从数据库中获取数据。

Data Studio will connect to the cloud SQL instance and show us the schema of our table. Now click on CREATE REPORT at the top right corner:

Data Studio将连接到云SQL实例,并向我们显示表的架构。 现在,单击右上角的CREATE REPORT

Image for post
Table schema fetched by Data Studio.
Data Studio提取的表架构。

Add charts and graphs as per your requirements. You can learn more about data studio here:

根据您的要求添加图表。 您可以在此处了解有关Data Studio的更多信息:

Image for post
Option to add charts and graphs in data studio.
选择在Data Studio中添加图表。

I have created a basic, 2 chart dashboard which shows Item wise quantity sold and Item wise sales.

我创建了一个基本的2图表仪表板,其中显示了按物料分类的 销售 数量和按物料分类的销售。

My final Dashboard, which gets updated as soon as the orders are getting placed:

我的最终仪表板,在下订单后立即进行更新:

Image for post
Final dashboard to see insights from our real-time data.
最终的仪表板,可从我们的实时数据中查看见解。

结论 (Conclusion)

In this article, I explained how real-time pipeline works. We have created all the components of a data pipeline, a source application which generates data in real-time, a buffer which ingests the data, an actual pipeline in Google cloud Platform which processes the data, a sink to store the processed data, and finally a dashboard to visualize our final data.

在本文中,我解释了实时管道如何工作。 我们已经创建了数据管道的所有组件,实时生成数据的源应用程序,提取数据的缓冲区,Google Cloud Platform中处理数据的实际管道,用于存储已处理数据的接收器以及最后是一个仪表板,以可视化我们的最终数据。

Please leave your comment about this article below and in case you are facing issues in any of the steps specified above you can reach out to me through Instagram and LinkedIn.

请在下面对本文发表评论,如果您在上面指定的任何步骤中遇到问题,您可以通过Instagram与我联系 LinkedIn

翻译自: https://towardsdatascience.com/how-to-build-a-real-time-data-pipeline-for-an-online-store-using-apache-beam-pub-sub-and-sql-6424f33eba97

apache beam

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值