数据架构设计数据架构组_数据科学第3部分基础架构的数据管道作为代码

最新推荐文章于 2024-06-04 00:32:09 发布

weixin_26720549

最新推荐文章于 2024-06-04 00:32:09 发布

阅读量443

点赞数 1

文章标签： python java 算法人工智能

原文链接：https://medium.com/@hudsonmendes/data-pipeline-for-data-science-part-3-infrastructure-as-code-770edbb2f98f

版权

数据架构设计数据架构组

Deploying AWS EC2 Machines, AWS Redshift, Apache Airflow using a a repeatable interactive process coded into an explanatory Jupyter Notebook.

部署AWS EC2机器， AWS Redshift和Apache Airflow，使用可重复的交互式过程，该过程被编码到说明性的Jupyter Notebook中。

是否想偶尔听到有关Tensorflow，Keras，DeepLearning4J，Python和Java的抱怨？ (Wanna hear occasional rants about Tensorflow, Keras, DeepLearning4J, Python and Java?)

Join me on twitter @ twitter.com/hudsonmendes!

和我一起在twitter @ twitter.com/hudsonmendes上！

Taking Machine Learning models to production is a battle. And there I share my learnings (and my sorrows) there, so we can learn together!

将机器学习模型投入生产是一场战斗。 我在那里分享我的学习(和悲伤)，所以我们可以一起学习！

数据科学系列的数据管道 (Data Pipeline for Data Science Series)

This is a large tutorial that we tried to keep conveniently small for the occasional reader, and is divided into the following parts:

这是一个很大的教程，我们试图为偶尔阅读的读者尽量减小它的大小，并分为以下几部分：

Part 1: Problem/Solution FitPart 2: TMDb Data “Crawler”Part 3: Infrastructure As CodePart 4: Airflow & Data Pipelines
(soon available) Part 5: DAG, Film Review Sentiment Classifier Model
(soon available) Part 6: DAG, Data Warehouse Building
(soon available) Part 7: Scheduling and Landings

第1部分：问题/解决方案拟合第2部分：TMDb数据“抓取工具” 第三部分：基础架构即代码第4部分：气流和数据管道(即将推出)第5部分：DAG，电影评论情感分类器模型(即将推出)第6部分：DAG，数据仓库构建(即将推出)第7部分：计划和着陆

问题：设置数据管道基础架构(The Problem: Setup Data Pipeline Infrastructure)

This project has the following problem statement:

该项目具有以下问题陈述：

Data Analysts must be able to produce reports on-demand, as well as run several roll-ups and drill-down queries into what the Review Sentiment is for both IMDb films and IMDb actors/actresses, based on their TMDb Film Reviews; And the Sentiment Classifier must be our own.

数据分析师必须能够按需生成报告，并且能够基于他们的TMDb电影评论，对IMDb电影和IMDb演员/女演员的评论情绪进行多次汇总和深入查询； 情感分类器必须是我们自己的。

In our Part 2 we have figured out a way to link IMDb Ids with TMDb Ids, so that we can use the TMDb Film Reviews using AWS Lambda.

在第2部分中，我们找到了一种将IMDb Ids与TMDb Ids链接的方法，以便我们可以使用AWS Lambda使用TMDb电影评论。

In order to produce the Fact Tables (Data Warehouse) that we need as a final product of our Data Analytics, we must install a bunch of infrastructure.

为了产生事实表(数据仓库)，这是我们数据分析的最终产品，我们必须安装大量基础架构。

Brace yourselves, this will be a slightly longer article.

自我支撑，这将是一篇稍长的文章。

The full source code for this solution can be found at:

该解决方案的完整源代码可以在以下位置找到：

Jupyter Notebooks：基础架构即代码 (Jupyter Notebooks: Infrastructure as Code)

Although Jupyter Notebook sometimes gets used a bit too much, and it also may not be idea for some tasks, it has become increasingly common to find Infrastructure As Code in such format.

尽管Jupyter Notebook有时会使用过多，并且对于某些任务可能也不是一个主意，但是以这种格式查找“基础结构即代码”已变得越来越普遍。

The benefits of having a notebook is that the explanation is "runnable". The same effect may be achieved with code comments. However the markdown and ease-of-use of notebooks make it a bit more welcoming for writers.

拥有笔记本的好处在于解释是“可运行的”。使用代码注释可以达到相同的效果。但是，笔记本的降价和易用性使笔者倍受欢迎。

基础架构组件 (Infrastructure Components)

As defined in our notebook:

根据笔记本中的定义：

One by one:

逐个：

Data Lake: for the present project is nothing more than an AWS S3 Bucket that will hold our files.
数据湖：对于本项目而言，无非是一个可保存文件的AWS S3存储桶。
Airflow: must be installed over an EC2 Machine that, in our case, can be also a Spot Instance (so we can save some money)
气流：必须安装在EC2机器上，在我们的情况下，该机器也可以是竞价型实例(这样可以节省一些钱)
Classifier Model: must be trained and, in this case, it can be left in a local folder inside the EC2 Machine, after it's trained by the Airflow DAG.
分类器模型：必须经过培训，在这种情况下，由Airflow DAG对其进行培训后，可以将其保留在EC2机器内部的本地文件夹中。
Data Warehouse: will live inside a AWS Redshift Database
数据仓库：将驻留在AWS Redshift数据库中

From these requirements, we can quickly infer that the following components are needed:

根据这些要求，我们可以快速推断出需要以下组件：

AWS S3 Bucket
AWS S3存储桶
AWS EC2 KeyPair
AWS EC2密钥对
AWS IAM Role, AWS IAM Policy, and AWS IAM Instance Profile for the AWS EC2 Machine running as a Spot Instance
作为竞价型实例运行的AWS EC2计算机的AWS IAM角色， AWS IAM策略和AWS IAM实例配置文件
AWS EC2 Security Group for Airflow (with the necessary Ingress Rules)
适用于气流的AWS EC2安全组(具有必要的入口规则)
AWS EC2 Spot Instance Request that will launch our EC2 Machine
将启动我们的EC2计算机的AWS EC2竞价型实例请求
Finally, the AWS EC2 Machine, where we will install our Airflow.
最后，我们将在其中安装Airflow的AWS EC2 Machine 。
The AWS EC2 ElasticIP used to access the AWS EC2 Machine.
AWS EC2 ElasticIP用于访问AWS EC2计算机。
AWS Redshift Cluster
AWS Redshift集群
AWS IAM Role and AWS IAM Policy for the Redshift Database, allowing access to our AWS S3 Bucket for our COPY commands.
适用于Redshift数据库的AWS IAM角色和AWS IAM策略，允许通过COPY命令访问我们的AWS S3存储桶。
AWS Redshift Database
AWS Redshift数据库

Let's now see how each one of those is created.

现在，让我们看看其中的每一个是如何创建的。

共享组件 (Shared Components)

We will start by installing the shared components that are used across our code.

我们将从安装在我们的代码中使用的共享组件开始。

安装Boto3 (Installing Boto3)

Boto3 is the python library that we use in order to mange our AWS account programmatically. We install the library by doing the following:

Boto3是我们用来以编程方式管理我们的AWS账户的python库。我们通过执行以下操作安装该库：

AWS EC2密钥对 (AWS EC2 KeyPair)

We must create a KeyPair that we will link to our AWS EC2 Machine, but that will also be used in order to SSH/SCP programmatically into our machine to install dependencies and copy our DAG files.

我们必须创建一个KeyPair ，它将链接到我们的AWS EC2 Machine ，但是也将使用它来以编程方式将SSH / SCP SSH到我们的计算机中，以安装依赖项并复制DAG文件。

气流：基础结构 (Airflow: Infra-structure)

To setup Airflow, we must first install some linux dependencies, so that we can later install Airflow, set it up, and copy our DAG files.

要设置Airflow，我们必须首先安装一些linux依赖项，以便以后可以安装Airflow，对其进行设置并复制DAG文件。

Let's see how it goes, step by step:

让我们一步一步来看看它是如何进行的：

AWS IAM角色， AWS IAM策略和AWS IAM实例配置文件 (AWS IAM Role, AWS IAM Policy, and AWS IAM Instance Profile)

See below how we create the:

参见下面我们如何创建：

AWS IAM Role,
AWS IAM角色，
AWS IAM Policy and
AWS IAM策略和
AWS IAM Instance Profile
AWS IAM实例配置文件

AWS EC2安全组(AWS EC2 Security Group)

In order to configure the firewall rules that allow inbound requests to Airflow, we must create a AWS EC2 Security Group.

为了配置允许入站请求到Airflow的防火墙规则，我们必须创建一个AWS EC2安全组。

AWS EC2竞价型实例请求 (AWS EC2 Spot Instance Request)

With all other components ready, we can now request our Spot Instance that will create the AWS EC2 Machine that we need in order to run Airflow.

在准备好所有其他组件之后，我们现在可以请求竞价型实例，该实例将创建运行Airflow所需的AWS EC2计算机。

AWS EC2机器 (AWS EC2 Machine)

The AWS EC2 Machine is created by the Spot Instance Request, but we need to wait until it's available, not directly by us.

AWS EC2 Machine由竞价型实例请求创建，但是我们需要等到它可用，而不是直接由我们创建。

Once available, we grab hold of the instance id.

一旦可用，我们将抓住实例id 。

AWS EC2弹性IP (AWS EC2 ElasticIP)

We now allocate and attach an AWS EC2 Elastic IP to the EC2 machine, so that we can access it via SSH and via HTTP.

现在，我们将AWS EC2弹性IP分配并附加到EC2计算机，以便我们可以通过SSH和HTTP访问它。

气流：依赖关系 (Airflow: Dependencies)

Setting up a machine remotely requires us to SSH to run commands and SCP to copy files over. Fortunately it’s really easy to achieve by libraries called paramiko and scp.

远程设置机器需要我们SSH来运行命令，并需要SCP来复制文件。幸运的是，通过名为paramiko和scp的库确实很容易实现。

准备SSH和SCP (Preparing SSH and SCP)

First we install the dependencies:

首先，我们安装依赖项：

And now we create a few methods that will help us run the commands:

现在，我们创建一些方法来帮助我们运行命令：

Once these preparations are done, we can start installing the Linux Dependencies that must be available in order to run Airflow.

完成这些准备后，我们就可以开始安装Linux依赖关系，该依赖关系必须可用才能运行Airflow。

气流套件和用于气流的本地MySQL数据库 (Airflow Packages & Local MySQL Database for Airflow)

In this step we install Airflow pip packages, that will be later used to run Airflow.

在此步骤中，我们将安装Airflow pip软件包，这些软件包将在以后用于运行Airflow。

Although We do not use MySQL for our Data Warehouse. Airflow needs a database in order to keep track of the DAGs, Runs, etc. So we here install a local instance of MySQL that will keep the Airflow metadata.

尽管我们不将MySQL用于数据仓库。 Airflow需要一个数据库来跟踪DAG，Run等。因此，我们在这里安装MySQL的本地实例，该实例将保留Airflow元数据。

VERY IMPORTANT: In your production environment you should NOT have a local MySQL database. Instead, you should have your Database setup in AWS RDS or similar. Otherwise if you lose your Spot Instance, you will also lose the track of runs, data lineage, etc.

非常重要：在生产环境中，您不应具有本地MySQL数据库。相反，您应该在AWS RDS或类似数据库中进行数据库设置。否则，如果您丢失了竞价型实例，那么您还将失去运行轨迹，数据沿袭等。

初始化气流数据库和配置(.ini文件) (Initializing Airflow DB & Configuration (.ini file))

Airflow installs what it needs in order to run in terms of databases. That is done by using the command airflow initdb.

Airflow安装了运行数据库所需的组件。这是通过使用airflow initdb命令完成的。

Configuration for Airflow must be done by changing a .ini file.

必须通过更改.ini文件来完成Airflow的配置。

It'd be possible to change the configuration using sed (popular command line tool) but it's a bit harder to manage regular expressions that we'd want it to be. Alternatively, we use a package called crudini to do it.

可以使用sed (流行的命令行工具)更改配置，但是要管理我们想要的正则表达式要困难一些。另外，我们使用称为crudini的程序包来执行此操作。

创建DAGs文件夹 (Creating DAGs folder)

We now create the folder where our DAGs code should live.

现在，我们将创建DAGs代码所在的文件夹。

气流：DAG依赖关系和启动 (Airflow: DAG Dependencies & Launch)

DAG (Directed Acyclic Graphs) are graph based representations of flows. When implemented using the Airflow Library, Airflow is capable of presenting them visually, as well as running them.

DAG(有向无环图)是基于图的流表示。使用Airflow库实施时，Airflow能够直观地显示它们并运行它们。

We must now setup Airflow (Variables and Connections) as well as copy the DAG files, so that Airflow can recognise these files and present them in the system.

现在，我们必须设置Airflow(变量和连接)并复制DAG文件，以便Airflow可以识别这些文件并将其显示在系统中。

安装DAG依赖项 (Installing DAG dependencies)

For this setup, DAGs run all "in-process" (as the same machine as Airflow itself). Alternative configurations are possible and recommended, but for the scope of this project, we will not change this.

对于此设置，DAG将运行所有“进程内”(与Airflow本身在同一台计算机上)。可以使用其他配置，建议使用其他配置，但是在本项目的范围内，我们不会对此进行更改。

In order to run our DAGs, we must install the Dependencies used by the DAGs to run:

为了运行DAG，我们必须安装DAG使用的依赖项才能运行：

启动气流 (Starting Airflow)

We will now start Airflow and leave its components running as daemons (-D) instead of interactively, so that it does not die with our SSH session or with a keyboard interrupt sequence.

现在，我们将启动Airflow并使它的组件作为守护程序(-D)运行，而不是交互运行，这样它就不会因我们的SSH会话或键盘中断序列而消失。

The last infrastructure component that we require is Redshift, which we will show next.

我们需要的最后一个基础架构组件是Redshift，我们将在下面显示。

AWS Redshift (AWS Redshift)

Redshift is a columnar database based on Postgres that is ideal for common use cases of Data Warehouse queries, such as roll ups and drill downs, vastly used for Data Analysis.

Redshift是基于Postgres的列式数据库，非常适合数据仓库查询的常见用例，例如汇总和向下钻取，广泛用于数据分析。

安装依赖项 (Installing Dependencies)

We will need to manage resources in EC2, IAM and Redshift. We therefore create their clients using Boto3.

我们将需要管理EC2 ， IAM和Redshift中的资源。因此，我们使用Boto3创建他们的客户。

RedShift的AWS IAM角色 (AWS IAM Role for RedShift)

We must now create the role that will be used by Redshift to run on our Cluster.

现在，我们必须创建Redshift用来在集群上运行的角色。

Important: this role must have Read access to our AWS S3 Bucket, from where it will run COPY commands.

重要提示：此角色必须具有对我们的AWS S3存储桶的读取访问权限，它将在此运行COPY命令。

适用于Redshift的AWS EC2安全组 (AWS EC2 Security Group for Redshift)

We now setup the Security Group that will contain the Ingress Rules that will make our Redshift Server accessible for all the internet.

现在，我们设置了安全组，其中将包含入口规则，这些规则将使Redshift Server可以访问所有Internet。

Obviously, review the ingress rule to fit your security needs.

显然，请检查入口规则以适合您的安全需求。

We also need an IP that we can use in order to access our Redshift.

我们还需要一个可用于访问Redshift的IP。

AWS Redshift集群 (AWS Redshift Cluster)

We will now request the creation of our Redshift Cluster, using specific credentials, and the Elastic IP that We had previously allocated.

现在，我们将请求使用特定的凭证以及我们先前分配的弹性IP创建Redshift集群。

数据仓库数据库 (Data Warehouse Database)

Using the ipython-sql Jupyter extension we can have entire notebook cells running as SQL.

使用ipython-sql Jupyter扩展，我们可以使整个笔记本单元作为SQL运行。

We will connect now connect our notebook to the database, in order to run them:

现在我们将连接，将笔记本连接到数据库，以便运行它们：

Once connected, we will create our Dimension Tables:

连接后，我们将创建尺寸表：

We then create our Facts Tables:

然后，我们创建事实表：

And that's it, we now have our Database Structure ready to receive the Data that our Data Pipeline will continuously process and deliver into it.

就是这样，我们现在已经准备好数据库结构来接收数据，这些数据将由我们的数据管道继续处理并传递到其中。

综上所述 (In Summary)

Using python, boto3 and paramiko, we have setup our entire infrastructure:

使用python， boto3和paramiko ，我们已经设置了整个基础架构：

AWS Networking
AWS网络
AWS EC2 Machine created via Spot Instance Request
通过竞价型实例请求创建的AWS EC2计算机
AWS Redshift
AWS Redshift

We are now ready to install/update our DAGs and launch them, so our Data Pipeline starts running.

现在，我们准备安装/更新DAG并启动它们，以便我们的数据管道开始运行。

下一步 (Next Steps)

In the next article Part 4: Airflow & Data Pipelines we will copy our DAG code and launch our DAGs, so that our Data Pipeline creating the Data Warehouse that we need.

在下一篇文章第4部分：气流和数据管道中，我们将复制DAG代码并启动DAG，以便我们的数据管道创建所需的数据仓库。