aws 分布式数据库
Building distributed systems for ETL & ML data pipelines is hard. If you tried implementing one yourself, you may have experienced that tying together a workflow orchestration solution with distributed multi-node compute clusters such as Spark or Dask may prove difficult to properly set up and manage. In this article, I want to show you how you can obtain a highly available, scalable, distributed system that will make the orchestration of your data pipelines for ETL & ML much more enjoyable and will free up your time to work with data and generate value out of it rather than spend time on maintaining clusters.
乙 uilding用于ETL&ML数据管道分布式系统是困难的。 如果您尝试自己实施,则可能会遇到将工作流程编排解决方案与分布式多节点计算集群(例如Spark或Dask)捆绑在一起的事实可能难以正确设置和管理。 在本文中,我想向您展示如何获得高度可用,可扩展的分布式系统 ,这将使针对ETL和ML的数据管道的编排更加有趣,并将腾出时间 来处理数据并创造价值而不是花时间在维护集群上。
The entire setup is, to a large extent, automated by AWS and Prefect. Plus, it will cost you almost nothing in the beginning. Why? Because I will show you how to set up a production-ready and secure workflow orchestration system with a serverless Kubernetes cluster as your distributed, fault-tolerant, highly available, self-healing & automatically scalable execution layer within minutes!
整个设置在很大程度上由AWS和Prefect自动化。 另外,一开始它几乎不会花费您任何费用 。 为什么? 因为我将向您展示如何在几分钟内使用无服务器的Kubernetes集群来设置可用于生产且安全的工作流程编排系统,作为您的分布式,容错,高可用性,自修复和自动可扩展的执行层!
All that you need to do is:
您需要做的只是:
create AWS account & IAM user with programmatic access to create AWS ECR, ECS & EKS resources
run a few shell commands from this article to set up a Kubernetes cluster by using AWS EKS on Fargate (which will cost you just 0.10$ per hour for the control plane! Before, it was 0.20$ but AWS halved the price in early 2020)
运行本文中的一些shell命令,以在Fargate上使用AWS EKS来设置Kubernetes集群( 控制平面每小时的成本仅为0.10 $!以前是0.20 $,但 AWS 在2020年初 将价格减半了 )
sign up for a free Prefect Cloud account (free Developer account gives you access to all features but only for you — if you want to use it in your team, you need to upgrade to either Team or Enterprise plan)
注册一个免费的Prefect Cloud帐户 ( 免费的Developer帐户可让您访问所有功能,但仅适用于您-如果要在团队中使用它,则需要升级到Team或Enterprise计划 )
- generate authentification tokens within Prefect Cloud UI to connect your Prefect Cloud account to your serverless Kubernetes cluster on AWS 在Prefect Cloud UI中生成身份验证令牌以将Prefect Cloud帐户连接到AWS上的无服务器Kubernetes集群
- run your distributed data pipelines and generate value out of your data! 运行您的分布式数据管道,并从数据中创造价值!
Let’s get started!
让我们开始吧!
无服务器Kubernetes集群作为执行层 (Serverless Kubernetes cluster as execution layer)
In December 2019, AWS launched a new Fargate feature which, to many, was considered a game-changer — they introduced an option to use AWS EKS on ECS Fargate, which is a way of saying: AWS made the Fargate service an orchestrator not only for the ECS but also for the EKS. Up to that point, AWS Fargate was a serverless way of running containers only on AWS ECS.
在2019年12月,AWS推出了一项新的Fargate功能,在许多人看来,它被认为是改变游戏规则的人-他们引入了在ECS Fargate上使用AWS EKS的选项,这是一种说法: AWS不仅使Fargate服务成为协调器对于ECS 以及对 EKS而言 。 到目前为止,AWS Fargate是仅在AWS ECS上运行容器的无服务器方式。
EKS and Fargate make it straightforward to run Kubernetes-based applications on AWS by removing the need to provision and manage infrastructure for pods.
通过消除对Pod的供应和管理的需求,EKS和Fargate使在AWS上直接运行基于Kubernetes的应用程序变得简单。
What does it mean for us? It means that we can now have a serverless Kubernetes cluster on EKS which is only charging us for running pods, rather than their underlying EC2 instances. It also means, among other benefits:
对我们意味着什么? 这意味着我们现在可以在EKS上拥有一个无服务器的 Kubernetes集群,该集群仅向我们收取运行Pod的费用,而不是其底层EC2实例的费用。 除其他好处外,这还意味着:
- no more worker node maintenance, 不再需要维护工作节点,
- no more guessing capacity, 没有更多的猜测能力,
- no more EC2 autoscaling groups to scale the worker nodes up and down. 无需再使用EC2自动伸缩组来上下扩展工作节点。
AWS takes care of all that for you. All that you need to do is to write YAML files for your deployments and interact with the EKS via kubectl. In short: your only task now is to write your ETL & ML code that adds value to your business and AWS takes care of the Ops, i.e. operating, maintaining, and scaling your Kubernetes cluster.
AWS会为您完成所有这些工作。 您需要做的就是为部署编写YAML文件,并通过kubectl与EKS进行交互 。 简而言之: 您现在的唯一任务是编写可为您的业务增加价值的ETL和ML代码 ,AWS负责运营,即操作,维护和扩展Kubernetes集群。
Considering that we are charged only for the actual vCPU and memory of running pods, this provides a great foundation for a modern data platform.
考虑到我们只为实际的vCPU和正在运行的Pod的内存付费,因此这为现代数据平台奠定了良好的基础 。
这听起来实在令人难以置信:不利之处是什么? (This sounds too good to be true: what are the downsides?)
One possible disadvantage of using almost any serverless option is the issue of cold start when the container orchestration system needs to first allocate and prepare the compute resources (a.o. pulling the latest version of the image from the docker registry to the allocated worker node and building the image) which may add some additional latency before the container (or your K8s pod) will turn into a running state.
使用几乎所有无服务器选项的可能缺点是,当容器编排系统需要首先分配并准备计算资源时,就会冷启动问题( 将最新版本的映像从docker注册表中拉到分配的工作节点上,并构建图片 ),这可能会增加一些额外的延迟,直到容器( 或您的K8s吊舱 )进入运行状态。
If your data workloads require a very low level of latency, you may opt for the AWS EKS cluster with the traditional data plane and follow the instructions from this blog post to set up a non-serverless cluster on AWS EKS and to connect it to your Prefect Cloud environment.
如果您的数据工作负载要求非常低的延迟水平,则可以选择具有传统数据平面的AWS EKS集群,并按照此博客文章中的说明在AWS EKS上设置非无服务器集群并将其连接到您的完美的云环境。
However, you can have both! AWS allows you to mix the two:
但是, 您可以同时拥有 ! AWS允许您将两者混合使用:
you can have the same AWS EKS cluster running pods in a serverless way in the default namespace (this is set up by generating a Fargate Profile)
您可以在默认名称空间中以相同的AWS EKS集群以无服务器方式运行pod( 通过生成Fargate Profile进行设置 )
and you can have an EC2 instance (ex. with GPU) for your Data Science models connected to the same Kubernetes cluster on EKS but within a different namespace or using different Kubernetes labels. When you then create a deployment for a pod that doesn’t match the namespace and labels defined in the Fargate Profile, it will be scheduled to the EC2-worker node which you maintain and which is available with no latency.
并且您可以为您的Data Science模型建立一个EC2实例( 例如带有GPU ),该实例连接到EKS上的同一Kubernetes集群,但在不同的名称空间或使用不同的Kubernetes标签。 然后,当您为与Fargate Profile中定义的名称空间和标签不匹配的Pod创建部署时,它将被调度到您维护且无延迟可用的EC2-worker节点上。
As you can see, AWS designed EKS on Fargate with a lot of foresight which allows you to mix the serverless and non-serverless options to save your time, money, and maintenance efforts. You can find out more about that in this video in which AWS introduced the service.
如您所见,AWS 在Fargate上设计的EKS具有远见卓识,可让您混合使用无服务器和非无服务器选项,以节省时间,金钱和维护工作。 您可以在此视频 (AWS引入该服务的视频)中找到有关此内容的更多信息。
AWS设置 (AWS Setup)
You need to have an AWS account with either admin access or at least a user with IAM permissions for creating ECR, EKS and ECS resources. Then, you must have AWS CLI configured for this account and eksctl installed, as described in this AWS docs.
您需要具有一个具有管理员访问权限的AWS账户,或者至少具有一个具有IAM权限的用户才能创建ECR,EKS和ECS资源。 然后,您必须为此帐户配置了AWS CLI ,并已安装eksctl ,如本AWS文档中所述 。
Kubernetes on AWS works well with AWS ECR, which is a registry for your Docker images. To authenticate your terminal with your ECR account, run:
AWS上的Kubernetes与AWS ECR可以很好地结合使用,后者是Docker映像的注册表。 要使用您的ECR帐户对终端进行 身份验证 ,请运行:
- if you use the new AWS CLI v2: 如果您使用新的AWS CLI v2:
aws ecr get-login-password --region <YOUR_AWS_REGION> | docker login --username AWS --password-stdin <YOUR_ECR_REGISTRY_ID>.dkr.ecr.<YOUR_AWS_REGION>.amazonaws.com
- if you use the old AWS CLI version: 如果您使用旧的AWS CLI版本:
$(aws ecr get-login --no-include-email --region <YOUR_AWS_REGION>)
Note:
<YOUR_AWS_REGION>
could be ex. us-east-1, eu-central-1, and more.注意:
<YOUR_AWS_REGION>
可以是ex。 美国东部-1,EU- 中央-1,和更多 。
If you get Login Succeeded message, you can create your ECR repositories for your data pipelines. We will create two data pipelines: dask-k8
and basic-etl-prefect-flow
— use the same names to follow this walkthrough, but in general, it’s easiest to give your ECR repository the same name as your Prefect flow to avoid confusion.
如果收到“ 登录成功”消息,则可以为数据管道创建ECR存储库。 我们将创建两个数据管道: dask-k8
和basic-etl-prefect-flow
dask-k8
使用相同的名称来遵循本演练,但总的来说,为您的ECR存储库提供 与 Prefect流 相同的名称是最容易的,以避免混淆。
aws ecr create-repository --repository-name dask-k8aws ecr create-repository --repository-name basic-etl-prefect-flow
Then, all you need to do is to run the following command, which will deploy Kubernetes control plane and Fargate Profile in your VPC:
然后,您所需要做的就是运行以下命令,该命令将在您的VPC中部署Kubernetes控制平面和Fargate Profile:
eksctl create cluster --name fargate-eks --region <YOUR_AWS_REGION> --fargate
I picked the name fargate-eks for the cluster, but feel free to change it. The --fargate
flag ensures that we create a new Fargate profile for use with this cluster. EKS allows you to create a custom Fargate profile if needed.
我为集群选择了名称fargate-eks ,但是可以随时对其进行更改。 --fargate
标志可确保我们创建一个新的Fargate配置文件以与此群集一起使用。 EKS允许您根据需要创建自定义Fargate配置文件。
The provision of all resources may take several minutes. When finished, you should see output similar to this one:
提供所有资源可能需要几分钟。 完成后,您应该看到类似于以下内容的输出:
➜ ~ eksctl create cluster --name fargate-eks --region eu-central-1 --fargate
[ℹ] eksctl version 0.25.0
[ℹ] using region eu-central-1
[ℹ] setting availability zones to [eu-central-1b eu-central-1c eu-central-1a]
[ℹ] ...
[ℹ] using Kubernetes version 1.17
[ℹ] creating EKS cluster "fargate-eks" in "eu-central-1" region with Fargate profile
[ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=eu-central-1 --cluster=fargate-eks'
[ℹ] CloudWatch logging will not be enabled for cluster "fargate-eks" in "eu-central-1"
[ℹ] you can enable it with 'eksctl utils update-cluster-logging --region=eu-central-1 --cluster=fargate-eks'
[ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "fargate-eks" in "eu-central-1"
[ℹ] 2 sequential tasks: { create cluster control plane "fargate-eks", create fargate profiles }
[ℹ] building cluster stack "eksctl-fargate-eks-cluster"
[ℹ] deploying stack "eksctl-fargate-eks-cluster"
[ℹ] creating Fargate profile "fp-default" on EKS cluster "fargate-eks"
[ℹ] created Fargate profile "fp-default" on EKS cluster "fargate-eks"
[ℹ] "coredns" is now schedulable onto Fargate
[ℹ] "coredns" is now scheduled onto Fargate
[ℹ] "coredns" pods are now scheduled onto Fargate
[ℹ] waiting for the control plane availability...
[✔] saved kubeconfig as "/Users/<YOUR_USERNAME>/.kube/config"
[ℹ] no tasks
[✔] all EKS cluster resources for "fargate-eks" have been created
[ℹ] kubectl command should work with "/Users/<YOUR_USERNAME>/.kube/config", try 'kubectl get nodes'
[✔] EKS cluster "fargate-eks" in "eu-central-1" region is ready
Then, if you check your context:
然后,如果您检查上下文:
kubectl config current-context
You should get output similar to this one:
您应该获得与此类似的输出:
<YOUR_AWS_USER_NAME>@fargate-eks.<YOUR_AWS_REGION>.eksctl.io
This way you can see that you are connected to a serverless Kubernetes cluster running on AWS Fargate! To prove it further, run:
这样,您可以看到您已连接到在AWS Fargate上运行的无服务器Kubernetes集群! 为了进一步证明它,运行:
➜ ~ kubectl get nodes
NAME STATUS ROLES AGE VERSION
fargate-ip-192-168-163-163.eu-central-1.compute.internal Ready <none> 15m v1.17.9-eks-a84824
fargate-ip-192-168-180-51.eu-central-1.compute.internal Ready <none> 15m v1.17.9-eks-a84824
In the output, you should see at least 1 fargate node waiting for your pod deployments.
在输出中,您应该至少看到1个Fargate节点正在等待Pod部署。
Note: those nodes are running inside of your VPC but they are not visible within your EC2 dashboard. You cannot SSH to those nodes either, as they are fully managed and deployed by Fargate in a serverless fashion.
注意:这些节点正在VPC内部运行, 但是在EC2仪表板中不可见 。 您也无法SSH到那些节点,因为它们由Fargate以无服务器方式完全管理和部署。
The advantage of combining this AWS EKS cluster with Prefect is that the entire Kubernetes pod deployment and scheduling is abstracted away from you by Prefect. This means that you don’t even need to know much about Kubernetes in order to derive value from it. In the next section, we will connect this cluster to our Prefect Cloud account and start building distributed ETL & ML data pipelines.
将此AWS EKS集群与Prefect结合使用的好处是Prefect可以从您那里抽象整个Kubernetes pod部署和调度 。 这意味着您甚至不需要了解Kubernetes就能从中获得价值。 在下一部分中,我们将将此集群连接到Prefect Cloud帐户,并开始构建分布式ETL和ML数据管道。
完美设置 (Prefect Setup)
Let’s first sign up for a free Developer account on https://cloud.prefect.io/.
让我们首先在https://cloud.prefect.io/上注册一个免费的Developer帐户。
At first you will be welcomed by a clean UI showing you your flows, agents and the general overview of the recent flow runs and the next scheduled jobs. The flows themselves can be organized into several projects. When you start building your data pipelines, this main dashboard let you quickly identify the current status of all your data pipelines. This dashboard is extremely helpful: imagine that you log into your account in the morning and see that all your pipelines are in a green status! You only need to dive deeper if you see any red bars indicating some failed data flows.
刚开始,您会受到一个干净的UI的欢迎,该UI向您显示您的流程,代理以及最近的流程运行和下一个计划的作业的 概述 。 流程本身可以组织成几个项目。 当您开始构建数据管道时,此主仪表板可让您快速识别所有数据管道的当前状态 。 该仪表板非常有用:假设您早上登录帐户,看到所有管道都处于绿色状态! 如果您看到任何指示某些失败数据流的红色条,则只需要深入研究。
I know companies who would dream of having this kind of dashboard and this level of transparency about their ETL & ML data pipelines and their status, while at the same time being able to see that all agents executing the work are healthy: at the moment you can see that I currently have a Fargate agent ready to deploy the flows on AWS ECS. For now, we focus on AWS EKS.
我知道有一家公司梦想着拥有这样的仪表板以及这种ETL和ML数据管道及其状态的透明性 ,同时又能够看到所有执行工作的代理都是健康的:可以看到我目前有一个Fargate代理准备好在AWS ECS上部署流程。 目前,我们专注于AWS EKS 。
安装完善 (Install Prefect)
Let’s continue with the setup and install prefect on your computer. The following command will install Prefect with AWS and Kubernetes extras (instead of [kubernetes,aws]
you could use [all_extras]
if you want to install all Prefect extensions to external systems):
让我们继续设置并在您的计算机上安装完善。 以下命令将安装带有AWS和Kubernetes Extras的Prefect( 如果要将所有Prefect扩展安装到外部系统, 可以使用 [all_extras]
代替 [kubernetes,aws]
):
pip install "prefect[kubernetes,aws]"
Now to make sure that you use Prefect Cloud and not the open source version Prefect Core, switch the context to Cloud:
现在,要确保您使用的是Prefect Cloud,而不是开源版本的Prefect Core,请将上下文切换到Cloud:
prefect backend cloud
创建个人访问令牌以将您的流程注册到Prefect Cloud (Create Personal Access Token to register your flows to Prefect Cloud)
After you registered for a free account, you need to create a Personal Access Token to authenticate your local terminal with Prefect Cloud. This will allow to register your flows (i.e. your ETL & ML data pipelines) to the Prefect Cloud directly from your computer. Go to the side bar: User → Personal Access Token → + CREATE TOKEN
button.
注册免费帐户后,您需要创建个人访问令牌以通过Prefect Cloud验证本地终端。 这将允许直接从计算机将您的流( 即ETL和ML数据管道 )注册到Prefect Cloud。 转到侧栏:用户→个人访问令牌→ + CREATE TOKEN
按钮。
Choose some meaningful name ex. MyTokenToRegisterFlows
.
选择一些有意义的名称,例如。 MyTokenToRegisterFlows
。
Then copy the token and within your terminal run the following commands:
然后复制令牌,并在您的终端中运行以下命令:
prefect auth login -t <MyTokenToRegisterFlows>
Now you can register your flows to be orchestrated from Prefect Cloud!
现在,您可以注册要从Prefect Cloud编排的流程!
创建API令牌以授权您的AWS EKS代理运行您的流程 (Create API token to authorize your AWS EKS agent to run your flows)
The last part is to create a RunnerToken for your Kubernetes Agent and to register the agent. Go to the side bar: Team → API Tokens → + CREATE TOKEN
button.
最后一部分是为您的Kubernetes代理创建 RunnerToken并注册该代理。 转到侧栏:团队→API令牌→ + CREATE TOKEN
按钮。
Alternatively, you could do the same from your terminal:
或者,您可以从终端执行相同的操作:
prefect auth create-token -n MyEKS_on_Fargate_K8s_Token -r RUNNER
It’s very important to select the RUNNER
scope, otherwise your agent will not be able to execute the flows on your behalf.
选择RUNNER
范围非常重要,否则您的代理将无法代表您执行流。
Click on CREATE
and copy the generated API token.
单击CREATE
并复制生成的API令牌。
使用生成的API令牌将EKS集群设置为Prefect代理 (Use the generated API token to set EKS cluster as your Prefect agent)
Now we are reaching the most exciting part: with the command below, you will be able to set your serverless AWS Kubernetes cluster as your execution layer (i.e. agent) for your Prefect data pipelines:
现在,我们到达了最令人兴奋的部分:使用以下命令,您将能够将无服务器的AWS Kubernetes集群设置为Prefect数据管道的执行层 ( 即agent ):
➜ ~ prefect agent install kubernetes -t deployment.apps/prefect-agent created
role.rbac.authorization.k8s.io/prefect-agent-rbac created
rolebinding.rbac.authorization.k8s.io/prefect-agent-rbac created
Now you should be able to see a new Kubernetes Agent in your Prefect Cloud account:
现在您应该可以在Prefect Cloud帐户中看到一个新的Kubernetes代理:
We can also see a new pod corresponding to the Prefect agent:
我们还可以看到对应于Prefect代理的新容器:
➜ ~ kubectl get pods
NAME READY STATUS RESTARTS AGE
prefect-agent-68785f47d4-pv9kt 1/1 Running 0 5m59s
完善的基本流程 (Basic Prefect flow structure)
Now everything is set up and we can start creating our flows. Prefect Docs include a variety of useful tutorials that will quickly show you how to adapt your Python ETL & ML code to make it run on Prefect. In a nutshell, you just need to decorate your Python functions with @task
decorator, add task and flow imports and create a flow object:
现在一切都已设置好,我们就可以开始创建流程了。 Prefect Docs包含各种有用的教程,这些教程将快速向您展示如何调整Python ETL和ML代码以使其在Prefect上运行。 简而言之,您只需要使用 @task
装饰器装饰Python函数 ,添加任务和流导入并创建流对象:
from prefect import Flow, task
import pandas as pd
def score_check(grade, subject, student):
"""
This is a normal "business logic" function which is not a Prefect task.
If a student achieved a score > 90, multiply it by 2 for their effort! But only if the subject is not NULL.
:param grade: number of points on an exam
:param subject: school subject
:param student: name of the student
:return: final nr of points
"""
if pd.notnull(subject) and grade > 90:
new_grade = grade * 2
print(f'Doubled score: {new_grade}, Subject: {subject}, Student name: {student}')
return new_grade
else:
return grade
@task
def extract():
""" Return a dataframe with students and their grades"""
data = {'Name': ['Hermione', 'Hermione', 'Hermione', 'Hermione', 'Hermione',
'Ron', 'Ron', 'Ron', 'Ron', 'Ron',
'Harry', 'Harry', 'Harry', 'Harry', 'Harry'],
'Age': [12] * 15,
'Subject': ['History of Magic', 'Dark Arts', 'Potions', 'Flying', None,
'History of Magic', 'Dark Arts', 'Potions', 'Flying', None,
'History of Magic', 'Dark Arts', 'Potions', 'Flying', None],
'Score': [100, 100, 100, 68, 99,
45, 53, 39, 87, 99,
67, 86, 37, 100, 99]}
df = pd.DataFrame(data)
return df
@task(log_stdout=True)
def transform(x):
x["New_Score"] = x.apply(lambda row: score_check(grade=row['Score'],
subject=row['Subject'],
student=row['Name']), axis=1)
return x
@task(log_stdout=True)
def load(y):
old = y["Score"].tolist()
new = y["New_Score"].tolist()
print(f"ETL finished. Old scores: {old}. New scores: {new}")
with Flow("basic-prefect-etl-flow") as flow:
extracted_df = extract()
transformed_df = transform(extracted_df)
load(transformed_df)
if __name__ == '__main__':
# flow.run()
flow.register(project_name='Medium_AWS_Prefect')
Note: adding
log_stdout=True
ensures that the printed output will appear in the Prefect Cloud flow logs.注意:添加
log_stdout=True
可确保打印的输出将显示在Prefect Cloud流日志中。
创建一个项目来组织您的流程 (Creating a project to organize your flows)
Now we can create a project to organize our flows either from the UI or by using the terminal:
现在,我们可以创建一个项目来通过UI或使用终端来组织流程:
➜ prefect create project Medium_AWS_Prefect
Medium_AWS_Prefect created
将您的流程注册到Prefect Cloud (Registering your flows to Prefect Cloud)
If we now run this script, we should get a link to the Prefect Cloud UI from which we can trigger or schedule the flow:
如果现在运行此脚本,我们应该获得指向Prefect Cloud UI的链接,我们可以从中触发或安排流程:
➜ python3 basic-prefect-etl-flow.py
Result check: OK
Flow: https://cloud.prefect.io/anna/flow/888046e6-f366-466a-b9b5-4113cd437e4d
从UI运行流 (Running your flows from the UI)
When we trigger the flow, we will see that it stays in the scheduled state and doesn’t run.
当我们触发流时,我们将看到它保持在计划状态并且没有运行。
This is because to run your flows on AWS EKS cluster, your flow must include information on where your code is stored.
这是因为要在AWS EKS集群上运行流程, 您的流程必须包含有关代码存储位置的信息 。
添加存储信息并将流程推送到ECR (Adding storage information & pushing the flow to ECR)
There are several options how you can store your code: EKS could pull it from ECR, S3 or Github.
有几种方法可以存储代码:EKS可以从ECR,S3或Github中提取代码。
The easiest option is to dockerize your flow and push the image to ECR. Luckily, Prefect team made it really easy — we only need to:
最简单的选择是对您的流进行泊坞处理并将图像推送到ECR。 幸运的是,Prefect团队使这变得非常容易-我们只需要:
create ECR repository for our flow (this step can be automated with a CI/CD pipelines in your production environments)
为我们的流程创建ECR存储库( 此步骤可以通过您的生产环境中的CI / CD管道自动执行 )
- add Docker storage to the code. 将Docker存储添加到代码中。
If you remember the AWS Setup section, we already created two ECR respositories: dask-k8
and basic-etl-prefect-flow
. Therefore, we only need to add storage=Docker()
argument to our flow code so that it can be executed by our serverless Kubernetes Agent:
如果您还记得AWS Setup部分,我们已经创建了两个ECR存储库: dask-k8
和basic-etl-prefect-flow
dask-k8
basic-etl-prefect-flow
。 因此,我们只需要在流代码中添加storage=Docker()
参数,即可由我们的无服务器Kubernetes代理执行该参数:
from prefect.environments.storage import Docker
from prefect import Flow, task
import pandas as pd
def score_check(grade, subject, student):
"""
This is a normal "business logic" function which is not a Prefect task.
If a student achieved a score > 90, multiply it by 2 for their effort! But only if the subject is not NULL.
:param grade: number of points on an exam
:param subject: school subject
:param student: name of the student
:return: final nr of points
"""
if pd.notnull(subject) and grade > 90:
new_grade = grade * 2
print(f'Doubled score: {new_grade}, Subject: {subject}, Student name: {student}')
return new_grade
else:
return grade
@task
def extract():
""" Return a dataframe with students and their grades"""
data = {'Name': ['Hermione', 'Hermione', 'Hermione', 'Hermione', 'Hermione',
'Ron', 'Ron', 'Ron', 'Ron', 'Ron',
'Harry', 'Harry', 'Harry', 'Harry', 'Harry'],
'Age': [12] * 15,
'Subject': ['History of Magic', 'Dark Arts', 'Potions', 'Flying', None,
'History of Magic', 'Dark Arts', 'Potions', 'Flying', None,
'History of Magic', 'Dark Arts', 'Potions', 'Flying', None],
'Score': [100, 100, 100, 68, 99,
45, 53, 39, 87, 99,
67, 86, 37, 100, 99]}
df = pd.DataFrame(data)
return df
@task(log_stdout=True)
def transform(x):
x["New_Score"] = x.apply(lambda row: score_check(grade=row['Score'],
subject=row['Subject'],
student=row['Name']), axis=1)
return x
@task(log_stdout=True)
def load(y):
old = y["Score"].tolist()
new = y["New_Score"].tolist()
print(f"ETL finished. Old scores: {old}. New scores: {new}")
with Flow("basic-prefect-etl-flow",
storage=Docker(registry_url="<YOUR_ECR_REGISTRY_ID>.dkr.ecr.eu-central-1.amazonaws.com",
python_dependencies=["pandas==1.1.0"],
image_tag='latest')) as flow:
extracted_df = extract()
transformed_df = transform(extracted_df)
load(transformed_df)
if __name__ == '__main__':
flow.register(project_name='Medium_AWS_Prefect')
Some important notes to the code above:
以上代码的一些重要说明 :
under the hood, Prefect checks your Prefect version and extends corresponding Prefect Docker image to include your flow and its dependencies,
在幕后,Prefect检查您的Prefect版本并扩展相应的Prefect Docker映像以包括您的流程及其依赖项,
make sure to set your ECR registry ID so that your flow will be dockerized and pushed to the ECR repository
dask-k8
that we created earlier确保设置您的ECR注册表ID,以便将您的流进行dockerized并将其推送到我们之前创建的ECR存储库
dask-k8
image_tag='latest'
is used to disable versioning of your ECR images. Setting it to latest, or any other particular tag, will make sure that every time you register a flow, you overwrite the previous ECR image tagged as latest. This works well for me, because I already use Git to version-control my code and I don’t need ECR versioning. Also, it can save you some money on AWS, as with each new version of the image on ECR, you pay more for storage of those ECR images. But feel free to skip this argument. This way, Prefect will tag the image with the current date and time and will store each flow version with a different tag.image_tag='latest'
用于禁用ECR图像的版本控制。 将其设置为最新,或任何其他特定标签,将确保每次您注册一个流动的时候,你覆盖以前的ECR图像标记为最新 。 这对我来说效果很好,因为我已经使用Git对代码进行版本控制,并且不需要ECR版本控制。 而且,它可以为您节省一些AWS资金,因为ECR上的每个图像新版本都需要为存储这些ECR图像支付更多费用。 但是请随时跳过此论点。 这样,Prefect将使用当前日期和时间标记图像,并使用不同的标记存储每个流版本。
Last note: the extra argument
python_dependencies=["pandas==1.1.0"]
allows to define a list of Python packages that needs to be installed within the container. If you need a more finegrained control over your image, you can provide a path to a custom Dockerfile, ex:dockerfile='/Users/anna/my_flows/aws/Dockerfile'
最后注意:额外的参数
python_dependencies=["pandas==1.1.0"]
允许定义需要在容器内安装的Python软件包列表。 如果您需要对图像进行更细粒度的控制,则可以提供自定义Dockerfile的路径,例如:dockerfile='/Users/anna/my_flows/aws/Dockerfile'
部署新的流程版本 (Deploying a new flow version)
We can now mark the previous flow run as Finished or Failed:
现在,我们可以将之前的流程标记为“完成”或“失败”:
Finally, we can register a new version of the flow by simply rerunning the modified script, as shown in the second Gist:
最后,我们只需重新运行修改后的脚本即可注册该流程的新版本,如第二个Gist所示:
➜ python3 basic-prefect-etl-flow.pyINFO - prefect.Docker | Building the flow's Docker storage...
Successfully built c7e0a5c78dc3
Successfully tagged 123456789.dkr.ecr.eu-central-1.amazonaws.com/basic-prefect-etl-flow:latest
[2020-08-24 21:49:44] INFO - prefect.Docker | Pushing image to the registry...
Pushing [==================================================>] 203.2MB/199.6MB
Flow: https://cloud.prefect.io/anna/flow/3d7b86b7-c812-469c-8de5-efe0ffbe82d8
在UI中检查流程进度 (Checking the flow progress in the UI)
If we visit the link from the output, we can see that a Version 2 has been created and if we run it, we can see that the flow will transfer to the state Submitted for execution (yellow), then Runnning (blue) and finally All reference tasks succeeded (green):
如果访问输出中的链接,则可以看到已经创建了版本2,如果运行它,则可以看到流程将转换为“已提交执行 ”状态(黄色),然后是“ 运行”状态 (蓝色),最后所有参考任务均已成功 (绿色):
从日志和CLI查看进度 (Viewing the progress from logs & CLI)
The logs display more information:
日志显示更多信息:
We can also inspect the pods by using kubectl:
我们还可以使用kubectl 检查吊舱 :
➜ ~ kubectl get pods
NAME READY STATUS RESTARTS AGE
prefect-agent-68785f47d4-pv9kt 1/1 Running 0 161m
prefect-job-40edeff2-5pvc6 0/1 Pending 0 5s➜ ~ kubectl get pods
NAME READY STATUS RESTARTS AGE
prefect-agent-68785f47d4-pv9kt 1/1 Running 0 163m
prefect-job-40edeff2-5pvc6 1/1 Running 0 118s➜ ~ kubectl get pods
NAME READY STATUS RESTARTS AGE
prefect-agent-68785f47d4-pv9kt 1/1 Running 0 3h
prefect-job-40edeff2-5pvc6 0/1 Completed 0 3m
To summarize what we did so far:
总结一下到目前为止我们所做的 :
- we created a Prefect Cloud account 我们创建了一个Prefect Cloud帐户
- we deployed a serverless Kubernetes cluster by using AWS EKS 我们通过使用AWS EKS部署了无服务器的Kubernetes集群
- we created ECR repository and pushed our Docker image containing our flow code and all Python package dependencies to this ECR repository 我们创建了ECR存储库,并将包含流代码和所有Python软件包依赖项的Docker映像推送到该ECR存储库
- we then registered our flow, ran it from the Prefect Cloud UI and inspected its status in the UI and in the CLI using kubectl. 然后,我们注册了流程,从Prefect Cloud UI中运行它,并使用kubectl在UI和CLI中检查了其状态。
If you came that far, congratulations! 👏🏻
如果您走了那么远,恭喜! 👏🏻
In the next section, we will create a second flow which will make use of a distributed Dask cluster.
在下一部分中,我们将创建第二个流程,该流程将使用分布式Dask集群。
按需分布式dask集群以并行化数据管道 (On-demand distributed dask cluster to parallelize your data pipeline)
Prefect Cloud works great with Dask Distributed. In order to run your Python code on Dask in a distributed fashion, you would typically have to deploy a Dask cluster with several worker nodes and maintain it. Prefect provides a great abstraction DaskKubernetesEnvironment
which:
Prefect Cloud可与Dask Distributed一起使用 。 为了以分布式方式在Dask上运行Python代码,通常必须部署具有多个工作程序节点的Dask集群并进行维护。 Prefect提供了一个很棒的抽象DaskKubernetesEnvironment
:
spins up on-demand Dask cluster across multiple pods and possibly also across multiple nodes (you can specify the min and max number of workers)
跨多个Pod并可能跨多个节点旋转 按需的Dask集群 ( 您可以指定worker的最小和最大数量 )
- submits your flow to this on-demand cluster 将您的流程提交到此按需集群
cleans up the resources (i.e. terminates the cluster after the job is finished).
清理资源( 即在作业完成后终止集群 )。
Here is an example flow based on the Prefect Docs that you can use to test your Dask setup. I saved this flow as dask-k8.py
and provided the same name dask-k8 as the flow name and as the name of ECR repository:
这是一个基于Prefect Docs的示例流程,可用于测试Dask设置。 我将此流另存为dask-k8.py
并提供了与流名称和ECR存储库名称相同的名称dask-k8 :
from prefect.environments.storage import Docker
from prefect.environments import DaskKubernetesEnvironment
from prefect import task, Flow
import random
from time import sleep
@task
def inc(x):
sleep(random.random() / 10)
return x + 1
@task
def dec(x):
sleep(random.random() / 10)
return x - 1
@task
def add(x, y):
sleep(random.random() / 10)
return x + y
@task(log_stdout=True)
def list_sum(arr):
return sum(arr)
with Flow("dask-k8") as flow:
random.seed(123)
incs = inc.map(x=range(100))
decs = dec.map(x=range(100))
adds = add.map(x=incs, y=decs)
total = list_sum(adds)
if __name__ == '__main__':
flow.storage = Docker(registry_url="<YOUR_ECR_REGISTRY_ID>.dkr.ecr.eu-central-1.amazonaws.com", image_tag='latest')
flow.environment = DaskKubernetesEnvironment(min_workers=3, max_workers=5)
flow.register(project_name="Medium_AWS_Prefect")
We now register the flow and trigger it from the UI again:
现在,我们注册流并再次从UI触发它:
➜ python3 dask-k8.py
Now we can observe the pods that are created by the Prefect Kubernetes agent:
现在,我们可以观察到Prefect Kubernetes代理创建的吊舱:
➜ ~ kubectl get pods --watch
NAME READY STATUS RESTARTS AGE
prefect-agent-68785f47d4-pv9kt 1/1 Running 0 3h29m
prefect-job-40edeff2-5pvc6 0/1 Completed 0 47m
prefect-job-8c91b00f-x98zt 0/1 Pending 0 0s
prefect-job-8c91b00f-x98zt 0/1 Pending 0 1s
prefect-job-8c91b00f-x98zt 0/1 Pending 0 87s
prefect-job-8c91b00f-x98zt 0/1 ContainerCreating 0 87s
prefect-job-8c91b00f-x98zt 1/1 Running 0 119s
prefect-dask-job-a9405c90-9d84-4766-a6d0-ac4fcdfd4652-vt4zl 0/1 Pending 0 0s
prefect-dask-job-a9405c90-9d84-4766-a6d0-ac4fcdfd4652-vt4zl 0/1 Pending 0 1s
prefect-job-8c91b00f-x98zt 0/1 Completed 0 2m11s
prefect-dask-job-a9405c90-9d84-4766-a6d0-ac4fcdfd4652-vt4zl 0/1 Pending 0 82s
prefect-dask-job-a9405c90-9d84-4766-a6d0-ac4fcdfd4652-vt4zl 0/1 ContainerCreating 0 82s
prefect-dask-job-a9405c90-9d84-4766-a6d0-ac4fcdfd4652-vt4zl 1/1 Running 0 119s
dask-root-19a2b157-cv8tvz 0/1 Pending 0 0s
dask-root-19a2b157-ch8m8n 0/1 Pending 0 0s
dask-root-19a2b157-cblr4q 0/1 Pending 0 0s
dask-root-19a2b157-ch8m8n 0/1 Pending 0 1s
dask-root-19a2b157-cblr4q 0/1 Pending 0 2s
dask-root-19a2b157-cv8tvz 0/1 Pending 0 2s
Note that we can now see several pods related to Dask. The UI shows the current progress of all tasks, running in parallel, and the final runtime:
请注意,我们现在可以看到与Dask相关的几个Pod 。 UI显示并行运行的所有任务的当前进度以及最终运行时:
If we analyze the logs, we can see that:
如果我们分析日志 ,可以看到:
- the flow was submitted for execution at 12:43 流程于12:43提交执行
- the flow run began at 12:47 流程开始于12:47
- the flow finished at 12:51. 流程在12:51结束。
This means that it took around 3 minutes until the flow has been picked up by the Kubernetes agent and the Dask cluster has been provisioned on our serverless AWS EKS cluster together with pulling all necessary images. After that the flow needed 4 minutes to successfully run the flow.
这意味着需要大约3分钟的时间,Kubernetes代理才能接收到该流,并且Dask集群已在我们的无服务器AWS EKS集群上进行了调配,并提取了所有必要的映像。 之后,流程需要4分钟才能成功运行流程。
If the latency related to a Dask cluster being created on-demand is not acceptable by your workloads, you can create and maintain your own Dask cluster and adapt the code as follows:
如果工作负载不接受与按需创建的Dask集群相关的延迟,则可以创建和维护自己的Dask集群,并按以下方式修改代码:
from prefect.engine.executors import DaskExecutor executor = DaskExecutor(address="255.255.255.255:8786") with Flow("dask-k8", executor=executor) as flow:
...flow.register(project_name="Medium_AWS_Prefect")
This way, you would use Executor rather than Environment abstraction. Also, you would have to adapt 255.255.255.255 to your Dask scheduler address and change the port 8786 accordingly, if needed.
这样,您将使用执行程序而不是环境抽象。 另外,如果需要,您必须将255.255.255.255修改为Dask调度程序地址,并相应地更改端口8786。
清理资源 (Cleaning up the resources)
Before we wrap up, make sure to delete the AWS EKS cluster and the ECR repositories to avoid any charges:
在总结之前,请确保删除AWS EKS集群和ECR存储库,以避免产生任何费用:
eksctl delete cluster -n fargate-eks --wait
aws ecr delete-repository --repository-name dask-k8
aws ecr
结论 (Conclusion)
In this article, we used AWS EKS on Fargate to create a serverless Kubernetes cluster on AWS. We connected it in a secure way as our execution layer with Prefect Cloud. Then, we dockerized our Prefect flows and pushed the images to ECR by using DockerStorage
abstraction. Finally, we deployed both, a simple data pipeline running within a single pod, as well as distributed Dask flow, allowing for a high-level parallelism of your ETL & ML code.
在本文中,我们在Fargate上使用了AWS EKS在AWS上创建了无服务器的Kubernetes集群。 我们以安全的方式将其连接到Prefect Cloud作为执行层。 然后,我们使用DockerStorage
抽象化对我们的Prefect流进行了DockerStorage
推送并将图像推送到ECR。 最后,我们同时部署了在单个pod中运行的简单数据管道以及分布式Dask流,从而实现了ETL和ML代码的高级并行性。
Along the way, we identified the differences, as well as the pros and cons of running data pipelines on already available resources that we need to maintain vs. running your containerized flows in a serverless way.
在此过程中,我们确定了差异,以及在我们需要维护的现有资源上运行数据管道的利弊,以及以无服务器方式运行容器化流的利弊。
Hopefully, this setup will make it easier to start orchestrating your ETL & Data Science workflows. Regardless of whether you are a start-up, large enterprise, or a student running code for a thesis, the combination of Prefect & AWS EKS on Fargate allows you to move your data projects to production faster than ever before.
希望此设置将使您更容易开始编排ETL和数据科学工作流程。 无论您是初创企业,大型企业还是学生运行论文的代码,Fargate上的Prefect和AWS EKS的结合都使您可以比以往更快地将数据项目移至生产环境。
Thank you for reading & have fun on your data journey!
感谢您的阅读并在数据之旅中玩得开心!
aws 分布式数据库