apache工程源码
Not so long ago, if you would ask any data engineer or data scientist about what tools do they use for orchestrating and scheduling their data pipelines, the default answer would likely be Apache Airflow. Even though Airflow can solve many current data engineering problems, I would argue that for some ETL & Data Science use cases it may not be the best choice.
ñ加时赛很久以前,如果你会问什么工具,他们使用的策划和调度的数据管道,默认的答案很可能是Apache的风量的数据工程师或科学家的数据。 尽管Airflow可以解决许多当前的数据工程问题,但我认为对于某些ETL和数据科学用例来说,它可能不是最佳选择。
In this article, I will discuss the pros and cons that I experienced while working with Airflow in the last two years and derive from that the use cases for which Airflow is still a great choice. I hope that by the end of this article, you will be able to determine whether it fits your ETL & Data Science needs.
在本文中,我将讨论在过去两年中使用Airflow时所经历的优缺点 ,并从中得出结论,对于Airflow来说,使用案例仍然是一个不错的选择。 我希望到本文结尾,您将能够确定它是否适合您的ETL和数据科学需求。
气流的优势是什么? (What are the Airflow’s strengths?)
社区 (Community)
Undeniably, Apache Airflow has an amazing community. There is a large number of individuals using Airflow and contributing to this open-source project. If you want to solve a particular data engineering problem, the chances are that somebody in the community has already solved that and shared their solution online or even contributed their implementation to the codebase.
不可否认,Apache Airflow拥有一个了不起的社区。 有大量的个人在使用Airflow并为这个开源项目做出贡献。 如果要解决特定的数据工程问题,社区中的某个人很可能已经解决了该问题,并在线共享了他们的解决方案,甚至将其实现贡献给了代码库。
对Airflow进行战略押注的公司 (Companies placing strategic bets on Airflow)
Many companies decided to invest in Apache Airflow and support its growth, among them:
许多公司决定投资Apache Airflow并支持其发展,其中包括:
Google with its Cloud Composer GCP service,
Google及其Cloud Composer GCP服务,
Astronomer offering enterprise support in deploying Airflow on Kubernetes,
天文学家提供企业支持,以在Kubernetes上部署Airflow,
Polidea heavily contributing to the codebase with many PMC members
Polidea与许多PMC成员一起为代码库做出了巨大贡献
GoDataDriven offering Apache Airflow training.
GoDataDriven提供Apache Airflow培训。
The support from those companies ensures that there are people working full-time to further improve the software which guarantees long-term stability, support, and training.
这些公司的支持可确保有人专职工作以进一步改进软件,从而保证长期的稳定性,支持和培训。
Python (Python)
The possibility to define your workflows within Python code is incredibly helpful, as it allows you to incorporate almost any custom workflow logic into your orchestration system.
在Python代码中定义工作流的可能性非常有用,因为它允许您将几乎所有自定义工作流逻辑合并到业务流程系统中。
可扩展性 (Extensibility)
Airflow allows you to extend the functionality by:
通过气流,您可以通过以下方式扩展功能:
using plugins ex. to add extra menu items within the UI,
使用插件前。 在界面中添加其他菜单项,
- adding custom Operators or building on top of the existing ones. 添加自定义运算符或在现有运算符的基础上构建。
广泛的经营者 (Wide range of Operators)
If you look at the number of Operators available in the Airflow Github repository, you will find that Airflow supports a wide range of connectors to external systems. This means that in many cases you will find code templates that you can use to interact with a variety of databases, execution engines, and cloud providers without having to implement the code yourself.
如果查看Airflow Github存储库中可用的Operators数量,您会发现Airflow支持广泛的外部系统连接器 。 这意味着在许多情况下,您将找到可用于与各种数据库,执行引擎和云提供程序进行交互的代码模板,而无需自己实现代码。
The number of connectors to external systems shows that Airflow can be used as a “glue” that ties together data from many different sources.
外部系统连接器的数量表明,Airflow可用作将来自许多不同来源的数据联系在一起的“胶水”。
气流的缺点是什么? (What are the Airflow’s weaknesses?)
From the list of advantages listed above, you can see that, overall, Airflow is a great product for data engineering from the perspective of tying many external systems together. The community put in an amazing amount of work building a wide range of features and connectors. However, it has several weak spots that prevent me from truly loving working with it. Some of them may be fixed in future releases, so I discuss the issues as they are at the moment of writing.
从上面列出的优势列表中,您可以看到,总体而言, 从将许多外部系统捆绑在一起的角度来看 ,Airflow是一款出色的数据工程产品 。 社区投入了大量工作来构建各种功能和连接器。 但是,它有几个弱点,使我无法真正喜欢使用它。 其中一些可能会在将来的发行版中修复,因此在撰写本文时,我将讨论这些问题。
没有数据管道的版本控制 (No versioning of your data pipelines)
These days, when we have version control systems and different versions of our Docker images stored in the Docker registries, we take versioning for granted — it’s a basic feature that simply should be there, no questions asked. However, Airflow still doesn’t have it. If you delete a task from your DAG code and redeploy it, you will lose the metadata related to that task.
如今,当我们在Docker注册表中存储了版本控制系统和不同版本的Docker映像时,我们就认为版本控制是理所当然的-这是一个基本功能,应该就已经存在,没有任何问题。 但是,Airflow仍然没有。 如果从DAG代码中删除任务并重新部署,则将丢失与该任务相关的元数据 。
对新用户不直观 (Not intuitive for new users)
I used Airflow long enough to understand its internals and to even extend its functionality by writing custom components. However, teaching a team of data engineers who haven’t used Airflow before on how to use it proved time-consuming as one needs to learn an entirely new “syntax”. Some data engineers considered the entire experience not intuitive.
我使用Airflow的时间足够长,以了解其内部结构,甚至通过编写自定义组件来扩展其功能。 但是, 教给一支从未使用过Airflow 的数据工程师团队如何使用它的事实证明是非常耗时的,因为需要学习一种全新的“语法”。 一些数据工程师认为整个体验并不直观 。
One prominent example was related to scheduling: many (me including) found it very confusing that Airflow starts scheduling the jobs at the end of the scheduling interval. This means that the schedule interval doesn’t start immediately, but only when the execution_date
reaches start_date
+ schedule_interval
. This seems to play well with batch ETL jobs that are running only once per night, but for jobs that are running every 10 minutes, it's rather confusing and may result in unexpected bugs when used by new users inexperienced with the tool, especially if the catchup option is not used properly.
一个重要的例子与调度有关:许多人( 包括我在内 )感到非常困惑,因为Airflow在调度间隔结束时开始调度作业。 这意味着,计划间隔不会立即启动,但只有当execution_date
达到start_date
+ schedule_interval
。 这对于每晚仅运行一次的批处理ETL作业似乎效果很好,但是对于每10分钟运行一次的作业,这很令人困惑,并且如果新手对该工具不熟悉,则可能会导致意外的错误,尤其是在进行追赶时选项使用不正确。
从一开始就配置超载+难以在本地使用 (Configuration overload right from the start + hard to use locally)
In order to start using Airflow on a local machine, a data professional new to the tool needs to learn:
为了在本地计算机上开始使用Airflow,该工具的新手需要学习以下数据:
the scheduling logic built into the product — such as the mentioned nuances related to a start date, execution date, schedule-intervals, catchup
产品中内置的调度逻辑 -例如与开始日期,执行日期,调度间隔,追赶有关的上述细微差别
an entire set of concepts and configuration details — operators vs. tasks, executors, DAGs, default arguments, airflow.cfg, Airflow metadata DB, the home directory for deploying DAGs, …).
完整的概念和配置详细信息集-操作员与任务,执行程序,DAG,默认参数,airflow.cfg,Airflow元数据数据库,用于部署DAG的主目录等)。
Plus, if you are a Windows user, you really can’t use the tool locally unless you use docker-compose files which are not even part of the official Airflow repository — many people use puckel/docker-airflow setup. It’s all doable but I wish it would be more intuitive and easier for new users.
另外,如果您是Windows用户 , 则除非您使用的docker-compose文件甚至都不是正式的Airflow存储库的一部分,否则您实际上不能在本地使用该工具 -许多人使用puckel / docker-airflow设置。 都是可行的,但我希望新用户会更直观,更轻松。
I know that Airflow released an official docker image in the last months, but what is still missing is an officialdocker-compose
file where new users (especially Windows users) could get a full basic setup, together with a metadata database container and a bind mount to copy their DAGs into the container. An official docker-compose
file would be very helpful to be able to run Airflow locally on Windows.
我知道Airflow在过去的几个月中发布了官方的docker映像,但是仍然缺少官方的docker-compose
文件,新用户( 尤其是Windows用户 )可以在其中获得完整的基本设置,以及元数据数据库容器和绑定安装以将其DAG复制到容器中。 官方的docker-compose
文件对于在Windows上本地运行Airflow很有帮助。
If you use Astronomer paid version of Airflow, you could use astro CLI which mitigates the problem of local testing to some extent.
如果您使用的是天文学家付费版本的Airflow,则可以使用astro CLI ,这可以在某种程度上减轻本地测试的问题。
设置用于生产的气流架构并不容易 (Setting up Airflow architecture for production is NOT easy)
In order to obtain a production-ready setup, you really have two choices:
为了获得可用于生产的设置,您实际上有两种选择:
Celery Executor: if you choose this option you need to know how Celery works + you need to be familiar with RabbitMQ or Redis as your message broker in order to set up and maintain the worker queues that can execute your Airflow pipelines. To the best of my knowledge, there are no official tutorials or deployment recipes directly from Airflow to make this scale-out process easier for the users. I personally learned it from this blog article (kudos for sharing!). Overall, I wish that this setup was easier for the users, or at least that Airflow would provide some official docs on how to set it up properly.
Celery Executor:如果选择此选项,则需要了解Celery的工作方式+您需要熟悉RabbitMQ或Redis作为消息代理,以便设置和维护可以执行Airflow管道的工作人员队列。 据我所知,没有直接来自Airflow的官方教程或部署食谱来简化用户的横向扩展过程。 我个人是从此博客文章 ( 分享的荣誉! )中学到的。 总的来说,我希望这种设置对用户来说更容易,或者至少Airflow会提供一些有关如何正确设置的官方文档。
Kubernetes Executor: this executor is relatively new compared to Celery, but it allows you to leverage the power of Kubernetes to automatically scale your workers (even down to zero!) and to manage all the Python package dependencies in a robust way because everything must be containerized to work on Kubernetes. However, also in this regard, I did not find much support in the official docs on how to properly set it up and maintain it.
Kubernetes Executor:与Celery相比,该执行器相对较新,但是它允许您利用Kubernetes的功能自动扩展您的worker( 甚至减少到零! ),并以健壮的方式管理所有Python软件包依赖项,因为一切都必须容器化以在Kubernetes上工作。 但是,在这方面,我也没有在官方文档中找到有关如何正确设置和维护它的大量支持。
My experience in setting up Airflow on AWS for the company I worked at was that you can either:
我为所工作的公司在AWS上设置Airflow的经验是,您可以:
hire some external consultant to do it for you
聘请一些外部顾问为您服务
get a paid version from Google (Cloud Composer) or from Astronomer.io
从Google( Cloud Composer )或Astronomer.io获得付费版本
or you can try and error, cross your fingers and hope it won’t break.
否则,您可以尝试犯错 ,指责并希望它不会破裂。
Overall, Airflow’s architecture includes many components such as:
总体而言,Airflow的体系结构包括许多组件,例如:
- the scheduler, 调度器
- webserver, 网络服务器,
- metadata database, 元数据数据库,
- worker nodes, 工作节点,
- executor, 执行者,
- message broker + Celery + Flower if choosing Celery executor, 消息经纪人+芹菜+花(如果选择芹菜执行器),
- possibly some shared volume such as AWS EFS for the common DAGs storage between the worker nodes, 可能有一些共享卷,例如用于工作节点之间的公共DAG存储的AWS EFS,
setting up the values in the
airflow.cfg
properly正确设置
airflow.cfg
的值- configuring the log storage ex. to S3 + ideally some lifecycle policy, as usually, you don’t need to look at very old logs and to pay for their storage 配置日志存储ex。 到S3 +最好是某些生命周期策略,通常,您无需查看非常旧的日志并为它们的存储付费
- registering a domain for the UI 为UI注册域
adding some monitoring to prevent your metadata database and worker nodes from exceeding their compute capacity and storage
添加一些监视以防止您的元数据数据库和工作节点超出其计算能力和存储
adding some Auth layer for the UI + database user management for access to the metadata database.
为UI +数据库用户管理添加一些Auth层 ,以访问元数据数据库。
Those are MANY components to maintain and to ensure that they all work well together, and it seems that the open-source version of Airflow doesn’t make this setup easy for the users.
这些是 许多组件,需要维护 并 确保它们都能很好地协同工作, 而且开源版本的Airflow似乎并不使用户容易进行此设置。
From my experience so far, choosing Astronomer seems to be the easiest choice if you want to use Airflow in production (especially if you use AWS or Azure and not GCP), as you get plenty of features added on top, such as monitoring of your nodes, pulling logs to one central place, Auth layer (and integration with Active Directories), support, SLA and the team from Astronomer will maintain at least some of the components listed above.
根据到目前为止的经验,如果要在生产中使用Airflow( 尤其是在使用AWS或Azure而不是GCP的情况下 ),则选择Astronomer似乎是最简单的选择,因为您会获得很多附加功能,例如监视节点,将日志拖到一个中央位置,Auth层( 以及与Active Directories的集成 ),支持,SLA和天文学家的团队将至少维护上面列出的某些组件。
任务之间缺乏数据共享会鼓励非原子性任务 (Lack of data sharing between tasks encourages not-atomic tasks)
There is currently no natural “Pythonic” way of sharing data between tasks in Airflow other than by using XComs which were designed to only share small amounts of metadata (there are plans on the roadmap to introduce functional DAGs so the data sharing might get somehow better in the future).
除了使用XCom来共享Airflow中的任务之间,目前还没有自然的“ Python式”方式,XCom只能共享少量的元数据( 路线图上计划引入功能性DAG,因此数据共享可能会有所改善未来 )。
A task means a basic atomic unit of work in a data pipeline. Because there is no easy way of sharing data between tasks in Airflow, instead of tasks being atomic, i.e. responsible for only 1 thing (ex. only extracting the data), people often tend to use entire scripts as tasks such as a script doing entire ETL (triggered with BashOprator ex. “python stage_adwords_etl.py”), which in turn makes the maintenance more difficult as you need to debug an entire script (full ETL) instead of a small atomic task (ex. only the “Extract” part).
任务是指工作的基本原子单位的数据管道。 由于没有简单的方法在Airflow中的任务之间共享数据,因此任务不是原子的 ,即仅负责一件事情( 例如,仅提取数据 ),因此人们经常倾向于将整个脚本用作任务,例如将整个脚本ETL( 由BashOprator触发,例如“ python stage_adwords_etl.py”触发 ),这又使维护变得更加困难,因为您需要调试整个脚本( 完整的ETL ),而不是小的原子任务( 例如,仅“提取”部分) )。
If your tasks are not atomic, you can’t just retry the Load part of ETL when it fails — you need to retry the entire ETL.
如果您的任务不是原子性的,则不能仅在ETL的Load部分失败时重试-您需要重试整个ETL 。
调度程序成为瓶颈 (Scheduler as a bottleneck)
If you worked with Airflow before, you may have experienced that after hitting the Trigger DAG button in the UI, you need to wait quite a long time before you can see that the task really starts running.
如果您以前使用过Airflow,则可能会遇到这样的情况:在单击UI中的Trigger DAG按钮后,您需要等待很长时间才能看到该任务真正开始运行。
The scheduler often needs up to several minutes before the task is scheduled and picked up by the worker process to be executed, at least it was the case when I was using Airflow deployed on EC2 earlier this year. Airflow community is working on improving the scheduler, so I hope it will get more performant in the next releases but at the time of writing this bottleneck prevents from applying Airflow to use-cases where this latency is not acceptable or desirable.
调度程序通常需要多达几分钟才能调度任务并由工作进程执行,至少在今年早些时候我使用部署在EC2上的Airflow时是如此。 Airflow社区正在努力改进调度程序,因此我希望它在下一个发行版中将具有更高的性能,但是在撰写本文时,该瓶颈阻止了将Airflow应用于延迟不可接受或不理想的用例。
使用气流仍然是一个不错的选择的用例 (Use cases for which Airflow is still a good option)
In this article, I highlighted several times that Airflow works well when all it needs to do is to schedule jobs that:
在本文中,我多次强调了Airflow在需要做的所有事情是安排以下工作时能很好地工作:
- run on external systems such as Spark, Hadoop, Druid, or some external cloud services such as AWS Sagemaker, AWS ECS or AWS Batch, 在外部系统(例如Spark,Hadoop,Druid)或某些外部云服务(例如AWS Sagemaker,AWS ECS或AWS Batch)上运行,
- submit a SQL code to some in-memory database. 将SQL代码提交到内存数据库。
Airflow was not designed to execute any workflows directly inside of Airflow, but just to schedule them and to keep the execution within external systems.
Airflow 并非旨在 直接在Airflow内部执行任何工作流,而只是为了对其进行调度并将其 保持在外部系统中 。
This implies that Airflow is still a good choice if your task is, for instance, to submit a Spark job and store the data on a Hadoop cluster or to execute some SQL transformation in Snowflake or to trigger a SageMaker training job.
这意味着,如果您的任务是例如提交Spark作业并将数据存储在Hadoop群集上或在Snowflake中执行某些SQL转换或触发SageMaker培训作业,那么Airflow仍然是一个不错的选择。
To give you an example: imagine a company where data engineers are creating ETL jobs in Pentaho Data Integration and they are using CeleryExecutor
to orchestrate BashOperator
tasks on an AWS EC2 instance. Those jobs are not dockerized and the task is just to schedule a bash command to run on a particular server. Airflow works well in this use case.
举个例子:假设有一家公司,数据工程师正在Pentaho Data Integration中创建ETL作业,并且他们正在使用CeleryExecutor
来协调AWS EC2实例上的BashOperator
任务。 这些作业没有被docker化 ,任务只是安排要在特定服务器上运行的bash命令 。 气流在此用例中效果很好。
If all you need to do in your workflow system is to submit some bash command to an external system and your actual data flow is defined within Spark, SageMaker, or, as in the example above, in Pentaho Data Integration, Airflow should work quite well for you, because data dependencies are managed by those external systems and Airflow only needs to manage the state dependencies between the tasks. The same is true if you use some in-memory databases such as Snowflake, Exasol, or SAP Hana, where the actual work is executed within those databases and your workflow orchestration system simply submits the query to it.
如果您在工作流系统中所需要做的就是向外部系统提交一些bash命令,并且您的实际数据流是在Spark,SageMaker或Pentaho Data Integration中定义的,如上面的示例中所示,那么Airflow应该可以正常工作对您来说,因为数据依赖关系是由那些外部系统管理的,而Airflow只需要管理任务之间的状态依赖关系。 如果您使用某些内存数据库(例如Snowflake,Exasol或SAP Hana),情况也是如此,那么实际工作是在这些数据库中执行的,而工作流程编排系统只需向其提交查询即可。
使用气流可能不是最佳选择的用例 (Use cases for which Airflow may not be the best option)
If you want that your workflow system works closely together with your execution layer and is able to pass data between the tasks within Python code, then Airflow may not be the best choice in this case.
如果您希望 工作流系统 与 执行层 紧密协作, 并能够在Python代码内的任务之间传递数据,那么 在这种情况下 ,Airflow可能不是最佳选择。
Airflow is only able to pass the state dependencies between tasks (plus perhaps some metadata through XComs) and NOT data dependencies. This implies that, if you build your workflows mainly in Python and you have a lot of data science use cases, which by their nature heavily rely on data sharing between tasks, other tools may work better for you such as Prefect.
Airflow只能传递任务之间的状态依赖关系( 也许还可以通过XComs传递一些元数据 ),而不能传递 数据依赖关系 。 这意味着,如果您主要使用Python构建工作流,并且有很多数据科学用例,它们本质上严重依赖任务之间的数据共享,那么其他工具(例如Prefect)可能对您更好。
Those are the use cases for which Prefect may be a better choice than Airflow:
在这些用例中,Prefect可能比Airflow更好的选择:
if you need to share data between tasks
如果您需要在任务之间共享数据
- if you need versioning of your data pipelines → at the time of writing, Airflow doesn’t support that 如果您需要对数据管道进行版本控制,则在撰写本文时→Airflow不支持
if you would like to parallelize your Python code with Dask — Prefect supports Dask Distributed out of the box
如果您想将Python代码与Dask 并行化— Prefect支持Dask Distributed
if you need to run dynamic parametrized data pipelines → in theory, you could get around it in Airflow, but by default, dynamic pipelines are not within the main Airflow’s scope
如果您需要运行动态参数化数据管道 →从理论上讲,您可以在Airflow中解决它,但是默认情况下,动态管道不在主要Airflow的范围内
if the Airflow’s scheduler latency is not acceptable by your workloads, you may find Prefect more suitable to your needs
如果您的工作负载无法接受Airflow的调度程序延迟 ,则您可能会发现Prefect更适合您的需求
if you want a seamless experience when testing workflow code locally
如果您想在本地测试工作流代码时获得无缝的体验
lastly, if you prefer an easier setup than maintaining all the Airflow components I mentioned previously, you may opt for Prefect Cloud.
最后,如果您要比维护我之前提到的所有Airflow组件更轻松的设置 ,则可以选择Prefect Cloud。
I showed one possible way of setting it up in this article:
我在本文中展示了一种可能的设置方法:
结论 (Conclusion)
In this article, we discussed the pros and cons of Apache Airflow as a workflow orchestration solution for ETL & Data Science. After analyzing its strengths and weaknesses, we could infer that Airflow is a great choice as long as it is used for the purpose it was designed to, i.e. to orchestrate work that is executed on external systems such as Apache Spark, Hadoop, Druid, cloud services, external servers (ex. distributed with Celery queues) or when submitting the SQL code to high-performance distributed databases such as Snowflake, Exasol or Redshift.
在本文中,我们讨论了Apache Airflow作为ETL和数据科学的工作流程编排解决方案的利弊。 在分析了其优势和劣势之后,我们可以推断出Airflow是一个不错的选择,只要将其用于特定目的即可,即编排在外部系统(例如Apache Spark,Hadoop,Druid,云)上执行的工作服务,外部服务器( 例如与Celery队列一起分发的服务器)或将SQL代码提交给高性能的分布式数据库(例如Snowflake,Exasol或Redshift)时。
However, Airflow is not designed to execute your data pipelines directly, so if your ETL & Data Science code needs to pass data between the tasks, needs to be dynamic and parametrized, needs to run in parallel by using Dask or requires a low latency scheduler, then you may prefer other tools such as Prefect.
但是,Airflow 并非旨在直接执行数据管道 ,因此,如果您的ETL和数据科学代码需要在任务之间传递数据,需要动态和参数化,需要使用Dask并行运行或需要低延迟调度程序,那么您可能更喜欢其他工具,例如Prefect。
I hope this article helped you to determine whether Apache Airflow suits your current ETL & Data Science needs. Due to the lack of really good alternatives to Airflow other than Prefect, I tried to determine if Prefect can fill the gap for the use cases when Airflow may not be good enough. If you wish, I could create a more detailed comparison between the two workflow platforms: let me know in the comments!
我希望本文能帮助您确定Apache Airflow是否适合您当前的ETL和数据科学需求。 由于除了Prefect之外,没有其他比Airflow更好的替代品,我试图确定当Airflow不够好时,Prefect是否可以填补用例的空白。 如果您愿意,我可以在两个工作流程平台之间进行更详细的比较:在评论中让我知道!
If you want to look at other possible workflow management tools, you may also have a look at Dagster or Argo.
如果您想查看其他可能的工作流管理工具,还可以查看Dagster或Argo 。
Thank you for reading & have fun on your data journey!
感谢您的阅读并在数据之旅中玩得开心!
apache工程源码