Introduction to Apache Airflow
What is Apache Airflow?
什么是Airflow?
Airflow is a platform to programmatically author, schedule and monitor workflows.These functions achieved with Directed Acyclic Graphs (DAG) of the tasks. It is an open-source and still in the incubator stage. It was initialized in 2014 under the umbrella of Airbnb since then it got an excellent reputation with approximately 800 contributors on GitHub and 13000 stars. The main functions of Apache Airflow is to schedule workflow, monitor and author.
Airflow是一个以编程方式创作、调度和监控工作流程的平台。这些功能是通过任务的有向无环图(DAG)实现的。它是一个开源的,仍处于孵化器阶段。它于2014年在Airbnb的保护伞下进行了初始化,从那时起,它在GitHub上获得了大约800个贡献者和13000颗星星的良好声誉。Apache Airflow 的主要功能是调度工作流程,监控和创作。
Apache airflow is a workflow (data-pipeline) management system developed by Airbnb. It is used by more than 200 companies such as Airbnb, Yahoo, PayPal, Intel, Stripe and many more.
Apache Airflow 是由Airbnb开发的工作流程(数据管道)管理系统。它被200多家公司使用,如Airbnb,雅虎,PayPal,英特尔,Stripe等等。
In this, everything revolves around workflow objects implemented as _directed acyclic graphs _(DAG). For example, such a workflow can involve the merging of multiple data sources and the subsequent execution of an analysis script. It takes care of scheduling the tasks while respecting their internal dependencies and orchestrates the systems involved.
在这方面,一切都围绕着作为有向无环图 (DAG) 实现的工作流对象。例如,此类工作流可能涉及多个数据源的合并以及分析脚本的后续执行。它负责调度任务,同时尊重其内部依赖关系,并编排所涉及的系统。
What is a Workflow?
什么是Workflow?
Workflow is a sequence of tasks which is started on a schedule or triggered by an event .It is frequently used to handle big data processing pipelines.
workflow是按计划启动或由事件触发的一系列任务。它经常用于处理大数据处理管道。
A typical workflow diagram
典型的工作流程图
There are total 5 phases in any workflow.
任何工作流中总共有 5 个阶段。
Firstly we download data from source
首先,我们从源头下载数据。
Then, send that data to somewhere else to process
然后,将该数据发送到其他地方进行处理
When the process is completed we get the result and report is generated which is sent by email.
该过程完成后,我们获得结果并生成报告,并通过电子邮件发送。
Working of Apache Airflow
Airflow 的工作原理
There are four main components that make up this robust and scalable workflow scheduling platform:
有四个主要组件组成了这个强大且可扩展的工作流调度平台:
Scheduler: The scheduler monitors all DAGs and their associated tasks. It periodically checks active tasks to initiate.
调度(Scheduler):计划程序监视所有 DAG 及其关联的任务。它会定期检查要启动的活动任务。
Web server: The web server is Airflow’s user interface. It shows the status of jobs and allows the user to interact with the databases and read log files from remote file stores, like Google Cloud Storage, Microsoft Azure blobs, etc.
网页服务器(WebServer):Airflow的用户界面。它显示作业的状态,并允许用户与数据库交互并从远程文件存储(如谷歌云存储,微软Azure blob等)中读取日志文件。
Database: The state of the DAGs and their associated tasks are saved in the database to ensure the schedule remembers metadata information. Airflow uses SQLAlchemy and Object Relational Mapping (ORM) to connect to the metadata database. The scheduler examines all of the DAGs and stores pertinent information, like schedule intervals, statistics from each run, and task instances.
数据库(Database):DAG 及其关联任务的状态保存在数据库中,以确保计划记住元数据信息。 Airflow使用 SQLAlchemy和对象关系映射 (ORM) 连接到元数据数据库。调度程序检查所有 DAG 并存储相关信息,如计划间隔、每次运行的统计信息和任务实例。
Executor: There are different types of executors to use for different use cases.Examples of executors:
执行者(Executer):有不同类型的执行器可用于不同的用例。执行器示例:
SequentialExecutor: This executor can run a single task at any given time. It cannot run tasks in parallel. It’s helpful in testing or debugging situations.
SequentialExecutor:此执行程序可以在任何给定时间运行单个任务。它不能并行运行任务。它在测试或调试情况下很有帮助。
LocalExecutor: This executor enables parallelism and hyperthreading. It’s great for running Airflow on a local machine or a single node.
LocalExecutor:此执行器启用并行性和超线程。它非常适合在本地计算机或单个节点上运行气流。
CeleryExecutor: This executor is the favored way to run a distributed Airflow cluster.
CeleryExecutor:此执行器是运行分布式Airflow集群的首选方式。
KubernetesExecutor: This executor calls the Kubernetes API to make temporary pods for each of the task instances to run.
KubernetesExecutor:此执行器调用 Kubernetes API 为每个要运行的任务实例创建临时 Pod。
So, how does Airflow work?
那么,Airflow是如何工作的呢?
Airflow examines all the DAGs in the background at a certain period.
Airflow在特定时间段内检查后台中的所有 DAG。
This period is set using the config and is equal to one second.
此时间段是使用配置设置的,等于一秒。
Task instances are instantiated for tasks that need to be performed, and their status is set to in the metadata database.processor_poll_intervalSCHEDULED
任务实例针对需要执行的任务进行实例化,其状态在元数据数据库中设置为。processor_poll_interval SCHEDULED
The schedule queries the database, retrieves tasks in the state, and distributes them to the executors.
计划查询数据库,检索处于该状态的任务,并将其分发给执行程序。
Then, the state of the task changes to .
然后,任务的状态将更改。
Those queued tasks are drawn from the queue by workers who execute them.
这些排队的任务由执行它们的工作人员从队列中提取。
When this happens, the task status changes to .SCHEDULEDQUEUEDRUNNING
发生这种情况时,任务状态将更改为 。SCHEDULEDQUEUEDRUNNING
When a task finishes, the worker will mark it as failed or finished, and then the scheduler updates the final status in the metadata database.
任务完成后,辅助角色会将其标记为_失败_或_已完成_,然后计划程序将更新元数据数据库中的最终状态。
Features(特征)
Easy to Use: If you have a bit of python knowledge, you are good to go and deploy on Airflow.
易于使用:如果你具备一点python知识,你会很高兴去部署Airflow。
Open Source: It is free and open-source with a lot of active users.
开源:它是免费的开源的,有很多活跃的用户。
Robust Integrations: It will give you ready to use operators so that you can work with Google Cloud Platform, Amazon AWS, Microsoft Azure, etc.
强大的集成:它将为您提供随时可用的运算符,以便您可以与谷歌云平台,亚马逊AWS,微软Azure等一起使用。
Use Standard Python to code: You can use python to create simple to complex workflows with complete flexibility.
使用标准 Python 编写代码:您可以使用 Python 创建简单到复杂的工作流,并具有完全的灵活性。
Amazing User Interface: You can monitor and manage your workflows. It will allow you to check the status of completed and ongoing tasks.
惊人的用户界面:您可以监视和管理工作流。它将允许您检查已完成和正在进行的任务的状态。
Principles (原则)
Dynamic: Airflow pipelines are configuration as code (Python), allowing for dynamic pipeline generation. This allows for writing code that instantiates pipelines dynamically.
动态:Airflow管道配置为代码 (Python),允许动态管道生成。这允许编写动态实例化管道的代码。
Extensible: Easily define your own operators, executors and extend the library so that it fits the level of abstraction that suits your environment.
可扩展:轻松定义您自己的运算符、执行器和扩展库,使其适合您环境的抽象级别。
Elegant: Airflow pipelines are lean and explicit.
优雅:Airflow 管道是精益和明确的。
Scalable: It has a modular architecture and uses a message queue to orchestrate an arbitrary number of workers. Airflow is ready to scale to infinity.
可扩展:它具有模块化架构,并使用消息队列来编排任意数量的工作者。Airflow已准备好扩展到无限远。