airflow源码精读五

最新推荐文章于 2023-04-05 16:49:18 发布

dll007

最新推荐文章于 2023-04-05 16:49:18 发布

阅读量238

点赞数

分类专栏：调度系统文章标签： python Powered by 金山文档

本文链接：https://blog.csdn.net/u013010890/article/details/129462521

版权

调度系统专栏收录该内容

12 篇文章 1 订阅

订阅专栏

伪代码表述如下：

    Step 0. Load available DAG definitions from disk (fill DagBag)
  	0.从磁盘读取DAG定义
		Airflow 从磁盘上加载 DAG 定义文件，从中解析出 DAG 和 DAG 所属的 Task，将其存储为 DagBag 的内存数据结构。
    
    While the scheduler is running:
        Step 1. The scheduler uses the 
        DAG definitions to identify and/or 
        initialize any DagRuns in the metadata db.
        1.scheduler调度器使用DAG definitions在数据库中识别并初始化DagRuns
        Scheduler 将该存储结构持久化到元数据库，并轮训所有数据库中的任务，通过检查依赖关系判断是否符合执行的条件。

        Step 2. The scheduler checks the states of 
        the TaskInstances associated with active DagRuns, 
        resolves any dependencies amongst TaskInstances, 
        identifies TaskInstances that need to be executed, 
        and adds them to a worker queue, updating the status 
        of newly-queued TaskInstances to "queued" in the
        datbase.
        2.scheduler检查TaskInstances实例和激活的dagruns之间的联系，
        识别需要执行的taskinstance，并将它们添加到工作队列中，在db中更新状态为“queued”
        如果满足执行条件，则将其加入调度队列，并更新元数据库中的任务状态。

        Step 3. Each available worker pulls a TaskInstance from 
        the queue and starts executing it, updating the 
            database record for the TaskInstance from "queued" 
            to "running".
        3.每个活跃的worker从队列中提取一个TaskInstance，并执行它，在db中更新状态为‘queued’->‘running’
        然后每个 Worker 从元数据库中拉取待执行的任务开始执行，并更新任务状态

        Step 4. Once a TaskInstance is finished running, the 
            associated worker reports back to the queue 
            and updates the status for the TaskInstance 
            in the database (e.g. "finished", "failed", 
            etc.)
        4. 一旦TaskInstance完成运行，则向关联的worker的队列报告，并在db中更新TaskInstance的状态，(例如:“完成”,“失败”等等）
         当任务结束（成功、失败），Scheduler 再次根据是否满足重试条件决定是否重新调度，
         当 DAGRun（DAG 的一次运行实例）所属的所有任务结束时候，更新 DAGRun 状态

        Step 5. The scheduler updates the states of all active 
            DagRuns ("running", "failed", "finished") according 
            to the states of all completed associated 
            TaskInstances.
        5. 调度程序通过DagRun所属的TaskInstances的完成状态，更新所有活动的DagRuns的状态 (“运行中”，“失败”，“完成”)                                                               
 
        Step 6. Repeat Steps 1-5

从上面流程中，我们可知其核心流程为：

Airflow 从磁盘上加载 DAG 定义文件，从中解析出 DAG 和 DAG 所属的 Task，将其存储为 DagBag 的内存数据结构。

接下来 Scheduler 将该存储结构持久化到元数据库，并轮训所有数据库中的任务，通过检查依赖关系判断是否符合执行的条件。

如果满足执行条件，则将其加入调度队列，并更新元数据库中的任务状态。

然后每个 Worker 从元数据库中拉取待执行的任务开始执行，并更新任务状态。

当任务结束（成功、失败），Scheduler 再次根据是否满足重试条件决定是否重新调度，当 DAGRun（DAG 的一次运行实例）所属的所有任务结束时候，更新 DAGRun 状态。

接下来我们基于 Airflow 1.9.0，从源代码的角度详细拆解工作流程。

Airflow 源码解析

作者：runrungo

主要部件介绍

Scheduler 调度器

调度器是整个airlfow的核心枢纽，负责发现用户定义的dag文件，并根据定时器将有向无环图转为若干个具体的dagrun，并监控任务状态。

Dag 有向无环图Job

有向无环图用于定义任务的任务依赖关系。任务的定义由算子operator进行，其中，BaseOperator是所有算子的父类。

Dagrun 有向无环图的运行实例

在调度器的作用下，每个有向无环图都会转成任务实例。不同的DagRun之间用 [dagid+ 执行时间execution date] 进行区分。

Taskinstance 算子operator执行的任务实例

dagrun下面的一个任务实例。具体来说，对于每个dagrun实例，算子（operator）都将转成对应的Taskinstance。由于任务可能失败，根据定义调度器决定是否重试。不同的任务实例taskInstance由 [ dagid + 执行时间 execution date + 算子 + 执行时间 + 重试次数] 进行区分.

Executor 任务执行器

每个任务都需要由任务执行器完成。BaseExecutor是所有任务执行器的父类。

LocalTaskJob 本地任务作业

负责监控任务与行，其中包含了一个重要属性taskrunner。

TaskRunner

开启子进程，执行任务

Airflow Scheduler

cli.py 实现了所有 Airflow 命令的入口接口，我们从 Airflow 启动 shceduler 开始，执行命令 airflow scheduler 即进入 scheduler 方法。

SchedulerJob 接受即将解析 DAG 文件的目录路径、dag_id 和调度 DAG 次数等，通过这些参数可以实现不同的目的，比如单元测试、解析指定的 DAG 定义文件等。

调度器初始启动时候需要提供整个的 DAG 定义文件所在目录，比如我们使用 Git 目录。然后调度器在执行过程中分为多个子处理器，每个子处理处理其中一部分文件。

cli.py 中 scheduler 部分代码实现如下。

Scheduler总体思路：

核心是四个方法的顺序执行，将 DAG 的加载解析、任务提交调度和任务的执行全部包含在一起了，以SchedulerJob和CeleryExecutor为具体实例讲解：

processor_manager.heartbeat()

self._execute_task_instances()

self.executor.heartbeat()

self._process_executor_events()

setup.py-> setup(scripts=['airflow/bin/airflow']) ->cli.py -> scheduler(args){启动jobs.SchedulerJob.run()}
		-> BaseJob.run()-> self._execute()->SchedulerJob._execute()
  	-> BaseJob中会定义executor_class，启动指定的执行器CeleryExecutor
  			->paths=list_py_file_paths->processor_manager =agFileProcessorManager(paths，DagFileProcessorFactory)
    		->self._execute_helper(processor_manager)
        		->self.executor.start()->CeleryExecutor.start().execute_async(cmd,queue)任务提交到分布式队列
          	->while(interval){processor_manager.set_file_paths(update_dag_paths)
            ->simple_dags = processor_manager.heartbeat()->进程管理器管理启动的<path,processor>}
                	->for self._processors.items()维护[task]各种执行状态running,finished
                  -> simple_dags.append(finished_dag)->while(并发有空闲){self._processor_factory(file_path)}
                			->DagFileProcessor.init._launch_process.helper->SchedulerJob(dag_id_white_list, log=log)
                  		->scheduler_job.process_file(file_path,pickle_dags)
                  		->在新的DagFileProcessor进程中启动新的SchedulerJob解析指定file
                          ->dagbag= models.DagBag(file_path)->[DAG]数组->dagbag.sync_to_db()->DAG有taskDict
                        	->simple_dags(SimpleDag(dag, pickle_id=pickle_id))->filter_dags
                          ->for dags：对代码定义的dag生成dag_run,并在线程内为dag_run添加其taskinstance
                          		->self._process_dags(dagbag, dags, ti_keys_to_schedule)
                          				->create_dag_run(dag)
                            					->DAG.create_dagrun(dag_id,run_id,next_run_date).db_add
                             	 		->self._process_task_instances(dag, ti_keys_to_schedule)
                              				->db.dag.getDagruns->tis=run.get_task_instances(NONE,UP_FOR_RETRY)->
                                			-> tis_to_schedule.append(dag.get_task(task_id).are_dependencies_met())
                         			->ti_keys_to_schedule[ti_key]->is_dep_met-> TaskInstance(state=SCHEDULED).db    
           ->SimpleDagBag(simple_dags)->self._execute_task_instances(simple_dag_bag,SCHEDULED)
                  		->executable_tis = self._find_executable_task_instances,并改状态为QUEUED
                    	->self._enqueue_task_instances_with_queued_state(executable_tis)
                      		-> for tis: ti.genCmd,ti.queue,priority->
                        			->self.SchedulerJob.executor.queue_command(ti,cmd,priority,queue)
                          					->self.CeleryExecutor.queued_tasks[key] = (cmd, priority, queue, ti)

				  ->self.executor.heartbeat()->self.CeleryExecutor.heartbeat()
          		->计算open_slots资源->queued_tasks.sort
              ->for(min(slots,queue_len)):
              		->key, (command, _, queue, ti) = sorted_queue.pop(0)
                  ->非running->self.execute_async(key, command=command, queue=queue)
                  ->异步提交任务到分布式队列Celery 集群,将返回的任务句柄保存在 tasks中: self.tasks[key]
              ->self.sync() sync 方法轮询任务句柄获取任务状态，并根据任务状态回调 success 或者 fail 更新状态
         ->self._process_executor_events(simple_dag_bag)冗余处理异常状态的逻辑
        		->异常状态的任务->ti.handle_failure(msg)->self.TI.email_alert(error, is_retry=True)

Worker的思路

Airflow 启动 Worker 开始，以BashTaskRunner为例：执行命令 airflow worker，所谓 Worker 其实是 Celery 的工作进程，一个 Worker 根据 concurrency 启动若干个守护进程，用于任务的并发执行。

LocalTaskJob runs a single task instance.

LocalTaskJob只为单个ti执行一次，运行完即结束。

airflow worker->cli.py:worker(args)->import celery.worker->worker.worker(app=celery_app).run(**options)
->启动后的worker接受任务队列的消息，在worker守护进程中被cli.py:run()解释执行
->run():独立的进程
		-> task = dag.get_task(task_id=args.task_id)->ti = TaskInstance(task, args.execution_date)
		->local:jobs.LocalTaskJob().run()
    		-> BaseJob.run()-> self._execute()->LocalTaskJob._execute()
      			->get_task_runner(self)：BashTaskRunner脚本执行器|CgroupTaskRunner
        		->self.task_runner.start()->self.BashTaskRunner.run_command(['bash', '-c'], join_args=True)
          			->subprocess.Popen(cmd)
            ->while not self.terminating:#外部任务通过循环检测返回码
              	->self.BaseJob.heartbeat()
  	->args.raw： ti._run_raw_task()->pre->result = task_copy.execute(context)执行给类定义的算子的实际code->post

dll007

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
airflow源码精读五

Scheduler 调度器调度器是整个airlfow的核心枢纽，负责发现用户定义的dag文件，并根据定时器将有向无环图转为若干个具体的dagrun，并监控任务状态。Dag 有向无环图Job有向无环图用于定义任务的任务依赖关系。任务的定义由算子operator进行，其中，BaseOperator是所有算子的父类。Dagrun 有向无环图的运行实例在调度器的作用下，每个有向无环图都会转成任务实例。不同的DagRun之间用 [dagid+ 执行时间execution date] 进行区分。
复制链接

扫一扫