airflow源码精读七

最新推荐文章于 2023-03-19 23:03:51 发布

dll007

最新推荐文章于 2023-03-19 23:03:51 发布

阅读量198

点赞数

分类专栏：调度系统文章标签： python Powered by 金山文档

本文链接：https://blog.csdn.net/u013010890/article/details/129462550

版权

调度系统专栏收录该内容

12 篇文章 1 订阅

订阅专栏

Executor 任务执行器

即用来执行任务。每个执行器都有一个并发度，表示当前正在执行的任务数量的极限。

执行器里面的任务分为三个部分

未执行的任务(self.queued*tasks)

正在执行的任务(self.runing)

已经执行完毕的任务(self.event_buffer)

执行器的子类里面包括

celery执行器

本地执行器

调试执行器

CeleryExecutor

celery是一个任务队列，用于分布式执行任务

CeleryExecutor用于远程执行任务，多进程模式，可分布在不同服务器间，通过共同连通的队列(redis，mq消息队列)，实现分布式的生产者消费者模式，CeleryExecutor接到任务提交到分布式celery任务队列中，celery的woker消费指定队列的任务，执行其命令。

'''
To start the celery worker, run the command:
airflow worker
配置celery
'''
if configuration.has_option('celery', 'celery_config_options'):
    celery_configuration = import_string(
        configuration.get('celery', 'celery_config_options')
    )
else:
    celery_configuration = DEFAULT_CELERY_CONFIG

app = Celery(
    configuration.get('celery', 'CELERY_APP_NAME'),
    config_source=celery_configuration)


@app.task
def execute_command(command):
    """
        Celery 通过装饰器app.task创建Task对象，Task对象提供两个核心功能：
        将任务消息发送到队列和声明 Worker 接收到消息后需要执行的具体函数。
        command 被格式化：airflow run <dag_id> <task_id> <execution_date> --local --pool <pool> -sd <python_file>
        使用@app.task装饰器将该函数转换为Celery任务。
        与call方法类似，不同在于如果命令行执行成功，check_call返回返回码0，否则抛出subprocess.CalledProcessError异常。
        subprocess.CalledProcessError异常包括returncode、cmd、output等属性，
        其中returncode是子进程的退出码，cmd是子进程的执行命令，output为None。
        当子进程退出异常时，则报错
    """
    log = LoggingMixin().log
    log.info("Executing command in Celery: %s", command)
    try:
        """
        # 检查返回的call back
        # 如果需要处理任务的结果，则需要使用回调函数等机制来获取结果
        # shell=True表示在shell中运行命令
        # check_call(command, shell=True)本地执行命令
        check_call(["ls", "-l"])
        """
        subprocess.check_call(command, shell=True)
    except subprocess.CalledProcessError as e:
        log.error(e)
        raise AirflowException('Celery command failed')


class CeleryExecutor(BaseExecutor):
    """
    CeleryExecutor is recommended for production use of Airflow. It allows
    distributing the execution of task instances to multiple worker nodes.

    Celery is a simple, flexible and reliable distributed system to process
    vast amounts of messages, while providing operations with the tools
    required to maintain such a system.
    """
    def start(self):
        self.tasks = {}
        self.last_state = {}

    # execute_command是Celery Task任务实例，下文会介绍
    def execute_async(self, key, command,
                      queue=DEFAULT_CELERY_CONFIG['task_default_queue']):
        self.log.info( "[celery] queuing {key} through celery, "
                       "queue={queue}".format(**locals()))
        # 通过 execute_async 异步提交到 Celery 集群，将返回的任务句柄保存在 tasks。
        self.tasks[key] = execute_command.apply_async(
            args=[command], queue=queue)
        self.last_state[key] = celery_states.PENDING

    # 同步任务状态，根据任务状态进行不同处理
    # Scheduler 通过 sync 方法轮询任务句柄获取任务状态，并根据任务状态回调 success 或者 fail 更新状态。
    def sync(self):
        self.log.debug("Inquiring about %s celery task(s)", len(self.tasks))
        for key, async in list(self.tasks.items()):
            try:
                state = async.state
                if self.last_state[key] != state:
                    if state == celery_states.SUCCESS:
                        self.success(key)
                        del self.tasks[key]
                        del self.last_state[key]
                    elif state == celery_states.FAILURE:
                        self.fail(key)
                        del self.tasks[key]
                        del self.last_state[key]
                    elif state == celery_states.REVOKED:
                        self.fail(key)
                        del self.tasks[key]
                        del self.last_state[key]
                    else:
                        self.log.info("Unexpected state: %s", async.state)
                    self.last_state[key] = async.state
            except Exception as e:
                self.log.error("Error syncing the celery executor, ignoring it:")
                self.log.exception(e)

    def end(self, synchronous=False):
        if synchronous:
            while any([
                    async.state not in celery_states.READY_STATES
                    for async in self.tasks.values()]):
                time.sleep(5)
        self.sync()

Celery Worker 的启动

执行命令 airflow worker，所谓 Worker 其实是 Celery 的工作进程，一个 Worker 根据 concurrency 启动若干个守护进程，用于任务的并发执行。 Celery的worker接受到消息执行taskInstance的 execute_command()

#cli.py
def worker(args):
    env = os.environ.copy()
    env['AIRFLOW_HOME'] = settings.AIRFLOW_HOME
    # Celery worker
    from airflow.executors.celery_executor import app as celery_app
    from celery.bin import worker

    worker = worker.worker(app=celery_app)
    options = {
        'optimization': 'fair',
        'O': 'fair',
        'queues': args.queues,
        'concurrency': args.concurrency,
        'hostname': args.celery_hostname,
    }
    worker.run(**options)

上面提交到 Celery 集群的命令 airflow run 在 Worker 守护进程中被 cli.py 的 run 方法解释执行

通过 airflow run 命令，在airflow 启动server进程后，解析命令运行至该run方法

#cli.py 任务在worker端执行的入口
def run(args, dag=None):
    task = dag.get_task(task_id=args.task_id)
    ti = TaskInstance(task, args.execution_date)
    ti.refresh_from_db()
    hostname = socket.getfqdn()
    log.info("Running on host %s", hostname)

    #local参数指定启动LocalTaskJob类型的Job，在LocalTaskJob内部指定参数raw从而启动_run_raw_task
    if args.local:
        run_job = jobs.LocalTaskJob(
            task_instance=ti,
            mark_success=args.mark_success,
            pickle_id=args.pickle,
            ignore_all_deps=args.ignore_all_dependencies,
            ignore_depends_on_past=args.ignore_depends_on_past,
            ignore_task_deps=args.ignore_dependencies,
            ignore_ti_state=args.force,
            pool=args.pool)
        run_job.run()
    elif args.raw:
        ti._run_raw_task(
            mark_success=args.mark_success,
            job_id=args.job_id,
            pool=args.pool,
        )

dll007

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
airflow源码精读七

执行命令 airflow worker，所谓 Worker 其实是 Celery 的工作进程，一个 Worker 根据 concurrency 启动若干个守护进程，用于任务的并发执行。CeleryExecutor用于远程执行任务，多进程模式，可分布在不同服务器间，通过共同连通的队列(redis，mq消息队列)，实现分布式的生产者消费者模式，CeleryExecutor接到任务提交到分布式celery任务队列中，celery的woker消费指定队列的任务，执行其命令。正在执行的任务(self.runing)
复制链接

扫一扫