【Airflow】基于数据的调度 -使用实例

本文链接：https://blog.csdn.net/jocelyn_hhhhhh/article/details/140490490

最近业务需求调研了一下airflow的数据调度方式，因为自己在调研的过程中找到的资料比较少，所以发出来帮助有需要的朋友～

Dataset 定义（什么属于能使用的数据库）

定义和要求

An Airflow dataset is a logical grouping of data. Upstream producer tasks can update datasets, and dataset updates contribute to scheduling downstream consumer DAGs.
逻辑上的一组数据，上游生产任务能更新数据集，数据集更新会促成下游DAG的调度

使用URI进行数据集定义，URI必须是一个字符串（不支持正则表达式，大小写敏感）

实操

文件数据集更新任务

💡 数据集定义为一个文件（txt文件）

#update task trigger
with DAG(
        dag_id = 'data_demo_update',
        tags = ['test','data'],
        start_date=datetime(2024,7,15),
):
    @task
    def update():
        print('update')

    update_task = PythonOperator(
        task_id='input_test_operator',
        outlets=[Dataset("/home/xxxxx/airflow/test_data_demo.txt")],
        python_callable=update
    )
    update_task

#target task
with DAG(
    dag_id = 'data_demo_output',
    tags = ['test','data'],
    start_date=datetime(2024,7,15),
    # this DAG should be run when example.csv is updated (by dag1)
    schedule=[Dataset("/home/xxxxx/airflow/test_data_output.txt")],
):
    @task
    def output():
        f = open("/home/xxxxx/airflow/test_data_output.txt",'a')
        print('output',f)
        f.close()
    
    output_task = PythonOperator(
        task_id='output_test_operator',
        python_callable=output
    )
    output_task

summary:

目前的状态是每一次针对文件的更新，都会触发下游的任务

文件夹数据集更新

💡 测试数据集定义为文件夹时的情况

TEST_PATH = '/home/xxxxx/airflow/test_demo'
test_dataset= Dataset(TEST_PATH)
if not os.path.exists(TEST_PATH):
    os.makedirs(TEST_PATH,exist_ok=True)

with DAG(
    dag_id='update_data',
    tags=['data'],
    start_date=datetime(2024,7,16),
    schedule=timedelta(minutes=1)
)as dag_update:

    def update():

        add_times = random.randint(0,10)
        f = open(os.path.join(TEST_PATH,str(datetime.now())+'.txt'),'w')
        for i in range(add_times):
            print(f'{i}times update!',file=f)
        f.close()

    update_task = PythonOperator(
        task_id = 'update_task',
        outlets = test_dataset,
        python_callable=update
    )
    chain(update_task)

with DAG(
    dag_id='output_data',
    tags=['data'],
    start_date=datetime(2024,7,16),
    schedule=[test_dataset]
)as dag_data_aware:
    
    def output():
        file_list = os.listdir(TEST_PATH)
        file_list.sort(key=lambda fn:os.path.getmtime(TEST_PATH+'/'+fn))

        with open(os.path.join(TEST_PATH,file_list[-1]),'r') as f:
            content = f.readlines()
        if len(content)>5:
            f = open('/home/xxxxx/airflow/test_data_output.txt','a')
            print(datetime.now(),file=f)
            print(content,file = f)
            f.close()

    output_task = PythonOperator(
        task_id = 'output_task',
        python_callable = output
    )

    chain(output_task)

if __name__=='__main__':
    dag_update.test()
    dag_data_aware.test()

解析说明

两个DAG，一个负责更新的update_data，一个是负责获取更新数据的output_data

updata_Data 每分钟向目标文件夹中插入一个文件，包括0-10行的更新（随机数控制）

output_Data在每次数据集有更新的时候进行判断，如果更新的文件中的更新行数大于5，则将更新的内容输出到指定文件中
描述：将数据集设为test_demo文件夹，文件夹中有所更新则获取最新修改时间的文件中的数据。
关键在于，更新获取的任务的时候需要在操作的函数对应的operator中定义outletDataset为test_demo