python dags_气流动态dags python globals

python dags

In this post, I introduce the concept of dynamic DAG creation and explain the significance of Python global variables for Airflow.

在本文中,我将介绍动态DAG创建的概念,并说明Python全局变量对Airflow的重要性。

“动态DAG”是什么意思? (What do I mean by “dynamic DAG”?)

Dynamic DAG creation is important for scalable data pipeline applications.

动态DAG创建对于可伸缩数据管道应用程序很重要。

When confined to the realm of static DAG scripts, we find ourselves duplicating code in order to create pipelines.

当局限于静态DAG脚本领域时,我们发现自己在复制代码以创建管道。

This duplication is undesirable because (usually) it causes an increase in code-base complexity, making DAGs more difficult to update and increasing the changes of bugs appearing.

这种重复是不希望的,因为(通常)这会导致代码库复杂性的增加,从而使DAG的更新更加困难,并增加了出现的错误的变化。

For example, updated DAGfile code must be copied across each replicated instance, while making sure to keep the intended diffs (e.g. params, custom logic) intact. In other words, a nightmare.

例如,必须在每个复制的实例之间复制更新的DAGfile代码,同时确保保持预期的差异(例如,参数,自定义逻辑)完整无缺。 换句话说,一场噩梦。

In order to dynamically create DAGs with Airflow, we need two things to happen:

为了使用Airflow动态创建DAG,我们需要做两件事:

  1. Run a function that instantiates an airflow.DAG object.

    运行实例化airflow.DAG对象的函数。

  2. Pass that object back to the global namespace of the DAGfile.

    将该对象传递回DAGfile的全局名称空间。

Sounds simple, and it is. But let’s see how it could go wrong.

听起来很简单,确实如此。 但是,让我们看看它可能会出错。

静态DAG示例 (Static DAG Example)

Let’s imagine I have a pipeline that get’s the current price of bitcoin (BTC) and emails it to me:

想象一下,我有一条获取比特币(BTC)当前价格并将其通过电子邮件发送给我的管道:

There’s nothing wrong here. Not yet anyway.

这里没有错。 仍然没有。

It renders the following DAG:

它呈现以下DAG:

Image for post

动态DAG示例 (Dynamic DAG example)

Now let’s imagine we wanted to get the price of some other cryptocurrencies as well; say, Ethereum (ETH), Litecoin (LTC) and Stellar (XLM).

现在让我们想象一下,我们也想获得其他一些加密货币的价格。 例如以太坊(ETH),莱特币(LTC)和Stellar(XLM)。

We might try to accomplish that dynamically as follows:

我们可以尝试如下动态地完成此操作:

def create_dag(symbol):
    with DAG(
        "email_{}_price".format(symbol.lower()),
        default_args={"start_date": "2020-01-01"},
        schedule_interval="0 0 * * *",
    ) as dag:
        get_price_task = PythonOperator(
            task_id="get_price",
            python_callable=get_price,
            op_kwargs=dict(
                symbol="BTC",
            ),
        )
        email_price_task = PythonOperator(
            task_id="email_price",
            python_callable=email_price,
        )
        (
            get_price_task
            >> email_price_task
        )
        return dag




for symbol in ("BTC", "ETH", "LTC", "XLM"):
    dag = create_dag(symbol=symbol)

Seems reasonable right? I iterate over the coins and dynamically create a DAG for each.

看起来合理吧? 我遍历硬币并为每个硬币动态创建DAG。

I’m even making sure to pass the instantiated DAG object dag back to the global namespace of the DAGfile (lines 22, 26).

我什至确保将实例化的DAG对象dag传递回DAGfile的全局名称空间(第22、26行)。

However this will not work.

但是,这将不起作用。

At least, not how we expect. It produces only one DAG (for XLM, the last element in the list):

至少,不是我们的期望。 它仅生成一个DAG(对于XLM,为列表中的最后一个元素):

Image for post

We are missing the other DAGs: email_btc_price,email_eth_price and email_ltc_price

我们缺少其他DAG: email_btc_priceemail_eth_priceemail_ltc_price

为什么它不起作用 (Why it doesn’t work)

In order to understand why the above code does not act like we need it to, we have to consider Ariflow’s core concept of DAG scope.

为了理解为什么上面的代码不能像我们需要的那样工作,我们必须考虑Ariflow的DAG范围的核心概念。

In particular: “Airflow will load any DAG object it can import from a DAGfile. Critically, that means the DAG must appear in globals()”

特别是:“ Airflow将加载它可以从DAGfile导入的任何DAG对象。 至关重要的是,这意味着DAG必须出现在globals()中”

In Python, globals is a built-in function that returns a dictionary of global variables. In addition to getting variables, it can be used to set them. E.g.

在Python中, globals是一个内置函数,该函数返回全局变量的字典。 除了获取变量外,它还可用于设置变量。 例如

>>> globals()["my_name"] = "Alex"
>>> print(my_name)
Alex

So we can now see what’s happening; for each loop (over the symbols BTC, ETH, LTC and XLM) the dag variable changes reference.

现在我们可以看到发生了什么。 对于每个循环(在符号BTC,ETH,LTC和XLM上), dag变量都会更改参考值。

Thus, all DAGs except the last lose their global variable reference.

因此,除最后一个DAG外,所有DAG都会丢失其全局变量引用。

如何运作 (How to make it work)

Knowing about this core concept of Airflow, the solution is trivial. All we need to do is maintain references to each DAG in the loop.

知道了气流这一核心概念后,该解决方案就变得微不足道了。 我们需要做的就是在循环中维护对每个DAG的引用。

This can be accomplished as follows:

这可以通过以下方式完成:

for symbol in ("BTC", "ETH", "LTC", "XLM"):
    dag = create_dag(symbol=symbol)
    globals()["{}_dag".format(symbol.lower())] = dag

Here I’m using globals() to update the global namespace with my DAG object (line 3) as my loop executes.

在这里,我使用globals()在执行循环时使用DAG对象(第3行)更新全局名称空间。

This produces the expected DAGs in the Airflow dashboard

这会在气流仪表板中产生预期的DAG

Image for post

结论 (Conclusion)

We’ve seen how using Python’s builtin globals function can be useful when dynamically creating Airflow DAGs.

我们已经看到了在动态创建Airflow DAG时使用Python的内置全局函数的功能。

As always, thanks for reading. Now get back to your code! Your projects are missing you ;)

与往常一样,感谢您的阅读。 现在回到您的代码! 您的项目很想念您;)

https://alexgalea.ca/

https://alexgalea.ca/

订阅我的时事通讯:) (Sign up for my newsletter :))

翻译自: https://medium.com/@galea/airflow-dynamic-dags-python-globals-4f40905d314a

python dags

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值