airflow数据平台_循序渐进:使用Airflow建立数据管道

airflow数据平台

Each time we deploy our new software, we will check the log file twice a day to see whether there is an issue or exception in the following one or two weeks. One colleague asked me is there a way to monitor the errors and send alert automatically if a certain error occurs more than 3 times. I am following the Airflow course now, it’s a perfect use case to build a data pipeline with Airflow to monitor the exceptions.

ËACH时候,我们部署了新的软件,我们将检查日志文件,一天两次,看是否有下面的一或两周的问题或异常。 一位同事问我是否可以监视错误并在发生特定错误3次以上时自动发送警报。 我现在正在关注Airflow课程,这是使用Airflow构建数据管道以监视异常的完美用例。

什么是气流? (What’s Airflow?)

Airflow is an open-source workflow management platform, It started at Airbnb in October 2014 and later was made open-source, becoming an Apache Incubator project in March 2016. Airflow is designed under the principle of “configuration as code”. [1]

Airflow是一个开放源代码的工作流管理平台,于2014年10月在Airbnb上启动,后来成为开放源代码,并于2016年3月成为Apache孵化器项目。Airflow的设计基于“配置为代码”原则。 [1]

In Airflow, a DAG — or a Directed Acyclic Graph — is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.[2]

在Airflow中,DAG(或有向非循环图)是您要运行的所有任务的集合,并以反映其关系和依赖性的方式进行组织。[2]

Airflow uses Python language to create its workflow/DAG file, it’s quite convenient and powerful for the developer.

Airflow使用Python语言创建其工作流/ DAG文件,对于开发人员而言非常方便且强大。

分析 (Analysis)

Our log files are saved in the server, there are several log files. We can fetch them by the sftp command. After downloading all the log files into one local folder, we can use the grep command to extract all lines containing exceptions or errors. The following is an example of an error log:

我们的日志文件保存在服务器中,有几个日志文件。 我们可以通过sftp命令获取它们。 将所有日志文件下载到一个本地文件夹后,我们可以使用grep命令提取包含异常或错误的所有行。 以下是错误日志的示例:

/usr/local/airflow/data/20200723/loginApp.log:140851:[[]] 23 Jul 2020/13:23:19,196 ERROR SessionId : u0UkvLFDNMsMIcbuOzo86Lq8OcU= [loginApp] dao.AbstractSoapDao - getNotificationStatus - service Exception: java.net.SocketTimeoutException: Read timed out

/usr/local/airflow/data/20200723/loginApp.log:140851:[[]] 2020年7月23日/ 13:23:19,196错误SessionId:u0UkvLFDNMsMIcbuOzo86Lq8OcU = [loginApp] dao.AbstractSoapDao-getNotificationStatus-服务异常:java.net .SocketTimeoutException:读取超时

Next, we need to parse the error message line by line and extract the fields. Like the above example, we want to know the file name, line number, date, time, session id, app name, module name, and error message. We will extract all this information into a database table, later on, we can use the SQL query to aggregate the information. If any type of error happens more than 3 times, it will trigger sending an email to the specified mailbox.

接下来,我们需要逐行分析错误消息并提取字段。 像上面的示例一样,我们想知道文件名,行号,日期,时间,会话ID,应用程序名称,模块名称和错误消息。 我们将所有这些信息提取到数据库表中,稍后,我们可以使用SQL查询来聚合信息。 如果任何类型的错误发生3次以上,它将触发向指定邮箱发送电子邮件。

The whole process is quite straightforward as following:

整个过程非常简单,如下所示:

Image for post
Workflow to monitor error logs
监控错误日志的工作流程

气流操作员 (Airflow Operators)

Airflow provides a lot of useful operators. An operator is a single task, which provides a simple way to implement certain functionality. For example, BashOperator can execute a Bash script, command, or set of commands. SFTPOperator can access the server via an SSH session. Furthermore, Airflow allows parallelism amongst tasks, since an operator corresponds to a single task, which means all the operators can run in parallel. Airflow also provides a very simple way to define dependency and concurrency between tasks, we will talk about it later.

气流提供了许多有用的操作员。 操作员是单个任务,它提供了实现某些功能的简单方法。 例如, BashOperator可以执行Bash脚本,命令或命令集。 SFTPOperator可以通过SSH会话访问服务器。 此外,Airflow允许任务之间的并行性,因为一个操作员对应一个任务,这意味着所有操作员都可以并行运行。 Airflow还提供了一种非常简单的方法来定义任务之间的依赖关系和并发性,我们将在后面讨论。

实作 (Implementation)

Normally, Airflow is running in a docker container. Apache publishes Airflow images in Docker Hub. A more popular Airflow image is released by Puckel which is configurated well and ready to use. We can retrieve the docker file and all configuration files from Puckel’s Github repository.

通常,Airflow在docker容器中运行。 Apache在Docker Hub中发布Airflow映像Puckel发行了更流行的Airflow图像,该图像配置正确并可以使用。 我们可以从Puckel的Github存储库中检索docker文件和所有配置文件。

After installing Docker client and pulling the Puckel’s repository, run the following command line to start the Airflow server:

安装Docker客户端并提取Puckel的存储库后,运行以下命令行以启动Airflow服务器:

docker-compose -f ./docker-compose-LocalExecutor.yml up -d
Image for post
First time to run Airflow
第一次运行气流

When it’s the first time to run the script, it will download Puckel’s Airflow image and Postgres image from Docker Hub, then start two docker containers.

第一次运行脚本时,它将从Docker Hub下载Puckel的Airflow映像和Postgres映像,然后启动两个Docker容器。

Airflow has a nice UI, it can be accessed from http://localhost:8080.

Airflow有一个不错的UI,可以从http:// localhost:8080进行访问。

Image for post
Airflow UI portal
气流UI门户

From the Airflow UI portal, it can trigger a DAG and show the status of the tasks currently running.

从Airflow UI门户,它可以触发DAG并显示当前正在运行的任务的状态。

Let’s start to create a DAG file. It’s pretty easy to create a new DAG. Firstly, we define some default arguments, then instantiate a DAG class with a DAG name monitor_errors, the DAG name will be shown in Airflow UI.

让我们开始创建DAG文件。 创建一个新的DAG非常容易。 首先,我们定义一些默认参数,然后使用DAG名称monitor_errors实例化DAG类,该DAG名称将显示在Airflow UI中。

Instantiate a new DAG
实例化一个新的DAG

The first step in the workflow is to download all the log files from the server. Airflow supports concurrency of running tasks. We create one downloading task for one log file, all the tasks can be running in parallel, and we add all the tasks into one list. SFTPOperator needs an SSH connection id, we will config it in the Airflow portal before running the workflow.

工作流程的第一步是从服务器下载所有日志文件。 气流支持并发运行任务。 我们为一个日志文件创建一个下载任务,所有任务可以并行运行,并将所有任务添加到一个列表中。 SFTPOperator需要SSH连接ID,我们将在运行工作流之前在Airflow门户中对其进行配置。

Create download tasks
创建下载任务

After that, we can refresh the Airflow UI to load our DAG file. Now we can see our new DAG - monitor_errors - appearing on the list:

之后,我们可以刷新Airflow UI来加载DAG文件。 现在,我们可以看到新的DAG- monitor_errors-出现在列表中:

Image for post
New DAG showing in Airflow
新的DAG在气流中显示

Click the DAG name, it will show the graph view, we can see all the download tasks here:

单击DAG名称,它将显示图形视图,我们可以在此处查看所有下载任务:

Image for post
All download tasks in the graph view
图形视图中的所有下载任务

Before we trigger a DAG batch, we need to config the SSH connection, so that SFTPOperator can use this connection. Click the Admin menu then select Connections to create a new SSH connection.

在触发DAG批处理之前,我们需要配置SSH连接,以便SFTPOperator可以使用此连接。 单击管理菜单,然后选择连接以创建新的SSH连接。

Image for post
Create an SSH connection
建立SSH连线

To access an SSH server without inputting a password, it needs to use the public key to log in. Assume the public key has already been put into server and the private key is located in /usr/local/airflow/.ssh/id_rsa. Leave Password field empty, and put the following JSON data into the Extra field.

要在不输入密码的情况下访问SSH服务器,需要使用公共密钥登录。假设公共密钥已经放入服务器中,并且私有密钥位于/usr/local/airflow/.ssh/id_rsa中 。 将“ 密码”字段保留为空,然后将以下JSON数据放入“ 额外”字段。

{
"key_file": "/usr/local/airflow/.ssh/id_rsa",
"timeout": "10",
"compress": "false",
"no_host_key_check": "false",
"allow_host_key_change": "false"
}

Ok, let’s enable the DAG and trigger it, some tasks turn green which means they are in running state, the other tasks are remaining grey since they are in the queue.

好的,让我们启用DAG并触发它,一些任务变为绿色,这意味着它们处于运行状态,而其他任务则由于处于队列中而保持灰色。

Image for post
Tasks are running
任务正在运行
Image for post
All tasks finished
所有任务完成

When all tasks finished, they are shown in dark green. Let’s check the files downloaded into the data/ folder. It will create the folder with the current date.

当所有任务完成时,它们以深绿色显示。 让我们检查下载到data /文件夹中的文件。 它将使用当前日期创建文件夹。

Image for post
All logs are downloaded into the folder
所有日志都下载到该文件夹​​中

Looks good.

看起来挺好的。

Next, we will extract all lines containing “exception” in the log files then write these lines into a file(errors.txt) in the same folder. grep command can search certain text in all the files in one folder and it also can include the file name and line number in the search result.

接下来,我们将在日志文件中提取所有包含“ exception ”的行,然后将这些行写入同一文件夹中的文件(errors.txt)。 grep命令可以在一个文件夹中的所有文件中搜索某些文本,还可以在搜索结果中包含文件名和行号。

Airflow checks the bash command return value as the task’s running result. grep command will return -1 if no exception is found. Airflow treats non-zero return value as a failure task, however, it’s not. No error means we’re all good. We check the errors.txt file generated by grep. If the file exists, no matter it’s empty or not, we will treat this task as a successful one.

Airflow会检查bash命令的返回值作为任务的运行结果。 如果未发现异常,则grep命令将返回-1 。 Airflow将非零返回值视为失败任务,但事实并非如此。 没有错误就意味着我们都很好。 我们检查由grep生成的errors.txt文件。 如果文件存在,则无论该文件是否为空,我们都会将此任务视为成功。

Create grep_exception task
创建grep_exception任务
Image for post
grep exception
grep例外

Refresh the DAG and trigger it again, the graph view will be updated as above. Let’s check the output file errors.txt in the folder.

刷新DAG并再次触发它,图形视图将如上所述更新。 让我们检查文件夹中的输出文件errors.txt。

Image for post
List last 5 exceptions in the errors.txt
在errors.txt中列出最后5个异常

Next, we will parse the log line by line and extract the fields we are interested in. We use a PythonOperator to do this job using a regular expression.

接下来,我们将逐行解析日志并提取我们感兴趣的字段。我们使用PythonOperator使用正则表达式来完成这项工作。

Parse exception logs using regular expression
使用正则表达式解析异常日志

The extracted fields will be saved into a database for later on the queries. Airflow supports any type of database backend, it stores metadata information in the database, in this example, we will use Postgres DB as backend.

提取的字段将保存到数据库中,以便以后进行查询。 Airflow支持任何类型的数据库后端,它将元数据信息存储在数据库中,在此示例中,我们将使用Postgres DB作为后端。

We define a PostgresOperator to create a new table in the database, it will delete the table if it’s already existed. In a real scenario, we may append data into the database, but we shall be cautious if some tasks need to be rerun due to any reason, it may add duplicated data into the database.

我们定义了一个PostgresOperator来在数据库中创建一个新表,如果表已经存在,它将删除该表。 在实际情况下,我们可以将数据追加到数据库中,但是如果出于某些原因需要重新运行某些任务,我们将保持谨慎,这可能会将重复的数据添加到数据库中。

Create a table in Postgres database
在Postgres数据库中创建表

To use the Postgres database, we need to config the connection in the Airflow portal. We can modify the existing postgres_default connection, so we don’t need to specify connection id when using PostgresOperator or PostgresHook.

要使用Postgres数据库,我们需要在Airflow门户中配置连接。 我们可以修改现有的postgres_default连接,因此在使用PostgresOperatorPostgresHook时不需要指定连接ID。

Image for post
Modify postgres_default connection
修改postgres_default连接
Image for post
Config postgres_default connection
配置postgres_default连接

Great, let’s trigger the DAG again.

太好了,让我们再次触发DAG。

Image for post
Parse the error logs
解析错误日志

The tasks ran successfully, all the log data are parsed and stored in the database. Airflow provides a handy way to query the database. Choose “Ad Hoc Query” under the “Data Profiling” menu then type SQL query statement.

任务成功运行,所有日志数据均已解析并存储在数据库中。 Airflow提供了一种方便的方法来查询数据库。 在“ 数据分析 ”菜单下选择“ 临时查询 ”,然后键入SQL查询语句。

Image for post
Ad Hoc Query
临时查询
Image for post
Error logs in Postgres database
Postgres数据库中的错误日志

Next, we can query the table and count the error of every type, we use another PythonOperator to query the database and generate two report files. One contains all the error records in the database, another is a statistics table to show all types of errors with occurrences in descending order.

接下来,我们可以查询表并计算每种类型的错误,我们使用另一个PythonOperator来查询数据库并生成两个报告文件。 一个包含数据库中的所有错误记录,另一个是统计信息表,用于显示所有类型的错误,并按降序排列。

Define task to generate reports
定义任务以生成报告

Ok, trigger the DAG again.

好的,再次触发DAG。

Image for post
Generate reports
产生报告

Two report files are generated in the folder.

该文件夹中生成两个报告文件。

Image for post
Two reports
两份报告

In error_logs.csv, it contains all the exception records in the database.

在error_logs.csv中,它包含数据库中的所有异常记录。

Image for post
Report with all exceptions
报告所有例外

In error_stats.csv, it lists different types of errors with occurrences.

在error_stats.csv中,它列出了发生错误的不同类型。

Image for post
Report with different types of exception
报告具有不同类型的异常

At last step, we use a branch operator to check the top occurrences in the error list, if it exceeds the threshold, says 3 times, it will trigger to send an email, otherwise, end silently. We can define the threshold value in the Airflow Variables, then read the value from the code. So that we can change the threshold later without modifying the code.

在最后一步,我们使用分支运算符检查错误列表中出现次数最多的事件,如果超过阈值,则重复3次,它将触发发送电子邮件,否则,将以静默方式结束。 我们可以在气流变量中定义阈值,然后从代码中读取该值。 这样我们以后就可以更改阈值而无需修改代码。

Image for post
Create a variable in Airflow
在气流中创建变量
Define task to check the error number
定义任务以检查错误号

BranchPythonOperator returns the next task’s name, either to send an email or do nothing. We use the EmailOperator to send an email, it provides a convenient API to specify to, subject, body fields, and easy to add attachments. And we define an empty task by DummyOperator.

BranchPythonOperator返回下一个任务的名称,以发送电子邮件或不执行任何操作。 我们使用EmailOperator发送电子邮件,它提供了一个方便的API,用于指定收件人 ,主题,正文字段,并且易于添加附件。 然后由DummyOperator定义一个空任务。

Email task and dummy task
电子邮件任务和虚拟任务

To use the email operator, we need to add some configuration parameters in the YAML file. Here we define configurations for a Gmail account. You may put your password here or use App Password for your email client which provides better security.

要使用电子邮件运算符,我们需要在YAML文件中添加一些配置参数。 在这里,我们定义了Gmail帐户的配置。 您可以在此处输入密码 ,也可以将应用密码用于电子邮件客户端,以提高安全性。

- AIRFLOW__SMTP__SMTP_HOST=smtp.gmail.com
- AIRFLOW__SMTP__SMTP_PORT=587
- AIRFLOW__SMTP__SMTP_USER=<your-email-id>@gmail.com
- AIRFLOW__SMTP__SMTP_PASSWORD=<your-app-password>
- AIRFLOW__SMTP__SMTP_MAIL_FROM=<your-email-id>@gmail.com

So far, we create all the tasks in the workflow, we need to define the dependency among these tasks. Airflow provides a very intuitive way to describe dependencies.

到目前为止,我们创建了工作流中的所有任务,我们需要定义这些任务之间的依赖关系。 气流提供了一种非常直观的方式来描述依赖关系。

dl_tasks >> grep_exception >> create_table >> parse_log >> gen_reports >> check_threshold >> [send_email, dummy_op]

Now, we finish all our coding part, let’s trigger the workflow again to see the whole process.

现在,我们完成了所有编码部分,让我们再次触发工作流以查看整个过程。

Image for post
Send an email when the number of any type of error exceeds the threshold
当任何类型的错误数超过阈值时发送电子邮件

In our case, there are two types of error, both of them exceeds the threshold, it will trigger sending the email at the end. Two reports are attached to the email.

在我们的例子中,有两种错误,两种错误都超过阈值,它将最终触发发送电子邮件。 电子邮件附有两个报告。

Image for post
Email alert
电子邮件提醒

We change the threshold variable to 60 and run the workflow again.

我们将阈值变量更改为60,然后再次运行工作流。

Image for post
Change threshold to 60
将阈值更改为60
Image for post
The workflow ends without sending the email
工作流结束而不发送电子邮件

As you can see, it doesn’t trigger sending the email since the number of errors is less than 60. The workflow ends silently.

如您所见,由于错误数量少于60,它不会触发发送电子邮件。工作流以静默方式结束。

Let’s go back to the DAG view.

让我们回到DAG视图。

Image for post
The DAG view
DAG视图

It lists all the active or inactive DAGs and the status of each DAG, in our example, you can see, our monitor_errors DAG has 4 successful runs, and in the last run, 15 tasks are successful and 1 task is skipped which is the last dummy_op task, it’s an expected result.

它列出了所有活动或非活动DAG以及每个DAG的状态,在我们的示例中,您可以看到,我们的monitor_errors DAG已成功运行了4次,在最后一次运行中,成功完成了15个任务,跳过了1个任务,这是最后一个dummy_op任务,这是预期的结果。

Now our DAG is scheduled to run every day, we can change the scheduling time as we want, e.g. every 6 hours or at a specific time every day.

现在我们的DAG已安排为每天运行,我们可以根据需要更改计划时间,例如每6个小时或每天的特定时间。

Airflow is a powerful ETL tool, it’s been widely used in many tier-1 companies, like Airbnb, Google, Ubisoft, Walmart, etc. And it’s also supported in major cloud platforms, e.g. AWS, GCP, Azure. It plays a more and more important role in data engineering and data processing.

Airflow是一种功能强大的ETL工具,已在Airbnb,Google,Ubisoft,Walmart等许多1级公司中广泛使用。AWS,GCP,Azure等主要云平台也支持它。 它在数据工程和数据处理中起着越来越重要的作用。

(Code)

https://github.com/kyokin78/airflow

https://github.com/kyokin78/airflow

翻译自: https://towardsdatascience.com/step-by-step-build-a-data-pipeline-with-airflow-4f96854f7466

airflow数据平台

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值