postgresql和es_Apache的气流和PostgreSQL与码头工人和码头工人组成

postgresql和es

ETL带文件 (ETL WITH DOCKERS)

Hello, in this post I will show you how to set up official Apache/Airflow with PostgreSQL and LocalExecutor using docker and docker-compose. In this post, I won’t be going through Airflow, what it is, and how it is used. Please check the official documentation for more information about that.

您好,在本文中,我将向您展示如何使用docker和docker-compose使用PostgreSQL和LocalExecutor设置官方的Apache / Airflow。 在这篇文章中,我将不介绍Airflow,它是什么以及如何使用它。 请检查 官方文档以获取更多信息。

Before setting up and running Apache Airflow, please install Docker and Docker Compose.

在设置和运行Apache Airflow之前,请安装DockerDocker Compose

对于那些赶时间的人... (For those in hurry...)

In this chapter, I will show you files and directories which are needed to run airflow and in the next chapter, I will go file by file, line by line explaining what is going on.

在本章中,我将向您展示运行气流所需的文件和目录,在下一章中,我将逐文件逐行解释发生的情况。

Firstly, in the root directory create three more directories: dags, logs, and scripts. Further, create following files: .env, docker-compose.yml, entrypoint.sh and dummy_dag.py. Please make sure those files and directories follow the structure below.

首先,在根目录中,再创建三个目录: dagslogsscripts 。 此外,创建以下文件: .env,docker-compose.yml,entrypoint.shdummy_dag.py。 请确保这些文件和目录遵循以下结构。

#project structureroot/
├── dags/
│ └── dummy_dag.py
├── scripts/
│ └── entrypoint.sh
├── logs/
├── .env
└── docker-compose.yml

Created files should contain the following:

创建的文件应包含以下内容:

#docker-compose.ymlversion: '3.8'
services:
postgres:
image: postgres
environment:
- POSTGRES_USER=airflow
- POSTGRES_PASSWORD=airflow
- POSTGRES_DB=airflow
scheduler:
image: apache/airflow
command: scheduler
restart_policy:
condition: on-failure
depends_on:
- postgres
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
webserver:
image: apache/airflow
entrypoint: ./scripts/entrypoint.sh
restart_policy:
condition: on-failure
depends_on:
- postgres
- scheduler
env_file:
- .env
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
ports:
- "8080:8080"
#entrypoint.sh#!/usr/bin/env bash
airflow initdb
airflow webserver
#.envAIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__EXECUTOR=LocalExecutor
#dummy_dag.pyfrom airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from datetime import datetimewith DAG('example_dag', start_date=datetime(2016, 1, 1)) as dag:
op = DummyOperator(task_id='op')

Positioning in the root directory and executing “docker-compose up” in the terminal should make airflow accessible on localhost:8080. Image bellow shows the final result.

在根目录中定位并在终端中执行“ docker-compose up” ,应可在localhost:8080上访问气流。 下面的图像显示了最终结果。

If you encounter permission errors, please run “chmod -R 777” on all subdirectories, e.g. “chmod -R 777 logs/”

如果遇到权限错误,请在所有子目录上运行“ chmod -R 777”,例如“ chmod -R 777 logs /”

Image for post

对于好奇的人... (For the curious ones...)

In Leyman’s terms, docker is used when managing individual containers and docker-compose can be used to manage multi-container applications. It also moves many of the options you would enter on the docker run into the docker-compose.yml file for easier reuse. It works as a front end "script" on top of the same docker API used by docker. You can do everything docker-compose does with docker commands and a lot of shell scripting.

在Leyman的条款, 搬运工用于管理个人集装箱和码头工人,撰写时,可用于管理多容器应用程序。 它还会将您要在Docker 运行时输入的许多选项移到docker-compose.yml文件中,以方便重用。 它在docker使用的同一docker API上充当前端“脚本”。 您可以使用docker命令和许多Shell脚本来完成docker-compose的所有工作。

Before running our multi-container docker applications, docker-compose.yml must be configured. With that file, we define services that will be run on docker-compose up.

在运行我们的多容器Docker应用程序之前,必须先配置docker-compose.yml 。 使用该文件,我们定义将在docker-compose up上运行的服务。

The first attribute of docker-compose.yml is version, which is the compose file format version. For the most recent version of file format and all configuration options click here.

docker-compose.yml的第一个属性是version,即撰写文件格式的版本。 有关文件格式的最新版本和所有配置选项,请单击此处

Second attribute is services and all attributes one level bellow services denote containers used in our multi-container application. These are postgres, scheduler and webserver. Each container has image attribute which points to base image used for that service.

第二个属性是服务 ,所有属性的以下一级服务表示在我们的多容器应用程序中使用的容器。 这些是postgres,调度程序网络服务器。 每个容器都有图像属性,该属性指向用于该服务的基本图像。

For each service, we define environment variables used inside service containers. For postgres it is defined by environment attribute, but for scheduler and webserver it is defined by .env file. Because .env is an external file we must point to it with env_file attribute.

对于每个服务,我们定义在服务容器内部使用的环境变量。 对于postgres,它是由环境属性定义的,而对于调度程序和网络服务器,它是由.env文件定义的。 因为.env是外部文件,所以我们必须使用env_file属性指向它。

By opening .env file we can see two variables defined. One defines executor which will be used and the other denotes connection string. Each connection string must be defined in the following manner:

通过打开.env文件,我们可以看到定义了两个变量。 一个定义将使用的执行程序,另一个表示连接字符串。 必须以以下方式定义每个连接字符串:

dialect+driver://username:password@host:port/database

Dialect names include the identifying name of the SQLAlchemy dialect, a name such as sqlite, mysql, postgresql, oracle, or mssql. Driver is the name of the DBAPI to be used to connect to the database using all lowercase letters. In our case, connection string is defined by:

方言名称包括SQLAlchemy方言的标识名称,例如sqlitemysqlpostgresqloraclemssql 。 驱动程序是使用所有小写字母连接到数据库的DBAPI的名称。 在我们的情况下,连接字符串由以下方式定义:

postgresql+psycopg2://airflow:airflow@postgres/airflow

Omitting port after host part denotes that we will be using default postgres port defined in its own Dockerfile.

主机部分后省略端口表示我们将使用在其自己的Dockerfile中定义的默认postgres端口。

Every service can define command which will be run inside Docker container. If one service needs to execute multiple commands it can be done by defining an optional .sh file and pointing to it with entrypoint attribute. In our case we have entrypoint.sh inside the scripts folder which once executed, runs airflow initdb and airflow webserver. Both are mandatory for airflow to run properly.

每个服务都可以定义将在Docker容器中运行的命令 。 如果一项服务需要执行多个命令,则可以通过定义一个可选的.sh文件并使用entrypoint属性指向该文件来完成。 在我们的例子中,我们在scripts文件夹中有entrypoint.sh ,该文件夹一旦执行便运行airflow initdbairflow webserver 。 两者都是使气流正常运行所必需的。

Defining depends_on attribute, we can express dependency between services. In our example, webserver starts only if both scheduler and postgres have started, also the scheduler only starts after postgres have started.

定义depends_on属性,我们可以表示服务之间的依赖关系。 在我们的示例中,仅当调度程序和postgres均已启动时,web服务器才会启动,调度程序也仅在postgres启动后才启动。

In case our container crashes, we can restart it by restart_policy. The restart_policy configures if and how to restart containers when they exit. Additional options are condition, delay, max_attempts, and window.

万一容器崩溃,我们可以通过restart_policy重新启动它。 restart_policy配置容器退出时是否以及如何重启容器。 其他选项是条件,延迟,max_attempts和窗口。

Once service is running, it is being served on containers defined port. To access that service we need to expose the containers port to the host's port. That is being done by ports attribute. In our case, we are exposing port 8080 of the container to TCP port 8080 on 127.0.0.1 (localhost) of the host machine. Left side of : defines host machines port and the right-hand side defines containers port.

服务运行后,将在容器定义的端口上提供服务。 要访问该服务,我们需要将容器端口暴露给主机的端口。 这是通过ports属性完成的。 在本例中,我们将容器的端口8080暴露给主机127.0.0.1(localhost)上的TCP端口8080 。 的左侧:定义主机端口,右侧定义容器端口。

Lastly, the volumes attribute defines shared volumes (directories) between host file system and docker container. Because airflows default working directory is /opt/airflow/ we need to point our designated volumes from the root folder to the airflow containers working directory. Such is done by the following command:

最后, volumes属性定义主机文件系统和docker容器之间的共享卷(目录)。 因为airflows默认工作目录是/ opt / airflow /,所以我们需要将指定的卷从根文件夹指向airflow container工作目录。 这是通过以下命令完成的:

#general case for airflow- ./<our-root-subdir>:/opt/airflow/<our-root-subdir>#our case- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./scripts:/opt/airflow/scripts
...

This way, when the scheduler or webserver writes logs to its logs directory we can access it from our file system within the logs directory. When we add a new dag to the dags folder it will automatically be added in the containers dag bag and so on.

这样,当调度程序或网络服务器将日志写入其日志目录时,我们可以从日志目录中的文件系统访问它。 当我们将新的dag添加到dags文件夹时,它将自动添加到容器的dag bag中,依此类推。

That's it for today, thank you for reading this story, I will be posting more soon. If you notice any mistakes, please let me know.

今天就这样,谢谢您阅读这个故事,我将尽快发布更多。 如果您发现任何错误,请告诉我。

翻译自: https://medium.com/@ivanrezic/apache-airflow-and-postgresql-with-docker-and-docker-compose-5651766dfa96

postgresql和es

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值