背景:
- ETL会经常使用脚本(bash/python)+crontab来运行数据处理任务
- 查看任务执行情况不直观方便,只能登录机器、或者写一个界面/监控
- 存在依赖关系的任务没办法保证,或者保证的方法成本太高
- 任务量达到一定量级,任务的管理将极其棘手
调研:
Airflow | Oozie |
+ Python Code for DAGs | --- Java or XML for DAGs |
+ Has connectors for every major service/cloud providers | - hard to build complex pipelines |
+ More versatile | - smaller, less active community |
+ Advanced metrics | - worse WEB GUI |
+ Better UI and API | - Java API |
+ Capable of creating extremely complex workflows | |
+ JinjaTemplating | |
= Can be parallelized | = Can be parallelized |
= Native Connections to HDFS, HIVE, PIG,Presto,MySQL,Postgres,S3etc.. | = Native Connections to HDFS, HIVE, PIG etc.. |
= Graph as DAG | = Graph as DAG |
- Airbnb开源的工作流管理平台
- 工作流依赖关系的可视化
- 日志追踪
- 用Python编写,易于扩展
- 开箱即用的ETL调度管理平台
- 运维/作业管理平台
- 调度平台设计
踩坑历程(基于AirFlow-1.9.0、Python3.6.5):
MySQL数据库安装(将作为元数据库):
# yum install gcc libffi-devel python-devel openssl-devel
# 创建相关数据库及账号
mysql> create database airflow default charset utf8 collate utf8_general_ci;
mysql> create user airflow@'localhost' identified by 'airflow';
mysql> grant all on airflow.* to airflow@'localhost';
mysql> flush privileges;
AirFlow安装配置:
# 配置 airflow 的 home 目录
> mkdir -p /usr/local/airflow/{dags,logs,plugins}
# echo "export AIRFLOW_HOME=/usr/local/airflow" >> /etc/profile
> source /etc/profile
# 安装 airflow
> pip3 install airflow
# 配置元数据库
> vi /usr/local/airflow/airflow.cfg
# dialect+driver://username:password@host:port/database
sql_alchemy_conn = mysql://airflow:airflow@localhost:3306/airflow
# 注意:连接mysql的密码为上一步安装mysql时设置的
# 注意:使用的mysql socket路径为:socket=/var/lib/mysql/mysql.sock
# 初始化元数据库连接(默认sqlite)
> airflow initdb
# 启动web服务(不指定端口时默认端口:8080)
> airflow webserver -p 8080
# 添加防火墙规则或停止防火墙
> systemctl stop firewalld.service
# 远程打开管理窗口
http://localhost:8080/admin/
AirFlow服务管理(supervisor不支持Python3.6.5的安装,故采用Python2安装):
# 安装进程管理工具Supervisord管理airflow进程
> easy_install supervisor
> echo_supervisord_conf > /etc/supervisord.conf
# 编辑文件supervisord.conf,添加启动命令
> vi /etc/supervisord.conf
[program:airflow_web]
command=/usr/bin/airflow webserver -p 8080
[program:airflow_scheduler]
command=/usr/bin/airflow scheduler
# 启动supervisord服务
/usr/bin/supervisord -c /etc/supervisord.conf
#此时可以用 supervisorctl 来管理airflow服务了
supervisorctl start airflow_web
supervisorctl stop airflow_web
supervisorctl restart airflow_web
supervisorctl stop all
安全认证:
# 添加密码模块
> pip install airflow[password]
# 启用访问认证
> vim /usr/local/airflow/airflow.cfg
[webserver]
authenticate = true
auth_backend = airflow.contrib.auth.backends.password_auth
# 在 python 中执行添加账户:
import airflow
from airflow import models, settings
from airflow.contrib.auth.backends.password_auth import PasswordUser
user = PasswordUser(models.User())
user.username = 'admin'
user.email = 'wjf20110627@163.com'
user.password = 'admin'
session = settings.Session()
session.add(user)
session.commit()
session.close()
exit()
# 重启 airflow_web 服务
> supervisorctl restart airflow_web
遇到的坑:
1.Error: 'airflow.www.gunicorn_config' doesn't exist
执行命令 sudo pip3 install 'gunicorn==19.3.0'
2.ValueError: too many values to unpack
执行命令 sudo -H pip3 install -U 'sqlalchemy==1.1.18'