hadoop组件---spark实战-----airflow----调度工具airflow部署到k8s中使用

最新推荐文章于 2024-08-07 09:11:58 发布

张小凡vip

最新推荐文章于 2024-08-07 09:11:58 发布

阅读量9.4k

点赞数 3

分类专栏： spark 文章标签： spark airflow k8s

本文链接：https://blog.csdn.net/zzq900503/article/details/104547612

版权

spark 专栏收录该内容

10 篇文章 3 订阅

订阅专栏

在之前的文章中我们已经了解了airflow 和它的工作原理。

hadoop组件—spark实战-----airflow----调度工具airflow的介绍和使用示例

Scheduler进程，WebServer进程和Worker进程需要单独启动。Scheduler和WebServer可以跑在一个操作系统内，也可以分开，而通常Worker需要很多，如果是部署特定的数量的Worker，那就需要特定数量的机器才行；

Airflow On Kubernetes的方案，就可以克服这个缺点，Scheduler和WebServer长期运行，Worker只在有需要任务调度的时候跑起来，如果没有任务要执行，那这个Airflow集群将只有Scheduler和WebServer进程；

方案和架构选择

任务交互方式–支持非常驻 worker

airflow和k8s结合交互，有两种方式。

一种是 KubernetesExecutor

KubernetesExecutor是比较新的一种用法和属性，在airflow 1.10中才引入。

使用KubernetesExecutor是一种跟 HiveOperator 或者 PythonOperator一个层级的 Operators。
会在worker层面有k8s的一些依赖和上下文context，同时需要在worker节点中设置相应的配置参数。
它是一种 airflow跟k8s更紧密的结合和应用。

Kubernetes Executor的使用流程是配置文件中定义好任务，并指明任务运行使用KuberneteExecutor，在配置KubernetesExecutor的时候指定镜像、tag、将要跟k8s集群申请的资源等，接下来，在指定的容器里面直接使用kubernetes_executor 执行器运行任务。

Kubernetes Executor的粒度层级是跟task 紧密结合的，一般一个task就会创建一个pod。

另一种比较常用的是 KubernetesPodOperator

KubernetesPodOperator的粒度层级是pod，它只管到pod的创建，不会管理到task的层面，也就是说 airflow利用k8s的pod创建能力，创建出一个pod作为我们的worker，里面的镜像是我们指定的，然后上面可以运行多个task，也可以运行任意其他的类型的executor，比如python，hive。

我们可以根据我们的业务场景需求来使用。

如果我们是一个全新的业务，那么可以尝试使用Kubernetes Executor来配置运行。

如果我们是一个已有的业务，以前都是使用PythonOperator等调度器在运行，那么使用 KubernetesPodOperator会更好，这样我们的业务代码就不需要修改，只需要修改外部调度的地方。

master的部署方式

我们的交互方式使用 KubernetesPodOperator 或者 KubernetesExecutor 只能保证woker的创建和运行。

我们还是需要 master 来常驻运行 Scheduler和WebServer。

master可以部署在一个常驻服务器中，当然也可以使用pod的方式运行。

这里为了统一管理，我们选择使用pod来进行部署master。

常驻master和worker部署在k8s

如果我们的数据量不大，任务不多，可以使用常驻的pod服务来安装master和 worker。

本篇文章实现这个思路的airflow。

需要借助项目 kube-airflow

kube-airflow github

下载项目

git clone git://www.github.com/mumoshu/kube-airflow.git

项目中文件如下：

zhangxiaofans-MacBook-Pro:kube-airflow joe$ ls
Dockerfile.template	README.md		config			script
LICENSE			airflow			dags
Makefile		airflow.all.yaml	requirements
zhangxiaofans-MacBook-Pro:kube-airflow joe$

流Dockerfile.template是kube-airflow的镜像文件，这个镜像也发布到了Docker-hub中，详情可参考docker-airflow
该镜像基于Debian扩展官方映像 debian:stretch。

airflow.all.yaml是用于手动创建Kubernetes服务和部署airflow。

kube-airflow默认使用的是 CeleryExecutor + rabbitmq的模式。配置情况参考gitlab的airflow.cfg

使用命令部署如下：

kubectl create -f airflow.all.yaml

这个命令会创建以下 deployment:

postgres
rabbitmq
airflow-webserver
airflow-scheduler
airflow-flower
airflow-worker

和以下 services :

postgres
rabbitmq
airflow-webserver
airflow-flower

部署成功后可以使用命令查看创建好的相关pod

kubectl get pods

找到web的podname，比如 web-796b7857b7-dt7nk

进入pod中

kubectl exec -ti  web-796b7857b7-dt7nk  -- bash

运行测试语句

# 测试 print_date
airflow test tutorial print_date 2020-02-27

运行成功输出如下：

airflow@web-796b7857b7-dt7nk:~$ airflow test tutorial print_date 2020-02-27
[2020-02-27 14:50:27,188] {__init__.py:57} INFO - Using executor CeleryExecutor
[2020-02-27 14:50:27,236] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/Grammar.txt
[2020-02-27 14:50:27,251] {driver.py:120} INFO - Generating grammar tables from /usr/lib/python2.7/lib2to3/PatternGrammar.txt
[2020-02-27 14:50:27,379] {models.py:167} INFO - Filling up the DagBag from /usr/local/airflow/dags
[2020-02-27 14:50:27,555] {models.py:1126} INFO - Dependencies all met for <TaskInstance: tutorial.print_date 2020-02-27 00:00:00 [success]>
[2020-02-27 14:50:27,557] {models.py:1126} INFO - Dependencies all met for <TaskInstance: tutorial.print_date 2020-02-27 00:00:00 [success]>
[2020-02-27 14:50:27,557] {models.py:1318} INFO -
--------------------------------------------------------------------------------
Starting attempt 1 of 2
--------------------------------------------------------------------------------

[2020-02-27 14:50:27,558] {models.py:1342} INFO - Executing <Task(BashOperator): print_date> on 2020-02-27 00:00:00
[2020-02-27 14:50:27,568] {bash_operator.py:71} INFO - tmp dir root location:
/tmp
[2020-02-27 14:50:27,568] {bash_operator.py:80} INFO - Temporary script location :/tmp/airflowtmpiBOkUx//tmp/airflowtmpiBOkUx/print_dateXQWqUY
[2020-02-27 14:50:27,569] {bash_operator.py:81} INFO - Running command: date
[2020-02-27 14:50:27,572] {bash_operator.py:90} INFO - Output:
[2020-02-27 14:50:27,574] {bash_operator.py:94} INFO - Thu Feb 27 14:50:27 UTC 2020
[2020-02-27 14:50:27,574] {bash_operator.py:97} INFO - Command exited with return code 0
airflow@web-796b7857b7-dt7nk:~$

更多测试命令

# 1、列出现有所有的活动的DAGS
    airflow list_dags

# 2、列出 tutorial 的任务id
    airflow list_tasks tutorial

# 3、以树形图的形式列出 tutorial 的任务id
    airflow list_tasks tutorial --tree


# 4、backfill 可以执行一个时间段内应该执行的所有任务
airflow backfill  tutorial -s 2020-02-26  -e  2020-02-27

也可以使用mysql+LocalExecutor模式
新增yaml内容如下：

\apiVersion: v1
kind: Pod
metadata:
  name: airflow-mysql
  labels:
    name: airflow-mysql
spec:
  containers:
  - name: airflow-mysql
    image: mysql:8.0.12
    imagePullPolicy: IfNotPresent
    ports:
    - containerPort: 3306
    env:
    - name: MYSQL_SERVICE_HOST
      value: "mysql"
    - name: MYSQL_SERVICE_PORT
      value: "3306"
    - name: MYSQL_ROOT_PASSWORD
      value: '123456'
    - name: MYSQL_DATABASE
      value: 'airflow'
    - name: TZ
      value: 'Asia/Shanghai'
    - name: LANG
      value: 'C.UTF-8'

---
apiVersion: v1
kind: Service
metadata:
  name: airflow-mysql
spec:
  selector:
    name: airflow-mysql
  ports:
    - protocol: TCP
      port: 3306
      targetPort: 3306
---

以及work的yaml中增加

  - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          value: "mysql+mysqldb://root:123456@mysql:3306/airflow"

如下：

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: worker
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: worker
    spec:
      restartPolicy: Always
      containers:
      - name: worker
        image: mumoshu/kube-airflow:1.8.0.0-1.6.1
        # volumes:
        #     - /localpath/to/dags:/usr/local/airflow/dags
        env:
        - name: AIRFLOW_HOME
          value: "/usr/local/airflow"
        - name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
          value: "mysql+mysqldb://root:123456@mysql:3306/airflow"
        args: ["worker"]

airflow.cfg中executor需要修改如下：

executor = LocalExecutor

mysql+localExecutor模式可参考

增加ingress配置，访问airflow的web

修改airflow.all.yaml文件

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  type: NodePort
  selector:
    app: airflow
    tier: web
  ports:
    - name: web
      protocol: TCP
      port: 8080
      targetPort: web
      nodePort: 32080

修改为:

apiVersion: v1
kind: Service
metadata:
  name: web
spec:
  selector:
    app: airflow
    tier: web
  ports:
    - name: web
      protocol: TCP
      port: 8080
      targetPort: 8080

新增内容：

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: web
  namespace: default
spec:
  rules:
  -
    host: api-beta.test.com
    http:
      paths:
      - path: /
        backend:
          serviceName: web
          servicePort: 8080

把之前的web的deployment和service删除后重新创建，使用命令

kubectl delete deployment web
kubectl delete service  web

kubectl create -f airflow.all.yaml

浏览器中访问api-beta.test.com

如图

在这里插入图片描述

使用Dockerfile做镜像

有时候我们根据自己的业务需要进行自动的ci/cd发布。

比如把需要执行的etl的py文件打包到 dags中，或者需要安装我们的业务项目包。

这几种情况都需要对Dockerfile.template修改调整后发布镜像。

步骤如下：

cp Dockerfile.template  Dockerfile

vi Dockerfile

注意有两个变量

ENV     EMBEDDED_DAGS_LOCATION=%%EMBEDDED_DAGS_LOCATION%%
ENV     REQUIREMENTS_TXT_LOCATION=%%REQUIREMENTS_TXT_LOCATION%%

这里是示例代码本地的路径，需要先设置环境变量，不然会报错找不到目录。

也可以直接注释掉，不把这部分加入镜像

# COPY    ${REQUIREMENTS_TXT_LOCATION} /requirements/dags.txt

# COPY        ${EMBEDDED_DAGS_LOCATION} ${AIRFLOW_HOME}/dags

删除

        &&  pip3 install -r /requirements/dags.txt \

使用命令创建镜像并上传

docker build -t  spark-airlfow:1.8.0  .
docker tag  spark-airflow:1.8.0   <repo>/spark-airflow:1.8.0
docker push <repo>/spark-airflow:1.8.0

可能遇到的报错–ERROR: Package ‘apache-airflow’ requires a different Python: 3.5.3 not in ‘~=3.6’

这是因为 Dockerfile中默认安装python是使用的 python3-dev，这样有可能下载到的是python3.5，不符合airflow需要的版本，所以我们需要修改一下 Dockerfile

把

RUN         set -ex \
        &&  buildDeps=' \
                build-essential \
                libblas-dev \
                libffi-dev \
                libkrb5-dev \
                liblapack-dev \
                libpq-dev \
                libsasl2-dev \
                libssl-dev \
                libxml2-dev \
                libxslt1-dev \
                python3-dev \
                python3-pip \
                zlib1g-dev \

去掉python3的安装修改成

RUN         set -ex \
        &&  buildDeps=' \
                build-essential \
                libblas-dev \
                libffi-dev \
                libkrb5-dev \
                liblapack-dev \
                libpq-dev \
                libsasl2-dev \
                libssl-dev \
                libxml2-dev \
                libxslt1-dev \
                zlib1g-dev \

新增命令安装3.6版本

        &&  apt-get install wget  -yqq \
        &&  wget https://www.python.org/ftp/python/3.6.9/Python-3.6.9.tgz \
        &&  tar -xzvf Python-3.6.9.tgz \
        &&  cd Python-3.6.9/   \
        &&  ./configure --prefix=/usr/local/python36   \
        &&  make && make install    \
        &&  chmod 777 -R /usr/local/python36  \
        &&  export PATH=/usr/local/python36/bin:$PATH  \

或者新增命令安装3.7版本

        &&  apt-get install wget  -yqq \
        &&  wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tgz \
        &&  tar -zxvf Python-3.7.2.tgz \
        &&  cd Python-3.7.2/   \
        &&  ./configure --prefix=/usr/local/python37   \
        &&  make && make install    \
        &&  chmod 777 -R /usr/local/python37  \
        &&  export PATH=/usr/local/python37/bin:$PATH \

可能遇到的报错–airflow command not found,python3 command not found

识别不到安装好的python和airflow

需要在endpoint.sh中设置下环境变量

vi script/entrypoint.sh

新增内容

export PATH=/usr/local/python37/bin:$PATH

python3 -V

whereis  python

export PATH=$PATH:$AIRFLOW_HOME

echo $PATH

Dockerfile中也新增一个软连接

vi Dockerfile

新增内容

RUN    ln -sf  /usr/bin/python  /usr/bin/python3

可能遇到的问题–relation “log” does not exist at character，relation ‘table_name‘’ does not exist

需要在pod使用前进行 airflow initdb命令行操作

检查endpoint.sh中是否有 airflow initdb 语句执行

vi script/entrypoint.sh

如果已经执行了，需要排查下是否运行成功还是报错了。

可能遇到的问题–airflow command error: argument subcommand: invalid choice: ‘initdb’

原因是在requirements/airflow.txt中的airflow没有指定版本，默认会拿最新的版本，也就是airflow2.0.0 。

2.0.0的命令与1.x.x的已经不同了，需要修正下运行的语句，比如

airflow initdb

修改为

airflow db init

详情参考 github question

不过这种方案需要修改的命令比较多，最好还是指定airflow的版本与kube-airflow使用的对应。比如这里是 1.8.0版本。

使用命令

vi  requirements/airflow.txt

把

git+https://github.com/apache/incubator-airflow#egg=airflow

修改为

git+https://github.com/apache/incubator-airflow@1.8.0#egg=airflow

requirements.txt中使用git有如下用法：

下面的示例package-two使用GitHub存储库进行更新。@和之间的文字#表示包装的详细信息。



指定提交哈希（41b95ec为commit号，在github中切换到相应分支后右上角可以看到Latest commit 41b95ec on 20 Mar 2017）：

package-one==1.9.4

git+git://github.com/path/to/package-two@41b95ec#egg=package-two

package-three==1.0.1

指定分支名称（master）：

git+git://github.com/path/to/package-two@master#egg=package-two

指定标签（0.1）：

git+git://github.com/path/to/package-two@0.1#egg=package-two

指定发布（3.7.1）：

git+git://github.com/path/to/package-two@releases/tag/v3.7.1#egg=package-two

请注意，#egg=package-two此处不是注释，而是要明确说明软件包名称

可能遇到的问题–async = [^ SyntaxError: invalid syntax

原因是python3.7中async是关键字，而airflow1.8.0版本中把async当作一个变量来使用。

也就是 python3.7和airflow1.8.0存在兼容性问题。

airflow initdb时会引入tenacity的包（from tenacity.async import AsyncRetrying）async在python3.7版本被加入了关键字，导致airflow无法在python3.7运行，详见github question

解决方案，指定tenacity==4.10.0

使用命令

vi  requirements/airflow.txt

新增内容

tenacity==4.10.0

同时 airflow的setup.py等也需要修改，不能再使用async这个变量，可以修改成 _async 等任意变量名。

这个方案需要修改的地方比较多，也可以尝试使用airflow1.10.7版本以上，是跟python3.7兼容的。

详情参考 [AIRFLOW-2716] Replace new Python 3.7 keywords

https://github.com/apache/airflow/releases

使用命令

vi  requirements/airflow.txt

把

git+https://github.com/apache/incubator-airflow#egg=airflow

修改为

git+https://github.com/apache/incubator-airflow@1.10.9#egg=airflow

可能遇到的问题—ModuleNotFoundError: No module named ‘_bz2’

原因是安装的python缺少系统的相关解压包

不用重新编译python的解决方式

找到_bz2.cpython-36m-x86_64-linux-gnu.so文件，一般python3.6的目录中会有，如果没有的话，可以下载我存在百度网盘的这个

链接:https://pan.baidu.com/s/1AwAW028WOTlQTk8RuF9evg 密码:f2tu

如果你的python版本是3.6，那就是36m
使用命令

cp _bz2.cpython-36m-x86_64-linux-gnu.so /usr/local/python3/lib/python3.6/lib-dynload/

我的是python3.7，得把文件名改为37m，并拷贝到python3的安装目录
使用命令

mv _bz2.cpython-36m-x86_64-linux-gnu.so _bz2.cpython-37m-x86_64-linux-gnu.so
cp _bz2.cpython-37m-x86_64-linux-gnu.so /usr/local/python3/lib/python3.7/lib-dynload/

修改时区为中国时区

首先镜像的时区也需要设置成中国时区

在Dockfile中新增语句如下—根据系统不同选择不同的语句：

# CentOS

RUN echo "Asia/shanghai" > /etc/timezone;

# Ubuntu

RUN cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime


# debian

ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

启动pod后需要进入pod中进行校验，使用命令如下：

[dev@test-kops-jump kube-airflow]$ kubectl get pods |grep airflow
airflow-flower-5f49695d96-mrmld                1/1     Running     0          53s
airflow-ps-5d5bb5cb88-mggmp                    1/1     Running     0          53s
airflow-rabbitmq-7984cf97ff-c7pjt              1/1     Running     0          53s
airflow-scheduler-575ccf69d9-bchvq             1/1     Running     0          53s
airflow-web-68cffc7664-52xzm                   1/1     Running     0          53s
airflow-worker-9cf6cb6dd-ndlg8                 1/1     Running     0          53s
[dev@test-kops-jump kube-airflow]$
[dev@test-kops-jump kube-airflow]$ kubectl exec -ti airflow-web-68cffc7664-52xzm -- bash
airflow@airflow-web-68cffc7664-52xzm:~$ date -R
Sat, 07 Mar 2020 15:55:11 +0800

这里使用airflow版本是1.10.9，其它版本大同小异，参照修改即可

首先要找到airflow的安装文件所在目录，不同的安装方式使用的文件是不同的。如果是手动下载github的源代码进行安装的话，安装目录就在我们自己指定的目录。

如果是使用python方式pip方式安装的，则一般在python的site-package目录中，比如我的目录如下：

airflow@airflow-web-77b59b7cc5-swhtq:~$ ls /usr/local/python37/lib/python3.7/site-packages/airflow/
alembic.ini         contrib/            executors/          logging_config.py   plugins_manager.py  serialization/      _vendor/            
api/                dag/                hooks/              macros/             __pycache__/        settings.py         version.py          
bin/                default_login.py    __init__.py         migrations/         security/           task/               www/                
config_templates/   example_dags/       jobs/               models/             sensors/            ti_deps/            www_rbac/           
configuration.py    exceptions.py       lineage/            operators/          sentry.py           utils/

airflow.cfg文件则在指定的airflow的home目录下，比如我的在

airflow@airflow-web-77b59b7cc5-jllpj:~$ ls /usr/local/airflow/
airflow.cfg  airflow-webserver.pid  entrypoint.sh  git-sync  hail_oper_base.log  logs  unittests.cfg
airflow@airflow-web-77b59b7cc5-jllpj:~$

修改airflow.cfg文件增加时区

使用命令

cd /usr/local/airflow/
vi  airflow.cfg

设置

 default_timezone = Asia/Shanghai

修改airflow/utils/timezone.py

使用命令

cd /usr/local/python37/lib/python3.7/site-packages/airflow/
 vi utils/timezone.py

在 utc = pendulum.timezone(‘UTC’) 这行(第27行)代码下添加

from airflow import configuration as conf
try:
	tz = conf.get("core", "default_timezone")
	if tz == "system":
		utc = pendulum.local_timezone()
	else:
		utc = pendulum.timezone(tz)
except Exception:
	pass

修改utcnow()函数 (在第69行)

原代码 d = dt.datetime.utcnow() 
修改为 d = dt.datetime.now()

修改airflow/utils/sqlalchemy.py

使用命令

cd /usr/local/python37/lib/python3.7/site-packages/airflow/
 vi utils/sqlalchemy.py

在utc = pendulum.timezone(‘UTC’) 这行(第37行)代码下添加

from airflow import configuration as conf
try:
	tz = conf.get("core", "default_timezone")
	if tz == "system":
		utc = pendulum.local_timezone()
	else:
		utc = pendulum.timezone(tz)
except Exception:
	pass

注释airflow/utils/sqlalchemy.py中的cursor.execute(“SET time_zone = ‘+00:00’”) (第66行)

修改airflow/www/templates/admin/master.html(第31行)

使用命令

cd /usr/local/python37/lib/python3.7/site-packages/airflow/
vi www/templates/admin/master.html

修改的内容

把代码 var UTCseconds = (x.getTime() + x.getTimezoneOffset()*60*1000); 
改为 var UTCseconds = x.getTime();

把代码 "timeFormat":"H:i:s %UTC%",
改为  "timeFormat":"H:i:s",

修改airflow/models/dag.py —注意，镜像时区不为中国时区时才需要修改

使用命令

cd /usr/local/python37/lib/python3.7/site-packages/airflow/
vi models/dag.py

修改的内容–在py文件中找到class DagModel(Base):在该类中新增方法utc2local如下：

class DagModel(Base):

    def utc2local(self,utc):
        import time
        epoch = time.mktime(utc.timetuple())
        print("时间戳"+str(epoch))
        result_time_8_hours_s = datetime.fromtimestamp(epoch) - datetime.utcfromtimestamp(epoch)
        time_result = utc + result_time_8_hours_s
        print("结果时间戳"+str(time_result))
        return time_result

修改airflow/www/templates/airflow/dags.html —注意，镜像时区不为中国时区时才需要修改

使用命令

cd /usr/local/python37/lib/python3.7/site-packages/airflow/
vi www/templates/airflow/dags.html

修改的内容

last_run.execution_date.strftime("%Y-%m-%d %H:%M")
和
last_run.start_date.strftime("%Y-%m-%d %H:%M")

分别修改为：
dag.utc2local(last_run.execution_date).strftime("%Y-%m-%d %H:%M")
和
dag.utc2local(last_run.start_date).strftime("%Y-%m-%d %H:%M")

重启webserver使修改生效


su airflow
 ps -ef|egrep 'airflow-webserver'|grep -v grep|awk '{print $2}'|xargs kill -9


rm -rf /home/airflow/airflow/airflow-scheduler.pid
 
 
airflow webserver -p 8080 -D

更多重启命令参考

重启webserver和scheduler

su airflow
 ps -ef|egrep 'scheduler|airflow-webserver'|grep -v grep|awk '{print $2}'|xargs kill -9
 rm -rf /home/airflow/airflow/airflow-scheduler.pid 
 airflow webserver -p 8080 -D
  airflow scheduler -D
tail -f /home/airflow/airflow/airflow-scheduler.err 

重启worker
su airflow
ps -ef|egrep 'serve_logs|celeryd'|grep -v grep
 rm -rf /home/airflow/airflow/airflow-worker.pid
 airflow worker -D
tail -f /home/airflow/airflow/airflow-worker.err   什么也不打印就是没有问题

隐藏官方示例

修改 airflow.cfg配置项(97行), 在airflow_home（airflow_home = /root/airflow）下

load_examples = False

删除airflow.db ，在airflow_home（airflow_home = /root/airflow）下，重新初始化

airflow initdb

进入访问pg数据库

使用命令进入pod

kubectl exec -ti  airflow-ps-5d5bb5cb88-s8z7f  -- bash

在pod中进入pg数据库

root@airflow-ps-5d5bb5cb88-s8z7f:/# which psql
/usr/bin/psql
root@airflow-ps-5d5bb5cb88-s8z7f:/# /usr/bin/psql -h localhost -U airflow  -d airflow
psql (12.2 (Debian 12.2-2.pgdg100+1))
Type "help" for help.

airflow=#

pg数据库常用命令

连接数据库, 默认的用户和数据库是postgres
psql -U user -d dbname

切换数据库,相当于mysql的use dbname
\c dbname
列举数据库，相当于mysql的show databases
\l
列举表，相当于mysql的show tables
\dt
查看表结构，相当于desc tblname,show columns from tbname
\d tblname

\di 查看索引 

创建数据库： 
create database [数据库名]; 
删除数据库： 
drop database [数据库名];  
*重命名一个表： 
alter table [表名A] rename to [表名B]; 
*删除一个表： 
drop table [表名]; 

*在已有的表里添加字段： 
alter table [表名] add column [字段名] [类型]; 
*删除表中的字段： 
alter table [表名] drop column [字段名]; 
*重命名一个字段：  
alter table [表名] rename column [字段名A] to [字段名B]; 
*给一个字段设置缺省值：  
alter table [表名] alter column [字段名] set default [新的默认值];
*去除缺省值：  
alter table [表名] alter column [字段名] drop default; 
在表中插入数据： 
insert into 表名 ([字段名m],[字段名n],......) values ([列m的值],[列n的值],......); 
修改表中的某行某列的数据： 
update [表名] set [目标字段名]=[目标值] where [该行特征]; 
删除表中某行数据： 
delete from [表名] where [该行特征]; 
delete from [表名];--删空整个表 
创建表： 
create table ([字段名1] [类型1] <references 关联表名(关联的字段名)>;,[字段名2] [类型2],......<,primary key (字段名m,字段名n,...)>;); 
\copyright     显示 PostgreSQL 的使用和发行条款
\encoding [字元编码名称]
                 显示或设定用户端字元编码
\h [名称]      SQL 命令语法上的说明，用 * 显示全部命令
\prompt [文本] 名称
                 提示用户设定内部变数
\password [USERNAME]
                 securely change the password for a user
\q             退出 psql

可能遇到的问题—Failed to fetch log file from worker.

在web ui界面中点击查看日志，发现获取不了，连接不上worker，远程获取不了log的日志

解决方法，在创建pod的yaml中需要增加worker的service 如下：

apiVersion: v1
kind: Service
metadata:
  name: worker
spec:
  type: NodePort
  selector:
    app: airflow
    tier: worker
  ports:
    - name: worker
      protocol: TCP
      port: 8793
      targetPort: worker
      nodePort: 32082
---

worker的deployment中也需要增加

hostname: worker


        ports:
        - name: airflow-worker
          containerPort: 8793

详情参考：
https://github.com/mumoshu/kube-airflow/issues/23

格式参考最终可用文件 yaml

可能遇到的问题—部分任务看不了日志–No host supplied

在web ui 中查看日志报错输出如下：

*** Log file does not exist: /usr/local/airflow/logs/user_quality/user_quality/2020-03-08T09:09:00+00:00/1.log
*** Fetching from: http://:8793/log/user_quality/user_quality/2020-03-08T09:09:00+00:00/1.log
*** Failed to fetch log file from worker. Invalid URL 'http://:8793/log/user_quality/user_quality/2020-03-08T09:09:00+00:00/1.log': No host supplied

解决方案

目前没有找到worker运行时日志获取没加上 host的原因。

只能通过另外的方案解决，比如把日志存储在s3上。

可以通过config/airflow.cfg进行配置

# The folder where airflow should store its log files. This location
base_log_folder = /usr/local/airflow/logs

# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply a remote location URL (starting with either 's3://...' or
# 'gs://...') and an Airflow connection id that provides access to the storage
# location.
remote_logging = True
remote_base_log_folder = s3://beta-env/tmp/sparktest/
remote_log_conn_id = my_conn_S3
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False

my_conn_S3为任意起的名字，也可以空着，关键是需要在环境变量中注入 aws s3需要的三个变量

我在做镜像的时候在Dockfile中使用命令：

ENV     AWS_ACCESS_KEY_ID   AKI123   
ENV     AWS_DEFAULT_REGION   cn-northwest-1   
ENV     AWS_SECRET_ACCESS_KEY   FmPTa12343

需要注意 s3这种方式的日志只有在任务运行结束后无论运行成功还是失败才会上传到s3中，然后显示

最终可用文件参考

Dockerfile

# VERSION 1.8.0.0
# AUTHOR: Yusuke KUOKA
# DESCRIPTION: Docker image to run Airflow on Kubernetes which is capable of creating Kubernetes jobs
# BUILD: docker build --rm -t mumoshu/kube-airflow
# SOURCE: https://github.com/mumoshu/kube-airflow

FROM    debian:stretch
MAINTAINER Yusuke KUOKA <ykuoka@gmail.com>

# Never prompts the user for choices on installation/configuration of packages
ENV     DEBIAN_FRONTEND noninteractive
ENV     TERM linux

# Airflow
ARG     AIRFLOW_VERSION=1.8.0
ENV     POSTGRES_HOST airflow-ps
ENV     RABBITMQ_HOST airflow-rabbitmq
ENV     AIRFLOW_HOME /usr/local/airflow
ENV     EMBEDDED_DAGS_LOCATION=%%EMBEDDED_DAGS_LOCATION%%
ENV     REQUIREMENTS_TXT_LOCATION=%%REQUIREMENTS_TXT_LOCATION%%


# Define en_US.
ENV     LANGUAGE en_US.UTF-8
ENV     LANG en_US.UTF-8
ENV     LC_ALL en_US.UTF-8
ENV     LC_CTYPE en_US.UTF-8
ENV     LC_MESSAGES en_US.UTF-8
ENV     LC_ALL en_US.UTF-8

WORKDIR /requirements
# Only copy needed files
COPY    requirements/airflow.txt /requirements/airflow.txt
# COPY    ${REQUIREMENTS_TXT_LOCATION} /requirements/dags.txt


RUN         set -ex \
        &&  buildDeps=' \
                build-essential \
                libblas-dev \
                libffi-dev \
                libkrb5-dev \
                liblapack-dev \
                libpq-dev \
                libsasl2-dev \
                libssl-dev \
                libxml2-dev \
                libxslt1-dev \
                zlib1g-dev \
            ' \
        &&  apt-get update -yqq \
        &&  apt-get upgrade -yqq \
        &&  apt-get install -yqq --no-install-recommends \
                $buildDeps \
                apt-utils \
                curl \
                git \
                locales \
                netcat \
                bzip2 \
        &&      sed -i 's/^# en_US.UTF-8 UTF-8$/en_US.UTF-8 UTF-8/g' /etc/locale.gen \
        &&  apt-get install wget  -yqq \
        &&  wget https://www.python.org/ftp/python/3.7.2/Python-3.7.2.tgz \
        &&  tar -zxvf Python-3.7.2.tgz \
        &&  cd Python-3.7.2/   \
        &&  ./configure --prefix=/usr/local/python37   \
        &&  make && make install    \
        &&  chmod 777 -R /usr/local/python37  \
        &&  export PATH=/usr/local/python37/bin:$PATH \
        &&  locale-gen \
        &&  update-locale LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 \
        &&  useradd -ms /bin/bash -d ${AIRFLOW_HOME} airflow \
        &&  pip3 install --upgrade pip 'setuptools!=36.0.0' \
        &&  if [ ! -e /usr/bin/pip ]; then ln -s /usr/bin/pip3 /usr/bin/pip ; fi \
        &&  if [ ! -e /usr/bin/python ]; then ln -sf /usr/bin/python3 /usr/bin/python; fi \
        &&  pip3 install wheel   \
        &&  pip3 install hail \
        &&  pip3 install -r /requirements/airflow.txt \
#        &&  pip3 install -r /requirements/dags.txt \
        &&  pip3 install  pydigree==0.0.3  --no-cache-dir \
        &&  apt-get remove --purge -yqq $buildDeps libpq-dev \
        &&  apt-get clean \
        &&  rm -rf \
                /var/lib/apt/lists/* \
                /tmp/* \
                /var/tmp/* \
                /usr/share/man \
                /usr/share/doc \
                /usr/share/doc-base

ENV         KUBECTL_VERSION %%KUBECTL_VERSION%%

RUN         curl -L -o /usr/local/bin/kubectl \
                https://storage.googleapis.com/kubernetes-release/release/v${KUBECTL_VERSION}/bin/linux/amd64/kubectl \
        &&  chmod +x /usr/local/bin/kubectl

COPY        script/entrypoint.sh ${AIRFLOW_HOME}/entrypoint.sh
COPY        config/airflow.cfg ${AIRFLOW_HOME}/airflow.cfg
COPY        script/git-sync ${AIRFLOW_HOME}/git-sync
COPY        script/git-sync ${AIRFLOW_HOME}/git-sync
COPY        _bz2.cpython-37m-x86_64-linux-gnu.so  /usr/local/python37/lib/python3.7/lib-dynload/

RUN         chown -R airflow: ${AIRFLOW_HOME} \
        &&  chmod +x ${AIRFLOW_HOME}/entrypoint.sh \
        &&  chmod +x ${AIRFLOW_HOME}/git-sync



RUN       mv /etc/apt/sources.list /etc/apt/sources.list.bak \
   &&  echo "deb http://mirrors.aliyun.com/debian/ stretch main non-free contrib" >> /etc/apt/sources.list  \
   && echo "deb-src http://mirrors.aliyun.com/debian/ stretch main non-free contrib" >>/etc/apt/sources.list  \
   && echo "deb http://mirrors.aliyun.com/debian-security stretch/updates main" >>/etc/apt/sources.list   \
   && echo "deb-src http://mirrors.aliyun.com/debian-security stretch/updates main" >>/etc/apt/sources.list  \
   &&  echo "deb http://mirrors.aliyun.com/debian/ stretch-updates main non-free contrib" >> /etc/apt/sources.list  \
   && echo "deb-src http://mirrors.aliyun.com/debian/ stretch-updates main non-free contrib" >>/etc/apt/sources.list  \
   && echo "deb http://mirrors.aliyun.com/debian/ stretch-backports main non-free contrib" >>/etc/apt/sources.list   \
   && echo "deb-src http://mirrors.aliyun.com/debian/ stretch-backports main non-free contrib" >>/etc/apt/sources.list  \
   &&  apt-get update  -y  \
   &&  apt-get install vim  -y



RUN     export PATH=/usr/local/python37/bin:$PATH \
        && pip install apache-airflow[s3]  \
        && pip install apache-airflow[log] \
        && pip install awscli --upgrade --user  -i https://mirrors.aliyun.com/pypi/simple/ \
        && pip install py4j  -i https://mirrors.aliyun.com/pypi/simple/ \
        && pip install s3fs==0.4.0  -i https://mirrors.aliyun.com/pypi/simple/




COPY   airflow.cfg ${AIRFLOW_HOME}/airflow.cfg
COPY   master.html /usr/local/python37/lib/python3.7/site-packages/airflow/www/templates/admin/master.html
COPY   sqlalchemy.py /usr/local/python37/lib/python3.7/site-packages/airflow/utils/sqlalchemy.py
COPY   timezone.py  /usr/local/python37/lib/python3.7/site-packages/airflow/utils/timezone.py
COPY   entrypoint.sh ${AIRFLOW_HOME}/entrypoint.sh



RUN  chmod 777 -R ${AIRFLOW_HOME}/entrypoint.sh

RUN    ln -sf  /usr/bin/python  /usr/bin/python3

EXPOSE  8080 5555 8793

ENV TZ=Asia/Shanghai
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone


USER        airflow
WORKDIR     ${AIRFLOW_HOME}
ENTRYPOINT  ["./entrypoint.sh"]

script/entrypoint.sh

#!/usr/bin/env bash


ls  /usr/local/python37/lib/python3.7/site-packages/etl/dags

export PATH=/usr/local/python37/bin:$PATH

python3 -V

whereis  python

export PATH=$PATH:$AIRFLOW_HOME

echo $PATH

echo 'export PYSPARK_SUBMIT_ARGS="--master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}   --deploy-mode client   pyspark-shell" '  >> ~/.bashrc

source ~/.bashrc

echo "hail env:"
echo $PYSPARK_SUBMIT_ARGS

CMD="airflow"
TRY_LOOP="${TRY_LOOP:-10}"
POSTGRES_HOST="${POSTGRES_HOST:-postgres}"
POSTGRES_PORT=5432
POSTGRES_CREDS="${POSTGRES_CREDS:-airflow:airflow}"
RABBITMQ_HOST="${RABBITMQ_HOST:-rabbitmq}"
RABBITMQ_CREDS="${RABBITMQ_CREDS:-airflow:airflow}"
RABBITMQ_MANAGEMENT_PORT=15672
FLOWER_URL_PREFIX="${FLOWER_URL_PREFIX:-}"
AIRFLOW_URL_PREFIX="${AIRFLOW_URL_PREFIX:-}"
LOAD_DAGS_EXAMPLES="${LOAD_DAGS_EXAMPLES:-true}"
GIT_SYNC_REPO="${GIT_SYNC_REPO:-}"

if [ -z $FERNET_KEY ]; then
    FERNET_KEY=$(python3 -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)")
fi

echo "Postgres host: $POSTGRES_HOST"
echo "RabbitMQ host: $RABBITMQ_HOST"
echo "Load DAG examples: $LOAD_DAGS_EXAMPLES"
echo "Git sync repository: $GIT_SYNC_REPO"
echo

# Generate Fernet key
sed -i "s/{{ FERNET_KEY }}/${FERNET_KEY}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s/{{ POSTGRES_HOST }}/${POSTGRES_HOST}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s/{{ POSTGRES_CREDS }}/${POSTGRES_CREDS}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s/{{ RABBITMQ_HOST }}/${RABBITMQ_HOST}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s/{{ RABBITMQ_CREDS }}/${RABBITMQ_CREDS}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s/{{ LOAD_DAGS_EXAMPLES }}/${LOAD_DAGS_EXAMPLES}/" $AIRFLOW_HOME/airflow.cfg
sed -i "s#{{ FLOWER_URL_PREFIX }}#${FLOWER_URL_PREFIX}#" $AIRFLOW_HOME/airflow.cfg
sed -i "s#{{ AIRFLOW_URL_PREFIX }}#${AIRFLOW_URL_PREFIX}#" $AIRFLOW_HOME/airflow.cfg

# wait for rabbitmq
if [ "$1" = "webserver" ] || [ "$1" = "worker" ] || [ "$1" = "scheduler" ] || [ "$1" = "flower" ] ; then
  j=0
  while ! curl -sI -u $RABBITMQ_CREDS http://$RABBITMQ_HOST:$RABBITMQ_MANAGEMENT_PORT/api/whoami |grep '200 OK'; do
    j=`expr $j + 1`
    if [ $j -ge $TRY_LOOP ]; then
      echo "$(date) - $RABBITMQ_HOST still not reachable, giving up"
      exit 1
    fi
    echo "$(date) - waiting for RabbitMQ... $j/$TRY_LOOP"
    sleep 5
  done
fi

# wait for postgres
if [ "$1" = "webserver" ] || [ "$1" = "worker" ] || [ "$1" = "scheduler" ] ; then
  i=0
  while ! nc $POSTGRES_HOST $POSTGRES_PORT >/dev/null 2>&1 < /dev/null; do
    i=`expr $i + 1`
    if [ $i -ge $TRY_LOOP ]; then
      echo "$(date) - ${POSTGRES_HOST}:${POSTGRES_PORT} still not reachable, giving up"
      exit 1
    fi
    echo "$(date) - waiting for ${POSTGRES_HOST}:${POSTGRES_PORT}... $i/$TRY_LOOP"
    sleep 5
  done
  # TODO: move to a Helm hook
  #   https://github.com/kubernetes/helm/blob/master/docs/charts_hooks.md
  if [ "$1" = "webserver" ]; then
    echo "Initialize database..."
    $CMD initdb
  fi
fi

if [ ! -z $GIT_SYNC_REPO ]; then
    mkdir -p $AIRFLOW_HOME/dags
    # remove possible embedded dags to avoid conflicts
    rm -rf $AIRFLOW_HOME/dags/*
    echo "Executing background task git-sync on repo $GIT_SYNC_REPO"
    $AIRFLOW_HOME/git-sync --dest $AIRFLOW_HOME/dags --force &
fi

$CMD "$@"

config/airflow.cfg

[core]
# The home folder for airflow, default is ~/airflow
airflow_home = /usr/local/airflow

# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
# dags_folder = /usr/local/airflow/dags
dags_folder =  /usr/local/python37/lib/python3.7/site-packages/etl/dags


# The folder where airflow should store its log files. This location
base_log_folder = /usr/local/airflow/logs

# Airflow can store logs remotely in AWS S3 or Google Cloud Storage. Users
# must supply a remote location URL (starting with either 's3://...' or
# 'gs://...') and an Airflow connection id that provides access to the storage
# location.
remote_logging = True
remote_base_log_folder = s3://etl/tmp/spark-log/
remote_log_conn_id = 
# Use server-side encryption for logs stored in S3
encrypt_s3_logs = False
# deprecated option for remote log storage, use remote_base_log_folder instead!
# s3_log_folder =

# The executor class that airflow should use. Choices include
# SequentialExecutor, LocalExecutor, CeleryExecutor
executor = CeleryExecutor

# The SqlAlchemy connection string to the metadata database.
# SqlAlchemy supports many different database engine, more information
# their website
sql_alchemy_conn = postgresql+psycopg2://{{ POSTGRES_CREDS }}@{{ POSTGRES_HOST }}/airflow

# The SqlAlchemy pool size is the maximum number of database connections
# in the pool.
sql_alchemy_pool_size = 5

# The SqlAlchemy pool recycle is the number of seconds a connection
# can be idle in the pool before it is invalidated. This config does
# not apply to sqlite.
sql_alchemy_pool_recycle = 3600

# The amount of parallelism as a setting to the executor. This defines
# the max number of task instances that should run simultaneously
# on this airflow installation
parallelism = 32

# The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16

# Are DAGs paused by default at creation
dags_are_paused_at_creation = False

# The maximum number of active DAG runs per DAG
max_active_runs_per_dag = 16

# Whether to load the examples that ship with Airflow. It's good to
# get started, but you probably want to set this to False in a production
# environment
load_examples = false

# Where your Airflow plugins are stored
plugins_folder = /usr/local/airflow/plugins

# Secret key to save connection passwords in the db
fernet_key = {{ FERNET_KEY }}

# Whether to disable pickling dags
donot_pickle = False

# How long before timing out a python file import while filling the DagBag
dagbag_import_timeout = 30

[webserver]
# The base url of your website as airflow cannot guess what domain or
# cname you are using. This is use in automated emails that
# airflow sends to point links to the right web server
base_url = http://localhost:8080

# The ip specified when starting the web server
web_server_host = 0.0.0.0

# Root URL to use for the web server
web_server_url_prefix = {{ AIRFLOW_URL_PREFIX }}

# The port on which to run the web server
web_server_port = 8080

# The time the gunicorn webserver waits before timing out on a worker
web_server_worker_timeout = 120

# Secret key used to run your flask app
secret_key = temporary_key

# Number of workers to run the Gunicorn web server
workers = 1

# The worker class gunicorn should use. Choices include
# sync (default), eventlet, gevent
worker_class = sync

# Expose the configuration file in the web server
expose_config = true

# Set to true to turn on authentication : http://pythonhosted.org/airflow/installation.html#web-authentication
authenticate = False

# Filter the list of dags by owner name (requires authentication to be enabled)
filter_by_owner = False

[email]
email_backend = airflow.utils.send_email_smtp

[smtp]
# If you want airflow to send emails on retries, failure, and you want to
# the airflow.utils.send_email function, you have to configure an smtp
# server here
smtp_host = localhost
smtp_starttls = True
smtp_ssl = False
smtp_user = airflow
smtp_port = 25
smtp_password = airflow
smtp_mail_from = airflow@airflow.local

[celery]
# This section only applies if you are using the CeleryExecutor in
# [core] section above

# The app name that will be used by celery
celery_app_name = airflow.executors.celery_executor

# The concurrency that will be used when starting workers with the
# "airflow worker" command. This defines the number of task instances that
# a worker will take, so size up your workers based on the resources on
# your worker box and the nature of your tasks
celeryd_concurrency = 16

# When you start an airflow worker, airflow starts a tiny web server
# subprocess to serve the workers local log files to the airflow main
# web server, who then builds pages and sends them to users. This defines
# the port on which the logs are served. It needs to be unused, and open
# visible from the main web server to connect into the workers.
# worker_log_server_port = 8793

# The Celery broker URL. Celery supports RabbitMQ, Redis and experimentally
# a sqlalchemy database. Refer to the Celery documentation for more
# information.
broker_url = amqp://{{ RABBITMQ_CREDS }}@{{ RABBITMQ_HOST }}:5672/airflow

# Another key Celery setting
celery_result_backend = amqp://{{ RABBITMQ_CREDS }}@{{ RABBITMQ_HOST }}:5672/airflow

# Celery Flower is a sweet UI for Celery. Airflow has a shortcut to start
# it `airflow flower`. This defines the port that Celery Flower runs on
flower_port = 5555

# The root URL for Flower
flower_url_prefix = {{ FLOWER_URL_PREFIX }}

# Default queue that tasks get assigned to and that worker listen on.
default_queue = default

[scheduler]
# Task instances listen for external kill signal (when you clear tasks
# from the CLI or the UI), this defines the frequency at which they should
# listen (in seconds).
job_heartbeat_sec = 5

# The scheduler constantly tries to trigger new tasks (look at the
# scheduler section in the docs for more information). This defines
# how often the scheduler should run (in seconds).
scheduler_heartbeat_sec = 5

# Statsd (https://github.com/etsy/statsd) integration settings
# statsd_on =  False
# statsd_host =  localhost
# statsd_port =  8125
# statsd_prefix = airflow

# The scheduler can run multiple threads in parallel to schedule dags.
# This defines how many threads will run. However airflow will never
# use more threads than the amount of cpu cores available.
max_threads = 2

[mesos]
# Mesos master address which MesosExecutor will connect to.
master = localhost:5050

# The framework name which Airflow scheduler will register itself as on mesos
framework_name = Airflow

# Number of cpu cores required for running one task instance using
# 'airflow run <dag_id> <task_id> <execution_date> --local -p <pickle_id>'
# command on a mesos slave
task_cpu = 1

# Memory in MB required for running one task instance using
# 'airflow run <dag_id> <task_id> <execution_date> --local -p <pickle_id>'
# command on a mesos slave
task_memory = 256

# Enable framework checkpointing for mesos
# See http://mesos.apache.org/documentation/latest/slave-recovery/
checkpoint = False

# Failover timeout in milliseconds.
# When checkpointing is enabled and this option is set, Mesos waits
# until the configured timeout for
# the MesosExecutor framework to re-register after a failover. Mesos
# shuts down running tasks if the
# MesosExecutor framework fails to re-register within this timeframe.
# failover_timeout = 604800

# Enable framework authentication for mesos
# See http://mesos.apache.org/documentation/latest/configuration/
authenticate = False

# Mesos credentials, if authentication is enabled
# default_principal = admin
# default_secret = admin


default_timezone = Asia/Shanghai

yaml

apiVersion: v1
kind: Service
metadata:
  name: airflow-worker
  namespace: airflow
spec:
  clusterIP: None
  selector:
    app: airflow-worker
  ports:
    - name: airflow-worker
      protocol: TCP
      port: 8888
      targetPort: 8888
---
apiVersion: v1
kind: Service
metadata:
  name: airflow-ps
  namespace: airflow
spec:
  type: ClusterIP
  selector:
    app: airflow
    tier: db
  ports:
    - name: airflow-ps
      protocol: TCP
      port: 5432
      targetPort: airflow-ps
---
apiVersion: v1
kind: Service
metadata:
  name: airflow-rabbitmq
  namespace: airflow
spec:
  type: ClusterIP
  selector:
    app: airflow
    tier: airflow-rabbitmq
  ports:
    - name: node
      protocol: TCP
      port: 5672
      targetPort: node
    - name: management
      protocol: TCP
      port: 15672
      targetPort: management
---
apiVersion: v1
kind: Service
metadata:
  name: airflow-web
  namespace: airflow
spec:
  selector:
    app: airflow
    tier: airflow-web
  ports:
    - name: airflow-web
      protocol: TCP
      port: 8080
      targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: airflow-flower
  namespace: airflow
spec:
  type: NodePort
  selector:
    app: airflow-airflow
    tier: airflow-flower
  ports:
    - name: airflow-flower
      protocol: TCP
      port: 5555
      targetPort: airflow-flower
      nodePort: 32081
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-ps
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: db
    spec:
      containers:
      - name: airflow-ps
        image: postgres
        ports:
        - name: airflow-ps
          containerPort: 5432
        env:
         - name: POSTGRES_USER
           value: "airflow"
         - name: POSTGRES_PASSWORD
           value: "airflow"
         - name: POSTGRES_DB
           value: "airflow"
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-rabbitmq
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: airflow-rabbitmq
    spec:
      restartPolicy: Always
      containers:
      - name: airflow-rabbitmq
        image: rabbitmq:3-management
        ports:
        - name: management
          containerPort: 15672
        - name: node
          containerPort: 5672
        env:
        - name: RABBITMQ_DEFAULT_USER
          value: airflow
        - name: RABBITMQ_DEFAULT_PASS
          value: airflow
        - name: RABBITMQ_DEFAULT_VHOST
          value: airflow
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-web
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: airflow-web
    spec:
      restartPolicy: Always
      containers:
      - name: airflow-web
        image: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/etl-airflow:1.10.9-prod-_TAG_
        imagePullPolicy: Always
        env:
        - name: AIRFLOW_HOME
          value: "/usr/local/airflow"
        ports:
        - name: airflow-web
          containerPort: 8080
        args: ["webserver"]
        livenessProbe:
          httpGet:
            path: /
            port: 8080
          initialDelaySeconds: 240
          periodSeconds: 60
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-flower
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: airflow-flower
    spec:
      restartPolicy: Always
      containers:
      - name: airflow-flower
        image: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/etl-airflow:1.10.9-prod-_TAG_
        imagePullPolicy: Always
        env:
        - name: AIRFLOW_HOME
          value: "/usr/local/airflow"
        - name: FLOWER_PORT
          value: "5555"
        ports:
        - name: airflow-flower
          containerPort: 5555
        args: ["flower"]
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-scheduler
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: airflow-scheduler
    spec:
      restartPolicy: Always
      containers:
      - name: airflow-scheduler
        image: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/etl-airflow:1.10.9-prod-_TAG_
        imagePullPolicy: Always
        env:
        - name: AIRFLOW_HOME
          value: "/usr/local/airflow"
        args: ["scheduler", "-n", "5"]
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: airflow-worker
  namespace: airflow
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: airflow
        tier: airflow-worker
    spec:
      hostname: airflow-worker
      restartPolicy: Always
      containers:
      - name: airflow-worker
        image: 123.dkr.ecr.cn-northwest-1.amazonaws.com.cn/spark/etl-airflow:1.10.9-prod-_TAG_
        imagePullPolicy: Always
        lifecycle:
          postStart:
            exec:
              command:
              - sh
              - -c
              - |
                echo "spark.master k8s://https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}" >> $SPARK_HOME/conf/spark-defaults.conf ; echo "spark.driver.pod.name ${HOSTNAME}" >> $SPARK_HOME/conf/spark-defaults.conf ; echo "spark.driver.host `hostname -i`" >> $SPARK_HOME/conf/spark-defaults.conf ;
        env:
        - name: AIRFLOW_HOME
          value: "/usr/local/airflow"
        args: ["worker"]

---
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: airflow-web
  namespace: airflow
spec:
  rules:
  -
    host: airflow.prod.com
    http:
      paths:
      - path: /
        backend:
          serviceName: airflow-web
          servicePort: 8080