tfx airflow 使用体验

会发paper的学渣

已于 2022-07-26 13:25:45 修改

阅读量412

点赞数

分类专栏： tensorflow2.x 深度学习基础文章标签： tensorflow 人工智能 python

于 2022-07-25 19:07:29 首次发布

本文链接：https://blog.csdn.net/sslfk/article/details/125981131

版权

tensorflow2.x 同时被 2 个专栏收录

39 篇文章 0 订阅

订阅专栏

深度学习基础

24 篇文章 4 订阅

订阅专栏

1、介绍：

airflow的介绍：Apache Airflow 2.3.0 在五一重磅发布！_数据社的博客-CSDN博客

airflow的官方文档：https://airflow.apache.org/docs/apache-airflow/2.2.3/installation/index.html

tfx的介绍：https://www.tensorflow.org/tfx/tutorials/tfx/components_keras

2、安装

整体参考：TFX (TensorFlow Extended) 调研_数据技术组的博客-CSDN博客

其中，版本对应关pip install --upgrade

python版本为3.8

pip install --upgrade pip

pip install -U tfx

pip install apache-airflow==3.6.1

pip install tfx==1.6.1

如果使用默认的最新版本安装启动airflow报错，会有版本冲突

3、创建用户权限：

airflow users create --role Admin --username admin --email admin --firstname admin --lastname admin --password admin

4、初始化数据库：

airflow db init

5、启动airflow（注意如果版本不对应，此处启动会报错）:

airflow webserver -p 8080

如果版本对应，airflow启动报错，建议删除~/airflow整个目录，然后从3步开始。

6、启动定时任务：

airflow scheduler

7、airflow测试脚本：放入到~/airflow/dags/,文件名可自定义

from datetime import datetime
from airflow.operators.bash import BashOperator
from airflow.models import DAG
 
#新建dag实例
dag = DAG(
    dag_id='xyz',
    start_date=datetime.now(),
    schedule_interval='0 0 * * *')
 
#新建三个task任务
t1 = BashOperator(
    task_id='first',
    bash_command='echo "1"',
    dag=dag)
 
t2 = BashOperator(
    task_id='second',
    bash_command='echo "2"',
    dag=dag)
 
t3 = BashOperator(
    task_id='third',
    bash_command='echo "3"',
    dag=dag)
 
#配置任务流
t1 >> t2 >> t3

8、tfx启动模型任务相关：

代码部分：

import datetime
import os
from tfx.components import CsvExampleGen
from tfx.components.trainer.executor import GenericExecutor
from tfx.dsl.components.base import executor_spec
from tfx.orchestration import metadata
from tfx.orchestration import pipeline
from tfx.orchestration.airflow.airflow_dag_runner import AirflowDagRunner
from tfx.orchestration.airflow.airflow_dag_runner import AirflowPipelineConfig
from tfx.types import Channel
from tfx.types.standard_artifacts import Model
from tfx.types.standard_artifacts import ModelBlessing
from tfx.components import StatisticsGen
from tfx.components import SchemaGen
from tfx.components import ExampleValidator
from tfx.components import Transform
from tfx.components import Trainer
from tfx.proto import trainer_pb2
import tensorflow_model_analysis as tfma
from tfx.components import Evaluator
from tfx.dsl.components.common import resolver
from tfx.dsl.experimental import latest_blessed_model_resolver
from tfx.components import Pusher
from tfx.proto import pusher_pb2
 
pipeline_name = 'keras'
airflow_root = os.path.join(os.environ['HOME'], 'airflow')
data_root = os.path.join(airflow_root, 'data', 'keras')
module_file = os.path.join(airflow_root, 'code', 'udf.py')
serving_model_dir = os.path.join(airflow_root, 'serving_model', pipeline_name)
tfx_root = os.path.join(airflow_root, 'tfx')
pipeline_root = os.path.join(tfx_root, 'pipelines', pipeline_name)
metadata_path = os.path.join(tfx_root, 'metadata', pipeline_name, 'metadata.db')
 
example_gen = CsvExampleGen(input_base=data_root)
 
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
 
infer_schema = SchemaGen(
    statistics=statistics_gen.outputs['statistics'],
    infer_feature_shape=False
)
validate_stats = ExampleValidator(
    statistics=statistics_gen.outputs['statistics'],
    schema=infer_schema.outputs['schema']
)
 
transform = Transform(
    examples=example_gen.outputs['examples'],
    schema=infer_schema.outputs['schema'],
    module_file=module_file
)
 
trainer = Trainer(
    module_file=module_file,
    custom_executor_spec=executor_spec.ExecutorClassSpec(GenericExecutor),
    examples=transform.outputs['transformed_examples'],
    transform_graph=transform.outputs['transform_graph'],
    schema=infer_schema.outputs['schema'],
    train_args=trainer_pb2.TrainArgs(num_steps=10000),
    eval_args=trainer_pb2.EvalArgs(num_steps=5000)
)
 
model_resolver = resolver.Resolver(
    strategy_class=latest_blessed_model_resolver.LatestBlessedModelResolver,
    model=Channel(type=Model),
    model_blessing=Channel(type=ModelBlessing)
).with_id('latest_blessed_model_resolver')
 
 
eval_config = tfma.EvalConfig(
    model_specs=[tfma.ModelSpec(label_key='tips')],
    slicing_specs=[tfma.SlicingSpec()],
    metrics_specs=[
        tfma.MetricsSpec(metrics=[
            tfma.MetricConfig(
                class_name='BinaryAccuracy',
                threshold=tfma.MetricThreshold(
                    value_threshold=tfma.GenericValueThreshold(
                        lower_bound={'value': 0.6}),
                    change_threshold=tfma.GenericChangeThreshold(
                        direction=tfma.MetricDirection.HIGHER_IS_BETTER,
                        absolute={'value': -1e-10})))
        ])
    ]
)
 
model_analyzer = Evaluator(
    examples=example_gen.outputs['examples'],
    model=trainer.outputs['model'],
    baseline_model=model_resolver.outputs['model'],
    eval_config=eval_config
)
 
pusher = Pusher(
    model=trainer.outputs['model'],
    model_blessing=model_analyzer.outputs['blessing'],
    push_destination=pusher_pb2.PushDestination(
        filesystem=pusher_pb2.PushDestination.Filesystem(
            base_directory=serving_model_dir))
)
 
pip = pipeline.Pipeline(
    pipeline_name=pipeline_name,
    pipeline_root=pipeline_root,
    components=[
        example_gen,
        statistics_gen,
        infer_schema,
        validate_stats,
        transform,
        trainer,
        model_resolver,
        model_analyzer,
        pusher
    ],
    enable_cache=True,
    metadata_connection_config=metadata.sqlite_metadata_connection_config(metadata_path),
    beam_pipeline_args=['--direct_running_mode=multi_processing', '--direct_num_workers=0']
)
 
airflow_config = {
    'schedule_interval': None,
    'start_date': datetime.datetime(2022, 1, 1)
}
 
DAG = AirflowDagRunner(AirflowPipelineConfig(airflow_config)).run(pip)

其中，相关组件说明：

ExampleGen：数据输入源，上述数据源：https://github.com/LahiruTjay/Machine-Learning-With-Python/blob/master/datasets/diabetes.csv放入到~/airflow/data/keras/

StatisticsGen：流水线组件根据训练数据和应用数据来生成特征统计信息，以供其他流水线组件使用。StatisticsGen 使用Beam来扩展为大型数据集

SchemaGen： schema信息，schema.proto 的一个实例，它可以指定特征值的数据类型、是否在所有示样本中都必须存在特征、允许的值范围以及其他属性。

ExampleValidator：识别数据集中的异常数据

Transform：流水线组件可对从ExampleGen组件发出的tf.Examples执行特征工程（使用由SchemaGen创建的数据架构）并发布SavedModel。执行时，SavedModel将接受从ExampleGen组件发出的tf.Examples组件并转换后的特征数据

Trainer：流水线组件用于训练 TensorFlow 模型

Tuner：用于调节模型的超参数

Evaluator：对模型的训练结果进行深入分析，以帮助您了解模型对数据子集的执行情况。Evaluator 还可以帮助您验证导出的模型

InfraValidator：在将模型投入生产之前用作预警层

Pusher：将经过验证的模型推送到部署目标