组织毫升管道项目

项目结构:要求(Project Structure: Requirements)

  • Enable experimentation with multiple pipelines

    启用multiple管道的实验

  • Support both alocal execution mode and a deployment execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud.

    支持local执行模式和deployment执行模式。 这样可以确保创建2个独立的运行配置,第一个配置用于本地开发和端到端测试,第二个配置用于在云中运行。

  • Reuse code across pipeline variants if it makes sense to do so

    如果合理的话,跨管道变量Reuse code

  • Provide an easy to useCLI interface for executing pipelines with different configurations and data

    提供易于使用的CLI interface以执行具有不同configurations和数据的管道

A correct implementation also ensures that tests are easy to incorporate in your workflow.

正确的实施方式还可以确保测试易于整合到您的工作流程中。

项目结构:设计决策 (Project Structure: Design Decisions)

  • Use Python.

    使用Python。
  • Use Tensorflow Extended (TFX) as the pipeline framework.

    使用Tensorflow Extended(TFX)作为管道框架。

In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle.

在本文中,我们将演示如何在本地和Kubeflow管道安装上运行TFX管道,而将麻烦降到最低。

设计决策引起的副作用 (Side Effects Caused By Design Decisions)

  • By using TFX, we are going to use tensorflow . Keep in mind that tensorflow supports more types of models, like boosted trees.

    通过使用TFX,我们将使用tensorflow 。 请记住,tensorflow支持更多类型的模型,例如增强树

  • Apache Beam can execute locally, anywhere kubernetes runs and on all public cloud providers. Examples include but are not limited to: GCP Dataflow, Azure Databricks.

    Apache Beam可以在本地,任何运行kubernetes的地方以及所有公共云提供商上执行。 示例包括但不限于:GCP Dataflow,Azure Databricks。
  • Due to Apache Beam, we need to make sure that the project code is easily packageable by python’ssdist for maximum portability. This is reflected on the top-level module structure of the project. (If you use external libraries be sure to include them by providing an argument to apache beam. Read more about this on Apache Beam: Managing Python Pipeline Dependencies).

    由于使用了Apache Beam,我们需要确保该项目代码易于被python的sdist打包,以实现最大的可移植性。 这反映在项目的顶级模块结构上。 (如果您使用外部库,请确保通过提供apache beam参数来包含它们。Apache Beam:Management Python Pipeline Dependencies中了解有关此内容的更多信息)。

[Optional] Before continuing, take a moment to read about the provided TFX CLI. Currently, it is embarrasingly slow to operate and the directory structure is much more verbose than it needs to be. It also does not include any notes on reproducibility and code reuse.

[可选]在继续之前,花一点时间阅读有关提供的TFX CLI的信息。 当前,它的运行速度令人尴尬,而且目录结构比所需的更加冗长。 它还不包含有关可重复性和代码重用的任何注释。

目录结构及其背后的直觉 (Directory Structure and Intuition Behind It)

  • $project-name is the root directory of your project

    $project-name$project-name的根目录

  • $project-name/ml includes machine learning related stuff.

    $project-name/ml包含与机器学习相关的内容。

  • $project-name/ml/pipelines includes the actual ML pipeline code

    $project-name/ml/pipelines包含实际的ML管道代码

  • Typically, you may find yourself with multiple ML pipelines to manage, such as $project-name/ml/pipelines/predict-sales and $project-name/ml/pipelines/classify-fraud or similar.

    通常,您可能会发现自己需要管理多个ML管道,例如$project-name/ml/pipelines/predict-sales$project-name/ml/pipelines/classify-fraud或类似名称。

  • Here is a simple tree view:

    这是一个简单的tree视图:

$project-name/ml/pipelines/
├── __init__.py
├── data
├── util
│    ├── __init__.py
│    ├── input_fn_utils.py
│    └── model_utils.py
├── kfp_runner.py
├── local_beam_dag_runner.py
├── model_utils.py
├── pipeline.py
├── cli.py
└── $pipeline-name
    ├── __init__.py
    ├── constants.py
    ├── model.py
    └── training.py

$project-name/ml/pipelines includes the following:

$project-name/ml/pipelines包括以下内容:

  • data → small amount of representative training data to run locally for testing and on CI. That’s true if your system does not have a dedicated component to pull data from somewhere. If this is true, make sure to include a sampling query with a small limited number of items.

    data →少量代表性培训数据可在本地运行以进行测试并在CI上运行。 如果您的系统没有专用组件可以从某处提取数据,那将是正确的。 如果是这样,请确保包含数量有限的少量抽样查询。

  • util → code that is reused and shared across $pipeline-name s. It is not necessary to include input_fn_utils.py and model_utils.py . Use whatever makes sense here. Here are some examples:

    util →在$pipeline-name之间重复使用和共享的代码。 不必包含input_fn_utils.pymodel_utils.py 。 在这里使用任何有意义的方法。 这里有些例子:

In my own projects, it made sense to abstract some parts on the utility module, like building named input and output layers for the keras models.

在我自己的项目中,有必要在实用程序模块上抽象某些部分,例如为keras模型构建命名的输入和输出层。

def get_input_graph(input_feature_keys, input_window_size) -> Tuple[Input, tf.keras.layers.Layer]:
    transformed_columns = [transformed_name(
        key) for key in input_feature_keys]


    input_layers = {
        colname: Input(name=colname, shape=(
            input_window_size), dtype=tf.float32)
        for colname in transformed_columns
    }


    pre_model_input = Concatenate(axis=-1)(list(input_layers.values()))
    pre_model_input = Reshape(target_shape=(input_window_size, len(input_feature_keys)))(
        pre_model_input)


    return input_layers, pre_model_input




def get_output_graph(head_layer, predict_feature_keys, output_window_size) -> Dict[Text, tf.keras.layers.Layer]:
    return {
        colname: Dense(units=output_window_size, name=colname)(head_layer)
        for colname in predict_feature_keys
    }

Building the serving signature metagraph using Tensorflow Transform output.

使用Tensorflow Transform输出构建服务签名元图。

def _get_serve_tf_examples_fn(model, tf_transform_output):
  """Returns a function that parses a serialized tf.Example and applies TFT."""


  model.tft_layer = tf_transform_output.transform_features_layer()


  @tf.function
  def serve_tf_examples_fn(serialized_tf_examples):
    """Returns the output to be used in the serving signature."""
    feature_spec = tf_transform_output.raw_feature_spec()
    feature_spec.pop(_LABEL_KEY)
    parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)


    transformed_features = model.tft_layer(parsed_features)


    return model(transformed_features)


  return serve_tf_examples_fn

Preprocessing features into groups by using keys.

使用键将特征预处理成组。

def _preprocessing_fn(inputs: Dict[Text, Any], dense_float_feature_keys, input_feature_keys) -> Dict[Text, Any]:
    outputs = {}
    for key in [k for k in dense_float_feature_keys if k in input_feature_keys]:
        outputs[transformed_name(key)
                ] = tft.scale_to_z_score(inputs[key])


    return outputs

And also other common repetitive tasks, like building input pipelines with the Tensorflow dataset api.

还有其他常见的重复性任务,例如使用Tensorflow数据集API构建输入管道。

def transformed_name(key: Text) -> Text:
    return key + '_xf'
    
def gzip_reader_fn(filenames):
    return tf.data.TFRecordDataset(filenames, compression_type='GZIP')
  
def input_fn(file_pattern, tf_transform_output,
         feature_spec,
         # feature_keys, input_feature_keys, predict_feature_keys, or anything you like
         batch_size=256):


  apply_tf_transform_map_fn = get_apply_tft_map_fn()


  dataset = tf.data.experimental.make_batched_features_dataset(
      file_pattern=file_pattern,
      features=feature_spec,
      reader=gzip_reader_fn,
      shuffle=True,
      sloppy_ordering=True,
      batch_size=batch_size) \
      ... \
      .map(apply_tf_transform_map_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE) \
      .prefetch(tf.data.experimental.AUTOTUNE) \


  return dataset
  • cli.py → entry point and command line interface for the pipelines. Here are some common things to consider when using TFX.

    cli.py →管道的入口点和命令行界面。 这是使用TFX时要考虑的一些常见事项。

By using abseil you can declare and access flags globally. Each module defines flags that are specific to it. It is a distributed system. This means that the common flags, like --data_dir=... , --hparam_tuning , --pipeline_root , --ml_metadata_url , --use_cache , --train_epochs is some you can define on the actual cli.py file. Other, more specific ones for each pipeline can be defined on submodules.

通过使用abseil您可以全局声明和访问标志。 每个模块定义特定于它的标志。 它是一个分布式系统。 这意味着可以在实际cli.py文件上定义一些通用标志,例如--data_dir=... ,-- --hparam_tuning ,-- --pipeline_root ,-- --ml_metadata_url ,-- --use_cache ,-- --train_epochs 。 可以在子模块上定义每个管道的其他更具体的管道。

This file acts as an entry point for the system. It uses contents in pipeline.py to set up the components of the pipeline as well as provide the user-provided module files ( in the tree example these are constants.py , model.py , training.py ) based on some flag like --pipeline_name=$pipeline-name or some other configuration.

该文件充当系统的入口点。 它使用pipeline.py内容来设置pipeline.py的组件,并根据一些标志(例如--pipeline_name=$pipeline-name )提供用户提供的模块文件(在树形示例中,这些文件为constants.pymodel.pytraining.py ) --pipeline_name=$pipeline-name或其他配置。

Finally, with the assembled pipeline, it calls some _runner.py file, by using a --runner= flag.

最后,在组装好的管道中,它使用--runner=标志调用了_runner.py文件。

  • pipeline.py → parameterised pipeline component declaration and wiring. This is usually just a function that declares a bunch of TFX components and returns a tfx.orchestration.Pipeline object.

    pipeline.py →参数化的管道组件声明和接线。 这通常只是一个函数,该函数声明一堆TFX组件并返回tfx.orchestration.Pipeline对象。

def get_pipeline(data_path: Text,
                 pyfiles_root: str = None, # parent directory for user modules
                 tune: bool = False, # do hparam tuning ?
                 hyper_params_uri: Text = None, # if not, provide some hparams uri
                 hyperparam_train_args: trainer_pb2.TrainArgs = None, 
                 hyperparam_eval_args: trainer_pb2.EvalArgs = None,
                 train_args: trainer_pb2.TrainArgs = None,
                 eval_args: trainer_pb2.EvalArgs = None,
                 push_args: Dict[Text, Any] = None):
    if not pyfiles_root:
        pyfiles_root = os.path.dirname(__file__)


    training_module_file = os.path.join(
        pyfiles_root, 'training.py') # directory structure of ./$pipeline-name/...
    
    # ...
    pipeline = # build pipeline definition with all the components
    return pipeline
  • local_beam_dag_runner.py → configuration to run locally with the portable Beam runner. This can typically be almost configuration — free, just by using the BeamDagRunner .

    local_beam_dag_runner.py →配置为使用便携式Beam local_beam_dag_runner.py在本地运行。 通常,这几乎是几乎免费的配置,只需使用BeamDagRunner

  • kfp_runner.py → configuration to run on Kubeflow Pipelines. This typically includes different data path and pipeline output prefixes and auto-binds an ml-metadata instance.

    kfp_runner.py →配置以在Kubeflow管道上运行。 这通常包括不同的数据路径和管道输出前缀,并自动绑定ml元数据实例。

Note: you can have more runners, like something that runs on GCP and just configures more provisioning resource like, TPU instances, parallel AI platform hyperparameter search etc.

注意:您可以拥有更多的运行程序,例如在GCP上运行的运行程序,并且只需配置更多的配置资源,例如TPU实例,并行AI平台超参数搜索等。

$ pipeline-name ($pipeline-name)

This is the user-provided code that makes different models, schedules different experiments, etc.

这是用户提供的代码,可用于制作不同的模型,安排不同的实验等。

Due to the util submodule, code under each pipeline should be much leaner. No need to split it in more than 3 files. It’s not prohibiting to split your code throughout more files though.

由于使用了util子模块,每个管道下的代码应该更加精简。 无需将其拆分为3个以上的文件。 但是,这并不禁止将代码分割成更多文件。

From experimentation, I converged to a constants , model and training split.

从实验开始,我收敛到constantsmodeltraining分裂。

  • constants.py → declarations. Sensible default values for training parameters, hyperparameter keys and declarations, feature keys, feature groups, evaluation configurations and metrics to track. Here is a small example:

    constants.py →声明。 训练参数,超参数键和声明,功能键,功能组,评估配置和要跟踪的指标的合理默认值。 这是一个小例子:

import tensorflow as tf
import kerastuner


TRAIN_STEPS = 1000
EVAL_STEPS = 100


BATCH_SIZE = 128


INPUT_WINDOW_SIZE = 7
OUTPUT_WINDOW_SIZE = 1


# === Feature Directory ===
DENSE_FLOAT_FEATURE_KEYS = ['a', 'b', 'c', 'd', 'e']


# === Feature Selection ===
FEATURE_KEYS = DENSE_FLOAT_FEATURE_KEYS  # feature inputs for model
PREDICT_FEATURE_KEYS = FEATURE_KEYS  # features that model predicts (if equal to feature_keys it can predict multiple timesteps ahead)


INPUT_FEATURE_KEYS = list(set(FEATURE_KEYS + PREDICT_FEATURE_KEYS))




# === Hyperparameters ===
HP_LR = 'learning_rate'
HP_HIDDEN_LAYER_NUM = 'hidden_layer_num'
HP_HIDDEN_LATENT_DIM = 'hidden_latent_dim'
HP_PRE_OUTPUT_UNITS = 'pre_output_units'




def _get_hyperparameters() -> kerastuner.HyperParameters:
    hp = kerastuner.HyperParameters()
    # todo: move string value definitions to constants
    hp.Fixed(HP_LR, 1e-2)
    hp.Fixed(HP_HIDDEN_LAYER_NUM, 1)
    hp.Fixed(HP_HIDDEN_LATENT_DIM, 64)
    #hp.Choice('dropout', [0.2, 0.3, 0.5], default=0.2)
    hp.Fixed(HP_PRE_OUTPUT_UNITS, 32)


    return hp




HYPERPARAMETERS = _get_hyperparameters()
HYPERPARAM_NUM_STEPS = 10
  • model.py → Model definition. Typically contains a build_keras_model function and uses imports from util and $pipeline-name.constants . Here’s an example from a recent project of mine:

    model.py →模型定义。 通常包含一个build_keras_model函数,并使用util$pipeline-name.constants 。 这是我最近的项目中的一个示例:

from typing import Dict, Text


import tensorflow as tf
from absl import logging
from tensorflow.keras.layers import (LSTM, Activation, Concatenate, Dense)
import kerastuner
from rnn.constants import (INPUT_FEATURE_KEYS, PREDICT_FEATURE_KEYS,
                           HP_HIDDEN_LATENT_DIM,
                           HP_HIDDEN_LAYER_NUM, HP_LR,
                           HP_PRE_OUTPUT_UNITS,
                           INPUT_WINDOW_SIZE,
                           OUTPUT_WINDOW_SIZE)


from input_fn_utils import transformed_name
from model_utils import get_input_graph, get_output_graph




def build_keras_model(hparams: kerastuner.HyperParameters) -> tf.keras.Model:
    input_layers, pre_model_input = get_input_graph(
        INPUT_FEATURE_KEYS, INPUT_WINDOW_SIZE)


    x = pre_model_input


    # ======
    layer_num = int(hparams.get(HP_HIDDEN_LAYER_NUM))
    latent_dim = int(hparams.get(HP_HIDDEN_LATENT_DIM))
    for i in range(layer_num):
        return_sequences = (i != layer_num-1)
        x = LSTM(latent_dim, return_sequences=return_sequences)(x)


    pre_output_units = int(hparams.get(HP_PRE_OUTPUT_UNITS))
    x = Dense(units=pre_output_units, activation='swish')(x)


    model_head = Dense(units=OUTPUT_WINDOW_SIZE *
                       len(PREDICT_FEATURE_KEYS), activation='relu')(x)
    # =====


    output_layers = get_output_graph(
        model_head, PREDICT_FEATURE_KEYS, OUTPUT_WINDOW_SIZE)


    model = tf.keras.Model(input_layers, output_layers)


    model.compile(
        loss='mae',
        optimizer=tf.keras.optimizers.Adam(
            lr=float(hparams.get(HP_LR))))


    model.summary(print_fn=logging.info)
    return model
  • Lastly, training.py includes all the fuss required to train the model. This is typically: preprocessing definition, hyperparameter search, setting up training data or model — parallel strategies and tensorboard logs and saving the module for production.

    最后, training.py包含了训练模型所需的所有操作。 这通常是:预处理定义,超参数搜索,设置训练数据或模型-并行策略和张量板日志并保存模块以供生产。

import os
from functools import partial
from typing import Any, Dict, List, Text


import kerastuner
import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow_transform as tft
import tensorflow_data_validation as tfdv
from absl import logging
from tensorflow_transform.tf_metadata import schema_utils
from tfx.components.trainer.fn_args_utils import FnArgs
from tfx.components.tuner.component import TunerFnResult


from rnn.constants import (BATCH_SIZE, DENSE_FLOAT_FEATURE_KEYS, FEATURE_KEYS, PREDICT_FEATURE_KEYS,
                           INPUT_FEATURE_KEYS, INPUT_WINDOW_SIZE, OUTPUT_WINDOW_SIZE,
                           HYPERPARAMETERS)
from rnn.model import build_keras_model
from input_fn_utils import input_fn, get_serve_raw_fn, _preprocessing_fn




def preprocessing_fn(inputs: Dict[Text, Any]) -> Dict[Text, Any]:
    return _preprocessing_fn(inputs,
                             dense_float_feature_keys=DENSE_FLOAT_FEATURE_KEYS,
                             input_feature_keys=INPUT_FEATURE_KEYS)




def _input_fn(train_files, tf_transform_output, feature_spec):
    return input_fn(train_files, tf_transform_output,
                    feature_spec=feature_spec,
                    input_window_size=INPUT_WINDOW_SIZE,
                    output_window_size=OUTPUT_WINDOW_SIZE,
                    batch_size=BATCH_SIZE,
                    predict_feature_keys=PREDICT_FEATURE_KEYS,
                    feature_keys=FEATURE_KEYS,
                    input_feature_keys=INPUT_FEATURE_KEYS)




def tuner_fn(fn_args: FnArgs) -> TunerFnResult:
     # ...
    return TunerFnResult(
        tuner=tuner,
        fit_kwargs={
            'x': train_dataset,
            'validation_data': eval_dataset,
            'steps_per_epoch': fn_args.train_steps,
            'validation_steps': fn_args.eval_steps
        })




def run_fn(fn_args):
    tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)
    train_files = fn_args.train_files
    eval_files = fn_args.eval_files
    serving_model_dir = fn_args.serving_model_dir
    train_steps = fn_args.train_steps
    eval_steps = fn_args.eval_steps


    schema = tfdv.load_schema_text(fn_args.schema_file)
    feature_spec = schema_utils.schema_as_feature_spec(schema).feature_spec


    hparams = fn_args.hyperparameters


    if type(hparams) is dict and 'values' in hparams.keys():
        hparams = hparams['values']


    train_dataset = _input_fn(train_files, tf_transform_output, feature_spec)
    eval_dataset = _input_fn(eval_files, tf_transform_output, feature_spec)


    mirrored_strategy = tf.distribute.MirroredStrategy()
    with mirrored_strategy.scope():
        model = build_keras_model(hparams=hparams)


    log_dir = os.path.join(os.path.dirname(serving_model_dir), 'logs')
    tensorboard_callback = tf.keras.callbacks.TensorBoard(
        log_dir=log_dir, update_freq='epoch')


    model.fit(
        train_dataset,
        steps_per_epoch=train_steps,
        validation_data=eval_dataset,
        validation_steps=eval_steps,
        callbacks=[tensorboard_callback])


    serving_raw_entry = get_serve_raw_fn(
        model, tf_transform_output, INPUT_WINDOW_SIZE)


    serving_raw_signature_tensorspecs = {x: tf.TensorSpec(
        shape=[None, INPUT_WINDOW_SIZE], dtype=tf.float32, name=x) for x in INPUT_FEATURE_KEYS}


    logging.info(
        f'serving_raw signature TensorSpecs are: {serving_raw_signature_tensorspecs}')


    signatures = {
        'serving_raw': serving_raw_entry.get_concrete_function(serving_raw_signature_tensorspecs),
    }


    model.save(serving_model_dir, save_format='tf',
               signatures=signatures)

That’s it. Thank you for reading to the end!

而已。 感谢您阅读到底!

I hope that you enjoyed reading this article as much as I enjoyed writing it.

我希望您喜欢阅读这篇文章,也喜欢阅读它。

翻译自: https://towardsdatascience.com/structuring-ml-pipeline-projects-97c16348be4a

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值