项目结构:要求(Project Structure: Requirements)
Enable experimentation with
multiple
pipelines启用
multiple
管道的实验Support both a
local
execution mode and adeployment
execution mode. This ensures the creation of 2 separate running configurations, with the first being used for local development and end-to-end testing and the second one used for running in the cloud.支持
local
执行模式和deployment
执行模式。 这样可以确保创建2个独立的运行配置,第一个配置用于本地开发和端到端测试,第二个配置用于在云中运行。Reuse code
across pipeline variants if it makes sense to do so如果合理的话,跨管道变量
Reuse code
Provide an easy to use
CLI interface
for executing pipelines with differentconfigurations
and data提供易于使用的
CLI interface
以执行具有不同configurations
和数据的管道
A correct implementation also ensures that tests are easy to incorporate in your workflow.
正确的实施方式还可以确保测试易于整合到您的工作流程中。
项目结构:设计决策 (Project Structure: Design Decisions)
- Use Python.使用Python。
- Use Tensorflow Extended (TFX) as the pipeline framework. 使用Tensorflow Extended(TFX)作为管道框架。
In this article we will demonstrate how to run a TFX pipeline both locally and on a Kubeflow Pipelines installation with minimum hassle.
在本文中,我们将演示如何在本地和Kubeflow管道安装上运行TFX管道,而将麻烦降到最低。
设计决策引起的副作用 (Side Effects Caused By Design Decisions)
By using TFX, we are going to use
tensorflow
. Keep in mind that tensorflow supports more types of models, like boosted trees.通过使用TFX,我们将使用
tensorflow
。 请记住,tensorflow支持更多类型的模型,例如增强树。- Apache Beam can execute locally, anywhere kubernetes runs and on all public cloud providers. Examples include but are not limited to: GCP Dataflow, Azure Databricks. Apache Beam可以在本地,任何运行kubernetes的地方以及所有公共云提供商上执行。 示例包括但不限于:GCP Dataflow,Azure Databricks。
Due to Apache Beam, we need to make sure that the project code is easily packageable by python’s
sdist
for maximum portability. This is reflected on the top-level module structure of the project. (If you use external libraries be sure to include them by providing an argument to apache beam. Read more about this on Apache Beam: Managing Python Pipeline Dependencies).由于使用了Apache Beam,我们需要确保该项目代码易于被python的
sdist
打包,以实现最大的可移植性。 这反映在项目的顶级模块结构上。 (如果您使用外部库,请确保通过提供apache beam参数来包含它们。在Apache Beam:Management Python Pipeline Dependencies中了解有关此内容的更多信息)。
[Optional] Before continuing, take a moment to read about the provided TFX CLI. Currently, it is embarrasingly slow to operate and the directory structure is much more verbose than it needs to be. It also does not include any notes on reproducibility and code reuse.
[可选]在继续之前,花一点时间阅读有关提供的TFX CLI的信息。 当前,它的运行速度令人尴尬,而且目录结构比所需的更加冗长。 它还不包含有关可重复性和代码重用的任何注释。
目录结构及其背后的直觉 (Directory Structure and Intuition Behind It)
$project-name
is the root directory of your project$project-name
是$project-name
的根目录$project-name/ml
includes machine learning related stuff.$project-name/ml
包含与机器学习相关的内容。$project-name/ml/pipelines
includes the actual ML pipeline code$project-name/ml/pipelines
包含实际的ML管道代码Typically, you may find yourself with multiple ML pipelines to manage, such as
$project-name/ml/pipelines/predict-sales
and$project-name/ml/pipelines/classify-fraud
or similar.通常,您可能会发现自己需要管理多个ML管道,例如
$project-name/ml/pipelines/predict-sales
和$project-name/ml/pipelines/classify-fraud
或类似名称。Here is a simple
tree
view:这是一个简单的
tree
视图:
$project-name/ml/pipelines/
├── __init__.py
├── data
├── util
│ ├── __init__.py
│ ├── input_fn_utils.py
│ └── model_utils.py
├── kfp_runner.py
├── local_beam_dag_runner.py
├── model_utils.py
├── pipeline.py
├── cli.py
└── $pipeline-name
├── __init__.py
├── constants.py
├── model.py
└── training.py
$project-name/ml/pipelines
includes the following:
$project-name/ml/pipelines
包括以下内容:
data
→ small amount of representative training data to run locally for testing and on CI. That’s true if your system does not have a dedicated component to pull data from somewhere. If this is true, make sure to include a sampling query with a small limited number of items.data
→少量代表性培训数据可在本地运行以进行测试并在CI上运行。 如果您的系统没有专用组件可以从某处提取数据,那将是正确的。 如果是这样,请确保包含数量有限的少量抽样查询。util
→ code that is reused and shared across$pipeline-name
s. It is not necessary to includeinput_fn_utils.py
andmodel_utils.py
. Use whatever makes sense here. Here are some examples:util
→在$pipeline-name
之间重复使用和共享的代码。 不必包含input_fn_utils.py
和model_utils.py
。 在这里使用任何有意义的方法。 这里有些例子:
In my own projects, it made sense to abstract some parts on the utility module, like building named input and output layers for the keras models.
在我自己的项目中,有必要在实用程序模块上抽象某些部分,例如为keras模型构建命名的输入和输出层。
def get_input_graph(input_feature_keys, input_window_size) -> Tuple[Input, tf.keras.layers.Layer]:
transformed_columns = [transformed_name(
key) for key in input_feature_keys]
input_layers = {
colname: Input(name=colname, shape=(
input_window_size), dtype=tf.float32)
for colname in transformed_columns
}
pre_model_input = Concatenate(axis=-1)(list(input_layers.values()))
pre_model_input = Reshape(target_shape=(input_window_size, len(input_feature_keys)))(
pre_model_input)
return input_layers, pre_model_input
def get_output_graph(head_layer, predict_feature_keys, output_window_size) -> Dict[Text, tf.keras.layers.Layer]:
return {
colname: Dense(units=output_window_size, name=colname)(head_layer)
for colname in predict_feature_keys
}
Building the serving signature metagraph using Tensorflow Transform output.
使用Tensorflow Transform输出构建服务签名元图。
def _get_serve_tf_examples_fn(model, tf_transform_output):
"""Returns a function that parses a serialized tf.Example and applies TFT."""
model.tft_layer = tf_transform_output.transform_features_layer()
@tf.function
def serve_tf_examples_fn(serialized_tf_examples):
"""Returns the output to be used in the serving signature."""
feature_spec = tf_transform_output.raw_feature_spec()
feature_spec.pop(_LABEL_KEY)
parsed_features = tf.io.parse_example(serialized_tf_examples, feature_spec)
transformed_features = model.tft_layer(parsed_features)
return model(transformed_features)
return serve_tf_examples_fn
Preprocessing features into groups by using keys.
使用键将特征预处理成组。
def _preprocessing_fn(inputs: Dict[Text, Any], dense_float_feature_keys, input_feature_keys) -> Dict[Text, Any]:
outputs = {}
for key in [k for k in dense_float_feature_keys if k in input_feature_keys]:
outputs[transformed_name(key)
] = tft.scale_to_z_score(inputs[key])
return outputs
And also other common repetitive tasks, like building input pipelines with the Tensorflow dataset api.
还有其他常见的重复性任务,例如使用Tensorflow数据集API构建输入管道。
def transformed_name(key: Text) -> Text:
return key + '_xf'
def gzip_reader_fn(filenames):
return tf.data.TFRecordDataset(filenames, compression_type='GZIP')
def input_fn(file_pattern, tf_transform_output,
feature_spec,
# feature_keys, input_feature_keys, predict_feature_keys, or anything you like
batch_size=256):
apply_tf_transform_map_fn = get_apply_tft_map_fn()
dataset = tf.data.experimental.make_batched_features_dataset(
file_pattern=file_pattern,
features=feature_spec,
reader=gzip_reader_fn,
shuffle=True,
sloppy_ordering=True,
batch_size=batch_size) \
... \
.map(apply_tf_transform_map_fn, num_parallel_calls=tf.data.experimental.AUTOTUNE) \
.prefetch(tf.data.experimental.AUTOTUNE) \
return dataset
cli.py
→ entry point and command line interface for the pipelines. Here are some common things to consider when using TFX.cli.py
→管道的入口点和命令行界面。 这是使用TFX时要考虑的一些常见事项。
By using abseil
you can declare and access flags globally. Each module defines flags that are specific to it. It is a distributed system. This means that the common flags, like --data_dir=...
, --hparam_tuning
, --pipeline_root
, --ml_metadata_url
, --use_cache
, --train_epochs
is some you can define on the actual cli.py
file. Other, more specific ones for each pipeline can be defined on submodules.
通过使用abseil
您可以全局声明和访问标志。 每个模块定义特定于它的标志。 它是一个分布式系统。 这意味着可以在实际cli.py
文件上定义一些通用标志,例如--data_dir=...
,-- --hparam_tuning
,-- --pipeline_root
,-- --ml_metadata_url
,-- --use_cache
,-- --train_epochs
。 可以在子模块上定义每个管道的其他更具体的管道。
This file acts as an entry point for the system. It uses contents in pipeline.py
to set up the components of the pipeline as well as provide the user-provided module files ( in the tree example these are constants.py
, model.py
, training.py
) based on some flag like --pipeline_name=$pipeline-name
or some other configuration.
该文件充当系统的入口点。 它使用pipeline.py
内容来设置pipeline.py
的组件,并根据一些标志(例如--pipeline_name=$pipeline-name
)提供用户提供的模块文件(在树形示例中,这些文件为constants.py
, model.py
, training.py
) --pipeline_name=$pipeline-name
或其他配置。
Finally, with the assembled pipeline, it calls some _runner.py
file, by using a --runner=
flag.
最后,在组装好的管道中,它使用--runner=
标志调用了_runner.py
文件。
pipeline.py
→ parameterised pipeline component declaration and wiring. This is usually just a function that declares a bunch of TFX components and returns atfx.orchestration.Pipeline
object.pipeline.py
→参数化的管道组件声明和接线。 这通常只是一个函数,该函数声明一堆TFX组件并返回tfx.orchestration.Pipeline
对象。
def get_pipeline(data_path: Text,
pyfiles_root: str = None, # parent directory for user modules
tune: bool = False, # do hparam tuning ?
hyper_params_uri: Text = None, # if not, provide some hparams uri
hyperparam_train_args: trainer_pb2.TrainArgs = None,
hyperparam_eval_args: trainer_pb2.EvalArgs = None,
train_args: trainer_pb2.TrainArgs = None,
eval_args: trainer_pb2.EvalArgs = None,
push_args: Dict[Text, Any] = None):
if not pyfiles_root:
pyfiles_root = os.path.dirname(__file__)
training_module_file = os.path.join(
pyfiles_root, 'training.py') # directory structure of ./$pipeline-name/...
# ...
pipeline = # build pipeline definition with all the components
return pipeline
local_beam_dag_runner.py
→ configuration to run locally with the portable Beam runner. This can typically be almost configuration — free, just by using theBeamDagRunner
.local_beam_dag_runner.py
→配置为使用便携式Beamlocal_beam_dag_runner.py
在本地运行。 通常,这几乎是几乎免费的配置,只需使用BeamDagRunner
。kfp_runner.py
→ configuration to run on Kubeflow Pipelines. This typically includes different data path and pipeline output prefixes and auto-binds an ml-metadata instance.kfp_runner.py
→配置以在Kubeflow管道上运行。 这通常包括不同的数据路径和管道输出前缀,并自动绑定ml元数据实例。
Note: you can have more runners, like something that runs on GCP and just configures more provisioning resource like, TPU instances, parallel AI platform hyperparameter search etc.
注意:您可以拥有更多的运行程序,例如在GCP上运行的运行程序,并且只需配置更多的配置资源,例如TPU实例,并行AI平台超参数搜索等。
$ pipeline-name ($pipeline-name)
This is the user-provided code that makes different models, schedules different experiments, etc.
这是用户提供的代码,可用于制作不同的模型,安排不同的实验等。
Due to the util
submodule, code under each pipeline should be much leaner. No need to split it in more than 3 files. It’s not prohibiting to split your code throughout more files though.
由于使用了util
子模块,每个管道下的代码应该更加精简。 无需将其拆分为3个以上的文件。 但是,这并不禁止将代码分割成更多文件。
From experimentation, I converged to a constants
, model
and training
split.
从实验开始,我收敛到constants
, model
和training
分裂。
constants.py
→ declarations. Sensible default values for training parameters, hyperparameter keys and declarations, feature keys, feature groups, evaluation configurations and metrics to track. Here is a small example:constants.py
→声明。 训练参数,超参数键和声明,功能键,功能组,评估配置和要跟踪的指标的合理默认值。 这是一个小例子:
import tensorflow as tf
import kerastuner
TRAIN_STEPS = 1000
EVAL_STEPS = 100
BATCH_SIZE = 128
INPUT_WINDOW_SIZE = 7
OUTPUT_WINDOW_SIZE = 1
# === Feature Directory ===
DENSE_FLOAT_FEATURE_KEYS = ['a', 'b', 'c', 'd', 'e']
# === Feature Selection ===
FEATURE_KEYS = DENSE_FLOAT_FEATURE_KEYS # feature inputs for model
PREDICT_FEATURE_KEYS = FEATURE_KEYS # features that model predicts (if equal to feature_keys it can predict multiple timesteps ahead)
INPUT_FEATURE_KEYS = list(set(FEATURE_KEYS + PREDICT_FEATURE_KEYS))
# === Hyperparameters ===
HP_LR = 'learning_rate'
HP_HIDDEN_LAYER_NUM = 'hidden_layer_num'
HP_HIDDEN_LATENT_DIM = 'hidden_latent_dim'
HP_PRE_OUTPUT_UNITS = 'pre_output_units'
def _get_hyperparameters() -> kerastuner.HyperParameters:
hp = kerastuner.HyperParameters()
# todo: move string value definitions to constants
hp.Fixed(HP_LR, 1e-2)
hp.Fixed(HP_HIDDEN_LAYER_NUM, 1)
hp.Fixed(HP_HIDDEN_LATENT_DIM, 64)
#hp.Choice('dropout', [0.2, 0.3, 0.5], default=0.2)
hp.Fixed(HP_PRE_OUTPUT_UNITS, 32)
return hp
HYPERPARAMETERS = _get_hyperparameters()
HYPERPARAM_NUM_STEPS = 10
model.py
→ Model definition. Typically contains abuild_keras_model
function and uses imports fromutil
and$pipeline-name.constants
. Here’s an example from a recent project of mine:model.py
→模型定义。 通常包含一个build_keras_model
函数,并使用util
和$pipeline-name.constants
。 这是我最近的项目中的一个示例:
from typing import Dict, Text
import tensorflow as tf
from absl import logging
from tensorflow.keras.layers import (LSTM, Activation, Concatenate, Dense)
import kerastuner
from rnn.constants import (INPUT_FEATURE_KEYS, PREDICT_FEATURE_KEYS,
HP_HIDDEN_LATENT_DIM,
HP_HIDDEN_LAYER_NUM, HP_LR,
HP_PRE_OUTPUT_UNITS,
INPUT_WINDOW_SIZE,
OUTPUT_WINDOW_SIZE)
from input_fn_utils import transformed_name
from model_utils import get_input_graph, get_output_graph
def build_keras_model(hparams: kerastuner.HyperParameters) -> tf.keras.Model:
input_layers, pre_model_input = get_input_graph(
INPUT_FEATURE_KEYS, INPUT_WINDOW_SIZE)
x = pre_model_input
# ======
layer_num = int(hparams.get(HP_HIDDEN_LAYER_NUM))
latent_dim = int(hparams.get(HP_HIDDEN_LATENT_DIM))
for i in range(layer_num):
return_sequences = (i != layer_num-1)
x = LSTM(latent_dim, return_sequences=return_sequences)(x)
pre_output_units = int(hparams.get(HP_PRE_OUTPUT_UNITS))
x = Dense(units=pre_output_units, activation='swish')(x)
model_head = Dense(units=OUTPUT_WINDOW_SIZE *
len(PREDICT_FEATURE_KEYS), activation='relu')(x)
# =====
output_layers = get_output_graph(
model_head, PREDICT_FEATURE_KEYS, OUTPUT_WINDOW_SIZE)
model = tf.keras.Model(input_layers, output_layers)
model.compile(
loss='mae',
optimizer=tf.keras.optimizers.Adam(
lr=float(hparams.get(HP_LR))))
model.summary(print_fn=logging.info)
return model
Lastly,
training.py
includes all the fuss required to train the model. This is typically: preprocessing definition, hyperparameter search, setting up training data or model — parallel strategies and tensorboard logs and saving the module for production.最后,
training.py
包含了训练模型所需的所有操作。 这通常是:预处理定义,超参数搜索,设置训练数据或模型-并行策略和张量板日志并保存模块以供生产。
import os
from functools import partial
from typing import Any, Dict, List, Text
import kerastuner
import tensorflow as tf
import tensorflow.keras.backend as K
import tensorflow_transform as tft
import tensorflow_data_validation as tfdv
from absl import logging
from tensorflow_transform.tf_metadata import schema_utils
from tfx.components.trainer.fn_args_utils import FnArgs
from tfx.components.tuner.component import TunerFnResult
from rnn.constants import (BATCH_SIZE, DENSE_FLOAT_FEATURE_KEYS, FEATURE_KEYS, PREDICT_FEATURE_KEYS,
INPUT_FEATURE_KEYS, INPUT_WINDOW_SIZE, OUTPUT_WINDOW_SIZE,
HYPERPARAMETERS)
from rnn.model import build_keras_model
from input_fn_utils import input_fn, get_serve_raw_fn, _preprocessing_fn
def preprocessing_fn(inputs: Dict[Text, Any]) -> Dict[Text, Any]:
return _preprocessing_fn(inputs,
dense_float_feature_keys=DENSE_FLOAT_FEATURE_KEYS,
input_feature_keys=INPUT_FEATURE_KEYS)
def _input_fn(train_files, tf_transform_output, feature_spec):
return input_fn(train_files, tf_transform_output,
feature_spec=feature_spec,
input_window_size=INPUT_WINDOW_SIZE,
output_window_size=OUTPUT_WINDOW_SIZE,
batch_size=BATCH_SIZE,
predict_feature_keys=PREDICT_FEATURE_KEYS,
feature_keys=FEATURE_KEYS,
input_feature_keys=INPUT_FEATURE_KEYS)
def tuner_fn(fn_args: FnArgs) -> TunerFnResult:
# ...
return TunerFnResult(
tuner=tuner,
fit_kwargs={
'x': train_dataset,
'validation_data': eval_dataset,
'steps_per_epoch': fn_args.train_steps,
'validation_steps': fn_args.eval_steps
})
def run_fn(fn_args):
tf_transform_output = tft.TFTransformOutput(fn_args.transform_output)
train_files = fn_args.train_files
eval_files = fn_args.eval_files
serving_model_dir = fn_args.serving_model_dir
train_steps = fn_args.train_steps
eval_steps = fn_args.eval_steps
schema = tfdv.load_schema_text(fn_args.schema_file)
feature_spec = schema_utils.schema_as_feature_spec(schema).feature_spec
hparams = fn_args.hyperparameters
if type(hparams) is dict and 'values' in hparams.keys():
hparams = hparams['values']
train_dataset = _input_fn(train_files, tf_transform_output, feature_spec)
eval_dataset = _input_fn(eval_files, tf_transform_output, feature_spec)
mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
model = build_keras_model(hparams=hparams)
log_dir = os.path.join(os.path.dirname(serving_model_dir), 'logs')
tensorboard_callback = tf.keras.callbacks.TensorBoard(
log_dir=log_dir, update_freq='epoch')
model.fit(
train_dataset,
steps_per_epoch=train_steps,
validation_data=eval_dataset,
validation_steps=eval_steps,
callbacks=[tensorboard_callback])
serving_raw_entry = get_serve_raw_fn(
model, tf_transform_output, INPUT_WINDOW_SIZE)
serving_raw_signature_tensorspecs = {x: tf.TensorSpec(
shape=[None, INPUT_WINDOW_SIZE], dtype=tf.float32, name=x) for x in INPUT_FEATURE_KEYS}
logging.info(
f'serving_raw signature TensorSpecs are: {serving_raw_signature_tensorspecs}')
signatures = {
'serving_raw': serving_raw_entry.get_concrete_function(serving_raw_signature_tensorspecs),
}
model.save(serving_model_dir, save_format='tf',
signatures=signatures)
That’s it. Thank you for reading to the end!
而已。 感谢您阅读到底!
I hope that you enjoyed reading this article as much as I enjoyed writing it.
我希望您喜欢阅读这篇文章,也喜欢阅读它。
翻译自: https://towardsdatascience.com/structuring-ml-pipeline-projects-97c16348be4a