sa模型 nut nutilda怎样设置_实操！TensorFlow Cloud 助力模型训练无缝 “上云”

最新推荐文章于 2024-04-05 08:41:59 发布

weixin_39704066

最新推荐文章于 2024-04-05 08:41:59 发布

阅读量641

点赞数

文章标签： sa模型 nut nutilda怎样设置

文 / Jonah Kohn 和 Pavithra Vijay，软件工程师，Google

TensorFlow Cloud 是一种 Python 软件包，提供的 API 可用于将本地环境中的 TensorFlow 代码调试和训练无缝转移到 Google Cloud 中的分布式训练。它将云端的模型训练过程简化为单一的简单函数调用，只需要最少的设置，并且几乎不需要对模型进行任何更改。TensorFlow Cloud 可以处理云端特定的任务，例如自动为您的模型创建 VM 实例和分布策略。本文演示了 TensorFlow Cloud 的常见用例和几点最佳做法。

TensorFlow Cloud
https://github.com/tensorflow/cloud

我们将对 stanford_dogs 数据集提供的犬种图像进行分类。为简化此工作，我们将对基于 ImageNet 权重训练的 ResNet50 使用迁移学习。您可以在 TensorFlow Cloud 代码库的此处找到本文中的代码。

stanford_dogshttps://tensorflow.google.cn/datasets/catalog/stanford_dogs
此处https://github.com/tensorflow/cloud/blob/master/src/python/tensorflow_cloud/core/tests/examples/call_run_within_script_with_keras_fit.py

设置

使用 pip install tensorflow_cloud 安装 TensorFlow Cloud。我们先添加必要的导入，为分类任务启动 Python 脚本。

import datetime
import os

import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
import tensorflow_cloud as tfc
import tensorflow_datasets as tfds

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Model

Google Cloud 配置

TensorFlow Cloud 使用后台的 AI Platform 服务在 Google Cloud 上运行训练作业(Training Job)。

AI Platformhttps://cloud.google.com/ai-platform
Google Cloudhttps://cloud.google.com/

如果您是 GCP 的新用户，请按照本部分的设置步骤创建并配置第一个 Google Cloud 项目。如果您是 Cloud 的新用户，首次设置和配置将需要一点学习和操作。好消息是，设置后不需要对 TensorFlow 代码进行任何更改，可以直接在云端运行！

创建 GCP 项目
启用 AI Platform 服务
创建服务帐号
下载授权密钥
创建 Google Cloud Storage 存储分区

GCP 项目

Google Cloud 项目集合了许多云资源，例如用户集、API 集、结算、身份验证和监控。要创建项目，请遵循本指南。在终端上运行本部分中的命令。

export PROJECT_ID=
gcloud config set project $PROJECT_ID

本指南
https://cloud.google.com/resource-manager/docs/creating-managing-projects

AI Platform 服务

请在此下拉菜单中输入您的项目 ID，确保为您的 GCP 项目启用 AI Platform 服务。

此
https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component&_ga=2.195250852.968184668.1595960596-2024863071.1593638259&_gac=1.152369355.1595642474.Cj0KCQjwjer4BRCZARIsABK4QeUwdi5usz8wVZVqzlLM-jdvG6KF8zqhHPT1XQ0ga1M11bNkUO41VtsaAuc2EALw_wcB

服务帐号和密钥

为您的新 GCP 项目创建一个服务帐号。服务帐号是应用或虚拟机实例使用的帐号，Cloud 应用使用它来发起授权的 API 调用。

export SA_NAME=gcloud iam service-accounts create $SA_NAME
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member serviceAccount:$SA_NAME@$PROJECT_ID.iam.gserviceaccount.com \
  --role 'roles/editor'

服务帐号https://cloud.google.com/iam/docs/creating-managing-service-accountsv
虚拟机实例https://cloud.google.com/compute/docs/instances

接下来，我们需要服务帐号的身份验证密钥。此身份验证密钥用于确保只有被授权处理您的项目的人才能使用您的 GCP 资源。创建身份验证密钥，如下所示：

gcloud iam service-accounts keys create ~/key.json --iam-account $SA_NAME@$PROJECT_ID.iam.gserviceaccount.com

创建 GOOGLE_APPLICATION_CREDENTIALS 环境变量。

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Cloud Storage 存储分区

如果您具有指定的存储分区，请输入存储分区名称，如下所示。或者，按照本指南创建 Google Cloud Storage(GCS) 存储分区。TensorFlow Cloud 使用 Google Cloud Build 来构建和发布 Docker 镜像，以及存储模型检查点和训练日志等辅助数据。

GCP_BUCKET = "your-bucket-name"

本指南
https://cloud.google.com/storage/docs/creating-buckets
Google Cloud Buildhttps://cloud.google.com/cloud-build

Keras 模型创建

TensorFlow Cloud 的模型创建工作流与在本地构建和训练 TF Keras 模型相同。

资源

我们首先加载 stanford_dogs 数据集，对犬种进行分类。此数据集作为 tensorflow-datasets 软件包的一部分提供。如果数据集较大，建议将其托管在 GCS 上，以提高性能。

(ds_train, ds_test), metadata = tfds.load(
"stanford_dogs",
    split=["train", "test"],
    shuffle_files=True,
    with_info=True,
    as_supervised=True,
)

NUM_CLASSES = metadata.features["label"].num_classes

tensorflow-datasetshttps://tensorflow.google.cn/datasets/catalog/overview

我们来可视化数据集：

print("Number of training samples: %d" % tf.data.experimental.cardinality(ds_train))
print("Number of test samples: %d" % tf.data.experimental.cardinality(ds_test))
print("Number of classes: %d" % NUM_CLASSES)

训练样本数：12000；测试样本数：8580；类数：120

plt.figure(figsize=(10, 10))
for i, (image, label) in enumerate(ds_train.take(9)):
    ax = plt.subplot(3, 3, i + 1)
    plt.imshow(image)
    plt.title(int(label))
    plt.axis("off")

预处理

我们将调整数据大小并进行批处理。

IMG_SIZE = 224
BATCH_SIZE = 64
BUFFER_SIZE = 2

size = (IMG_SIZE, IMG_SIZE)
ds_train = ds_train.map(lambda image, label: (tf.image.resize(image, size), label))
ds_test = ds_test.map(lambda image, label: (tf.image.resize(image, size), label))

def input_preprocess(image, label):
    image = tf.keras.applications.resnet50.preprocess_input(image)
return image, label

配置输入流水线来提高性能

接下来，我们将配置输入流水线来提高性能。请注意，我们使用并行调用和预提取，这样当您的模型在训练时 I/O 不会成为瓶颈。本指南详细介绍了如何配置输入流水线来提高性能。

ds_train = ds_train.map(
    input_preprocess, num_parallel_calls=tf.data.experimental.AUTOTUNE
)

ds_train = ds_train.batch(batch_size=BATCH_SIZE, drop_remainder=True)
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

ds_test = ds_test.map(input_preprocess)
ds_test = ds_test.batch(batch_size=BATCH_SIZE, drop_remainder=True)

指南
https://tensorflow.google.cn/guide/data_performance

构建模型

我们将加载包含在 ImageNet 上训练的权重的 ResNet50，同时使用 include_top=False 来重构我们任务的模型。

inputs = tf.keras.layers.Input(shape=(IMG_SIZE, IMG_SIZE, 3))
base_model = tf.keras.applications.ResNet50(
    weights="imagenet", include_top=False, input_tensor=inputs
)
x = tf.keras.layers.GlobalAveragePooling2D()(base_model.output)
x = tf.keras.layers.Dropout(0.5)(x)
outputs = tf.keras.layers.Dense(NUM_CLASSES)(x)

model = tf.keras.Model(inputs, outputs)

ImageNet
http://www.image-net.org/
ResNet50v
https://tensorflow.google.cn/api_docs/python/tf/keras/applications/ResNet50?version=nightly

我们将基本模型中的所有层冻结于其当前权重，这样可以训练我们添加的其他层。

base_model.trainable = False

只要存储目标在您的 Cloud Storage 存储分区内，Keras 回调便可轻松用于 TensorFlow Cloud。在本例中，我们在训练的各个阶段使用 ModelCheckpoint 回调来保存模型，使用 Tensorboard 回调来可视化模型及其进度，使用 Early Stopping 回调来自动确定训练的最佳周期数。

MODEL_PATH = "resnet-dogs"
checkpoint_path = os.path.join("gs://", GCP_BUCKET, MODEL_PATH, "save_at_{epoch}")
tensorboard_path = os.path.join(
"gs://", GCP_BUCKET, "logs", datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
)
callbacks = [
    tf.keras.callbacks.ModelCheckpoint(checkpoint_path),
    tf.keras.callbacks.TensorBoard(log_dir=tensorboard_path, histogram_freq=1),
    tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=3),
]

ModelCheckpointhttps://tensorflow.google.cn/api_docs/python/tf/keras/callbacks/ModelCheckpoint
Tensorboardhttps://tensorflow.google.cn/api_docs/python/tf/keras/callbacks/TensorBoard
Early Stopping https://tensorflow.google.cn/api_docs/python/tf/keras/callbacks/EarlyStopping

编译模型

optimizer = tf.keras.optimizers.Adam(learning_rate=1e-2)
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)

本地调试模型

我们会先在本地环境中训练模型，确保代码能正常运行后再将作业发送到 GCP。我们将使用 tfc.remote() 确定代码应该在本地执行还是在云端执行。选择比完整训练作业预定数量少的周期，不仅有助于验证模型能否正常运行，还不会使本地计算机超载。

if tfc.remote():
    epochs = 500
    train_data = ds_train
    test_data = ds_test
else:
    epochs = 1
    train_data = ds_train.take(5)
    test_data = ds_test.take(5)
    callbacks = None

model.fit(
    train_data, epochs=epochs, callbacks=callbacks, validation_data=test_data, verbose=2
)

if tfc.remote():
 SAVE_PATH = os.path.join("gs://", GCP_BUCKET, MODEL_PATH)
    model.save(SAVE_PATH)

Google Cloud 上的模型训练

要在 GCP 上训练，请使用 GCP 项目设置填充示例代码，然后只需在代码中调用 tfc.run()。API 很简便，所有参数都有智能默认设置。此外，我们不需要担心云端特定任务，例如使用 TensorFlow Cloud 时创建 VM 实例和分布策略。API 将按顺序执行以下操作：

准备好 Python 脚本/笔记本云和分布。
将其转换为包含必要依赖项的 Docker 镜像。
在 GCP 集群上运行训练作业。
流式传输相关日志并存储检查点。

run() API 的使用非常灵活，例如可让用户指定自定义集群配置、自定义 Docker 镜像。有关可用于调用 run()的参数的完整列表，请参阅 TensorFlow Cloud 自述文件。

自述文件
https://github.com/tensorflow/cloud#usage-guide

使用您的模型所依赖的 Python 软件包列表创建 requirements.txt 文件。默认情况下，TensorFlow Cloud 会将 TensorFlow 及其依赖项作为默认 Docker 镜像的一部分，因此无需包括它们。请在 Python 文件的同一目录中创建 requirements.txt。此示例的 requirements.txt 内容为：

tensorflow-datasets
matplotlib

默认情况下，run API 会根据您提供的集群配置将模型代码封装在 TensorFlow 分布策略中。在本示例中，我们使用单节点多 GPU 配置。因此，您的模型代码会自动封装在 TensorFlow MirroredStrategy 实例中。

调用run()以开始云端训练。当您的作业提交后，系统会为您提供云作业的链接。要监控训练日志，请点击该链接，并选择“View logs”以查看训练进度信息。

tfc.run(
    requirements_txt="requirements.txt",
    distribution_strategy="auto",
    chief_config=tfc.MachineConfig(
        cpu_cores=8,
        memory=30,
        accelerator_type=tfc.AcceleratorType.NVIDIA_TESLA_T4,
        accelerator_count=2,
    ),
    docker_image_bucket_name=GCP_BUCKET,
)

使用 TensorBoard 可视化模型

在这里，我们从 GCS 存储分区加载 Tensorboard 日志来评估模型性能和历史记录。

tensorboard dev upload --logdir "gs://your-bucket-name/logs" --name "ResNet Dogs"

评估模型

在训练后，我们可以加载存储在 GCS 存储分区中的模型，并评估其性能。

if tfc.remote():
    model = tf.keras.models.load_model(SAVE_PATH)
model.evaluate(test_data)

后续步骤

本文介绍了 TensorFlow Cloud，这是一个 Python 软件包，可将多个 GPU/TPU 并入一个函数，简化云端的训练过程，不需要对模型的代码进行任何更改。您可以在这里找到本文中的完整代码。下一步，您可以在 TensorFlow Cloud 代码库上查找此代码示例及许多其他示例。

TensorFlow Cloudhttps://github.com/tensorflow/cloud
这里https://github.com/tensorflow/cloud/blob/master/src/python/tensorflow_cloud/core/tests/examples/call_run_within_script_with_keras_fit.py
TensorFlow Cloud 代码库https://github.com/tensorflow/cloud/tree/master/src/python/tensorflow_cloud/core/tests/examples

了解更多请点击 “

weixin_39704066

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫