keras 分布式_Keras 的分布式训练

最新推荐文章于 2022-05-03 21:53:52 发布

weixin_39693193

最新推荐文章于 2022-05-03 21:53:52 发布

阅读量716

点赞数

文章标签： keras 分布式

本文链接：https://blog.csdn.net/weixin_39693193/article/details/111539493

版权

本教程详细介绍了如何使用 Keras 的 `tf.distribute.MirroredStrategy` 进行分布式训练，旨在使用户能够在单机多 GPU 环境下实现模型的并行训练。内容包括策略的创建、数据预处理、模型构建、训练过程及回调设置等，展示了如何在分布式环境中高效地利用计算资源。

摘要由CSDN通过智能技术生成

Note:我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为，所以无法保证它们是最准确的，并且反映了最新的

官方英文文档。如果您有改进此翻译的建议，请提交 pull request 到

tensorflow/docs GitHub 仓库。要志愿地撰写或者审核译文，请加入

docs-zh-cn@tensorflow.org Google Group。

概述

tf.distribute.Strategy API 提供了一个抽象的 API ，用于跨多个处理单元(processing units)分布式训练。它的目的是允许用户使用现有模型和训练代码，只需要很少的修改，就可以启用分布式训练。

本教程使用 tf.distribute.MirroredStrategy，这是在一台计算机上的多 GPU(单机多卡)进行同时训练的图形内复制(in-graph replication)。事实上，它会将所有模型的变量复制到每个处理器上，然后，通过使用 all-reduce 去整合所有处理器的梯度(gradients)，并将整合的结果应用于所有副本之中。

MirroredStategy 是 tensorflow 中可用的几种分发策略之一。您可以在分发策略指南中阅读更多分发策略。

Keras API

导入依赖

# 导入 TensorFlow 和 TensorFlow 数据集

import tensorflow_datasets as tfds

import tensorflow as tf

tfds.disable_progress_bar()

import osprint(tf.__version__)

2.3.0

下载数据集

下载 MNIST 数据集并从 TensorFlow Datasets 加载。这会返回 tf.data 格式的数据集。

将 with_info 设置为 True 会包含整个数据集的元数据,其中这些数据集将保存在 info 中。除此之外，该元数据对象包括训练和测试示例的数量。

datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)

mnist_train, mnist_test = datasets['train'], datasets['test']

定义分配策略

创建一个 MirroredStrategy 对象。这将处理分配策略，并提供一个上下文管理器(tf.distribute.MirroredStrategy.scope)来构建你的模型。

strategy = tf.distribute.MirroredStrategy()

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)

print('Number of devices: {}'.format(strategy.num_replicas_in_sync))

Number of devices: 1

设置输入管道(pipeline)

在训练具有多个 GPU 的模型时，您可以通过增加批量大小(batch size)来有效地使用额外的计算能力。通常来说，使用适合 GPU 内存的最大批量大小(batch size)，并相应地调整学习速率。

# 您还可以执行 info.splits.total_num_examples 来获取总数

# 数据集中的样例数量。

num_train_examples = info.splits['train'].num_examples

num_test_examples = info.splits['test'].num_examples

BUFFER_SIZE = 10000

BATCH_SIZE_PER_REPLICA = 64

BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

0-255 的像素值，必须标准化到 0-1 范围。在函数中定义标准化。

def scale(image, label):

image = tf.cast(image, tf.float32)

image /= 255

return image, label

将此功能应用于训练和测试数据，随机打乱训练数据，并批量训练。请注意，我们还保留了训练数据的内存缓存以提高性能。

train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE)

eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE)

生成模型

在 strategy.scope 的上下文中创建和编译 Keras 模型。

with strategy.scope():

model = tf.keras.Sequential([

tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),

tf.keras.layers.MaxPooling2D(),

tf.keras.layers.Flatten(),

tf.keras.layers.Dense(64, activation='relu'),

tf.keras.layers.Dense(10)

])

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),

optimizer=tf.keras.optimizers.Adam(),

metrics=['accuracy'])

定义回调(callback)

这里使用的回调(callbacks)是：

TensorBoard: 此回调(callbacks)为 TensorBoard 写入日志，允许您可视化图形。

Model Checkpoint: 此回调(callbacks)在每个 epoch 后保存模型。

Learning Rate Scheduler: 使用此回调(callbacks)，您可以安排学习率在每个 epoch/batch 之后更改。

为了便于说明，添加打印回调(callbacks)以在笔记本中显示学习率。

# 定义检查点(checkpoint)目录以存储检查点(checkpoints)

checkpoint_dir = './training_checkpoints'

# 检查点(checkpoint)文件的名称

checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")# 衰减学习率的函数。

# 您可以定义所需的任何衰减函数。

def decay(epoch):

if epoch < 3:

return 1e-3

elif epoch >= 3 and epoch < 7:

return 1e-4

else:<

最低0.47元/天解锁文章

weixin_39693193

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
keras 分布式_Keras 的分布式训练

Note:我们的 TensorFlow 社区翻译了这些文档。因为社区翻译是尽力而为，所以无法保证它们是最准确的，并且反映了最新的官方英文文档。如果您有改进此翻译的建议，请提交 pull request 到tensorflow/docs GitHub 仓库。要志愿地撰写或者审核译文，请加入docs-zh-cn@tensorflow.org Google Group。概述tf.distribute...
复制链接

扫一扫