tensorflow 自定义梯度_TensorFlow 2.0 下多 GPU 自定义并行训练

最新推荐文章于 2022-10-24 13:07:05 发布

weixin_39607450

最新推荐文章于 2022-10-24 13:07:05 发布

阅读量425

点赞数 2

文章标签： tensorflow 自定义梯度

本文链接：https://blog.csdn.net/weixin_39607450/article/details/111662210

版权

最近研究了一下 TF 2.0 下的加载 TFRecord 文件的多 GPU 自定义并行训练代码的写法，踩了几个坑，在这里记录一下。

注意：本文介绍使用 Custom training loop 实现多 GPU 并行训练，请使用tf.keras中定义的层构建模型；如果使用Keras（非tf.keras）请不要参考本文，原因请参考 @windows98 大佬的观点：

TF并没有变成keras，只是改造了keras的类作为自己的状态容器而已

此外，在 TF 最新版本 2.3.1 中，strategy.experimental_run_v2已经改为strategy.run。

主要参考 TF 2.0 提供的官方提供的 Tutorial：

Custom training with tf.distribute.Strategy | TensorFlow Coretensorflow.google.cn

首先，导入 tensorflow 的 packages：

from __future__ import absolute_import, division, print_function, unicode_literals

# Import TensorFlow
import tensorflow as tf

# Helper libraries
import numpy as np
import os

创建一个MirroredStrategy分发数据和计算图

strategy = tf.distribute.MirroredStrategy()

可以利用MirroredStrategy中的num_replicas_in_sync属性获取可利用的GPU数量。

定义常量，其中batch_size_per_replica为每个 GPU （Replica）训练的 batch size，strategy.num_replicas_in_sync为本脚本可使用的 GPU 数量（脚本内可通过os.environ['CUDA_VISIBLE_DEVICES']指定，不指定默认使用全部 GPU）：

# Global batch size
GLOBAL_BATCH_SIZE = batch_size_per_replica * strategy.num_replicas_in_sync
# Buffer size for data loader
BUFFER_SIZE = batch_size_per_replica * strategy.num_replicas_in_sync * 16

定义load_dataset函数读取并解析 TFRecord 格式数据，注意_parse_function中对数据进行了归一化预处理，且必须要在strategy分发数据之前进行数据预处理，一旦分发数据就不可以对数据进行处理，否则报错。

def load_dataset(tfr_dir, batch_size, buffer_size):
    filenames = os.listdir(tfr_dir)
    training_filenames = [tfr_dir + filename for filename in filenames]
    
    dataset = tf.data.TFRecordDataset(training_filenames)
    
    # 这里的feature_description根据自己的数据格式进行定义
    feature_description = {
        'x': tf.io.FixedLenFeature([], tf.string), 
        'y': tf.io.FixedLenFeature([], tf.string)
    }

    # 注意，数据预处理要在这里执行，一旦使用strategy分发数据，就不能对数据进行预处理了
    def _parse_function(example_proto):
        example = tf.io.parse_single_example(example_proto, feature_description)
        
        x = tf.io.decode_raw(example['x'], tf.int32)
        y = tf.io.decode_raw(example['y'], tf.uint8)

        x = tf.reshape(x, [512, 512, 7])[..., 1:5]
        y = tf.reshape(y, [512, 512])

        x = tf.cast(x, tf.float32)
        x = (x - tf.reduce_min(x)) / (tf.reduce_max(x) - tf.reduce_min(x) + EPSILON)
        y = tf.cast(y, tf.int32)

        example['x'] = x
        example['y'] = y

        return example

    dataset = dataset.map(_parse_function).shuffle(buffer_size).batch(batch_size, drop_remainder=True)

    return dataset

读取数据之后，使用strategy分发数据：

dist_dataset = strategy.experimental_distribute_dataset(dataset)

创建模型、定义损失函数都需要定义在strategy.scope下：

with strategy.scope():
    # 定义模型、优化器、检查点和检查点管理器
    model = create_model()
    optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
    checkpoint = tf.train.Checkpoint(model=model, optimizer=optimizer)
    checkpoint_manager = tf.train.CheckpointManager(checkpoint, directory=checkpoint_dir, max_to_keep=5, checkpoint_name='ckpt')
    # 定义损失
    loss_object = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE)
    def compute_loss(logits, labels):
        per_example_loss = loss_object(y_true=labels, y_pred=logits)
        return tf.nn.compute_average_loss(per_example_loss, global_batch_size=GLOBAL_BATCH_SIZE)

上述代码请注意几点：

定义计算损失的类tf.keras.losses.SparseCategoricalCrossentropy时，需要指定其中reduction参数为tf.keras.losses.Reduction.None
在定义损失计算函数compute_loss中，最后返回时，使用tf.nn.compute_average_loss，其中指定global_batch_size参数为文章开头定义的GLOBAL_BATCH_SIZE常量

上述修改的的原因在于：使用单卡的情况下，计算 Loss 后需要进行 Reduction 操作（tf.reduce_mean）；但使用多卡时，每个 Replica 上计算梯度后使用 Summing 方法同步梯度，所以在每个 Replica 上的 Loss，要除以一次使用的全部数据（GLOBAL_BATCH_SIZE）进行 Reduction。

之后定义训练流程，注意使用的是 TensorFlow 下的 Custom train loop，使用tf.GradientTape将前向运算记录在梯度带上，然后使用tape.gradient求 Loss 关于相应参数的梯度：

with strategy.scope():
    def train_step(inputs, labels):
        with tf.GradientTape() as tape:
            logits = model(inputs, training=True)
            loss = compute_loss(logits=logits, labels=labels)

        grads = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(grads, model.trainable_variables))

        return loss

使用experimental_run_v2 执行分布式训练：

with strategy.scope():
    @tf.function # 并没有试过去掉这个注解会造成什么后果
    def distributed_train_step(dataset_inputs, dataset_labels):
        per_replica_losses = strategy.experimental_run_v2(
            train_step, args=(dataset_inputs, dataset_labels)
        )
        return strategy.reduce(tf.distribute.ReduceOp.SUM, per_replica_losses, axis=None)

使用python控制流进行训练：

with strategy.scope():
    for epoch in range(EPOCHS):
        # ==== Train ====
        start = time.time()
        total_loss = 0.0
        num_batches = 0
            
        for record in dist_dataset:
            x_train = record['x']
            y_train = record['y']

            total_loss += distributed_train_step(x_train, y_train)
            num_batches += 1
        train_loss = total_loss / num_batches
        end = time.time()
        print('[{}] Time for epoch {} / {} is {:0.4f} sec, loss {:0.4f}'.format(time.asctime(), epoch + 1, EPOCHS, end - start, train_loss))
        
        # ==== Save checkpoint and validate ====
        if (epoch + 1) % SAVE_STEP == 0:
            checkpoint_save_path = checkpoint_manager.save()
            print('[{}] Checkpoint saved, for epoch {}, at {}'.format(time.asctime(), epoch + 1, checkpoint_save_path))

记录自己的几个问题：

问题曾出在decode_raw上：一开始先分发了数据，然后将分发的数据进行解码，报错；后来将解码和预处理步骤都放在_parse_function中，保证分发的就是完成解码的tensor，代码就可以顺利运行了。

weixin_39607450

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
tensorflow 自定义梯度_TensorFlow 2.0 下多 GPU 自定义并行训练

最近研究了一下 TF 2.0 下的加载 TFRecord 文件的多 GPU 自定义并行训练代码的写法，踩了几个坑，在这里记录一下。注意：本文介绍使用 Custom training loop 实现多 GPU 并行训练，请使用tf.keras中定义的层构建模型；如果使用Keras（非tf.keras）请不要参考本文，原因请参考 @windows98 大佬的观点：TF并没有变成keras，只是改造了k...
复制链接

扫一扫