keras实现多gpu数据并行训练

qq_38086463

已于 2022-06-23 17:42:12 修改

阅读量1.3k

点赞数

分类专栏： keras 文章标签： keras tensorflow 深度学习

于 2022-06-23 17:40:44 首次发布

本文链接：https://blog.csdn.net/qq_38086463/article/details/125404532

版权

keras 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

多gpu数据并行的原理(官方原话)

假设你的电脑上有8个gpu：
在这里插入图片描述
值得注意的是，在实际操作中，我发现如果不增加批次(batch)，只增加gpu的数量，模型的训练速度并不会提高，反而会降低。原因我还没找到，可能是数据在不同gpu传输时浪费了时间吧(我猜的)。

demo

import tensorflow.keras as keras
import tensorflow as tf

# 获取已经建好并编译了的模型
def get_compiled_model():
    # Make a simple 2-layer densely-connected neural network.
    inputs = keras.Input(shape=(784,))
    x = keras.layers.Dense(256, activation="relu")(inputs)
    x = keras.layers.Dense(256, activation="relu")(x)
    outputs = keras.layers.Dense(10)(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[keras.metrics.SparseCategoricalAccuracy()],
    )
    return model

# 获取mnist数据，并划分好训练集、验证集、测试集
def get_dataset():
    batch_size = 256
    num_val_samples = 10000

    # Return the MNIST dataset in the form of a `tf.data.Dataset`.
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    # Preprocess the data (these are Numpy arrays)
    x_train = x_train.reshape(-1, 784).astype("float32") / 255
    x_test = x_test.reshape(-1, 784).astype("float32") / 255
    y_train = y_train.astype("float32")
    y_test = y_test.astype("float32")

    # Reserve num_val_samples samples for validation
    x_val = x_train[-num_val_samples:]
    y_val = y_train[-num_val_samples:]
    x_train = x_train[:-num_val_samples]
    y_train = y_train[:-num_val_samples]
    return (
        tf.data.Dataset.from_tensor_slices((x_train, y_train)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_val, y_val)).batch(batch_size),
        tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(batch_size),
    )


# Create a MirroredStrategy.
# 指定两个gpu
devices = ['/device:gpu:0', '/device:gpu:1']
strategy = tf.distribute.MirroredStrategy(devices=devices, cross_device_ops=tf.distribute.ReductionToOneDevice())
print("Number of devices: {}".format(strategy.num_replicas_in_sync))

# Open a strategy scope.
with strategy.scope():
    # Everything that creates variables should be under the strategy scope.
    # In general this is only model construction & `compile()`.
    model = get_compiled_model()
# 如果想测试单gpu的话，可以把上面一直到“指定两个gpu”的代码注释掉并取消下面一行代码的注释
# model = get_compiled_model()
# Train the model on all available devices.
train_dataset, val_dataset, test_dataset = get_dataset()
model.fit(train_dataset, epochs=2, validation_data=val_dataset)

# Test the model on all available devices.
model.evaluate(test_dataset)

不同gpu和批次(batch)的对比

1、使用mnist数据集

gpu	批次(batch)	完成每个epoch的时间	准确率
1	64	2s	97.64%
2	256	2.5s	97.47%
3	384	2s	97.16%
4	512	2s	96.83%

2、使用cifar10数据集

gpu	批次(batch)	完成每个epoch的时间	准确率
1	32	10s	79.85%
1	128	4s	77.30%
2	64	12s	79.07%
2	256	4s	73.90%
3	96	14s	78.34%
3	192	6s	76.26%
3	288	4s	75.08%
3	384	3s	72.47%
4	128	17s	77.83%
4	256	8s	75.48%
4	512	3s	72.79%

cifar10数据集只跑了50轮，有兴趣的可以自己去试一下。

通过修改不同的gpu和批次(batch)，我们得出以下结论:

1、单纯的增加gpu数量而不同步增加批次(batch)，模型的训练速度不会提高，甚至会降低；
2、gpu数量必须是批次(batch)的因数，也就是说批次(batch)必须能整除gpu数量；
3、只有在增加gpu数量的同时，同步增加批次(batch)，模型的训练速度才会提高。也就是说，keras的多gpu的作用是让你能用更大的批次(batch)而不用担心gpu的内存被爆掉。

如果你使用一个gpu就可以把批次(batch)拉到最大，那你完全没有必要使用多gpu，这反而会降低你的训练速度。

使用cifar10数据集的demo。
我的环境：
tensorflow-gpu 2.3.0
python 3.7.10
keras 2.3.0
cudnn 7.6.5
cudatoolkit 10.1.243

参考：
在 CIFAR10 小型图像数据集上训练一个深度卷积神经网络。
Multi-GPU and distributed training
多 GPU 和分布式训练
 win10用tensorflow是不是不能多卡并行训练？