分布式训练的GPU设置与分配(含源码可以直接测试)

0 前言

为什么需要分布式?

   数据量太大
   模型太复杂

为什么要进行分布式训练的GPU设置?

单机训练默认只使用一个GPU,并且使用策略是不管需要多少计算资源默认使用全部GPU并将内存全部占满,使得另外的进程就无法使用GPU了
避免上述情况:
1. 内存自增长:根据需要占用资源
2. 虚拟设备机制:实际上只有一个GPU,手动切分成多个虚拟上的逻辑GPU
多GPU使用
1. 虚拟GPU & 实际GPU
2. 手工设置 & 分布式机制

API列表

tf.debugging.set_log_device_placement :输出日志信息,包含任务的布置情况
tf.config.set_soft_device_placement :自动指定设备布置任务
tf.config.experimental.set_visible_devices :设置可见设备,例如机器上有4个GPU,但设置只对一个GPU可见,则该进程无法访问其他设备
tf.config.experimental.list_physical_devices :获取所有物理设备(整块)
tf.config.experimental.VirtualDeviceConfiguration :建立逻辑分区
tf.config.experimental.list_logical_devices :获取所有逻辑设备(分块)
tf.config.experimental.set_memory_growth :设置内存自增长,需在程序开始的时候就被设置

1 查看本机GPU环境

hqc@master:~$ nvidia-smi
Mon Jun 13 16:58:43 2022       
	+-----------------------------------------------------------------------------+
	| NVIDIA-SMI 495.44       Driver Version: 495.44       CUDA Version: 11.5     |
	|-------------------------------+----------------------+----------------------+
	| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
	| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
	|                               |                      |               MIG M. |
	|===============================+======================+======================|
	|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
	| 23%   29C    P8     8W / 250W |     11MiB / 11178MiB |      0%      Default |
	|                               |                      |                  N/A |
	+-------------------------------+----------------------+----------------------+
	|   1  NVIDIA GeForce ...  Off  | 00000000:04:00.0 Off |                  N/A |
	| 23%   30C    P8     7W / 250W |     11MiB / 11178MiB |      0%      Default |
	|                               |                      |                  N/A |
	+-------------------------------+----------------------+----------------------+
	                                                                               
	+-----------------------------------------------------------------------------+
	| Processes:                                                                  |
	|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
	|        ID   ID                                                   Usage      |
	|=============================================================================|
	|    0   N/A  N/A      2035      G   /usr/lib/xorg/Xorg                  4MiB |
	|    0   N/A  N/A      4804      G   /usr/lib/xorg/Xorg                  4MiB |
	|    1   N/A  N/A      2035      G   /usr/lib/xorg/Xorg                  4MiB |
	|    1   N/A  N/A      4804      G   /usr/lib/xorg/Xorg                  4MiB |
	+-----------------------------------------------------------------------------+
# 可见,有两个GPU

# 进入配置好的tensorflow-gpu环境
root@master:/home/hqc# source activate tf
# 进入python查看
(tf) root@master:/home/hqc# python
	Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
	[GCC 7.5.0] :: Anaconda, Inc. on linux
	Type "help", "copyright", "credits" or "license" for more information.
	>>> import tensorflow as tf
	>>> print(tf.__version__)
		2.4.1
	>>> tf.test.is_gpu_available()
		...
		True # 代表GPU可用
	>>> gpus = tf.config.experimental.list_physical_devices(device_type='GPU')
		...
		2022-06-13 17:14:35.957644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
		pciBusID: 0000:01:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
		coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
		...
		2022-06-13 17:14:35.958938: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 1 with properties: 
		pciBusID: 0000:04:00.0 name: NVIDIA GeForce GTX 1080 Ti computeCapability: 6.1
		coreClock: 1.582GHz coreCount: 28 deviceMemorySize: 10.92GiB deviceMemoryBandwidth: 451.17GiB/s
		...
		2022-06-13 17:14:35.962910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0, 1
		# 找到两个GPU
	>>> cpus = tf.config.experimental.list_physical_devices(device_type='CPU')
	# 查看详细信息
	>>> print(gpus, cpus)
	[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]

	# 也可直接查看
	>>> tf.config.list_physical_devices('GPU')
	[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]

因此,本机有两块物理GPU

2 GPU设置实战

2.1 不做GPU设置的实验

先做一个默认gpu设置的实验,作为对照组。

基础代码:

### import some neccessary modules
import os
import sys
import time

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

### load data
fashion_mnist = keras.datasets.fashion_mnist
(x_train_all, y_train_all), (x_test, y_test) = fashion_mnist.load_data()

x_valid, x_train = x_train_all[:5000], x_train_all[5000:]
y_valid, y_train = y_train_all[:5000], y_train_all[5000:]

### normalize data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# x_train [None, 28, 28] --> [None, 784]
x_train_scaler = scaler.fit_transform(x_train.reshape(-1, 784)).reshape(-1, 28, 28, 1) # 最后一维 1,>表示1个通道
x_valid_scaler = scaler.transform(x_valid.reshape(-1, 784)).reshape(-1, 28, 28, 1)
x_test_scaler = scaler.transform(x_test.reshape(-1, 784)).reshape(-1, 28, 28, 1)

### make dataset
def make_dataset(images, labels, epochs, batch_size, shuffle=True):
    dataset = tf.data.Dataset.from_tensor_slices((images, labels))
    if shuffle:
        dataset = dataset.shuffle(10000)

    # prefetch:表示从数据中预先取出来多少个,来给生成数据作准备。为什么说是用来加速的一个函数?
    dataset = dataset.repeat(epochs).batch(batch_size).prefetch(50)
    return dataset

batch_size = 128
epochs = 100
train_dataset = make_dataset(x_train_scaler, y_train, epochs, batch_size)

### build a model
model = keras.models.Sequential()
model.add(keras.layers.Conv2D(filters=32, kernel_size=3,
                              padding='same',
                              activation='selu',
                              input_shape=(28, 28, 1)))
model.add(keras.layers.SeparableConv2D(filters=32, kernel_size=3,
                                       padding='same',
                                       activation='selu'))
model.add(keras.layers.MaxPool2D(pool_size=2))

# 一般每进行一次pooling层,图像的大小就会缩小,中间的数据就会大大减少,为减少这种信息的损失,故将filters翻倍。
model.add(keras.layers.SeparableConv2D(filters=64, kernel_size=3,
                                       padding='same',
                                       activation='selu'))
model.add(keras.layers.SeparableConv2D(filters=64, kernel_size=3,
                                       padding='same',
                                       activation='selu'))
model.add(keras.layers.MaxPool2D(pool_size=2))

model.add(keras.layers.SeparableConv2D(filters=128, kernel_size=3,
                                       padding='same',
                                       activation='selu'))
model.add(keras.layers.SeparableConv2D(filters=128, kernel_size=3,
                                       padding='same',
                                       activation='selu'))
model.add(keras.layers.MaxPool2D(pool_size=2))

# 展平
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(128, activation='selu')) # 全链接层
model.add(keras.layers.Dense(10, activation="softmax")) # 全链接层

model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=keras.optimizers.SGD(),
              metrics=["accuracy"])

model.summary()

### training
history = model.fit(train_dataset,
                    steps_per_epoch = x_train_scaler.shape[0] // batch_size,
                    epochs=10)

容器内进行训练:

# 需要先安装三个缺少的模块
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# pip install pandas
	...
	Successfully installed pandas-1.1.5 python-dateutil-2.8.2 pytz-2022.1
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# pip install matplotlib
	...
	Successfully installed cycler-0.11.0 kiwisolver-1.3.1 matplotlib-3.3.4 pillow-8.4.0 pyparsing-3.0.9
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# pip install sklearn
	...
	Successfully installed joblib-1.1.0 scikit-learn-0.24.2 scipy-1.5.4 sklearn-0.0 threadpoolctl-3.1.0
# 运行起来
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# python default.py 
	...
	# 确定使用的GPU及占用的资源
	2022-06-13 11:58:13.110072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10261 MB memory) -> physical GPU (device: 1, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:04:00.0, compute capability: 6.1)
	# 输出神经网络模型
	Model: "sequential"
	_________________________________________________________________
	Layer (type)                 Output Shape              Param #   
	=================================================================
	conv2d (Conv2D)              (None, 28, 28, 32)        320       
	_________________________________________________________________
	separable_conv2d (SeparableC (None, 28, 28, 32)        1344      
	_________________________________________________________________
	max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0         
	_________________________________________________________________
	separable_conv2d_1 (Separabl (None, 14, 14, 64)        2400      
	_________________________________________________________________
	separable_conv2d_2 (Separabl (None, 14, 14, 64)        4736      
	_________________________________________________________________
	max_pooling2d_1 (MaxPooling2 (None, 7, 7, 64)          0         
	_________________________________________________________________
	separable_conv2d_3 (Separabl (None, 7, 7, 128)         8896      
	_________________________________________________________________
	separable_conv2d_4 (Separabl (None, 7, 7, 128)         17664     
	_________________________________________________________________
	max_pooling2d_2 (MaxPooling2 (None, 3, 3, 128)         0         
	_________________________________________________________________
	flatten (Flatten)            (None, 1152)              0         
	_________________________________________________________________
	dense (Dense)                (None, 128)               147584    
	_________________________________________________________________
	dense_1 (Dense)              (None, 10)                1290      
	=================================================================
	Total params: 184,234
	Trainable params: 184,234
	Non-trainable params: 0
	_________________________________________________________________
	# 开始训练
	Epoch 1/10
	...
	429/429 [==============================] - 4s 6ms/step - loss: 2.3024 - accuracy: 0.1021
	Epoch 2/10
	429/429 [==============================] - 3s 6ms/step - loss: 2.3014 - accuracy: 0.1101
	Epoch 3/10
	429/429 [==============================] - 3s 6ms/step - loss: 2.2998 - accuracy: 0.1240
	Epoch 4/10
	429/429 [==============================] - 3s 6ms/step - loss: 2.2933 - accuracy: 0.1750
	Epoch 5/10
	429/429 [==============================] - 3s 6ms/step - loss: 1.9980 - accuracy: 0.3968
	Epoch 6/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.8706 - accuracy: 0.6798
	Epoch 7/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.7657 - accuracy: 0.7071
	Epoch 8/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.7207 - accuracy: 0.7247
	Epoch 9/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.6953 - accuracy: 0.7382
	Epoch 10/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.6702 - accuracy: 0.7470
# 训练结束,准确率有点低,但无伤大雅。

默认情况下,此demo每步运行花费6ms。

查看GPU占用情况:watch -n 0.2 nvidia-smi
在这里插入图片描述
发现仅仅这一个进程就几乎占满GPU,对资源浪费十分严重。因此,进行GPU的合理设置十分有必要。

2.2 设置GPU自增长

注意:一定要在程序一开始的时候就设置,否则会报错

修改部分的代码:在import之后,load data之前

### set gpu self_growth
tf.debugging.set_log_device_placement(True)### show which device each variable on
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)### set gpu self_growth

logical_gpus = tf.config.experimental.list_logical_devices('GPU')

print("the num of gpu", len(gpus))
print("the num of logical gpu", len(logical_gpus))

运行结果:

# 拷贝文件用于修改
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# cp default.py self_growth.py
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# ls
default.py  self_growth.py
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# python self_growth.py
	...
	the num of gpu 2
	the num of logical gpu 2
	# 可见有两个物理gpu两个逻辑gpu,物理gpu也算是逻辑gpu
	...
	420/429 [============================>.] - ETA: 0s - loss: 0.7134 - accuracy: 0.72912022-06-13 12:27:38.592831: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.599417: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.605374: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.611573: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.617694: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.625279: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.631732: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	2022-06-13 12:27:38.637800: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	428/429 [============================>.] - ETA: 0s - loss: 0.7133 - accuracy: 0.72922022-06-13 12:27:38.643905: I tensorflow/core/common_runtime/eager/execute.cc:760] Executing op __inference_train_function_927 in device /job:localhost/replica:0/task:0/device:GPU:0
	429/429 [==============================] - 3s 6ms/step - loss: 0.7133 - accuracy: 0.7292
# 输出的日志中可以看到,每个变量位于哪个设备上都打印出来,由于默认使用第一个gpu,故全在第一个gpu上

监控gpu占用情况:
在这里插入图片描述

可见,只占用了不到800MB资源,大大降低了资源的浪费。

2.3 手动指定可见GPU

这里,我设置为第二个GPU可见。
添加一句即可:tf.config.experimental.set_visible_devices(gpus[1], 'GPU')
具体位置如下:

### set gpu self_growth
#tf.debugging.set_log_device_placement(True)### show which device each variable on
gpus = tf.config.experimental.list_physical_devices('GPU')
### set gpu visible
tf.config.experimental.set_visible_devices(gpus[1], 'GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)### set gpu self_growth

logical_gpus = tf.config.experimental.list_logical_devices('GPU')

print("the num of gpu", len(gpus))
print("the num of logical gpu", len(logical_gpus))

运行起来:

# 拷贝一份
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# cp self_growth.py visible.py
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# ls
default.py  self_growth.py  visible.py
# 修改
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# vim visible.py 
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# python visible.py 
	...
	the num of gpu 2
	the num of logical gpu 1
	# 可见此处只显示一个逻辑gpu了
	...
	Epoch 10/10
	429/429 [==============================] - 3s 7ms/step - loss: 0.7120 - accuracy: 0.7311

2.4 逻辑GPU切分

进行逻辑切分的话,就不能再进行自增长,因为自增长需要在进行任何操作之前设置,所以把自增长部分去掉。
并加上逻辑切分语句。

这里为将第二块GPU切分为两块上限为5G的逻辑gpu

具体位置如下:

### set gpu self_growth
#tf.debugging.set_log_device_placement(True)### show which device each variable on
gpus = tf.config.experimental.list_physical_devices('GPU')
### set gpu visible
tf.config.experimental.set_visible_devices(gpus[1], 'GPU')
### divided into logical gpu
tf.config.experimental.set_virtual_device_configuration(
        gpus[1],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=5120),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=5120)])

运行起来:

# 拷贝
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# cp visible.py virtual_device.py
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# ls
	default.py  self_growth.py  virtual_device.py  visible.py
# 修改
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# vim virtual_device.py 
# 运行起来
(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# python virtual_device.py 
	...
	the num of gpu 2
	the num of logical gpu 2
	# 此处两个逻辑gpu均为第二块gpu切分而来
	...
	Epoch 10/10
	429/429 [==============================] - 3s 7ms/step - loss: 0.7171 - accuracy: 0.7252

监控gpu:
在这里插入图片描述
可见,切分为两块以后,也对资源占用改善有作用,但是相对于设置自增长效果没那么好。

2.5 手动设置进行多gpu运算

此部分设置了一个简单的矩阵乘法指定CPU运行,模型的不同层运行在不同gpu上。
完整代码如下:

### import some neccessary modules
import os
import sys
import time

import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras

### set gpu self_growth
#tf.debugging.set_log_device_placement(True)### show which device each variable on
gpus = tf.config.experimental.list_physical_devices('GPU')

for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)### set gpu self_growth

logical_gpus = tf.config.experimental.list_logical_devices('GPU')

print("the num of gpu", len(gpus))
print("the num of logical gpu", len(logical_gpus))

### specify computation on specific device
c = []
for gpu in logical_gpus:
    print(gpu.name)
    with tf.device(gpu.name):
        a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
        b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
        c.append(tf.matmul(a, b))
 ### add on cpu
with tf.device('/cpu:0'):
    matmul_sum = tf.add_n(c)

print(matmul_sum)

### load data
fashion_mnist = keras.datasets.fashion_mnist
(x_train_all, y_train_all), (x_test, y_test) = fashion_mnist.load_data()

x_valid, x_train = x_train_all[:5000], x_train_all[5000:]
y_valid, y_train = y_train_all[:5000], y_train_all[5000:]

### normalize data
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# x_train [None, 28, 28] --> [None, 784]
x_train_scaler = scaler.fit_transform(x_train.reshape(-1, 784)).reshape(-1, 28, 28, 1) 
x_valid_scaler = scaler.transform(x_valid.reshape(-1, 784)).reshape(-1, 28, 28, 1)
x_test_scaler = scaler.transform(x_test.reshape(-1, 784)).reshape(-1, 28, 28, 1)

### make dataset
def make_dataset(images, labels, epochs, batch_size, shuffle=True):
    dataset = tf.data.Dataset.from_tensor_slices((images, labels))
    if shuffle:
        dataset = dataset.shuffle(10000)

    dataset = dataset.repeat(epochs).batch(batch_size).prefetch(50)
    return dataset

batch_size = 128
epochs = 100
train_dataset = make_dataset(x_train_scaler, y_train, epochs, batch_size)

### set diffierent layer on diffierent device
model = keras.models.Sequential()
with tf.device(logical_gpus[0].name):
    model.add(keras.layers.Conv2D(filters=32, kernel_size=3,
                                padding='same',
                                activation='selu',
                                input_shape=(28, 28, 1)))
    model.add(keras.layers.SeparableConv2D(filters=32, kernel_size=3,
                                        padding='same',
                                        activation='selu'))
    model.add(keras.layers.MaxPool2D(pool_size=2))

    model.add(keras.layers.SeparableConv2D(filters=64, kernel_size=3,
                                        padding='same',
                                        activation='selu'))
    model.add(keras.layers.SeparableConv2D(filters=64, kernel_size=3,
                                        padding='same',
                                        activation='selu'))
    model.add(keras.layers.MaxPool2D(pool_size=2))


with tf.device(logical_gpus[1].name):
    model.add(keras.layers.SeparableConv2D(filters=128, kernel_size=3,
                                        padding='same',
                                        activation='selu'))
    model.add(keras.layers.SeparableConv2D(filters=128, kernel_size=3,
                                        padding='same',
                                        activation='selu'))
    model.add(keras.layers.MaxPool2D(pool_size=2))

    # ��~U平
    model.add(keras.layers.Flatten())
    
    model.add(keras.layers.Dense(128, activation='selu'))
    model.add(keras.layers.Dense(10, activation="softmax"))

model.compile(loss=keras.losses.SparseCategoricalCrossentropy(),
              optimizer=keras.optimizers.SGD(),
              metrics=["accuracy"])

model.summary()

### training
history = model.fit(train_dataset,
                    steps_per_epoch = x_train_scaler.shape[0] // batch_size,
                    epochs=10)

运行起来:

(tf2_py3) root@53275d4a111e:/share/distributed tensorflow/config_gpu# python manual_multi_gpu.py 
	...
	the num of gpu 2
	the num of logical gpu 2
	...
	tf.Tensor(
	[[ 44.  56.]
	 [ 98. 128.]], shape=(2, 2), dtype=float32)
	 # 此为cpu运算得出的结果
	...
	Epoch 10/10
	429/429 [==============================] - 3s 6ms/step - loss: 0.7345 - accuracy: 0.7169

监控GPU:
在这里插入图片描述
可见两个gpu均被占用。因此达到指定多gpu进行运算的效果。

3 实验需求

前面部分只是进行一些测试,我需要做一个单机多卡、多机多卡的分布式训练实验。

而目前我的主机设备只有两块物理GPU,而我准备了5个封装在docker中的tensorflow-gpu的开发环境,不一定要全部用上,但只有两块是肯定不够的。

目前需要解决的问题是:

  1. 单机多卡实验可以通过将两块物理GPU划分为4/5块逻辑GPU,但是如何分别指定各个逻辑GPU进行运算呢。
  2. 多机多卡实验,每个开发环境如何只调用其中一块逻辑GPU呢?
  3. 划分为逻辑GPU之后是否可以都设置为内存自增长?

3.3 回答问题3

经过查询资料以及自己的一些实践验证,发现设置每个GPU内存自增长 和 划分逻辑GPU不能同时进行
遂暂时作罢。。。

  • 4
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 3
    评论
在Keras中设置分布式训练可以使用TensorFlow的tf.distribute.Strategy API。这个API提供了多种分布式策略,可以根据不同的使用场景选择适合的策略。其中,对于单机多卡训练,可以使用MirroredStrategy。\[1\] 使用MirroredStrategy时,需要在代码中引入tf.distribute.MirroredStrategy,并在创建模型之前实例化该策略。然后,将模型的创建和编译放在strategy.scope()的上下文中,以确保模型在所有可用的GPU上进行复制和训练。\[2\] 下面是一个设置分布式训练的示例代码: ```python import tensorflow as tf from tensorflow import keras # 实例化MirroredStrategy strategy = tf.distribute.MirroredStrategy() # 在strategy.scope()的上下文中创建和编译模型 with strategy.scope(): model = keras.Sequential(\[...\]) # 创建模型 model.compile(\[...\]) # 编译模型 # 加载数据集 train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).batch(BATCH_SIZE) eval_dataset = mnist_test.map(scale).batch(BATCH_SIZE) # 在分布式环境下训练模型 model.fit(train_dataset, epochs=10, validation_data=eval_dataset) ``` 在上述代码中,MirroredStrategy会自动将模型复制到所有可用的GPU上,并在每个GPU上进行训练。这样可以充分利用多个GPU的计算资,加快模型训练的速度。\[1\] 需要注意的是,分布式训练需要有多个GPU才能发挥作用。如果只有单个GPU,使用分布式训练可能不会带来性能上的提升。另外,分布式训练还需要适当调整batch size和学习率等超参数,以获得最佳的训练效果。 #### 引用[.reference_title] - *1* [【Keras】TensorFlow分布式训练](https://blog.csdn.net/qq_36643449/article/details/124592521)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *2* [Keras 的分布式训练](https://blog.csdn.net/weixin_39693193/article/details/111539493)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] - *3* [Tensorflow2.0进阶学习-Keras 的分布式训练 (九)](https://blog.csdn.net/u010095372/article/details/124547254)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v91^control_2,239^v3^insert_chatgpt"}} ] [.reference_item] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值