tensorflow与pytorch的GPU分配与使用策略详解

最新推荐文章于 2025-04-27 07:30:33 发布

LoveMIss-Y

最新推荐文章于 2025-04-27 07:30:33 发布

阅读量6.7k

点赞数 18

分类专栏： tensorflow2.x TensorFlow pytorch 文章标签：多GPU训练 tf的GPU分配 pytorch GPU CUDA_VISIBLE 指定GPU训练

本文链接：https://blog.csdn.net/qq_27825451/article/details/106002237

版权

TensorFlow 同时被 3 个专栏收录

28 篇文章

订阅专栏

pytorch

28 篇文章

订阅专栏

tensorflow2.x

8 篇文章

订阅专栏

本文详细介绍了在多GPU环境下，如何管理和分配GPU资源给深度学习框架如TensorFlow和PyTorch。涵盖GPU可见性设置、设备映射、内存管理、虚拟GPU使用等关键概念和技术实践。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

前言：看了很多关于多GPU分配与使用的文章，发现很多文章都是只介绍了一些最浅显的东西，没有深入解释清楚，本文所使用的服务器上面含有4块 GTX2080Ti 的GPU。

声明：深度学习框架所使用的GPU不是以GPU本身的个数和编号而言的，而是以我们本身给框架能够看见的GPU数量而言的，什么意思呢？

一、关于GPU的可见性与框架使用的GPU的映射关系——device mapping

（1）当不设定任何限制的时候，我们的框架可以看见4块GPU，所以在使用的时候对应的关系如下：

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5

前面是框架所使用的的设备全名，后面是真实的硬件名称。

（2）现在我自己指定可见的GPU设备

比如现在GPU：0和GPU：3在被别人使用，我现在不能再使用者两块GPU，我只能使用GPU:1和GPU:2，我们看到的信息如下：

/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:02:00.0, compute capability: 7.5
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 2, name: GeForce RTX 2080 Ti, pci bus id: 0000:03:00.0, compute capability: 7.5

现在应该只有两块GPU可用，至于如何指定可见的设备，后面再说，

一定要注意到这里的 device mapping 关系，现在我们的对应关系是：

/device:GPU:0 -> device: 1
/device:GPU:1 -> device: 2

而我们在使用tensorflow分配设备的时候，能够使用的实际上是 /device:GPU:0 和 /device:GPU:1，但是他们实际上又是物理GPU的第二块和第三块，这时特别要注意的地方，否则就会出错，如果我现在使用：

with tf.device("/gpu:2"):

那么就会出错，为什么？不是我这里明明是用的是第二块和第三块GPU啊，为什么不能使用 “/gpu:2”,这是因为映射关系的存在。

记住：tensorflow和pytorch识别的设备都是通过映射关系来实现的，及我们前面的 /device:GPU:0 和 /device:GPU:1。

再比如下面的一些例子：

如果只使用第四块GPU，则 /device:GPU:0 -> device: 3
如果只是用第3,4块GPU，则 /device:GPU:0 -> device: 2 和 /device:GPU:0 -> device: 3
如果只使用第1块GPU，则 /device:GPU:0 -> device: 0

（3）GPU内存被完全占用的时候

当有某两块GPU内存被使用满了的时候，比如我现在的服务器上面第3，4两块GPU内存被占满了，这个时候我是没有办法查看到所有的GPU设备的，比如如下面的代码：

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())
'''错误原因，第三块GPU内存满了，出现了错误
tensorflow.python.framework.errors_impl.InternalError: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY: out of memory; total memory reported: 11554717696
'''

当第3,4块GPU被完全使用的时候，若我们指定的是0,1两块GPU，则会得到下面的结果：

红色部分圈出来的表示的是现在可以使用的两块GPU，蓝色部分圈出来的是第3,4块两块GPU内存占用满了，绿色部分圈出来的是GPU设备的映射。

总结如下：为了更好地在多个GPU上面进行训练，因为这个服务器有多个人进行使用，我们最好是遵循下面的步骤

（1）第一步：明确指定可见设备。先明确指定对于tensorflow或者是pytorch明确可见的GPU是哪几块，然后会对指定的GPU完成 device mapping，映射规则如上面所示，为了方便查看GPU的实时使用情况，我们可以用下面命令进行监控：

watch -n 1 nvidia-smi

（2）在映射的GPU设备上面进一步配置GPU的使用规则。这是以第一步为基础，比如将哪一些tensor，哪一些operation分配在哪一些GPU设备上，指定的设备占用率是多少，内存允许分配多少等等。切记。这些都是在第一步的device mapping基础之上，这很重要。

（3）在指定可见设备时候，指定代码一般放在最前面，以防止因为其他的人将某一块GPU沾满出现未知错误。

二、明确指定可见GPU设备的方法

所谓的明确指定GPU，就是让框架只看得见我们制定的那几块GPU，完成 device mapping，没指定的GPU，框架根本就看不见，不管它是完全没使用还是已经内存被占用满了。

明确指定GPU的方法有很多，这里使用每一种来进行说明。

2.1 在运行脚本程序的时候在终端指定——针对tensorflow和pytorch

如下：

CUDA_VISIBLE_DEVICES = 1 python train_net.py
CUDA_VISIBLE_DEVICES = 0,1 python train_net.py
CUDA_VISIBLE_DEVICES = 0,2,3 python train_net.py
CUDA_VISIBLE_DEVICES = "1,2" python train_net.py
CUDA_VISIBLE_DEVICES = "1,2,3" python train_net.py

2.2 使用os模块在程序开头指定可见的设备——针对tensorflow和pytorch

os.environ["CUDA_VISIBLE_DIVICES"]="2"
os.environ["CUDA_VISIBLE_DIVICES"]="0,2"
os.environ["CUDA_VISIBLE_DIVICES"]="2,3,4"

2.3 tensorflow1.x的GPU可见性设置——tensorflow1.13及之前

# 会话GPU的相关配置
gpu_options = tf.GPUOptions()
gpu_options.visible_device_list = "1,2" # 可见的两块GPU是2、3块GPU

2.4 tensorflow1.14以及tensorflow2.x

# 获取所有的物理GPU
physical_devices = tf.config.list_physical_devices('GPU') 
# 配置可见的GPU，从第二块GPU开始
tf.config.set_visible_devices(physical_devices[1:], 'GPU')

函数原型如下：

tf.config.set_visible_devices(devices, device_type=None)

2.5 pytorch的设置方法——对于torch，控制设备可见性，推荐使用CUDA_VISIBLE_DEVICES

print(torch.cuda.is_available())     # True
print(torch.cuda.device_count())     # 4 ,共有4块GPU
torch.cuda.set_device(2)             # 设置第3块GPU
device = torch.cuda.current_device() # 当前的GPU设备是2，返回2

# 按道理这里只有设置一块GPU，即低块，我们能够使用的是只有一块GPU，也就是下面只能是 cuda:0
# 但是我们即便使用 cuda:1、cuda:2、cuda:3 均可以
cuda = torch.device("cuda:1")  # 返回 cuda:1

x=torch.tensor([1,2,3],device=cuda)
y=torch.tensor([4,5,6],device=cuda)
z=torch.add(x,y)
print(z)

所以官方不推荐使用

torch.cuda.set_device(）方法，因为他没有真正的控制到设备对于框架是否是可见的，而是推荐使用 CUDA_VISIBLE_DEVICES 的方法。

参照下面的

import torch
import os
os.environ["CUDA_VISIBLE_DEVICES"]="1"  # 只有第2块GPU是对torch可见的，因此只有cuda:0 是真正可用的

print(torch.cuda.is_available())     # True
print(torch.cuda.device_count())     # 4 ,共有4块GPU

#torch.cuda.set_device(2)             # 出错，因为这时候GPU:2根本对于torch是不可见的
device = torch.cuda.current_device() # 返回可见的当前的GPU设备是1，返回1

# 这句话总是不会出错，不管该GPU是否真实可见，总会打印出结果，就像这里，即使没有cuda:1,他还是会返回这个结果
# 但是，在下面指定tensor到cuda:1的时候就会出错了，显示RuntimeError: CUDA error: invalid device ordinal
# 所以应该将其设置为 cuda:0，这样下面的tensor才不会出错
cuda = torch.device("cuda:0")  # 返回 cuda:0

x=torch.tensor([1,2,3],device=cuda)
y=torch.tensor([4,5,6],device=cuda)
z=torch.add(x,y)
print(z)

三、tensorflow不同版本对于GPU的常见的一些设置

3.1 tensorflow1.13 以及之前的版本

# 获取所有的GPU设备
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

'''
不再推荐使用此方法，因为这个方法有一个bug，当我们指定GPU 0,1 对我们的tensorflow可见的时候，
2,3因为别的人在使用已经内存全部被使用，是用这个方法就没有办法打印出设备 2,3 显示内存被耗尽了，
所以不推荐使用
我们应该使用高版本的tf.config.list_phisical_deivices来进行查看更加合理，即便内存全部被占用，物理GPU至少用该能够被统计才合理。
'''

相关的设置方法，指定operation和tensor的设备、限制GPU内存，设置operation的设备显示、自动分配可见设备等操作

注意：这些都是在前面设置的可见设备基础之上的哦！！！

（1）通过GPUOptions、ConfigProto、Session三者来设置

# 创建GPUOptions对象并设置相关的属性,前提都是对于可见设备而言的哦！
gpu_options = tf.GPUOptions()
gpu_options.visible_device_list = "1,2"       # 指定GPU的可见性
gpu_options.allow_growth = True               # 允许自动达到可见GPU的最大内存
gpu_options.per_process_memory_fraction = 0.4 # 设置GPU内存占用的最大比例

# 创建ConfigProto对象，并设置它的gpu_options属性
config = tf.ConfigProto(gpu_options = gpu_options)
config.log_device_placement = True      # 查看每一个operation所在的设备，也是以可见的device mapping为前提的
config.allow_soft_placement = True      # 每一个operation在可见的device上面自动分配
config.inter_op_parallelism_threads     # 设置一个操作内部并行计算的线程数，0表示最优线程处理
config.intra_op_parallelism_threads     # 设置多个操作并行计算的线程数，0表示最优线程处理

# 创建Session会话，与graph关联
with tf.Session(config = config, graph = graph) as sess:
    # 开始一系列操作

（2）将operation指定到某一个设备上面——以可见的device mapping为基础哦

with tf.device("/gpu:0"):  # 可见设备中的第1块
    graph = tf.Graph()
    with graph.as_default():
    
        a = tf.constant([1.0,2.0])
        b = tf.constant([3.0,4.0])
        c = tf.add(a,b,name="a_add_b")
    
        x = tf.Variable(initial_value=[10.0,20.0])
        y = tf.Variable(initial_value=[30.0,40.0])
        z = tf.add(x,y,name="x_add_y")

总结：在TensorFlow中GPU设备名称

"/device:CPU:0": 机器中的CPU
"/GPU:0": 机器中对tensorflow可见的GPU中的第一块GPU，是一个简写，我们常用这个
"/job:localhost/replica:0/task:0/device:GPU:1": 机器中对tensorflow可见的GPU中的第二块GPU，这个是完全名称，不是简写；

四、tensorflow1.14以及之后的版本（tf2.x）中的分配与使用策略

（1）查看GPU的数量以及确保GPU可用

# tf.config.experimental.list_physical_devices('GPU') 

import tensorflow as tf
# 查看所有的设备
print("可用GPU数量为: ", len(tf.config.experimental.list_physical_devices()))
'''
[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), 
PhysicalDevice(name='/physical_device:XLA_CPU:0', device_type='XLA_CPU'), 
PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), 
PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU'), 
PhysicalDevice(name='/physical_device:GPU:2', device_type='GPU'), 
PhysicalDevice(name='/physical_device:GPU:3', device_type='GPU'), 
PhysicalDevice(name='/physical_device:XLA_GPU:0', device_type='XLA_GPU'),  # 这里的XLA指的是Accelerated Linear Algebra 加速线性代数
PhysicalDevice(name='/physical_device:XLA_GPU:1', device_type='XLA_GPU'),  # 我的个人理解是该GPU是支持XLA的，因为没有使用这个优化，所以先不用管
PhysicalDevice(name='/physical_device:XLA_GPU:2', device_type='XLA_GPU'), 
PhysicalDevice(name='/physical_device:XLA_GPU:3', device_type='XLA_GPU')]
'''

从上面的所有的物理设备可以看出，我们的设备类型一共有4大类设备类型，即

CPU
XLA_CPU
GPU
XLA_GPU

那实际上是服务器中只安装了一个CPU一级四个GPU，这个XLA又是什么呢？它实际上加速线性代数运算的优化方法，是说明我们的设备是支持XLA，即支持线性代数加速运算的，并不是一块新的显卡，我们可以在控制台打印出来的信息查看到如下信息：

2020-05-09 13:57:39.978330: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5608d0ad8a20 executing computations on platform Host. Devices:

2020-05-09 13:57:40.636659: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5608d0b3cff0 executing computations on platform CUDA. Devices:

既然有四大设备类型我们可以只查看不同类型的物理设备，如下：

tf.config.experimental.list_physical_devices("CPU")     # 返回一个
tf.config.experimental.list_physical_devices("XLA_CPU") # 返回一个
tf.config.experimental.list_physical_devices("GPU")     # 返回四个
tf.config.experimental.list_physical_devices("XLA_GPU") # 返回四个

（2）限制哪一些GPU对于tensorflow可见

tf.config.set_visible_devices(devices, device_type=None)  # 参见上面第二大标题，一般设置放在代买前面哦！
# 比如针对上面所返回出来的所有的物理设备，我们要使用第0,1两块GPU应该这么做
# 注意这里的0,1两块GPU的索引是2和3，不要弄错哦，因为不同的机器可能是不一样的
tf.config.set_visible_devices(gpus[2:4], 'GPU')  # 特别注意索引位置不要错，要根据返回的物理设备来确定

（3）手动分配设备——与上面的是一样的

# 将tensor放在CPU上面
with tf.device('/CPU:0'):
  a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
  b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
c = tf.matmul(a, b)

（4）查看每一个operation以及tensor所在的设备

tf.debugging.set_log_device_placement(True) # 查看每一个operation和tensor在那一个设备上面，这句话放在最前面即可

（5）查看逻辑GPU的数量，logical GPU

所谓的逻辑GPU实际上指的就是visible GPU，即对于tensorflow框架可见的GPU的个数，如果有虚拟GPU的时候，则l

逻辑GPU = 真实可见的GPU + 虚拟GPU个数

如下：

# 查看所有GPU的数量，应该为4个
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # 严格限制，只允许使用第一块GPU
    tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
    # 查看逻辑GPU的数量
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPU")
'''
4 Physical GPUs, 1 Logical GPU
'''

（6）限制内存增长

# 自动增长
tf.config.experimental.set_memory_growth(devices[0], True)  
# 限制内存是具体的多少
tf.config.experimental.set_virtual_device_configuration(
        gpus[0],   # 指定的一块可见的GPU哦
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)]  # 通过虚拟GPU技术，后面也会介绍到
)

（7）自己选择在所有的可见设备上自动分配

tf.config.set_soft_device_placement(True)

（8）虚拟GPU——单GPU模拟多GPU环境

当我们的机器实际上只有一块GPU的时候，有时候为了方便编写分布式多GPU的代码，我们可以将一块GPU设置成几块虚拟的GPU，如下面的代码：

# 获取所有的物理GPU，假设这里是2块
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # 给第一块GPU，分成两块虚拟GPU
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024),
         tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    # 查看逻辑GPU数量
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPU,", len(logical_gpus), "Logical GPUs")
'''
2 Physical GPU, 3 Logical GPUs
'''

总结：注意理解：真实的物理GPU、可见的GPU、虚拟GPU、逻辑GPU 这四者之间的关系。

五、pytorch中GPU常见的一些使用策略

torch.cuda.current_device()   # 返回当前所选择的device的索引

torch.cuda.device_count()  # 返回可使用的GPU的数量

torch.cuda.get_device_capability(device=None) # 查看某一个设备device的计算能力

torch.cuda.get_device_name(device=None) # 获取设备的名称

torch.cuda.is_available()  # 查看GPU是否可用

torch.cuda.is_initialized() # 查看pytorch的 CUDA 状态是否初始化

torch.cuda.set_device(device)  # 不推荐使用，参见前面的指定可见GPU

当然pytorch的cuda模块中还有很多其它的方法，很多也没搞懂，也没找到相关的文献，也没有使用过，暂时就先不说了，后面遇到了再补充。

六、安装GPU版本之后的一些简单的测试代码

6.1 对于tensorflow而言

tf.__verison__
tf.__xxx__
tf.version.xxxx
tf.test.is_built_with_cuda()
tf.test.is_gpu_available()
tf.test.gpu_device_name()
# 以及1.x版本与2.x版本获取所有的物理设备的方法

6.2 对于pytorch而言

torch.__version__
torch.version.cuda        # 9.0
torch.cuda.is_available()
torch.cuda.get_device_name(0)
torch.cuda.get_device_propertise(0)
torch.cuda..device_count()
torch.cuda.current_device()

torch.backends.cudnn.version()  # 7005版本

import torch
from torch.backends import cudnn
x = torch.Tensor([1.0])
xx = x.cuda()
print(xx)

# 检测cudnn
print(cudnn.is_acceptable(xx))