【PyTorch框架】GPU的使用

HUI 别摸鱼了

已于 2024-07-13 18:40:24 修改

阅读量6.2k

点赞数 2

分类专栏：深度学习PyTorch 文章标签： pytorch 深度学习 python

于 2023-04-18 16:10:33 首次发布

本文链接：https://blog.csdn.net/weixin_47244593/article/details/130216098

版权

深度学习PyTorch 专栏收录该内容

16 篇文章 2 订阅

订阅专栏

一、CPU与GPU

GPU in PyTorch：

CPU(Central Processing Unit,中央处理器)：主要包括控制器和运算器；
GPU(Graphics Processing Unit,图形处理器)：处理统一的，无依赖的大规模数据运算。

说明：
pytorch中张量的运算应处于同一个处理器上，要么同时在CPU上计算，要么同时在GPU上计算。
GPU的ALU（算术运算单元）比CPU多，而CPU中的缓存区多，用于加速程序的运行，两者适用于不同的任务，计算密集型的程序和易于并行的程序通常在GPU上完成。

二、数据迁移至GPU

在这里插入图片描述

1. to 函数：转换数据类型/设备

tensor.to(*args, **kwargs)
代码：

# 张量实例代码
x=torch.ones((3,3))    # 定义一个张量
x=x.to(torch.float64)  # 把默认的float32转换为float64

x=torch.ones(3,3)      # 定义一个张量
x=x.to("cuda")         # 迁移到GPU

module.to(*args, **kwargs)
代码：

# module实例代码
linear=nn.Linear(2,2)    # 定义一个module
linear.to(torch.double)  # 把module中所有的参数从默认的float32转换为float64（double就是float64）

gpu1=torch.device("cuda")  # 定义设备
linear.to(gpu1)            # 迁移到gpu

说明：张量不执行inplace，模型执行inplace（inplace=True指的是进行原地操作，选择进行原地覆盖运算。），即 tensor是需要用等号进行赋值的，而module是直接执行to函数即可。

代码
运行 cuda_methods.py:

# -*- coding: utf-8 -*-
"""
# @file name  : cuda_methods.py
# @author     : TingsongYu https://github.com/TingsongYu
# @date       : 2019-11-11
# @brief      : 数据迁移至cuda的方法
"""
import torch
import torch.nn as nn
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# ========================== tensor to cuda
# flag = 0
flag = 1
if flag:
    x_cpu = torch.ones((3, 3))
    print("x_cpu:\ndevice: {} is_cuda: {} id: {}".format(x_cpu.device, x_cpu.is_cuda, id(x_cpu)))

    x_gpu = x_cpu.to(device)
    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))

# pytorch已经弃用的方法
# x_gpu = x_cpu.cuda()

# ========================== module to cuda
# flag = 0
flag = 1
if flag:
    net = nn.Sequential(nn.Linear(3, 3))

    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))

    net.to(device)
    print("\nid:{} is_cuda: {}".format(id(net), next(net.parameters()).is_cuda))


# ========================== forward in cuda
# flag = 0
flag = 1
if flag:
    output = net(x_gpu) # 输入在gpu，模型在gpu，输出在gpu上
    print("output is_cuda: {}".format(output.is_cuda))

    # output = net(x_cpu) # 输入在cpu，模型在gpu，报错


# ========================== 查看当前gpu 序号，尝试修改可见gpu，以及主gpu
flag = 0
# flag = 1
if flag:
    current_device = torch.cuda.current_device()
    print("current_device: ", current_device)

    torch.cuda.set_device(0)
    current_device = torch.cuda.current_device()
    print("current_device: ", current_device)


    #
    cap = torch.cuda.get_device_capability(device=None)
    print(cap)
    #
    name = torch.cuda.get_device_name()
    print(name)

    is_available = torch.cuda.is_available()
    print(is_available)



    # ===================== seed
    seed = 2
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    current_seed = torch.cuda.initial_seed()
    print(current_seed)


    s = torch.cuda.seed()
    s_all = torch.cuda.seed_all()

运行结果：

tensor to cuda 结果：

说明：内存地址不同，说明没有执行inplace操作。
module to cuda 结果：

说明：内存地址相同，说明执行了inplace操作。
数据运行forward in cuda结果：

说明：数据、模型必须在相同的设备上。

2. torch.cuda常用方法

torch.cuda.device_count()：计算当前可见可用GPU数；
torch.cuda.get_device_name()：获取GPU名称；
torch.cuda.manual_seed()：为当前GPU设置随机种子；
torch.cuda.manual_seed_all()：为所有可见可用GPU设置随机种子（推荐）；
torch.cuda.set_device()：设置主GPU为哪一个物理GPU(不推荐)，
推荐: os.environ.setdefault("CUDA_VISIBLE_DEVICES", "2, 3")

说明：物理GPU为插在服务器上的卡的个数，永远不会变：0，1，2，3；逻辑GPU为PyTorch 可见的。
os.environ.setdefault("CUDA_VISIBLE_DEVICES", "2, 3")，说明有2个GPU，物理GPU为2，3。

os.environ.setdefault("CUDA_VISIBLE_DEVICES", "0，3, 2")，说明有3个GPU，物理GPU为0，3，2。

说明：
0代表着逻辑gpu0是物理gpu0，3代表着逻辑gpu1是物理gpu3，2代表着逻辑gpu2是物理gpu2。
注： 默认gpu0是主gpu。为什么要分主gpu这个概念呢？这与多gpu运算的分发并行机制有关。
代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch


# ========================== 选择 gpu
# flag = 0
flag = 1
if flag:
    gpu_id = 0
    gpu_str = "cuda:{}".format(gpu_id)
    device = torch.device(gpu_str if torch.cuda.is_available() else "cpu")

    x_cpu = torch.ones((3, 3))
    x_gpu = x_cpu.to(device)

    print("x_gpu:\ndevice: {} is_cuda: {} id: {}".format(x_gpu.device, x_gpu.is_cuda, id(x_gpu)))


# ========================== 查看 gpu数量/名称
# flag = 0
flag = 1
if flag:
    device_count = torch.cuda.device_count()
    print("\ndevice_count: {}".format(device_count))

    device_name = torch.cuda.get_device_name(0)
    print("\ndevice_name: {}".format(device_name))

三、多GPU并行运算

1. 原因

做一份作业需要60分钟，做四份作业需要4*60 = 240分钟；而4个人一起做，一人做一份，分发作业三分钟，四个人做作业60分钟，回收作业3分钟，只需要66分钟，极大增加了运算速度。将回收结果存储到主gpu中，即gpu0。
在这里插入图片描述
说明：多gpu运算的分发并行机制：分发→并行运算→结果回收

2. PyTorch中实现多GPU计算

torch.nn.DataParallel ：
√ 功能：包装模型，实现分发并行机制；
√ 主要参数：
module：需要包装分发的模型；
device_ids：可分发的gpu，默认分发到所有可见可用gpu；
output_device：结果输出设备。
代码示例：

# -*- coding: utf-8 -*-

import os
import numpy as np
import torch
import torch.nn as nn

# ============================ 手动选择gpu
# flag = 0
flag = 1
if flag:
    # gpu_list=[2,3] 如果你的炼丹炉只有一个GPU，这样设置也没有用，当前设备没有2号和3号GPU，pt在运行的时候的device_count属性为0
    gpu_list = [0]  # 因此要设置为0号，这样device_count属性才会为1
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


# ============================ 依内存情况自动选择主gpu
# flag = 0
flag = 1
if flag:
    def get_gpu_memory():
        import platform
        if 'Windows' != platform.system():
            import os
            os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free > tmp.txt')
            memory_gpu = [int(x.split()[2]) for x in open('tmp.txt', 'r').readlines()]
            os.system('rm tmp.txt')
        else:
            memory_gpu = False
            print("显存计算功能暂不支持windows操作系统")
        return memory_gpu


    gpu_memory = get_gpu_memory() # 这里是获取所有GPU的剩余内存
    if not gpu_memory:
        print("\ngpu free memory: {}".format(gpu_memory)) # 然后打印出来
        gpu_list = np.argsort(gpu_memory)[::-1]

        gpu_list_str = ','.join(map(str, gpu_list))
        os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str) # 这里是把剩余内存最多的GPU做为主GPU
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


class FooNet(nn.Module):
    def __init__(self, neural_num, layers=3):
        super(FooNet, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])

    def forward(self, x):

        print("\nbatch size in forward: {}".format(x.size()[0])) # 观察每个forward的batchsize大小
        # 注意，这里是传入的batchsize经过分发后的数量，所以应该是原batchsize除以GPU的数量
        # 这里的 batch_size = 16，如果有device_count=1，这里应该是16，如果是device_count=2，这里应该是8.
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            x = torch.relu(x)
        return x


if __name__ == "__main__":

    batch_size = 16

    # data
    inputs = torch.randn(batch_size, 3)
    labels = torch.randn(batch_size, 3)

    inputs, labels = inputs.to(device), labels.to(device) # 把输入和标签放到指定的device中，device根据上面的代码是优先GPU的。

    # model
    net = FooNet(neural_num=3, layers=3)
    net = nn.DataParallel(net) # 对模型进行包装，使得模型具有并行分发运行的能力，让模型能把一个batchsize的数据分发到不同GPU上进行运算
    net.to(device)

    # training
    for epoch in range(1):

        outputs = net(inputs)

        print("model outputs.size: {}".format(outputs.size()))

    print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))
    print("device_count :{}".format(torch.cuda.device_count()))

（1）当前电脑只有1个GPU ：
1）gpu_list=[2,3] 结果：
在这里插入图片描述
说明：只有一个GPU，这样设置也没有用，当前设备没有2号和3号GPU，PyTorch在运行的时候的device_count为0。0个可以用。
2）gpu_list=[0] 结果：

（2）2个GPU可用：

（3） 4个GPU可用：

说明： 0为主GPU。

询问当前GPU内存剩余：

#查询当前gpu内存剩余
def get_gpu_memory(): 
	import os 
	# -q表示查询，-d表示查询哪个内容，grep表示搜索， -A4 GPU表示显示GPU这一行以及后面的四行，总共5行，Free表示内存的剩余；得到这个信心后进行重定向，输出到临时文件tmp. txt。
	os.system('nvidia-smi -q -d Memory | grep -A4 GPU | grep Free>tmp. txt')
	# 打开open(' tmp. txt','r')并进行readlines，然后进行split，再int，得到list
	memory_gpu=[int(x.split())[2]) for x in open(' tmp. txt','r'). readlines()]
	# 删除临时文件
	os.system('rm tmp. txt')
	# 返回list
	return memory_gpu

运行实例：

# example:
gpu_memory = get_gpu_memory()
# 进行排序
gpu_list = np.argsort(gpu_memory)[::-1]
# 转成字符串的形式
gpu_list_str = ','.join(map(str,gpu_list))
# 设置CUDA_VISIBLE_DEVICES
os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)

print("\ngpu free memory: {}".format(gpu_memory))
print("CUDA_VISIBLE_DEVICES :{}".format(os.environ["CUDA_VISIBLE_DEVICES"]))

实例结果：
在这里插入图片描述
说明：物理GPU是按内存大小来排序0，1，2，3，而逻辑GPU0，1，3，2，这是与numpy函数np.argsort(gpu_memory)[::-1]有关，不用管，我们的目的是通过排序得到显存最多的GPU，设为主GPU。

3. GPU加载的常见问题

报错1：没有GPU时，加载到GPU，就会报错。

解决办法： GPU不存在时，需要加载至CPU

加载至CPU代码：

# =================================== 加载至cpu
# flag = 0
flag = 1
if flag:
    gpu_list = [0]
    gpu_list_str = ','.join(map(str, gpu_list))
    os.environ.setdefault("CUDA_VISIBLE_DEVICES", gpu_list_str)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    net = FooNet(neural_num=3, layers=3)
    net.to(device)

    # save
    net_state_dict = net.state_dict()
    path_state_dict = "./model_in_gpu_0.pkl"
    torch.save(net_state_dict, path_state_dict)

    # load
    # state_dict_load = torch.load(path_state_dict)
    state_dict_load = torch.load(path_state_dict, map_location="cpu")  # 加载在cpu
    print("state_dict_load:\n{}".format(state_dict_load))

结果：
在这里插入图片描述
加载至GPU代码： 有GPU，直接加载

    # load to GPU
    state_dict_load = torch.load(path_state_dict)
    # state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

结果：
在这里插入图片描述

报错2：训练时采用多GPU训练并行运算，模型会使用DataParallel进行包装，模型的网络层命名多了一个module，所以在加载state_dict时，出现字典命名不匹配。
原因：多GPU的state_dict是OrderedDict形式，没办法直接加载在net中。

解决方法：
由于加载的state_dict是OrderedDict形式，不能直接修改Key，所以需要构建一个新的OrderedDict
代码：

# =================================== 多gpu 加载
# flag = 0
flag = 1
if flag:

    net = FooNet(neural_num=3, layers=3)

    path_state_dict = "./model_in_multi_gpu.pkl"
    state_dict_load = torch.load(path_state_dict, map_location="cpu")
    print("state_dict_load:\n{}".format(state_dict_load))

    # net.load_state_dict(state_dict_load)  # 直接加载会报错2

    # remove module.（7个字符）—— 移除module.这7个字符
    from collections import OrderedDict
    # 构建空的new_state_dict，
    new_state_dict = OrderedDict()
    # 遍历命名
    for k, v in state_dict_load.items():
        # 对命名进行检查，如果是以module.开头的，对这个key进行修改，只获取第7个以后的所有字符，得到新的key
        namekey = k[7:] if k.startswith('module.') else k  # 只使用除前七个字符外的字符
        new_state_dict[namekey] = v
    print("new_state_dict:\n{}".format(new_state_dict))

    net.load_state_dict(new_state_dict)