PyTorch系列 | 如何加快你的模型训练速度呢？

最新推荐文章于 2024-10-11 15:45:59 发布

spearhead_cai

最新推荐文章于 2024-10-11 15:45:59 发布

阅读量6.1k

点赞数 1

本文链接：https://blog.csdn.net/lc013/article/details/100033506

版权

640?wx_fmt=jpeg

图片来源：Pixabay，作者：talha khalil

2019 年第 67 篇文章，总第 91 篇文章

本文大约 6500 字，建议收藏阅读！

原题 | Speed Up your Algorithms Part 1 — PyTorch

作者 | Puneet Grover

译者 | kbsc13("算法猿的成长"公众号作者)

原文 | https://towardsdatascience.com/speed-up-your-algorithms-part-1-pytorch-56d8a4ae7051

声明 | 翻译是出于交流学习的目的，欢迎转载，但请保留本文出于，请勿用作商业或者非法用途

前言

本文将主要介绍如何采用 cuda 和 pycuda 检查、初始化 GPU 设备，并让你的算法跑得更快。

PyTorch 是 torch 的 python 版本，它是 Facebook AI 研究组开发并开源的一个深度学习框架，也是目前非常流行的框架，特别是在研究人员中，短短几年已经有追上 Tensorflow 的趋势了。这主要是因为其简单、动态计算图的优点。

pycuda 是一个 python 第三方库，用于处理 Nvidia 的 CUDA 并行计算 API 。

本文目录如下：

如何检查 cuda 是否可用？
如何获取更多 cuda 设备的信息？
在 GPU 上存储 Tensors 和运行模型的方法
有多个 GPU 的时候，如何选择和使用它们
数据并行
数据并行的比较
torch.multiprocessing

本文的代码是用 Jupyter notebook，Github 地址为：

https://nbviewer.jupyter.org/github/PuneetGrov3r/MediumPosts/blob/master/SpeedUpYourAlgorithms/1%29%20PyTorch.ipynb

1. 如何检查 cuda 是否可用？

检查 cuda 是否可用的代码非常简单，如下所示：

import torch
torch.cuda.is_available()
# True

2. 如何获取更多 cuda 设备的信息？

获取基本的设备信息，采用 torch.cuda 即可，但如果想得到更详细的信息，需要采用 pycuda 。

实现的代码如下所示：

import torch
import pycuda.driver as cuda
cuda.init()
## Get Id of default device
torch.cuda.current_device()
# 0
cuda.Device(0).name() # '0' is the id of your GPU
# Tesla K80

或者如下所示：

torch.cuda.get_device_name(0) # Get name device with ID '0'
# 'Tesla K80'

这里写了一个简单的类来获取 cuda 的信息：

# A simple class to know about your cuda devices
import pycuda.driver as cuda
import pycuda.autoinit # Necessary for using its functions
cuda.init() # Necesarry for using its functions

class aboutCudaDevices():
    def __init__(self):
        pass

    def num_devices(self):
        """返回 cuda 设备的数量"""
        return cuda.Device.count()

    def devices(self):
        """获取所有可用的设备的名称"""
        num = cuda.Device.count()
        print("%d device(s) found:"%num)
        for i in range(num):
            print(cuda.Device(i).name(), "(Id: %d)"%i)

    def mem_info(self):
        """获取所有设备的总内存和可用内存"""
        available, total = cuda.mem_get_info()
        print("Available: %.2f GB\nTotal:     %.2f GB"%(available/1e9, total/1e9))

    def attributes(self, device_id=0):
        """返回指定 id 的设备的属性信息"""
        return cuda.Device(device_id).get_attributes()

    def __repr__(self):
        """输出设备的数量和其id、内存信息"""
        num = cuda.Device.count()
        string = ""
        string += ("%d device(s) found:\n"%num)
        for i in range(num):
            string += ( "    %d) %s (Id: %d)\n"%((i+1),cuda.Device(i).name(),i))
            string += ("          Memory: %.2f GB\n"%(cuda.Device(i).total_memory()/1e9))
        return string

# You can print output just by typing its name (__repr__):
aboutCudaDevices()
# 1 device(s) found:
#    1) Tesla K80 (Id: 0)
#          Memory: 12.00 GB

如果想知道当前内存的使用情况，查询代码如下所示：

import torch
# Returns the current GPU memory usage by 
# tensors in bytes for a given device
# 返回当前使用的 GPU 内存，单位是字节
torch.cuda.memory_allocated()
# Returns the current GPU memory managed by the
# caching allocator in bytes for a given device
# 返回当前缓存分配器中的 GPU 内存
torch.cuda.memory_cached()

清空 cuda 缓存的代码如下所示：

# Releases all unoccupied cached memory currently held by
# the caching allocator so that those can be used in other
# GPU application and visible in nvidia-smi
# 释放所有非占用的内存
torch.cuda.empty_cache()

但需要注意的是，上述函数并不会释放被 tensors 占用的 GPU 内存，因此并不能增加当前可用的 GPU 内存。

3. 在 GPU 上存储 Tensors 和运行模型的方法

如果是想存储变量在 cpu 上，可以按下面代码所示这么写：

a = torch.DoubleTensor([1., 2.])

变量 a 将保持在 cpu 上，并在 cpu 上进行各种运算，如果希望将它转换到 gpu 上，需要采用 .cuda ，可以有以下两种实现方法

# 方法1
a = torch.FloatTensor([1., 2.]).cuda()
# 方法2
a = torch.cuda.FloatTensor([1., 2.])

这种做法会选择默认的第一个 GPU，查看方式有下面两种：

# 方法1
torch.cuda.current_device()
# 0

# 方法2
a.get_device()
# 0

另外，也可以在 GPU 上运行模型，例子如下所示，简单使用 nn.Sequential 定义一个模型：

sq = nn.Sequential(
         nn.Linear(20, 20),
         nn.ReLU(),
         nn.Linear(20, 4),
         nn.Softmax()
)

然后放到 GPU 上运行：

model = sq.cuda()

怎么判断模型是否在 GPU 上运行呢，可以通过下述方法查看模型的参数是否在 GPU 上来判断：

# From the discussions here: discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda
# 参考 https://discuss.pytorch.org/t/how-to-check-if-model-is-on-cuda/180

next(model.parameters()).is_cuda
# True

4. 有多个 GPU 的时候，如何选择和使用它们

假设有 3 个 GPU ，我们可以初始化和分配 tensors 到任意一个指定的 GPU 上，代码如下所示，这里分配 tensors 到指定 GPU 上，有 3 种方法：

初始化 tensor 时，指定参数 device
.to(cuda_id)
.cuda(cuda_id)

cuda0 = torch.device('cuda:0')
cuda1 = torch.device('cuda:1')
cuda2 = torch.device('cuda:2')

# 如果只是采用 .cuda() 方法，默认是放到 cuda:0 的 GPU 上
# 下面是 3 种实现方法
x = torch.Tensor([1., 2.], device=cuda1)
# Or
x = torch.Tensor([1., 2.]).to(cuda1)
# Or
x = torch.Tensor([1., 2.]).cuda(cuda1)

# 修改默认的设备方法，输入希望设置为默认设备的 id
torch.cuda.set_device(2) 
# 调用环境变量 CUDA_VISIBLE_DEVICES，可以设置想采用的 GPU 的数量和哪几个 GPU
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,2"

当你有多个 GPU 的时候，就可以将应用的工作划分，但这里存在相互之间交流的问题，不过如果不需要频繁的交换信息，那么这个问题就可以忽略。

实际上，还有另一个问题，在 PyTorch 中所有 GPU 的运算默认都是异步操作。但在 CPU 和 GPU 或者两个 GPU 之间的数据复制是需要同步的，当你通过函数 torch.cuda.Stream() 创建自己的流时，你必须注意这个同步问题。

下面是官方文档上一个错误的示例：

cuda = torch.device('cuda')
# 创建一个流
s = torch.cuda.Stream()  
A = torch.empty((100, 100), device=cuda).normal_(0.0, 1.0)
with torch.cuda.stream(s):
    # because sum() may start execution before normal_() finishes!
    # sum() 操作可能在 normal_() 结束前就可以执行了
    B = torch.sum(A)

如果想完全利用好多 GPU，应该按照如下做法：

将所有 GPU 用于不同的任务或者应用；
在多模型中，每个 GPU 应用单独一个模型，并且各自有预处理操作都完成好的一份数据拷贝；
每个 GPU 采用切片输入和模型的拷贝，每个 GPU 将单独计算结果，并将结果都发送到同一个 GPU 上进行进一步的运算操作。

5. 数据并行

数据并行的操作要求我们将数据划分成多份，然后发送给多个 GPU 进行并行的计算。

PyTorch 中实现数据并行的操作可以通过使用 torch.nn.DataParallel。

下面是一个简单的示例。要实现数据并行，第一个方法是采用 nn.parallel 中的几个函数，分别实现的功能如下所示：

复制(Replicate)：将模型拷贝到多个 GPU 上；
分发(Scatter)：将输入数据根据其第一个维度(通常就是 batch 大小)划分多份，并传送到多个 GPU 上；
收集(Gather)：从多个 GPU 上传送回来的数据，再次连接回一起；
并行的应用(parallel_apply)：将第三步得到的分布式的输入数据应用到第一步中拷贝的多个模型上。

实现代码如下所示：

# Replicate module to devices in device_ids
replicas = nn.parallel.replicate(module, device_ids)
# Distribute input to devices in device_ids
inputs = nn.parallel.scatter(input, device_ids)
# Apply the models to corresponding inputs
outputs = nn.parallel.parallel_apply(replicas, inputs)
# Gather result from all devices to output_device
result = nn.parallel.gather(outputs, output_device)

实际上，还有一个更简单的也是常用的实现方法，如下所示，只需一行代码即可：

model = nn.DataParallel(model, device_ids=device_ids)
result = model(input)

6. 数据并行的比较

根据文章 https://medium.com/@iliakarmanov/multi-gpu-rosetta-stone-d4fa96162986 以及 Github：https://github.com/ilkarman/DeepLearningFrameworks 得到的不同框架在采用单个 GPU 和 4 个 GPU 时运算速度的对比结果，如下所示：

从图中可以看到数据并行操作尽管存在多 GPU 之间交流的问题，但是提升的速度还是很明显的。而 PyTorch 的运算速度仅次于 Chainer ，但它的数据并行方式非常简单，一行代码即可实现。

7. torch.multiprocessing

torch.multiprocessing 是对 Python 的 multiprocessing 模块的一个封装，并且百分比兼容原始模块，也就是可以采用原始模块中的如 Queue 、Pipe、Array 等方法。并且为了加快速度，还添加了一个新的方法--share_memory_()，它允许数据处于一种特殊的状态，可以在不需要拷贝的情况下，任何进程都可以直接使用该数据。

通过该方法，可以共享 Tensors 、模型的参数 parameters ，可以在 CPU 或者 GPU 之间共享它们。

下面展示一个采用多进程训练模型的例子：

# Training a model using multiple processes:
import torch.multiprocessing as mp
def train(model):
    for data, labels in data_loader:
        optimizer.zero_grad()
        loss_fn(model(data), labels).backward()
        optimizer.step()  # This will update the shared parameters
model = nn.Sequential(nn.Linear(n_in, n_h1),
                      nn.ReLU(),
                      nn.Linear(n_h1, n_out))
model.share_memory() # Required for 'fork' method to work
processes = []
for i in range(4): # No. of processes
    p = mp.Process(target=train, args=(model,))
    p.start()
    processes.append(p)
for p in processes: 
    p.join()

更多的使用例子可以查看官方文档：

https://pytorch.org/docs/stable/distributed.html

参考：