揭秘 AMD GPU 上 PyTorch Profiler 的性能洞察

Unveiling performance insights with PyTorch Profiler on an AMD GPU — ROCm Blogs

2024年5月29日,作者:Phillip Dang。
在机器学习领域,优化性能通常和改进模型架构一样重要。在本文中,我们将深入探讨 PyTorch Profiler,这是一款设计用于帮助深入了解我们 PyTorch 模型内部状态的便捷工具,能够揭示瓶颈和低效之处。本文将介绍 PyTorch Profiler 的基本工作原理以及如何在 AMD GPU + ROCm 系统中利用它来提高模型效率。

什么是 PyTorch Profiler?

PyTorch Profiler 是一个性能分析工具,使开发人员能够检查 PyTorch 模型训练和推理的各个方面。它允许用户收集和分析详细的分析信息,包括 GPU/CPU 利用率、内存使用情况以及模型内不同操作的执行时间。通过利用 PyTorch Profiler,开发人员可以获得关于其模型运行时行为的宝贵见解,并发现潜在的优化机会。
使用 PyTorch Profiler 非常简单,只需几个步骤:
1. 标注代码:要开始对 PyTorch 代码进行分析,您需要使用分析注释对其进行标注。这些注释指定了要分析的代码区域或操作。PyTorch Profiler 提供了上下文管理器和装饰器以便于标注。
2. 配置分析器设置:根据您的分析需求配置分析器设置。您可以指定参数,如详细程度、分析模式(例如 CPU, GPU)和输出格式。
3. 运行分析:在代码标注完成且分析器设置配置好后,像往常一样运行您的 PyTorch 代码。分析器将在执行期间收集性能数据。
4. 分析分析结果:执行后,使用 PyTorch Profiler 提供的可视化工具分析分析结果。探索时间线、火焰图和内存使用图,以识别性能瓶颈和优化机会。
5. 迭代和优化:利用从分析中获得的洞见来反复优化代码。根据分析数据进行有针对性的优化,并重新运行分析器以评估您更改的影响。


- Linux 操作系统

有关支持的 GPU 和操作系统的列表,请参阅此页面。为了方便和稳定,我们建议您直接在 Linux 系统中使用以下代码拉取并运行 rocm/pytorch Docker:

docker run -it --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \
           --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
           --name=olmo rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1 /bin/bash

检查您的硬件并确保系统识别到您的 GPU,请运行:

! rocm-smi --showproductname


================= ROCm System Management Interface ================
========================= Product Info ============================
GPU[0] : Card series: Instinct MI210
GPU[0] : Card model: 0x0c34
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: D67301
===================== End of ROCm SMI Log =========================

接下来,确保 PyTorch 检测到您的 GPU:

import torch
print(f"number of GPUs: {torch.cuda.device_count()}")
print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])


number of GPUs: 1
['AMD Radeon Graphics']



import torch
import torch.nn as nn
import torchvision
from torchvision import transforms
from torch.profiler import profile, record_function, ProfilerActivity



class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, kernel_size=2, stride=2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, kernel_size=2, stride=2)
        x = x.view(-1, 32 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x



# Load CIFAR-10 dataset 
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)



# Function to train the model
def train(model, trainloader, criterion, optimizer, device, epochs=1):
    for epoch in range(epochs):
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            # exit after 200 batches 
            if i == 200:


# utility function for running the profiler 
def run_profiler(trainloader, model, profile_memory=False):
    device = 'cuda'
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA]
    with profile(activities=activities, record_shapes=True, profile_memory=profile_memory) as prof:
        with record_function("training"):
            train(model, trainloader, criterion, optimizer, device, epochs=1)

    if profile_memory == False:
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
         print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))



有了模型训练循环和性能分析工具函数的实现后,我们可以使用 PyTorch Profiler 来分析执行时间和内存消耗。



model = SimpleCNN()
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
run_profiler(trainloader, model)


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               training        23.76%     360.249ms        71.31%        1.081s        1.081s       0.000us         0.00%      68.837ms      68.837ms             1  
autograd::engine::evaluate_function: ConvolutionBack...         0.15%       2.271ms         3.63%      55.037ms     136.908us       0.000us         0.00%      34.770ms      86.493us           402  
                             aten::convolution_backward         2.34%      35.480ms         3.34%      50.615ms     125.908us      18.366ms        16.60%      34.770ms      86.493us           402  
                                   ConvolutionBackward0         0.14%       2.151ms         3.46%      52.431ms     130.425us       0.000us         0.00%      34.486ms      85.786us           402  
    autograd::engine::evaluate_function: AddmmBackward0         0.33%       4.960ms         7.98%     120.946ms     300.861us       0.000us         0.00%      16.764ms      41.701us           402  
                                            aten::copy_         0.44%       6.674ms         2.08%      31.585ms      77.037us      15.762ms        14.25%      16.408ms      40.020us           410  
                                         aten::_to_copy         0.14%       2.079ms         2.31%      34.972ms      86.995us       0.000us         0.00%      16.306ms      40.562us           402  
                                              aten::sum         0.78%      11.818ms         0.93%      14.160ms      17.612us      14.723ms        13.31%      16.162ms      20.102us           804  
                                               aten::to         0.13%       2.031ms         2.36%      35.852ms      35.674us       0.000us         0.00%      15.783ms      15.704us          1005  
                                       CopyHostToDevice         0.00%       0.000us         0.00%       0.000us       0.000us      15.739ms        14.23%      15.739ms      39.152us           402  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.516s
Self CUDA time total: 110.639ms

注意 self cpu time 和 cpu time 之间的区别。根据[教程](PyTorch Profiler — PyTorch Tutorials 2.4.0+cu121 documentation),“操作符可以调用其它操作符,自身 cpu time 排除了在子操作符调用中花费的时间,而总的 cpu time 包括了这些时间。你可以选择通过其他指标排序,比如传递 sort_by="self_cpu_time_total" 到表格调用中来按自身 cpu time 排序。”

接下来,我们将卷积神经网络(CNN)简化为一个非常简单的线性层,再次运行性能分析。我们预计会看到 CUDA 总时间的显著减少。

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(3 * 32 * 32, 10)

    def forward(self, x):
        x = x.view(-1, 3 * 32 * 32)
        x = self.fc1(x)
        return x

model = SimpleNet()
run_profiler(trainloader, model)


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               training        23.91%     192.128ms        84.59%     679.785ms     679.785ms       0.000us         0.00%      39.361ms      39.361ms             1  
                                           aten::linear         0.10%     768.000us         1.57%      12.605ms      62.711us       0.000us         0.00%      16.955ms      84.353us           201  
                                            aten::addmm         0.99%       7.943ms         1.28%      10.247ms      50.980us      16.955ms        37.52%      16.955ms      84.353us           201  
Cijk_Alik_Bljk_SB_MT64x64x32_MI32x32x2x1_SE_1LDSB0_A...         0.00%       0.000us         0.00%       0.000us       0.000us      15.556ms        34.42%      15.556ms      77.393us           201  
                                            aten::copy_         0.25%       2.028ms         3.07%      24.636ms      60.980us      14.614ms        32.34%      14.614ms      36.173us           404  
                                       CopyHostToDevice         0.00%       0.000us         0.00%       0.000us       0.000us      14.608ms        32.32%      14.608ms      36.338us           402  
                                         aten::_to_copy         0.27%       2.130ms         3.50%      28.122ms      69.955us       0.000us         0.00%      14.554ms      36.204us           402  
                                               aten::to         0.31%       2.460ms         3.61%      28.972ms      28.771us       0.000us         0.00%      13.586ms      13.492us          1007  
                                Optimizer.step#SGD.step         2.09%      16.809ms         2.94%      23.664ms     117.731us       0.000us         0.00%       5.557ms      27.647us           201  
    autograd::engine::evaluate_function: AddmmBackward0         0.28%       2.236ms         1.64%      13.185ms      65.597us       0.000us         0.00%       3.691ms      18.363us           201  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 803.604ms
Self CUDA time total: 45.193ms

正如预期的那样,CUDA 总时间显著减少(从 110.639ms 到 45.193ms)。



trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
model = SimpleCNN()
run_profiler(trainloader, model, profile_memory=True)


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        22.44%     224.849ms        22.74%     227.911ms       1.134ms       0.000us         0.00%       0.000us       0.000us      75.42 Mb      75.42 Mb           0 b           0 b           201  
                                            aten::empty         0.22%       2.204ms         0.22%       2.204ms       2.731us       0.000us         0.00%       0.000us       0.000us     390.64 Kb     390.64 Kb       3.79 Mb       3.79 Mb           807  
                                    aten::scalar_tensor         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us           8 b           8 b           0 b           0 b             1  
                                          aten::random_         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                             aten::item         0.00%       9.000us         0.00%      13.000us       6.500us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                              aten::_local_scalar_dense         0.00%       4.000us         0.00%       4.000us       2.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                          aten::resize_         0.00%       6.000us         0.00%       6.000us       0.002us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          2615  
                                     aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                      aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                               aten::to         0.22%       2.206ms         3.73%      37.335ms      37.149us       0.000us         0.00%      14.821ms      14.747us           0 b           0 b      75.47 Mb       2.63 Mb          1005  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.002s
Self CUDA time total: 109.871ms

如果我们对数据加载器的内存消耗不满意,可以通过尝试各种策略来解决内存瓶颈。这些策略可能包括减少批次大小、简化模型架构或使用混合精度训练。让我们将批次大小从 32 减少到 4,然后再次运行性能分析:

trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=4)
model = SimpleCNN()
run_profiler(trainloader, model, profile_memory=True)


-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        13.45%     127.135ms        13.74%     129.910ms     646.318us       0.000us         0.00%       0.000us       0.000us       9.43 Mb       9.43 Mb           0 b           0 b           201  
                                            aten::empty         0.23%       2.193ms         0.23%       2.193ms       2.717us       0.000us         0.00%       0.000us       0.000us     390.64 Kb     390.64 Kb       3.87 Mb       3.87 Mb           807  
                                    aten::scalar_tensor         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us           8 b           8 b           0 b           0 b             1  
                                          aten::random_         0.00%      22.000us         0.00%      22.000us      11.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                             aten::item         0.00%       6.000us         0.00%      10.000us       5.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                              aten::_local_scalar_dense         0.00%       4.000us         0.00%       4.000us       2.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                          aten::resize_         0.00%       7.000us         0.00%       7.000us       0.003us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          2615  
                                     aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                      aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                               aten::to         0.21%       2.013ms         2.86%      27.042ms      26.907us       0.000us         0.00%       5.850ms       5.821us           0 b           0 b       9.52 Mb     481.50 Kb          1005  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 945.407ms
Self CUDA time total: 83.583ms

在这里,我们显著减少了加载数据所需的 CPU 内存,从 75.42 MB 减少到 9.43 MB。


评论 1




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则




¥1 ¥2 ¥4 ¥6 ¥10 ¥20



钱包余额 0


