torch.compile用法

最新推荐文章于 2025-02-11 19:01:01 发布

zhuikefeng

最新推荐文章于 2025-02-11 19:01:01 发布

阅读量3.8k

点赞数 12

分类专栏：深度学习文章标签： pytorch 人工智能 python

本文链接：https://blog.csdn.net/zhuikefeng/article/details/136219682

版权

深度学习专栏收录该内容

10 篇文章

订阅专栏

简介

torch.compile 通过 JIT 将 PyTorch 代码编译成优化的内核，使 PyTorch 代码运行得更快。加速主要来自减少了 Python 开销和 GPU 读/写，因此观察到的加速可能因模型架构和批量大小等因素而异。例如，如果一个模型的架构很简单并且数据量很大，那么瓶颈将是 GPU 计算并且观察到的加速可能不那么显着。

要求torch>2.0，还需要安装torchtriton。

用法

编译函数

import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b

opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))

#直接用装饰器的方式也可以
@torch.compile
def opt_foo2(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
print(opt_foo2(torch.randn(10, 10), torch.randn(10, 10)))

编译类

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.lin = torch.nn.Linear(100, 10)

    def forward(self, x):
        return torch.nn.functional.relu(self.lin(x))

mod = MyModule()
opt_mod = torch.compile(mod)
print(opt_mod(torch.randn(10, 100)))

优化案例

def timed(fn):
    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)
    start.record()
    result = fn()
    end.record()
    torch.cuda.synchronize()
    return result, start.elapsed_time(end) / 1000

# Generates random input and targets data for the model, where `b` is
# batch size.
def generate_data(b):
    return (
        torch.randn(b, 3, 128, 128).to(torch.float32).cuda(),
        torch.randint(1000, (b,)).cuda(),
    )

N_ITERS = 10

from torchvision.models import resnet18
def init_model():
    return resnet18().to(torch.float32).cuda()

def evaluate(mod, inp):
    return mod(inp)

model = init_model()

# Reset since we are using a different mode.
import torch._dynamo
torch._dynamo.reset()

evaluate_opt = torch.compile(evaluate, mode="reduce-overhead")

inp = generate_data(16)[0]
print("eager:", timed(lambda: evaluate(model, inp))[1])
print("compile:", timed(lambda: evaluate_opt(model, inp))[1])

#eager: 4.2023056640625
#compile: 10.3568037109375

注意，torch.compile 与 eager 相比，好像需要更长的时间才能完成，这是因为 torch.compile 在执行时需要将模型编译成优化的内核。在这个示例中，模型的结构没有改变，因此不需要重新编译，多次运行优化的模型再与 eager 相比，应该会看到显着的改进。

eager_times = []
compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    _, eager_time = timed(lambda: evaluate(model, inp))
    eager_times.append(eager_time)
    print(f"eager eval time {i}: {eager_time}")

print("~" * 10)

compile_times = []
for i in range(N_ITERS):
    inp = generate_data(16)[0]
    _, compile_time = timed(lambda: evaluate_opt(model, inp))
    compile_times.append(compile_time)
    print(f"compile eval time {i}: {compile_time}")
print("~" * 10)

import numpy as np
eager_med = np.median(eager_times)
compile_med = np.median(compile_times)
speedup = eager_med / compile_med
print(f"(eval) eager median: {eager_med}, compile median: {compile_med}, speedup: {speedup}x")
print("~" * 10)

根据所选的 mode 参数，您还可能会看到不同的加速结果。由于这里的模型和数据都很小，如果希望尽可能减少开销，所以选择选择了 "reduce-overhead". 对于您自己的模型，您可能需要尝试不同的模式以最大化加速。torch.compile 支持三种模式：