AOT & JIT介绍

Undefined游侠

已于 2024-08-18 07:25:50 修改

阅读量507

点赞数 8

文章标签： python

于 2024-08-17 18:15:51 首次发布

本文链接：https://blog.csdn.net/qq_19859865/article/details/141275188

版权

在编程世界，有两个重要概念，那就是 Ahead of time compilation 和 Just-in-time compilation。针对机器学习，由于通常涉及图的构建以及不定的数据大小，各种推理框架也经常需要在jit和aot两种策略面前作出选择。

AOT属性

* 程序需要在运行前完成编译。

* 编译生成的文件较大

* 只能生成特定平台的二进制文件

* 性能是可预测的

JIT属性

* 程序在运行时（runtime），在被执行前才被编译

* 每个函数在第一次使用前才被编译

* 有限的优化实践，很难进行全局的优化

* Bin文件大小较小，因为只有必要的代码会在运行前被编译

* 它们会生成与平台无关的IR, 然后运行时才能进行针对平台的必要优化。

TorchScript is an intermediate representation of a PyTorch model (subclass of nn. Module ) that can then be run in a high-performance environment like C++. It's a high-performance subset of Python that is meant to be consumed by the PyTorch JIT Compiler, which performs run-time optimization on your model's computation.

针对AI推理训练框架，pytorch选择了JIT，而Tensorflow选择了AOT。

对于AOT，我们可以对于这个程序进行编译，进行kernel fusion，去除冗余计算，提升吞吐率，减少冗余。但是，问题是

1. 中间层的输出结果在优化过程中有可能已经丢失。

2. 如果涉及到参数调整，整个程序需要重新编译

3. 针对不同的输入尺寸，AOT不一定能提供的对应的最优策略。

对于JIT， pytorch则可以有效的支持rapid prototyping。因为，在运行的时候，我们只需要编译一部分的程序。网络的中间结果打印检查也会很容易，JIT基于dynmaic compilation，只会重新编译修改的内容。此外，JIT还会基于输入的大小，类型进行图的优化，还会将此前见到的一些输入类型放入缓存，并重新利用。这就是所谓的“Dynamic code compilation”

此外，JIT还具有一高效特性，dynamic tracing, 基于数据流，可以生成计算图，也就是DAG（directed acyclic graph）,基于该图，我们可以进行进一步的优化，某些情况下，该优化效果甚至强于AOT。

关于JIT，python提供了多种实现，比如torch.compile. dynamo, 基于下列demo，我们可以简单的体验一下.

其中，exp5采用的pytorch的eager模式，而exp6利用torch.compile实现jit，第一次epoch的时间，明显长于eager 模式， exp7基于dynamo。可以看到，在jit模式下，第一次运行涉及到图的构建，会需要更长的时间，但是在完成warm up之后，速度会比第一次有明显提升。此外，在对比torch.compile. torch.dynamo,和eager模式，可以看到速度有所区别。

这里还做了另外一个实验，那就是如果在warm up的过程中，没有进行backpropagate的操作，那么warm up的时间会缩短，但是第一次推理的时间会增强，这个效果从另一个角度展示了jit的特性，那就是，如果我不用，我就不去编译以及优化它。

关于运行的速度和效果，在CPU和GPU上，可能效果并不同，网络结构可能也存在影响。我在下面提供的代码和torch官方提供的代码结论也有差别，因此这里不太好说结论，建议大家参考

torch.compile Tutorial — PyTorch Tutorials 1.13.1+cu117 documentation

进行测试。

def exp5():
    # model = models.alexnet()
    model = exp_model
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    x= torch.randn(16, 3, 224, 224).cuda()
    #warm up

    start = time.time()
    out = model(x)
    out.sum().backward()
    end = time.time()
    print(f"naive warm up time {end - start}")

    count = []
    for  epoch in range(10):
        optimizer.zero_grad()
        start = time.time()
        out = model(x)
        out.sum().backward()
        optimizer.step()
        end = time.time()
        count.append(end - start)
        #print(f"Finish one epoche for compiled one use {end - start}")
    print(f"mean time of naive training: {sum(count)/10}")

def exp6():
    model = exp_model
    #model = models.alexnet()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    compiled_model = torch.compile(model)
    x= torch.randn(16, 3, 224, 224).cuda()
    start = time.time()
    out = compiled_model (x)
    out.sum().backward()
    end = time.time()
    print(f"compiled model warm up time {end - start}")


    count = []
    for  epoch in range(10):
        optimizer.zero_grad()
        start = time.time()
        out = compiled_model(x)
        out.sum().backward()
        optimizer.step()
        end = time.time()
        count.append(end - start)
        # print(f"Finish one epoche for compiled one use {end - start}")
    print(f"mean time of compiled training: {sum(count)/10}")

def exp7():
    torch._dynamo.reset()
    # model = models.alexnet()
    model = exp_model
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    compiled_model = dynamo.optimize("inductor")(model)
    x= torch.randn(16, 3, 224, 224).cuda()
    start = time.time()
    out = compiled_model (x)
    out.sum().backward()
    end = time.time()
    print(f"dynamo model warm up time {end - start}")

    count = []
    for  epoch in range(10):
        optimizer.zero_grad()
        start = time.time()
        out = compiled_model(x)
        out.sum().backward()
        optimizer.step()
        end = time.time()
        count.append(end - start)
        # print(f"Finish one epoche for compiled one use {end - start}")
    print(f"mean time of dynamo training: {sum(count)/10}")


print('~'*10)
exp5()
print('~'*10)
exp6()
print('~'*10)
exp7()

参考链接：
https://medium.com/@princejain_77044/ahead-of-time-aot-and-just-in-time-jit-compilation-d73a2fa7fa19

torch.compile Tutorial — PyTorch Tutorials 1.13.1+cu117 documentation

Undefined游侠

关注

8
点赞
踩
14

收藏

觉得还不错? 一键收藏
0
评论
AOT & JIT介绍

可以看到，在jit模式下，第一次运行涉及到图的构建，会需要更长的时间，但是在完成warm up之后，速度会比第一次有明显提升。因为，在运行的时候，我们只需要编译一部分的程序。此外，JIT还会基于输入的大小，类型进行图的优化，还会将此前见到的一些输入类型放入缓存，并重新利用。此外，JIT还具有一高效特性，dynamic tracing, 基于数据流，可以生成计算图，也就是DAG（directed acyclic graph）,基于该图，我们可以进行进一步的优化，某些情况下，该优化效果甚至强于AOT。
复制链接

扫一扫