初探TVM--TVM优化resnet50

最新推荐文章于 2024-08-15 13:40:34 发布

shaojie_wang

最新推荐文章于 2024-08-15 13:40:34 发布

阅读量1.5k

点赞数 1

分类专栏： tvm学习文章标签： python

本文链接：https://blog.csdn.net/shaojie_wang/article/details/121047570

版权

tvm学习专栏收录该内容

8 篇文章 8 订阅

订阅专栏

测试用TVM编译出的resnet50在CPU上的效果

测试resnet50在CPU上的效果
resnet50自动调优
- 模型调优 auto-tune
- 编译调优过的模型

测试resnet50在CPU上的效果

如果直接点开了这篇，可能你会不知道编译过的模型是咋来的，戳这里。再回顾一下，编译过的模型会被压缩后存在一个tar压缩包里面。首先解压出来他：

mkdir model
tar -xvf resnet50-v2-7-tvm.tar -C model
ls model

你会看到model里面有三个文件：

mod.so 这个其实就是模型，只不过被编译为c++共享库，TVM的runtime会加载并调用它
mod.params 包含模型的预训练数据
mod.json 表示relay计算图的文本文件

这些东西可以直接被你的应用加载，模型可以通过TVM的runtime API调用。

编译后的resnet50模型

我们已经编译出了模型模块，现在需要测试一下效果。测试使用tvm的runtime api，当然tvmc里面集成了它。使用时，我们需要准备：

编译过的模型，刚编出来，热乎的
一张输入的图片

每个模型都会有期望的输入尺寸，数据类型，数据格式等等，因此对于一张图片，通常需要对齐进行预处理或者后处理。tvmc接受numpy的.npz文件，可以让我们简单的使用。我很喜欢猫子，这里就跟tvm教程里一样，就用这个猫子的照片了。
请添加图片描述

图像预处理

对于resnet50，图像需要使用ImageNet的格式，下面放上一个pre-processing和post-processing的例子。在做前后处理的时候，需要使用pillow模块，如果没有的话，可以这样安装pip3 install pillow。

#!python ./preprocess.py
from tvm.contrib.download import download_testdata
from PIL import Image
import numpy as np

# if you have problem of download,just use images above
img_url = "https://s3.amazonaws.com/model-server/inputs/kitten.jpg"
img_path = download_testdata(img_url, "imagenet_cat.png", module="data")

# Resize it to 224x224
resized_image = Image.open(img_path).resize((224, 224))
img_data = np.asarray(resized_image).astype("float32")

# ONNX expects NCHW input, so convert the array
img_data = np.transpose(img_data, (2, 0, 1))

# Normalize according to ImageNet
imagenet_mean = np.array([0.485, 0.456, 0.406])
imagenet_stddev = np.array([0.229, 0.224, 0.225])
norm_img_data = np.zeros(img_data.shape).astype("float32")
for i in range(img_data.shape[0]):
        norm_img_data[i, :, :] = (img_data[i, :, :] / 255 - imagenet_mean[i]) / imagenet_stddev[i]

# Add batch dimension
img_data = np.expand_dims(norm_img_data, axis=0)

# Save to .npz (outputs imagenet_cat.npz)
np.savez("imagenet_cat", data=img_data)

运行编译后的模型

有了编译后的模型和转换后的图片，我们就可以测试模型的效果了：

python -m tvm.driver.tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz \
resnet50-v2-7-tvm.tar

在tar文件包里面，有编译后的模型运行时库，tvmc封装了tvm的runtime接口，运行后，tvmc会给出一个预测结果的.npz文件。在这个例子中，运行模型的编译模型的机器为同一个平台，但你也可以使用RPC中提供的平台运算测试，通过python -m tvm.driver.tvmc run --help查看RPC使用的方式。

查看输出结果

其实每个模型都有自己的输出tensor格式，我们这里可以下载一个resnet50的输出查找表格，从中提取信息，并打印输出。这里会用到一个后处理的脚本：

#!python ./postprocess.py
import os.path
import numpy as np

from scipy.special import softmax

from tvm.contrib.download import download_testdata

# Download a list of labels
labels_url = "https://s3.amazonaws.com/onnx-model-zoo/synset.txt"
labels_path = download_testdata(labels_url, "synset.txt", module="data")

with open(labels_path, "r") as f:
    labels = [l.rstrip() for l in f]

output_file = "predictions.npz"

# Open the output and read the output tensor
if os.path.exists(output_file):
    with np.load(output_file) as data:
        scores = softmax(data["output_0"])
        scores = np.squeeze(scores)
        ranks = np.argsort(scores)[::-1]

        for rank in ranks[0:5]:
            print("class='%s' with probability=%f" % (labels[rank], scores[rank]))

运行后可以拿到如下结果;

class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261

预测的top5全部都是不同种类的猫虎豹。

resnet50自动调优

上述模型仅仅完成基础的编译，并未加入任何与目标平台相关的调优工作，我们使用tvmc可以对模型根据目标平台特性，做自动调优。在一些情况下，我们其实不清楚在平台上使用哪些优化策略会比较好，auto-tune模块可以帮助我创建一个调优的搜索空间，并且进行性能调优。这里的tune并不是模型训练时的fine-tune，这里不改变模型的预测精度，仅仅是对目标平台的运行时速度做调优。tvm提供多个调优调度的模板，在目标平台中选出最优的那个，调优也可以通过tvmc实现。在最简单的调优模式中，tvmc需要我们给出：

目标平台
调优输出文件
模型

模型调优 auto-tune

在下面的命令可以完成一次调优：

python -m tvm.driver.tvmc tune \
--target "llvm" \
--output resnet50-v2-7-autotuner_records.json \
resnet50-v2-7.onnx

在这个例子中，我们可以写明我们的目标平台架构，例如cpu的skylake架构，你可以通过 – target llvm mcpu=skylake。这样auto-tune模块就可以找到更适用的算子优化组合。auto-tune模块优化在模型中的各个算子子图，每个子图都有一个优化调度的搜索空间，auto-tune会找出最佳的搜索结果¹。

在这个模型中，得到如下tuning结果：

[Task  1/25]  Current/Best:  201.89/ 220.20 GFLOPS | Progress: (40/40) | 20.89 s Done.
[Task  2/25]  Current/Best:  163.58/ 188.97 GFLOPS | Progress: (40/40) | 10.87 s Done.
[Task  3/25]  Current/Best:  107.46/ 232.10 GFLOPS | Progress: (40/40) | 13.03 s Done.
[Task  4/25]  Current/Best:   86.44/ 216.15 GFLOPS | Progress: (40/40) | 18.86 s Done.
[Task  5/25]  Current/Best:  158.27/ 225.16 GFLOPS | Progress: (40/40) | 12.15 s Done.
[Task  6/25]  Current/Best:  127.20/ 204.70 GFLOPS | Progress: (40/40) | 15.10 s Done.
[Task  7/25]  Current/Best:  181.57/ 211.43 GFLOPS | Progress: (40/40) | 12.19 s Done.
[Task  8/25]  Current/Best:  136.64/ 205.29 GFLOPS | Progress: (40/40) | 21.14 s Done.
[Task  9/25]  Current/Best:  235.01/ 235.01 GFLOPS | Progress: (40/40) | 20.43 s Done.
[Task 10/25]  Current/Best:  121.53/ 209.96 GFLOPS | Progress: (40/40) | 12.77 s Done.
[Task 11/25]  Current/Best:  214.43/ 214.43 GFLOPS | Progress: (40/40) | 12.11 s Done.
[Task 12/25]  Current/Best:  147.68/ 206.92 GFLOPS | Progress: (40/40) | 14.60 s Done.
[Task 13/25]  Current/Best:  112.67/ 214.38 GFLOPS | Progress: (40/40) | 13.11 s Done.
[Task 14/25]  Current/Best:  168.30/ 216.40 GFLOPS | Progress: (40/40) | 20.96 s Done.
[Task 15/25]  Current/Best:   68.19/ 221.29 GFLOPS | Progress: (40/40) | 20.52 s Done.
[Task 16/25]  Current/Best:  140.36/ 194.06 GFLOPS | Progress: (40/40) | 13.67 s Done.
[Task 17/25]  Current/Best:  104.27/ 219.83 GFLOPS | Progress: (40/40) | 13.28 s Done.
[Task 18/25]  Current/Best:  102.57/ 214.91 GFLOPS | Progress: (40/40) | 17.39 s Done.
[Task 19/25]  Current/Best:  141.44/ 223.53 GFLOPS | Progress: (40/40) | 14.55 s Done.
[Task 20/25]  Current/Best:   62.98/ 202.96 GFLOPS | Progress: (40/40) | 21.22 s Done.
[Task 21/25]  Current/Best:  100.66/ 218.22 GFLOPS | Progress: (40/40) | 19.84 s Done.
[Task 22/25]  Current/Best:  162.85/ 215.54 GFLOPS | Progress: (40/40) | 12.81 s Done.
[Task 23/25]  Current/Best:  122.70/ 231.81 GFLOPS | Progress: (40/40) | 16.73 s Done.
[Task 24/25]  Current/Best:   23.29/  74.45 GFLOPS | Progress: (40/40) | 20.90 s Done.
[Task 25/25]  Current/Best:    8.61/  85.78 GFLOPS | Progress: (40/40) | 20.67 s Done.

调优的时间有时会比较长，所以tvmc提供其他选项供大家控制运算时间（--repeat或者--number）等。

编译调优过的模型

调优后的数据记录在resnet50-v2-7-autotuner_records.json文件中，这个文件可以在将来用作：

编译优化后的模型 tvmc tune --tuning-records
直接使用用作后续的进一步调优

编译器会利用调优后的记录，在目标平台生成高效代码，可以用tvmc compile --tuning-records完成，也可以查看tvmc compile --help使用。收集过优化的数据，我们可以重新编译一遍模型了，：

python -m tvm.driver.tvmc compile \
--target "llvm" \
--tuning-records resnet50-v2-7-autotuner_records.json  \
--output resnet50-v2-7-tvm_autotuned.tar \
resnet50-v2-7.onnx

验证下优化的模型的预测结果：

python -m tvm.driver.tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz \
resnet50-v2-7-tvm_autotuned.tar

python postprocess.py

可以看到是相同的输出：

class='n02123045 tabby, tabby cat' with probability=0.610552
class='n02123159 tiger cat' with probability=0.367179
class='n02124075 Egyptian cat' with probability=0.019365
class='n02129604 tiger, Panthera tigris' with probability=0.001273
class='n04040759 radiator' with probability=0.000261

tvmc也提供了对比推理的运算时间的工具：

python -m tvm.driver.tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz  \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm_autotuned.tar

# Execution time summary:
# mean (ms)   max (ms)   min (ms)   std (ms) 
#   19.19      99.95      16.60       9.33 

python -m tvm.driver.tvmc run \
--inputs imagenet_cat.npz \
--output predictions.npz  \
--print-time \
--repeat 100 \
resnet50-v2-7-tvm.tar

# Execution time summary:
# mean (ms)   max (ms)   min (ms)   std (ms) 
#   22.93      150.05     21.02      12.93