【模型推理加速系列】07: 以BERT为例全面评测各种推理加速方案

最新推荐文章于 2024-07-01 15:14:26 发布

JasonLiu1919

最新推荐文章于 2024-07-01 15:14:26 发布

阅读量1.6k

点赞数 2

分类专栏：推理加速 NLP 文章标签： bert 人工智能深度学习推理加速 NLP

本文链接：https://blog.csdn.net/ljp1919/article/details/128414771

版权

NLP 同时被 2 个专栏收录

37 篇文章 11 订阅

订阅专栏

推理加速

9 篇文章 4 订阅

订阅专栏

简介

冬至夜月圆，霜露点清银。枫叶落尽霜，寒风凛冽水。山河冰封路，长夜独酌愁。火炉照暖家，窗外雪花飞。小伙伴们好，我是微信公众号小窗幽记机器学习的首席试药师：卖布洛芬的小男孩。上面这首诗是最近顶流ChatGPT模仿李白风格创作的，仅供欣赏。前文以CV领域中的resnet18模型为例综合评测各种推理加速方案，具体评测结果可以参考之前的小作文：模型推理加速系列｜ 06: 基于resnet18加速方案评测。今天这篇小作文尝试以NLP领域中的常用模型BERT为例(仅将输入文本进行encode)，综合评测包括Pytorch、ONNX、JIT、TensorRT和OpenVino在内这5种推理方案的性能。

更多、更新文章欢迎关注微信公众号：小窗幽记机器学习。后续会持续输出模型推理加速和工程部署相关系列，敬请期待～

模型导出

导出 ONNX

如何将Pytorch版BERT导出为ONNX格式可以参考之前的文章：模型推理加速系列｜04：BERT模型推理加速 TorchScript vs. ONNX。

导出 TorchScript

如何将Pytorch版BERT导出为TorchScript格式可以参考之前的文章：模型推理加速系列｜04：BERT模型推理加速 TorchScript vs. ONNX，更多关于TorchScript模型格式的介绍可以参考之前的文章：模型推理加速系列｜05：TorchScript模型格式简介及其使用。

导出 TensorRT

在TensorRT这部分依次评测变长和定长版(指文本输入长度上的限制)，所以在导出TensorRT模型格式的时候，导出支持变长和定长文本输入的两个版本。

导出输入文本定长版 TensorRT engine格式模型示例如下：

CUDA_VISIBLE_DEVICES=0 trtexec --onnx=/home/model_zoo/nlp/onnx/bert-base-chinese/bert_dynamic.onnx  --minShapes=input_ids:1x32,attention_mask:1x32,token_type_ids:1x32 --optShapes=input_ids:64x32,attention_mask:64x32,token_type_ids:64x32 --maxShapes=input_ids:128x32,attention_mask:128x32,token_type_ids:128x32 --saveEngine=/home/model_zoo/nlp/tensorrt/bert_static_32.engine

PS：

以上导出的模型在batch size维度上的动态的，但是只支持长度为32的输入文本。

导出输入文本变长版 TensorRT engine格式模型示例如下：

CUDA_VISIBLE_DEVICES=0 trtexec --onnx=/home/model_zoo/nlp/onnx/bert-base-chinese/bert_dynamic.onnx --minShapes=input_ids:1x8,attention_mask:1x8,token_type_ids:1x8 --optShapes=input_ids:64x32,attention_mask:64x32,token_type_ids:64x32 --maxShapes=input_ids:128x64,attention_mask:128x64,token_type_ids:128x64 --saveEngine=/home/model_zoo/nlp/tensorrt/bert_dynamic_max64.engine

导出 OpenVino

使用的镜像是openvino/ubuntu20_runtime:latest。先用之前Pytorch格式模型转为的ONNX格式模型，再将ONNX格式模型转为OpenVino所需要的格式。这里需要用到mo命令，安装方式如下：pip3 install openvino-dev[all]。将ONNX模型转为openvino的命令如下：

mo --input_model /home/model_zoo/nlp/onnx/bert-base-chinese/bert_dynamic.onnx --input "input_ids[-1 128],attention_mask[-1 128],token_type_ids[-1 128]" --output_dir /home/model_zoo/nlp/openvino/bert/ --data_type FP32

评测结果

本次实验硬件信息：

GPU：1张 Nvidia T4

CPU：10 Intel® Xeon® Platinum 8255C CPU @ 2.50GHz

以下实验结果是batch size的推理耗时，单位是ms

输入文本长度为 8

CPU：

batch size	Pytorch	ONNX	JIT	OpenVino(定长版)	OpenVino(变长版)
1	24.5	10.1	18.8	12.1	12.5
2	46.7	12.0	35.5	14.7	16.2
4	53.9	18.0	40.9	25.4	25.5
8	65.2	30.3	50.6	35.0	36.8
16	82.6	53.6	65.3	66.1	66.2
32	126.3	99.1	105.9	112.7	112.1
64	236.2	194.2	192.9	198.5	200.1
128	474.9	368.2	360.3	362.1	364.6

GPU：

batch size	Pytorch	ONNX	JIT	TensorRT(定长版)	TensorRT(变长版)
1	10.7	6.5	8.3	4.8	7.9
2	11.6	6.6	8.3	5.1	7.9
4	10.9	7.1	7.2	5.5	8.0
8	11.6	10.0	11.3	7.8	8.0
16	15.8	14.9	15.4	13.0	20.9
32	24.3	23.8	23.9	20.6	75.3
64	44.6	42.6	44.4	37.2	287.9
128	86.5	87.6	83.9	73.5	568.1

输入文本长度为 16

CPU：

batch size	Pytorch	ONNX	JIT	OpenVino(定长版)	OpenVino(变长版)
1	46.1	12.2	35.5	14.6	16.3
2	52.7	18.1	41.9	21.8	24.2
4	64.1	30.0	51.9	31.9	38.6
8	81.1	52.7	68.0	52.5	66.2
16	125.4	97.4	111.1	102.6	112.4
32	235.2	190.8	216.7	185.8	197.1
64	430.8	362.8	352.6	340.7	366.3
128	754.7	725.4	769.1	620.1	718.5

GPU：

batch size	Pytorch	ONNX	JIT	TensorRT(定长版)	TensorRT(变长版)
1	11.0	6.6	8.0	5.1	21.1
2	11.2	7.1	7.2	5.5	21.1
4	11.8	10.2	11.4	7.9	21.4
8	16.2	15.2	15.8	13.5	21.9
16	25.1	24.7	24.8	20.9	20.8
32	45.9	43.7	45.4	38.0	74.3
64	88.8	90.1	87.2	75.8	286.9
128	174.2	174.8	170.8	148.6	571.2

TensorRT 模型是定长输入，如果使用变长版TensorRT，但是输入的文本长度设置为16。

输入文本长度为 32

CPU：

batch size	Pytorch	ONNX	JIT	OpenVino(定长版)	OpenVino(变长版)
1	45.8	18.0	39.7	21.8	22.7
2	64.7	29.8	51.1	32.2	32.4
4	80.6	52.9	65.6	52.7	52.0
8	129.4	96.8	115.2	95.4	92.6
16	244.9	191.7	213.5	178.8	196.3
32	420.1	363.7	362.3	312.2	318.2
64	794.5	733.0	701.9	636.4	713.7
128	1648.0	1477.6	1501.5	1281.9	1441.1

GPU：

batch size	Pytorch	ONNX	JIT	TensorRT(定长版)	TensorRT(变长版-maxlen512)
1	10.6	7.5	7.2	5.5	74.6
2	12.0	10.5	11.7	7.9	74.9
4	16.3	15.6	16.0	13.5	75.4
8	24.8	24.7	24.8	21.8	76.4
16	44.9	44.1	45.6	37.3	74.9
32	86.5	89.4	86.6	74.5	74.4
64	173.3	174.6	170.7	146.1	288.0
128	326.9	346.8	322.3	285.6	1188.4

输入文本长度为 64

CPU：

batch size	Pytorch	ONNX	JIT	OpenVino(定长版)	OpenVino(变长版)
1	56.7	30.2	49.4	32.1	32.6
2	72.2	53.3	72.0	52.8	53.3
4	112.4	98.3	105.8	93.0	93.2
8	218.4	197.1	205.6	208.9	177.3
16	402.5	367.9	392.9	310.0	391.7
32	764.8	739.3	752.7	653.2	711.3
64	1531.4	1488.3	1442.0	1356.5	1580.3
128	3244.5	3139.7	2883.6	3063.2	3133.2

GPU：

batch size	Pytorch	ONNX	JIT	TensorRT(定长版)	TensorRT(变长版-maxlen512)	TensorRT(变长版-maxlen64)
1	11.8	10.5	11.5	8.0	287.7	287.4
2	16.3	15.4	16.1	13.7	287.8	287.5
4	25.3	24.6	25.3	22.0	288.6	291.6
8	44.9	43.6	46.2	39.4	293.1	291.1
16	88.9	89.8	86.7	74.7	287.3	288.9
32	176.2	174.9	170.5	146.2	287.9	288.9
64	332.7	348.1	322.7	287.3	288.6	288.6
128	675.4	683.1	655.1	569.2	1185.5	574.9

结论

根据上述实验结果可以得出以下结论：

对于CPU，短文本和小batch size 场景，ONNX推理速度最快，OpenVnio次之。而在长文本和大batch size的场景，OpenVino推理速度最快
对于OpenVino，在短文本和小batch size场景，定长输入和变长输入性能差别不大。随着文本长度增加和batch size增大，定长版OpenVino显著优于变长版OpenVino
对于GPU，定长版TensorRT显著优于变长版TensorRT。但是，定长版在实际使用中可能会涉及对输入文本的padding操作，需要在具体使用场景中评测该部分的耗时。

更多、更新文章欢迎关注微信公众号：小窗幽记机器学习。后续会持续输出模型推理加速和工程部署相关系列，敬请期待～