测试对象和平台
测试对象:(gpt,C-Dial gpt)
测试平台:Triton Inference Server
性能测试比较
onnx形态
什么是onnx?
Open Neural Network Exchange
运行命令:
docker run --rm --net=host hub.yun.paic.com.cn/pib-core/ibudda-triton:tritonserver-21.06-py3-sdk perf_analyzer -m ibuddha_chitchat_onnx --percentile=95 -u localhost:8010 -b 50 --shape input_ids:32 --shape attention_mask:32 --shape token_type_ids:32 --input-data zero
onnx和triton内onnx转tensorRT的性能测试
batch 1 | batch 50 | |
dynamic_batching { } | 136 | 1500 |
dynamic_batching { } optimization { execution_accelerators { gpu_execution_accelerator : [ { name : "tensorrt" }] }} | 264 | 1430 |
对比同一模型的pytorch形态
docker run --rm --net=host hub.yun.paic.com.cn/pib-core/ibudda-triton:tritonserver-21.06-py3-sdk perf_analyzer -m ibuddha_chitchat --percentile=95 -u localhost:8010 -b 1 --shape INPUT__0:32 --shape INPUT__1:32 --shape INPUT__2:32 --input-data zero
batch 1 | batch 50 | ||
dynamic_batching | 64 | 1330 | |
dynamic_batching parameters: { key: "INFERENCE_MODE" value: { string_value:"true" } } | 99 | 1370 | |
dynamic_batching parameters: { key: "INFERENCE_MODE" value: { string_value:"true" } } parameters: { key: "ENABLE_NVFUSER" value: { string_value:"true" } } | 91 | 1300 |
已经默认打开的选型
ENABLE_JIT_EXECUTOR
ENABLE_JIT_PROFILING
ENABLE_TENSOR_FUSER
测试结论
onnx模型比优化后的pytorch模型更快35%
转为tensorRT后,相比优化后的pytorch模型,吞吐量提升1.5倍