OpenVINO不同Batch Size不同计算引擎的benchmark评测对比

本文详细记录了在Intel N4100 CPU、集成GPU UHD600和MYRIAD NCS2上,使用OpenVINO的benchmark_app.py对Xubett964模型进行测试。结果显示,当Batch Size为1时,CPU性能最佳,MYRIAD次之,GPU最慢。随着Batch Size增大,虽然计算效率提高,但Latency也会增加。结论强调选择合适的Batch Size应考虑处理数据大小、算法结构和处理引擎的平衡。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近参加“英特尔®OpenVINO™领航者联盟 DFRobot行业AI开发者大赛”活动,采用benchmark_app.py做模型的benchmark时,发现采用不同的Batch Size,结果在CPU、GPU和MYRIAD上有很大的差距,特此整理如下,以供参考。

硬件平台:

主办方提供了拿铁熊猫LattePanda Delta和Intel神经计算棒NCS2,板上带有4GB内存,本文所列数据都是在该平台上运行得到的。

CPU是Intel 全新 N 系列赛扬 4 核4线程处理器N4100,最高可达 2.40 GHz,4MB缓存。

GPU是集成显卡UHD600,显卡基本频率是200MHz,最大动态频率是700MHz。

MYRIAD是Intel神经计算棒NCS2,Intel® Movidius™ Myriad™ X VPU核心,通过USB 3.1 Type-A接口插在LattePanda Delta上。

软件:

Windows 10, OpenVINO 2020.4,Python 3.6.5。

采用模型:

所采用模型:Xubett964.fp16.xml,输入为28x28的图像,网络结构图如下:

Xubett964.png

Batch Size:1

benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d CPU

[ INFO ] Read network took 174.37 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 234.38 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count:      751604 iterations
Duration:   60002.19 ms
Latency:    0.29 ms
Throughput: 12526.28 FPS


benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD

[ INFO ] Read network took 31.25 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 1589.09 ms

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count:      33676 iterations
Duration:   60010.21 ms
Latency:    7.01 ms
Throughput: 561.17 FPS

benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d GPU

[ INFO ] Read network took 15.62 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 14325.00 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000 ms duration)

Count:      67800 iterations
Duration:   60000.80 ms
Latency:    3.39 ms
Throughput: 1129.98 FPS

可以看到CPU的结果很好,远超GPU和MYRIAD。

Batch Size:32

benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d CPU -b 32

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 156.26 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count:      44064 iterations
Duration:   60006.13 ms
Latency:    5.48 ms
Throughput: 23498.40 FPS


benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b32

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 1561.80 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count:      8680 iterations
Duration:   60017.96 ms
Latency:    27.45 ms
Throughput: 4627.95 FPS


benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d GPU -b32

[ INFO ] Read network took 22.20 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 14444.84 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000

Count:      25220 iterations
Duration:   60010.31 ms
Latency:    10.25 ms
Throughput: 13448.36 FPS

 

Batch Size:1024

benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d CPU -b1024

[ INFO ] Read network took 31.24 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 222.76 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)

Count:      1420 iterations
Duration:   60276.96 ms
Latency:    183.38 ms
Throughput: 24123.31 FPS


benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b1024

[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.63 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 1521.97 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000

Count:      388 iterations
Duration:   60890.69 ms
Latency:    627.76 ms
Throughput: 6525.00 FPS


benchmark_app.py  -m Xubett964.fp16.xml -i testImg2.png -d GPU -b1024

[ INFO ] Read network took 15.65 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.68 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 14388.24 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28

[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000

Count:      1128 iterations
Duration:   60223.88 ms
Latency:    189.66 ms
Throughput: 19179.64 FPS

 

总结:

对于类似本文比较小的图片输入(28x28),如果选择Batch Size为1,CPU会比GPU和MYRIAD快很多,是因为GPU和MYRIAD读取数据花的时间比CPU多,MYRIAD通过USB读取数据,延时最长,GPU会稍微好一些,IO的开销占比较大;增加Batch Size,每次读取数据变大,IO的开销占比变小,计算耗时占比变大,Throughput变大,对应的Latency也相应增加。受限于内存与处理能力,Throughput变大的程度并不会同比例增大,需要根据不同的处理数据大小,不同的算法结构,Latency的选择,以及不同的处理引擎,选择一个合适的Batch Size。

                                                                                                                                                                    老徐 2020.7

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值