最近参加“英特尔®OpenVINO™领航者联盟 DFRobot行业AI开发者大赛”活动,采用benchmark_app.py做模型的benchmark时,发现采用不同的Batch Size,结果在CPU、GPU和MYRIAD上有很大的差距,特此整理如下,以供参考。
硬件平台:
主办方提供了拿铁熊猫LattePanda Delta和Intel神经计算棒NCS2,板上带有4GB内存,本文所列数据都是在该平台上运行得到的。
CPU是Intel 全新 N 系列赛扬 4 核4线程处理器N4100,最高可达 2.40 GHz,4MB缓存。
GPU是集成显卡UHD600,显卡基本频率是200MHz,最大动态频率是700MHz。
MYRIAD是Intel神经计算棒NCS2,Intel® Movidius™ Myriad™ X VPU核心,通过USB 3.1 Type-A接口插在LattePanda Delta上。
软件:
Windows 10, OpenVINO 2020.4,Python 3.6.5。
采用模型:
所采用模型:Xubett964.fp16.xml,输入为28x28的图像,网络结构图如下:

Batch Size:1
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU
[ INFO ] Read network took 174.37 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 234.38 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)
Count: 751604 iterations
Duration: 60002.19 ms
Latency: 0.29 ms
Throughput: 12526.28 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD
[ INFO ] Read network took 31.25 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 1589.09 ms
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000
Count: 33676 iterations
Duration: 60010.21 ms
Latency: 7.01 ms
Throughput: 561.17 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU
[ INFO ] Read network took 15.62 ms
[ INFO ] Network batch size: 1
[ INFO ] Load network took 14325.00 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000 ms duration)
Count: 67800 iterations
Duration: 60000.80 ms
Latency: 3.39 ms
Throughput: 1129.98 FPS
可以看到CPU的结果很好,远超GPU和MYRIAD。
Batch Size:32
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU -b 32
[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 156.26 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)
Count: 44064 iterations
Duration: 60006.13 ms
Latency: 5.48 ms
Throughput: 23498.40 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b32
[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 1561.80 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000
Count: 8680 iterations
Duration: 60017.96 ms
Latency: 27.45 ms
Throughput: 4627.95 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU -b32
[ INFO ] Read network took 22.20 ms
[ INFO ] Reshaping network: 'imageinput': [32, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 32
[ INFO ] Load network took 14444.84 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 32 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000
Count: 25220 iterations
Duration: 60010.31 ms
Latency: 10.25 ms
Throughput: 13448.36 FPS
Batch Size:1024
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d CPU -b1024
[ INFO ] Read network took 31.24 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 0.00 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 222.76 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 4 streams for CPU, limits: 60000 ms duration)
Count: 1420 iterations
Duration: 60276.96 ms
Latency: 183.38 ms
Throughput: 24123.31 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d MYRIAD -b1024
[ INFO ] Read network took 31.27 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.63 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 1521.97 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests, limits: 60000
Count: 388 iterations
Duration: 60890.69 ms
Latency: 627.76 ms
Throughput: 6525.00 FPS
benchmark_app.py -m Xubett964.fp16.xml -i testImg2.png -d GPU -b1024
[ INFO ] Read network took 15.65 ms
[ INFO ] Reshaping network: 'imageinput': [1024, 1, 28, 28]
[ INFO ] Reshape network took 15.68 ms
[ INFO ] Network batch size: 1024
[ INFO ] Load network took 14388.24 ms
[ INFO ] Network input 'imageinput' precision FP32, dimensions (NCHW): 1024 1 28 28
[Step 10/11] Measuring performance (Start inference asyncronously, 4 inference requests using 2 streams for GPU, limits: 60000
Count: 1128 iterations
Duration: 60223.88 ms
Latency: 189.66 ms
Throughput: 19179.64 FPS
总结:
对于类似本文比较小的图片输入(28x28),如果选择Batch Size为1,CPU会比GPU和MYRIAD快很多,是因为GPU和MYRIAD读取数据花的时间比CPU多,MYRIAD通过USB读取数据,延时最长,GPU会稍微好一些,IO的开销占比较大;增加Batch Size,每次读取数据变大,IO的开销占比变小,计算耗时占比变大,Throughput变大,对应的Latency也相应增加。受限于内存与处理能力,Throughput变大的程度并不会同比例增大,需要根据不同的处理数据大小,不同的算法结构,Latency的选择,以及不同的处理引擎,选择一个合适的Batch Size。
老徐 2020.7
本文详细记录了在Intel N4100 CPU、集成GPU UHD600和MYRIAD NCS2上,使用OpenVINO的benchmark_app.py对Xubett964模型进行测试。结果显示,当Batch Size为1时,CPU性能最佳,MYRIAD次之,GPU最慢。随着Batch Size增大,虽然计算效率提高,但Latency也会增加。结论强调选择合适的Batch Size应考虑处理数据大小、算法结构和处理引擎的平衡。

1640

被折叠的 条评论
为什么被折叠?



