SNPE DSP vs AIP

Y一fan

已于 2022-08-22 16:30:56 修改

阅读量1.8k

点赞数 2

文章标签：深度学习人工智能 pytorch

于 2022-08-22 16:26:49 首次发布

本文链接：https://blog.csdn.net/qq_38743313/article/details/126468380

版权

背景

高通官方宣称865板子上的HTA（也就是AIP, Artificial Intelligence Processor）的算力高达8T，比DSP的3T高2.7倍，但实测过classify模型，HTA的速度跟DSP差不多，甚至可能还更慢一些。

DSP VS AIP测试

简单卷积层测试-1

onnx生成代码

class SampleModel(torch.nn.Module):
    def __init__(self):
        super(SampleModel, self).__init__()
        self.conv3x3_0 = torch.nn.Conv2d(3, 16, (3, 3))
        self.conv3x3_1 = torch.nn.Conv2d(16, 32, (3, 3))
        self.conv3x3_2 = torch.nn.Conv2d(32, 64, (3, 3))
        self.conv3x3_3 = torch.nn.Conv2d(64, 128, (3, 3))

    def forward(self, x):
        x = self.conv3x3_0(x)
        x = self.conv3x3_1(x)
        x = self.conv3x3_2(x)
        return self.conv3x3_3(x)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-o', '--output_onnx', type=str, default='Conv_4.onnx', help='onnx save path')
    args = parser.parse_args()

    net = SampleModel()
    model_name = args.output_onnx
    print(model_name)
    dummy_input = torch.randn(1, 3, 320, 320)
    torch.onnx.export(model=net,
                      args=dummy_input,
                      f=model_name,
                      input_names=['input'],
                      output_names=['output'])

snpe量化

量化数据集可以采用随机生成的方式，size是320x320x3，量化时加入了enable_hta参数，以生成可以在AIP上运行的dlc模型：SNPE-hta_support

> snpe-dlc-quantize --input_dlc=./Conv_4.dlc \
--input_list=../face_det/raw_list_769M.txt \
--output_dlc=./Conv_4_int8_hta.dlc \
--enable_hta
[INFO] InitializeStderr: DebugLog initialized.
[INFO] Writing intermediate model
item already exists: 1124
[AIP_TF8_HTA : 0 1 2 3 4 ] ::1
item already exists: 1124
Starting assembler.
[INFO] Setting activation for layer: input and buffer: input
[INFO] bw: 8, min: -2.127063, max: 2.630841, delta: 0.018658, offset: -114.000000
[INFO] Setting activation for layer: Conv_0 and buffer: input.1
[INFO] bw: 8, min: -3.182152, max: 3.813096, delta: 0.027432, offset: -116.000000
[INFO] Setting activation for layer: Conv_1 and buffer: input.4
[INFO] bw: 8, min: -2.350746, max: 2.260332, delta: 0.018083, offset: -130.000000
[INFO] Setting activation for layer: Conv_2 and buffer: input.8
[INFO] bw: 8, min: -1.348760, max: 1.317393, delta: 0.010456, offset: -129.000000
[INFO] Setting activation for layer: Conv_3 and buffer: output
[INFO] bw: 8, min: -0.795907, max: 0.777397, delta: 0.006170, offset: -129.000000
[INFO] Running Graph Partitioner for SDM865
[INFO] Blob ID:2
[INFO] Writing quantized model to: ./Conv_4_int8_hta.dlc
[INFO] Compiling HTA metadata into DLC.
[INFO] Creating new AIP record aip.metadata0
[INFO] Record Version:: 1.2.0.0
[INFO] Compiler Version:: 1.6.2.1
[INFO] HTA Blob ID:: 1
NO ERRORS;
NO WARNINGS;
Please contact Qualcomm NPU team for potential further performance optimization for your model
item already exists: 1124
[INFO] Creating new AIP record aip.metadata1
[INFO] Record Version:: 1.2.0.0
[INFO] Compiler Version:: 1.6.2.1
[INFO] HTA Blob ID:: 2
[INFO] Successfully compiled HTA metadata into DLC.
[INFO] DebugLog shutting down.

高通板子(QCS8250)上运行dsp与aip

demo说明：使用SNPE c++ API编写的demo，输入是一张图片，会resize到模型需要的input，-d=2代表DSP, -d=3代表AIP。-e参数无需关心。

Dsp: 397ms

255|kona:/data/local/tmp/yeruihuan/snpe_test $ ./snpe_test -i=./lena.jpg \
-m=./Conv_4_int8_hta.dlc -e=7 -d=2
[08-18 10:42:33.517]  15069 15069 V snpe_test.cc:0131:  INFO: input img size: 320 x 320
[08-18 10:42:33.521]  15069 15069 V snpe_test.cc:0147:  INFO: use SNPE AI engine
[08-18 10:42:33.521]  15069 15069 V snpe_util.cc:0008:  SNPE runtime: DSP_FIXED8_TF
[08-18 10:42:33.918]  15069 15069 V snpe_test.cc:0105:  RunSNPESample cost: 397.628000 ms

Aip：445ms

kona:/data/local/tmp/yeruihuan/snpe_test $ ./snpe_test -i=./lena.jpg \
-m=./Conv_4_int8_hta.dlc -e=7 -d=3
[08-18 10:42:59.496]  15183 15183 V snpe_test.cc:0131:  INFO: input img size: 320 x 320
[08-18 10:42:59.500]  15183 15183 V snpe_test.cc:0147:  INFO: use SNPE AI engine
[08-18 10:42:59.500]  15183 15183 V snpe_util.cc:0008:  SNPE runtime: AIP_FIXED8_TF
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
NPU User Driver: npu_read_info 0
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
NPU driver built on: Nov 15 2021 16:31:41
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
DLBC compression enabled
item already exists: 1124
NET size 106496 off 0 id=ffffffff
INTERMEDIATE size 19384320 off 0 id=fffffffe
ACO buffer size 7688 fd 15 off 0
* NPU_Stats: npu_compile_get_objs(): 17.21 ms
DUAL ACO VA = 0fddf4000 Network VA = 0xff1a0000 Intermediate VA = 0xfde00000 Intermediate 1 VA= 0xfca00000
npu_load_network_v2: perf mode = 4 priority = 3f flags = 0x44 num layers = 5
* NPU_Stats: npu_load_network_v2: NPU + kernel : 16.71 ms
npu_load_network_v2: network handle = 0x10401
* NPU_Stats: npu_load_network(): 36.02 ms
* NPU_Stats: npu_alloc_buffer_v2(): 0.74 ms sts=0
* NPU_Stats: npu_alloc_buffer_v2(): 0.41 ms sts=0
npu_set_property status: 0
[08-18 10:42:59.945]  15183 15183 V snpe_test.cc:0105:  RunSNPESample cost: 445.387000 ms
* NPU_Stats: npu_free_buffer_v2(): 0.00 ms
* NPU_Stats: npu_free_buffer_v2(): 0.00 ms
* NPU_Stats: npu_unload_network(): NPU + kernel : 12.72 ms
free delayed buffer fbc00000
free delayed buffer fc980000
* NPU_Stats: npu_unload_network(): 31.72 ms

SNPE SDK TOOL测试

官方链接：SNPE-SDK-Tools

Snpe-platform-validator

检查device上的SNPE兼容情况，可以查看此设备是否支持SNPE的GPU, DSP, AIP等。

1|kona:/data/local/tmp/SNPE_SDK_TOOL/bin/ # ./snpe-platform-validator \
--runtime aip \
--coreVersion --libVersion --debug

PF_VALIDATOR: DEBUG: starting calculator test
PF_VALIDATOR: DEBUG: Loading DSP stub: libcalculator.so
PF_VALIDATOR: DEBUG: Successfully loaded DSP library - 'libcalculator.so'.  Setting up pointers.
PF_VALIDATOR: DEBUG: Success in executing the sum function
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
NPU User Driver: npu_read_info 0
PF_VALIDATOR: DEBUG: Calling PlatformValidator->RuntimeCheck
PF_VALIDATOR: DEBUG: Testing for the support of AIP runtime.
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
NPU User Driver: npu_read_info 0
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
NPU driver built on: Nov 15 2021 16:31:41
npu_get_property status: 0
npu_get_property status: 0
FW CAPS [0] = 0x2007
FW CAPS [1] = 0x5
FW CAPS [2] = 0x0
FW CAPS [3] = 0x0
FW CAPS [4] = 0x0
FW CAPS [5] = 0x0
FW CAPS [6] = 0x0
FW CAPS [7] = 0x0
npu_get_property status: 0
DLBC compression enabled
item already exists: 1124
NET size 4096 off 0 id=ffffffff
INTERMEDIATE size 8192 off 0 id=fffffffe
ACO buffer size 2728 fd 13 off 0
* NPU_Stats: npu_compile_get_objs(): 15.23 ms
DUAL ACO VA = 0fffee000 Network VA = 0xffff3000 Intermediate VA = 0xffff0000 Intermediate 1 VA= 0xfffec000

npu_load_network_v2: perf mode = 0 priority = 0 flags = 0x44 num layers = 2
* NPU_Stats: npu_load_network_v2: NPU + kernel : 4.44 ms
npu_load_network_v2: network handle = 0x10101
* NPU_Stats: npu_load_network(): 20.81 ms
Unit Test on the runtime AIP: Passed.
SNPE is supported for runtime AIP on the device.
PF_VALIDATOR: DEBUG: Calling PlatformValidator->IsRuntimeAvailable
Runtime AIP Prerequisites: Present.
PF_VALIDATOR: DEBUG: Calling PlatformValidator->GetLibVersion
Library Version of the runtime AIP: Npu Lib v2

PF_VALIDATOR: DEBUG: Calling PlatformValidator->GetCoreVersion
Core Version of the runtime AIP: 135936

高通板子上所有device的支持情况如下，说明GPU, DSP和AIP都是支持的。