TensorFlowLite + Armnn 实现神经网络推理

最新推荐文章于 2024-08-10 07:47:58 发布

陌生的天花板

最新推荐文章于 2024-08-10 07:47:58 发布

阅读量3k

点赞数 2

分类专栏： TEMI机器人机器学习文章标签： tensorflow arm 神经网络 c++ 嵌入式

本文链接：https://blog.csdn.net/weixin_41680653/article/details/121255046

版权

机器学习同时被 2 个专栏收录

17 篇文章 0 订阅

订阅专栏

TEMI机器人

13 篇文章 4 订阅

订阅专栏

随着深度学习技术的飞速发展，越来越多的神经网络可以运行嵌入式设备上了，但是网上的教程多以安卓平台为主，这可能是因为手机平板等移动设备装机量巨大，所以大家都比较关注，而嵌入式linux上的相关资料和项目不是很多。最近由于工作需要，研究了一下这方面的东西，这里进行一下总结，也希望能过帮助到有需要的朋友，同时有相关经验的朋友也可以解答一些我的疑问，共同进步。

硬件平台：安装了ubuntu20.04的x86笔记本安装了ubuntu20.04的rk3399 arm 板

目标：在arm板上运行神经网络（ssd,yolo）

推理框架： tensorflow lite / armnn

我只研究了一下RockChip的板子，rk3399是没有npu的，如果使用有NPU的板子可以直接使用RockChip的 rknn-toolkit 据说跑yolov4-tiny 可以200帧哦。

先说结论，经过自己使用发现armnn对自家硬件支持是最好的，同样的网络用armnn跑比tflite快很多，之前还用过一个叫tengine的推理引擎，当然没有细致地研究，拿过来不做任何调整就用的化，还是armnn最快。但是armnn支持的操作符有限，许多牛逼的网络都不能直接拿来用，直接不支持，但是armnn说自己有一个作为tflite插件的机制，就是作为tflite的delegate，这样一来armnn支持的操作符就在armnn上跑，不支持的就在tflite上跑，tflite各种网络都支持，通过这种办法岂不是随便拿一个新的网络都可以在arm板上跑了？因为自己对神经网络的研究还不够，所以看似这是一种一劳永逸的办法，于是便尝试了一下。结果发现即使这套框架也并不能很好地支持所有网络，起码我自己测试的Yolov5-nano就会提示操作符不支持。这个实验目前来看算是失败了，下面把编译armnn,tensorflowlite,以及如何在tensorflowlite上跑armnndelegate的过程记录下来，供大家参考，也免得自己遗忘。

1. 编译TensorflowLite

目的：分别编译出可以用在 arm 板上的和 x86 电脑上的 tensorflowlite 动态库

官方文档

我记得之前官方文档没有这么好用，不知道是不是后边完善了，现在就按照这个来就可以了。

需要注意的地方：

1. glibc 版本要大于2.28

ldd --version

2. 使用bazel构建

3. 需要自己打包头文件，并且不能破坏目录结构，进入到tensorflow文件夹

find ./lite -name "*.h" | tar -cf headers.tar -T -

还需要 absl 和 flatbuffers 的头文件，这个自己找吧，因为要版本匹配所以别人给你的也不一定能用，文后我会附上我自己用的 tflite v2.3.1版本的。

参考连接：

Tensorflow lite 编译Android JNI C++ 动态链接库 - 知乎

ubuntu18.04 tensorflow以及tensorflow lite源码编译C++库_上善若水-CSDN博客

网上还有设置BUILD的方法，而且ArmNN的readme里也使用了设置BUILD的方法，我没研究bazel，但是这两种方法都能用，个人推荐官方方法。

2. 编译ArmNN

下载armnn源码，最外层有一个 BuildGuideCrossCompilation.md 如果只编译armnn就按照这个readme来做，因为指令集不同，arm上的.so和x86电脑上的.so是不能互换使用的，但是头文件是相同的。这里教你如何在x86电脑上进行交叉编译，编译出来的.so可以用在arm板上，关于交叉编译大家可以自行学习一下，我的理解主要就是要使用一个交叉编译工具链，因为我用的板子上的镜像还不是官方镜像，所以我选择了直接在arm板上编译arm用的.so，在x86电脑上编译x86用的.so(没错，armnn也可以在x86电脑上跑，就是特别慢)。下面就把这个readme里的主要步骤做一下总结：

1). 编译protobuf

## Build and install Google's Protobuf library

We support protobuf version 3.12.0
* Get protobuf from here: https://github.com/protocolbuffers/protobuf : 
```bash
git clone -b v3.12.0 https://github.com/google/protobuf.git protobuf
cd protobuf
git submodule update --init --recursive
./autogen.sh
```
* Build a native (x86_64) version of the protobuf libraries and compiler (protoc):
  (Requires cUrl, autoconf, llibtool, and other build dependencies if not previously installed: sudo apt install curl autoconf libtool build-essential g++)
```
mkdir x86_64_build
cd x86_64_build
../configure --prefix=$HOME/armnn-devenv/google/x86_64_pb_install
make install -j16
cd ..
```
* Build the arm64 version of the protobuf libraries:
```
mkdir arm64_build
cd arm64_build
CC=aarch64-linux-gnu-gcc \
CXX=aarch64-linux-gnu-g++ \
../configure --host=aarch64-linux \
--prefix=$HOME/armnn-devenv/google/arm64_pb_install \
--with-protoc=$HOME/armnn-devenv/google/x86_64_pb_install/bin/protoc
make install -j16
cd ..
```

这里分别编译了x86版本的和arm64版本的，因为我没有选择交叉编译，所以在电脑上只编了x86版本的，在arm板上只编了arm64版本的，不用指定交叉编译相关参数，直接编

mkdir arm64_build
cd arm64_build
../configure --prefix=$HOME/armnn-devenv/google/arm64_pb_install
make install -j16

简单了解一下protobuf，这东西是谷歌开发的一个序列化库，数据传输用的，个人感觉有点类似通信原理中的PCM，把要传输的东西编码成01然后传输，接收后再解码成所需信息，后边要编译的flatbuffer也是一样的用途，为什么要编译这两个东西呢？因为onnxparser要用到protobuf，tfliteparser要用到flatbuffer。看来深度学习现在还是百家争鸣啊，标准结构都花样百出，不知道什么时候可以出来个一统江湖的，大家也不用在这里到处踩坑了。

~~我没有重新编译boost，也没有出现什么问题，这个略过。~~

编译boost，下载boost 之后如下编译

cd $HOME/armnn-devenv
tar -zxvf boost_1_64_0.tar.gz
cd boost_1_64_0
echo "using gcc : arm : aarch64-linux-gnu-g++ ;" > user_config.jam
./bootstrap.sh --prefix=$HOME/armnn-devenv/boost_arm64_install
./b2 install toolset=gcc-arm link=static cxxflags=-fPIC --with-test --with-log --with-program_options -j32 --user-config=user_config.jam

2). 编译ACL

## Build Compute Library
* Building the Arm Compute Library:
```bash
cd $HOME/armnn-devenv
git clone https://github.com/ARM-software/ComputeLibrary.git
cd ComputeLibrary/
git checkout <tag_name>
scons arch=arm64-v8a neon=1 opencl=1 embed_kernels=1 extra_cxx_flags="-fPIC" -j4 internal_only=0
```

For example, if you want to checkout release tag of 21.02:
```bash
git checkout v21.02
```

ARM自己的计算库，这里要注意的是ACL的版本和之后要编译的ArmNN版本要一致，我编译当时的最新版本 v21.08 报错了，改成了 v21.05 后编译成功。编译x86版本的时候要去掉一些参数：

scons arch=x86_64 extra_cxx_flags="-fPIC" -j8

neon=1 是用arm的CPU，opencl=1 embed_kernels=1 是用支持opencl的GPU，我电脑上没安，x86编译的时候就没选，编译了x86版本就是为了写代码方便，所以能用就行，官方也说了ACL确实可以在x86上编译，但是并不推荐这么使用，因为没什么意义。

3). 编译flatbuffers

## Build Flatbuffer
* Building Flatbuffer version 1.12.0
```bash
cd $HOME/armnn-devenv
wget -O flatbuffers-1.12.0.tar.gz https://github.com/google/flatbuffers/archive/v1.12.0.tar.gz
tar xf flatbuffers-1.12.0.tar.gz
cd flatbuffers-1.12.0
rm -f CMakeCache.txt
mkdir build
cd build
cmake .. -DFLATBUFFERS_BUILD_FLATC=1 \
     -DCMAKE_INSTALL_PREFIX:PATH=$HOME/armnn-devenv/flatbuffers \
     -DFLATBUFFERS_BUILD_TESTS=0
make all install
```

* Build arm64 version of flatbuffer
```bash
cd ..
mkdir build-arm64
cd build-arm64
# Add -fPIC to allow us to use the libraries in shared objects.
CXXFLAGS="-fPIC" cmake .. -DCMAKE_C_COMPILER=/usr/bin/aarch64-linux-gnu-gcc \
     -DCMAKE_CXX_COMPILER=/usr/bin/aarch64-linux-gnu-g++ \
     -DFLATBUFFERS_BUILD_FLATC=1 \
     -DCMAKE_INSTALL_PREFIX:PATH=$HOME/armnn-devenv/flatbuffers-arm64 \
     -DFLATBUFFERS_BUILD_TESTS=0
make all install
```

这里也是分别编译了两个版本，我记得编译x86版本的时候需要改成

CXXFLAGS="-fPIC" cmake .. -DFLATBUFFERS_BUILD_FLATC=1 \
     -DCMAKE_INSTALL_PREFIX:PATH=$HOME/armnn-devenv/flatbuffers \
     -DFLATBUFFERS_BUILD_TESTS=0

要不然编译ArmNN时候会提示找不到这个库，如果是在arm板子上直接编，那么不用指定工具链

CXXFLAGS="-fPIC" cmake .. \
     -DFLATBUFFERS_BUILD_FLATC=1 \
     -DCMAKE_INSTALL_PREFIX:PATH=$HOME/armnn-devenv/flatbuffers-arm64 \
     -DFLATBUFFERS_BUILD_TESTS=0

4). 编译ONNX

## Build Onnx
* Building Onnx
```bash
cd $HOME/armnn-devenv
git clone https://github.com/onnx/onnx.git
cd onnx
git fetch https://github.com/onnx/onnx.git 553df22c67bee5f0fe6599cff60f1afc6748c635 && git checkout FETCH_HEAD
LD_LIBRARY_PATH=$HOME/armnn-devenv/google/x86_64_pb_install/lib:$LD_LIBRARY_PATH \
$HOME/armnn-devenv/google/x86_64_pb_install/bin/protoc \
onnx/onnx.proto --proto_path=. --proto_path=../google/x86_64_pb_install/include --cpp_out $HOME/armnn-devenv/onnx
```

这里要注意的就是 x86_64_pb_install 这个参数要与PC和arm板上相应的路径相同，onnx，tflite，这里其实都没有进行编译，应该是变armnn的时候一起编了

5). 编译TFlite

## Build TfLite
* Building TfLite (Tensorflow version 2.3.1)
```bash
cd $HOME/armnn-devenv
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow/
git checkout fcc4b966f1265f466e82617020af93670141b009
#cd $HOME/armnn-devenv
mkdir tflite
cd tflite
cp ../tensorflow/tensorflow/lite/schema/schema.fbs .
../flatbuffers-1.12.0/build/flatc -c --gen-object-api --reflect-types --reflect-names schema.fbs
```

这里mkdir tflite应该是在 cd $HOME/armnn-devenv 之后执行的，否则是找不到相应路径的

6). 编译ArmNN

## Build Arm NN
* Compile Arm NN for arm64:
```bash
cd $HOME/armnn-devenv/armnn
mkdir build
cd build
```

* Use CMake to configure your build environment, update the following script and run it from the armnn/build directory to set up the Arm NN build:
```bash
#!/bin/bash
CXX=aarch64-linux-gnu-g++ CC=aarch64-linux-gnu-gcc cmake .. \
-DARMCOMPUTE_ROOT=$HOME/armnn-devenv/ComputeLibrary \
-DARMCOMPUTE_BUILD_DIR=$HOME/armnn-devenv/ComputeLibrary/build/ \
-DBOOST_ROOT=$HOME/armnn-devenv/boost_arm64_install/ \
-DARMCOMPUTENEON=1 -DARMCOMPUTECL=1 -DARMNNREF=1 \
-DONNX_GENERATED_SOURCES=$HOME/armnn-devenv/onnx \
-DBUILD_ONNX_PARSER=1 \
-DBUILD_TF_LITE_PARSER=1 \
-DTF_LITE_GENERATED_PATH=$HOME/armnn-devenv/tflite \
-DFLATBUFFERS_ROOT=$HOME/armnn-devenv/flatbuffers-arm64 \
-DFLATC_DIR=$HOME/armnn-devenv/flatbuffers-1.12.0/build \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/x86_64_pb_install \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/x86_64_pb_install/ \
-DPROTOBUF_LIBRARY_DEBUG=$HOME/armnn-devenv/google/arm64_pb_install/lib/libprotobuf.so.23.0.0 \
-DPROTOBUF_LIBRARY_RELEASE=$HOME/armnn-devenv/google/arm64_pb_install/lib/libprotobuf.so.23.0.0
```

* If you want to include standalone sample dynamic backend tests, add the argument to enable the tests and the dynamic backend path to the CMake command:
```bash
-DSAMPLE_DYNAMIC_BACKEND=1 \
-DDYNAMIC_BACKEND_PATHS=$SAMPLE_DYNAMIC_BACKEND_PATH
```
* Run the build
```bash
make -j32
```

这是交叉编译arm版本的命令，如果是直接在arm上编译就把相应的路径也都替换成自己编译的路径，

cmake .. \
-DARMCOMPUTE_ROOT=$HOME/armnn-devenv/ComputeLibrary \
-DARMCOMPUTE_BUILD_DIR=$HOME/armnn-devenv/ComputeLibrary/build/ \
-DBOOST_ROOT=$HOME/armnn-devenv/boost_arm64_install/ \
-DARMCOMPUTENEON=1 -DARMCOMPUTECL=1 -DARMNNREF=1 \
-DONNX_GENERATED_SOURCES=$HOME/armnn-devenv/onnx \
-DBUILD_ONNX_PARSER=1 \
-DBUILD_TF_LITE_PARSER=1 \
-DTF_LITE_GENERATED_PATH=$HOME/armnn-devenv/tflite \
-DFLATBUFFERS_ROOT=$HOME/armnn-devenv/flatbuffers-arm64 \
-DFLATC_DIR=$HOME/armnn-devenv/flatbuffers-1.12.0/build \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/arm64_pb_install \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/arm64_pb_install/ \
-DPROTOBUF_LIBRARY_DEBUG=$HOME/armnn-devenv/google/arm64_pb_install/lib/libprotobuf.so.23.0.0 \
-DPROTOBUF_LIBRARY_RELEASE=$HOME/armnn-devenv/google/arm64_pb_install/lib/libprotobuf.so.23.0.0

如果编译x86版本需要做一些修改，我是这么改的

cmake .. \
-DARMCOMPUTE_ROOT=$HOME/armnn-devenv/ComputeLibrary \
-DARMCOMPUTE_BUILD_DIR=$HOME/armnn-devenv/ComputeLibrary/build/ \
-DARMNNREF=1 \
-DONNX_GENERATED_SOURCES=$HOME/armnn-devenv/onnx \
-DBUILD_ONNX_PARSER=1 \
-DBUILD_TF_LITE_PARSER=1 \
-DTF_LITE_GENERATED_PATH=$HOME/armnn-devenv/tflite \
-DFLATBUFFERS_ROOT=$HOME/armnn-devenv/flatbuffers \
-DFLATC_DIR=$HOME/armnn-devenv/flatbuffers-1.12.0/build \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/x86_64_pb_install \
-DPROTOBUF_ROOT=$HOME/armnn-devenv/google/x86_64_pb_install/ \
-DPROTOBUF_LIBRARY_DEBUG=$HOME/armnn-devenv/google/x86_64_pb_install/lib/libprotobuf.so.23.0.0 \
-DPROTOBUF_LIBRARY_RELEASE=$HOME/armnn-devenv/google/x86_64_pb_install/lib/libprotobuf.so.23.0.0
```

总结起来就是

1. arm上可以用 neon 和 opencl ，x86上不用

2. 路径别搞错

3. 编flatbuffers的时候要设置 CXXFLAGS="-fPIC"

make -j 10086 以后就可以去摸会鱼了，要编一段时间...

如果出现 internal compiler error: Killed (program cc1plus) 大概率是arm板上的内存不够，增加swap空间可以解决

ArmNN编译好以后就可以使用它的API加载模型并进行推理了，我这里主要是想试一试用tflite的ArmNN delegate功能，所以这块就略过了。

3. 编译ArmNN Delegate

如果要使用ArmNN Delegate功能其实不用编译整个ArmNN，例如onnxparser和tfliteparser就用不到，在armnn的delegate文件夹下边还有一个 BuildGuideNative.md ，他这里也是直接在arm板上编的，当然这个delegate在x86上也能编，就是用的时候会报错。

1). 编译tensorflowlite

## Build Tensorflow Lite for C++
Tensorflow has a few dependencies on it's own. It requires the python packages pip3, numpy, wheel,
and also bazel which is used to compile Tensoflow. A description on how to build bazel can be
found [here](https://docs.bazel.build/versions/master/install-compile-source.html). There are multiple ways.
I decided to compile from source because that should work for any platform and therefore adds the most value
to this guide. Depending on your operating system and architecture there might be an easier way.
```bash
# Install the required python packages
pip3 install -U pip numpy wheel

# Bazel has a dependency on JDK (The specific JDK version depends on the bazel version but default-jdk tends to work.)
sudo apt-get install default-jdk
# Build Bazel
wget -O bazel-3.1.0-dist.zip https://github.com/bazelbuild/bazel/releases/download/3.1.0/bazel-3.1.0-dist.zip
unzip -d bazel bazel-3.1.0-dist.zip
cd bazel
env EXTRA_BAZEL_ARGS="--host_javabase=@local_jdk//:jdk" bash ./compile.sh 
# This creates an "output" directory where the bazel binary can be found
```

### Download and build Tensorflow Lite

```bash
cd $BASEDIR
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow/
git checkout tags/v2.3.1 # Minimum version required for the delegate
```
Before we build, a target for tensorflow lite needs to be defined in the `BUILD` file. This can be 
found in the root directory of Tensorflow. Append the following target to the file:
```bash
cc_binary(
     name = "libtensorflow_lite_all.so",
     linkshared = 1,
     deps = [
         "//tensorflow/lite:framework",
         "//tensorflow/lite/kernels:builtin_ops",
     ],
)
```
Now the build process can be started. When calling "configure", as below, a dialog shows up that asks the
user to specify additional options. If you don't have any particular needs to your build, decline all
additional options and choose default values.
```bash
PATH="$BASEDIR/bazel/output:$PATH" ./configure
$BASEDIR/bazel/output/bazel build --config=opt --config=monolithic --strip=always libtensorflow_lite_all.so
```

注意两个地方:

1. 这里是源码安装bazel，所以bazel的路径要写对

2. 这里把库的名字设置成了 libtensorflow_lite_all.so 我遇到的问题是之前编译的是libtensorflowlite.so,并用这个库来写程序，使用的时候armnndelegate的库还是会去找libtensorflow_lite_all.so 这两个其实是一样的，就名字不一样，所以通过设置软连接，添加环境变量解决了，不知道如果这里一开始就把名字设置成 libtensorflowlite.so 可不可以解决

2). 编译flatbuffers

## Build Flatbuffers
Flatbuffers is a memory efficient cross-platform serialization library as 
described [here](https://google.github.io/flatbuffers/). It is used in tflite to store models and is also a dependency 
of the delegate. After downloading the right version it can be built and installed using cmake.
```bash
cd $BASEDIR
wget -O flatbuffers-1.12.0.zip https://github.com/google/flatbuffers/archive/v1.12.0.zip
unzip -d . flatbuffers-1.12.0.zip
cd flatbuffers-1.12.0 
mkdir install && mkdir build && cd build
# I'm using a different install directory but that is not required
cmake .. -DCMAKE_INSTALL_PREFIX:PATH=$BASEDIR/flatbuffers-1.12.0/install 
make install
```

前边已经编过一次了，没什么特别的

3). 编译ACL

## Build the Arm Compute Library

The Arm NN library depends on the Arm Compute Library (ACL). It provides a set of functions that are optimized for 
both Arm CPUs and GPUs. The Arm Compute Library is used directly by Arm NN to run machine learning workloads on 
Arm CPUs and GPUs.

It is important to have the right version of ACL and Arm NN to make it work. Luckily, Arm NN and ACL are developed 
very closely and released together. If you would like to use the Arm NN version "20.11" you should use the same "20.11"
version for ACL too.

To build the Arm Compute Library on your platform, download the Arm Compute Library and checkout the tag 
that contains the version you want to use. Build it using `scons`.
```bash
cd $BASEDIR
git clone https://review.mlplatform.org/ml/ComputeLibrary 
cd ComputeLibrary/
git checkout <tag_name> # e.g. v20.11
# The machine used for this guide only has a Neon CPU which is why I only have "neon=1" but if 
# your machine has an arm Gpu you can enable that by adding `opencl=1 embed_kernels=1 to the command below
scons arch=arm64-v8a neon=1 extra_cxx_flags="-fPIC" benchmark_tests=0 validation_tests=0 
```

注意版本，我用的v21.05，这里作者只用了neon，我们要用GPU，所以要

scons arch=arm64-v8a neon=1 opencl=1 embed_kernels=1 extra_cxx_flags="-fPIC" benchmark_tests=0 validation_tests=0

4). 编译ArmNN


## Build the Arm NN Library

With ACL built we can now continue to building Arm NN. To do so, download the repository and checkout the matching 
version as you did for ACL. Create a build directory and use `cmake` to build it.
```bash
cd $BASEDIR
git clone "https://review.mlplatform.org/ml/armnn" 
cd armnn
git checkout <branch_name> # e.g. branches/armnn_20_11
mkdir build && cd build
# if you've got an arm Gpu add `-DARMCOMPUTECL=1` to the command below
cmake .. -DARMCOMPUTE_ROOT=$BASEDIR/ComputeLibrary -DARMCOMPUTENEON=1 -DBUILD_UNIT_TESTS=0 
make
```

版本用的v21.05 我们要用GPU

cmake .. -DARMCOMPUTE_ROOT=$BASEDIR/ComputeLibrary -DARMCOMPUTENEON=1 -DARMCOMPUTECL=1 -DBUILD_UNIT_TESTS=0

5). 编译delegate

# Build the TfLite Delegate (Stand-Alone)

The delegate as well as Arm NN is built using `cmake`. Create a build directory as usual and build the delegate
with the additional cmake arguments shown below
```bash
cd $BASEDIR/armnn/delegate && mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=release                               # A release build rather than a debug build.
         -DTENSORFLOW_ROOT=$BASEDIR/tensorflow \                  # The root directory where tensorflow can be found.
         -DTFLITE_LIB_ROOT=$BASEDIR/tensorflow/bazel-bin \        # Directory where tensorflow libraries can be found.
         -DFLATBUFFERS_ROOT=$BASEDIR/flatbuffers-1.12.0/install \ # Flatbuffers install directory.
         -DArmnn_DIR=$BASEDIR/armnn/build \                       # Directory where the Arm NN library can be found
         -DARMNN_SOURCE_DIR=$BASEDIR/armnn                        # The top directory of the Arm NN repository. 
                                                                  # Required are the includes for Arm NN
make
```

To ensure that the build was successful you can run the unit tests for the delegate that can be found in 
the build directory for the delegate. [Doctest](https://github.com/onqtam/doctest) was used to create those tests. Using test filters you can
filter out tests that your build is not configured for. In this case, because Arm NN was only built for Cpu 
acceleration (CpuAcc), we filter for all test suites that have `CpuAcc` in their name.
```bash
cd $BASEDIR/armnn/delegate/build
./DelegateUnitTests --test-suite=*CpuAcc* 
```
If you have built for Gpu acceleration as well you might want to change your test-suite filter:
```bash
./DelegateUnitTests --test-suite=*CpuAcc*,*GpuAcc*
```

我遇到的问题:

编译好后，测试*GpuAcc*的时候failed，经过排查，发现是arm的Mali GPU 设备权限问题，这个已经是系统层面的问题了，知识比较杂乱，后来是别人帮我弄好的，这里有个参考连接，有助于理解

opencl库移植_Vernon Blog-CSDN博客_libmali.so

4. 使用tflite + armnn delegate 实现神经网络推理

准备工作完成了就可以试一试这个看似完美的解决方案了，先理顺一下思路，我们是要使用tflite框架来推理，因为我们的arm板上有GPU，要使用armnndelegate把能用GPU运算的操作放到GPU上跑，从而实现加速。其实 tflite 还可以用一些其他的加速插件，例如 XNNPACK 这个是谷歌自己开发的给CPU推理加速的库，实测确实可以提速一倍左右。

这里我参照tensorflow源码里的最简示例进行测试

int main(int argc, char* argv[]) {
  if (argc != 2) {
    fprintf(stderr, "minimal <tflite model>\n");
    return 1;
  }
  const char* filename = argv[1];

  // Load model
  std::unique_ptr<tflite::FlatBufferModel> model =
      tflite::FlatBufferModel::BuildFromFile(filename);
  TFLITE_MINIMAL_CHECK(model != nullptr);

  // Build the interpreter
  tflite::ops::builtin::BuiltinOpResolver resolver;
  InterpreterBuilder builder(*model, resolver);
  std::unique_ptr<Interpreter> interpreter;
  builder(&interpreter);
  TFLITE_MINIMAL_CHECK(interpreter != nullptr);

  //   // Create the ArmNN Delegate
  // std::vector<armnn::BackendId> backends = { armnn::Compute::GpuAcc };
  //   // std::string backends = "GpuAcc";
  // armnnDelegate::DelegateOptions delegateOptions(backends);
  // std::unique_ptr<TfLiteDelegate, decltype(&armnnDelegate::TfLiteArmnnDelegateDelete)>
  //                       theArmnnDelegate(armnnDelegate::TfLiteArmnnDelegateCreate(delegateOptions),
  //                                        armnnDelegate::TfLiteArmnnDelegateDelete);
  // // Modify armnnDelegateInterpreter to use armnnDelegate
  // interpreter->ModifyGraphWithDelegate(theArmnnDelegate.get());


  // Allocate tensor buffers.
  TFLITE_MINIMAL_CHECK(interpreter->AllocateTensors() == kTfLiteOk);
  printf("=== Pre-invoke Interpreter State ===\n");
  tflite::PrintInterpreterState(interpreter.get());

  // Fill input buffers
  // TODO(user): Insert code to fill input tensors
  // Note: The buffer of the input tensor with index `i` of type T can
  // be accessed with `T* input = interpreter->typed_input_tensor<T>(i);`

  // 按照新的输入张量的大小重新分配内存
  interpreter->AllocateTensors();

  // 输入张量信息
  int input_tensor_id;
  for (auto i : interpreter->inputs()) {
    const TfLiteTensor* tensor = interpreter->tensor(i);
    if (std::string(tensor->name) == INPUT_NAME) {
      input_tensor_id = i;
      printf("input tensor id : %d\n", input_tensor_id);
      printf("input tensor name : %s\n", tensor->name);
    }
  }

  // 输出张量信息
  float* out_tensor_data_0;
  for (auto i : interpreter->outputs()) {
    const TfLiteTensor* tensor = interpreter->tensor(i);
    if (std::string(tensor->name) == OUTPUT_NAME_0) {
      printf("out tensor id: %d\n", i );
      printf("out tensor name : %s\n", tensor->name);
      out_tensor_data_0 = interpreter->typed_tensor<float>(i);
    }
  }
  float* out_tensor_data_1;
  for (auto i : interpreter->outputs()) {
    const TfLiteTensor* tensor = interpreter->tensor(i);
    if (std::string(tensor->name) == OUTPUT_NAME_1) {
      printf("out tensor id: %d\n", i );
      printf("out tensor name : %s\n", tensor->name);
      out_tensor_data_1 = interpreter->typed_tensor<float>(i);
    }
  }
  float* out_tensor_data_2;
  for (auto i : interpreter->outputs()) {
    const TfLiteTensor* tensor = interpreter->tensor(i);
    if (std::string(tensor->name) == OUTPUT_NAME_2) {
      printf("out tensor id: %d\n", i );
      printf("out tensor name : %s\n", tensor->name);
      out_tensor_data_2 = interpreter->typed_tensor<float>(i);
    }
  }
  float* out_tensor_data_3;
  for (auto i : interpreter->outputs()) {
    const TfLiteTensor* tensor = interpreter->tensor(i);
    if (std::string(tensor->name) == OUTPUT_NAME_3) {
      printf("out tensor id: %d\n", i );
      printf("out tensor name : %s\n", tensor->name);
      out_tensor_data_3 = interpreter->typed_tensor<float>(i);
    }
  }

  // 加载图片
  cv::Mat origin_img = cv::imread(DEFAULT_INPUT_IMAGE);
  // 调整图片大小和颜色顺序
  cv::Mat img_src = cv::Mat::zeros(300, 300, CV_8UC3);
  cv::resize(origin_img, origin_img, cv::Size(300, 300));
  cv::cvtColor(origin_img, img_src, cv::COLOR_BGR2RGB);
  // 数据复制给input tensor
  uint8_t* dst = interpreter->typed_tensor<uint8_t>(input_tensor_id); 
  uint8_t* src = (uint8_t*)(img_src.data);
  std::copy(src, src + 300*300*3, dst);

  // Run inference
  TFLITE_MINIMAL_CHECK(interpreter->Invoke() == kTfLiteOk);

  // Read output buffers
  // TODO(user): Insert getting data out code.

  int32_t num_det = static_cast<int32_t>(out_tensor_data_3[0]);
  printf("num_det: %d\n", num_det);
  
  float threshold_confidence = 0.4;
    for (int32_t i = 0; i < num_det; i++) {
        if (out_tensor_data_2[i] < threshold_confidence) continue;
        printf("find a target!\n");
        BoundingBox bbox;
        bbox.class_id = static_cast<int32_t>(out_tensor_data_1[i]);
        // bbox.label = label_list_[bbox.class_id];
        // bbox.score = score_raw_list[i];
        bbox.x = static_cast<int32_t>(out_tensor_data_0[i * 4 + 1] * 300);
        bbox.y = static_cast<int32_t>(out_tensor_data_0[i * 4 + 0] * 300);
        bbox.w = static_cast<int32_t>((out_tensor_data_0[i * 4 + 3] - out_tensor_data_0[i * 4 + 1]) * 300);
        bbox.h = static_cast<int32_t>((out_tensor_data_0[i * 4 + 2] - out_tensor_data_0[i * 4 + 0]) * 300);
        cv::rectangle(img_src, cv::Rect(bbox.x, bbox.y, bbox.w, bbox.h),cv::Scalar(0, 0, 0), 1);
    }
  cv::namedWindow("result", 0);
  cv::imshow("result", img_src);
  cv::waitKey(0);
  return 0;
}

解释：

1. 通过打印网络结构可以看到网络里每一个tensor的信息，对于ssd来说

input tensor id : 175
input tensor name : normalized_input_image_tensor
out tensor id: 167
out tensor name : TFLite_Detection_PostProcess
out tensor id: 168
out tensor name : TFLite_Detection_PostProcess:1
out tensor id: 169
out tensor name : TFLite_Detection_PostProcess:2
out tensor id: 170
out tensor name : TFLite_Detection_PostProcess:3

这几个是我们需要设置的，输入和输出，然后根据tensor的数据类型进行设置和读取，armnndelegate那部分代码可以根据自己的需要选择使用GpuAcc还是CpuAcc，我都试了，设置Cpu的时候有的网络确实可以提速，有的却不行，Gpu的时候都不行，反而更慢了，不知道是哪里操作的不对，有了解的朋友可以指点一下谢谢。

完整demo我传到github上了。