小米开源框架MACE 如何构建和使用

最新推荐文章于 2024-04-27 09:54:00 发布

Kaam

最新推荐文章于 2024-04-27 09:54:00 发布

阅读量7.4k

点赞数 2

分类专栏：深度学习文章标签： MACE 小米开源框架深度学习移动端

深度学习专栏收录该内容

7 篇文章 0 订阅

订阅专栏

转载自https://www.jianshu.com/p/3be518027ac2

文章内容翻译自 MACE 官方手册，记录本人阅读与开发过程，力求不失原意，但推荐阅读原文。

https://media.readthedocs.org/pdf/mace/latest/mace.pdf
Github地址： https://github.com/xiaomi/mace

声明：如有侵权，请联系作者删除

如何构建

支持的平台

Platform	Explanation
TensorFlow	>= 1.6.0.
Caffe	>= 1.0.

环境要求

MACE 的依赖项：

software	version	install command
bazel	>= 0.13.0	bazel installation guide
android-ndk	r15c/r16b	NDK installation guide or refers to the docker file
adb	>= 1.0.32	apt-get install android-tools-adb
tensorflow	>= 1.6.0	pip install -I tensorflow==1.6.0 (if you use tensorflow model)
numpy	>= 1.14.0	pip install -I numpy==1.14.0
scipy	>= 1.0.0	pip install -I scipy==1.0.0
jinja2	>= 2.10	pip install -I jinja2==2.10
PyYaml	>= 3.12.0	pip install -I pyyaml==3.12
sh	>= 1.12.14	pip install -I sh==1.12.14
filelock	>= 3.0.0	pip install -I filelock==3.0.0
docker (for caffe)	>= 17.09.0-ce	docker installation guide

Note

export ANDROID_NDK_HOME=/path/to/ndk to specify ANDROID_NDK_HOME

MACE 提供了已安装好依赖库的 Docker文件，可以直接下载Docker文件进行编译。

cd docker
docker build -t xiaomimace/mace-dev

或者，从Docker Hub上直接 pull 已构建好的镜像，

docker pull xiaomimace/mace-dev

然后按照下述命令运行 container。

# Create container
# Set 'host' network to use ADB
docker run -it --rm --privileged -v /dev/bus/usb:/dev/bus/usb --net=host \
           -v /local/path:/container/path xiaomimace/mace-dev /bin/bash

用法

1. 获取 MACE 源码

git clone https://github.com/XiaoMi/mace.git
git fetch --all --tags --prune

# Checkout the latest tag (i.e. release version)
tag_name=`git describe --abbrev=0 --tags`
git checkout tags/${tag_name}

2. 模型预处理

TensorFlow
Tensorflow 提供了 Graph Transform Tool，通过各种优化手段（Ops folding、redundant node removal等）提升模型推理效率。强烈建议转换模型之前先进行优化处理。
以下给出了不同运行环境下，graph转换和优化的一些建议命令：

# CPU/GPU:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        flatten_atrous_conv
        fold_batch_norms
        fold_old_batch_norms
        strip_unused_nodes
        sort_by_execution_order'

# DSP:
./transform_graph \
    --in_graph=tf_model.pb \
    --out_graph=tf_model_opt.pb \
    --inputs='input' \
    --outputs='output' \
    --transforms='strip_unused_nodes(type=float, shape="1,64,64,3")
        strip_unused_nodes(type=float, shape="1,64,64,3")
        remove_nodes(op=Identity, op=CheckNumerics)
        fold_constants(ignore_errors=true)
        fold_batch_norms
        fold_old_batch_norms
        backport_concatv2
        quantize_weights(minimum_size=2)
        quantize_nodes
        strip_unused_nodes
        sort_by_execution_order'

Caffe
MACE 转换只支持 Caffe 1.0+，必要时需要升级 Caffe 模型。

# Upgrade prototxt
$CAFFE_ROOT/build/tools/upgrade_net_proto_text MODEL.prototxt MODEL.new.prototxt

# Upgrade caffemodel
$CAFFE_ROOT/build/tools/upgrade_net_proto_binary MODEL.caffemodel MODEL.new.caffemodel

3. 构建静态/共享库

3.1 概述
MACE 可以构建静态或共享库（通过linkshared 在 YAML 模型部署文件中指定）。以下是两种使用案例。

为特定 SoCs 构建优化库
当在 YAML 文件中指定好 target_socs 后，构建工具将对 GPU kernels 进行自动优化。整个过程的耗时主要由模型的复杂度决定。
Note：记得插入相应的SoCs设备
为所有 SoCs 构建通用库
当未指定target_socs 时，生成的库兼容通用的器件。
Note：相对于特定优化的库将会有约1 ~ 10% 的性能下降

MACE 提供了用于模型转换、编译、测试运行、测试基准、正确性检查的命令行工具 (tools/converter.py)。
Note:

tools/converter.py 需要在当前工程的根目录下运行。
当 linkshared 设置为 1 时，build_type 应为 proto。此时，只支持Android设备。

3.2 tools/converter.py 概述

命令

构建
构建库和测试工具

# Build library
python tools/converter.py build --config=models/config.yaml

运行
运行模型

# Test model run time
python tools/converter.py run --config=models/config.yaml --round=100

# Validate the correctness by comparing the results against the
# original model and framework, measured with cosine distance for similarity.
python tools/converter.py run --config=models/config.yaml --validate

# Check the memory usage of the model(**Just keep only one model in configuration file**)
python tools/converter.py run --config=models/config.yaml --round=10000 &
sleep 5
adb shell dumpsys meminfo | grep mace_run
kill %1

Warning：完成构建后才可以运行。

测试
测试和分析模型

# Benchmark model, get detailed statistics of each Op.
python tools/converter.py benchmark --config=models/config.yaml

Warning：完成构建后才可以测试。

常用参数

option	type	default	commands	explanation
--omp_num_threads	int	-1	`run/benchmark`	number of threads
--cpu_affinity_policy	int	1	`run/benchmark`	0:AFFINITY_NONE/1:AFFINITY_BIG_ONLY/2:AFFINITY_LITTLE_ONLY
--gpu_perf_hint	int	3	`run/benchmark`	0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_perf_hint	int	3	`run/benchmark`	0:DEFAULT/1:LOW/2:NORMAL/3:HIGH
--gpu_priority_hint	int	3	`run/benchmark`	0:DEFAULT/1:LOW/2:NORMAL/3:HIGH

使用-h获取帮助信息

python tools/converter.py -h
python tools/converter.py build -h
python tools/converter.py run -h
python tools/converter.py benchmark -h

4. 部署

build 命令完成了静态/共享库的生成，库文件和模型文件、头文件一起被打包为build/${library_name}/libmace_${library_name}.tar.gz

生成的静态库的结构如下所示

build/
└── mobilenet-v2-gpu
    ├── include
    │   └── mace
    │       └── public
    │           ├── mace.h
    │           └── mace_runtime.h
    ├── libmace_mobilenet-v2-gpu.tar.gz
    ├── lib
    │   ├── arm64-v8a
    │   │   └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
    │   └── armeabi-v7a
    │       └── libmace_mobilenet-v2-gpu.MI6.msm8998.a
    ├── model
    │   ├── mobilenet_v2.data
    │   └── mobilenet_v2.pb
    └── opencl
        ├── arm64-v8a
        │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
        └── armeabi-v7a
            └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin

生成的共享库的结构如下所示

build
└── mobilenet-v2-gpu
    ├── include
    │   └── mace
    │       └── public
    │           ├── mace.h
    │           └── mace_runtime.h
    ├── lib
    │   ├── arm64-v8a
    │   │   ├── libgnustl_shared.so
    │   │   └── libmace.so
    │   └── armeabi-v7a
    │       ├── libgnustl_shared.so
    │       └── libmace.so
    ├── model
    │   ├── mobilenet_v2.data
    │   └── mobilenet_v2.pb
    └── opencl
        ├── arm64-v8a
        │   └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin
        └── armeabi-v7a
            └── mobilenet-v2-gpu_compiled_opencl_kernel.MI6.msm8998.bin

Note：

DSP 的运行依赖于libhexagon_controller.so
${MODEL_TAG}.pb 文件只有当 build_type 为 proto时才会生成
${library_name}_compiled_opencl_kernel.${device_name}.${soc}.bin 文件只有当 target_socs和 gpu 运行条件指定时才会生成
生成的共享库依赖于libgnustl_shared.so

5. 如何在项目中使用生成的库
以下列出了一些关键步骤，完整的使用说明请参考mace/examples/example.cc

// Include the headers
#include "mace/public/mace.h"
#include "mace/public/mace_runtime.h"
// If the build_type is code
#include "mace/public/mace_engine_factory.h"

// 0. Set pre-compiled OpenCL binary program file paths when available
if (device_type == DeviceType::GPU) {
  mace::SetOpenCLBinaryPaths(opencl_binary_paths);
}

// 1. Set compiled OpenCL kernel cache, this is used to reduce the
// initialization time since the compiling is too slow. It's suggested
// to set this even when pre-compiled OpenCL program file is provided
// because the OpenCL version upgrade may also leads to kernel
// recompilations.
const std::string file_path ="path/to/opencl_cache_file";
std::shared_ptr<KVStorageFactory> storage_factory(
    new FileStorageFactory(file_path));
ConfigKVStorageFactory(storage_factory);

// 2. Declare the device type (must be same with ``runtime`` in configuration file)
DeviceType device_type = DeviceType::GPU;

// 3. Define the input and output tensor names.
std::vector<std::string> input_names = {...};
std::vector<std::string> output_names = {...};

// 4. Create MaceEngine instance
std::shared_ptr<mace::MaceEngine> engine;
MaceStatus create_engine_status;
// Create Engine from compiled code
create_engine_status =
    CreateMaceEngineFromCode(model_name.c_str(),
                             nullptr,
                             input_names,
                             output_names,
                             device_type,
                             &engine);
// Create Engine from model file
create_engine_status =
    CreateMaceEngineFromProto(model_pb_data,
                              model_data_file.c_str(),
                              input_names,
                              output_names,
                              device_type,
                              &engine);
if (create_engine_status != MaceStatus::MACE_SUCCESS) {
  // Report error
}

// 5. Create Input and Output tensor buffers
std::map<std::string, mace::MaceTensor> inputs;
std::map<std::string, mace::MaceTensor> outputs;
for (size_t i = 0; i < input_count; ++i) {
  // Allocate input and output
  int64_t input_size =
      std::accumulate(input_shapes[i].begin(), input_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_in = std::shared_ptr<float>(new float[input_size],
                                          std::default_delete<float[]>());
  // Load input here
  // ...

  inputs[input_names[i]] = mace::MaceTensor(input_shapes[i], buffer_in);
}

for (size_t i = 0; i < output_count; ++i) {
  int64_t output_size =
      std::accumulate(output_shapes[i].begin(), output_shapes[i].end(), 1,
                      std::multiplies<int64_t>());
  auto buffer_out = std::shared_ptr<float>(new float[output_size],
                                           std::default_delete<float[]>());
  outputs[output_names[i]] = mace::MaceTensor(output_shapes[i], buffer_out);
}

// 6. Run the model
MaceStatus status = engine.Run(inputs, &outputs);