On-Device Neural Net Inference with Mobile GPUs

Yongqiang Cheng

已于 2023-03-24 22:50:07 修改

阅读量3k

点赞数 1

分类专栏： 20240202 文章标签： On-Device Neural Net Inference Mobile GPUs

于 2022-11-25 22:40:15 首次发布

世上没有白读的书，每一页都算数。

本文链接：https://blog.csdn.net/chengyq116/article/details/128034745

版权

20240202 专栏收录该内容

218 篇文章

订阅专栏

On-Device Neural Net Inference with Mobile GPUs

Juhyun Lee, Nikolay Chirkov, Ekaterina Ignasheva, Yury Pisarchyk, Mogan Shieh, Fabio Riccardi, Raman Sarokin, Andrei Kulik, and Matthias Grundmann

Google Research

1600 Amphitheatre Pkwy, Mountain View, CA 94043, USA

Google Research tackles challenges that define the technology of today and tomorrow.
tackle ['tæk(ə)l]：v. 处理，阻截，与某人交涉，向某人提起 (问题或困难情况) n. 用具，阻截，阻截铲球，阻截队员
United States of America，U.S.A. or USA
United States，U.S. or US
America [ə'merɪkə]：n. 美洲，美国
California，CA：加利福尼亚州，加州
Mountain View is a city in Santa Clara County, California, United States.
山景城或芒廷维尤是位于美国加利福尼亚州圣克拉拉县 (Santa Clara County) 的城市
amphitheatre ['æmfɪ.θɪətə(r)]：n. (尤指古希腊和罗马的) 圆形露天剧场
parkway，Pkwy：公园大道
Google LLC (/ˈɡuːɡəl/) is an American multinational technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics.
quantum [ˈkwɒntəm]：n. 量子
on-device：移动端，设备端，端侧，终端侧
biological [ˌbaɪəˈlɒdʒɪk(ə)l]：adj. 生物学的，生物的，与生命过程有关的，加酶的 n. 生物制品

Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.

Abstract

On-device inference of machine learning models for mobile phones is desirable due to its lower latency and increased privacy. Running such a compute-intensive task solely on the mobile CPU, however, can be difficult due to limited computing power, thermal constraints, and energy consumption. App developers and researchers have begun exploiting hardware accelerators to overcome these challenges. Recently, device manufacturers are adding neural processing units into high-end phones for on-device inference, but these account for only a small fraction of hand-held devices. In this paper, we present how we leverage the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, to run inference of deep neural networks in real-time for both Android and iOS devices. By describing our architecture, we also discuss how to design networks that are mobile GPU-friendly. Our state-of-the-art mobile GPU inference engine is integrated into the open-source project TensorFlow Lite and publicly available at https://www.tensorflow.org/lite.
最近，设备制造商正在将 neural processing units 添加到高端手机中以进行设备上推理，但这些只占手持设备的一小部分。在本文中，我们介绍了如何利用移动 GPU (几乎每部手机上无处不在的硬件加速器) 为 Android 和 iOS 设备实时运行深度神经网络推理。通过描述我们的架构，我们还讨论了如何设计对移动 GPU 友好的网络。最先进的移动 GPU 推理引擎已集成到开源项目 TensorFlow Lite 中，并可通过 https://www.tensorflow.org/lite 公开获取。

desirable [dɪ'zaɪrəb(ə)l]：adj. 期望的，可取的，值得拥有的，值得做的 n. 称心如意的人
solely [ˈsəʊlli]：adv. 只，仅，唯，单独地
ubiquitous [juːˈbɪkwɪtəs]：adj. 似乎无所不在的，十分普遍的
virtually [ˈvɜː(r)tʃʊəli]：adv. 几乎，实际上，事实上，虚拟

1. Introduction

On-device machine learning (ML) offers a variety of benefits. The most apparent is the improved inference latency: By skipping the data upload to the server and wait-time for the inference result, the app can respond more quickly to the user’s request. Removing the server dependency has additional benefits, such as:

Removing the need to maintain inference servers, (无需维护推理服务器)
Running with limited or no connectivity, and (在连接受限或无连接的情况下运行)
Reducing privacy concerns as the user data remains on the device. (减少隐私问题，因为用户数据保留在设备上。)

However, on-device ML is not trivial. Despite both recent advances in mobile hardware technology and efforts to efficiently run deep networks on mobile devices, mobile CPUs continue to be less powerful than those found in servers. Running deep net inference on a mobile device means adding a significant compute-intensive task to the CPU which competes with existing logic. Fully utilizing the mobile CPU comes with additional unwanted costs, e.g. increased energy consumption leads to shorter battery life and an increase in the phone’s thermal profile causes throttling resulting in slower computation.
尽管最近移动硬件技术取得了进步，并且努力在移动设备上高效运行深度网络，但移动 CPU 的能力仍然不如服务器中的 CPU。在移动设备上运行深度网络推理意味着向 CPU 添加重要的计算密集型任务，这会与 CPU 现有逻辑竞争。充分利用移动 CPU 会带来额外的不必要成本，例如增加能源消耗会导致电池寿命缩短，而手机热分布的增加会导致减速，从而导致计算速度变慢。

Dynamic frequency scaling (CPU throttling) is a power management technique in computer architecture whereby the frequency of a microprocessor can be automatically adjusted “on the fly” depending on the actual needs, to conserve power and reduce the amount of heat generated by the chip.
动态时钟频率调整 (CPU 节流) 是一个用来使微控制器的频率可以自动适应需要进行调节，从而让 CPU 降低功耗，降低发热的技术。

Dynamic frequency scaling almost always appear in conjunction with dynamic voltage scaling, since higher frequencies require higher supply voltages for the digital circuit to yield correct results. The combined topic is known as dynamic voltage and frequency scaling (DVFS).
动态时钟频率调整几乎总是与动态电压调整一起出现，是因为较高的频率需要较高的电压，数字电路才能产生正确的结果。它们组合的主题被称为动态电压和频率调整 (Dynamic voltage and frequency scaling，DVFS)。

concern [kənˈsɜː(r)n]：n. 关心，忧虑，公司，企业 v. 涉及，影响，牵涉，与 ... 有关
trivial ['trɪviəl]：adj. 不重要的，琐碎的，微不足道的
throttle [ˈθrɒt(ə)l]：v. 抑制 (讨论，贸易等)，使窒息，(用节汽阀等) 调节，使节流 n. 风门，风门杆，气管，扼流圈
apparent [əˈpærənt]：adj. 显然，显而易见，明白易懂，貌似的
fraction ['frækʃ(ə)n]：n. 分数，小部分，小数，少量

Hardware accelerators such as the digital signal processors offer solutions to overcome these challenges. The demand for on-device ML has led to recent trends of phone manufacturers integrating dedicated neural processing units (NPUs) for high-end next-generation phones, which account for only a small fraction of the current distribution of mobile devices.
这些手机仅占当前移动设备分布的一小部分。

Our primary goal is a fast inference engine with wide coverage for TensorFlow Lite (TFLite) [8]. By leveraging the mobile GPU, a ubiquitous hardware accelerator on virtually every phone, we can achieve real-time performance for various deep network models. Table 1 demonstrates that GPU has significantly more compute power than CPU.

在这里插入图片描述
Table 1. Example of available compute power on mobile in gigaflops (billion floating point instructions per second). FP16 and FP32 refer to 16- and 32-bit floating point arithmetic, respectively.

billion floating point instructions per second，gigaflops

This paper presents the techniques we adopt for TFLite GPU and how we achieve an average acceleration of 2-9x for various deep networks on GPU compared to CPU inference. We first describe the general mobile GPU architecture and GPU programming, followed by how we materialize this with Compute Shaders for Android devices, with OpenGL ES 3.1+ [16] and Metal Shaders for iOS devices with iOS 9+ [1].
本文介绍了我们为 TFLite GPU 采用的技术，以及与 CPU 推理相比，我们如何在 GPU 上为各种深度网络实现 2-9 倍的平均加速。

materialize [məˈtɪəriəlaɪz]：v. 实现，发生，成为现实，突然显现
demand [dɪˈmɑːnd]：n. 需要，(坚决的) 要求，所需之物 v. 需要，强烈要求

2. Related Work

Various research efforts from both academia and industry endeavor to bring deep neural networks inference previously limited to server, forward to mobile devices. Those efforts can be roughly categorized into three strategies:

Network architecture-driven,
Hardware-driven, and
ML framework-driven.

将以前仅限于服务器的深度神经网络推理推向移动设备。

endeavor [ɪnˈdevə(r)]：n. 努力 v. 努力

Neural network researchers have focused on optimizing their network architectures explicitly for processing on-device in various domains such as image classification [10, 21], object localization [11], and image enhancements [13, 14]. Many of these techniques involve reducing the model size by re-designing the network architecture and adding pre-/post-training quantization of weights. With these, one can achieve faster computation and smaller memory footprint, leading to reduced inference latency at the cost of slightly degraded model accuracy. MorphNet [9] takes a unique path of reducing the number of floating point operations per second which is optimized during training of the model. Our work is complementary to these efforts and instead focuses on optimizing the inference engine that runs the neural network rather than the model or training.
有了这些，人们可以实现更快的计算和更小的内存占用，从而以稍微降低模型精度为代价减少推理延迟。我们的工作是对这些努力的补充，而是专注于优化运行神经网络的推理引擎，而不是模型或训练。

Major hardware manufacturers have made architectural changes responding to demands for faster mobile inference, and are publishing software development kits (SDKs) to expose those: Arm Compute Library [4], Huawei HiAI SDK [12], MediaTek NeuroPilot SDK [17], and Qualcomm SNPE SDK [20]. These libraries are vendor-specific and either cannot be re-used on a different architecture or do not guarantee the expected performance boost on other platforms. Our work does not add new hardware or SDKs. Instead, we use well-established hardware, the mobile GPU, and well-supported graphics and compute standards as OpenGL [16] and Metal [1], to achieve high-performance neural network inference.
主要硬件制造商已针对更快的移动推理需求做出架构更改，并发布软件开发工具包 (SDK) 以公开这些：Arm Compute Library [4], Huawei HiAI SDK [12], MediaTek NeuroPilot SDK [17], and Qualcomm SNPE SDK [20]. 这些库是特定于供应商的，不能在不同的体系结构上重复使用，或者不能保证在其他平台上的预期性能提升。我们的工作不会添加新的硬件或 SDK。相反，我们使用成熟的硬件、移动 GPU 和支持良好的图形和计算标准，如 OpenGL [16] and Metal [1]，来实现高性能神经网络推理。

explicitly [ɪk'splɪsɪtli]：adv. 明确，显然，清楚地，直接地
footprint ['fʊt.prɪnt]：n. 足迹，脚印，面积，覆盖区

Apple presented the Metal Performance Shaders with support of convolutional neural networks [3] accelerated by GPU. This is a solution built on top of the Metal API and allows custom operations. Our approach is analogous to Apple’s on iOS devices. Apple also released CoreML [2], an end-to-end solution for inference on mobile devices using CPU, GPU, and NPU, if available.
这是一个构建在 Metal API 之上的解决方案，并允许自定义操作。我们的方法类似于 Apple 在 iOS 设备上的方法。Apple 还发布了 CoreML [2]，这是一种使用 CPU、GPU 和 NPU (如果可用) 在移动设备上进行推理的端到端解决方案。

Android introduced the Android Neural Networks API [7] that serves as a layer between hardware and higher-level ML frameworks that vendors must implement for Android 8.1 or later. Our work has wider coverage and does not depend on a specific Android version, or require vendors to implement individual APIs for deep network processing.
我们的工作覆盖面更广，不依赖于特定的 Android 版本，也不要求供应商实现单独的 API 来进行深度网络处理。

individual [.ɪndɪ'vɪdʒuəl]：n. 个人，与众不同的人，有个性的人，某种类型的人 adj. 单独的，个别的，一个人的，供一人用的

Some of the latest mobile-friendly ML frameworks are:

Caffe2 [6] which focuses on CPU inference and uses Arm Compute Library for Arm Mali GPUs.
MACE [24] which employs OpenCL which is not a part of standard Android OS.

TFLite GPU leverages the mobile GPU with OpenGL ES for Android devices and Metal for iOS devices. The specific version requirements are OpenGL ES 3.1+ and iOS 9+ which are available for more than 52% of all Android devices [23]. One of our biggest strength is that our framework employs open standards, i.e. is not limited by specific hardware vendor, and thus covers a wide range of devices.
TFLite GPU 利用适用于 Android 设备的 OpenGL ES 和适用于 iOS 设备的 Metal 的移动 GPU。具体版本要求是 OpenGL ES 3.1+ 和 iOS 9+，超过 52% 的 Android 设备都可用 [23]。我们最大的优势之一是我们的框架采用开放标准，即不受特定硬件供应商的限制，因此涵盖了广泛的设备。

employ [ɪm'plɔɪ]：v. 使用，雇用，运用，应用 n. 使用，雇用，服务，工作

3. General Architecture

This section explains the general architecture of TFLite GPU, consisting of an initialization phase followed by a model inference phase. The techniques in this section are independent of the architecture of the underlying GPU.
本节介绍 TFLite GPU 的一般架构，包括初始化阶段和模型推理阶段。本节中的技术独立于底层 GPU 的架构。

3.1. Initialization

TFLite provides APIs for the delegation of the execution of neural network sub-graphs to another library. We exploit this feature to integrate the GPU backend into TFLite. Given a neural net model, TFLite first checks whether it can execute all the operators in the model with our GPU delegate. Our GPU backend identifies supported operators, and TFLite then partitions the graph into several sub-graphs, substituting the sub-graphs with virtual “delegate nodes”. From that point, the GPU backend is responsible for executing this sub-graph, as depicted in Figure 1. Unsupported operators are by default computed by the CPU. Ideally, the whole graph would be compatible with our mobile GPU backend for maximum performance.
TFLite 提供 API，用于将神经网络子图的执行委托给另一个库。我们利用此功能将 GPU 后端集成到 TFLite 中。给定一个神经网络模型，TFLite 首先检查它是否可以使用我们的 GPU delegate 执行模型中的所有运算符。我们的 GPU 后端识别支持的运算符，然后 TFLite 将图划分为几个子图，用虚拟“委托节点”替换子图。从那时起，GPU 后端负责执行此子图，如图 1 所示。不支持的运算符默认由 CPU 计算。理想情况下，整个图将与我们的移动 GPU 后端兼容，以实现最佳性能。

underlie [ˌʌndə(r)ˈlaɪ]：v. 构成 ... 的基础，作为 ... 的原因
delegation [.delə'ɡeɪʃ(ə)n]：n. 代表团，委派，委托
delegate ['deləɡət]：n. 代表，会议代表 v. 授权，把 (工作、权力等) 委托 (给下级)，选派 (某人做某事)
substitute [ˈsʌbstɪˌtjuːt]：v. 取代，代替 n. 代用品，代替物，代替者，替补

在这里插入图片描述
Figure 1. TFLite’s delegate mechanism: Operations supported by the GPU delegate will run on the GPU, and the rest on the CPU.

As our mobile GPU inference engine is primarily designed for high-performance execution, we first inspect the model and resolve obvious inefficiencies. For example:

Merging PAD as an option of another op where it was previously described separately. (将 PAD 合并为之前一个 op 的选项)
Removing superfluous identity operations, e.g. RESIZE with scale one or single input ADD/CONCAT. (删除多余的恒等运算)

While these inefficiencies might be caught by the architect, artifacts such as these crop up inevitably, and we should still optimize these whenever possible.
虽然这些低效率可能会被架构师发现，但不可避免地会出现诸如此类的工件，我们仍然应该尽可能优化它们。

obvious [ˈɒbviəs]：adj.明显的，显然的，易理解的，公认的
superfluous [suːˈpɜː(r)fluəs]：adj. 过剩的，过多的，多余的
architect [ˈɑː(r)kɪˌtekt]：n. 建筑师，设计师，缔造者，创造者
artifact [ˈɑː(r)tɪˌfækt]：n. 人工制品，(组织结构的) 人为现象
inevitably [ɪn'evɪtəbli]：adv. 必然，必定，免不了
crop [krɒp]：n. 庄稼，作物，产量，一批人 v. 剪短，裁切，啃吃，有收成

Note that, in contrast to CPU backends which work without initialization, GPU backends require initialization involving shader compilation and optimization by the driver before inference. The cost of this process depends on network size and may take from few milliseconds to seconds, but is incurred once and not again for subsequent runs until the cache memory is invalidated for any of reasons: application is updated or re-installed, device is rebooted, cache memory is over, or for other OS-specific reasons.
请注意，与无需初始化即可工作的 CPU 后端相比，GPU 后端需要在推理前进行初始化。取决于网络大小，此过程可能需要几毫秒到几秒不等，但在后续运行中只会产生一次，直到 cache memory 因以下任何原因而失效：application is updated or re-installed, device is rebooted, cache memory is over, or for other OS-specific reasons.

incur [ɪnˈkɜː(r)]：v. 招致，遭受，引起，引致
invalidate [ɪnˈvælɪdeɪt]：v. 证明 ... 错误，使站不住脚，使无效，使作废
subsequent ['sʌbsɪkwənt]：adj. 随后的，后来的，之后的，接后的
invalidate [ɪnˈvælɪdeɪt]：v. 证明 ... 错误，使站不住脚，使无效，使作废

3.2. Running Inference

The inference phase is fairly straightforward. The input tensors are reshaped to the PHWC4 format detailed later in Section 4, if their tensor shape has channel size not equal to 4. For each operator, shader programs are linked by binding resources such the operator’s input/output tensors, weights, etc. and dispatched, i.e. inserted into the command queue. The GPU driver then takes care of scheduling and executing all shader programs in the queue, and makes the result available to the CPU by the CPU/GPU synchronization. There might be a final conversion from PHWC4 to HWC, if the output tensor has a channel size not equal to 4.
推理阶段相当简单。如果输入张量的通道大小不等于 4，则输入张量将 reshape 为 PHWC4 格式，稍后将在第 4 节中详细介绍。对于每个运算符，着色器程序通过绑定资源 (such the operator’s input/output tensors, weights, etc.) 进行链接，并分派，即插入到命令队列中。然后 GPU 驱动程序负责调度和执行队列中的所有着色器程序，并通过 CPU/GPU 同步将结果提供给 CPU。如果输出张量的通道大小不等于 4，则可能存在从 PHWC4 到 HWC 的最终转换。

fairly [ˈfeə(r)li]：adv. 公正地，相当地，一定地，公平合理地
bind [baɪnd]：v. 约束，捆绑，系，装订 n. 窘境

在这里插入图片描述
Figure 2. Example of PHWC4 memory layout (best viewed in color). A tensor of shape $H{=}8, W{=}6, C{=}12)$ is split into 4-element slices of size $(H, W, 4)$ which are stored sequentially as a continuous 2D array of size $HC/4{=}24,4W{=}24)$ .

HC/4 = H * (C / 4) = 8 * (12 / 4) = 8 * 3 = 24
4W = W4 = W * 4 = 6 * 4 = 24

For maximum performance, one should avoid CPU/GPU synchronization at all cost, and preferably, never leave GPU context if real-time processing is needed. The most ideal scenario would be the following: A camera provides with RGBA texture that goes directly to TFLite GPU and the output of the network is then directly rendered to the screen.
为了获得最佳性能，应该不惜一切代价避免 CPU/GPU 同步，并且最好不要在需要实时处理时离开 GPU 上下文。最理想的场景如下：相机提供直接进入 TFLite GPU 的 RGBA texture，然后网络的输出直接渲染到屏幕。

preferably ['pref(ə)rəbli]：adv. 最好，宁可，更可取地
render [ˈrendə(r)]：v. 提供，给予，提交，翻译 n. 粉刷，初涂，抹灰，精制油

Shader Program Optimization
In the GPU inference engine, operators exist in the form of shader programs. The shader programs eventually get compiled and inserted into the command queue and the GPU executes programs from this queue without synchronization with the CPU.
在 GPU 推理引擎中，算子以着色器程序的形式存在。着色器程序最终会被编译并插入到命令队列中，GPU 会在不与 CPU 同步的情况下执行该队列中的程序。

To reduce the number of shader programs in the command queue, we consolidate them into meaningful aggregates while maximizing parallelism and well-defined data dependencies.
为了减少命令队列中着色器程序的数量，我们将它们合并为有意义的聚合，同时最大化并行性和明确定义的数据依赖性。

The following techniques are employed when generating the source code for the shader programs:

Fusing element-wise operators with computationally expensive operators, e.g. activations with convolution, to reduce the number of shader programs. (将逐元素运算符与计算量大的运算符融合，以减少着色器程序的数量。)
In-lining parameters and small objects directly into the shader program to reduce memory I/O overhead. (parameters and small objects 直接内联到着色器程序中以减少内存 I/O 开销。)
Baking uniforms into the source code, instead of passing them in the run-time, allowing drivers to produce more optimal code. (Baking uniforms 到源代码中，允许驱动程序生成更优化的代码。)
Creating specialized version of shaders, like “convolution with 1 x 1 kernel size”, to manually optimize shaders for particular cases. (创建专门版本的着色器程序以针对特定情况手动优化着色器。)
Implementing specialization of shader programs optimized for a certain architecture to improve the op’s performance on the said environment.

In computing, inline expansion, or inlining, is a manual or compiler optimization that replaces a function call site with the body of the called function.
内联展开 (内联) 是一种将函数体直接展开到调用处的一种优化技术。它可以由手工指定 (inline 关键字)，或者经由编译优化自动完成。内联展开类似于宏展开，区别在于内联展开在编译时完成，而宏展开则可能在预编译、编译时、运行时时完成。内联是一种重要的优化技术。内联的好处主要在于消除函数的调用开销 (压栈，保护/恢复现场)，但内联展开可能导致生成的代码体积膨胀，并且影响指令缓存的命中率。函数内联展开在缓存小的时候能提升性能，缓存较大的时候性能有可能下降。

eventually [ɪ'ventʃuəli]：adv. 最后，终于
consolidate [kənˈsɒlɪdeɪt]：v. 合并，使加强，使巩固
aggregate [ˈæɡrɪɡeɪt]：n. 骨料，合计，总数 v. 合计，总计 adj. 总数的，总计的
particular [pə(r)ˈtɪkjʊlə(r)]：adj. 讲究，挑剔，专指的，不寻常的 n. 细节，详细资料，详细介绍材料
said [sed]：adj. 上述的 v. say 的过去式和过去分词
bake [beɪk]：n. 烤，烘烤的成品，烧烤会餐 v. 焙，烘烤，烤硬，灼热

After the source code for each program is generated, each shader gets compiled. This compilation step can take a while, from several milliseconds to seconds. Typically, app developers can hide this latency while loading the model or starting the app for the first time. Once all shader programs are compiled, the GPU backend is ready for inference.
此编译步骤可能需要一段时间，从几毫秒到几秒不等。通常，应用程序开发人员可以在加载模型或首次启动应用程序时隐藏此延迟。编译完所有着色器程序后，GPU 后端就可以进行推理

4. Data Layout

Most modern GPUs use a homogeneous coordinate [18] system which represents points in space with coordinates $(x, y, z, w)$ . A homogeneous coordinate $(x, y, z, w)$ , where $w{\neq}0$ , represents a point $(x / w, y / w, z / w, 1)$ in a 3D space. This allows affine transformations and projective transformations to be represented in the form of 4D matrix multiplications. GPUs are essentially processors optimized for 4-element vector compute and load/store operations.
大多数现代 GPU 使用齐次坐标系统，该系统用坐标 $(x, y, z, w)$ 表示空间中的点。齐次坐标 $(x, y, z, w)$ ，其中 $w{\neq}0$ 表示 3D 空间中的点 $(x / w, y / w, z / w, 1)$ 。这允许仿射变换和投影变换以 4D 矩阵乘法的形式表示。GPU 本质上是为 4 元素矢量计算和加载/存储操作优化的处理器。

In Euclidean geometry, an affine transformation or affinity is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.
仿射变换 (Affine transformation)，又称仿射映射，是指在几何中，对一个向量空间进行一次线性变换并接上一个平移，变换为另一个向量空间。

While TFLite does not restrict tensors to a certain shape, many operators assume 4D input/output tensors shaped as $[B, H, W, C]$ where $B$ , $H$ , $W$ , $C$ respectively represent batch size, height, width, and channel size. For convenience, the rest of the paper will mostly describe tensors assuming a batch size of 1, or $[H, W, C]$ for short. This simplified example can be generalized if we consider batches to be a concatenation of multiple $[H, W, C]$ tensors.
虽然 TFLite 不将张量限制为特定形状，但许多运算符假定 4D 输入/输出张量的形状为 $[B, H, W, C]$ 。如果我们将 batches 视为多个 $[H, W, C]$ 张量的串联，则可以推广这个简化示例。

In TFLite GPU, a $[H, W, C]$ tensor is split into 4-channel slices which are stored sequentially in memory. If the number of channels is not divisible by 4, it is padded with zeroes. This memory layout, called PHWC4 (Figure 2), optimally reduces cache misses in the graphics architecture. This is tightly coupled with how compute threads are executed on the GPU, which defines the order of computation, and more importantly, the order of memory load instructions.
在 TFLite GPU 中， $[H, W, C]$ 张量被分成 4 通道切片，这些切片顺序存储在内存中。如果通道数不能被 4 整除，则用零填充。这种称为 PHWC4 (图 2) 的内存布局以最佳方式减少了图形架构中的缓存未命中率。这与计算线程在 GPU 上的执行方式紧密相关，GPU 定义了计算顺序，更重要的是，还定义了内存加载指令的顺序。

homogeneous [ˌhəʊməʊˈdʒiːniəs]：adj. 由同类事物 (或人) 组成的，同种类的
homogeneous coordinates or projective coordinates：齐次坐标或投影坐标
essentially [ɪ'senʃ(ə)li]：adv. 基本上，本质上，根本上
parallelism [ˈpærəleˌlɪz(ə)m]：n. 相似，相似的特点
projective transformation：投影变换

In mathematics, homogeneous coordinates or projective coordinates, introduced by August Ferdinand Möbius in his 1827 work Der barycentrische Calcul, are a system of coordinates used in projective geometry, just as Cartesian coordinates are used in Euclidean geometry. They have the advantage that the coordinates of points, including points at infinity, can be represented using finite coordinates. Formulas involving homogeneous coordinates are often simpler and more symmetric than their Cartesian counterparts. Homogeneous coordinates have a range of applications, including computer graphics and 3D computer vision, where they allow affine transformations and, in general, projective transformations to be easily represented by a matrix.
在数学中，齐次坐标或投影坐标是指一个用于投影几何里的坐标系统，如同用于欧氏几何里的笛卡儿坐标一般。该词由 August Ferdinand Möbius 于 1827 年在其著作《Der barycentrische Calcul》一书内引入。齐次坐标可让包括无穷远点的点坐标以有限坐标表示。使用齐次坐标的公式通常会比用笛卡儿坐标表示更为简单，且更为对称。齐次坐标有着广泛的应用，包括 computer graphics and 3D computer vision。使用齐次坐标可让计算机进行仿射变换，其投影变换通常能简单地使用矩阵来表示。

在这里插入图片描述
Figure 3. Compute shader execution grid $X{=}12, Y{=}12, Z{=}8)$ built upon the tensor shape $H{=}10, W{=}10, C{=}6)$ shown in blue (best viewed in color). Work group size $x{=}4, y{=}4, z{=}4)$ highlighted as cubes with bold lines. Each cell represents a FP32 value.

cube [kjuːb]：n. 立方体，立方形，立方形的东西，三次幂 v. 求 ... 的立方，把 (食物) 切成小方块
bold [bəʊld]：n. 粗体，黑体 adj. 大胆自信的，敢于表白情感的，敢于冒险的，明显的

4.1. Work Groups: GPU Threading Units

A GPU compute task consist of a shader program and a grid. Every thread executes the same shader program, but on different region of a 3D mesh problem space. The global grid is made up of repeated work groups of constant shape $(x, y, z)$ and has a total dimension $(X, Y, Z)$ which is a multiple of these work groups.
GPU 计算任务由 a shader program and a grid 组成。每个线程执行相同的 shader program，但在 3D 网格问题空间的不同区域。全局网格由形状不变的重复工作组 $(x, y, z)$ 组成，总维度 $(X, Y, Z)$ 是这些工作组的倍数。

Every operation in the graph has at least one output 3D tensor. If there is more than one output tensor, we use one of them as a basis for the compute grid size calculation. The grid may be larger than the actual output tensor, because we expand it to sizes in multiples of 4 due to GPUs working efficiently for those sizes. This causes the creation of threads which do nothing and return at the beginning of the main function, but this is faster than working with misaligned grid sizes which prevents efficient optimization of byte code. The described situation is visualized in Figure 3, where blue color highlights useful threads which will actually calculate output values, and red color highlights stub threads. Further tuning of the compute grid/work group sizes is described in subsection 4.2.
Graph 中的每个操作都有至少一个输出 3D 张量。如果有多个输出张量，我们将以其中一个作为计算网格大小计算的基础。网格可能比实际输出张量大，我们将其扩展为 4 的倍数，因为 GPU 可以高效地处理这些尺寸。这会导致创建不执行任何操作并在 main 函数开始时返回的线程，但这比处理未对齐的网格大小更快，后者会阻止字节代码的有效优化。 所描述的情况如图 3 所示，其中蓝色突出显示实际计算输出值的有用线程，红色突出显示 stub 线程。

Optimizations are focused on neighboring threads within a work group - those spawned in sequential order as described. The PHWC4 layout provides the advantage of allowing neighboring threads to hit the same cache line when requesting data for input tensors.
优化集中在工作组内的相邻线程 - 如前所述按顺序生成的线程。PHWC4 布局的优点是允许相邻线程在为输入张量请求数据时命中同一缓存行。

Threads inside a work group are executed in a particular order. Our experiments show that for each work group channel, each row is sequentially picked in order from the first to last, starting across $W$ , then $H$ and finally $C$ . Ordering of work group execution is likewise sequential and follows the same schema, as shown on Figure 3.
工作组内的线程按特定顺序执行。我们的实验表明，对于每个工作组通道，每一行都按从第一到最后的顺序依次选取，从 $W$ 开始，然后是 $H$ ，最后是 $C$ 。工作组执行的顺序同样是连续的，并遵循相同的模式，如图 3 所示。

mesh [meʃ]：v. 啮合，适合 n. 网状物，网状织物，陷阱，困境
prevent [prɪ'vent]：v. 阻止，阻碍，阻挠
stub [stʌb]：n. 树桩，剩余部分，票根 v. (脚趾) 踢到，连根拔起，捻灭 (香烟,雪茄等)
tune [tjuːn]：n. 曲调，曲子 vt. (为乐器) 调音，校音，调整，调节 (发动机)，(给收音机、电视等) 调谐，调频道，调整
spawn [spɔːn]：v. 产卵，导致，引发，引起 n. (鱼、蛙等的) 卵

For a 2D Convolution,
we compute the result at every output element, by iterating over the weights of a convolution kernel and its corresponding input elements covered by a window of size $(\mathit{kernel\_height},\mathit{kernel\_width})$ . For simplicity, we consider the case of $1{\times}1$ convolution window case. In this case, only one input cell is needed to calculate one output element. As we work with 3D tensors, every cell is implied to be a vector of channels. For this operation, every thread at the very first iteration of its loop requests first 4 channels of the appropriate cell. A compulsory cache miss occurs on the initial thread request (for 16 bytes, or 4 float values), which triggers the actual data load. When this occurs, the hardware memory manager loads the whole cache line and not just the requested 16 bytes. Since the cache line size on most mobile GPUs is 64 bytes, this results in the loading of the next 48 bytes as well. Since all threads execute the same shader code, the neighboring threads will also execute the same code as the first one (the initially requested 16 bytes). Organizing threads in the way is an efficient strategy for memory loading as the next (neighboring) input values will already be available when requested and loaded as part of the same cache line for initial neighbor compute threads (Figure 4).
对于二维卷积，我们通过迭代卷积核的权重及其由大小为 $(\mathit{kernel\_height},\mathit{kernel\_width})$ 的窗口覆盖的相应输入元素来计算每个输出元素的结果。为简单起见，我们考虑 $1{\times}1$ 卷积窗口的情况。在这种情况下，只需要一个 input cell 来计算一个输出元素。当我们使用 3D 张量时，每个 cell 都意味着 a vector of channels。对于此操作，每个线程在其循环的第一次迭代中请求适当单元格的前 4 个通道。强制缓存未命中发生在初始线程请求 (16 字节或 4 个浮点值) 时，这会触发实际数据加载。发生这种情况时，硬件内存管理器会加载整个缓存行，而不仅仅是请求的 16 个字节。由于大多数移动 GPU 上的缓存行大小为 64 字节，因此这也会导致加载接下来的 48 字节。由于所有线程都执行相同的着色器代码，因此相邻线程也将执行与第一个线程相同的代码 (最初请求的 16 个字节)。以这种方式组织线程是一种有效的内存加载策略，因为下一个 (相邻的) 输入值在请求时已经可用，并且作为初始相邻计算线程的同一缓存行的一部分加载 (图 4)。

compulsory [kəm'pʌlsəri]：adj. 必须做的
cache hit：缓存命中
cache miss：缓存未命中

In computing, a cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere.

A cache hit occurs when the requested data can be found in a cache, while a cache miss occurs when it cannot.

在这里插入图片描述
Figure 4. Cache hit by 4 neighboring threads. When threads $T_{0} - T_{3}$ each issue a 16-byte load of memory blocks $i_{0} - i_{3}$ that are contiguous in memory, the first load can fill the 64-byte cache line, benefiting the other threads with no additional cost in memory I/O.
缓存被 4 个相邻线程命中。当线程 $T_{0} - T_{3}$ 每次发出 16 字节的内存块加载时 $i_{0} - i_{3}$ 在内存中是连续的，第一次加载可以填充 64-byte cache line，使其它线程受益，而不会增加内存 I/O 成本。

4.2. Work Group Size Selection

The work group size for executing shader programs defines the group of threads which share data inside the work group. Depending on the GPU, picking the right work group size can result in increased performance, whereby picking the wrong can result in unexpected slowdowns. Arm Mali GPUs, for instance, show robust performance independent of configured work group sizes and tuning them only results in a nominal performance gain typically less than 5%. Qualcomm Adreno GPUs, on the other hand, are extremely sensitive to well-configured work group sizes and tuning these can give up to a 30% performance boost.
执行着色器程序的工作组大小定义了在工作组内共享数据的线程组。根据 GPU 的不同，选择正确的工作组大小可以提高性能，而选择错误的工作组大小可能会导致意外的减速。例如，Arm Mali GPU 显示出独立于配置的工作组大小的鲁棒性能，并且对其进行调整只会导致通常不到 5% 的标称性能提升。另一方面，Qualcomm Adreno GPUs 对配置良好的工作组规模极为敏感，对其进行调整可以将性能提升 30%。

Tuning the work group size is unfortunately difficult as GPU internals are not available to the user either directly (via the API), or indirectly (via some assembly representation of internal state). Threads are executed in groups called “waves” and knowing the wave size is crucial to optimizing the work group size as they fine-tune the memory usage of neighboring threads. Devising an algorithmic selection of optimal work group size thus becomes an exhaustive search. Note that selecting the wrong work group size may slow down execution by 5-7 times on Adreno GPUs.
不幸的是，调整工作组大小很困难，因为用户无法直接 (通过 API) 或间接 (通过内部状态的某些程序集表示) 使用 GPU 内部结构。线程在称为 waves 的组中执行，知道 wave 大小对于优化工作组大小至关重要，因为它们会微调相邻线程的内存使用。因此，设计最佳工作组规模的算法选择成为一种穷举搜索。请注意，选择错误的工作组大小可能会使 Adreno GPU 上的执行速度减慢 5-7 倍。

nominal [ˈnɒmɪn(ə)l]：adj. 名义上的，有名无实的，很小的，象征性的 n. 名词性的词
devise [dɪ'vaɪz]：v. 设计，制定，发明，创造 n. 遗赠财产的遗嘱 (或其中的条款)，遗赠的财产
brute-force search or exhaustive search：穷举搜索，暴力搜索

Despite these challenges, we conducted extensive investigations into optimizing the work group size, focusing primarily on conv_2d and depthwise_conv, as these make up nearly 90% of the workload for convolutional networks. While the algorithmic solution is not perfect, the alternative brute-force approach is impractical for real time applications because the work group investigation for a model may take several minutes. In addition, measurements may be inconsistent due to device temperature, resource racing, etc., causing the true global optimal work group size to change from one inference to another.
尽管存在这些挑战，我们还是对优化 work group size 进行了广泛调查，主要关注 conv_2d 和 depthwise_conv，因为它们占卷积网络计算量的近 90%。虽然算法解决方案并不完美，但替代的穷举搜索对于实时应用程序是不切实际的，因为模型的工作组调查可能需要几分钟。此外，由于设备温度、资源竞争等原因，测量结果可能会不一致，从而导致真正的全局最优工作组大小从一种推断变为另一种推断。

In computer science, brute-force search or exhaustive search, also known as generate and test, is a very general problem-solving technique and algorithmic paradigm that consists of systematically enumerating all possible candidates for the solution and checking whether each candidate satisfies the problem’s statement.
暴力搜索或穷举搜索，在计算机科学中也称生成与测试，是一种非常低效的解决问题的技术，方法包括了系统地枚举解决方案的所有可能候选项，以及检查每个候选项是否符合问题描述。

Because of these fluctuations, we approximate a reasonable optimum within the neighborhood region of the global optimum given an inference time function $T (W, C)$ , where $W$ is work group sizes, and $C$ identifies convolution configuration.
由于这些波动，我们在给定推理时间函数 $T (W, C)$ 的情况下，在全局最优值的邻域内近似合理的最优值，其中 $W$ 是工作组大小， $C$ 标识卷积配置。

extensive [ɪk'stensɪv]：adj. 广阔的，广大的，大量的，广泛的
brute [bruːt]：n. 残酷的人，暴君，大野兽，牲畜 adj. 蛮干不动脑筋的，根本而令人不快的，赤裸裸的
impractical [ɪm'præktɪk(ə)l]：adj. 不明智的，不现实的，手不灵巧的，不善于规划的
fluctuation [.flʌktʃu'eɪʃ(ə)n]：n. 波动，脉动，踌躇，彷徨变异
optimum [ˈɒptɪməm]：adj. 最佳的，最适宜的 n. 最佳结果，最好的条件
optimal [ˈɒptɪm(ə)l]：adj. 最适宜的，最理想的，最好的

The domain of the function parameters are:

Work groups dimensions $W$ : $2$ , $4$ , or $8$
Convolution configurations $C$ search space:
- conv_2d weights $1{\times}1$ , $2{\times}2$ , $3{\times}3$ , or
- depthwise_conv input and output shapes from $(8, 8, 8)$ to $(128, 128, 128)$ , and
- Strides $1{\times}1$ , $2{\times}2$ , $3{\times}3$

Given the search space defined by the convolution configuration, a gradient descent approach allows us to converge on a stable optimum work groups where expected performance varies $10\%$ on every inference. From this region of stable work groups, an approximate optimal work group can be selected for every device and convolution type combination.
给定由卷积配置定义的搜索空间，梯度下降方法使我们能够收敛到一个稳定的最佳工作组，其中每次推理的预期性能变化 $10\%$ 。从这个稳定的工作组区域，可以为每个设备和卷积类型组合选择一个近似最优的工作组。

Work groups from the Table 2 are currently used in TFLite GPU and their stability is statistically proven. While they do not necessarily result in peak optimal time across all parameters, they are reliable in giving top 10% performance regardless of the convolution parameters.
表 2 中的工作组目前在 TFLite GPU 中使用，其稳定性已通过统计证明。虽然它们不一定会导致所有参数的峰值最佳时间，但无论卷积参数如何，它们都能可靠地提供前 10% 的性能。

converge [kənˈvɜː(r)dʒ]：v. 集中，汇集，聚集，(向某一点) 相交
statistically [stə'tɪstɪkli]：adv. 统计地，统计上地

在这里插入图片描述
Table 2. Optimal work group sizes for Adreno GPUs.

Adreno is a series of graphics processing unit (GPU) semiconductor intellectual property cores developed by Qualcomm and used in many of their SoCs.

Adreno (an anagram of AMD’s graphic card brand Radeon), was originally developed by ATI Technologies and sold to Qualcomm in 2009, and was used in their mobile chipset products.
2009 年 1 月，高通收购 Imageon 并将其更名 Adreno。

5. Memory Manager for Intermediate Tensors

While we allocate GPU memory for all input/output tensors and tensors holding the trained weights, we do not allocate memory for all intermediate tensors between the operators separately, as they do not have to co-exist in memory simultaneously. This is an important optimization to reduce the memory footprint of the GPU run-time.
虽然我们为所有输入/输出张量和保存训练权重的张量分配 GPU 内存，但我们没有为运算符之间的所有中间张量单独分配内存，因为它们不必同时共存于内存中。这是减少 GPU 运行时内存占用的重要优化。

During initialization, we first topologically sort the network to determine the execution order of each operator, and the correspondingly required tensors. For each intermediate tensor, we can determine the first and the last operator that uses this tensor either as input or output. Once the last “consumer” of an intermediate tensor has finished executing, the memory for the said intermediate tensor can be re-used for other intermediate tensors. To minimize the total required memory allocation, we have devised a strategy to determine when this final operator execution has occurred. This problem is NP-complete [22].
在初始化时，我们首先对网络进行拓扑排序，以确定每个算子的执行顺序，以及相应需要的张量。对于每个中间张量，我们可以确定使用该张量作为输入或输出的第一个和最后一个运算符。一旦中间张量的最后一个“消费者”完成执行，该中间张量的内存就可以重新用于其它中间张量。为了最小化所需的内存分配总量，我们设计了一种策略来确定最终运算符执行的时间。这个问题是 NP-complete [22]。

The name “NP-complete” is short for “nondeterministic polynomial-time complete”. In this name, “nondeterministic” refers to nondeterministic Turing machines, a way of mathematically formalizing the idea of a brute-force search algorithm. Polynomial time refers to an amount of time that is considered “quick” for a deterministic algorithm to check a single solution, or for a nondeterministic Turing machine to perform the whole search. “Complete” refers to the property of being able to simulate everything in the same complexity class.

The set of NP-complete problems is often denoted by NP-C or NPC.

intermediate [ˌɪntə(r)ˈmiːdiət]：adj. 中间的，中级的，中等的，适合中等程度者的 v. 起调解作用，起媒介作用 n. 中级学生
separately ['seprətli]：adv. 单独地，分别地
polynomial [ˌpɒliˈnəʊmiəl]：adj. 多项式的，多词学名的 n. 多项式，多词学名
NP-complete，NP-C or NPC：NP 完全，NP 完备

We compared three algorithms for managing the intermediate tensors: (a) a naïve algorithm, (b) a greedy algorithm, and (‌c) a minimum-cost flow algorithm. The first just naïvely allocates all memory necessary and only serves as a baseline for comparison. The latter two implement smart memory management and use the concept of “shared objects” by which we refer to as allocated memory that is used for more than one tensor during inference, but not more than exactly one at a time. The size of the shared object is the maximum of sizes of tensors that it is used for. For example, if a shared object $S$ is used for tensor a, re-used for tensor $b$ , and later for tensor $c$ , the size of the shared object $S$ needs to be $size_{S} = max(size_{a}, size_{b}, size_{c})$ .
我们比较了三种管理中间张量的算法：(a) a naïve algorithm, (b) a greedy algorithm, and (‌c) a minimum-cost flow algorithm。第一个只是朴素地分配所有必要的内存，并且只作为比较的基线。后两者实现智能内存管理并使用 shared objects 的概念，我们将其称为分配的内存，在推理期间用于多个张量，但一次不超过一个。shared objects 的大小是它所使用的张量大小的最大值。

The Greedy Algorithm is summarized in Algorithm 1. We iterate through all operators in topological execution order. If an output tensor of the current operator is an intermediate tensor, it is assigned to a newly created shared object if the pool of shared objects is empty (L.7), or to an existing shared object that has the closest size by absolute difference to the $t . s i ze$ (L.9) which gets removed from the available pool (L.10). If $t . s i ze > S . s i ze$ , then the shared object’s buffer size is increased (L.11-12). This shared object S is inserted into the set of currently used objects (L.14). After the output tensors, the input tensors are inspected. If an input tensor is an intermediate tensor and the current operator is the last consumer, we remove the shared object that is assigned to this tensor from the set of currently used objects, and add it back to the pool of shared objects (L.17-19).
我们以拓扑执行顺序遍历所有运算符。如果当前运算符的输出张量是中间张量，如果共享对象池为空 (L.7)，它将分配给新创建的共享对象；或者分配给与 $t . s i ze$ (L.9) 的绝对差异最接近的现有共享对象，并从可用池 (L.10) 中删除的 $t . s i ze$ (L.9)。如果 $t . s i ze > S . s i ze$ ，则共享对象的缓冲区大小增加 (L.11-12)。这个共享对象 S 被插入到当前使用对象的集合中 (L.14)。在输出张量之后，检查输入张量。如果输入张量是中间张量并且当前运算符是最后一个消费者，我们从当前使用的对象集中删除分配给该张量的共享对象，并将其添加回共享对象池 (L.17-19).

在这里插入图片描述

This algorithm has the runtime complexity of $\text{log} n)$ where $n$ is the number of intermediate tensors. We use binary search tree for the pool of shared objects and binary heap priority queue for the set of currently used objects. Straightforward implementation of the same algorithm without these data structures has a run-time complexity of $O(n^{2})$ . For the neural network from Figure 5, this approach re-uses memory of output tensor of vertex 0 for output tensor of vertex 2, and memory of output tensor of vertex 1 for output tensor of vertex 4. The total size of allocated memory is 104.
该算法的运行时复杂度为 $\text{log} n)$ ，其中 $n$ 是中间张量的数量。我们对共享对象池使用二叉搜索树，对当前使用的对象集使用二叉堆优先级队列。没有这些数据结构的相同算法的直接实现具有 $O(n^{2})$ 的运行时复杂度。对于图 5 中的神经网络，此方法将顶点 0 的输出张量的内存重新用于顶点 2 的输出张量，将顶点 1 的输出张量的内存重新用于顶点 4 的输出张量。分配的内存总大小为 104。

在这里插入图片描述
Figure 5. An example neural net. Each vertex corresponds to an op. The upper number denotes the execution order, and the lower number the size of its output intermediate tensor. The last op does not have the latter as its output is not an intermediate tensor.

1. available_objects = {}, used_objects = {}.
2. op0:
op0_out0 is intermediate tensor, available_objects = {}, used_objects = {32_0}.
op0_in0 is not intermediate tensor.
available_objects = {}, used_objects = {32_0}.
3. op1:
op1_out0 is intermediate tensor, available_objects = {}, used_objects = {32_0, 8_1}.
op1_in0 is intermediate tensor and is the last consumer, available_objects = {32}, used_objects = {8_1}.
available_objects = {32}, used_objects = {8_1}.
4. op2:
op2_out0 is intermediate tensor, available_objects = {}, used_objects = {8_1, 32_2}.
op2_in0 is intermediate tensor and is not the last consumer.
available_objects = {}, used_objects = {8_1, 32_2}.
5. op3:
op3_out0 is intermediate tensor, available_objects = {}, used_objects = {8_1, 32_2, 8_3}.
op3_in0 is intermediate tensor and is the last consumer, available_objects = {8}, used_objects = {32_2, 8_3}.
available_objects = {8}, used_objects = {32_2, 8_3}.
6. op4:
op4_out0 is intermediate tensor, available_objects = {}, used_objects = {32_2, 8_3, 64_4}.
op4_in0 is intermediate tensor and is the last consumer, available_objects = {32}, used_objects = {8_3, 64_4}.
op4_in1 is intermediate tensor and is the last consumer, available_objects = {32, 8}, used_objects = {64_4}.
available_objects = {32, 8}, used_objects = {64_4}.
7. op5:
op5_out0 is not intermediate tensor.
op5_in0 is intermediate tensor and is the last consumer, available_objects = {32, 8, 64}, used_objects = {}.
available_objects = {32, 8, 64}, used_objects = {}.

The total size of allocated memory is 104 = 32 + 8 + 64.

The Minimum-Cost Flow Algorithm involves creating an auxiliary flow network and solving the minimum-cost flow problem (MCFP) [5]. First, we insert two vertices for each intermediate tensor $x$ and denote them $l_{x}$ and $r_{x}$ with two special vertices for the source $s$ and the sink $t$ . Then, we add directed edges to the flow network:
首先，我们为每个中间张量 $x$ 插入两个顶点，并将它们表示为 $l_{x}$ 和 $r_{x}$ ，源点 $s$ 和汇点 $t$ 是两个特殊顶点。然后，我们将有向边添加到流网络：

For each $x$ in $1... N$ , add an edge from $s$ to $r_x$ with capacity 1 and cost $size_x$ . For tensor $x$ , we can allocate new shared object of size $size_x$ .
If a shared object allocated for tensor $x$ can be re-used for tensor $y$ , then add an edge from $l_x$ to $r_y$ with capacity 1 and cost $\text{max}(0, \ size_{y} - size_{x})$ . If tensor $y$ is greater in size than tensor $x$ , we can re-use corresponding shared object, but we might need to allocate $size_{y} - size_{x}$ of additional memory. This is not always the case, when the shared object can already have a size greater than $size_x$ , but it is a good approximation.
For each $x$ in $1... N$ , add an edge from $s$ to $l_x$ with capacity 1 and cost 0.
For each $x$ in $1... N$ , add an edge from $r_x$ to $t$ with capacity 1 and cost 0.

auxiliary [ɔːɡˈzɪliəri]：n. 助动词，辅助工，辅助人员 adj. 辅助的，备用的
sink [sɪŋk]：v. 下沉，沉没，沉降，下陷 adj. 位于贫穷地区的，贫民窟的 n. 洗碗槽

在这里插入图片描述
Figure 6. The flow network for the neural network in Figure 5. Capacity of each edge is 1. Saturated edges, i.e. the final assignment of shared objects to tensors, are shown as solid lines.
每条边的容量为 1。饱和边，即共享对象到张量的最终分配，显示为实线。

After building the flow network, we solve the MCFP with Shortest Path Faster Algorithm (SPFA) [19] or Johnson’s algorithm [15]. With SPFA, the run-time complexity $O(N^{4})$ , but it can be reduced to $O(N^{3})$ by decreasing the number of edges of type 2. Figure 6 shows a flow network and the result of this algorithm execution for example graph from Figure 5. Minimum-cost flow approach re-uses memory of output tensor of vertex 0 for output tensor of vertex 4. The total size of allocated memory is 84.
它可以通过减少类型 2 的边数减少到 $O(N^{3})$ 。Minimum-cost flow approach 将顶点 0 的输出张量的内存重新用于顶点 4 的输出张量。分配的内存总大小为 84。

If an edge of type 1 (from $s$ to $r_x$ ) is saturated by the flow, i.e. its residual capacity is equal to 0, we create new shared object for the tensor $x$ . If an edge of type 2 (from $l_x$ to $r_y$ ) is saturated by the flow, we assign the same shared object for tensor $y$ that was used by tensor $x$ . After execution of the algorithm, the amount of the flow will be equal to $N$ . It means that the resulting flow network has information about the assignment of shared objects for all $N$ intermediate tensors. Size of each shared object is determined by the maximum size of all tensors assigned to it.

saturate [ˈsætʃəreɪt]：v. 浸透，使湿透，使充满，使饱和 adj. 浸透的，饱和的

There is no clear winner between these two memory management algorithms in terms of the minimal memory footprint, and it depends on the network (Table 3). TFLite GPU is using the greedy algorithm by default with the developer being able to choose the MCFP algorithm if desired.
在最小内存占用方面，这两种内存管理算法之间没有明显的赢家，这取决于网络 (Table 3)。TFLite GPU 默认使用贪心算法，开发人员可以根据需要选择 MCFP 算法。

在这里插入图片描述
Table 3. Total memory allocated (in MB) for all intermediate tensors. Naïve means no memory manager and serves as baseline. Bold number means the smallest memory footprint for each model.

6. Results

Figure 7 illustrates the performance of GPU inference compared to CPU inference in TFLite for various neural networks which generally demonstrates a 2-9x speedup. The first 10 warm-up runs were skipped for benchmarking and averages are based on the 100 subsequent inferences. This profiling revealed that TFLite GPU is often bound by memory bandwidth and we typically only see 20-40% ALU utilization. On iOS devices, we benefit from larger cache sizes that result in reduced memory I/O latency, and hence, better performance than the OpenGL backend.
图 7 说明了在 TFLite 中针对各种神经网络的 GPU 推理与 CPU 推理性能比较，通常表现出 2-9 倍的加速。前 10 次热身运行被跳过以进行基准测试，平均值基于后续 100 次推理。此分析显示 TFLite GPU 通常受内存带宽的限制，我们通常只能看到 20-40% 的 ALU 利用率。在 iOS 设备上，我们受益于更大的缓存大小，从而减少内存 I/O 延迟，因此比 OpenGL 后端有更好的性能。

在这里插入图片描述
Figure 7. Average inference latency (in milliseconds) of TFLite GPU (orange) compared to CPU (gray) on various neural networks, run on a variety of smartphones (best viewed in color).

Table 4 and Table 5 show the average inference latency of iOS- and Android-compatible ML frameworks on MobileNet v1, respectively. Note that TFLite GPU employs OpenGL for the widest coverage with reasonable performance. MACE and SNPE employ OpenCL and may outperform TFLite GPU on some mobile devices shipped with OpenCL. As OpenCL is not a part of the standard Android distribution, apps using those frameworks may not be able to guarantee their inference performance e.g. on Google Pixel devices. Also note that SNPE does not run on devices with Arm Mali GPUs.

在这里插入图片描述
Table 4. Average inference latency (in milliseconds) of iOS-compatible ML frameworks on MobileNet v1.

在这里插入图片描述
Table 5. Average inference latency (in milliseconds) of Android-compatible ML frameworks on MobileNet v1. Note that TFLite GPU employs OpenGL and thus has the widest coverage with reasonable performance. MACE and SNPE employ OpenCL and may run faster on devices shipped with OpenCL, but may not run on all devices. $^{1}$ Arm Mali GPUs are not compatible with SNPE. $^{2}$ Google Pixel devices do not support OpenCL.

Figure 8 shows how inference performance degrades over a sustained period of time due thermal throttling of the device. Mobile inference by applications typically occur in one of two modes: one-time detection or ongoing run-time data processing. For one-time inference, e.g. object detection, an application may achieve the peak performance illustrated in the left half of graph in Figure 8 where device temperature is nominal. For ongoing run-time inference, e.g. video segmentation, the right half illustrates the potential impact of thermal throttling due to sustained performance.
图 8 显示了推理性能如何在持续一段时间内由于设备的热节流而下降。

在这里插入图片描述
Figure 8. Inference latency (in milliseconds) for MobileNet v1 over extended period of time $[0, 200] sec$ (best viewed in color).
MobileNet v1 在较长时间 $[0, 200] sec$ 内的推理延迟，以毫秒为单位，最好以彩色显示。

In order to avoid data transfer delays, real-time applications usually place neural network input/output tensors in a GPU texture or buffer. TFLite GPU allows using CPU-side tensors as input/output as well. Additionally, CPU-to-GPU data-transfer efficiency can be controlled via time or power efficient synchronization mechanisms. The most power-efficient one suspends waiting threads until the GPU completes its task. The fastest option by comparison, employs an active spin-lock approach, reducing data acquisition delays by avoiding operating system process re-scheduling.
为了避免数据传输延迟，实时应用程序通常将神经网络输入/输出张量放置在 GPU texture or buffer。TFLite GPU 也允许使用 CPU 端张量作为输入/输出。此外，CPU 到 GPU 的数据传输效率可以通过时间或功率高效的同步机制来控制。最 power-efficient 的方法会暂停等待线程，直到 GPU 完成其任务。相比之下，最快的选项采用主动自旋锁方法，通过避免操作系统进程重新调度来减少数据采集延迟。

In software engineering, a spinlock is a lock that causes a thread trying to acquire it to simply wait in a loop (“spin”) while repeatedly checking whether the lock is available.
自旋锁是计算机科学用于多线程同步的一种锁，线程反复检查锁变量是否可用。由于线程在这一过程中保持执行，因此是一种忙等待。一旦获取了自旋锁，线程会一直保持该锁，直至显式释放自旋锁。

warm-up：n. 准备活动，热身练习
contour [ˈkɒntʊə(r)]：n. 轮廓，外形，(地图上连接相同海拔各点的) 等高线 v. 描画 ... 的轮廓，画 ... 的等高线，顺等高线 (作业) adj.(表示循着) 等高线的，与轮廓相合的
ship [ʃɪp]：n. 舰，船 v. 运送，运输，船运，上市
sustain [sə'steɪn]：v. 支持，支撑，遭受，证实，维持，保持
latency ['leɪtənsɪ]：n. 潜伏，潜在因素
synchronization [ˌsɪŋkrənaɪ'zeɪʃ(ə)n]：n. 同时，同时性，同步，同步录音
mechanism [ˈmekəˌnɪz(ə)m]：n. 机制，机械装置，方法，机件
suspend [sə'spend]：v. 暂停，悬浮，中止，挂
acquisition [.ækwɪ'zɪʃ(ə)n]：n. 收购，购置，(知识、技能等的) 获得，(多指贵重的) 购得物
spin [spɪn]：v. 旋转，纺纱，吐丝，纺线  n. 头晕，(快速) 旋转，晕头转向

7. Conclusion

In this paper, we presented the architectural design of TFLite GPU. We described the properties of mobile GPUs and explained optimization techniques we employed for fast memory I/O, small run-time memory footprint, and fast compute shader execution. With these, we aim to make the network architects be mobile GPU-aware when they design their networks.
在本文中，我们介绍了 TFLite GPU 的架构设计。我们描述了 mobile GPUs 的属性，并解释了我们为 fast memory I/O, small run-time memory footprint, and fast compute shader execution 而采用的优化技术。

From our discussion of mobile GPU-friendly data layout PHWC4, neural network designers should know that any kind of RESHAPEs are significantly more expensive on the GPU than on the CPU. The network itself will learn the weights regardless of the RESHAPE op, thus it is best to skip the operator entirely if a RESHAPE operation was inserted just for convenience of the architect.
从我们对 mobile GPU 友好数据布局 PHWC4 的讨论中，神经网络设计人员应该知道，任何类型的 RESHAPE 在 GPU 上的开销都明显高于在 CPU 上的开销。无论 RESHAPE 操作如何，网络本身都会学习权重，因此如果只是为了架构师的方便而插入 RESHAPE 操作，则最好完全跳过操作符。

For the same reason, if the mobile device can produce RGBA rather than RGB, it is now apparent that using the former can avoid a conversion, i.e. memory copy, from RGBA to RGB. Similarly, if the mobile device can render a 4-channel tensor, i.e. RGBA, directly, that can be a better choice than the RGB counterpart. This choices benefits not just the graph input/output, but also its intermediate tensors. Similarly, since we know that a tensor of shape
$[B, H, W, 5]$ , for instance, is twice as expensive as $[B, H, W, 4]$ , but about the same as $[B, H, W, 8]$ , then the architect can tune around those 4-channel boundaries rather than trying to optimize on other boundaries.
出于同样的原因，如果 mobile device 可以产生 RGBA 而不是 RGB，那么现在显然使用前者可以避免从 RGBA 到 RGB 的转换，即内存复制。同样，如果 mobile device 可以直接渲染 4-channel tensor，即 RGBA，那么这可能是比对应的 RGB 更好的选择。这种选择不仅有利于图形输入/输出，还有利于它的中间张量。架构师可以调整这些 4-channel 边界而不是尝试优化其他边界。

TFLite GPU is still in its early development stages. We plan to investigate several areas including employing additional GPU-specific optimizations to improve inference speed further, and expanding support for more operations, e.g. understand more about recurring networks or LSTMs, and how we can optimize those for GPUs. Finally, we are extensively exploring other GPU backends such as OpenCL and Vulkan to achieve better ALU utilization.
TFLite GPU 仍处于早期开发阶段。我们计划研究几个领域，包括采用额外的 GPU 特定优化来进一步提高推理速度，以及扩展对更多操作的支持，例如了解有关循环网络或 LSTMs 的更多信息，以及我们如何为 GPU 优化这些网络。最后，我们正在广泛探索其它 GPU 后端，例如 OpenCL 和 Vulkan，以实现更好的 ALU 利用率。

aware [əˈweə(r)]：adj. 意识到，知道，明白，发现
counterpart [ˈkaʊntə(r)ˌpɑː(r)t]：n. 职位 (或作用) 相当的人，对应的事物

Acknowledgements

We would like to acknowledge our colleagues at TensorFlow Lite; Lawrence Chan, Tim Davis, Jared Duke, Yu-Cheng Ling, Andrew Selle, Sarah Sirajuddin, and Pete Warden. We are also grateful to Aleksandr Ignashev for the figures in this paper and Karthik Raveendran for his valuable feedback.

acknowledge [əkˈnɒlɪdʒ]：v. 承认，认识，感谢，致谢
acknowledgement [ək'nɒlɪdʒmənt]：n. 感谢，鸣谢，(对事实、现实、存在的) 承认，谢礼
colleague [ˈkɒliːɡ]：n. 同事，同行，同僚
grateful [ˈɡreɪtf(ə)l]：adj. 感激的，表示感谢的，请
figure [ˈfɪɡə(r)]：n. 图形，人物，身材，体形 v. 认为，是重要部分，是 ... 的部分，计算 (数量或成本)