Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices_accurate latency prediction of deep learning model-CSDN博客

nn-Meter: Towards Accurate Latency Prediction of Deep-Learning Model Inference on Diverse Edge Devices

nn-Meter：精准预测深度学习模型在边缘设备上的推理延迟
nn-Meter：面向多样化边缘设备的深度学习模型精准延迟预测
深度模型端侧推理时间预测系统 nn-Meter

Li Lyna Zhang, Shihao Han, Jianyu Wei, Ningxin Zheng, Ting Cao, Yuqing Yang, Yunxin Liu

Microsoft Research is the research subsidiary of Microsoft Corporation dedicated to conducting both basic and applied research.
Rose-Hulman Institute of Technology，RHIT：罗斯-霍曼理工学院
University of Science and Technology of China，USTC：中国科学技术大学
Institute for AI Industry Research, Tsinghua University，AIR, THU：清华大学智能产业研究院
Association for Computing Machinery，ACM：计算机协会，计算机学会
The 19th ACM International Conference on Mobile Systems, Applications, and Services，ACM MobiSys 2021
The 22nd ACM International Conference on Mobile Systems, Applications, and Services，ACM MobiSys 2024：计算机学会移动系统、应用和服务国际会议
Special Interest Groups，SIGs：特别兴趣小组
Special Interest Group on Mobility of Systems, Users, Data, and Computing，Special Interest Group on Mobility of Systems, Users, Data & Comp，SIGMOBILE
Special Interest Group on Operating Systems，SIGOPS
The Advanced Computing Systems Association，USENIX：高级计算机系统组织
United States of America，USA
United States，U.S.
Microsoft Research Asia，MSR Asia

Artifact Review and Badging Version 1.1 - August 24, 2020

https://www.acm.org/publications/policies/artifact-review-and-badging-current

We recommend that three separate badges (徽章) related to artifact review be associated with research articles in ACM publications: Artifacts Evaluated, Artifacts Available and Results Validated.

Artifacts Evaluated

This badge is applied to papers whose associated artifacts have successfully completed an independent audit. Artifacts need not be made publicly available to be considered for this badge. However, they do need to be made available to reviewers.

Artifacts Available

This badge is applied to papers in which associated artifacts have been made permanently available for retrieval.

Results Validated

This badge is applied to papers in which the main results of the paper have been successfully obtained by a person or team other than the author.

artifact [ˈɑrtɪˌfækt]：n. 人工制品，人为现象
badge [bædʒ]：n. 徽章，标记，象征，证章

ABSTRACT

With the recent trend of on-device deep learning, inference latency has become a crucial metric in running Deep Neural Network (DNN) models on various mobile and edge devices. To this end, latency prediction of DNN model inference is highly desirable for many tasks where measuring the latency on real devices is infeasible or too costly, such as searching for efficient DNN models with latency constraints from a huge model-design space. Yet it is very challenging and existing approaches fail to achieve a high accuracy of prediction, due to the varying model-inference latency caused by the runtime optimizations on diverse edge devices.
随着深度学习在移动端的兴起，推理延迟 (inference latency) 已经成为在各种移动和边缘设备上运行深度神经网络 (DNN) 模型的一个重要指标。为此，预测 DNN 模型推理的延迟非常必要，尤其是对于无法在真实设备上测试延迟或者代价太高的任务，例如从巨大的模型设计空间中寻找具有延迟约束的有效的 DNN 模型。然而，由于不同边缘设备上运行时 (runtime) 的不同优化导致了模型推理延迟的巨大差异，准确预测推理延迟仍然非常具有挑战性。目前，现有方法无法实现高精度的预测。

In this paper, we propose and develop nn-Meter, a novel and efficient system to accurately predict the inference latency of DNN models on diverse edge devices. The key idea of nn-Meter is dividing a whole model inference into kernels, i.e., the execution units on a device, and conducting kernel-level prediction. nn-Meter builds atop two key techniques: (i) kernel detection to automatically detect the execution unit of model inference via a set of well-designed test cases; and (ii) adaptive sampling to efficiently sample the most beneficial configurations from a large space to build accurate kernel-level latency predictors. Implemented on three popular platforms of edge hardware (mobile CPU, mobile GPU, and Intel VPU) and evaluated using a large dataset of 26,000 models, nn-Meter significantly outperforms the prior state-of-the-art.
在本文中，我们提出并开发了 nn-Meter，可高效、准确地预测 DNN 模型在不同边缘设备上的推理延迟。它的关键思想是将整个模型推理划分为内核 (kernel)，即设备上的执行单元，然后执行内核级预测。nn-Meter 建立在两个关键技术之上：(1) 内核检测，通过一组设计好的测试用例来自动检测模型推理的执行单元；(2) 自适应采样，从大空间中有效地采样最有益的配置，以构建准确的内核级延迟预测器。我们在三个常用的边缘硬件平台 (移动 CPU、移动 GPU 和 Intel VPU) 上实现了 nn-Meter 系统、并使用包含 26,000 个模型的大型数据集进行评估，结果 nn-Meter 的表现明显优于先前的最好方法。

PPT: https://www.microsoft.com/en-us/research/publication/nn-meter-towards-accurate-latency-prediction-of-deep-learning-model-inference-on-diverse-edge-devices/

CCS CONCEPTS

Computer systems organization -> Neural networks; Embedded systems.

Computing Classification System，CCS

KEYWORDS

deep neural network, inference latency prediction, edge AI

1 INTRODUCTION

Deep Neural Networks (DNNs) have been widely used in today’s mobile and edge applications [33]. In many applications such as on-device video analytics, face recognition, AR/VR etc., DNN models are constrained by efficiency constraints (e.g., latency). To design a model with both high accuracy and efficiency, model compression [6, 14, 15, 24] and the recent Neural Architecture Search (NAS) [7, 29, 32, 34] consider the inference latency of DNN models as the hard design constraint.

However, measuring the inference latency for DNN models is laborious and expensive. In practice, it requires developers to perform a deployment process on the physical device to obtain the latency. The engineering effort is tremendous for diverse edge devices (e.g., mobile CPU/GPU and various AI accelerators) and different inference frameworks (e.g., TFLite and OpenVINO). Even on a single device, it may be extremely time-consuming to measure a large number of models in NAS tasks (e.g., ProxylessNas [7] explores ∼0.3 millions of models in just one round of search). Such a high cost can hinder the scalability and make the measurement-based method practically infeasible to support the fast-growing number of edge devices.

hinder [ˈhɪndə(r)]：v. 阻碍，妨碍，阻挡 adj. 后面的
scalability [skeɪlə'bɪlɪtɪ]：n. 可量测性
infeasible [ɪn'fi:zəbl]：adj. 不能实行的

Consequently, approaches have been proposed to predict the inference latency. For example, the FLOPs $^{1}$ based method has been widely applied to evaluate the efficiency [15, 22, 23, 30], which is simple but not a direct metric of latency. To predict a model latency, many NAS works [6, 7, 32] build the operator-wise lookup table. Such operator-level methods sum up the latencies of all operators. However, they do not consider the model latency differences caused by runtime optimizations of model graphs. For instance, many frameworks merge multiple operators into one fused operator to accelerate the inference, which impacts the inference latency significantly. Recently, the state-of-the-art BRP-NAS [13] uses graph convolutional networks (GCN) to predict latency of the NASBench201 [12] dataset on various devices. It captures the runtime optimizations by learning the representation of model graphs and corresponding latency. However, this model-graph based approach heavily depends on the tested model structures and may not work for many unseen model structures.

In this work, we propose and develop a novel system called nn-Meter $^{2}$ that aims to accurately predict the latency of arbitrary DNN models on diverse edge devices. The key idea of nn-Meter is dividing a whole model inference into multiple kernels that are independent execution units of the model inference on a device. A kernel may be either a single primitive operator or a fusion of multiple operators, depending on the runtime and hardware. nn-Meter builds latency predictors for kernels and predicts the total latency of a model by the latency sum of all kernels of the model.

$^{1}$ The definition of FLOPs follows [35], i.e., the number of multiply-adds.
$^{2}$ nn means neural networks.

primitive ['prɪmətɪv]：adj. 原始的，远古的，发展早期的，简陋的 n. 文艺复兴前的艺术家，原始派画家

This design choice of kernel-level prediction are based on two observations. First, kernel is the basic scheduling and execution unit (e.g., GPU kernels) in deep-learning frameworks, particularly on edge devices. Thus, the notion of kernel naturally captures the diverse runtime optimizations including operator fusion, the most important optimization that can largely impact the latency. Second, despite a very large number of DNN models, the kinds of operators and kernels are stable with a relative small set. Any models are just different combinations of operators/kernels. Therefore, kernel-level prediction is generic enough to support unseen new models.

notion [ˈnəʊʃ(ə)n]：n. 观念，信念，理解

2 BACKGROUND AND MOTIVATION

2.1 CNN Model Characteristics

2.2 Optimizations of Inference Frameworks

2.3 Rationale for Kernel-level Prediction

rationale [ˌræʃəˈnɑːl]：n. 基本原理，根本原因

3 NN-METER DESIGN

4 KERNEL DETECTION

4.1 Test Case Design

4.2 Find All Kernels of a Model

5 LATENCY PREDICTOR

This section introduces the method to build latency predictors for kernels and models. We start by addressing the challenges in nonlinear latency pattern and expensive sampling cost.
我们首先解决非线性延迟模式和昂贵的采样成本的挑战。

5.1 Kernel Characterization

Conv and DWConv dominate the latency of a model. By applying the kernel detection to the target model, we get a set of kernels. However, not all kernels account equally for the latency. Fig. 6 shows the model latency percentages by kernel types. We make the following observations: (1) In most models, Conv (Conv+bn+relu) and DWConv (DWConv+bn+relu) account for the main latency percentages. On average, Conv and DWConv take 94.2%, 91.91%, 75.5% of the model latency on the CPU, GPU, and VPU, respectively. (2) The latency of FC and element-wise operators (i.e., Others in Fig. 6) are relative large on the VPU. It’s also necessary to estimate these small kernels for accurate prediction. For instance, FC can take 47.4% latency of AlexNet. Among all the detected kernels, Conv is the most challenging one due to the large sample space. We mainly take Conv as the example in the following discussion.
然而，并非所有内核都会同等地考虑延迟。

在这里插入图片描述
Figure 6: Model latency percentage breakdown. Conv and DWConv are the latency-dominating kernels.

A large sample space of Conv. The possible configurations of a kernel decides the sample space. For the latency-dominating Conv and DWConv kernels, the primary configuration parameter includes: input height $H$ , input width $W$ , kernel size $K$ , stride $S$ , input channel number $C_{in}$ and output channel $C_{out}$ . Since $H$ usually is equal to $W$ for a kernel in CNN models, we encode it as a 5-dimension vector: $HW, K, S, C_{in}, C_{out})$ .

We collect 24 CNN models from PyTorch model zoo and get all the Conv configurations. As shown in Table 5, CNN models configure $H W$ , $K$ , $S$ from a small range of numbers. However, the range of $C_{in}$ and $C_{out}$ are unbound. Among these models, $C_{in}$ varies from minimal 3 to maximum 2160. The full sample space size is the multiplication of the size of every configuration dimension. To this end, the latency-dominating Conv has a vast amount (i.e., $\approx$ 0.7 billion) of configurations to sample.

在这里插入图片描述
Table 5: Sample space of Conv+bn+relu. It contains $\approx$ 0.7 billion configurations.

Non-linear latency pattern. Existing works [5, 27, 28] assume the linearity between operator configurations and the corresponding latency. For instance, Conv with a larger $H$ has a larger latency. However, as shown in Fig. 7 and Fig. 8, we observe that the $K$ , $H W$ , $C_{in}$ , $C_{out}$ show the non-linearity pattern on our measured devices. Instead, $H W$ and $C_{out}$ exhibit the staircase pattern, in which Conv with two different $HW/C_{out}$ may have the same latency. These non-linearities reflect the complexities in hardware optimizations.

在这里插入图片描述
Figure 7: Conv+bn+relu with (a): different kernel sizes $HW=224, C_{in}=3, C_{out}=32, S=1)$ ; (b): different input heights/widths. $C_{in}=C_{out}=64, K=3, S=1)$

在这里插入图片描述
Figure 8: Latency of Conv+bn+relu with different output channel numbers. The groundtruth with sampling all channel numbers shows a staircase pattern on VPU and GPU. $HW=112, C_{in}=32, K=3, S=1)$

Random sampling misses hardware-crucial data. To learn the non-linearity between configurations and latency, we need to generate a training set (i.e., variously configured kernels and the latencies) for regression. While it’s unfeasible to sample and measure all the configurations of Conv, a direct method is random sampling.

However, we argue that it’s difficult to build accurate predictors by random sampling. As shown in Fig. 8, random sampling ignores many important configurations (e.g., $C_{out}=66$ on the VPU). These crucial data reflect the complex hardware optimizations. Without them, predictors can easily learn an inaccurate latency pattern. To capture the staircase pattern on GPU and VPU, we should sample more data in the channel number dimension.

Main takeaways. Conv and DWConv are the latency-dominating kernels, and the prediction accuracy is most important to final model performance. However, the large sample space of Conv introduces the challenges for sampling and measurement (i.e., labeling).Random sampling ignores many crucial data as shown in Fig. 8.

5.2 Adaptive Data Sampling (自适应数据采样)

Instead of random selection, the main idea is to sample the most beneficial data from the kernel configuration space. It covers (1) the configuration range in CNN design, and (2) hardware-crucial configurations that reflect the hardware optimizations and can significantly impact the prediction accuracy.

Driven by the two goals, we first prune the rarely-considered configurations by constraining the sampling distribution. This leverages the observation that many configurations are unlikely selected in state-of-the-art CNN models. And, the considered configurations are non-uniformly distributed in the sample space. For instance, due to the efficiency and accuracy, modern CNNs do not consider the Conv with large (224, 1, 1, 2160, 2048) and small (1, 1, 1, 3, 16) configurations. Second, we run an iterative process to sample more data around inaccurate prediction data. These data are treated as the hardware-crucial data. Since Conv has a 5-dimension configuration, we leverage the observation in Fig. 8 to sample in the channel number dimension.

在这里插入图片描述

To this end, we propose adaptive data sampling. Algorithm 2 illustrates the main steps. First, to generate sufficient configurations that are likely to be considered in CNN design, we sample by the prior possibility distribution (line 11). Then, to evaluate the sampled data quality, we build the machine learning predictor and design a test set for evaluation (line 12-14). Finally, we perform fine-grained channel number sampling for data with large prediction errors (line 1-8). The iterative process continues until the predictor accuracy meets user’s requirements (line 16-23).

fine-grained：adj. 纹理细密的，网络细粒度，细粒度的，细粒的

Prior possibility distribution $P$ . It describes the boundary and the possibility of each data to sample. To compute it, we collect the configurations from 24 state-of-the-art CNN models and get the possibility distribution of each kernel dimension. We sample $N$ data from the distributions as the initial data and measure the latency. In our experiment, we set $N$ to 10,000 for Conv, 5,000 for DWConv, and 2,000 for other kernels.

Test set $T D$ . It’s crucial to construct the Test set as it evaluates the performance of sampled data and predictor. Since the sample size of the input $H W$ , kernel size $K$ , and stride $S$ are small, we generate all the combinations of $(H W, K, S)$ in Table 5. For $C_{in}$ and $C_{out}$ , we set the numbers that appeared in our collected model zoo. Specifically, the initial set contains 2,800 and 500 points for Conv and DWConv, respectively. To avoid overfitting, we expand 20% of newly sampled data into the test set in each iteration.

Fine-grained sampling around inaccurate data. After evaluating the predictor in each iteration, we pick out the data with large errors and perform fine-grained sampling. For each data, we fix all the other dimensions except the channel number $C_{o}$ . We random sample $M$ data from $0.4 \times C_{o}, 1.2 \times C_{o}]$ . For example, for Conv with (56, 3, 1, 24, 64), we fix the $H W$ , $K$ , $S$ dimension, and sample $M$ new $C_{in}$ , $C_{out}$ from [9, 28] and [25, 76], respectively. We set $M$ = 10 in our experiment.
我们通过迭代采样过程对最有益的配置进行采样，并离线为目标设备上的所有内核构建机器学习预测器。同时，我们设计了一个测试集来评估采样数据的质量。在每次迭代中，使用测试集来评估更新后的机器学习预测器的性能。对于预测误差较大的数据点，我们对其附近进行更细粒度的数据采样。

5.3 Kernel and Model Latency Prediction

Predict kernel latency. To learn the non-linearity observed in figs. 7 and 8, we use the Random Forests Regression [21]. Random Forests is ensemble decision tree-based and commonly reported as one of the most accurate learning algorithms. Some works [10] adopt the XGBoost [8]. While XGBoost has many hyper-parameters, Random Forests is much easier to tune for high performance. Table 6 lists out the prediction features and the number of collected data for building the latency predictor. For each kernel, we train and save the predictor for online model latency prediction.

在这里插入图片描述
Table 6: Main kernels, features and valid data.

Predict model latency. Finally, we estimate the model latency by the summation of all kernels’ predicted latency shown in Equation 2.

$f_{o}$ is the ML predictor of kernel $o$ , and $x_{o}$ is the extracted features.
$\begin{aligned} Latency(m) = \sum_{o \in m} f_{o}(x_{o}) \tag{2} \end{aligned}$

我们采用随机森林回归来预测非线性的内核延迟，然后使用算子的延迟总和来估计整个模型的延迟。

6 NN-METER IMPLEMENTATION

The entire nn-Meter consists of 18,093 lines of Python code (loc) - Test cases: 2,025 loc, Adaptive data sampling and kernel latency predictors: 8,052 loc, Model latency prediction: 1,291 loc, Benchmark dataset: 2,630 loc, Latency measurement: 4,095 loc.

Latency measurement. nn-Meter currently supports three widely-used edge devices, as shown in Table 7. Different from the mobile CPU, mobile GPU, the Intel NCS2 VPU is a dedicated AI accelerator.

在这里插入图片描述
Table 7: Evaluated edge devices (用于评估的边缘设备).

We build an automated measurement platform to measure latency. Given a model/kernel configuration, we generate the graph in both the Tensorflow protobuf and tflite format, which are generally supported by edge inference frameworks. We send the target model to the measurement platform and collect the returned inference latency. To measure the latency on the CPU, we set CPU frequency to the highest 2.42GHz. The latency on the CPU is measured by the TFLite benchmark tool. Since TFLite currently doesn’t support operator-level profiling for GPU, we implement an operator profiler in TFLite for GPU backend. For VPU latency measurement, we convert the protobuf format into OpenVINO IR, and measure the latency by the OpenVINO $^{\text{TM}}$ toolkit. The latency number is the average of 50 inference runs.

Kernel detection. The test cases of nn-Meter cover all possible two-operator combinations for each of the three devices. The numbers of CNN operators are 26 (CPU), 21 (GPU), and 27 (VPU). The detected fusion rules are 668 (CPU), 434 (GPU), and 720 (VPU) respectively. The total found kernels from our dataset are 22 (CPU), 26 (GPU), and 22 (VPU). More kernels are found on the GPU because more fusion rules are supported by the TFLite GPU backend. For example, there is a Conv+bn+add+add kernel found for the GPU since the fusion of these operators is all supported. By comparison, the fusion rules supported by the CPU and VPU are limited to Conv, bn, and relu operators, resulting in less kernels (also refer to Table 4 for the real model example). The detected fusion rules and found kernels are the same as the framework reported results on our CPU and GPU backends. The VPU backend is not open source, and thus cannot directly verify the rules or kernels. However, as Section 7.3 will show, the prediction accuracy on the VPU based on the kernels are much higher (83.4% vs 8.5%) than operator-based prediction. Therefore, the fusion detection and kernel search algorithm are also effective on the black-box VPU.

Latency prediction. In our experiment, we observe the latency difference between Conv and its fused operators (e.g., Conv and relu/relu6, bn, add) is negligible (same as DWConv). For example, the latency of Conv, Conv+bn, Conv+bn+relu with configuration (56, 3, 1, 32, 32) is 0.404ms, 0.404ms, 0,405ms on the GPU, respectively.

Therefore, for Conv and DWConv fused operators, we only build predictor for Conv+bn+relu and DWConv+bn+relu. To collect the data for regression, we manually set the error thresholds and run the adaptive data sampling. We split the collected data (Table 6) into train, validation, and test by 7:1:2, where we use the validation data for hyper-parameter tuning. The hyper-parameters are tuned by the popular NNI [26]. Table 9 lists out main kernels’ performance.

在这里插入图片描述
Table 9: Performance for main kernel predictors.

7 EVALUATION (评估)

7.1 Experiment Setup

We evaluate nn-Meter on the benchmark dataset (Table 2) for CPU, GPU, and VPU (Table 7).
我们在 CPU, GPU, and VPU (Table 7) 的基准数据集 (Table 2) 上对 nn-Meter 进行了评估。

Comparison baselines. We implement 3 baselines for comparison: (1) FLOPs, (2) FLOPs+MAC, (3) BRP-NAS. Baseline (1) and (2) are the widely used latency predictors. Baseline (3) is the latency predictor in BRP-NAS, one of the state-of-the-art model-graph based prediction by GCN on the NASBench201 dataset. For baselines (1) and (2), we use the FLOPs and memory access cost $^{6}$ (i.e., MAC) to estimate model latency. We train the predictors by linear regression. For baseline (3), we directly run the BRP-NAS source code [2]. Since BRP-NAS currently implements for cell-based models, it can not directly apply to the non-cell-based models in our dataset. Thus, we modify the graph representation as follows.
我们使用了三种基准方法作为比较：(1) FLOPs, (2) FLOPs+MAC, (3) BRP-NAS.

$^{6}$ the size of all feature maps and weights during the inference

The GCN in BRP-NAS takes as input a feature description matrix and a description of the graph structure as an adjacency matrix. BRP-NAS encodes the cell of NASBench201 model for representation. The GCN input is a 9x6 feature matrix and a 9x9 adjacent matrix. However, for the non-cell-based models in our dataset, we should encode the complete model graph. Therefore, we encode all the kernel nodes in a model graph. Besides, we also encode the 5-dimension configuration as the node attributes. Finally, the graph representations are larger than the BRP-NAS. Specifically, the NASBench201 models representation are a 133x22 feature matrix and a 133x133 adjacent matrix.
BRP-NAS 中的 GCN 将特征描述矩阵和图结构的描述作为邻接矩阵作为输入。

Metrics. We evaluate the prediction performance by the Root Mean Square Error (RMSE) and the relative Root Mean Square Percentage Error (RMSPE), that are the standard metrics in regression. Besides, we report the $\pm5\%$ and $±10% \pm10\%$ accuracy [13], that are the percentage of models with predicted latency within the corresponding error bound relative to the measured latency. In this paper, $±10% \pm10\%$ error boundary is the maximum acceptable prediction error. We use $±10% \pm10\%$ accuracy as the default metric. Smaller RMSE/RMSPE and larger $\pm5\%$ / $±10% \pm10\%$ accuracy suggest better performance.
通过回归中的标准指标：均方根误差 (RMSE)、相对均方根百分比误差 (RMSPE) 来评估预测性能。在本文中， $±10% \pm10\%$ 的误差边界是最大可接受的预测误差。

Root-Mean-Square Deviation (RMSD) or Root-Mean-Square Error (RMSE), https://en.wikipedia.org/wiki/Root-mean-square_deviation

7.2 End-to-End Prediction Evaluation (端到端的预测评估)

7.2.1 Comparison with Baselines on Unseen Models

In real-world scenarios, a usable predictor must be able to predict unseen models (i.e., a new model). As introduced, nn-Meter requires no model-level data for building the predictors, and can make predictions on models it has not seen before. To demonstrate it, we design a k-fold cross-validation experiment as follows.
nn-Meter 不需要模型级数据来构建预测器，并且可以在它从未见过的模型上进行预测。

Setting. We select AlexNets, VGGs, MobileNetv1s, MobileNetv2s, and NASBench201 for the evaluation. For each model variant, we take it as the testing set (e.g., 2,000 AlexNets), and the remaining 4 model variants (e.g., 8,000 models of VGGs, MobileNetv1s/v2s and NASBench201) as the training set to train the baselines. nn-Meter predicts model latency via the predicted latency sum of all kernels, it requires no model-level training data.
我们选择 AlexNets, VGGs, MobileNetv1s, MobileNetv2s, and NASBench201 进行评估。对于每个模型变体，我们将其作为测试集 (例如，2000 个 AlexNet)，并将其余 4 个模型变体 (例如，VGGs, MobileNetv1s/v2s and NASBench201 的 8,000 个模型) 作为训练集来训练基线。nn-Meter 通过所有内核的预测延迟总和来预测模型整体延迟。

Results. Fig. 9 shows the prediction accuracy achieved by different predictors. Compared with the baselines, nn-Meter is the only approach that consistently achieves accurate predictions on various devices. None of the baselines can achieve comparable performance for unseen models on any device. Specifically, on average, nn-Meter achieves 89.2% accuracy, significantly better than FLOPs (22.1%), FLOPs+MAC (17.1%), and BRP-NAS (8.5%) on the three devices. The FLOPs/FLOPs+MAC predictors achieve better accuracy on the CPU compared to the VPU and GPU. This is because on these accelerators, operator fusion plays a more important role on latency reduction compared to the CPU due to the more serious memory wall issue. However, FLOPs/FLOPs+MAC ignores operator fusion impact. For the BRP-NAS baseline, the performance is consistently poor on three devices. As discussed in Section 2.1, the reason is the model graph differences between training and testing set. GCN learns the representation of model graphs. Although the five model variants have largely overlapped operator types, the operator configurations, edges, and model latency ranges are different.
Fig. 9 显示了不同预测方法的预测精度。与基准方法相比，nn-Meter 是唯一能够在各种设备上始终实现准确预测的方法。平均而言，nn-Meter 89.2% 的准确率明显优于 FLOPs (22.1%)、FLOPs+MAC (17.1%) 和 BRP-NAS (8.5%)。

在这里插入图片描述
Figure 9: Compared to baseline predictors, nn-Meter achieves much higher $±10% \pm10\%$ accuracy on unseen models.

To further demonstrate the effectiveness of nn-Meter on unseen models, we calculate the kernel configuration overlaps between sampled data and our benchmark dataset. A low ratio indicates a high generalization ability of kernel predictors. Results show that our kernel predictors have seen only 5.9% (CPU), 9.4% (GPU), 5.0% (VPU) configurations in the dataset, but can accurately predict the remaining unseen ones.

7.2.2 nn-Meter Results and Analysis

We now provide results of nn-Meter on the full benchmark dataset (in Table 2). We predict the latency of 26,000 models on each evaluated device. Remarkably, we achieve 99.0%, and 99.1% prediction accuracy on the mobile CPU and GPU, respectively. On the Intel VPU, we can predict 83.4% models within the $±10% \pm10\%$ error boundary. Table 8 lists out the performance for each model variant on three devices. We can see that the strong performance (small RMSE and high accuracy) generalizes across various devices, which have vastly different latency behaviors. Significantly, >95% of all model variants on the CPU and GPU are with a <10% prediction error. nn-Meter even reaches an impressive high $\pm5\%$ accuracy on the GPU. On the VPU, we notice that nn-Meter achieves relative low accuracy for the AlexNets, VGGs, and NASBench201.
我们现在在完整的基准数据集上提供 nn-Meter 的结果。我们预测了每个评估设备上 26,000 个模型的延迟。值得注意的是，我们分别在移动 CPU 和 GPU 上实现了 99.0% 和 99.1% 的预测准确率。在 Intel VPU 上，我们可以在 $±10 \pm10%$ 的误差范围内达到 83.4% 的预测准确率。

在这里插入图片描述
Table 8: End-to-end latency prediction for 26,000 models on mobile CPU, GPU and Intel VPU.

To better investigate the performance on the VPU, we divide the dataset into 4 groups by the model measured latency. As shown in Fig. 10, the relatively large errors are from models with very small (i.e., NASBench201 and AlexNet models <10 ms) or very large (i.e., VGG models >100 ms) latency. Fortunately, these models take a small percentage in our dataset. In a real-world scenario, models with very small latency are more likely with a lower classification accuracy, and models with very large latency (i.e., the average FLOPs of VGGs is 28,422M) are rarely considered for edge devices.

在这里插入图片描述
Figure 10: Prediction errors on the VPU. X-axis label: latency range/group size percentages of the dataset.

We now perform manual analysis to reason the failure cases on the VPU. First, model latency is the sum of all kernels’ latencies. The accumulated kernel prediction errors sensitively impact the AlexNets and NASBench201, which have low inference latency. To relax the prediction error boundary to 15%, we can reach 86.0% accuracy on AlexNets and 73.3% accuracy on NASBench201. Second, the latency of a single kernel can be significantly different from that in a complete model. For example, for the Conv+bn+relu with a large configuration of (28, 7,1, 819, 768), the latency is 628.7 ms for a single kernel, but becomes 188.6 ms in a VGG model variant. By comparing the execution graphs of Conv within/without a model, we found that VPU performs ad-hoc optimizations that merges the computation of Conv+bn+relu and the next maxpool layer in VGGs. This only happens for very large Conv+bn+relu. We will further discuss it in Section 8.

7.3 Microbenchmarks

Kernel-level prediction. As the core component, kernel detector automatically detects the kernels on each device. The kernel-level prediction diminishes the latency differences caused by operator fusion. To demonstrate the effectiveness, we build the operator-level baseline. It predicts the latency of all operators in a model and sums them as the model latency. Since the latency difference between Conv+bn+relu and Conv is negligible (same for DWConv+bn+relu and DWConv), we use the Conv+bn+relu predictor for Conv. We build extra predictors for relu, bn, and add. The $±10% \pm10\%$ prediction accuracy is high, which ranges from 87% to 98%.

We test the operator-level baseline on our dataset. Fig. 11 shows the $\pm5\%$ and $±10% \pm10\%$ accuracy achieved by two different approaches. On all devices, our kernel-level prediction consistently outperforms operator-level prediction. We observe that operator-level prediction performs unstably on different devices. For the $±10% \pm10\%$ accuracy, it achieves relative high accuracy on the CPU (91.3%) and GPU (53.7%), but only 8.5% on the VPU. The reason is that these elementwise operators take very small latency percentages ( $\approx$ 0.1%-15%) of the model on the CPU and GPU, but high latency percentages (up to 50%) on the VPU. Therefore, the baseline still achieves high prediction accuracy on the CPU and GPU for the Conv-dominating models (e.g., VGGs). However, the operator-level prediction does not work for non-Conv-dominating models. Specifically, it achieves only 65.7%, 48.2% accuracy on the CPU for MobileNetv2s and NASBench201, respectively. The performance is worse on the GPU. Only 2.0% MobileNetv2s and 6.5% ProxylessNass are still within the $±10% \pm10\%$ error boundary. On the VPU, it has only 8.5% accuracy without considering the impact of operator fusion.
作为核心组件，内核检测器自动检测每个设备上的内核。内核级预测减少了由算子融合引起的延迟差异。为了证明有效性，我们建立了算子级别的基准方法，它预测模型中所有算子的延迟，并将它们相加为模型延迟。在所有设备上，我们的内核级预测始终优于算子级预测，而且算子级别的预测在不同的设备上表现不稳定，也不适用于非 Conv 主导模型。

diminish [dɪ'mɪnɪʃ]：v. 减少，削弱，减小，贬低
negligible ['neɡlɪdʒəb(ə)l]：adj. 微不足道的，不重要的，不值一提的

在这里插入图片描述
Figure 11: Operator-level approach achieves much lower $\pm5\%$ and $±10% \pm10\%$ accuracy on three devices.

Adaptive data sampling. We now evaluate the performance of adaptive data sampling for its two key tasks: sampling efficiency for Conv and the effectiveness to model prediction. We compare it with random sampling. For each device, we randomly sample the same amount of Conv data as ours (as shown in Table 6).
我们现在评估自适应数据采样的两个关键任务的性能：Conv 的采样效率和模型预测的有效性。我们将其与随机抽样进行比较。

First, we compare the sampling efficiency for Conv by two different approaches. Due to the large sample space, the sampled data are very different. For a fair comparison, we report the performance on the initial test set that contains 2,800 data (refer to Section 5.2).

Table 10 shows the RMSE and $±10% \pm10\%$ accuracy. Under the same sampling budgets, adaptive data sampling achieves much smaller RMSE and higher accuracy than random sampling. We observe that random sampling generates lots of large but rarely-considered Conv (e.g., configuration of (224, 7, 4, 2141, 1876) has a 750MB size).
在相同数量的采样数据下，自适应数据采样比随机采样有更小的 RMSE 和更高的精度。

在这里插入图片描述
Table 10: Under the same amount of sampled data, we achieve better performance than random sampling.

Then, we compare the model prediction accuracy achieved by predictors trained with randomly sampled and adaptively sampled data. We evaluate the performance of the 2,000 AlexNets as they are Conv-dominating models. Note that we only change the predictors of Conv and keep others the same. By adopting the Conv predictor trained with randomly sampled data, the accuracy heavily drops to 5.8%, 32.3%, 0% on the three devices.
我们比较了用随机抽样、自适应采样数据训练的预测器所达到的模型预测精度。通过采用随机采样数据训练的 Conv 预测器，三台设备的准确率分别大幅下降到 5.8%，32.3%，0%。

7.4 Generalization Performance (泛化能力)

In previous sections, we build and test latency predictors for the three types of hardware (in Table 7), and demonstrate the high prediction accuracy of nn-Meter. We now discuss the generalization performance for nn-Meter on a new edge platform. The experimental device is a Pixel3XL phone with a mobile Adreno 630 GPU, which is a lower version than the Adreno 640 GPU in Table 7. We measure the inference latency of the dataset in TFLite 2.1 on the Adreno 630 GPU for testing. The evaluation contains two folds.

Firstly, we measure the cross-device generalization performance. We use the existing latency predictors trained for Adreno 640 GPU to predict the model latency on the Adreno 630 GPU. As shown in Table 11, the non-Conv-dominated models (i.e., MobileNet-series, MnasNets, and ProxylessNass that contain both Conv and DWConv kernels) can achieve high prediction accuracy. However, the Conv-dominated model variants (i.e., AlexNets, VGGs, GoogleNets, etc.) achieve much lower prediction accuracy with > 15% RMSPE. The reason is that the Conv kernel runs faster on Adreno 640 GPU than that on Adreno 630 GPU, while the DWConv kernel has a similar inference latency on the two different versions of Adreno GPUs.

Secondly, we rebuild the latency predictors for the Adreno 630 GPU and use them to predict model latency. The results then become very promising as shown in Table 11. In total, nn-Meter achieves 99.0% prediction accuracy with the rebuilt latency predictors on the Adreno 630 GPU. The rebuilt cost is acceptable as evaluated in the next section.

在这里插入图片描述
Table 11: Two different latency predictors for model inference on the Adreno GPU 630.

7.5 System Overhead

Finally, we evaluate the system overhead. As shown in Table 12, most of the overhead comes from the measurement time. In total, we take 2.5 days, 1 day, and 4.4 days to measure the latency of all sampled kernels on a single CPU, GPU, and VPU device, respectively. The measurement cost can be linearly scaled down by increasing more devices, which indicates it requires low efforts to build predictors for new device by nn-Meter.
通过增加更多设备，测量成本可以线性降低，这表明通过 nn-Meter 为新设备构建预测器所需的工作量很小。

在这里插入图片描述
Table 12: Time cost of nn-Meter.

8 DISCUSSION

Prediction for language models. Current edge inference backends mainly support CNN models but not language models. For example, TFLite does not support BERT-mini [31] inference. Therefore, nn-Meter is only evaluated on CNN models in this paper. The technique, however, should also be applicable for language models since they are also DAGs composed of operators.

Limitations. There might be some ad-hoc optimizations or implementations in the frameworks for certain unknown conditions, such as specific input size discussed in Section 7.2. For the black-box backends, it is not feasible to find common rules behind these ad-hoc optimizations to design test cases for detection yet. However, these ad-hoc optimizations are rare, and their impact on prediction accuracy is limited.

For new inference backends and significant updates on available backends, the predictor building process should be done again to meet the new implementation. The major cost is the data profiling shown in Table 12.
对于新的推理后端和可用后端的重大更新，应再次完成预测器构建过程以满足新的实现。

There are also some backends which generate different kernel codes and search for the fastest one for different input size, such as TVM [9]. nn-Meter could also built predictors for these backends based on the searched kernel implementation. However, these backends are hardly used on edge platforms, because of the large time cost for code generation and search (easily takes hours to run for one configuration). Therefore, nn-Meter has not built predictors for this kind of backends. We leave it for future work.

Current nn-Meter predictors are built offline and will not be updated dynamically during the inference phase. It is possible to integrate more dynamic resource impact in the predictors such as current CPU utilization. This can also be a future research direction.

Concurrent execution. The design of nn-Meter is based on the fact that current kernels run sequentially on edge chips (refer to Section 2.3). For possible future inference where kernels may run concurrently on heterogeneous edge chips or multi-cores, a potential solution is to extend nn-Meter with a static-analysis phase to work out the kernel execution plan first. The predicted model latency will then be the sum of kernel latencies on the longest sequential path, as well as the synchronization cost. The prediction accuracy is possibly lower than that for purely sequential run, since there are more uncertainties introduced by concurrent execution.

Power prediction. It should be straightforward to extend nnMeter to predict kernel power or energy by training the predictors using measured power or energy data. However, it will be difficult to conduct thermal or heat modeling, since heat dissipation depends on the external environment which is hard to model.
然而，进行热或热建模将很困难，因为散热取决于难以建模的外部环境。

9 RELATED WORK

Unaware of runtime implementation. Current CNN design uses high-level APIs, which are independent of runtime implementation. Besides, most runtimes are closed source. Therefore, many CNN latency predictors only rely on CNN model features but no consideration of runtime implementation. Some works [15, 22, 23, 30] simply use FLOPs and MAC of the model as proxies for latency, or use these as feature inputs of regressors [28] to predict latency. However, these methods are inaccurate because they neglect the runtime behaviour difference of various operators. Similar as our paper, NeuralPower [5] and PALEO [27] predict latency for operators or layers, and sum them up as the model latency. They are inferior to nn-Meter due to the ignorance of model graph optimizations of the runtime.

BRP-NAS [13] can learn both the operator latency and graph optimization by encoding operator type and graph as features to a GCN prediction model. However, as we have shown, its generalization ability is low to new CNN models with diverse number of operators, and the connection distance it can learn is limited.

unaware [ˌʌnəˈweə(r)]：adj. 不知道，没意识到，未察觉
proxy [ˈprɒksi]：n. 代理，代表，代理投票，代理人
inferior [ɪnˈfɪəriə(r)]：adj. 较差的，次的，比不上 ... 的，级别低的 n. 不如别人的人，级别低的人

Prediction on operator implementation. Some operator-latency predictors use machine learning methods to learn latency from low-level implementations. They either use code features and simple regression model to predict operator latency [1, 16], or costly DNN code embedding [19, 25] approach to avoid feature engineering. TVM [10] uses both approaches to accelerate its code search process. Its embedding-based latency predictor recursively uses TreeGRU model to embed a low-level AST into a vector and then map it to predicted latency using a linear layer. The other predictor uses code features like memory accesses, data reuse, vectorization, and unrolling as inputs to an XGBoost model to predict latency. However, since most edge DNN runtimes are closed source, it is infeasible to use these code-based methods.

There are also analytical latency prediction methods generally used by language compilers (e.g., LLVM-MCA [3] and IACA [18]), and cycle-accurate hardware simulation (e.g. gem5 [4] and GPGPUSim [20]). These methods require knowledge of the exact mechanisms of the processor, which is also infeasible for black-box edge AI hardware.

analytical [.ænə'lɪtɪk(ə)l]：adjs. 分析的，解析的，分析性的，分析的

在计算机科学中，抽象语法树 (Abstract Syntax Tree，AST)，或简称语法树 (Syntax tree)，是源代码语法结构的一种抽象表示。它以树状的形式表现编程语言的语法结构，树上的每个节点都表示源代码中的一种结构。

10 CONCLUSION

We propose nn-Meter, a kernel-based prediction system that accurately predicts the latency of DNN models on diverse edge devices. nn-Meter introduces kernel detection that captures the various operator-fusion behaviours. By sampling the most beneficial data, nn-Meter efficiently builds latency predictors for kernels. We demonstrate the effectiveness of nn-Meter with experiments on a large dataset and three types of edge devices.
我们提出了 nn-Meter，它是一个基于内核的模型推理延迟预测系统，可以准确地预测 DNN 模型在不同边缘设备上的推理延迟。nn-Meter 引入了内核检测，可找出算子融合行为。通过对最有价值的数据进行采样，nn-Meter 有效地建立了内核的延迟预测器。我们在一个大型数据集和三种类型的边缘设备上进行了实验，充分验证了 nn-Meter 的有效性。

REFERENCES

[01] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[02] microsoft / nn-Meter, https://github.com/microsoft/nn-Meter
[03] nn-meter 2.0, https://pypi.org/project/nn-meter/
[04] 张亚勤：AI大模型时代, https://air.tsinghua.edu.cn/info/1007/2015.htm

[2] Eagle: Efficient and Agile Performance Estimator and Dataset, https://github.com/SamsungLabs/eagle
[31] Well-Read Students Learn Better: On the Importance of Pre-training Compact Models, https://arxiv.org/abs/1908.08962
[33] A First Look at Deep Learning Apps on Smartphones, https://arxiv.org/abs/1812.05448.