【论文阅读】Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-asa Service Systems

 HotCloud 20的一篇workshop。

摘要

Machine learning (ML) based prediction models, and especially deep neural networks (DNNs) are increasingly being served in the cloud in order to provide fast and accurate inferences.

基于机器学习(ML)的预测模型,尤其是深度神经网络(DNN),越来越多地在云中提供,以提供快速准确的推理。

However, existing service ML serving systems have trouble dealing with fluctuating workloads and either drop requests or significantly expand hardware resources in response to load spikes.

然而,现有的服务ML服务系统在处理波动的工作负载方面存在问题,要么放弃请求,要么在响应负载峰值时显著扩展硬件资源。

In this paper, we introduce Model-Switching, a new approach to dealing with fluctuating workloads when serving DNN models.

在本文中,我们介绍了模型切换,这是一种处理DNN模型波动工作负载的新方法。

Motivated by the observation that endusers of ML primarily care about the accuracy of responses that are returned within the deadline (which we refer to as effective accuracy), we propose to switch from complex and highly accurate DNN models to simpler but less accurate models in the presence of load spikes.

由于观察到ML的最终用户主要关心在截止日期内返回的响应的准确性(我们称之为有效精度),我们建议在负载峰值存在的情况下,从复杂和高度精确的DNN模型切换到更简单但不太准确的模型。

We show that the flexibility introduced by enabling online model switching provides higher effective accuracy in the presence of fluctuating workloads compared to serving using any single model. We implement Model-Switching within Clipper, a state-of-art DNN model serving system, and demonstrate its advantages over baseline approaches.

我们表明,与使用任何单一模型提供服务相比,通过启用在线模型切换引入的灵活性在工作负载波动的情况下提供了更高的有效准确性。我们在Clipper中实现了模型切换,Clipper是一个最先进的深度神经网络模型服务系统,并展示了它比基线方法的优势。(详见文章Clipper: A Low-Latency Online Prediction Serving System)

1 Introduction

Deep neural networks (DNN), currently the state-of-the-art in machine learning (ML), are being increasingly deployed as cloud services to provide highly-accurate inferencing (or predictions) for a range of applications. Systems such as Clipper [10] and Tensorflow serving [31] have been developed to ease the challenges in the deployment, optimization, and maintenance of DNN based machine-learning-as-a-service (MLaaS).

深度神经网络(DNN)是目前机器学习(ML)领域最先进的技术,正越来越多地被部署为云服务,为一系列应用提供高度精确的推理(或预测)。Clipper[10]和Tensorflow服务[31]等系统已经被开发出来,以缓解基于深度神经网络的机器学习即服务(MLaaS)的部署、优化和维护方面的挑战。

Like other cloud services, MLaaS has quality of service (QoS) requirements in the form of service level agreements (SLAs) between the user and the cloud provider that provide guarantees on request latency, throughput and reliability. For ML, however, the prediction accuracy (or simply accuracy) of the model is also a critical metric but has not traditionally been encapsulated in SLAs.

与其他云服务一样,MLaaS以用户和云提供商之间的服务级别协议(SLAs)的形式具有服务质量(QoS)需求,该协议为请求延迟、吞吐量和可靠性提供保证。然而,对于ML,模型的预测准确性(或简单的准确性)也是一个关键指标,但传统上并没有封装在SLA中。(SLA详细说明了服务提供商将提供的服务质量标准

In this paper, we make two observations to address this gap. First, we observe that for ML workloads, clients are interested not in the fraction of predictions returned within a deadline, but instead the fraction of correct predictions returned within the deadline.

在本文中,我们做了两个观察来解决这一差距。首先,我们观察到,对于ML工作负载,客户端感兴趣的不是在截止日期内返回的预测的比例,而是在截止日期内返回的正确预测的比例

We refer to this metric as the effective accuracy; in Section 3, we show that the effective accuracy is the product of the ML model’s accuracy and its deadline meet rate. Second, we observe that an SLA specified in terms of effective accuracy enables flexibility in dealing with spikes in workload, for instance, those observed during events like Black Friday.

我们把这个度量称为有效精度(摘要里面提到过);在第3节中,我们证明了有效准确率是ML模型的准确率与其截止时间满足率的乘积。其次,我们观察到,根据有效准确性指定的SLA可以灵活地处理工作负载高峰,例如,在黑色星期五等事件期间观察到的工作负载高峰。

The flexibility arises from the fact that deep learning models of varying computational complexity and accuracy can be trained for the same application. We observe that as load increases, the cloud provider can swap out complex and highly accurate models for more computationally efficient models while preserving effective accuracy.

这种灵活性源于这样一个事实,即可以为相同的应用程序训练不同计算复杂性和准确性的深度学习模型。我们观察到,随着负载的增加,云提供商可以将复杂和高度精确的模型替换为计算效率更高的模型,同时保持有效的准确性。(考虑损失一定的精度来满足对请求的处理)

We refer to this approach as Model-Switching. From the cloud providers’ perspective, Model-Switching allows to make meaningful trade-offs between the computational cost and the service accuracy. 

我们将这种方法称为模型切换。从云提供商的角度来看,模型切换允许在计算成本和服务准确性之间进行有意义的权衡。

For instance, consider a web-serving application, which employs high-performance machine-learning models to recommend relevant items to users or show them ads. In such applications, whenever there is a spike in user load, cloud providers either throttle the serving rate of the application or scale-out computational resources to meet the demand and thus the latency SLA.

例如,考虑一个web服务应用程序,它采用高性能机器学习模型向用户推荐相关项目或向他们展示广告。在这样的应用程序中,每当用户负载激增时,云提供商要么限制应用程序的服务速率,要么扩展计算资源以满足需求,从而降低延迟SLA。

However, in the former, there is a hit to the throughput SLA (or servable queries-per-second), and in the latter, there is an associated hardware cost for the cloud provider . A third alternative to these options is to guarantee effective accuracy. With this approach, latency and throughput SLAs can always be met at the cost of limited accuracy loss. Without this knob, clients have to either give up on latency or throughput under spiking load.

但是,在前者中,吞吐量SLA(或每秒可服务查询数)会受到影响,而在后者中,云提供商会产生相关的硬件成本。这些选项的第三个选择是保证有效的准确性。使用这种方法,总是可以以有限的准确性损失为代价来满足延迟和吞吐量SLA。如果没有这个选择,客户端就必须在峰值负载下放弃延迟或者放弃吞吐量。

To illustrate the benefits of the proposed approach, we develop and evaluate Model-Switching, an online scheduler built on top of Clipper [10] that monitors and adapts to workload fluctuations by switching between a set of pre-trained models for image classification. To further improve efficiency, ModelSwitching also optionally determines the optimal number of threads and replicas for each model. Our evaluation shows that Model-Switching yields the highest effective accuracy for all deadline constraints compared with serving with each single model by itself.

为了说明所提出方法的优势,我们开发并评估了Model-Switching,这是一个建立在Clipper[10]之上的在线调度器,通过在一组预训练的图像分类模型之间切换来监控和适应工作负载波动。为了进一步提高效率,ModelSwitching还可以选择确定每个模型的线程和副本的最佳数量。我们的评估表明,与单独使用每个模型相比,模型切换在所有截止时间约束下产生了最高的有效准确性。

The remainder of the paper is organized as follows: Section 2 provides a brief overview of deep learning and the limitation of existing MLaaS frameworks; the proposed effective accuracy, our new QoS metirc, and the model-switching framework are described in Section 3 followed by results from preliminary evaluation in Section 4. Section 5 describes related work while Section 7 lists the limitations of current approach and future work. Section 6 ends the paper.

本文的其余部分组织如下:第2节简要概述了深度学习和现有MLaaS框架的局限性;第3节描述了提出的有效精度、我们的新QoS指标和模型切换框架,第4节描述了初步评估的结果。第5节描述了相关工作,而第7节列出了当前方法和未来工作的局限性。第六部分是论文的结束语。

2 Background and Motivation

We begin by briefly describing the deep learning inference and the limitations of existing MLaaS approaches.

我们首先简要描述深度学习推理和现有MLaaS方法的局限性。

2.1 DNNs and MLaaS

DNN Basics State-of-the-art DNNs are typically trained using GPU-enabled machine learning frameworks [7,24,32] like PyTorch, TensorFlow, or Caffe to obtain the model weights. Trained models can then be deployed into IoT and embedded devices for inferencing (i.e., to render predictions); in practice, state-of-the-art DNN models can be computationally demanding and hence inference is often outsourced to the cloud.

DNN 基础知识 最先进的深度神经网络通常使用支持gpu的机器学习框架[7,24,32]进行训练,如PyTorch, TensorFlow或Caffe,以获得模型权重。然后可以将训练好的模型部署到物联网和嵌入式设备中进行推理(即进行预测);在实践中,最先进的深度神经网络模型可能在计算上要求很高,因此推理通常外包给云。

The execution time of DNN inference depends on its depth, the size of each layer’s feature maps and filters. Fig. 1 shows the execution time for a family of ResNet models with varying depth (for example, ResNet-18 has 18 layers). From the figure, we can observe a more than 5× spread in execution times and that more complex, deeper models are also more accurate. (Also shown in the figure is the impact of threadlevel parallelism on each model’s execution time, which will be discussed later in Section 3.2.1).

DNN推理的执行时间取决于其深度、每层特征映射和过滤器的大小。图1显示了一系列具有不同深度的ResNet模型的执行时间(例如,ResNet-18有18层)。从图中,我们可以观察到执行时间的差异超过5倍,并且更复杂,更深入的模型也更准确。(图中还显示了线程级并行性对每个模型执行时间的影响,这将在后面的3.2.1节中讨论)。

MLaaS Framework Several DNN prediction serving systems have been built [10, 31, 39] to ease the deployment, optimization, and maintenance of DNN inference services. In these systems, DNN models are usually deploy into containers, or servables. State-of-art serving systems also enable model versioning, updates, rollbacks, replication, etc. At runtime, they enable caching, adaptive batching, and ensembling, together with auto-scaling policies [8, 9, 16–18, 30] to sustain QoS and manage hardware resources. Nonetheless, these systems have some drawbacks, described next.

MLaaS 框架 为了简化DNN推理服务的部署、优化和维护,已经建立了几个DNN预测服务系统[10,31,39]。在这些系统中,DNN模型通常部署到容器或可服务中。最先进的服务系统还支持模型版本控制、更新、回滚、复制等。在运行时,它们启用缓存、自适应批处理和集成,以及自动缩放策略[8,9,16 - 18,30]来维持QoS和管理硬件资源。尽管如此,这些系统有一些缺点,下面将介绍。

2.2 Limitation of Exiting MLaaS Framework

Despite recent advances, state-of-the-art MLaaS frameworks still do not perform well in the presence of highly fluctuating loads or load spikes. Existing frameworks like Swayam [18] and Clipper [10] choose to violate SLAs in the presence of load spikes in order to conserve hardware resources.

尽管最近取得了进展,但最先进的MLaaS框架在存在高度波动的负载或负载峰值时仍然表现不佳。现有框架如Swayam[18]和Clipper[10]在出现负载峰值时选择违反SLA,以节省硬件资源。

Both systems internally track each request’s deadline in the queue and prune (or drop) the request if the queuing delay will result in a deadline miss. As a consequence, Swayam is only able to guarantee that 96% of requests return within the deadline during load spikes, even though the target SLA requires 99% of the jobs to meet deadline.

两个系统都在内部跟踪队列中每个请求的截止日期,如果排队延迟将导致错过截止时间,则修剪(或删除)请求。因此,在负载高峰期间,Swayam只能保证96%的请求在截止日期内返回,即使目标SLA要求99%的作业满足截止时间。

In return, Swayam provides 27% resources savings compared to a baseline that scales hardware resources in response to load spikes. To summarize, existing MLaaS serving systems offer an unappealing trade-off for bursty workloads: either violate SLAs or incur significant hardware overheads. In this paper we show that this trade-off can be averted using the proposed model switching scheme and without the need to scale up resources.

作为回报,与根据负载峰值扩展硬件资源的基准相比,Swayam提供了27%的资源节省。总而言之,现有的MLaaS服务系统为突发工作负载提供了一种不吸引人的权衡:要么违反SLA,要么导致大量的硬件开销。在本文中,我们证明了使用所提出的模型切换方案可以避免这种权衡,并且不需要扩大资源。

3 Model-Switching: Our Approach

We start with the new metric, i.e., effective accuracy for MLaaS systems, and then introduce our Model-Switching Framework.

我们从新的度量开始,即MLaaS系统的有效精度,然后介绍我们的模型切换框架。

3.1 Effective Accuracy

We argue that for MLaaS, response time (latency) alone does not capture end-users’ expectation; instead, users are interested in both fast and accurate responses. To this end, we define effective accuracy within a deadline constraint as the fraction of correct (or accurate) predictions returned within the deadline.

我们认为,对于MLaaS,单独的响应时间(延迟)并不能捕获最终用户的期望;相反,用户对快速准确的响应更感兴趣。为此,我们将截止时间约束内的有效精度定义为在截止时间内返回的正确(或准确)预测的比例。

We will assume that users are agnostic to which model the service provider uses as long as the SLA, specified in terms of effective accuracy, is met.

我们将假设用户不知道服务提供者使用哪个模型,只要满足以有效准确性指定的SLA。

M:预训练的ML库,一共M个模型

i:表示第i个ML模型

λ:请求速率

公式(1)的解释:Effective Accuracy=Accuracy×SLA Achievement Ratio

其中Accuracy:模型在给定数据集上的预测正确率。

SLA Achievement Ratio(ai):在给定的截止时间内完成的请求比例

Note that the execution time of a DNN is typically fixed and input independent; consequently, deadline misses are statistically independent of mis-classifications, thus enabling us to express the effective accuracy as a product of probabilities.

请注意,DNN的执行时间通常是固定的,并且与输入无关(真的吗,有没有人能解释一下);因此,截止时间遗漏在统计上与错误分类无关,从而使我们能够将有效准确性表示为概率的乘积。(原来是为了表达这个意思,这里的严谨程度确实有待商榷)

Figure 2 shows the effective accuracy for five ResNet models with increasing load (requests/second) assuming a deadline of 750 ms. At low loads, the most complex ResNet model, since it has the highest baseline accuracy and, since queuing delay is negligible under low loads, always meets the deadline. However, as load is increased, the simpler models have higher effective accuracy than more complex models because the latter incur a much higher fraction of deadline misses.

图2显示了负载(请求/秒)增加的五种ResNet模型的有效精度,假设截止时间为750 ms。在低负载下,最复杂的ResNet模型,由于它具有最高的基线精度,并且由于队列延迟在低负载下可以忽略不计,因此总是满足截止时间。然而,随着负载的增加,更简单的模型比更复杂的模型具有更高的有效精度,因为后者会经常超过截止时间。

 

3.2 Online Model-Switching

Based on the observations above, our online model-switching framework monitors and predicts future job arrivals and switches between models to maximize effective accuracy. In addition, once a model is picked, the framework also selects the optimal number of threads and replicas of the model given the hardware constraints. We begin by discussing the impact of number of threads and model replicas on performance.

基于上述观察,我们的在线模型切换框架监控和预测未来的工作到达,并在模型之间切换,以最大限度地提高有效准确性。此外,一旦选择了模型,框架还会在给定硬件约束的情况下选择该模型的最佳线程数和副本。我们首先讨论线程数量和模型副本对性能的影响。

3.2.1 Job-level and Thread-level Parallelism

DNN model-based microservices, along with other general cloud computing workloads, offer variety of opportunities in terms of parallelism [12, 28].

基于深度神经网络模型的微服务,以及其他通用的云计算工作负载,在并行性方面提供了各种机会[12,28]。

In MLaaS setting, as requests queue up for processing, two decisions can be made to exploit the parallelism at different granularity: (1) How many requests can be serviced in parallel? The answer depends on the number of microservice replicas (R) we have in the system; (2) Once a request is assigned to one of the DNN models, how many threads (T ) should be allocated to this microservice (as shown in Fig. 1)?

在MLaaS设置中,当请求排队等待处理时,可以做出两个决定来利用不同粒度的并行性:(1)可以并行地服务多少请求?答案取决于系统中微服务副本(R)的数量;(2)一旦一个请求被分配给一个DNN模型,有多少线程(T)应该分配给这个微服务(如图1所示)?

(可以发现图1的结果是在32核的CPU上实现的)

In this paper, we assume fixed capacity C of computing resources (i.e., CPU cores) for serving job requests. (If required, the proposed approach can be easily combined with auto-scaling frameworks that increase C in response to spikes in workload.) Under this assumption, any combination of <R,T > that satisfies R × T = C can be chosen.

在本文中,我们假设用于服务作业请求的计算资源(即CPU内核)的容量是固定的C。(如果需要,建议的方法可以很容易地与自动伸缩框架相结合,以增加C响应工作负载的峰值。)在此假设下,可以选择满足R × T = C的<R,T>的任意组合。

To understand the impact of the choice of <R,T > on performance, we deploy several ResNet models in Clipper [10] and measure the end-to-end query latency by varying <R,T > combinations at varying load levels. More information about Clipper and on the experiment set-up can be found in Section 4.

为了理解选择 <R,T > 对性能的影响,我们在 Clipper [10] 中部署了几个 ResNet 模型,并通过在不同的负载水平下改变 <R,T > 组合来测量端到端查询延迟。有关 Clipper 和实验设置的更多信息,请参见第 4 节。

Fig. 3 summarizes the 99th percentile (P99) query latency for ResNet-50 and ResNet-152 models for five different <R,T > configurations. It can be observed that judiciously pick <R,T > is necessary; the optimal number of threads T reduces with increasing load. Qualitatively similar results hold for the other ResNet models but are not shown here due to space constraints.

图3总结了五种不同<R,T >配置下ResNet-50和ResNet-152模型的第99百分位(P99)查询延迟。可以看出,明智地选择<R,T >是必要的;最佳线程数T随着负载的增加而减少。其他ResNet模型也有类似的定性结果,但由于篇幅限制,这里没有显示。(P99 表示在所有观测值中,有 99% 的值小于或等于这个值

3.2.2 Request Rate Prediction

In this paper, we use event-based windowing to monitor load at run-time, and use the load measured in a given window as a predictor for the next window.

在本文中,我们使用基于事件的窗口来监视运行时的负载,并使用在给定窗口中测量的负载作为下一个窗口的预测器。

Clipper records each incoming request’s arrival timestamp internally. Our Model-Switching controller estimates the inter-arrival rate by using the youngest and oldest timestamps with a fixed window size periodically. The window size can be tuned offline to improve the responsiveness. A similar approach was also used in [35].

Clipper在内部记录每个传入请求的到达时间戳。我们的模型切换控制器通过使用具有固定窗口大小的最新和最旧的时间戳来周期性地估计间隔到达率。可以离线调整窗口大小以提高响应性。[35]也采用了类似的方法。(就是请求率=窗口内请求量/窗口时间间隔,然后这个窗口大小可以离线调整

3.2.3 Rule Based Model-Switching

With pre-characterized information about the P99 latency for each model, and the request-rate estimate, the ModelSwitching controller searches over all M models and their <R, T > configurations to pick the model and configuration that has the highest effective accuracy for the specified deadline.

使用关于每个模型的P99延迟和请求率估计的预特征信息,ModelSwitching控制器搜索所有M个模型及其<R, T >配置,以选择在指定截止日期内具有最高有效精度的模型和配置。

Note that doing so maximizes the chances that any given SLA constraint specified in terms of effective accuracy would be met. While our goal of maximizing effective accuracy might create some slack between offered and required service quality, this slack can be exploited by scaling down hardware resources (although we do not explore resource scaling in this paper).

请注意,这样做可以最大限度地满足根据有效精度指定的任何给定SLA约束。虽然我们最大化有效精度的目标可能会在提供的服务质量和所需的服务质量之间产生一些松弛,但这种松弛可以通过缩减硬件资源来利用(尽管我们没有在本文中探讨资源的伸缩)。

Although our Algorithm 1 is general and able to explore dynamic <R,T > allocation, we can see that <R:4,T :4> works effectively on ResNets across the board for most target deadlines, shown in Fig. 3. In our evaluations, we statically pick this configuration and do not consider this problem further in this paper.

尽管我们的算法1是通用的,并且能够探索动态的<R,T >分配,但我们可以看到<R:4,T:4>在ResNets上对大多数目标截止日期有效地工作,如图3所示。在我们的评估中,我们静态地选择这个配置,在本文中不进一步考虑这个问题。

这里让模型从准确度高的(其实就是大的resnet)开始往小的选,寻找到就return出来,保证充分利用现有资源。(个人感觉实际运行不一定能做到这样子,还有内存什么的需要考虑在内)

3.2.4 Additional Considerations

Model Setup Times ML backend setup incurs large provisioning delays (e.g., a few seconds) due to massive I/O operations. To alleviate this issue, in this work, we pre-deploy all candidate models, relying on the fact that the RAM resources are usually abundant and under-utilized.

由于大量的I/O操作,ML后端设置会导致大量的供应延迟(例如,几秒钟)。为了缓解这个问题,在这项工作中,我们预先部署了所有候选模型(相当于已经提前启动的),这依赖于RAM资源通常丰富且未充分利用的事实。

Similar ideas have also been proposed recently in MultiTenant Serving system aiming for better resource utilization [29, 33]. We also quantify the actual memory cost for hosting 5 models simultaneously in Section 4. More discussion is also available in Section 7.

最近在多租户服务系统中也提出了类似的想法,旨在更好地利用资源[29,33]。在第4节中,我们还量化了同时托管5个模型的实际内存成本。更多的讨论也可以在第7节中找到。

CPU Resource Contention Hosting multiple DNN models may incur CPU performance overhead. However, at each given time, only one model would be required in active mode. To minimize performance impact, we reduce the CPU priority of the remaining “inactive” models via the OS level scheduler. We validated this approach and found that with this optimization, the performance is almost the same as only deploying a single model by itself.

托管多个DNN模型可能会导致CPU性能开销。然而,在每个给定时间,在活动模式下只需要一个模型。为了最小化性能影响,我们通过操作系统级别调度器降低了剩余“非活动”模型的CPU优先级。我们验证了这种方法,发现通过这种优化,性能几乎与仅部署单个模型相同。

4 Evaluation

We build our Model-Switching controller into Clipper [10], an open source online prediction serving system. To be focus on this study, we disable the cache and dynamic batch size adaption in Clipper.

我们将模型切换控制器构建到开源在线预测服务系统Clipper[10]中。作为本研究的重点,我们在Clipper中禁用了缓存和动态批大小自适应。

System Configuration We use a dedicated Azure Virtual Machine (VM) with 32 vCPUs and 128GB of RAM for Clipper model serving and our Model-Switching controller. For the client, we have another separate VM (8 vCPUs and 32GB RAM) to send image queries. We set batch size of 1 when posting the request.

我们使用具有32个vcpu和128GB RAM的专用Azure虚拟机(VM)用于Clipper模型服务和我们的模型切换控制器。对于客户端,我们有另一个单独的VM(8个vcpu和32GB RAM)来发送图像查询。我们在发布请求时将批大小设置为1。

Inference Models We primarily look at deep residual nets (ResNets) [22] with various number of layers, baseline accuracy and execution time, shown in Fig. 1. Each model is pre-trained in Pytorch [32] on Imagenet [13], and deployed into container with <R:4,T :4> as microservices (discussed in Section 3.2.1) in Clipper. At any given time, only one model’s containers are in the active state.

我们主要研究具有不同层数、基线精度和执行时间的深度残差网络(ResNets)[22],如图1所示。每个模型在Imagenet[13]上使用Pytorch[32]进行预训练,并在Clipper中以<R:4,T:4>作为微服务(在第3.2.1节中讨论)部署到容器中。在任何给定的时间,只有一个模型的容器处于活动状态。

Workload Generator The load generator tries to emulate user behavior with a Markov model [11, 14]. It operates in an open system model [34], i.e., new jobs arrive independently of job completions and following Poisson inter-arrivals [11, 35]. We also validated the generated workload with a trace of job arrivals from a production system deployed in industry.

负载生成器尝试用马尔可夫模型模拟用户行为[11,14]。它在一个开放的系统模型中运行[34],即新工作独立于工作完成后到达,并遵循泊松间隔到达[11,35]。我们还使用来自工业中部署的生产系统的作业到达跟踪来验证生成的工作负载。

Model-Switching Controller The controller runs as part of the Clipper serving system with a sample period of 1 second and tracks the most recent incoming queries to the TASK QUEUE to measure load. It then determines and switches to the best model with a target SLA deadline of 750 ms. The deadline is selected to make sure the largest ResNet-152 is a feasible solution at low load.

控制器作为Clipper服务系统的一部分运行,采样周期为1秒,并跟踪对TASK QUEUE的最新传入查询以测量负载。然后,它确定并切换到目标SLA截止时间为750毫秒的最佳模型。最后期限的选择是为了确保最大的ResNet-152在低负载下是可行的解决方案。

4.1 Results

Fig. 4a shows the load profile (queries/second) over a 300second period for the industrial trace and the models selected by the Model-Switching controller during this period. From the figure, we can see the the controller selects the most accurate but also computationally expensive ResNet-152 model when the load is relatively low.

图4a显示了300秒内工业轨迹的负载概况(查询数/秒)以及在此期间由模型切换控制器选择的模型。从图中可以看出,当负载相对较低时,控制器选择了最准确但计算成本也较高的ResNet-152模型。

For moderate loads, the controller switches between ResNet-101 and ResNet-152. However, when the load spikes at the 125-second mark, the controller can quickly adapt and serves requests using the smaller ResNet-50, ResNet-34, or even ResNet-18 models. Note that all the switching happens in real-time, together with the prediction serving. The results account for all overheads of switching between models.

中等负载时,控制器在ResNet-101和ResNet-152之间切换。然而,当负载峰值达到125秒时,控制器可以快速适应并使用较小的ResNet-50, ResNet-34甚至ResNet-18模型来处理请求。请注意,所有的切换以及预测服务都是实时发生的。结果说明了在模型之间切换的所有开销。

Effective Accuracy Fig. 4b shows the effective accuracy of the proposed Model-Switching controller compared to baselines that make use of a single model only. The results are shown for deadline constraints ranging from 700ms to 1500ms using the same workload trace from Fig. 4a. We observe that since both ResNet-18 and ResNet-34 are fast and never miss deadlines, their effective accuracy are simply equal to their baseline accuracies.

图4b显示了与仅使用单个模型的基线相比,所提出的模型切换控制器的有效精度。使用图4a中相同的工作负载跟踪,结果显示了截止时间约束范围从700ms到1500ms的情况。我们观察到,由于ResNet-18和ResNet-34都是快速的,从不错过截止时间,它们的有效精度只是等于它们的基线精度。

The effective accuracy of the larger ResNet-50, -101, and -152 models reduces as the deadline constraints become tighter due to high deadline miss rates. In contrast, Model-Switching yields the highest effective accuracy for all deadline constraints — we note that this effective accuracy would be unachievable by the smaller models and only achievable by the larger models by introducing additional hardware resources.

较大的ResNet-50、-101和-152模型的有效精度由于高截止期缺失率而随着截止期约束变得更加严格而降低。相比之下,Model-Switching在所有截止时间约束下产生最高的有效精度——我们注意到,较小的模型无法实现这种有效精度,只有通过引入额外的硬件资源,较大的模型才能实现。

意思就是现有的硬件条件和比较极限的截止时间的限制,导致使用大的模型准确率比较低,但是从前面的图a来看,设置的750ms的截止时间,导致在大量的情况下选的都是大的模型,但是大的模型在这个时间限制下准确性又不够,因此作者考虑提升硬件条件或许就能避免这个问题,让他的Model Switching发挥真正的作用?)(但是如果是这样的话,前面的大模型是怎么过精度筛选那关的?不过前面的算法1的设计似乎没有添加精度限制)

但是这样子是不是可以考虑引入在硬件限制和时间限制的条件下,使用小模型优先的方法,美其名曰使用具体情况具体分析的机制来增加泛化能力?

Tail Latency To better understand how our ModelSwitching adapts to load fluctuations, we plot the empirical CDF (Fig. 4c) of end-to-end latency observed by a client submitting requests in Fig. 4a. We compare the percentile latency of Model-Switching with baselines that serve requests using single model each. Several observations from the plot: (1) Small models, such as ResNet-18 and ResNet-34, guarantee that all queries finish within the deadline but have low baseline accuracies. (2) Large and more accurate models (e.g., ResNet-152 and ResNet-101) suffer long tail latency, resulting in several deadline misses. (3) Model-Switching, as it stands, seeks to combine the best of both worlds; i.e., meeting deadlines on the one hand, while serving requests using the most accurate models whenever possible.

为了更好地理解我们的ModelSwitching是如何适应负载波动的,我们在图4a中绘制了客户机提交请求时观察到的端到端延迟的经验CDF(图4c)(经验累积分布函数)。我们将模型切换的百分位数延迟与使用单个模型服务请求的基线进行比较。图中的几个观察结果:(1)小模型,如ResNet-18和ResNet-34,保证所有查询在截止日期内完成,但基线精度较低。(2)大型和更精确的模型(例如,ResNet-152和ResNet-101)遭受长尾延迟,导致多次错过截止日期。(3)模型转换,就其现状而言,寻求将两个世界的优点结合起来;例如,一方面要满足最后期限,同时尽可能使用最准确的模型来处理请求。(有点不上不下的感觉

CPU and RAM Usage:CPU and RAM Usage: In all our experiments above, we allocate a total of 16 vCPUs (<R:4,T :4>) for active model containers and limit the CPU utilization for each of the inactive model containers to be less than 1%. 8 vCPUs are configured to handle the client’s HTTP POST requests, and the remaining for other Clipper components. The Multi-Tenant ResNet models with a total 20 replicas occupy about 11.8% (15.1GB) of the total system RAM.

CPU和RAM使用:在上面的所有实验中,我们为活动模型容器分配了总共16个vcpu (<R:4,T:4>),并将每个非活动模型容器的CPU利用率限制为小于1%。配置8个vcpu来处理客户端的HTTP POST请求,其余的用于其他Clipper组件。具有20个副本的多租户ResNet模型占用了系统总RAM的11.8% (15.1GB)。

5 Related Work

There are an increasing number of studies on different aspects of MLaaS. The most relevant for our paper are those that introduce platforms and characterize their performance, and those that optimize QoS and resource allocation.

对MLaaS的不同方面的研究越来越多。与我们的论文最相关的是那些介绍平台并描述其性能的,以及那些优化QoS和资源分配的。

ML Inference Serving Framework A convention way to deploy ML service is to provision containers, or servables to host ML models. Examples include Clipper [10], Tensorflow Serving [31], and Rafiki [39]. These frameworks aim to minimize the cost of deployment, optimization of latency and throughput, and maintenance of DNN based MLaaS.

部署ML服务的常规方法是提供容器或可服务来托管ML模型。例如Clipper[10]、Tensorflow Serving[31]和Rafiki[39]。这些框架旨在最小化部署成本,优化延迟和吞吐量,以及基于DNN的MLaaS的维护。

Our work can be easily implemented on top of these existing frameworks with minimal modifications (indeed, we build our proposed solution within Clipper).

我们的工作可以很容易地在这些现有框架的基础上实现,只需进行最小的修改(实际上,我们在Clipper中构建我们提出的解决方案)。

Auto-scaling Policies Auto-scaling policies are used to guarantee response time SLA while maximizing resource efficiency [3, 4, 18]. However, in the event of load spikes, these existing auto-scaling policies fail to capture the dynamics in time (since bringing up new hardware resources is time consuming) and increase resource usage. In this paper, we aim to solve this problem without scaling hardware resources but by exploiting model diversity while providing high QoS.

自动缩放策略用于保证响应时间SLA,同时最大化资源效率[3,4,18]。但是,在负载峰值的情况下,这些现有的自动扩展策略无法及时捕获动态(因为引入新的硬件资源非常耗时),并且会增加资源使用。在本文中,我们的目标是在不扩展硬件资源的情况下解决这个问题,而是在提供高QoS的同时利用模型多样性。

Model Accuracy Vs Performance Several works have attempted to optimize the model accuracy and performance. For example, static pruning (compression), quantization, and neural architecture search approaches [25, 36, 41, 43, 44] can generate a family of model versions that can be switched during run-time. Input-redundant techniques such as NoScope [27] and Focus [23] are primarily targeted at video queries (where a lot of redundancy exists in the transferred data between frames).

一些工作试图优化模型的精度和性能。例如,静态剪枝(压缩)、量化和神经结构搜索方法[25,36,41,43,44]可以生成一系列可以在运行时切换的模型版本。输入冗余技术,如NoScope[27]和Focus[23]主要针对视频查询(其中帧之间传输的数据存在大量冗余)。

In our case, we are targeting image (and presumably textual) queries from different users. In such scenarios, the opportunity to exploit input redundancy may be lower than that in video queries. Other input-dependent cascade methods [37, 40], also fit nicely with our model-switching framework wherein classifiers of different complexities could be switched in and out at run-time in response to the work load.

在我们的示例中,我们针对来自不同用户的图像(可能还有文本)查询。在这种情况下,利用输入冗余的机会可能低于视频查询。其他依赖输入的级联方法[37,40]也很适合我们的模型切换框架,其中不同复杂性的分类器可以在运行时根据工作负载进行切换。

Another related work [20] exploits model diversity by exposing latency/accuracy trade-offs to users, while we focus on automatically switching between models to optimize effective accuracy.

另一项相关工作[20]通过向用户暴露延迟/精度权衡来利用模型多样性,而我们专注于在模型之间自动切换以优化有效精度。

6 Conclusion

In this paper, we argue that for MLaaS, the prediction accuracy (or simply accuracy) of the model is also a critical metric but has not traditionally been encapsulated in SLAs. We call for effective accuracy, a new metric, that should be looked at when evaluating the performance of such systems. To achieve a better effective accuracy while serving the prediction requests, Model-Switching has been proposed to dynamically select the best model according to the load and the pre-characterized model performance. We evaluated our framework on a real MLaaS system.

在本文中,我们认为对于MLaaS,模型的预测精度(或简单的精度)也是一个关键度量,但传统上没有封装在SLA中。我们要求有效精度,这是一个新的指标,在评估此类系统的性能时应该考虑。为了在满足预测要求的同时获得更好的有效精度,提出了模型切换方法,根据负载和预表征的模型性能动态选择最佳模型。我们在一个真实的MLaaS系统上对我们的框架进行了评估。

7 Challenges and Discussion

Several challenges lie ahead of us before we can achieve our goal of an automatic and low cost Model-Switching controller for MLaaS.

在实现用于MLaaS的自动和低成本模型切换控制器的目标之前,我们面临着几个挑战。

Reducing Memory Overheads Containerization disallows any sharing of host machine resources. There is a trade-off between reducing cold start time and the memory resources. A model can be invoked quickly when it is already in memory and does not require a cold start. However, keeping all models in memory at all times is prohibitively expensive and does not scale well. Ideally, we want a method to provide illusion that all models are always warm, while spending resources as if they were always cold. Some work on this can be found in [5, 6].

容器化不允许共享任何主机资源。在减少冷启动时间和内存资源之间存在权衡。当模型已经在内存中并且不需要冷启动时,可以快速调用模型。然而,始终将所有模型保存在内存中是非常昂贵的,并且不能很好地扩展。理想情况下,我们需要一种方法来提供一种错觉,即所有的模型都是温暖的,而消耗的资源却总是寒冷的。在[5,6]中可以找到一些关于这方面的研究。

Dynamic Replica and Thread Allocation Currently, we statically set the model replica and thread configuration before the deployment for simplicity. We are exploring more practical ways to implement this allocation online in real time.

目前,为了简单起见,我们在部署之前静态地设置模型副本和线程配置。我们正在探索更实际的方法来实现在线实时分配。

Integration of Exiting Auto-scaling Framework In this work, we assume a fixed capacity C of computing resources (i.e. CPU cores) for backend serving. However, in practice, there may be a certain amount of resources available to scale up. In this setting, the problem of synergistically performing both model-switching and autoscaling remains open.

在这项工作中,我们假设一个固定容量的计算资源C(即CPU内核)用于后端服务。然而,在实践中,可能有一定数量的资源可用于扩大规模。在这种情况下,协同执行模型切换和自动缩放的问题仍然存在。

Offline Training-Free Model-Switching Controller Further, since we used the pre-characterized information about the P99 latency for each model at a fixed capacity. The modelswitching controller needs to be retrained if autoscaling policy spawns or revokes some container replicas. As a future work, we are exploring the possibility of training a Reinforcement Learning agent to automatically learning these changes online.

此外,由于我们使用了固定容量下每个模型的P99延迟的预特征信息。如果自动缩放策略生成或撤销一些容器副本,则需要重新训练模型切换控制器。作为未来的工作,我们正在探索训练强化学习代理在线自动学习这些变化的可能性。

Synergistic Optimization with Caching, Batching etc. Existing MLaaS frameworks enable performance optimizations through caching, adaptive batching [10, 19] etc. We need to first figure out what optimizations our model-switching is compatible with and also figure out a synergistic way to incorporate model switching into these techniques.

现有的MLaaS框架通过缓存、自适应批处理等实现性能优化[10,19]。我们首先需要弄清楚我们的模型切换与哪些优化兼容,还要弄清楚将模型切换合并到这些技术中的协同方法。

Extend to Multiple Types of Computing Resources Although CPUs are widely used for DNN inference in existing MLaaS platforms such as in Facebook [21] and Amazon [2], many specialized hardwares are designed for better DNN inference. Examples are GPU [38], FPGA [15], and Google’s TPU [26, 42]. These heterogeneous hardware resources open new opportunities to optimize the latency, accuracy, power efficiency and resource efficiency, etc. with a holistic approach.

虽然cpu在现有的MLaaS平台(如Facebook[21]和Amazon[2])中被广泛用于DNN推理,但许多专门的硬件都是为了更好地进行DNN推理而设计的。例如GPU[38]、FPGA[15]和Google的TPU[26,42]。这些异构硬件资源为用整体方法优化延迟、准确性、功率效率和资源效率等提供了新的机会。

  • 7
    点赞
  • 6
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值