TensorFlow Serving Batching Guide

最新推荐文章于 2021-09-07 21:53:55 发布

zzwwllii

最新推荐文章于 2021-09-07 21:53:55 发布

阅读量3.3k

点赞数 1

文章标签： tensorflow serving

TensorFlow Serving Batching Guide

Introduction

While serving a TensorFlow model, batching individual model inference requests together can be important for performance. In particular, batching is necessary to unlock the high throughput promised by hardware accelerators such as GPUs. This is a library for batching requests and scheduling the batches. The library is not tied to GPUs, per se, and can be used for any situation that benefits from processing groups of small tasks in tandem (but this document assumes GPUs to simplify exposition). It offers a specific TensorFlow Session API, as well as lower-level APIs that can be used to batch at other granularities.

在为TensorFlow模型提供服务时，将单个模型推断请求分组在一起对性能很重要。特别是，批处理对于解锁gpu等硬件加速器所承诺的高吞吐量非常必要。这是一个用于批处理请求和调度批处理的库。这个库本身并不与gpu绑定，并且可以用于任何可以连续处理小任务组的情况(但是本文假设gpu可以简化显示)。它提供了一个特定的TensorFlow Session API，以及可以在其他粒度上批处理的低级API。

The library is currently split across two locations: (1) core/kernels/batching_util (core API and implementation), and (2) tensorflow_serving/batching (higher-level and experimental code).

这个库目前被分为两部分：(1) core/kernels/batching_util（核心API和实现）(2) tensorflow_serving/batching（高层次实验性代码）

The library offers several alternative classes to choose from. The reason for the choices is that there are many reasonable ways to perform batching. No single "best" approach dominates because different use-cases have different requirements, e.g.:

API preferences: Tensor API vs. general API; synchronous vs. asynchronous.
Does the model have significant CPU components, in addition to GPU?
Does the server need to interleave requests to multiple models (or versions)?
Is this for online serving or bulk processing (e.g. map-reduce jobs)?

该库提供了几个可供选择的类。选择的原因是有许多合理的方式执行批处理。没有单一的“最佳”方法占主导地位，因为不同的应用场景有不同的要求，如：

API偏好：张量API VS 通用API；同步 VS 异步

除GPU外，模型有重要的CPU组件吗？

服务需要交叉请求多个模型（或版本）吗？

这是用于在线服务还是批量处理

Furthermore, whereas some deployments need advanced capabilities to squeeze out maximal performance, others just want something simple that performs reasonably well.

此外，鉴于一些部署需要高级功能来实现最大性能，但其他部署只需要简单性能较为合理即可。

This document gives a tour of the batching library, including when to use which class, and some best practices.

本文介绍了批处理库，包括什么时候用什么类，和一些最佳实践。

Simple Batching

If you are new to the batching library and/or you have only basic requirements, you can focus just on BatchingSession and/or BasicBatchScheduler.

如果你是批处理库的新手，且／或者您只有基本需求，宁可以只关注BatchingSession和BasicBatchScheduler

`BatchingSession`

BatchingSession adds batching to a standard tensorflow::Session, and lets you call Session::Run() with individual (non-batched) tensors while getting the benefits of batching "under the covers".

BatchingSession是在标准的tensorflow::Session　增加批处理，允许你使用单独的（非批处理的）张量调用Session::Run()，同时获得在幕后批处理的好处

This abstraction works well if your application uses TensorFlow (naturally), and can accommodate Session::Run()'s synchronous API -- request threads make Session::Run() calls that block while awaiting other calls to group into the same batch.

如果你的应用使用的是TensorFlow，这个抽象类表现很好，同时可以容纳Session::Run()的同步API――请求现成用Session::Run()来调用该块，同时等待其他调用将其分到同一个批处理中

To achieve good throughput with this synchronous API, it is recommended to set the number of client threads to roughly twice the maximum batch size.

用这个同步API可以获得很好的吞吐量，建议将客户端线程设为大概最大批处理大小的两倍

BatchingSession can be used with any of the library's batch schedulers including BasicBatchScheduler, which offers a way to bound how long each Session::Run() call blocks.

BatchingSession可以与库中的任何批调度程序一起使用，包括BasicBatchScheduler，它提供一种方法来限制每个Session::Run()调用块的时间长度。

The simplest way to use BatchingSession is via CreateRetryingBasicBatchingSession(), which gives you a tensorflow::Session object that uses a BasicBatchScheduler underneath, and also handles retrying requests that overflow the scheduler's queue.

使用BatchingSession最简单的方式是通过CreateRetryingBasicBatchingSession(),它为您提供一个tensorflow::Session对象，在底层用BasicBatchScheduler，也处理溢出调度程序队列的重新尝试请求。

You will supply some key parameters governing the scheduling and execution of batched requests that are passed to the underlying BasicBatchScheduler; see below for details.

你将提供一些关键的参数，用于控制传递给底层BasicBatchScheduler的批处理请求的调度和执行；详情见下文

BasicBatchScheduler has a bounded-size queue;

BasicBatchScheduler有一个限制大小的队列

you can set parameters that govern whether Session::Run() should fail upon finding a full queue, or retry some number of times with a delay; again, see below.

你可以设置参数控制Session::Run()是否在失败后找一个完整的队列，或者存在延迟再多次重新尝试

A final configuration parameter is allowed_batch_sizes. This parameter is optional.

最终的配置参数是allowed_batch_sizes这个参数是可选的

If unset, then batch sizes can vary freely between 1 and the maximum allowed size, say 1024.

如果未设置，batch sizes可在１到最大的允许大小（如１０２４）范围内变化

Depending on your environment, having a large numbrer of possible batch sizes may cause problems.

根据您的环境，有一个大的批处理sizes可能导致问题

The allowed_batch_sizes parameter lets you limit the batch sizes to a fixed set, say 128, 256, 512, 1024.

allowed_batch_sizes参数让你将batch sizes限制在一个固定的集合中，如128, 256, 512, 1024.

BatchingSession adheres to this restriction by padding invalid-size batches with dummy data to round up to the next valid size.

BatchingSession遵循这一限制规则，通过将无效的批大小用虚数据填充值下一个有效的大小。

`BasicBatchScheduler`

BasicBatchScheduler is a lower-level abstraction than BatchingSession.

BasicBatchScheduler是一个比BatchingSession底级别的抽象类

It is not tied to tensors/TensorFlow per se, making it quite flexible.

它本身没有和tensors/TensorFlow绑定，使其很灵活

It is suitable for servers that handle homogeneous requests (see basic_batch_scheduler.h for a precise characterization of that restriction).

它适用于处理同种请求的服务（见basic_batch_scheduler.h看这种限制更准确的描述）

BasicBatchScheduler offers an asynchronous API that it shares with its less basic cousins (discussed below), called BatchScheduler.

BasicBatchScheduler提供一个异步API，可以与它的不那么基本的近亲共享，叫BatchScheduler

The API is templetized by a BatchTask class that encapsulates a unit of work to be batched.

这个API是由一个封装要批处理的工作单元的BatchTask类进行模板化的

A non-blocking Schedule() method is used to enqueue a task for processing.

一个非阻塞的Schedule()方法用于将任务编入队列里。

Once a batch of tasks is ready to be processed, a callback is invoked on a separate thread to process the batch.

一旦一个批次的任务准备好用于处理，就会在单独的线程中调用回调函数来处理这个批次

A good illustration of how to use this API is found in the implementation of BatchingSession in batching_session.cc.

一个如何使用这个API的很好的例子在batching_session.cc中的BatchingSession的实现。

Batch Scheduling Parameters and Tuning

批量调度参数和调优

The parameters that govern batch scheduling (e.g. in BasicBatchScheduler::Options) are:

max_batch_size: The maximum size of any batch. This parameter governs the throughput/latency tradeoff, and also avoids having batches that are so large they exceed some resource constraint (e.g. GPU memory to hold a batch's data).
batch_timeout_micros: The maximum amount of time to wait before executing a batch (even if it hasn't reached max_batch_size). Used to rein in tail latency. (See basic_batch_scheduler.h for the exact latency contract.)
num_batch_threads: The degree of parallelism, i.e. the maximum number of batches processed concurrently.
max_enqueued_batches: The number of batches worth of tasks that can be enqueued to the scheduler. Used to bound queueing delay, by turning away requests that would take a long time to get to, rather than building up a large backlog.

用于控制批处理调度的参数（在BasicBatchScheduler::Options）是：

max_batch_size:　任意批的最大的大小。这个参数控制吞吐量／延迟　平衡，也避免存在一些批过大，超过了一些资源的限制（保存批处理数据的GPU内存）

batch_timeout_micros:　在执行批处理之前等待时间的最大值（即使没达到max_batch_size）。用于控制尾部延迟（basic_batch_scheduler.h查看延迟的协议）

batch_timeout_micros:　并行度。同时处理的最大batches数量

max_enqueued_batches:　可以加入到调度器队列中的任务的批次数量。用于限制队列延迟，通过脱离需要很长时间才能获得的请求，而不是建立一个大的储备。

Performance Tuning　　性能调优

The best values to use for the batch scheduling parameters depend on your model, system and environment, as well as your throughput and latency goals.

批处理调度参数的最佳值取决于您的模型、系统和环境，以及吞吐量和延迟目标。

Choosing good values is best done via experiments. Here are some guidelines that may be helpful in selecting values to experiment with.

选择好的值最好通过实验来实现。这里有一些指导原则，可能有助于选择要试验的值。

Overall Guidelines　总体策略

First of all, while experimenting you should temporarily set max_enqueued_batches to infinity.

首先，在进行试验时，应该将max_enqueued_batch临时设置为无穷大。

Later, for your production setup, set it as follows:

稍后，对您的产品设置，设置如下：

If you are performing online serving, depending on the policy used to (re-)route requests to server instances, consider setting max_enqueued_batches equal to num_batch_threads to minimize queueing delay at a given server while keeping it busy.

如果您正在执行在线服务，根据用于(重新)将请求路由到服务器实例的策略，请考虑将max_enqueued_batches设置为num_batch_threads，以最小化给定服务器的排队延迟，同时保持服务器的繁忙。

For bulk processing jobs, set max_enqueued_batches to a generous value, but low enough to avoid out-of-memory crashes.

对于批量处理作业，请将max_enqueued_batch设置为一个慷慨的值，但要低到足以避免内存不足的崩溃。

Second, if for system architecture reasons you need to constrain the set of possible batch sizes (e.g. just 100, 200 or 400, rather than any value between 1 and 400): If you are using BatchingSession you can set the allowed_batch_sizes parameter. Otherwise, you can arrange for your callback code to pad the batches with dummy elements.

其次，如果出于系统架构的原因，你需要限制可能的批处理大小的集合（如，仅仅是１００，２００，４００，而不是任意值在１到４００之间）：如果你用BatchingSession，你可以设置allowed_batch_sizes参数。否则，你可以安排回调代码用虚拟元素填充批次

CPU-only: One Approach

If your system is CPU-only (no GPU), then consider starting with the following values: num_batch_threads equal to the number of CPU cores; max_batch_size to infinity; batch_timeout_micros to 0. Then experiment with batch_timeout_micros values in the 1-10 millisecond (1000-10000 microsecond) range, while keeping in mind that 0 may be the optimal value.

如果你的系统只有CPU，那么考虑用下边的值开始：num_batch_threads等于CPU核数；max_batch_size设置无穷大；batch_timeout_micros为0.然后在在1-10毫秒之间测试batch_timeout_micros的值，同时记住0可能是最优值

GPU: One Approach

If your model uses a GPU device for part or all of your its inference work, consider the following approach:

Set num_batch_threads to the number of CPU cores.

Temporarily set batch_timeout_micros to infinity while you tune max_batch_size to achieve the desired balance between throughput and average latency. Consider values in the hundreds or thousands.
For online serving, tune batch_timeout_micros to rein in tail latency. The idea is that batches normally get filled to max_batch_size, but occasionally when there is a lapse in incoming requests, to avoid introducing a latency spike it makes sense to process whatever's in the queue even if it represents an underfull batch. The best value for batch_timeout_micros is typically a few milliseconds, and depends on your context and goals. Zero is a value to consider; it works well for some workloads. (For bulk processing jobs, choose a large value, perhaps a few seconds, to ensure good throughput but not wait too long for the final (and likely underfull) batch.)

如果你的模型在你的推理工作中的部分或全部用GPU，考虑下边的方法：

num_batch_threads为CPU的核数
临时设batch_timeout_micros为无穷大，同时调优max_batch_size以实现吞吐量和平均延迟之间的期望平衡。考虑的值在几百或几千
对于在线服务，调优batch_timeout_micros来控制尾部延迟。

其思想是，批处理通常被填充到max_batch_size，但有时当传入请求出现错误时，为了避免引入延迟峰值，处理队列中的任何内容都是有意义的，即使它表示的是不充足的批处理。

batch_timeout_micros的最佳值通常是几毫秒，这取决于您的上下文和目标。

0是一个需要考虑的值;它可以很好地处理一些工作负载。

(对于批量处理作业，请选择一个较大的值，可能是几秒钟，以确保良好的吞吐量，但不要等待太长时间，直到最后的批处理(可能不够满)。)

Servers with Multiple Models, Model Versions or Subtasks

多个模型，多个版本或子任务的模型

Some server instances service multiple request types (e.g. multiple models, or multiple versions of a model offered concurrently).

In another scenario, a single request gets broken down into sub-requests involving multiple distinct servables

(e.g. a recommender system might have a triggering model that decides whether to formulate a recommendation, followed by a model that selects the actual recommendation).

A third scenario is bucketizing sequence model requests to batch together requests of similar length, to minimize padding.

有些服务器实例服务于多种请求类型(例如，多个模型，或同时提供的模型的多个版本)。

在另一个场景中，单个请求被分解为涉及多个不同的子服务

(如。推荐系统可能有一个决定是否制定推荐的触发模型，然后是选择实际推荐的模型)。

第三种场景是对序列模型请求进行分组，将长度相近的请求批处理在一起，以最小化填充。

Generally speaking, using a separate batch scheduler for each kind of request or sub-task does not work well if they share a common underlying compute resource -- each scheduler would run its own threads that compete with the others' threads to access the resource.

It is better to have a single scheduler with a single thread pool, that is aware of multiple distinct types of tasks and is able to interleave batches of one kind of task with batches of another.

一般来说，对于每一种请求或子任务使用单独的批处理程序，如果它们共享一个公共的底层计算资源(每个调度程序将运行自己的线程，与其他线程竞争以访问资源)，那么就不能很好地工作。

最好有一个具有单个线程池的调度程序，该调度程序能够识别多种不同类型的任务，并且能够将一种任务的批处理与另一种任务的批处理交叉。

That is what SharedBatchScheduler does. It presents an abstraction of queues, accepts requests to schedule a particular kind of task.

Each batch contains tasks of just one type, i.e. from one queue. The scheduler ensures fairness by interleaving the different types of batches.

这就是SharedBatchScheduler所做的。它提供了队列的抽象，接受请求来安排特定类型的任务。

每个批处理只包含一种类型的任务，即来自一个队列的任务。调度程序通过交错不同类型的批处理来确保公平性。

The queues implement the BatchScheduler API, so they can be used anywhere a simpler (non-shared) scheduler can be used, including with BatchingSession.

Queues can be added and removed over time, which is useful e.g. for transitioning to new model versions in environments in which clients specify a specific version: while clients learn about the new version, the server will have to process requests for both versions, and SharedBatchScheduler takes care of interleaving batches of both kinds of requests.

队列实现了BatchScheduler API，所以它们可以用于任何地方用一个更简单（非共享）的调度器，包括BatchingSession。

队列可以随着时间的推移而添加和删除，这很有用，例如在客户指定特定版本的环境中转换到新的模型版本:

当客户机了解新版本时，服务器必须处理两个版本的请求，SharedBatchScheduler负责处理这两种请求的交错批处理。

Mixed CPU/GPU/IO Workloads

Some models perform nontrivial CPU work, in addition to their main GPU work. While the core matrix operations may run well on a GPU, peripheral operations may take place on a CPU, e.g. embedding lookup, vocabulary lookup, quantization/dequantization.

Depending on how the GPU is managed, batching the entire sequence of CPU and GPU steps as a unit can underutilize the GPU.

除了主要的GPU工作之外，一些模型还执行重要的CPU工作。

虽然核心矩阵操作可以在GPU上很好地运行，但是外围操作可以在CPU上进行，例如嵌入查找、词汇表查找、量化/去量化。

根据GPU的管理方式，将整个CPU和GPU步骤作为一个单元进行批处理可能会使GPU不能充分利用。

Non-GPU pre- and post-processing can be performed in the request threads, with the batch scheduler used only for the GPU portion of the work.

非GPU预处理和后处理可以在请求线程中执行，批处理调度器仅用于工作的GPU部分。

Alternatively, the non-GPU work can be done in the batch threads, in the callback the batch scheduler calls.

To allow the callback to perform non- batched work on tasks before a batch is fully formed, you can use StreamingBatchScheduler.

It is designed for servers that control latency very precisely, and need fine control over each stage of the pipeline.

或者，非gpu工作可以在批处理线程中完成，在批处理程序调用的回调中完成。

要让回调函数在批处理完全形成之前对任务执行非批处理工作，可以使用StreamingBatchScheduler。

它是为非常精确地控制延迟的服务器设计的，并且需要对算法的每个阶段进行精细控制。

StreamingBatchScheduler will reject a task if the scheduler currently has no capacity to process it. If you want to automatically retry tasks that are rejected for that reason you can layer a BatchSchedulerRetrier on top of the batch scheduler. There is a convenience function for creating a streaming scheduler coupled with a retrier: `CreateRetryingStreamingBatchScheduler()'.

StreamingBatchScheduler会在调度器当前没有能力处理它时拒绝任务。如果你想自动重新执行由于那个原因被拒绝的任务，你可以在批处理调度器上层加一层BatchSchedulerRetrier。这有一个方便的函数来创建一个有重尝试的流调度器：CreateRetryingStreamingBatchScheduler()

When splitting model inference logic into multiple distinct phases to optimize latency or utilization, keep in mind that for a given request, every phase should use the same version of the model.

当将模型推断逻辑分解为多个不同的阶段以优化延迟或利用率时，请记住，对于给定的请求，每个阶段都应该使用模型的相同版本。

A good way to ensure this property is to coordinate which ServableHandle object(s) get used in each phase, across the threads.

确保此属性的一个好方法是跨线程协调每个阶段使用哪个ServableHandle对象。

Lastly, I/O-intensive phases of inference, e.g. lookups to disk or remote servers, may benefit from batching to hide their latency. You can use two batch scheduler instances: one to batch these lookups, and a separate one to batch the GPU work.

最后，I/ o密集的推理阶段，例如对磁盘或远程服务器的查找，可能得益于批处理来隐藏它们的延迟。

您可以使用两个批处理程序实例:一个用于批处理这些查找，另一个用于批处理GPU工作。

zzwwllii

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
TensorFlow Serving Batching Guide

TensorFlow Serving Batching GuideIntroduction Simple Batching BatchingSession BasicBatchScheduler Batch Scheduling Parameters and Tuning Performance Tuning Servers with Multiple M...
复制链接

扫一扫