Triton inference server系列(3)——pytorch模型配置、挂载

最新推荐文章于 2024-08-20 21:48:56 发布

江洋大盗与鸭子

最新推荐文章于 2024-08-20 21:48:56 发布

阅读量5.6k

点赞数 4

文章标签： tensorrt triton server pytorch

本文链接：https://blog.csdn.net/searobbers_duck/article/details/116022042

版权

源文档：Triton inference server系列(3)——pytorch模型配置、挂载

文章目录

接上文 Triton inference server系列(2)——pytorch 导出triton server模型，导出了可以用 triton inference server服务挂载的 torchscript模型文件，本文我们详述模型相关的配置文件

多读官方文档！多读官方文档！多读官方文档！triton-inference-server/server
多看配置protobuf！多看配置protobuf！多看配置protobuf！model_config.proto

TorchScript Models配置

模型仓库

TorchScript Models

模型仓库按照如下的方式进行配置

  <model-repository-path>/
    <model-name>/
      config.pbtxt
      1/
        model.pt

具体示例如下：上一篇文章我们导出的模型cta2mbf_generator.pt，以下示例中模型仓库中只有一个模型：

tree -L 4

.
├── models                      # <model-repository-path>
│   ├── cta2mbf_generator       # <model-name>， 这里的名字要与config.pbtxt中的名字对应
│   │   ├── 1                   # 模型的版本，可以自定义
│   │   │   └── model.pt        # 模型, 将`cta2mbf_generator.pt`重命名`model.pt`,放于该路径下
│   │   └── config.pbtxt        # 模型相关的配置文件，后面会细讲
|..

模型配置

以下我们主要讲解pytorch的导出模型torch script的配置,其它模型大体差不多，具体请参见官方文档(Model Configuration)

基本的模型配置,这是tensorrt模型的一份基础的配置文件，之所以先把这份配置文件拎出来讲，是因为pytorch的配置会有所差异

  platform: "tensorrt_plan"
  max_batch_size: 8
  input [
    {
      name: "input0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    },
    {
      name: "input1"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]
  output [
    {
      name: "output0"
      data_type: TYPE_FP32
      dims: [ 16 ]
    }
  ]

差异1：platform: "tensorrt_plan", torch script相关的模型的backend需要选用：platform: "pytorch_libtorch", 具体参见：Backends
差异2：输入input和输出output端口的命名, 上述示例中输入和输出端口的命名实际和模型导出时端口的命名一致，如上例中模型输入和输出端口的名字分别命名为输入input0和input1输出output0; 而torch script保存的模型，目前不提供输入和输出的端口的命名，因此在配置文件中，输入和输出端口的名字必须按照如下命名："INPUT__0", "INPUT__1" and "OUTPUT__0", "OUTPUT__1" such that "INPUT__0" refers to first input and INPUT__1 refers to the second input, etc，具体参见Inputs and Outputs
输入数据和输出数据的数据类型选择参见Datatypes
输入和输出的shape，如上例中允许的输入数据shape为[N,16],其中N为<=max_batch_size的数字，若配置文件中max_batch_size: 8, 则允许的输入shape为[16],如果希望输入和输出的shape不固定，则可以在相应的dims设置为:[-1,-1,-1]，具体参见Inputs and Outputs

以下为我启动的torch script模型的配置文件:

name: "cta2mbf_generator"
platform: "pytorch_libtorch"
max_batch_size: 1
input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 1,-1,-1,-1 ]
  }
]
output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1,-1,-1,-1 ]
  }
]
instance_group [
  {
    kind: KIND_GPU
  }
]

在我的模型中，name、platform、max_batch_size、input、output上面已经进行解释了，接下来我们重点看这个instance_group，这个属性主要是用来配置模型实例的，通俗来讲就是模型实例要在哪里启动，启动几个实例。该实例的protobuf协议见message ModelInstanceGroup, 详细说明见Instance Groups

todo
- Shape Tensors
- Version Policy

参考链接

triton-inference-server/server : Model Repository
[triton-inference-server/server : Model Configuration]

附录：

Instance Group

请参看如下实例, Instance Group：

在所有可用GPU上，每个GPU上启动两个实例

  instance_group [
    {
      count: 2
      kind: KIND_GPU
    }
  ]

在GPU0上启动一个实例，在GPU1和2上分别启动2个实例（共5个实例）

 instance_group [
   {
     count: 1
     kind: KIND_GPU
     gpus: [ 0 ]
   },
   {
     count: 2
     kind: KIND_GPU
     gpus: [ 1, 2 ]
   }
 ]

在CPU上启动两个实例，哪怕有可用的GPU存在

  instance_group [
    {
      count: 2
      kind: KIND_CPU
    }
  ]

该实例的protobuf协议见message ModelInstanceGroup, 个人注解如下：


message ModelInstanceGroup
{
  enum Kind {
    // 默认值，指明实例可以在GPU或CPU上运行，如果列举的所有GPU都可达，则在GPU上运行，否则在CPU上运行；
    KIND_AUTO = 0;

    // GPU
    KIND_GPU = 1;

    // CPU
    KIND_CPU = 2;

    // 由模型指定，目前只支持tensorflow的模型
    KIND_MODEL = 3;
  }

  // 实例组名，可选，如果不指定，默认为<model name>_<group number>
  string name = 1;

  // Default is KIND_AUTO
  Kind kind = 4;

  // 指定每块GPU上创建的实例数，默认为1.
  int32 count = 2;

  //
  ModelRateLimiter rate_limiter = 6;

  // gpus: [ 1, 2 ]指可用GPU为1,2；[]或不指定指所有可用GPU
  repeated int32 gpus = 3;

  //
  repeated string profile = 5;
}

官方：

message ModelInstanceGroup
{
  //@@
  //@@  .. cpp:enum:: Kind
  //@@
  //@@     Kind of this instance group.
  //@@
  enum Kind {
    //@@    .. cpp:enumerator:: Kind::KIND_AUTO = 0
    //@@
    //@@       This instance group represents instances that can run on either
    //@@       CPU or GPU. If all GPUs listed in 'gpus' are available then
    //@@       instances will be created on GPU(s), otherwise instances will
    //@@       be created on CPU.
    //@@
    KIND_AUTO = 0;

    //@@    .. cpp:enumerator:: Kind::KIND_GPU = 1
    //@@
    //@@       This instance group represents instances that must run on the
    //@@       GPU.
    //@@
    KIND_GPU = 1;

    //@@    .. cpp:enumerator:: Kind::KIND_CPU = 2
    //@@
    //@@       This instance group represents instances that must run on the
    //@@       CPU.
    //@@
    KIND_CPU = 2;

    //@@    .. cpp:enumerator:: Kind::KIND_MODEL = 3
    //@@
    //@@       This instance group represents instances that should run on the
    //@@       CPU and/or GPU(s) as specified by the model or backend itself.
    //@@       The inference server will not override the model/backend
    //@@       settings.
    //@@       Currently, this option is supported only for Tensorflow models.
    //@@
    KIND_MODEL = 3;
  }

  //@@  .. cpp:var:: string name
  //@@
  //@@     Optional name of this group of instances. If not specified the
  //@@     name will be formed as <model name>_<group number>. The name of
  //@@     individual instances will be further formed by a unique instance
  //@@     number and GPU index:
  //@@
  string name = 1;

  //@@  .. cpp:var:: Kind kind
  //@@
  //@@     The kind of this instance group. Default is KIND_AUTO. If
  //@@     KIND_AUTO or KIND_GPU then both 'count' and 'gpu' are valid and
  //@@     may be specified. If KIND_CPU or KIND_MODEL only 'count' is valid
  //@@     and 'gpu' cannot be specified.
  //@@
  Kind kind = 4;

  //@@  .. cpp:var:: int32 count
  //@@
  //@@     For a group assigned to GPU, the number of instances created for
  //@@     each GPU listed in 'gpus'. For a group assigned to CPU the number
  //@@     of instances created. Default is 1.
  int32 count = 2;

  //@@  .. cpp:var:: ModelRateLimiter rate_limiter
  //@@
  //@@     The rate limiter specific settings to be associated with this
  //@@     instance group. Optional, if not specified no rate limiting
  //@@     will be applied to this instance group.
  //@@
  ModelRateLimiter rate_limiter = 6;

  //@@  .. cpp:var:: int32 gpus (repeated)
  //@@
  //@@     GPU(s) where instances should be available. For each GPU listed,
  //@@     'count' instances of the model will be available. Setting 'gpus'
  //@@     to empty (or not specifying at all) is eqivalent to listing all
  //@@     available GPUs.
  //@@
  repeated int32 gpus = 3;

  //@@  .. cpp:var:: string profile (repeated)
  //@@
  //@@     For TensorRT models containing multiple optimization profile, this
  //@@     parameter specifies a set of optimization profiles available to this
  //@@     instance group. The inference server will choose the optimal profile
  //@@     based on the shapes of the input tensors. This field should lie
  //@@     between 0 and <TotalNumberOfOptimizationProfilesInPlanModel> - 1
  //@@     and be specified only for TensorRT backend, otherwise an error will
  //@@     be generated. If not specified, the server will select the first
  //@@     optimization profile by default.
  //@@
  repeated string profile = 5;
}