tritonserver学习之九：tritonserver grpc异步模式

liupenglove

已于 2024-03-11 22:56:27 修改

阅读量1.4k

点赞数 22

文章标签：深度学习人工智能 c++

于 2024-03-11 22:44:18 首次发布

本文链接：https://blog.csdn.net/liupenglove/article/details/136604741

版权

tritonserver学习之一：triton使用流程

tritonserver学习之二：tritonserver编译

tritonserver学习之三：tritonserver运行流程

tritonserver学习之四：命令行解析

tritonserver学习之五：backend实现机制

tritonserver学习之六：自定义c++、python custom backend实践

tritonserver学习之七：cache管理器

tritonserver学习之八：redis_caches实践

1、tritonserver支持的协议

tritonserver成功将模型serve后，client端可以通过http或grpc协议请求到server端部署的模型，而对于grpc通信方式，系统选择了其异步模式，选择这种模式的原因主要有：

高并发：gRPC的异步模式允许服务器同时处理多个客户端请求，而不会因等待某个请求的响应而阻塞其他请求的处理。这使得TritonServer能够充分利用系统资源，提高并发性能，从而能够更高效地处理大量的模型推理请求。

资源利用率：在异步模式下，服务器不会为每个请求创建单独的线程或进程，而是将请求放入队列中，并通过事件循环机制来处理这些请求。这减少了系统资源的开销，使得TritonServer能够在有限的资源下处理更多的请求。

2、grpc异步模式

gRPC使用CompletionQueue API进行异步操作，基础工作流如下：

构建CompletionQueue，并绑定到RPC调用。
读写操作，使用一个唯一的void *指针(tag)标识。
注册处理函数，通常以类对象指针作为唯一tag。
调用CompletionQueue::Next，阻塞等待请求的到达。
请求到达后，通过tag指针进行响应处理。

grpc异步模式启动主流程：

grpc 示例代码：

 void Run(uint16_t port) {
    std::string server_address = absl::StrFormat("0.0.0.0:%d", port);

    ServerBuilder builder;
    // Listen on the given address without any authentication mechanism.
    builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
    // Register "service_" as the instance through which we'll communicate with
    // clients. In this case it corresponds to an *asynchronous* service.
    builder.RegisterService(&service_);
    // Get hold of the completion queue used for the asynchronous communication
    // with the gRPC runtime.
    cq_ = builder.AddCompletionQueue();
    // Finally assemble the server.
    server_ = builder.BuildAndStart();
    std::cout << "Server listening on " << server_address << std::endl;

    // Proceed to the server's main loop.
    HandleRpcs();
  }

注册处理函数：

service_->RequestSayHello(&ctx_, &request_, &responder_, cq_, cq_, this);

this即为唯一的tag，为指向该对象的指针。

另外要说的是，通过builder.AddCompletionQueue函数获得异步队列，一个系统中是可以有多个的，在triton中一共使用了三个异步队列，分别用于普通请求、推理请求、流式推理请求。

server端完整示例代码：

/*
 *
 * Copyright 2015 gRPC authors.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 */

#include <iostream>
#include <memory>
#include <string>
#include <thread>

#include "absl/flags/flag.h"
#include "absl/flags/parse.h"
#include "absl/strings/str_format.h"

#include <grpc/support/log.h>
#include <grpcpp/grpcpp.h>

#ifdef BAZEL_BUILD
#include "examples/protos/helloworld.grpc.pb.h"
#else
#include "helloworld.grpc.pb.h"
#endif

ABSL_FLAG(uint16_t, port, 50051, "Server port for the service");

using grpc::Server;
using grpc::ServerAsyncResponseWriter;
using grpc::ServerBuilder;
using grpc::ServerCompletionQueue;
using grpc::ServerContext;
using grpc::Status;
using helloworld::Greeter;
using helloworld::HelloReply;
using helloworld::HelloRequest;

class ServerImpl final {
 public:
  ~ServerImpl() {
    server_->Shutdown();
    // Always shutdown the completion queue after the server.
    cq_->Shutdown();
  }

  // There is no shutdown handling in this code.
  void Run(uint16_t port) {
    std::string server_address = absl::StrFormat("0.0.0.0:%d", port);

    ServerBuilder builder;
    // Listen on the given address without any authentication mechanism.
    builder.AddListeningPort(server_address, grpc::InsecureServerCredentials());
    // Register "service_" as the instance through which we'll communicate with
    // clients. In this case it corresponds to an *asynchronous* service.
    builder.RegisterService(&service_);
    // Get hold of the completion queue used for the asynchronous communication
    // with the gRPC runtime.
    cq_ = builder.AddCompletionQueue();
    // Finally assemble the server.
    server_ = builder.BuildAndStart();
    std::cout << "Server listening on " << server_address << std::endl;

    // Proceed to the server's main loop.
    HandleRpcs();
  }

 private:
  // Class encompasing the state and logic needed to serve a request.
  class CallData {
   public:
    // Take in the "service" instance (in this case representing an asynchronous
    // server) and the completion queue "cq" used for asynchronous communication
    // with the gRPC runtime.
    CallData(Greeter::AsyncService* service, ServerCompletionQueue* cq)
        : service_(service), cq_(cq), responder_(&ctx_), status_(CREATE) {
      // Invoke the serving logic right away.
      Proceed();
    }

    void Proceed() {
      if (status_ == CREATE) {
        // Make this instance progress to the PROCESS state.
        status_ = PROCESS;

        // As part of the initial CREATE state, we *request* that the system
        // start processing SayHello requests. In this request, "this" acts are
        // the tag uniquely identifying the request (so that different CallData
        // instances can serve different requests concurrently), in this case
        // the memory address of this CallData instance.
        service_->RequestSayHello(&ctx_, &request_, &responder_, cq_, cq_,
                                  this);
      } else if (status_ == PROCESS) {
        // Spawn a new CallData instance to serve new clients while we process
        // the one for this CallData. The instance will deallocate itself as
        // part of its FINISH state.
        new CallData(service_, cq_);

        // The actual processing.
        std::string prefix("Hello ");
        reply_.set_message(prefix + request_.name());

        // And we are done! Let the gRPC runtime know we've finished, using the
        // memory address of this instance as the uniquely identifying tag for
        // the event.
        status_ = FINISH;
        responder_.Finish(reply_, Status::OK, this);
      } else {
        GPR_ASSERT(status_ == FINISH);
        // Once in the FINISH state, deallocate ourselves (CallData).
        delete this;
      }
    }

   private:
    // The means of communication with the gRPC runtime for an asynchronous
    // server.
    Greeter::AsyncService* service_;
    // The producer-consumer queue where for asynchronous server notifications.
    ServerCompletionQueue* cq_;
    // Context for the rpc, allowing to tweak aspects of it such as the use
    // of compression, authentication, as well as to send metadata back to the
    // client.
    ServerContext ctx_;

    // What we get from the client.
    HelloRequest request_;
    // What we send back to the client.
    HelloReply reply_;

    // The means to get back to the client.
    ServerAsyncResponseWriter<HelloReply> responder_;

    // Let's implement a tiny state machine with the following states.
    enum CallStatus { CREATE, PROCESS, FINISH };
    CallStatus status_;  // The current serving state.
  };

  // This can be run in multiple threads if needed.
  void HandleRpcs() {
    // Spawn a new CallData instance to serve new clients.
    new CallData(&service_, cq_.get());
    void* tag;  // uniquely identifies a request.
    bool ok;
    while (true) {
      // Block waiting to read the next event from the completion queue. The
      // event is uniquely identified by its tag, which in this case is the
      // memory address of a CallData instance.
      // The return value of Next should always be checked. This return value
      // tells us whether there is any kind of event or cq_ is shutting down.
      GPR_ASSERT(cq_->Next(&tag, &ok));
      GPR_ASSERT(ok);
      static_cast<CallData*>(tag)->Proceed();
    }
  }

  std::unique_ptr<ServerCompletionQueue> cq_;
  Greeter::AsyncService service_;
  std::unique_ptr<Server> server_;
};

int main(int argc, char** argv) {
  absl::ParseCommandLine(argc, argv);
  ServerImpl server;
  server.Run(absl::GetFlag(FLAGS_port));

  return 0;
}

示例代码github：https://github.com/grpc/grpc/tree/master/examples/cpp/helloworld

以上示例只是简单说明了grpc异步模式的使用方法，而对于处理多类请求的情况还需要优化设计，triton的设计是非常值得推荐的。

3、triton grpc异步模式设计

triton中一共设计了三个异步队列，分别用于处理普通请求、推理请求、流式推理请求：

  std::unique_ptr<::grpc::ServerCompletionQueue> common_cq_;  // 普通请求
  std::unique_ptr<::grpc::ServerCompletionQueue> model_infer_cq_;   // 推理请求
  std::unique_ptr<::grpc::ServerCompletionQueue> model_stream_infer_cq_;  // 流式推理请求

启动grpc服务代码位于【server】代码库main函数：

TRITONSERVER_Error*
StartGrpcService(
    std::unique_ptr<triton::server::grpc::Server>* service,
    const std::shared_ptr<TRITONSERVER_Server>& server,
    triton::server::TraceManager* trace_manager,
    const std::shared_ptr<triton::server::SharedMemoryManager>& shm_manager)
{
  TRITONSERVER_Error* err = triton::server::grpc::Server::Create(
      server, trace_manager, shm_manager, g_triton_params.grpc_options_,
      service);
  if (err == nullptr) {
    err = (*service)->Start();
  }

  if (err != nullptr) {
    service->reset();
  }

  return err;
}

其中(*service)->Start()函数为核心函数，实现了grpc请求的注册和处理，看如下代码(grpc_server.cc)：

其中common_handler_->Start()为普通grpc请求的注册，model_infer_handler->Start为推理的注册，model_stream_infer_handler->Start为流式推理请求的注册，两个推理都出在一个循环中，这个循环标识的是在多个线程中注册函数，以便实现多线程的推理。

我们以common_handler为例继续看代码的实现：

void
CommonHandler::Start()
{
  // Use a barrier to make sure we don't return until thread has
  // started.
  auto barrier = std::make_shared<Barrier>(2);
  // 启动一个线程，完成api的注册以及处理
  thread_.reset(new std::thread([this, barrier] {
    // 注册所有函数
    SetUpAllRequests();
    barrier->Wait();

    void* tag;
    bool ok;

    // 循环等待接收请求
    while (cq_->Next(&tag, &ok)) {
      ICallData* call_data = static_cast<ICallData*>(tag);
      if (!call_data->Process(ok)) {
        LOG_VERBOSE(1) << "Done for " << call_data->Name() << ", "
                       << call_data->Id();
        delete call_data;
      }
    }
  }));

  barrier->Wait();
  LOG_VERBOSE(1) << "Thread started for " << Name();
}

新启动的线程，完成所有api的注册，并循环等待rpc请求的到达，接收到请求后，将tag进行类型转换，同时调用其成员函数：Process()进行处理。其中类：ICallData为一个基类，这个类很重要，这里先列出，但不讲解。

继续看请求的注册，以健康检查注册为例：

void
CommonHandler::RegisterHealthCheck()
{
  auto OnRegisterHealthCheck =
      [this](
          ::grpc::ServerContext* ctx,
          ::grpc::health::v1::HealthCheckRequest* request,
          ::grpc::ServerAsyncResponseWriter<
              ::grpc::health::v1::HealthCheckResponse>* responder,
          void* tag) {
        this->health_service_->RequestCheck(
            ctx, request, responder, this->cq_, this->cq_, tag);
      };

  auto OnExecuteHealthCheck = [this](
                                  ::grpc::health::v1::HealthCheckRequest&
                                      request,
                                  ::grpc::health::v1::HealthCheckResponse*
                                      response,
                                  ::grpc::Status* status) {
    bool live = false;
    TRITONSERVER_Error* err =
        TRITONSERVER_ServerIsReady(tritonserver_.get(), &live);

    auto serving_status =
        ::grpc::health::v1::HealthCheckResponse_ServingStatus_UNKNOWN;
    if (err == nullptr) {
      serving_status =
          live ? ::grpc::health::v1::HealthCheckResponse_ServingStatus_SERVING
               : ::grpc::health::v1::
                     HealthCheckResponse_ServingStatus_NOT_SERVING;
    }
    response->set_status(serving_status);

    GrpcStatusUtil::Create(status, err);
    TRITONSERVER_ErrorDelete(err);
  };

  const std::pair<std::string, std::string>& restricted_kv =
      restricted_keys_.Get(RestrictedCategory::HEALTH);
  new CommonCallData<
      ::grpc::ServerAsyncResponseWriter<
          ::grpc::health::v1::HealthCheckResponse>,
      ::grpc::health::v1::HealthCheckRequest,
      ::grpc::health::v1::HealthCheckResponse>(
      "Check", 0, OnRegisterHealthCheck, OnExecuteHealthCheck,
      false /* async */, cq_, restricted_kv, response_delay_);
}

这个函数中有三个重点：

OnRegisterHealthCheck变量，该变量为std::function变量，该变量实现了grpc异步api的注册。
OnExecuteHealthCheck变量，该变量为std::function变量，该变量为api的处理函数。
创建CommonCallData对象，该对象真正实现了注册、处理请求的操作。

CommonCallData类的构造函数，会调用OnRegisterHealthCheck完成api的注册，在注册时，传入的tag为CommonCallData类对象指针，唯一标识了一个api请求，这个类继承自上面所说的ICallData类，在异步队列接收到请求数据后，会将tag强制转换为一个指向ICallData基类的指针，然而其真实类型为CommonCallData，接收到请求后，通过指针调用其成员函数Process对对应的请求进行处理：

template <typename ResponderType, typename RequestType, typename ResponseType>
bool
CommonCallData<ResponderType, RequestType, ResponseType>::Process(bool rpc_ok)
{
  LOG_VERBOSE(1) << "Process for " << name_ << ", rpc_ok=" << rpc_ok << ", "
                 << id_ << " step " << step_;

  // If RPC failed on a new request then the server is shutting down
  // and so we should do nothing (including not registering for a new
  // request). If RPC failed on a non-START step then there is nothing
  // we can do since we one execute one step.
  const bool shutdown = (!rpc_ok && (step_ == Steps::START));
  if (shutdown) {
    if (async_thread_.joinable()) {
      async_thread_.join();
    }
    step_ = Steps::FINISH;
  }

  if (step_ == Steps::START) {
    // Start a new request to replace this one...
    if (!shutdown) {
      new CommonCallData<ResponderType, RequestType, ResponseType>(
          name_, id_ + 1, OnRegister_, OnExecute_, async_, cq_, restricted_kv_,
          response_delay_);
    }

    if (!async_) {
      // For synchronous calls, execute and write response
      // here.
      Execute();
      WriteResponse();
    } else {
      // For asynchronous calls, delegate the execution to another
      // thread.
      step_ = Steps::ISSUED;
      async_thread_ = std::thread(&CommonCallData::Execute, this);
    }
  } else if (step_ == Steps::WRITEREADY) {
    // Will only come here for asynchronous mode.
    WriteResponse();
  } else if (step_ == Steps::COMPLETE) {
    step_ = Steps::FINISH;
  }

  return step_ != Steps::FINISH;
}

以上即为tritonserver grpc异步请求注册的全流程，欢迎各位程序员同学进行指正、讨论。

也非常欢迎同学们关注公众号进行沟通，一起学习，一起进步。

liupenglove

关注

22
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
tritonserver学习之九：tritonserver grpc异步模式

gRPC的异步模式允许服务器同时处理多个客户端请求，而不会因等待某个请求的响应而阻塞其他请求的处理。这使得TritonServer能够充分利用系统资源，提高并发性能，从而能够更高效地处理大量的模型推理请求。：在异步模式下，服务器不会为每个请求创建单独的线程或进程，而是将请求放入队列中，并通过事件循环机制来处理这些请求。这减少了系统资源的开销，使得TritonServer能够在有限的资源下处理更多的请求。
复制链接

扫一扫