多线程下创建多个tensorrt实例

基于tensorrt7.2.1.6 \ Cuda11.1版本下实现多GPU推理:

类似 GPU0:run model A,GPU1:run model B

1 模型最好分为2个独立文件,不要一个文件交给2个线程去加载,而且每个模型文件最好由该gpu转换生成,否则会有警告:

“WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.”

对于这种情况,一些git给出的解释
2 nv的官方注意事项:
multi gpus

Q: How do I use TensorRT on multiple GPUs?
A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.
可以看到最后一句话,要深刻理解!!!

TRTgit的issue

It’s actually pretty simple (now that it’s working 😃).
I built a class that inherited from a thread.Each thread had a member to the context and engine. I pass the gpu id to the class (0,1…) and call cudaSetDevice before building the engine and the context.
Now we have two threads, each with the context and engine mapping to the correct card.
When running inference using these threads, before copying your buffer to the gpu (using a buffer manager) or cuda copy, you have to call cudaSetDevice within that thread. Every single time you send a batch to the card.
So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn’t think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.
I was thinking there would be a threading issue when calling cudaSetDevice (ie. I call set device from one thread as the other thread is copying memory), but I logged that case and it didn’t cause any issues.It seems that cudaSetDevice is only operating within the thread which is good.

3 非常重要的总结:在内存向GPU传递数据时,一定要再次指定设备,否则这个复制线程不知道到底向哪个GPU传递,所以在iinfer时:

    cudaSetDevice(mDevice); #最关键的一行一定要加入 
    CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 *mInputH * mInputW * sizeof(float), 	     cudaMemcpyHostToDevice, stream));
    context.enqueue(batchSize, buffers, stream, nullptr);
    CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * mOutPutSize * sizeof(float), cudaMemcpyDeviceToHost, stream));
	cudaStreamSynchronize(stream);
  • 5
    点赞
  • 12
    收藏
    觉得还不错? 一键收藏
  • 3
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值