基于tensorrt7.2.1.6 \ Cuda11.1版本下实现多GPU推理:
类似 GPU0:run model A,GPU1:run model B
1 模型最好分为2个独立文件,不要一个文件交给2个线程去加载,而且每个模型文件最好由该gpu转换生成,否则会有警告:
“WARNING: Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.”
对于这种情况,一些git给出的解释
2 nv的官方注意事项:
multi gpus
Q: How do I use TensorRT on multiple GPUs?
A: Each ICudaEngine object is bound to a specific GPU when it is instantiated, either by the builder or on deserialization. To select the GPU, use cudaSetDevice() before calling the builder or deserializing the engine. Each IExecutionContext is bound to the same GPU as the engine from which it was created. When calling execute() or enqueue(), ensure that the thread is associated with the correct device by calling cudaSetDevice() if necessary.
可以看到最后一句话,要深刻理解!!!
It’s actually pretty simple (now that it’s working 😃).
I built a class that inherited from a thread.Each thread had a member to the context and engine. I pass the gpu id to the class (0,1…) and call cudaSetDevice before building the engine and the context.
Now we have two threads, each with the context and engine mapping to the correct card.
When running inference using these threads, before copying your buffer to the gpu (using a buffer manager) or cuda copy, you have to call cudaSetDevice within that thread. Every single time you send a batch to the card.
So the threads run asynchronous with tonnes of calls to cudaSetDevice which I didn’t think was necessary, but works in practice. I ran this on 4 cards all afternoon with no problem.
I was thinking there would be a threading issue when calling cudaSetDevice (ie. I call set device from one thread as the other thread is copying memory), but I logged that case and it didn’t cause any issues.It seems that cudaSetDevice is only operating within the thread which is good.
3 非常重要的总结:在内存向GPU传递数据时,一定要再次指定设备,否则这个复制线程不知道到底向哪个GPU传递,所以在iinfer时:
cudaSetDevice(mDevice); #最关键的一行一定要加入
CHECK(cudaMemcpyAsync(buffers[0], input, batchSize * 3 *mInputH * mInputW * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[1], batchSize * mOutPutSize * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);