https://github.com/NVIDIA/TensorRT/issues/1593
You need to multiply qps by the batch size. You should also look at the GPU compute time, which should be equivalent to qps if you do (1000/gpu_compute_time(ms)).
When we checked logs found there is already a throughput improvement between batch_size=8 and batch_size=1.
- batch_size=1: 32.5049
- batch_size=8: 4.56083 * 8 = 36.48664