TensorRT samples里common的代码是写得不错的用于简化调用TensorRT engine的套路的封装代码,使用这些封装类可以节省些代码,也使得代码更优雅点,但是里面有点问题,例如,有dynamic batch_size或者height/width维度的模型engine,在调用时会发生崩溃:
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
Aborted (core dumped)
追查了一下这个abort发生的原因是当有dynamic维度,例如batch_size=-1时,崩溃在/usr/src/tensorrt/samples/common/buffers.h里面的代码:
class BufferManager
{
public:
static const size_t kINVALID_SIZE_VALUE = ~size_t(0);
//!
//! \brief Create a BufferManager for handling buffer interactions with engine.
//!
BufferManager(std::shared_ptr<nvinfer1::ICudaEngine> engine, const int batchSize = 0,
const nvinfer1::IExecutionContext* context = nullptr)
: mEngine(engine)
, mBatchSize(batchSize)
{
// Full Dims implies no batch size.
assert(engine->hasImplicitBatchDimension() || mBatchSize == 0);
// Create host and device buffers
for (int i = 0; i < mEngine->getNbBindings(); i++)
{
auto dims = context ? context->getBindingDimensions(i) : mEngine->getBindingDimensions(i);
size_t vol = context || !mBatchSize ? 1 : static_cast<size_t>(mBatchSize);
nvinfer1::DataType type = mEngine->getBindingDataType(i);
int vecDim = mEngine->getBindingVectorizedDim(i);
if (-1 != vecDim) // i.e., 0 != lgScalarsPerVector
{
int scalarsPerVec = mEngine->getBindingComponentsPerElement(i);
dims.d[vecDim] = divUp(dims.d[vecDim], scalarsPerVec);
vol *= scalarsPerVec;
}
vol *= samplesCommon::volume(dims);
std::unique_ptr<ManagedBuffer> manBuf{new ManagedBuffer()};
manBuf->deviceBuffer = DeviceBuffer(vol, type);
manBuf->hostBuffer = HostBuffer(vol, type);
mDeviceBindings.emplace_back(manBuf->deviceBuffer.data());
mManagedBuffers.emplace_back(std::move(manBuf));
}
}
manBuf->deviceBuffer = DeviceBuffer(vol, type); 这行,而DeviceBuffer继承自GenericBuffer,实际上是GenericBuffer的构造函数里抛出了std::bad_alloc()异常,原因是allocFn()执行失败了:
GenericBuffer(size_t size, nvinfer1::DataType type)
: mSize(size)
, mCapacity(size)
, mType(type)
{
if (!allocFn(&mBuffer, this->nbBytes()))
{
throw std::bad_alloc();
}
}
而对于GPU Device和Host有各自的allocFn()和freeFn():
class DeviceAllocator
{
public:
bool operator()(void** ptr, size_t size) const
{
return cudaMalloc(ptr, size) == cudaSuccess;
}
};
class DeviceFree
{
public:
void operator()(void* ptr) const
{
cudaFree(ptr);
}
};
class HostAllocator
{
public:
bool operator()(void** ptr, size_t size) const
{
*ptr = malloc(size);
return *ptr != nullptr;
}
};
class HostFree
{
public:
void operator()(void* ptr) const
{
free(ptr);
}
};
很显然是cudaMalloc(ptr, size)这里分配GPU内存失败了,为何失败呢?查看了一下发现size值是18446744073709100032!原因就是这里了,那为何size值异常大呢?追查了一下发现根本原因就是engine的第一个维度dims.d[0]值为-1,导致BufferManager的构造函数里samplesCommon::volume(dims)计算出来的值为-451584,vol是无符号数,所以
vol *= samplesCommon::volume(dims);这句让vol得到了一个超大的值!
为了防止动态维度-1造成这种错误,可以做针对性修改,把动态维度先强制修改成对应的具体维度值就是了,对于height/weight是动态的也做类似处理:
if (dims.d[0] == -1) dims.d[0] = vol;
vol *= samplesCommon::volume(dims);
做了上面的修改,维度计算就正确了,用于存储数据的buffer内存分配就没问题了。
当然,对于推理过程中需要动态改变输入数据的heigh/width维度的话,可能需要修改BufferManger的构造函数以便针对不同的具体height/width维度创建多个BufferManager实例用于对应维度输入数据的推理调用。