昨天提取HOG模块在工程中单独编译,发现效率低于调用opencv库接口函数3倍左右(release),以下是原因分析。
1、在Cmake opencv时默认加入TBB编译,在cvconfig.h.cmake中:
/* Intel Threading Building Blocks */
#cmakedefine HAVE_TBB
2、在HOG模块的detectMultiScale函数内有如下代码:
parallel_for(BlockedRange(0, (int)levelScale.size()),
HOGInvoker(this, img, hitThreshold, winStride, padding, &levelScale[0], &allCandidates, &tempWeights, &tempScales));
这就是调用TBB实现并行执行的接口,opencv对它进行了一些封装。详见opencv2.31\opencv\build\include\opencv2\core\internal.hpp文件中的代码:
#ifdef HAVE_TBB
#include "tbb/tbb_stddef.h"
#if TBB_VERSION_MAJOR*100 + TBB_VERSION_MINOR >= 202
#include "tbb/tbb.h"
#include "tbb/task.h"
#undef min
#undef max
#else
#undef HAVE_TBB
#endif
#endif
#ifdef HAVE_TBB
namespace cv
{
typedef tbb::blocked_range<int> BlockedRange;
template<typename Body> static inline
void parallel_for( const BlockedRange& range, const Body& body )
{
tbb::parallel_for(range, body);
}
template<typename Iterator, typename Body> static inline
void parallel_do( Iterator first, Iterator last, const Body& body )
{
tbb::parallel_do(first, last, body);
}
typedef tbb::split Split;
template<typename Body> static inline
void parallel_reduce( const BlockedRange& range, Body& body )
{
tbb::parallel_reduce(range, body);
}
typedef tbb::concurrent_vector<Rect> ConcurrentRectVector;
typedef tbb::concurrent_vector<double> ConcurrentDoubleVector;
}
#else
namespace cv
{
class BlockedRange
{
public:
BlockedRange() : _begin(0), _end(0), _grainsize(0) {}
BlockedRange(int b, int e, int g=1) : _begin(b), _end(e), _grainsize(g) {}
int begin() const { return _begin; }
int end() const { return _end; }
int grainsize() const { return _grainsize; }
protected:
int _begin, _end, _grainsize;
};
#ifdef HAVE_THREADING_FRAMEWORK
#include "threading_framework.hpp"
template<typename Body>
static void parallel_for( const BlockedRange& range, const Body& body )
{
tf::parallel_for<Body>(range, body);
}
typedef tf::ConcurrentVector<Rect> ConcurrentRectVector;
#else
template<typename Body> static inline
void parallel_for( const BlockedRange& range, const Body& body )
{
body(range);
}
typedef std::vector<Rect> ConcurrentRectVector;
typedef std::vector<double> ConcurrentDoubleVector;
#endif
namespace cv
{
class BlockedRange
{
public:
BlockedRange() : _begin(0), _end(0), _grainsize(0) {}
BlockedRange(int b, int e, int g=1) : _begin(b), _end(e), _grainsize(g) {}
int begin() const { return _begin; }
int end() const { return _end; }
int grainsize() const { return _grainsize; }
protected:
int _begin, _end, _grainsize;
};
#ifdef HAVE_THREADING_FRAMEWORK
#include "threading_framework.hpp"
template<typename Body>
static void parallel_for( const BlockedRange& range, const Body& body )
{
tf::parallel_for<Body>(range, body);
}
typedef tf::ConcurrentVector<Rect> ConcurrentRectVector;
#else
template<typename Body> static inline
void parallel_for( const BlockedRange& range, const Body& body )
{
body(range);
}
typedef std::vector<Rect> ConcurrentRectVector;
typedef std::vector<double> ConcurrentDoubleVector;
#endif
如果没有定义HAVE_TBB,就会执行串行执行代码段。
综上,可以看出提取代码单独编译效率降低的原因是没有使用TBB,那问题就不难了,在我们提取的代码中加入internal.hpp头文件,并在头文件前定义宏:HAVE_TBB就
ok了,可能会出现找不到TBB的库,配置下就好了。如:
// opencv 头文件
#define HAVE_TBB
#include "opencv2/opencv.hpp"
#include "opencv2/core/core.hpp"
#include "opencv2/core/internal.hpp"
其实我对TBB也不熟,只是为了解决效率降低的问题,以下的解决问题参考的链接
http://blog.csdn.net/twilightgod/article/details/7187565