我使用fftw库(fftw3.a,fftw3.lib)在Linux和Windows中编写了两个相同的程序,并计算了fftwf_execute(m_wfpFFTplan)语句的持续时间(16-fft).
对于10000次运行:
>在Linux上:平均时间为0.9
>在Windows上:平均时间为0.12
我对为什么Windows上的速度比Linux上的速度快9倍感到困惑.
处理器:Intel(R)Core(TM)i7 CPU 870 @ 2.93GHz
每个操作系统(Windows XP 32位和Linux OpenSUSE 11.4 32位)都安装在同一台计算机上.
我从互联网下载了fftw.lib(适用于Windows),但不知道该配置.一旦使用此配置构建FFTW:
/configure --enable-float --enable-threads --with-combined-threads --disable-fortran --with-slow-timer --enable-sse --enable-sse2 --enable-avx
在Linux中,它生成的lib比默认配置(0.4 ms)快四倍.
解决方法:
16 FFT非常小.您会发现FFT小于64,这将是没有循环的硬编码汇编程序,以获得最高的性能.这意味着它们极易受到指令集变化,编译器优化甚至64位或32位字的影响.
当您从16开始测试FFT大小会发生什么? 1048576以2的幂?我说这是在Linux上作为特定的硬编码asm例程,可能不是对您的机器进行最佳优化的方法,但是对于这种特定大小的Windows实现,您可能很幸运.比较此范围内的所有大小,可以更好地说明Linux与Windows的性能.
您是否已校准FFTW?首次运行FFTW时,猜测每台计算机的最快实现速度,但是,如果您有特殊的指令集,特定大小的缓存或其他处理器功能,则它们可能会对执行速度产生巨大影响.结果,执行校准将测试各种FFT例程的速度,并为您的特定硬件选择最快的每种大小.校准涉及重复计算计划并保存生成的FFTW“智慧”文件.然后可以重复使用保存的校准数据(这是一个漫长的过程).我建议您在软件启动时执行一次,然后每次重新使用该文件.校准后,我注意到某些尺寸的性能提高了4-10倍!
以下是我用于校准某些尺寸的FFTW的代码片段.请注意,这段代码是逐字逐句地粘贴在我从事过的DSP库中的,因此某些函数调用特定于我的库.希望FFTW特定的调用对您有所帮助.
// Calibration FFTW
void DSP::forceCalibration(void)
{
// Try to import FFTw Wisdom for fast plan creation
FILE *fftw_wisdom = fopen("DSPDLL.ftw", "r");
// If wisdom does not exist, ask user to calibrate
if (fftw_wisdom == 0)
{
int iStatus2 = AfxMessageBox("FFTw not calibrated on this machine."\
"Would you like to perform a one-time calibration?\n\n"\
"Note:\tMay take 40 minutes (on P4 3GHz), but speeds all subsequent FFT-based filtering & convolution by up to 100%.\n"\
"\tResults are saved to disk (DSPDLL.ftw) and need only be performed once per machine.\n\n"\
"\tMAKE SURE YOU REALLY WANT TO DO THIS, THERE IS NO WAY TO CANCEL CALIBRATION PART-WAY!",
MB_YESNO | MB_ICONSTOP, 0);
if (iStatus2 == IDYES)
{
// Perform calibration for all powers of 2 from 8 to 4194304
// (most heavily used FFTs - for signal processing)
AfxMessageBox("About to perform calibration.\n"\
"Close all programs, turn off your screensaver and do not move the mouse in this time!\n"\
"Note:\tThis program will appear to be unresponsive until the calibration ends.\n\n"
"\tA MESSAGEBOX WILL BE SHOWN ONCE THE CALIBRATION IS COMPLETE.\n");
startTimer();
// Create a whole load of FFTw Plans (wisdom accumulates automatically)
for (int i = 8; i <= 4194304; i *= 2)
{
// Create new buffers and fill
DSP::cFFTin = new fftw_complex[i];
DSP::cFFTout = new fftw_complex[i];
DSP::fconv_FULL_Real_FFT_rdat = new double[i];
DSP::fconv_FULL_Real_FFT_cdat = new fftw_complex[(i/2)+1];
for(int j = 0; j < i; j++)
{
DSP::fconv_FULL_Real_FFT_rdat[j] = j;
DSP::cFFTin[j][0] = j;
DSP::cFFTin[j][1] = j;
DSP::cFFTout[j][0] = 0.0;
DSP::cFFTout[j][1] = 0.0;
}
// Create a plan for complex FFT.
// Use the measure flag to get the best possible FFT for this size
// FFTw "remembers" which FFTs were the fastest during this test.
// at the end of the test, the results are saved to disk and re-used
// upon every initialisation of the DSP Library
DSP::pCF = fftw_plan_dft_1d
(i, DSP::cFFTin, DSP::cFFTout, FFTW_FORWARD, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real forward FFT
DSP::pCF = fftw_plan_dft_r2c_1d
(i, fconv_FULL_Real_FFT_rdat, fconv_FULL_Real_FFT_cdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Create a plan for real inverse FFT
DSP::pCF = fftw_plan_dft_c2r_1d
(i, fconv_FULL_Real_FFT_cdat, fconv_FULL_Real_FFT_rdat, FFTW_MEASURE);
// Destroy the plan
fftw_destroy_plan(DSP::pCF);
// Destroy the buffers. Repeat for each size
delete [] DSP::cFFTin;
delete [] DSP::cFFTout;
delete [] DSP::fconv_FULL_Real_FFT_rdat;
delete [] DSP::fconv_FULL_Real_FFT_cdat;
}
double time = stopTimer();
char * strOutput;
strOutput = (char*) malloc (100);
sprintf(strOutput, "DSP.DLL Calibration complete in %d minutes, %d seconds\n"\
"Please keep a copy of the DSPDLL.ftw file in the root directory of your application\n"\
"to avoid re-calibration in the future\n", (int)time/(int)60, (int)time%(int)60);
AfxMessageBox(strOutput);
isCalibrated = 1;
// Save accumulated wisdom
char * strWisdom = fftw_export_wisdom_to_string();
FILE *fftw_wisdomsave = fopen("DSPDLL.ftw", "w");
fprintf(fftw_wisdomsave, "%s", strWisdom);
fclose(fftw_wisdomsave);
DSP::pCF = NULL;
DSP::cFFTin = NULL;
DSP::cFFTout = NULL;
fconv_FULL_Real_FFT_cdat = NULL;
fconv_FULL_Real_FFT_rdat = NULL;
free(strOutput);
}
}
else
{
// obtain file size.
fseek (fftw_wisdom , 0 , SEEK_END);
long lSize = ftell (fftw_wisdom);
rewind (fftw_wisdom);
// allocate memory to contain the whole file.
char * strWisdom = (char*) malloc (lSize);
// copy the file into the buffer.
fread (strWisdom,1,lSize,fftw_wisdom);
// import the buffer to fftw wisdom
fftw_import_wisdom_from_string(strWisdom);
fclose(fftw_wisdom);
free(strWisdom);
isCalibrated = 1;
return;
}
}
秘诀在于使用FFTW_MEASURE标志创建计划,该标志专门测量数百个例程,以针对您的特定FFT类型(实数,复数,1D,2D)和大小找到最快的例程:
DSP::pCF = fftw_plan_dft_1d (i, DSP::cFFTin, DSP::cFFTout,
FFTW_FORWARD, FFTW_MEASURE);
最后,所有基准测试也应在执行之外的单个FFT计划阶段执行,该阶段应从以释放模式编译且具有优化功能并与调试器分离的代码中调用.基准测试应在具有数千(甚至数百万)次迭代的循环中执行,然后使用平均运行时间来计算结果.您可能知道计划阶段会花费大量时间,并且执行被设计为使用单个计划多次执行.
标签:performance,linux,windows,fft,fftw
来源: https://codeday.me/bug/20191011/1891457.html