本文旨在纪录6678多核处理的调试过程
1、测试数据
手头有一份雷达测试数据,128脉冲积累,PRT 580个距离单元,数据量128*580*8 byte
仿真图如下:
2、单核处理
主要算法MTD+CFAR,数据放在DDR,为了演示关闭了编译器优化
连接仿真器,加载core0.out
2.1 导入测试数据
View->Memory Browser,右键Load Memory
Next,填入程序处理数据的地址
查看加载的数据地址
运行:统计耗时34.5ms,相参处理间隔128*70us=8.96ms,需要优化处理耗时
2.2 编译器优化
可以优化大部分,选择-o3
运行:耗时16.6ms,直接优化了一倍
2.3 存储空间优化
将处理数据放在共享内存
运行:耗时15.5ms,好像没啥优化。这块还需要再研究
2.4 针对部分耗时和执行频次高的代码优化
#define TH_S(th, len, coef) (((th/len)*(coef+1)))
for(i = 0; i < len; i++)
{
if(data[i] > (thres + TH_S(th[i+1], len, coef)))
{
result[i] = data[i];
result_th[i] = th[i+1];
}
}
采用DSP的内联函数,一次循环处理多组数据降低循环次数,如下将循环次数降低为原来的1/4
temp3 = _ftod(coef+1,coef+1);
temp4 = _ftod(1.0/len,1.0/len);
temp7 = _ftod(thres,thres);
for(i = 0; i < len-4; i+=4)
{
th1_0 = th[i+1];
th1_1 = th[i+1+1];
th1_2 = th[i+2+1];
th1_3 = th[i+3+1];
temp1 = _ftod(th1_0,th1_1);
temp2 = _ftod(th1_2,th1_3);
temp5 = _daddsp(_dmpysp(_dmpysp(temp1,temp3),temp4),temp7);
temp6 = _daddsp(_dmpysp(_dmpysp(temp2,temp3),temp4),temp7);
if(data[i]>_hif(temp5))
{
result[i] = data[i];
result_th[i] = _hif(temp5);
}
if(data[i+1]>_lof(temp5))
{
result[i+1] = data[i+1];
result_th[i+1] = _lof(temp5);
}
if(data[i+2]>_hif(temp6))
{
result[i+2] = data[i+2];
result_th[i+2] = _hif(temp6);
}
if(data[i+3]>_lof(temp6))
{
result[i+3] = data[i+3];
result_th[i+3] = _lof(temp6);
}
}
for(; i < len; i++)
{
temp = thres + TH_S(th[i+1], len, coef);
if(data[i]>(temp)){
result[i] = data[i];
result_th[i] = temp;
}
}
运行:耗时8.2ms,优化了一倍。耗时在一个CPI(8.96ms)内了
3、多核处理
主从方式,核0主核,其余从核,主核负责接受数据分发任务,汇总结果上报
3.1 2核处理
将数据一分为二,core0和core1分别处理一半。
MTD:2核处理,分别delayNum/2,CFAR:2核处理,分别fftNum/2。
连接仿真器,cfg配置的是8核同步,懒得改了,group 8核,加载8核.out。
加载8核.out有个小技巧,因为分别选择8个.out路径挺费劲的,只要选择过一遍,就可以选择Reload Program,会自动加载8核上一次选择的路径。重点mark下,刚开始不知道这个操作,每次调试选择8个.out一度怀疑人生
Reload Program
选中Group运行:耗时5ms,以core0时间为准。ps:core1耗时是统计某一个处理过程的耗时,如下应该是CFAR处理耗时
3.2 4核处理
MTD:2核处理,分别delayNum/2,CFAR:4核处理,分别fftNum/4
因为MTD相对于CFAR不那么耗时,所以MTD只用了2核
选中Group运行:耗时3.7ms
3.3 8核处理
MTD:2核处理,分别delayNum/2,CFAR:8核处理,分别fftNum/8
选中Group运行:耗时3ms
全速运行,MTD:8核处理,分别delayNum/8,CFAR:8核处理,分别fftNum/8
运行:耗时2.7ms
4、mark
L2设置256K cache,低256K(从0x00800000)开始为sram,高256k为cache
void PrintSpStatus()
{
SP_STATUS *pSpStatus = &pCoreStatus->spStatus;
UINT i = 0;
PRT("------------------------------ spBuf Info --------------------------------------- \n");
PRT("recvData recvData recvData recvData recv ram fftout cfarData cfarIn \n");
PRT(" Buf[0] Buf[1] Buf[2] Buf[3] Buf Buf Buf Buf Buf \n");
PRT("%6x %6x %6x %6x %6x %6x %6x %6x %6x \n",
pSpStatus->recvDataBuf[0], pSpStatus->recvDataBuf[1], pSpStatus->recvDataBuf[2],
pSpStatus->recvDataBuf[3], pSpStatus->recvBuf, pSpStatus->ramBuf,
pSpStatus->fftOutBuf, pSpStatus->cfarDataBuf, pSpStatus->cfarInBuf);
PRT("------------------------------ sp Info ------------------------------------------ \n");
PRT("tar overflow core0 core1 core2 core3 core4 core5 core6 core7 fft delay sp \n");
PRT("Num Cnt Time Time Time Time Time Time Time Time num num cnt \n");
PRT("%d %6d %6d %6d %6d %6d %6d %6d %6d %6d %6d %6d %6d \n",
pSpStatus->tarNumOut, pSpStatus->overflowCnt,
pSpStatus->spTime[0], pSpStatus->spTime[1], pSpStatus->spTime[2], pSpStatus->spTime[3],
pSpStatus->spTime[4], pSpStatus->spTime[5], pSpStatus->spTime[6], pSpStatus->spTime[7],
pSpStatus->fftNumAll, pSpStatus->delayNumAll, pSpStatus->spCnt);
PRT("------------------------------ fft Info ------------------------------------------ \n");
PRT("core0 core1 core2 core3 core4 core5 core6 core7 \n");
PRT("delay delay delay delay delay delay delay delay \n");
PRT("%d %6d %6d %6d %6d %6d %6d %6d \n",
pSpStatus->fftDelayNum[0], pSpStatus->fftDelayNum[1], pSpStatus->fftDelayNum[2], pSpStatus->fftDelayNum[3],
pSpStatus->fftDelayNum[4], pSpStatus->fftDelayNum[5], pSpStatus->fftDelayNum[6], pSpStatus->fftDelayNum[7]);
PRT("------------------------------ cfar Info ------------------------------------------ \n");
PRT("core0 core1 core2 core3 core4 core5 core6 core7 \n");
PRT("fftNum fftNum fftNum fftNum fftNum fftNum fftNum fftNum \n");
PRT("%d %6d %6d %6d %6d %6d %6d %6d \n",
pSpStatus->cfarFftNum[0], pSpStatus->cfarFftNum[1], pSpStatus->cfarFftNum[2], pSpStatus->cfarFftNum[3],
pSpStatus->cfarFftNum[4], pSpStatus->cfarFftNum[5], pSpStatus->cfarFftNum[6], pSpStatus->cfarFftNum[7]);
PRT("------------------------------ spTarget Info ----------------------------------- \n");
PRT("tar r range fre angle mag \n");
PRT("Idx Idx Out Out Out Out \n");
for (i = 0; i < pSpStatus->tarNumOut; i++)
{
PRT("%d %6d %6d %6d %6d %6d \n",
pSpStatus->spInfo[i].tarIndexOut, pSpStatus->spInfo[i].r_index, pSpStatus->spInfo[i].rangeOut,
pSpStatus->spInfo[i].freOut, pSpStatus->spInfo[i].angleOut, pSpStatus->spInfo[i].magSumOut);
}
}