目录
2.2 IPcore的testBench与c-simulation
一、IPcore代码概览
1.1 接口
//----------------convolution in FPGA-----------------------------------
int convolution_3x3(int inHight,int inWidth,int inChanNum,int outHight,int outWidth,int OutChanNum,
int stride,
volatile float *weight_ptr,volatile float *input_ptr,volatile float *output_ptr){
#pragma HLS INTERFACE s_axilite port=inHight bundle=axilite
#pragma HLS INTERFACE s_axilite port=inWidth bundle=axilite
#pragma HLS INTERFACE s_axilite port=inChanNum bundle=axilite
#pragma HLS INTERFACE s_axilite port=outHight bundle=axilite
#pragma HLS INTERFACE s_axilite port=outWidth bundle=axilite
#pragma HLS INTERFACE s_axilite port=OutChanNum bundle=axilite
#pragma HLS INTERFACE s_axilite port=stride bundle=axilite
#pragma HLS INTERFACE s_axilite port=return bundle=axilite
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=weight_ptr offset=slave bundle=memorybus
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=input_ptr offset=slave bundle=memorybus
#pragma HLS INTERFACE m_axi depth=DRAM_DEPTH port=output_ptr offset=slave bundle=memorybus
接口描述:
软件端:通过函数接口传入参数,实现卷积运算。
硬件端:通过axi-lite协议将神经网络的参数通过ARM传入IPcore;
通过m-axi用IPcore从DRAM上获取相应的权重,特征以及写出卷积输出到DRAM上。
1.2 功能
用FPGA加速实现MTCNN之中的3*3卷积。
1.3 时间与空间资源
1.3.1 空间资源
下图为IPcore在7z035上占用的资源,在7z020上资源超出预期。
1.3.2 时间资源
因卷积为变长卷积,所以我们以Pnet的第一层输入为准来确定相应的时钟周期(Pnet的输入与Rnet和Onet不同,Pnet输入没有resize来缩减尺寸,因此卷积尺寸较大且较为耗时,Pnet第一层在单片机端卷积的模拟为数秒左右)。
卷积尺寸:输入640*480*3,输出尺寸640*480*10,权重尺寸 3*3*3*10
时钟周期:
总体的latency与II并不大,表明并行是有效果的。
二、IPcore正确性及验证
2.1 IPcore在MTCNN之中的调用
直接在mtcnn之中调用IPcore的c代码,每次3*3的卷积都对IPcore的c代码进行一次调用,例如:
convolution_3x3(this->rgb->height,this->rgb->width,this->rgb->channel,
this->conv1_out->height,this->conv1_out->width,this->conv1_out->channel,
this->conv1_wb->stride,
this->conv1_wb->pdata,this->rgb->pdata,this->conv1_out->pdata);
经过调用之后的mtcnn能正确运行且产生正确的结果。
2.2 IPcore的testBench与c-simulation
编写testBench,将IPcore代码独立出来进行验证。且与卷积结果进行对比。无误,证明IPcore的代码可以独立出MTCNN的代码进行synthesis。
2.3 synthesis与RTL输出
IPcore的c代码输出及RTL输出正常无报错。
2.4 系统搭建与烧录
系统搭建validate未见异常,比特流生成正常,比特流通过SDK软件烧入FPGA正常。
三、SDK端的调用
3.1 初始化IPcore
通过HLS驱动来初始化IPcore及开始IPcore
//set conv IPcore value
XConvolution_3x3_Set_inHight(&XConvolution_3x3_Core,featureIn.height);
XConvolution_3x3_Set_inWidth(&XConvolution_3x3_Core,featureIn.width);
XConvolution_3x3_Set_inChanNum(&XConvolution_3x3_Core,featureIn.channel);
XConvolution_3x3_Set_outHight(&XConvolution_3x3_Core,conv_PL_out.height);
XConvolution_3x3_Set_outWidth(&XConvolution_3x3_Core,conv_PL_out.width);
XConvolution_3x3_Set_OutChanNum(&XConvolution_3x3_Core,conv_PL_out.channel);
XConvolution_3x3_Set_stride(&XConvolution_3x3_Core,weightIn.stride);
XConvolution_3x3_Set_weight_ptr(&XConvolution_3x3_Core,(unsigned int)weightIn.pdata);
XConvolution_3x3_Set_input_ptr(&XConvolution_3x3_Core,(unsigned int)featureIn.pdata);
XConvolution_3x3_Set_output_ptr(&XConvolution_3x3_Core,(unsigned int)conv_PL_out.pdata);
printf("Set conv parameters SUCCESS!\n");
//conv in PL
XConvolution_3x3_Start(&XConvolution_3x3_Core);
printf("IPcore conv start SUCCESS!\n");
3.2 判断IPcore是否完成
for(int cur_sleep_times=0;cur_sleep_times<20;cur_sleep_times++){
if(!XConvolution_3x3_IsDone(&XConvolution_3x3_Core)){
printf("IPcore Not Done! current sleep times is %d \n",cur_sleep_times);
}
else{
printf("IP core Done SUCCESS!");
break;
}
usleep(1000*7);//7 mlili second
}
isDone变为1表示IPcore正常运行结束。
3.3 运行情况
小尺寸卷积正常运行
--------------program start-------------
init network parameters run time is 0.000474 mili second
Output variable init SUCCESS!
Write conv data to DRAM run time is 2.690571 mili second
Initialize XConvolution_3x3_Core IPcore SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 0
IP core get inHight is 5
IP core get weight prt is 1000000
Set conv parameters SUCCESS!IPcore conv start SUCCESS!
---------print IP core value---------
IP core return is 0
IP core isDone is 1
IP core get inHight is 5
IP core get weight prt is 1000000
Strat again SUCCESS!
IP core Done SUCCESS!
Input_Pixels is 50 and hex memory size is 000000c8
weight_pixels is 36 and hex memory size is 00000090
Output_Pixels is 18 and hex memory size is 00000048
Input pointer value is 01000090
Weight pointer value is 01000000
Output PS pointer value is 010000d8
InputSize is 5, In_channels is 2, Input_Pixels is 50
OutputSize is 3, Out_channels is 2, Output_Pixels is 18
Stride is 1, weight_pixels is 36
------------Program End SUCCESS!-----------
卷积尺寸较大时导致单片机死机,例如下面这个尺寸:
InputSize is 23, In_channels is 32, Input_Pixels is 16928
OutputSize is 21, Out_channels is 64, Output_Pixels is 28224
Stride is 1, weight_pixels is 18432
Input_Pixels is 16928 and hex memory size is 00010880
weight_pixels is 18432 and hex memory size is 00012000
Output_Pixels is 28224 and hex memory size is 0001b900
Input pointer value is 01012000
Weight pointer value is 01000000
Output PS pointer value is 0102d900
3.4 IPcore问题总结
IPcore对DDR的指针无法与单片机共享,因为小尺寸卷积输出写入DDR的值单片机无法读出。
IPcore实现尺寸较大卷积会导致单片机死机。
四、与zynqNet的对比
4.1 一次卷积HLS的时钟周期
下面为zynqNet的时钟周期:
4.2 MACC操作的次数
全局变量的增加:在fpgaAcc.cpp之中,加入extern int,然后在pikaqiu中程序之外定义全局变量int。
MTCNN选用一张:85,176,568 为10^7数量级
另一张:43,543,288为 10^7数量级
zynqNe的MACC次数为: 152,731,648 ,为10^8数量级