MTCNN（七）卷积更改为嵌套for循环格式

最新推荐文章于 2021-07-07 20:16:48 发布

祥瑞Coding

最新推荐文章于 2021-07-07 20:16:48 发布

阅读量1.7k

点赞数 2

分类专栏： c/c++ 机器学习目标检测 MTCNN

本文链接：https://blog.csdn.net/weixin_36474809/article/details/83145601

版权

机器学习同时被 3 个专栏收录

133 篇文章 51 订阅

订阅专栏

c/c++

51 篇文章 6 订阅

订阅专栏

目标检测

32 篇文章 6 订阅

订阅专栏

背景：将MTCNN部署在FPGA上需要将其代码设计为C代码，c代码中的相乘相加依赖于openBLAS库。改为zynqNet的方式需要将卷积拆分为3*3的卷积，不能采用gemm的形式。

目的：将卷积与全连接去掉对openBLAS库的依赖，改为与zynqNet一致的嵌套for循环形式实现卷积，以便并行化。

一、gemm

1.1 关于卷积的gemm的理解

1.2 替换掉cblas_sgemm为gemm

一、gemm

1.1 关于卷积的gemm的理解

关于im2col的过程：https://blog.csdn.net/lanchunhui/article/details/74838635

卷积中，将feature提取为矩阵形式，然后与权重矩阵相乘是常见的形式。

参考内容：

MTCNN（六）c代码网络结构的更改 4.3 convolution
YOLOv3：Darknet代码解析（三）卷积操作 https://blog.csdn.net/weixin_36474809/article/details/81296612

cblas_segmm的参数 https://blog.csdn.net/u012235274/article/details/52769682

cblas_sgemm(order, transA, transB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDA);

第一个参数的函数是存储的有限性，有行优先和列优先(c语言是行优先)
第二个参数和第三个参数是是否转置
A矩阵经过transA之后的维度是M×K
B矩阵经过transB之后的维度是K×N
C矩阵的维度是M×N
LDA和LDB是对应矩阵还没变换之前，在主维度方向的维度。（如果是行优先就是列数）。

1.2 替换掉cblas_sgemm为gemm

// -------------convulution in 2D matrix format-------------------
// input kernel matrix  *  input feature matrix(Trans) = output feature matrix
// height (outChannels)    height (3D_KernelSize)        height (outChannels)
// width  (3D_KernelSize)  width  (outFeatureSize)       width  (outFeatureSize)

//C=αAB + βC :   outpBox=weightIn*matrixIn(T)
	//       A_transpose       B_transpose
	gemm_cpu(0,                1,            \
	//A row C row             B col C col        A col B row        alpha
	weightIn->out_ChannelNum, matrixIn->height,  matrixIn->width,   1,   \
	//A*            A'col           B*              B'col             beta
	weightIn->pdata,matrixIn->width,matrixIn->pdata,matrixIn->width,  0, \
	//C*             C'col
	outpBox->pdata,  matrixIn->height);

替换之后程序正常运行

二、全连接层的cblas_sgemv

2.1 MTCNN中的sgem

openBLAS中的sgem https://blog.csdn.net/chenlanjie842179335/article/details/8043925

运算式：C=alpha*A*b+beta*C

一般取alpha=1.0，beta=0.0 即计算式：C=A*b

cblas_sgemv(CblasRowMajor, CblasNoTrans,A的行数,A的列数,alpha,A,A的列数,b,1,beta,C,1)

	//Y=αAX + βY    β must be 0(zero)  cblas_sgemv:Multiplies a matrix by a vector (single precision)
	//          row_Major      no_trans      A hight                 A width               alpha
	cblas_sgemv(CblasRowMajor, CblasNoTrans, weight->out_ChannelNum, weight->in_ChannelNum,1,   \
	//A*           A width                x               1   beta  C*              1
	weight->pdata, weight->in_ChannelNum, Inpbox->pdata,  1,  0,    outpBox->pdata, 1);

2.2 YOLO中的sgemv

    int m = l.batch;
    int k = l.inputs;
    int n = l.outputs;
    float *a = net.input;//input
    float *b = l.weights;//weight
    float *c = l.output;//output
    gemm(0,1,m,n,k,1,a,k,b,k,1,c,n);

但是YOLO中是input在左，weight在右，我们需要weight在左，input在右的格式。

2.3 直接用gemm实现sgemv

//C=αAB + βC :   outpBox=weightIn*matrixIn(T)
	//       A_transpose       B_transpose
	gemm_cpu(0,                0,            \
	//A hight C hight         B width C width   A width B hight          alpha
	weight->out_ChannelNum, 1,                weight->in_ChannelNum,   1,   \
	//A*             A'width                B*              B'width   beta
	weight->pdata,   weight->in_ChannelNum, Inpbox->pdata,  1,        0, \
	//C*             C'width
	outpBox->pdata,  1);

经过验证，我们可以直接将openBLAS的程序全部变为自己的代码实现。可以去除掉对openBLAS库的依赖。

三、去除对openCV的依赖

3.1 zynqNet中对图片的读取

zynqNet直接将图片转换为二进制格式的文件方便读取。

3.2 MTCNN中引用了两个图像库

都在network.h之中，

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"

using namespace cv;

openCV库对调试暂时较为重要，后续部署FPGA阶段再回来探讨此内内容。

见 MTCNN（九）去除openCV依赖库 https://blog.csdn.net/weixin_36474809/article/details/83343514

四、卷积改为嵌套for循环形式

嵌套更改之前，需要从四个方面理解程序。

YOLO中im2col的模式
YOLO中嵌套for循环卷积模式
MTCNN中im2col的模式

4.1 YOLO中im2col函数

//YOLO  additionally.c  
float im2col_get_pixel(float *im, int height, int width, int channels,
    int row, int col, int channel, int pad){
		
    row -= pad;
    col -= pad;

    if (row < 0 || col < 0 ||
        row >= height || col >= width) return 0;
    return im[col + width*(row + height*channel)];
}

此函数为根据当前height，width，channel，row，col得到相应的pad之后的像素。

// im2col_CPU.c
//left matrix weight,right matrix data_col
//data_col  height (3D_kernelSize), width (Out_featureSize)
int channels_col = channels * ksize * ksize;//3D_kernelSize
for (c = 0; c < channels_col; ++c) {
	int w_offset = c % ksize;
	int h_offset = (c / ksize) % ksize;
	int c_im = c / ksize / ksize;
	for (h = 0; h < height_col; ++h) {
		for (w = 0; w < width_col; ++w) {
			int im_row = h_offset + h * stride;
			int im_col = w_offset + w * stride;
			int col_index = (c * height_col + h) * width_col + w;
			data_col[col_index] = im2col_get_pixel(data_im, height, width, channels,
				im_row, im_col, c_im, pad);
		}
	}
}

4.2 YOLO中的嵌套for循环

for (fil = 0; fil < l.n; ++fil) {//channels out
int chan, y, x, f_y, f_x;
// channel index
for (chan = 0; chan < l.c; ++chan)//channels in
// input - y
for (y = 0; y < l.h; ++y)
// input - x
for (x = 0; x < l.w; ++x){
	
	//for channels out,for channels in,for row,for col
	int const output_index = fil*l.w*l.h + y*l.w + x;
	int const weights_pre_index = fil*l.c*l.size*l.size + chan*l.size*l.size;
	int const input_pre_index = chan*l.w*l.h;
	float sum = 0;

	// filter - y
	for (f_y = 0; f_y < l.size; ++f_y)
	{
		int input_y = y + f_y - l.pad;
		// filter - x
		for (f_x = 0; f_x < l.size; ++f_x)
		{
			int input_x = x + f_x - l.pad;
			if (input_y < 0 || input_x < 0 || input_y >= l.h || input_x >= l.w) continue;

			int input_index = input_pre_index + input_y*l.w + input_x;
			int weights_index = weights_pre_index + f_y*l.size + f_x;

			sum += state.input[input_index] * l.weights[weights_index];
		}
	}
	// l.output[filters][width][height] +=
	//        state.input[channels][width][height] *
	//        l.weights[filters][channels][filter_width][filter_height];
	l.output[output_index] += sum;
}
}

在每一个for chanel_out, for channel_In, for out_height, for out_width中计算偏移地址，

然后在当前输出piexl下计算每一个卷积核的累乘相加。

4.3 卷积的编写

//set the output value to 0
for(cur_col_out=0;cur_col_out<out_featureSize;cur_col_out++)
	output_ptr[cur_col_out]=0;

#pragma omp parallel for
for(cur_channel_out=0; cur_channel_out<out_ChannelNum; cur_channel_out++){//out_channel
 for(cur_channel_in=0;cur_channel_in<in_ChannelNum;cur_channel_in++){//in_channel
  for(cur_row_out=0;cur_row_out<out_height;cur_row_out++){//out_row,out_height
	for(cur_col_out=0;cur_col_out<out_width;cur_col_out++){//out_col,out_width
		output_loc=cur_channel_out*out_height*out_width+cur_row_out*out_width+cur_col_out;
		weight_pre_loc=cur_channel_out*in_ChannelNum*kernelSize_2D + cur_channel_in*kernelSize_2D;
		input_pre_loc=cur_channel_in*in_width*in_height  \
								+ cur_row_out*stride*in_width+stride*cur_col_out;
		sum=0;
// outpBox [out_ChannelNum][out_height][out_width] +=
//		weightIn[out_ChannelNum][in_ChannelNum][kernelWidth][kernelHeight] *
//		pboxIn[in_ChannelNum][width][height] 
		for (filter_row=0;filter_row<kernelSize;filter_row++){
		  for(filter_col=0;filter_col<kernelSize;filter_col++){
			weight_loc=weight_pre_loc+filter_row*kernelSize+filter_col;
			input_loc=input_pre_loc+filter_row*in_width+filter_col;
			sum+=weight_ptr[weight_loc]*input_ptr[input_loc];
		  }
		}
		output_ptr[output_loc]+=sum;
	}
  }
 }
}

我们将卷积改为嵌套for循环的形式，验证通过了程序。至此，我们可以开始参照zynqNet的模式将MTCNN一步一步向zynqNet上实现。

4.4 一个bug的调通

最初编写好嵌套卷积的时候，并未出现与gemm形式一致的结果，后面查找相应gemm的程序，发现最初有一个对卷积前的输出矩阵置零的步骤。

void gemm_cpu(int TA, int TB, int M, int N, int K,  
        float *A, int lda, 
        float *B, int ldb,
        float *C, int ldc)
{
	int i,j;
	for(i = 0; i < M; ++i){
        for(j = 0; j < N; ++j){
            C[i*ldc + j] = 0;
        }
    }
...

开始我们认为在convolutionInit函数之中已经运用memset将程序置零，但是后续验证发现，pnet是开辟了多个存储空间（金字塔缩放，每个feature的大小不固定），但是Rnet与Onet的内存结构是固定的，每次运行重复运用了很多次开辟的空间，空间一次性在网络初始化时开辟好。所以，必须在卷积之前将output的值置为0，否则值会累加之前的值。我们打出每次卷积之前与之后的信息。

Start run Pnet
Pnet buffer init
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
Start Pnet generate Bbox
Done Pnet generate Bbox
Done run Pnet
Run nms
...
Rnet run
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
just memset 0 :0.000000
after =0 : 0.000000
Rnet run
just memset 0 :0.518421
after =0 : 0.000000
just memset 0 :0.212925
after =0 : 0.000000
just memset 0 :0.009512
after =0 : 0.000000
Rnet run

Pnet所有的运行空间均为新开辟的空间，而Rnet与Onet重复运用了相同的内存空间运行了多次。所以第一次运行时初始值为0，但后续边为上次的值。解决了此bug，我们已经将此结构改为了与之对应的nestedloop的格式。

4.5 全连接层的编写

//--------------------fc layer	in nested loop format--------------
//loop variables
int cur_outChannel,cur_inChannel;
int out_ChannelNum=weight->out_ChannelNum, in_ChannelNum=weight->in_ChannelNum;
//loaction variables
int weight_loc_pre,weight_loc;
//variable pointer
float sum;
for(cur_outChannel=0;cur_outChannel<out_ChannelNum;cur_outChannel++){
	sum=0;
	weight_loc_pre=cur_outChannel*in_ChannelNum;
	for(cur_inChannel=0;cur_inChannel<in_ChannelNum;cur_inChannel++){
		weight_loc=weight_loc_pre+cur_inChannel;
		sum+=weight->pdata[weight_loc]*Inpbox->pdata[cur_inChannel];
	}
	outpBox->pdata[cur_outChannel]=sum;
}

至此，我们摆脱了对openBLAS库的依赖，并且根据嵌套for循环将程序改为了zynqNet的模式。

祥瑞Coding

关注

2
点赞
踩
5

收藏

觉得还不错? 一键收藏
打赏
1
评论
MTCNN（七）卷积更改为嵌套for循环格式

背景：将MTCNN部署在FPGA上需要将其代码设计为C代码，c代码中的相乘相加依赖于openBLAS库。改为zynqNet的方式需要将卷积拆分为3*3的卷积，不能采用gemm的形式。目的：将卷积与全连接去掉对openBLAS库的依赖，改为与zynqNet一致的嵌套for循环形式实现卷积，以便并行化。目录一、gemm1.1 关于卷积的gemm的理解1.2 替换掉cblas_sge...
复制链接

扫一扫