卷积函数的FPGA实现(三)加入HLS预编译指令

51 篇文章 6 订阅
20 篇文章 7 订阅

背景:我们编写了卷积操作的IPcore,并且在c代码端模拟通过,现在我们需要实现加入HLS的预编译指令,然后将其实现为硬件结构。

目的:加入HLS预编译指令。

相关文章:HLS的预编译指令优化到硬件的知识:

 FPGA基础知识(十)HLS针对循环的操作 https://blog.csdn.net/weixin_36474809/article/details/81479551

 FPGA基础知识(十一)HLS针对数组的操作 https://blog.csdn.net/weixin_36474809/article/details/81483993

 FPGA基础知识(十二)HLS增大吞吐量的优化 https://blog.csdn.net/weixin_36474809/article/details/81665911

重要

vivado HLS硬件化指令(四)卷积相关的指令优化 https://blog.csdn.net/weixin_36474809/article/details/84587535

UG902文档v2016.4

目录

一、数组的BRAM的实现

1.1  IBRAM

IBRAM:zynqNet实现方法

IBRAM:MTCNN代码加入

1.2  OBRAM

OBRAM:zynqNet实现方法

OBRAM:MTCNN代码加入

1.3  WBRAM

WBRAM的zynqNet实现

WBRAM:MTCNN代码加入

 二、并行化与相关并行数组的实现

2.1 MACC中的UNROLL

UNROLL:zynqNet的格式

UNROLL:MTCNN加入

2.2 与UNROLL相关的ARRAY_PARTITION

zynqNet的ARRAY_PARTITION

MTCNN的ARRYA_PARTITION

2.3 并行化MACC

zynqNet并行化MACC的方法

MTCNN并行化MACC的指令

三、INLINE与实体化

3.1 OBRAM 相关的实体化

3.2 OBRAM向DRAM相关的实体化

3.3 权重DRAM加载入WBRAM

3.4 权重从WBRAM取出到PE

3.5 图像从DRAM到IBRAM

3.6 IBRAM读出到PE


一、数组的BRAM的实现

只有三种BRAM,参考zynqNet的模式,将相应BRAM加入具体优化指令。

运用resource指令与array partation指令,将权重,图像,累加的输出实现为具体的BRAM上。

1.1  IBRAM

IBRAM:zynqNet实现方法

//get pixel out from BRAM
data_t ImageCache::getPixel(const coordinate_t y, const imgcacheaddr_t y_offset,
		const coordinate_t x, const channel_t ci) {
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM
	imgcacheaddr_t addr_pixel_offset = x * ch_in;//col_offset
	imgcacheaddr_t addr = y_offset + addr_pixel_offset + ci;//row_offset+col_offset+channel_offset
	bool is_padding_pixel = x < 0 | x >= width_in | y < 0 | y >= height_in;
	data_t px = is_padding_pixel ? 0.0f : IBRAM[addr];
	return px;
}

 直接将相应的IBRAM设置为双端口的BRAM

#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM,表示支持一个端口读另一个端口写。

且另外几个BRAM都是在从BRAM到PE的读出之中具体化为BRAM的。

IBRAM:MTCNN代码加入

float ImageCache::get_IBRAM_Pixel(const int IBRAM_line_offset, const int cur_col,
								const int channel_in){
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM	
	int IBRAM_col_offset=cur_col*in_ChannelNum;
	int IBRAM_loc=IBRAM_line_offset+IBRAM_col_offset+channel_in;
	float px=IBRAM[IBRAM_loc];
	return px;
}

 与zynqNet一样,直接具体化为双端口的BRAM,不加其他指令并且不设置latency

1.2  OBRAM

OBRAM:zynqNet实现方法

void OutputCache::accumulateChannel(channel_t co, data_t value_to_add) {
#pragma HLS inline
//#pragma HLS pipeline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
  data_t old_ch = getChannel(co); /* BRAM[c] */
  data_t new_ch = old_ch + value_to_add;
  setChannel(co, new_ch); /* BRAM[c] = new_ch; */
};

 将OBRAM实现为双端口的BRAM,时延设为2. 相应的时延如何计算及设定?为什么设为2(BRAM之中读取一个时钟周期,OBRAM读取需要两个时钟周期?)

类型为RAM_T2P_BRAM,表示双端口均支持读写。

并且按照N_PE的数量进行均分。即cyclic,均匀洗牌一样分。

OBRAM:MTCNN代码加入

void OutputCache::accumulateChannel(int co, float value_to_add) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
  float old_ch = getOutChannel(co); 
  float new_ch = old_ch + value_to_add;
  setOutChannel(co, new_ch); 
};

 将其用resource指令具体化为双端口的BRAM,因为围绕单个像素点进行卷积操作,所以BRAM的大小为输出通道的个数。设置时延为2(此处存疑,后续需要找到资料推导)。然后将输出的BRAM运用cyclic的方式均分成N_PE个。

1.3  WBRAM

WBRAM的zynqNet实现

void WeightsCache::getNineWeights(const channel_t co,
                                  const weightaddr_t ci_offset,
                                  data_t weights_buf[9]) {
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS inline
#pragma HLS pipeline
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1    // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 2    // block ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 4    // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
  // Calculate Memory Address
  PEID_t PEID;
  blockID_t blockID;
  rowID_t rowID;
  weightID_t weightID;
  getAddrForSingleWeight(co, ci_offset, PEID, blockID, rowID, weightID);
  data_t *WBRAM_BLOCK = WBRAM[PEID][blockID][rowID];
  // Fetch Weights into Filter Template
  data_t weights_temp[9];
#pragma HLS array_partition variable = weights_temp complete dim = 0
L_getNineWeights:
  for (int i = 0; i < 9; i++) {
    // Fetch all 9 elements in last dimension into registers (weights_temp)
    weights_temp[i] = WBRAM_BLOCK[i];
    // Fill weights_buf with 0.0f for 1x1 kernel / with weights for 3x3 kernel
    weights_buf[i] = (kernel == 1) ? 0.0f : weights_temp[i];
  }
  // Fill single relevant weight into weights_buf for 1x1 kernel
  if (kernel == 1) weights_buf[4] = weights_temp[weightID];
}

 将WBRAM为四维数组,将其三个维度完全分开。即1,2,4维度,所以第三维度的大小在初始化时设为blockSize,即BRAM的大小。

RAM_S2P_BRAM,表示一个端口读同时支持另一个端口写。

读取入PE时dim为0,即在处理单元之中完全分成寄存器的模式便于并行。

WBRAM:MTCNN代码加入

//get weight from IBRAM to buffer
void WeightsCache::get_9_weights_to_buffer(int cur_ci, int cur_co,float weight_buffer[9]){
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1    // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 3    // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
	int PEID,filterID;
	WeightsCache::get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
	for(int i=0;i<9;i++){
		weight_buffer[i]=WBRAM[PEID][filterID][i];
	}
}

我们为了避免zynqNet的复杂的模式,将其简化为了相对简单的模式。运用三维替代之前的四维。依然用双端口实现。维度第一维和第三维也就是N_PE和当前位置的9个权重分开。

 二、并行化与相关并行数组的实现

2.1 MACC中的UNROLL

3*3的卷积,UNROLL循环完全展开,完全并行化。

UNROLL:zynqNet的格式

void ProcessingElement::macc2d(const data_t pixels[9], const data_t weights[9],data_t& result) {
#pragma HLS inline
  data_t accumulator = 0.0f;
  data_t multresult[9];
#pragma HLS ARRAY_PARTITION variable = multresult complete dim = 0
L_MACC_multiply:
  for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
    multresult[i] = pixels[i] * weights[i];
  }
L_MACC_accumulate:
  for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
    accumulator = accumulator + multresult[i];
  }
  result = accumulator;
}

其中,将相乘和累加的单元都给UNROLL。但是我们存在一个疑问,累加单元后一个对前一个有依赖,是如何运用UNROLL而不用PIPELINE的。

UNROLL:MTCNN加入

void ProcessingElement::macc2d(const float pixels[9],const float weights[9],
                               float& result) {
#pragma HLS inline
  float accumulator = 0.0f;
  float multresult[9];
#pragma HLS ARRAY_PARTITION variable = multresult complete dim = 0
L_MACC_multiply:
  for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
    multresult[i] = pixels[i] * weights[i];
  }
L_MACC_accumulate:
  for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
    accumulator = accumulator + multresult[i];
  }
  result = accumulator;
};

与zynqNet保持一致,依然是相乘的部分和相加的部分UNROLL。

2.2 与UNROLL相关的ARRAY_PARTITION

卷积相关的9个pixel,9个weight,9个multresult,1个result

需要将其都进行ARRAY_PARTITION为寄存器格式的,也就是需要加参数dim=0.

zynqNet的ARRAY_PARTITION

//load pixels[9] and loop weight on them
void ProcessingElement::processInputChannel(const coordinate_t y,
                                            const coordinate_t x,
                                            const channel_t ci_in,
                                            const channel_t ch_out) {
#pragma HLS inline off
#pragma HLS FUNCTION_INSTANTIATE variable = ci_in
#pragma HLS dataflow
  channel_t ci = ci_in;
  weightaddr_t ci_offset;
  data_t pixel_buffer[9];
#pragma HLS ARRAY_PARTITION variable = pixel_buffer complete dim = 0
  // Preload Image Pixel Buffer (fetch pixels around (y,x,ci))
  preloadPixelsAndPrecalcCIoffset(y, x, ci, ch_out, ci_offset, pixel_buffer);
  // MACC All Output Channels
  processAllCHout(ch_out, ci, ci_offset, pixel_buffer);
}

//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAllCHout(const channel_t ch_out,
                                        const channel_t ci,
                                        const weightaddr_t ci_offset,
                                        const data_t pixels[9]) {
#pragma HLS INLINE off
L_CH_OUT:
  for (channel_t co = 0; co < ch_out; co++) {
#pragma HLS LOOP_TRIPCOUNT min = 16 max = 1024 avg = 258
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
    data_t result, weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0
    // fetch weights
    WeightsCache::getNineWeights(co, ci_offset, weights_local);
    // multiply-accumulate
    macc2d(pixels, weights_local, result);
    // save result to Output Buffer
    if (ci == 0) {
      OutputCache::setChannel(co, result);
    } else {
      OutputCache::accumulateChannel(co, result);
    }
  };
}

multresult[9]在MACC中已经被指令为寄存器格式。

weight[9]在processAllCHout中被分为寄存器格式。pixel[9]在processInputChannel中被分为寄存器格式。

此处存在疑问:只有单个的pixel[9]其实被许多个MACC模块并用了,那么是否会影响相应的并行速率?

MTCNN的ARRYA_PARTITION

//load pixels[9] and loop weight on them
void ProcessingElement::processInputChannel(const int cur_row_times_stride,
const int cur_col_times_stride,
const int cur_ci, const int out_channelNum){
#pragma HLS inline off
#pragma HLS FUNCTION_INSTANTIATE variable = cur_ci
#pragma HLS dataflow	
	int cur_channel_in=cur_ci;
	float pixel_buffer[9];
#pragma HLS ARRAY_PARTITION variable = pixel_buffer complete dim = 0
	loadPixel_buffer(cur_row_times_stride, cur_col_times_stride,
	cur_channel_in, pixel_buffer);
	processAll_channelOut(out_channelNum, cur_channel_in,
	pixel_buffer);
};
//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAll_channelOut(const int out_Channel_Num, const int cur_ci,const float pixel_buffer[9]){
L_CH_OUT:
  for(int cur_co=0;cur_co<out_Channel_Num;cur_co++){
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
	float result,weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0	
	// fetch weights
    WeightsCache::get_9_weights_to_buffer(cur_ci,cur_co,weights_local);
	//MACC 3*3  multiply accumulate
	ProcessingElement::macc2d(pixel_buffer,weights_local,result);
	//accumulate 3*3 macc result in OBRAM
	if (cur_ci == 0) {
		OutputCache::setOutChannel(cur_co, result);
	} else {
		OutputCache::accumulateChannel(cur_co, result);
	}
  }
};

 与zynqNet一致,将其完全分为寄存器格式。

2.3 并行化MACC

zynqNet并行化MACC的方法

上面为MACC中的3*3卷积的并行,我们现在并行化MACC单元,使多个MACC单元同时运行。

//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAllCHout(const channel_t ch_out,
                                        const channel_t ci,
                                        const weightaddr_t ci_offset,
                                        const data_t pixels[9]) {
#pragma HLS INLINE off
L_CH_OUT:
  for (channel_t co = 0; co < ch_out; co++) {
#pragma HLS LOOP_TRIPCOUNT min = 16 max = 1024 avg = 258
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
    data_t result, weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0
    // fetch weights
    WeightsCache::getNineWeights(co, ci_offset, weights_local);
    // multiply-accumulate
    macc2d(pixels, weights_local, result);
    // save result to Output Buffer
    if (ci == 0) {
      OutputCache::setChannel(co, result);
    } else {
      OutputCache::accumulateChannel(co, result);
    }
  };
}

zynqNet中的MACC并行:运用了unroll factor = N_PE,并且将PIPELINE II = 1,II 表示初始化时延,表示一个task从接收上一个数据到能接收下一个数据之间的时间。

processAllCHout需要加INLINE off,表明此函数是一个具体的实体,只会被其他函数调用,但不会嵌入到其他函数之中。因为此函数内的并行已经并行化了很多copy出来,此函数不需要再被copy了。这个函数与上一层级的processInputChannel都加了INLINE off,表明这两个函数都是作为独立的实体,被其他函数调用。

然后运用unroll指令,factor设置为N_PE,运用N_PE并行来处理,也就是部分并行。没有并行的部分或者并行与并行之间运用PIPELINE处理。

将II设置为1,并行单元之间也需要流水线化,此函数内为 获取权重,MACC、OBRAM累加。getNineWeights直接被INLINE到此函数之中。

MTCNN并行化MACC的指令

//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAll_channelOut(const int out_Channel_Num, const int cur_ci,
const float pixel_buffer[9]){
#pragma HLS INLINE off
L_CH_OUT:
  for(int cur_co=0;cur_co<out_Channel_Num;cur_co++){
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1

直接仿照zynqNet的模式将设置为INLINE off,然后加入unroll factor = N_PE,与PIPELINE II = 1

三、INLINE与实体化

INLINE指令表明硬件结构上直接将相应的结构嵌入到上层的结构之中,而不是独立出来被上级调用。取地址和运算地址的过程中经常进行INLINE指令,将地址的计算嵌入到程序之中。

3.1 OBRAM 相关的实体化

OBRAM上的地址很简单,只有out_channel_Num个。三个相关的读,写,相加都加一个INLINE指令即可。

并且针对不同的co,函数需要通过FUNCTION_INSTANTIATE具体化为具体的函数实例。

//------------------------OutputCache---------------------------
void OutputCache::accumulateChannel(int co, float value_to_add) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
  float old_ch = getOutChannel(co); 
  float new_ch = old_ch + value_to_add;
  setOutChannel(co, new_ch); 
};

float OutputCache::getOutChannel(int co) {
#pragma HLS inline
  return OBRAM[co];
}

void OutputCache::setOutChannel(int co, float data) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
  OBRAM[co] = data;
};

3.2 OBRAM向DRAM相关的实体化

相应的DRAM上的地址有一个映射。即DRAM[channel_offset+row_offset+col_offset]即 DRAM[cur_channel*pixels_per_channel+cur_row*colNum+cur_col]

为验证程序正确性,这里我们没有更改prelu为relu,后续我们再进行更改。

//write from OBRAM to DRAM
void MemoryController::writeBackOutputChannel(float * output_ptr, int cur_co,
                                               float data) {
#pragma HLS inline
#pragma HLS pipeline
  int channel_offset=cur_co*out_channelPixels;
#pragma HLS RESOURCE variable = channel_offset core = MulnS latency = 2  
  output_ptr[channel_offset+peixl_out_DRAM_offset] = data;
};
//set output piexl offset on DRAM
void MemoryController::setPixelOutOffset(int cur_out_row,int cur_out_col){
#pragma HLS inline
#pragma HLS pipeline
  int row_offset=cur_out_row*out_width;
#pragma HLS RESOURCE variable = row_offset core = MulnS latency = 2  
  peixl_out_DRAM_offset=row_offset+cur_out_col;
};

MulnS:首先地址的运算涉及乘法的,实现在MulnS的core上,相当于n阶段流水线乘法器,latency的设置为2,关于latency存疑需要后续HLS实验确定。

INLINE:全部INLINE,因为均需要嵌入函数中

PIPELINE:两个都PIPELINE,确保算地址与读取,算地址与相加地址的过程最大化利用率。

3.3 权重DRAM加载入WBRAM

从DRAM到BRAM加载数据的时候,都经过了这一步:先从DRAM到寄存器reg之中,再往BRAM上搬运:

#ifndef __SYNTHESIS__
// Register Stage for manual Pipelining:
template <class T>  T reg(T x) {
#pragma HLS pipeline
#pragma HLS inline self off
#pragma HLS interface ap_ctrl_none register port=return
	return x;
}
#endif

 DRAM到reg再到BRAM上。运用ap_ctrl_none的模式。

权重是一次性的从DRAM载入到BRAM之中。

//load from DRAM weight to reg
float MemoryController::load_weight_2_reg(float * weight_DRAM_ptr, int weight_loc){
#pragma HLS inline
#pragma HLS pipeline
  float read = reg(weight_DRAM_ptr[weight_loc]);
  return read;
}

//load weights from DRAM to BRAM
void WeightsCache::load_WBRAM_from_DRAM(float * weight_ptr){
#pragma HLS inline
	int PEID,filterID;
	float *WBRAM_ptr; float *weight_DRAM_ptr;
	for(int cur_co=0;cur_co<outChannelNum;cur_co++){
	 int offset_inchannel=cur_co*inChannelNum;
#pragma HLS RESOURCE variable = offset_inchannel core = MulnS latency = 2	 
	 for(int cur_ci=0;cur_ci<inChannelNum;cur_ci++){
		get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
		WBRAM_ptr=WBRAM[PEID][filterID];
		int weight_DRAM_loc=9*(offset_inchannel+cur_ci);
#pragma HLS RESOURCE variable = weight_DRAM_loc core = MulnS latency = 2  
		weight_DRAM_ptr=weight_ptr+weight_DRAM_loc;
		for(int i=0;i<9;i++){
#pragma HLS PIPELINE II = 2
			float weight_in_reg=MemoryController::load_weight_2_reg(weight_DRAM_ptr,i);
			WBRAM_ptr[i]=weight_in_reg;
		}
	 }
	}
}

MulnS:乘积设为pipeline的乘法器,加快运算。

INLINE:程序INLINE指令,嵌入到主程序之中。

PIPELINE:最底层的循环加PIPELINE,保持DRAM持续的读出与BRAM持续的写入。

3.4 权重从WBRAM取出到PE

权重地址的运算加入INLINE,确保地址运算直接运行不需要调用,而是嵌入到取权重的程序之中。

void WeightsCache::get_WBRAM_addr(const int cur_ci, const int cur_co,int &PEID, int &filterID){
#pragma HLS INLINE
	PEID=cur_co%N_PE;
	filterID=(cur_co/N_PE)*inChannelNum+cur_ci;
}

将权重的读取加入优化指令:

//get 9 weights from IBRAM to buffer
void WeightsCache::get_9_weights_to_buffer(int cur_ci, int cur_co,float weight_buffer[9]){
#pragma HLS FUNCTION_INSTANTIATE variable = cur_co
#pragma HLS inline
#pragma HLS pipeline
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1    // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 3    // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
	int PEID,filterID;
	WeightsCache::get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
	for(int i=0;i<9;i++){
		weight_buffer[i]=WBRAM[PEID][filterID][i];
	}
}

INLINE:此程序嵌入到processAll_channelOut的函数之中,在processAll_channelOut中的每一个UNROLL都有取出权重的操作的模块。每个outChannel在不同的PE之中,而WBRAM已经按照PE完全分开了为不同的BRAM上,所以可以并行。

PIPELINE:对顶层函数的pipeline相当于把底层的循环全都UNROLL,这里是可以UNROLL因为WBRAM已经按照每一个9个权重完全分开了。分开在了不同的BRAM上,并且weight_buffer在前面已经完全分为寄存器之中,也可以并行。因此,取权重可以并行实现。

FUNCTION_INSTANTIATE:此程序是在不同的输出通道上实现为具体化的值。

3.5 图像从DRAM到IBRAM

地址计算相关:地址计算的过程中均用INLINE指令嵌入到与地址相关的操作之中,加快速度。并且乘法均用core为MulnS的递归乘法器来实现。

//set output piexl offset on DRAM
void MemoryController::setPixelOutOffset(int cur_out_row,int cur_out_col){
#pragma HLS inline
  int row_offset=cur_out_row*out_width;
#pragma HLS RESOURCE variable = row_offset core = MulnS latency = 2  
  peixl_out_DRAM_offset=row_offset+cur_out_col;
};
//load input piexl from DRAM to IBRAM
void MemoryController::setPixelLoadRowOffset(){
#pragma HLS inline
	int cur_inPixel_row_loc=cur_loadPixel_row*stride;
#pragma HLS RESOURCE variable = cur_inPixel_row_loc core = MulnS latency = 2  	
	pixel_loadRow_DRAM_offset=cur_inPixel_row_loc*in_width;
#pragma HLS RESOURCE variable = pixel_loadRow_DRAM_offset core = MulnS latency = 2  
	cur_loadPiexel_col=0;
	cur_loadPixel_row++;
};
void MemoryController::setPixelLoadOffset(){
#pragma HLS inline	
	int cur_inPixel_col_loc=stride*cur_loadPiexel_col;
#pragma HLS RESOURCE variable = cur_inPixel_col_loc core = MulnS latency = 2  	
	load_pixel_offset=pixel_loadRow_DRAM_offset+cur_inPixel_col_loc;
	cur_loadPiexel_col++;
};

取出根据地址取出DRAM上的像素值

先从DRAM加载入寄存器reg之中:

//load from DRAM pixel to reg
float MemoryController::loadInputChannelPixel(float * input_ptr,int ci){
#pragma HLS inline	
#pragma HLS pipeline
	int in_channel_pixel_offset=ci*in_channel_pixels;
#pragma HLS RESOURCE variable = in_channel_pixel_offset core = MulnS latency = 2  	
	float px=reg(input_ptr[load_pixel_offset+in_channel_pixel_offset]);
	return px;
};

运用了INLINE表示嵌入程序之中,pipeline表示流水线进行。

然后整行的图像像素加载程序:

//load whole row from DRAM to IBRAM (in hardware IBRAM order is row/col/channel_In)
void ImageCache::loadRowDRAM_2_IBRAM(float * input_ptr){
#pragma HLS inline
	L_DRAM_PRELOADROW_X: for (int cur_col = 0; cur_col < in_width; cur_col++) {
		MemoryController::setPixelLoadOffset();
		loadPixelDRAM_2_IBRAM(input_ptr);
	}
};
void ImageCache::loadPixelDRAM_2_IBRAM(float * input_ptr){
#pragma HLS inline
	L_PRELOAD_PIXEL_FROM_DRAM: for (int ci = 0; ci < in_ChannelNum; ci++) {
#pragma HLS pipeline
//#pragma HLS latency min=4
		float px = MemoryController::loadInputChannelPixel(input_ptr,ci);
		writeNextChannelPixel_2_IBRAM(px);
	}
};
void ImageCache::writeNextChannelPixel_2_IBRAM(float pixel){
	// Write Value into IBRAM
	IBRAM[cur_IBRAM_addr] = pixel;
	// Check and Wrap Write Address into IBRAM
	if (cur_IBRAM_addr == MAX_IBRAM_ADDR)
		cur_IBRAM_addr = 0;
	else
		cur_IBRAM_addr++;	
};

最小的for循环之中进行PIPILINE

3.6 IBRAM读出到PE

//get IBRAM pixel into buffer
void ProcessingElement::loadPixel_buffer(const int up_row,const int left_col,
							const int cur_In_channel, float pixel_buffer[9]){
#pragma HLS inline
#pragma HLS pipeline
  load_pixel_2_PE_row_loop:							
  for (int cur_filterRow=0;cur_filterRow<3;cur_filterRow++){
	int pixel_row_to_load=up_row+cur_filterRow;
	int IBRAM_line_offset=ImageCache::calcu_IBRAM_row_offset(pixel_row_to_load);
	load_pixel_2_PE_col_loop:
	for (int cur_filterCol=0;cur_filterCol<3;cur_filterCol++){
		int pixel_col_to_load=left_col+cur_filterCol;
		float px=reg(ImageCache::get_IBRAM_Pixel(IBRAM_line_offset,pixel_col_to_load,
								cur_In_channel));
		pixel_buffer[3*cur_filterRow+cur_filterCol]=px;
	}
  }
};

从IBRAM加载到buffer,加入了INLINE指令,加入了PIPELINE自动会对底层循环进行UNROLL,但是我们这一点有不懂的地方,IBRAM是双端口的BRAM,为社么可以进行取出的并行?

//load piexl from IBRAM out to PE
int ImageCache::calcu_IBRAM_row_offset(int cur_row){
#pragma HLS inline
	int IBRAM_line=cur_row%NUM_IMG_CACHE_LINES;
	int pixels_each_line=in_width*in_ChannelNum;
#pragma HLS RESOURCE variable=pixels_each_line core=MulnS latency=2	
	int IBRAM_line_offset=IBRAM_line*pixels_each_line;
#pragma HLS RESOURCE variable=IBRAM_line_offset core=MulnS latency=2	
	return IBRAM_line_offset;
};
float ImageCache::get_IBRAM_Pixel(const int IBRAM_line_offset, const int cur_col,
								const int channel_in){
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM	
	int IBRAM_col_offset=cur_col*in_ChannelNum;
	int IBRAM_loc=IBRAM_line_offset+IBRAM_col_offset+channel_in;
	float px=IBRAM[IBRAM_loc];
	return px;
}
  • 6
    点赞
  • 14
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

祥瑞Coding

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值