背景:我们编写了卷积操作的IPcore,并且在c代码端模拟通过,现在我们需要实现加入HLS的预编译指令,然后将其实现为硬件结构。
目的:加入HLS预编译指令。
相关文章:HLS的预编译指令优化到硬件的知识:
FPGA基础知识(十)HLS针对循环的操作 https://blog.csdn.net/weixin_36474809/article/details/81479551
FPGA基础知识(十一)HLS针对数组的操作 https://blog.csdn.net/weixin_36474809/article/details/81483993
FPGA基础知识(十二)HLS增大吞吐量的优化 https://blog.csdn.net/weixin_36474809/article/details/81665911
重要:
vivado HLS硬件化指令(四)卷积相关的指令优化 https://blog.csdn.net/weixin_36474809/article/details/84587535
UG902文档v2016.4
目录
一、数组的BRAM的实现
只有三种BRAM,参考zynqNet的模式,将相应BRAM加入具体优化指令。
运用resource指令与array partation指令,将权重,图像,累加的输出实现为具体的BRAM上。
1.1 IBRAM
IBRAM:zynqNet实现方法
//get pixel out from BRAM
data_t ImageCache::getPixel(const coordinate_t y, const imgcacheaddr_t y_offset,
const coordinate_t x, const channel_t ci) {
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM
imgcacheaddr_t addr_pixel_offset = x * ch_in;//col_offset
imgcacheaddr_t addr = y_offset + addr_pixel_offset + ci;//row_offset+col_offset+channel_offset
bool is_padding_pixel = x < 0 | x >= width_in | y < 0 | y >= height_in;
data_t px = is_padding_pixel ? 0.0f : IBRAM[addr];
return px;
}
直接将相应的IBRAM设置为双端口的BRAM
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM,表示支持一个端口读另一个端口写。
且另外几个BRAM都是在从BRAM到PE的读出之中具体化为BRAM的。
IBRAM:MTCNN代码加入
float ImageCache::get_IBRAM_Pixel(const int IBRAM_line_offset, const int cur_col,
const int channel_in){
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM
int IBRAM_col_offset=cur_col*in_ChannelNum;
int IBRAM_loc=IBRAM_line_offset+IBRAM_col_offset+channel_in;
float px=IBRAM[IBRAM_loc];
return px;
}
与zynqNet一样,直接具体化为双端口的BRAM,不加其他指令并且不设置latency
1.2 OBRAM
OBRAM:zynqNet实现方法
void OutputCache::accumulateChannel(channel_t co, data_t value_to_add) {
#pragma HLS inline
//#pragma HLS pipeline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
data_t old_ch = getChannel(co); /* BRAM[c] */
data_t new_ch = old_ch + value_to_add;
setChannel(co, new_ch); /* BRAM[c] = new_ch; */
};
将OBRAM实现为双端口的BRAM,时延设为2. 相应的时延如何计算及设定?为什么设为2(BRAM之中读取一个时钟周期,OBRAM读取需要两个时钟周期?)
类型为RAM_T2P_BRAM,表示双端口均支持读写。
并且按照N_PE的数量进行均分。即cyclic,均匀洗牌一样分。
OBRAM:MTCNN代码加入
void OutputCache::accumulateChannel(int co, float value_to_add) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
float old_ch = getOutChannel(co);
float new_ch = old_ch + value_to_add;
setOutChannel(co, new_ch);
};
将其用resource指令具体化为双端口的BRAM,因为围绕单个像素点进行卷积操作,所以BRAM的大小为输出通道的个数。设置时延为2(此处存疑,后续需要找到资料推导)。然后将输出的BRAM运用cyclic的方式均分成N_PE个。
1.3 WBRAM
WBRAM的zynqNet实现
void WeightsCache::getNineWeights(const channel_t co,
const weightaddr_t ci_offset,
data_t weights_buf[9]) {
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS inline
#pragma HLS pipeline
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1 // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 2 // block ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 4 // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
// Calculate Memory Address
PEID_t PEID;
blockID_t blockID;
rowID_t rowID;
weightID_t weightID;
getAddrForSingleWeight(co, ci_offset, PEID, blockID, rowID, weightID);
data_t *WBRAM_BLOCK = WBRAM[PEID][blockID][rowID];
// Fetch Weights into Filter Template
data_t weights_temp[9];
#pragma HLS array_partition variable = weights_temp complete dim = 0
L_getNineWeights:
for (int i = 0; i < 9; i++) {
// Fetch all 9 elements in last dimension into registers (weights_temp)
weights_temp[i] = WBRAM_BLOCK[i];
// Fill weights_buf with 0.0f for 1x1 kernel / with weights for 3x3 kernel
weights_buf[i] = (kernel == 1) ? 0.0f : weights_temp[i];
}
// Fill single relevant weight into weights_buf for 1x1 kernel
if (kernel == 1) weights_buf[4] = weights_temp[weightID];
}
将WBRAM为四维数组,将其三个维度完全分开。即1,2,4维度,所以第三维度的大小在初始化时设为blockSize,即BRAM的大小。
RAM_S2P_BRAM,表示一个端口读同时支持另一个端口写。
读取入PE时dim为0,即在处理单元之中完全分成寄存器的模式便于并行。
WBRAM:MTCNN代码加入
//get weight from IBRAM to buffer
void WeightsCache::get_9_weights_to_buffer(int cur_ci, int cur_co,float weight_buffer[9]){
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1 // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 3 // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
int PEID,filterID;
WeightsCache::get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
for(int i=0;i<9;i++){
weight_buffer[i]=WBRAM[PEID][filterID][i];
}
}
我们为了避免zynqNet的复杂的模式,将其简化为了相对简单的模式。运用三维替代之前的四维。依然用双端口实现。维度第一维和第三维也就是N_PE和当前位置的9个权重分开。
二、并行化与相关并行数组的实现
2.1 MACC中的UNROLL
3*3的卷积,UNROLL循环完全展开,完全并行化。
UNROLL:zynqNet的格式
void ProcessingElement::macc2d(const data_t pixels[9], const data_t weights[9],data_t& result) {
#pragma HLS inline
data_t accumulator = 0.0f;
data_t multresult[9];
#pragma HLS ARRAY_PARTITION variable = multresult complete dim = 0
L_MACC_multiply:
for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
multresult[i] = pixels[i] * weights[i];
}
L_MACC_accumulate:
for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
accumulator = accumulator + multresult[i];
}
result = accumulator;
}
其中,将相乘和累加的单元都给UNROLL。但是我们存在一个疑问,累加单元后一个对前一个有依赖,是如何运用UNROLL而不用PIPELINE的。
UNROLL:MTCNN加入
void ProcessingElement::macc2d(const float pixels[9],const float weights[9],
float& result) {
#pragma HLS inline
float accumulator = 0.0f;
float multresult[9];
#pragma HLS ARRAY_PARTITION variable = multresult complete dim = 0
L_MACC_multiply:
for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
multresult[i] = pixels[i] * weights[i];
}
L_MACC_accumulate:
for (int i = 0; i < 9; i++) {
#pragma HLS UNROLL
accumulator = accumulator + multresult[i];
}
result = accumulator;
};
与zynqNet保持一致,依然是相乘的部分和相加的部分UNROLL。
2.2 与UNROLL相关的ARRAY_PARTITION
卷积相关的9个pixel,9个weight,9个multresult,1个result
需要将其都进行ARRAY_PARTITION为寄存器格式的,也就是需要加参数dim=0.
zynqNet的ARRAY_PARTITION
//load pixels[9] and loop weight on them
void ProcessingElement::processInputChannel(const coordinate_t y,
const coordinate_t x,
const channel_t ci_in,
const channel_t ch_out) {
#pragma HLS inline off
#pragma HLS FUNCTION_INSTANTIATE variable = ci_in
#pragma HLS dataflow
channel_t ci = ci_in;
weightaddr_t ci_offset;
data_t pixel_buffer[9];
#pragma HLS ARRAY_PARTITION variable = pixel_buffer complete dim = 0
// Preload Image Pixel Buffer (fetch pixels around (y,x,ci))
preloadPixelsAndPrecalcCIoffset(y, x, ci, ch_out, ci_offset, pixel_buffer);
// MACC All Output Channels
processAllCHout(ch_out, ci, ci_offset, pixel_buffer);
}
//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAllCHout(const channel_t ch_out,
const channel_t ci,
const weightaddr_t ci_offset,
const data_t pixels[9]) {
#pragma HLS INLINE off
L_CH_OUT:
for (channel_t co = 0; co < ch_out; co++) {
#pragma HLS LOOP_TRIPCOUNT min = 16 max = 1024 avg = 258
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
data_t result, weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0
// fetch weights
WeightsCache::getNineWeights(co, ci_offset, weights_local);
// multiply-accumulate
macc2d(pixels, weights_local, result);
// save result to Output Buffer
if (ci == 0) {
OutputCache::setChannel(co, result);
} else {
OutputCache::accumulateChannel(co, result);
}
};
}
multresult[9]在MACC中已经被指令为寄存器格式。
weight[9]在processAllCHout中被分为寄存器格式。pixel[9]在processInputChannel中被分为寄存器格式。
此处存在疑问:只有单个的pixel[9]其实被许多个MACC模块并用了,那么是否会影响相应的并行速率?
MTCNN的ARRYA_PARTITION
//load pixels[9] and loop weight on them
void ProcessingElement::processInputChannel(const int cur_row_times_stride,
const int cur_col_times_stride,
const int cur_ci, const int out_channelNum){
#pragma HLS inline off
#pragma HLS FUNCTION_INSTANTIATE variable = cur_ci
#pragma HLS dataflow
int cur_channel_in=cur_ci;
float pixel_buffer[9];
#pragma HLS ARRAY_PARTITION variable = pixel_buffer complete dim = 0
loadPixel_buffer(cur_row_times_stride, cur_col_times_stride,
cur_channel_in, pixel_buffer);
processAll_channelOut(out_channelNum, cur_channel_in,
pixel_buffer);
};
//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAll_channelOut(const int out_Channel_Num, const int cur_ci,const float pixel_buffer[9]){
L_CH_OUT:
for(int cur_co=0;cur_co<out_Channel_Num;cur_co++){
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
float result,weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0
// fetch weights
WeightsCache::get_9_weights_to_buffer(cur_ci,cur_co,weights_local);
//MACC 3*3 multiply accumulate
ProcessingElement::macc2d(pixel_buffer,weights_local,result);
//accumulate 3*3 macc result in OBRAM
if (cur_ci == 0) {
OutputCache::setOutChannel(cur_co, result);
} else {
OutputCache::accumulateChannel(cur_co, result);
}
}
};
与zynqNet一致,将其完全分为寄存器格式。
2.3 并行化MACC
zynqNet并行化MACC的方法
上面为MACC中的3*3卷积的并行,我们现在并行化MACC单元,使多个MACC单元同时运行。
//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAllCHout(const channel_t ch_out,
const channel_t ci,
const weightaddr_t ci_offset,
const data_t pixels[9]) {
#pragma HLS INLINE off
L_CH_OUT:
for (channel_t co = 0; co < ch_out; co++) {
#pragma HLS LOOP_TRIPCOUNT min = 16 max = 1024 avg = 258
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
data_t result, weights_local[9];
#pragma HLS ARRAY_PARTITION variable = weights_local complete dim = 0
// fetch weights
WeightsCache::getNineWeights(co, ci_offset, weights_local);
// multiply-accumulate
macc2d(pixels, weights_local, result);
// save result to Output Buffer
if (ci == 0) {
OutputCache::setChannel(co, result);
} else {
OutputCache::accumulateChannel(co, result);
}
};
}
zynqNet中的MACC并行:运用了unroll factor = N_PE,并且将PIPELINE II = 1,II 表示初始化时延,表示一个task从接收上一个数据到能接收下一个数据之间的时间。
processAllCHout需要加INLINE off,表明此函数是一个具体的实体,只会被其他函数调用,但不会嵌入到其他函数之中。因为此函数内的并行已经并行化了很多copy出来,此函数不需要再被copy了。这个函数与上一层级的processInputChannel都加了INLINE off,表明这两个函数都是作为独立的实体,被其他函数调用。
然后运用unroll指令,factor设置为N_PE,运用N_PE并行来处理,也就是部分并行。没有并行的部分或者并行与并行之间运用PIPELINE处理。
将II设置为1,并行单元之间也需要流水线化,此函数内为 获取权重,MACC、OBRAM累加。getNineWeights直接被INLINE到此函数之中。
MTCNN并行化MACC的指令
//load and loop all channel out weight MACC on pixel[9]
void ProcessingElement::processAll_channelOut(const int out_Channel_Num, const int cur_ci,
const float pixel_buffer[9]){
#pragma HLS INLINE off
L_CH_OUT:
for(int cur_co=0;cur_co<out_Channel_Num;cur_co++){
#pragma HLS unroll factor = N_PE
#pragma HLS PIPELINE II = 1
直接仿照zynqNet的模式将设置为INLINE off,然后加入unroll factor = N_PE,与PIPELINE II = 1
三、INLINE与实体化
INLINE指令表明硬件结构上直接将相应的结构嵌入到上层的结构之中,而不是独立出来被上级调用。取地址和运算地址的过程中经常进行INLINE指令,将地址的计算嵌入到程序之中。
3.1 OBRAM 相关的实体化
OBRAM上的地址很简单,只有out_channel_Num个。三个相关的读,写,相加都加一个INLINE指令即可。
并且针对不同的co,函数需要通过FUNCTION_INSTANTIATE具体化为具体的函数实例。
//------------------------OutputCache---------------------------
void OutputCache::accumulateChannel(int co, float value_to_add) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
#pragma HLS ARRAY_PARTITION variable = OBRAM cyclic factor = N_PE
#pragma HLS RESOURCE variable=OBRAM core=RAM_T2P_BRAM latency=2
float old_ch = getOutChannel(co);
float new_ch = old_ch + value_to_add;
setOutChannel(co, new_ch);
};
float OutputCache::getOutChannel(int co) {
#pragma HLS inline
return OBRAM[co];
}
void OutputCache::setOutChannel(int co, float data) {
#pragma HLS inline
#pragma HLS FUNCTION_INSTANTIATE variable = co
OBRAM[co] = data;
};
3.2 OBRAM向DRAM相关的实体化
相应的DRAM上的地址有一个映射。即DRAM[channel_offset+row_offset+col_offset]即 DRAM[cur_channel*pixels_per_channel+cur_row*colNum+cur_col]
为验证程序正确性,这里我们没有更改prelu为relu,后续我们再进行更改。
//write from OBRAM to DRAM
void MemoryController::writeBackOutputChannel(float * output_ptr, int cur_co,
float data) {
#pragma HLS inline
#pragma HLS pipeline
int channel_offset=cur_co*out_channelPixels;
#pragma HLS RESOURCE variable = channel_offset core = MulnS latency = 2
output_ptr[channel_offset+peixl_out_DRAM_offset] = data;
};
//set output piexl offset on DRAM
void MemoryController::setPixelOutOffset(int cur_out_row,int cur_out_col){
#pragma HLS inline
#pragma HLS pipeline
int row_offset=cur_out_row*out_width;
#pragma HLS RESOURCE variable = row_offset core = MulnS latency = 2
peixl_out_DRAM_offset=row_offset+cur_out_col;
};
MulnS:首先地址的运算涉及乘法的,实现在MulnS的core上,相当于n阶段流水线乘法器,latency的设置为2,关于latency存疑需要后续HLS实验确定。
INLINE:全部INLINE,因为均需要嵌入函数中
PIPELINE:两个都PIPELINE,确保算地址与读取,算地址与相加地址的过程最大化利用率。
3.3 权重DRAM加载入WBRAM
从DRAM到BRAM加载数据的时候,都经过了这一步:先从DRAM到寄存器reg之中,再往BRAM上搬运:
#ifndef __SYNTHESIS__
// Register Stage for manual Pipelining:
template <class T> T reg(T x) {
#pragma HLS pipeline
#pragma HLS inline self off
#pragma HLS interface ap_ctrl_none register port=return
return x;
}
#endif
DRAM到reg再到BRAM上。运用ap_ctrl_none的模式。
权重是一次性的从DRAM载入到BRAM之中。
//load from DRAM weight to reg
float MemoryController::load_weight_2_reg(float * weight_DRAM_ptr, int weight_loc){
#pragma HLS inline
#pragma HLS pipeline
float read = reg(weight_DRAM_ptr[weight_loc]);
return read;
}
//load weights from DRAM to BRAM
void WeightsCache::load_WBRAM_from_DRAM(float * weight_ptr){
#pragma HLS inline
int PEID,filterID;
float *WBRAM_ptr; float *weight_DRAM_ptr;
for(int cur_co=0;cur_co<outChannelNum;cur_co++){
int offset_inchannel=cur_co*inChannelNum;
#pragma HLS RESOURCE variable = offset_inchannel core = MulnS latency = 2
for(int cur_ci=0;cur_ci<inChannelNum;cur_ci++){
get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
WBRAM_ptr=WBRAM[PEID][filterID];
int weight_DRAM_loc=9*(offset_inchannel+cur_ci);
#pragma HLS RESOURCE variable = weight_DRAM_loc core = MulnS latency = 2
weight_DRAM_ptr=weight_ptr+weight_DRAM_loc;
for(int i=0;i<9;i++){
#pragma HLS PIPELINE II = 2
float weight_in_reg=MemoryController::load_weight_2_reg(weight_DRAM_ptr,i);
WBRAM_ptr[i]=weight_in_reg;
}
}
}
}
MulnS:乘积设为pipeline的乘法器,加快运算。
INLINE:程序INLINE指令,嵌入到主程序之中。
PIPELINE:最底层的循环加PIPELINE,保持DRAM持续的读出与BRAM持续的写入。
3.4 权重从WBRAM取出到PE
权重地址的运算加入INLINE,确保地址运算直接运行不需要调用,而是嵌入到取权重的程序之中。
void WeightsCache::get_WBRAM_addr(const int cur_ci, const int cur_co,int &PEID, int &filterID){
#pragma HLS INLINE
PEID=cur_co%N_PE;
filterID=(cur_co/N_PE)*inChannelNum+cur_ci;
}
将权重的读取加入优化指令:
//get 9 weights from IBRAM to buffer
void WeightsCache::get_9_weights_to_buffer(int cur_ci, int cur_co,float weight_buffer[9]){
#pragma HLS FUNCTION_INSTANTIATE variable = cur_co
#pragma HLS inline
#pragma HLS pipeline
// Array Partitioning
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 1 // PE ID
#pragma HLS ARRAY_PARTITION variable = WBRAM complete dim = 3 // weight ID
#pragma HLS RESOURCE variable = WBRAM core = RAM_S2P_BRAM latency = 3
int PEID,filterID;
WeightsCache::get_WBRAM_addr(cur_ci,cur_co,PEID,filterID);
for(int i=0;i<9;i++){
weight_buffer[i]=WBRAM[PEID][filterID][i];
}
}
INLINE:此程序嵌入到processAll_channelOut的函数之中,在processAll_channelOut中的每一个UNROLL都有取出权重的操作的模块。每个outChannel在不同的PE之中,而WBRAM已经按照PE完全分开了为不同的BRAM上,所以可以并行。
PIPELINE:对顶层函数的pipeline相当于把底层的循环全都UNROLL,这里是可以UNROLL因为WBRAM已经按照每一个9个权重完全分开了。分开在了不同的BRAM上,并且weight_buffer在前面已经完全分为寄存器之中,也可以并行。因此,取权重可以并行实现。
FUNCTION_INSTANTIATE:此程序是在不同的输出通道上实现为具体化的值。
3.5 图像从DRAM到IBRAM
地址计算相关:地址计算的过程中均用INLINE指令嵌入到与地址相关的操作之中,加快速度。并且乘法均用core为MulnS的递归乘法器来实现。
//set output piexl offset on DRAM
void MemoryController::setPixelOutOffset(int cur_out_row,int cur_out_col){
#pragma HLS inline
int row_offset=cur_out_row*out_width;
#pragma HLS RESOURCE variable = row_offset core = MulnS latency = 2
peixl_out_DRAM_offset=row_offset+cur_out_col;
};
//load input piexl from DRAM to IBRAM
void MemoryController::setPixelLoadRowOffset(){
#pragma HLS inline
int cur_inPixel_row_loc=cur_loadPixel_row*stride;
#pragma HLS RESOURCE variable = cur_inPixel_row_loc core = MulnS latency = 2
pixel_loadRow_DRAM_offset=cur_inPixel_row_loc*in_width;
#pragma HLS RESOURCE variable = pixel_loadRow_DRAM_offset core = MulnS latency = 2
cur_loadPiexel_col=0;
cur_loadPixel_row++;
};
void MemoryController::setPixelLoadOffset(){
#pragma HLS inline
int cur_inPixel_col_loc=stride*cur_loadPiexel_col;
#pragma HLS RESOURCE variable = cur_inPixel_col_loc core = MulnS latency = 2
load_pixel_offset=pixel_loadRow_DRAM_offset+cur_inPixel_col_loc;
cur_loadPiexel_col++;
};
取出根据地址取出DRAM上的像素值
先从DRAM加载入寄存器reg之中:
//load from DRAM pixel to reg
float MemoryController::loadInputChannelPixel(float * input_ptr,int ci){
#pragma HLS inline
#pragma HLS pipeline
int in_channel_pixel_offset=ci*in_channel_pixels;
#pragma HLS RESOURCE variable = in_channel_pixel_offset core = MulnS latency = 2
float px=reg(input_ptr[load_pixel_offset+in_channel_pixel_offset]);
return px;
};
运用了INLINE表示嵌入程序之中,pipeline表示流水线进行。
然后整行的图像像素加载程序:
//load whole row from DRAM to IBRAM (in hardware IBRAM order is row/col/channel_In)
void ImageCache::loadRowDRAM_2_IBRAM(float * input_ptr){
#pragma HLS inline
L_DRAM_PRELOADROW_X: for (int cur_col = 0; cur_col < in_width; cur_col++) {
MemoryController::setPixelLoadOffset();
loadPixelDRAM_2_IBRAM(input_ptr);
}
};
void ImageCache::loadPixelDRAM_2_IBRAM(float * input_ptr){
#pragma HLS inline
L_PRELOAD_PIXEL_FROM_DRAM: for (int ci = 0; ci < in_ChannelNum; ci++) {
#pragma HLS pipeline
//#pragma HLS latency min=4
float px = MemoryController::loadInputChannelPixel(input_ptr,ci);
writeNextChannelPixel_2_IBRAM(px);
}
};
void ImageCache::writeNextChannelPixel_2_IBRAM(float pixel){
// Write Value into IBRAM
IBRAM[cur_IBRAM_addr] = pixel;
// Check and Wrap Write Address into IBRAM
if (cur_IBRAM_addr == MAX_IBRAM_ADDR)
cur_IBRAM_addr = 0;
else
cur_IBRAM_addr++;
};
最小的for循环之中进行PIPILINE
3.6 IBRAM读出到PE
//get IBRAM pixel into buffer
void ProcessingElement::loadPixel_buffer(const int up_row,const int left_col,
const int cur_In_channel, float pixel_buffer[9]){
#pragma HLS inline
#pragma HLS pipeline
load_pixel_2_PE_row_loop:
for (int cur_filterRow=0;cur_filterRow<3;cur_filterRow++){
int pixel_row_to_load=up_row+cur_filterRow;
int IBRAM_line_offset=ImageCache::calcu_IBRAM_row_offset(pixel_row_to_load);
load_pixel_2_PE_col_loop:
for (int cur_filterCol=0;cur_filterCol<3;cur_filterCol++){
int pixel_col_to_load=left_col+cur_filterCol;
float px=reg(ImageCache::get_IBRAM_Pixel(IBRAM_line_offset,pixel_col_to_load,
cur_In_channel));
pixel_buffer[3*cur_filterRow+cur_filterCol]=px;
}
}
};
从IBRAM加载到buffer,加入了INLINE指令,加入了PIPELINE自动会对底层循环进行UNROLL,但是我们这一点有不懂的地方,IBRAM是双端口的BRAM,为社么可以进行取出的并行?
//load piexl from IBRAM out to PE
int ImageCache::calcu_IBRAM_row_offset(int cur_row){
#pragma HLS inline
int IBRAM_line=cur_row%NUM_IMG_CACHE_LINES;
int pixels_each_line=in_width*in_ChannelNum;
#pragma HLS RESOURCE variable=pixels_each_line core=MulnS latency=2
int IBRAM_line_offset=IBRAM_line*pixels_each_line;
#pragma HLS RESOURCE variable=IBRAM_line_offset core=MulnS latency=2
return IBRAM_line_offset;
};
float ImageCache::get_IBRAM_Pixel(const int IBRAM_line_offset, const int cur_col,
const int channel_in){
#pragma HLS inline
#pragma HLS RESOURCE variable = IBRAM core = RAM_S2P_BRAM
int IBRAM_col_offset=cur_col*in_ChannelNum;
int IBRAM_loc=IBRAM_line_offset+IBRAM_col_offset+channel_in;
float px=IBRAM[IBRAM_loc];
return px;
}