about Sdsoc_examples

最新推荐文章于 2021-05-15 06:58:26 发布

元气少女缘结神

最新推荐文章于 2021-05-15 06:58:26 发布

阅读量377

点赞数

分类专栏： FPGA basises

本文链接：https://blog.csdn.net/wd1603926823/article/details/86739851

版权

FPGA basises 专栏收录该内容

11 篇文章 2 订阅

订阅专栏

网址：https://github.com/Xilinx/SDSoC_Examples/tree/master/cpp/getting_started

一、Array_partion
1:zero_copy-->Which has Direct Memory Interface with DDR and PL(AXI-master),以前以为只有读＋写型数组才可用zero_copy放在共享区，原来只读型数组也可以！
2:硬件函数内部的local memory是由BRAM(最多两个端口:一个时钟最多访问２个位置)实现的！一般将only-read型数组从DDR内burst read给local memory（支持随机访问）;将结果写给local memory,然后一次性burst write给结果形参。书上和例子中都推荐这么做。（好像对形参是DDR的才推荐burst read/write？）
3:对每一个for循环，只要loop bound是形参变量，都要使用LOOP_TRIPCOUNT（哪怕两个for使用同一个变量，那么也要LOOP_TRIPCOUNT两次）（如果使用assert是否只要一次？）
4:PIPELINE下级的loops都会自动unroll。但因为A and B are local memorys(DRAM)，max port is 2.所以arraypart3不可能自动unroll completely,而是自动unroll 2 times.

arraypart1: for (int i = 0; i < mat_dim; i++) {
#pragma HLS LOOP_TRIPCOUNT min=64 max=64
	arraypart2: for (int j = 0; j < mat_dim; j++) {
	#pragma HLS LOOP_TRIPCOUNT min=64 max=64
	#pragma HLS PIPELINE
	    int result = 0;
	    arraypart3: for (int k = 0; k < MAX_SIZE; k++) {
		result += A[i][k] * B[k][j];
	    }
	    C[i][j] = result;
	}
}

二、burst read and write
1:burst-->到底怎样才叫burst，看第一个例子我以为是将DDR中的数组与local memory直接的读/写才叫burst read/write，但是这个例子不是local memory：

void vec_incr_accel(int *in, int *out, int size, int inc_value){
    calc_write: for(int j=0; j < size; j++){
    #pragma HLS LOOP_TRIPCOUNT min=1 max=2048
    #pragma HLS PIPELINE
        out[j] = in[j] + inc_value;
    }
}

所以应该是DDR中的数组in/out只要是使用类似上述结构read/write都叫burst read/write！

三、custom data type
1:struct-->struct里类型之和一定要是32bit的整数倍！这样能高效率的访问全局变量！所以当自定义的struct不到32bit时一定要凑到32bit，哪怕多增加一个元素

typedef struct RGBcolor_struct
{
  unsigned char r;
  unsigned char g;
  unsigned char b;
  unsigned char pad;
 } __attribute__ ((packed, aligned(4))) RGBcolor;

四、direct connect
1:access_pattern:默认是RANDOM，只有指定为SEQUENTIAL才是流形式访问(从这个例子的注释看也意味着在DDR中)？！
2:burst_write:因为这个例子direct connect与第一个例子array partition都有对乘法的加速，不同之处是array partition中write时是从local memory写道DDR的；而这个例子是直接写到DDR，没有使用local memory。我将这两个例子的乘法硬件函数的输入数组大小都设置成64x64，差别就是第一个例子使用zero_copy优化且使用了local memory进行read/write；而这个例子使用的是access_pattern:SEQUENTIAL优化，只使用了local memory来read，没有使用local memory来write而是直接write。
当我重新编译并比较这两个函数的报告：函数1的HLS报告latency是12883个时钟，Data motion network 报告的Accelerator Callsites中的Data mover setup time是976，transfer time是14005540；函数2的HLS报告latency是8201个时钟，Data motion network 报告的Accelerator Callsites中的Data mover setup time是1112，transfer time是28524。
我以为第二种方法使用了流，传输会更快，好像报告显示也是这样哦；我以为第一种方法全部使用了local memory那么函数HLS报告中时钟应该更少，可是却显示更多？！那些官网文档不是都很提倡使用local cache吗说这样很快啊？