HLS第五课（UG902纪要）

最新推荐文章于 2025-02-25 10:22:24 发布

Huskar_Liu

最新推荐文章于 2025-02-25 10:22:24 发布

阅读量1.5k

点赞数

分类专栏： hls 文章标签： hls

本文链接：https://blog.csdn.net/weixin_42418557/article/details/118890215

版权

hls 专栏收录该内容

42 篇文章

订阅专栏

Interface Synthesis

When the top-level function is synthesized, the arguments (or parameters) to the function are
synthesized into RTL ports.

#include "sum_io.h"
dout_t sum_io(din_t in1, din_t in2, dio_t *sum) {
dout_t temp;
*sum = in1 + in2 + *sum;
temp = in1 + in2;
return temp;
}

The above example includes:
• Two pass-by-value inputs in1 and in2.
• A pointer sum that is both read from and written to.
• A function return, the value of temp

Vivado HLS creates three types of ports on the RTL design:
Clock and Reset ports: ap_clk and ap_rst
时复信号，这是系统的全局时序控制信号。

Block-Level interface protocol. These are shown expanded in the preceding figure: ap_start,ap_done, ap_ready, and ap_idle.
模块级接口，用来控制模块的总FSM的运行。

Port Level interface protocols. These are created for each argument in the top-level function
and the function return (if the function returns a value).
端口级接口，用来控制各个端口的握手，例如vld。

In this example, these ports are:
in1,in2, sum_i, sum_o, sum_o_ap_vld, and ap_return
if the function has a return value, an output port ap_return is implemented to provide the
return value.

When the in-out is split into separate input and
output ports, mode ap_none is applied to the input port and ap_vld applied to the output port.
This is the default for pointer arguments that are both read and written.
注意，参数中的指针，如果在程序中，既被用来READ，又被用来WRITE，那么通常被实现为BRAM，但是如果只对指针进行寻址，并没有指针偏移寻址，也就是说，这个RAM只含有一个元素，那么将被实现为register，对应到端口上，就是sum_i和sum_o两套端口。

the block-level protocols indicate the function is complete with the ap_done
signal. This also indicates the data on port ap_return is valid and can be read.
Note: The return value to the top-level function cannot be a pointer
返回值必须是内部register的外连wire。

The design starts when ap_start is asserted High.
The ap_idle signal is asserted Low to indicate the design is operating.
The input data is read at any clock after the first cycle.
The ap_ready signal is asserted high when all inputs have been read.
When output sum is calculated, the associated output handshake (sum_o_ap_vld) indicates
that the data is valid.
When the function completes, ap_done is asserted. This also indicates that the data on
ap_return is valid.
Port ap_idle is asserted High to indicate that the design is waiting start again.

The block-level interface protocols are ap_ctrl_none, ap_ctrl_hs, and ap_ctrl_chain.
These are specified, and can only be specified, on the function or the function return.
The ap_ctrl_none mode implements the design without any block-level I/O protocol

If the function return is also specified as an AXI4-Lite interface (s_axilite) all the ports in the
block-level interface are grouped into the AXI4-Lite interface.
返回值如果是AXILITE总线形式，那么返回值将作为总线上的一个register。供总线去READ。
也就是说，返回值的register，不再wire到端口，而是wire到总线的READ MUX开关上。

The AXI4 interfaces supported by Vivado HLS include the AXI4-Stream (axis), AXI4-Lite
(s_axilite), and AXI4 master (m_axi) interfaces, which you can specify as follows:
• AXI4-Stream interface: Specify on input arguments or output arguments only, not on input/output arguments.
只能是单向的，如果被指定为AXIS，那么参数可以被理解为，
放置了一个FIFO（甚至是无穷大的），FIFO的一端是AXIS，另一端是FIFO的RD接口或者WR接口。
在C代码中，通常在for中访问参数，每访问一次，就是从FIFO中取一个数据，或者向FIFO写入一个数据。

• AXI4-Lite interface: Specify on any type of argument except streams. You can group multiple arguments into the same AXI4-Lite interface.
可以是双向的，如果指定为AXILITE，那么参数可以被理解为，
放置了一个register，连接到AXILITE的READ MUX和 WRITE DEMUX上。
在C代码中，如果是被外部input的参数，那么内部读取这个参数时，读取的就是参数对应的register。如果是向外部output的参数，那么内部写入这个参数时，是拍入register，这个register连线到READ MUX，当外部访问总线时，会被MUX合适选通。

• AXI4 master interface: Specify on arrays and pointers (and references in C++) only. You can group multiple arguments into the same AXI4 interface.
类似于AXILITE的处理机制。

The ap_none and ap_stable modes specify that no I/O protocol be added to the port.
The ap_none mode is the default for scalar inputs.

The ap_stable mode is intendedfor configuration inputs that only change when the device is in reset mode.
常数模式，ap_stable，用于设置一个常数端口。在reset时，会读取一次端口值并寄存，之后，便不再读取也不再寄存，除非再次reset。

The ap_hs mode can be applied to arrays that are read or written in sequential order.

Array arguments are implemented by default as an ap_memory interface. This is a standard
block RAM interface with data, address, chip-enable, and write-enable ports.

If Vivado HLS can determine that using a dual-port interface will reduce the initial interval, it will automatically implement a dual-port interface.
建议让HLS自动决定，所以，不推荐使用RESOURCE来指定SPRAM还是DPRAM。

An ap_memory interface is displayed as multiple and separate ports.
A bram interface is displayed as a single grouped port which can be connected to a Xilinx
block RAM using a single point-to-point connection
在C代码中，如果存在一条语句中，同时访问了数组参数的多个元素的情况时，HLS需要对BRAM进行分割，扩展出多个读写端口。
如果一个数组参数，被指定为ap_memory，那么RTL中，被分割成多个读写端口呈现给用户，
如果被指定为ap_bram，那么RTL中，会附加额外的逻辑，实现一个Switch，进行读写转换和分发，所以，只会呈现出一个读写端口给用户，这可以方便的连接到外部BRAM。

If the array is accessed in a sequential manner an ap_fifo interface can be used.
The ap_fifo interface can only be used for reading or writing, not both.
在C代码中，如果对一个数组参数的访问寻址，是递增的或者递减的，没有fallback的情况，那么可以将数组的端口，由ap_memory修改为ap_fifo。

Structs on the interface are by default decomposed into their member elements and ports are
implemented separately for each member element.
Arrays of structs are implemented as multiple arrays, with a separate array for each member of
the struct.
对于结构体参数，在端口实现时，会被分解为多个独立的端口，用来分别访问各个成员。
如果参数是结构体数组，那么，每个单独的成员，会被组织成一个单独的数组，并被实现为BRAM，这样，由多少个成员，就会有多少个BRAM。

Structs are only supported for the AXIM interface if the struct is packed using the DATA_PACK
optimization.
If a struct port using DATA_PACK is to be implemented with an AXI4 interface you may wish to consider using the DATA_PACK -byte_pad option.
不推荐使用DATA_PACK。除非II很紧张，必须标量化才能提高访问吞吐。

If a struct contains arrays, those arrays can be optimized using the ARRAY_PARTITION
directive to partition the array or the ARRAY_RESHAPE directive to partition the array and recombine the partitioned elements into a wider array.
A struct cannot be optimized with DATA_PACK and then partitioned or reshaped. The
DATA_PACK, ARRAY_PARTITION, and ARRAY_RESHAPE directives are mutually exclusive.

+++++++++++++++++++++++++++++++++++++++++++++++++++
Interface Synthesis and Multi-Access Pointers

In the following example pointer d_i is read four times and pointer d_o is written to
twice: the pointers perform multiple accesses.

#include "pointer_stream_bad.h"
//void pointer_stream_bad ( dout_t *d_o, din_t *d_i) {
void pointer_stream_better ( volatile dout_t *d_o, volatile din_t *d_i) {
	din_t acc = 0;
	acc += *d_i;
	acc += *d_i;
	*d_o = acc;
	acc += *d_i;
	acc += *d_i;
	*d_o = acc;
}

Using pointers which are accessed multiple times can introduce unexpected behavior after
synthesis.
pointers must be specified as volatile

it is highly recommended to implement the behavior required using the hls::stream class.

register----If you select this option, all pass-by-value reads are performed in the first cycle of operation.For output ports, the register option guarantees the output is registered.
For memory, FIFO, and AXI4 interfaces, the register option has no effect.
如果选择了register选项，输入的参数的端口数据，会在模块FSM启动后的第一个周期，被寄存下来。每个ap_start，会被寄存一次。

depth----This option specifies how many samples are provided to the design by the test bench and how many output values the test bench must store.
Note: For cases in which a pointer is read from or written to multiple times within a single transaction,
the depth option is required for C/RTL co-simulation.
The depth option is not required for arrays or when using the hls::stream construct.
It is only required when using pointers on the interface
选项depth是在仿真时需要使用的，用来制定一个指针的访问深度。

If the depth option is set too small, the C/RTL co-simulation might deadlock as follows:
The input reads might stall waiting for data that the test bench cannot provide.
The output writes might stall when trying to write data, because the storage is full
不要把depth设置小了。

offset----This option is used for AXI4 interfaces.

+++++++++++++++++++++++++++++++++++++++++++++++++
AXI4-Stream Interfaces

An AXI4-Stream interface can be applied to any input argument and any array or pointer output argument.
An AXI4-Stream interface is always sign-extended to the next byte. For example, a 12-bit data value is sign-extended to 16-bit.

For AXI-Stream interfaces, four types of register
modes are provided to control how the AXI-Stream interface registers are implemented.
• Forward: Only the TDATA and TVALID signals are registered.
• Reverse: Only the TREADY signal is registered.
• Both: All signals (TDATA, TREADY and TVALID) are registered. This is the default.
• Off: None of the port signals are registered.

AXI4-Stream Interfaces with Side-Channels
Side-channels are optional signals which are part of the AXI4-Stream standard. The side-channel signals may be directly referenced and controlled in the C code using a struct, provided the member elements of the struct match the names of the AXI4-Stream side-channel signals.
The AXI-Stream side-channel signals are considered data signals and are registered whenever TDATA is registered.
AXIS的辅助信号，和TDATA一起，先经过拼位扩展成更宽的bitvecter，然后再被寄存到FIFO中。

If an argument to the top-level function is a struct, Vivado HLS by
default partitions the struct into separate elements and implements each member of the struct as
a separate port. However, the DATA_PACK directive may be used to pack the elements of a
struct into a single wide-vector, allowing all elements of the struct to be implemented in the
same AXI4-Stream interface.
如果需要将一个结构体数组，指定为AXIS，那么要使用DATA_PACK。

The side-channel signals may be directly referenced and controlled in the C code using a struct, provided the member elements of the struct match the names of the AXI4-Stream side-channel signals.
如果要使用AXIS的边带信号，那么需要typedef定义成结构体，将边带信号定义为member，并使用完全相同的名字。

The Vivado HLS include directory contains the file ap_axi_sdata.h.

#include "ap_int.h"
#include “ap_axi_sdata.h”

template<int D,int U,int TI,int TD>
struct ap_axis{
	ap_int<D> data;
	ap_uint<D/8> keep;
	ap_uint<D/8> strb;
	ap_uint<U> user;
	ap_uint<1> last;
	ap_uint<TI> id;
	ap_uint<TD> dest;
};

template<int D,int U,int TI,int TD>
struct ap_axiu{
	ap_uint<D> data;
	ap_uint<D/8> keep;
	ap_uint<D/8> strb;
	ap_uint<U> user;
	ap_uint<1> last;
	ap_uint<TI> id;
	ap_uint<TD> dest;
};

You can create your own user defined structs.
Since the structs shown above use ap_int types and templates, this header file is only for use in
C++ designs.
The valid and ready signals are mandatory signals in an AXI4-Stream and will always be implementedby Vivado HLS. These cannot be controlled using a struct.
可以按照这个规范，定义自己的typedef。

The following example shows how the side-channels can be used directly in the C code and
implemented on the interface.

#include "ap_axi_sdata.h"
void example(ap_axis<32,2,5,6> A[50], ap_axis<32,2,5,6> B[50]){
#pragma HLS INTERFACE axis port=A
#pragma HLS INTERFACE axis port=B
	int i;
	for(i = 0; i < 50; i++){
		B[i].data = A[i].data.to_int() + 5;
		B[i].keep = A[i].keep;
		B[i].strb = A[i].strb;
		B[i].user = A[i].user;
		B[i].last = A[i].last;
		B[i].id = A[i].id;
		B[i].dest = A[i].dest;
	}
}

数组被实现为AXIS，在for循环中访问时，数组的访问寻址是递增的，没有fallback，所以是合法的。

When using AXI4-Stream interfaces with side-channels, the function argument is itself a struct (AXI-Stream struct). It can contain data which is itself a struct (data struct) along with the side channels:
Vivado HLS automatically applies the DATA_PACK directive to the data struct and all elements of the data struct are combined into a single wide-data vector.
If the DATA_PACK directive is applied to AXI-Stream struct, the function argument, the data struct and the side-channel signals are combined into a single wide-vector.

++++++++++++++++++++++++++++++++++++++++++++++++
AXI4-Lite Interface

You can use an AXI4-Lite interface to allow the design to be controlled by a CPU or
microcontroller.
Group multiple ports into the same AXI4-Lite interface.
Output C driver files for use with the code running on a processor.

The following example shows how Vivado HLS implements multiple arguments, including the function return, as an AXI4-Lite interface. Because each directive uses the same name for the bundle option, each of the ports is grouped into the same AXI4-Lite interface.

void example(char *a, char *b, char *c)
{
#pragma HLS INTERFACE s_axilite port=return bundle=BUS_A

#pragma HLS INTERFACE s_axilite port=a bundle=BUS_A
#pragma HLS INTERFACE s_axilite port=b bundle=BUS_A

#pragma HLS INTERFACE s_axilite port=c bundle=BUS_A offset=0x0400

#pragma HLS INTERFACE ap_vld port=b

	*c += *a + *b;
}

指针是读写双向的，所以可以被收集到总线里，
在C代码中，如果只是对指针进行唯一访问寻址，那么参数被实现为一个单一的寄存器，如果对指针进行了偏移寻址，则会被理解为数组，按照数组实现。

As a result, the AXI4-Lite interface contains a register for the port b data, a register for the output to acknowledge that port b was read, and a register for the port b input valid signal.
port b一方面，被收集到总线里，可以通过总线READ和WRITE。另一方面，被wire到两个端口上，一个是input ，一个是output。

Xilinx recommends that you do notinclude additional I/O protocols in the ports grouped into an AXI4-Lite interface.
However, Xilinx recommends that you include the block-level I/O protocol associated with the return port in the AXI4-Lite interface.
不推荐对收集到总线里的参数，再额外附加端口级协议。
但是，模块级的return是个例外。

If you do not use the bundle option, Vivado HLS groups all arguments specified with an AXI4-Lite interface into the same default bundle and automatically names the port.

You can only assign arrays to an AXI4-Lite interface using the default ap_memory interface.
You cannot assign arrays to an AXI4-Lite interface using the bram interface.
数组被实现为AXILITE总线时，不推荐附加ap_memory约束，因为默认就是，更不能使用ap_bram约束。

By default, Vivado HLS automatically assigns the address for each port that is grouped into an AXI4-Lite interface. Vivado HLS provides the assigned addresses in the C driver files.
To explicitly define the address, you can use the offset option。
默认情况下，HLS自己负责分配AXILITE总线的offset，当然，也可以手工指定一个offset。
推荐让HLS自己管理offset。

Vivado HLS creates the interrupt port by including the function return in the
AXI4-Lite interface. You can program the interrupt through the AXI4-Lite interface. You can also
drive the interrupt from the following block-level protocols:
• ap_done: Indicates when the function completes all operations.
• ap_ready: Indicates when the function is ready for new input data.
模块会生成中断信号，并通过AXILITE来配置中断信号的wire连接。

By default, Vivado HLS uses the same clock for the AXI4-Lite interface and the synthesized
design.
Optionally, you can use the INTERFACE directive clock option to specify a separate clock for
each AXI4-Lite port.
AXI4-Lite interface clock must be synchronous to the clock used for the synthesized logic
(ap_clk). That is, both clocks must be derived from the same master generator clock.
AXI4-Lite interface clock frequency must be equal to or less than the frequency of the clock
used for the synthesized logic (ap_clk).
如果需要指定ap_clk之外的时钟作为AXILITE的时钟，那么他们必须由同一个MMCM分频产生，并且，AXILITE的时钟，要小于ap_clk的频率。

所以，推荐做法是，AXILITE时钟，也是使用ap_clk来驱动。

#pragma HLS interface s_axilite port=a clock=AXI_clk1
#pragma HLS interface s_axilite port=c bundle=CTRL1 clock=AXI_clk2

额外指定clock。

You can program the interface using the C driver files.
The hardware header file xexample_hw.h (in this example) provides a complete list of the memory mapped locations for the ports grouped into the AXI4-Lite slave interface.

For example, to start the block operation the ap_start register must be set to 1.
When the block completes operation, the ap_done, ap_idle and ap_ready registers will be set by the hardware output ports and the results for any output ports grouped into the AXI4-Lite slave interface read from the appropriate register.
Function argument c is both read and written to, and is therefore implemented as separate input and output ports c_i and c_o,

The first recommended flow for programing the AXI4-Lite slave interface is for a one-time execution of the function:
• Use the interrupt function to determine how you wish the interrupt to operate.
• Load the register values for the block input ports. In the above example this is performed
using API functions XExample_Set_a, XExample_Set_b, and XExample_Set_c_i.
• Set the ap_start bit to 1 using XExample_Start to start executing the function. This register is self-clearing as noted in the header file above. After one transaction, the block will suspend operation.
• Allow the function to execute. Address any interrupts which are generated.
• Read the output registers. In the above example this is performed using API functions XExample_Get_c_o_vld, to confirm the data is valid, and XExample_Get_c_o. Note: The registers in the AXI4-Lite slave interface obey the same I/O protocol as the ports. In this case, the output valid is set to logic 1 to indicate if the data is valid.
• Repeat for the next transaction.

The second recommended flow is for continuous execution of the block.
In this mode, the input ports included in the AXI4-Lite slave interface should only be ports which perform configuration.
• Use the interrupt function to determine how you wish the interrupt to operate.
• Load the register values for the block input ports. In the above example this is performed using API functions XExample_Set_a, XExample_Set_a and XExample_Set_c_i.
• Set the auto-start function using API XExample_EnableAutoRestart
• Allow the function to execute. The individual port I/O protocols will synchronize the data being processed through the block.
• Address any interrupts which are generated. The output registers could be accessed during this operation but the data may change often.
• Use the API function XExample_DisableAutoRestart to prevent any more executions.
• Read the output registers. In the above example this is performed using API functions XExample_Get_c_o and XExample_Set_c_o_vld.

连续运行模式下，AXILITE总线里的参数，只是配置性质的参数。

+++++++++++++++++++++++++++++++++++++++++++++++++++++++
AXI4 Master Interface

You can use an AXI4 master interface on array or pointer/reference arguments,
推荐在C代码中使用指针，并在HLS中指定为AXIMM。

With individual data transfers, Vivado HLS reads or writes a single element of data for each address.

void bus (int *d) {
	static int acc = 0;
	acc += *d;
	*d = acc;
}

In this example, Vivado HLS generates an address on the AXI interface to read a single data value and an address to write a single data value.

With burst mode transfers, Vivado HLS reads or writes data using a single base address followed by multiple sequential data samples, when you use the C memcpy function or a pipelined for loop.

void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return
//Port a is assigned to an AXI4 master interface
	int i;
	int buff[50];
	//memcpy creates a burst access to memory
	memcpy(buff,(const int*)a,50*sizeof(int));
	
	for(i=0; i < 50; i++){
		buff[i] = buff[i] + 100;
	}
	
	memcpy((int *)a,buff,50*sizeof(int));
}

void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return
	//Port a is assigned to an AXI4 master interface
	int i;
	int buff[50];
	//memcpy creates a burst access to memory
	memcpy(buff,(const int*)a,50*sizeof(int));
	
	for(i=0; i < 50; i++){
		buff[i] = buff[i] + 100;
	}
	
	for(i=0; i < 50; i++){
	#pragma HLS PIPELINE
		a[i] = buff[i];
	}
}

在C代码中使用的指针，要使用volatile，并在HLS中指定为m_axi。
在C代码中使用memcpy，并将指针作为buffer addr，可以被理解为burst mode。
在C代码中使用for循环，并用循环变量作为偏移变量，将指针作为base addr，并在HLS中指定PIPELINE约束，也可以被理解为burst mode。因为如果for循环中如果没有被PIPELINE，那么模块FSM可以理解为逐句执行，在不同的周期读取不同的数据。只有被PIPELINE的for循环，才能被并行化处理，一个周期内，同时发起多个连续地址的访问寻址。
注意，在for循环中使用索引变量时，要求单调递增，不能是单调递减，也不能出现fallback。

When using a for loop to implement burst reads or writes, follow these requirements:
• Pipeline the loop
• Access addresses in increasing order
• Do not place accesses inside a conditional statement
• For nested loops, do not flatten loops, because this inhibits the burst operation
几个要求，流水化，单调递增，无条件分支，不展平。

Note: Only one read and one write is allowed in a for loop unless the ports are bundled in different AXI ports.
在C代码中，for循环体中的指针的访问寻址，一次迭代步中，对一个数据单元，只能是单次读和单次写。

The following example shows how to perform two reads in burst mode using different AXI
interfaces

void example(volatile int *a, volatile int *b){
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE m_axi depth=50 port=b bundle=d2_port
	int i;
	int buff[50];
	//copy data in
	
	for(i=0; i < 50; i++){
	#pragma HLS PIPELINE
		buff[i] = a[i] + b[i];
	}
...
}

Port a isspecified without using the bundle option and is implemented in the default AXI interface. Portb is specified using a named bundle and is implemented in a separate AXI interface called d2_port.
不推荐对期望实现为m_axi的指针使用默认名称，
推荐为每个m_axi指针bundle一个名称。

Structs are only supported for the AXIM interface if the struct is packed using the DATA_PACK optimization.
如果要在AXIM接口上只用结构体指针，那么必须使用DATA_PACK。

对AXIM接口进行性能调优：
To create the optimal AXI4 interface, the following options are provided in the INTERFACE directive to specify the behavior of the bursts and optimize the efficiency of the AXI4 interface.
Some of these options use internal storage to buffer data 。

latency----Specifies the expected latency of the AXI4 interface, allowing the design to initiate
a bus request a number of cycles (latency) before the read or write is expected.
If this figure it too low, the design will be ready too soon and may stall waiting for the bus.
If this figure is too high, bus access may be granted but the bus may stall waiting on the design to start the access

max_read_burst_length----Specifies the maximum number of data values read during a burst transfer.

num_read_outstanding----Specifies how many read requests can be made to the AXI4 bus, without a response, before the design stalls.
This implies internal storage in the design, a FIFO of size: num_read_outstanding*max_read_burst_length*word_size.

max_write_burst_length----Specifies the maximum number of data values written during a burst transfer.

num_write_outstanding----Specifies how many write requests can be made to the AXI4 bus, without a response, before the design stalls.
This implies internal storage in the design, a FIFO of size: num_write_outstandingmax_write_burst_lengthword_size.

#pragma HLS interface m_axi port=input offset=slave bundle=gmem0 	\
			depth=1024*1024*16/(512/8) 		\
			latency=100	\
			num_read_outstanding=32 	\
			num_write_outstanding=32 	\
			max_read_burst_length=16	\
			max_write_burst_length=16 	\

The interface is specified as having a latency of 100. Vivado HLS seeks to schedule the request for burst access 100 clock cycles before the design is ready to access the AXI4 bus.
the options num_write_outstanding and num_read_outstanding ensure the design contains enough buffering to store up to 32 read and write accesses.
the options max_read_burst_length and max_write_burst_length ensure the maximum burst size
is 16 and that the AXI4 interface does not hold the bus for longer than this.
实现一个AXIM接口，其实就是在实现DMA功能，所以要小心调优。

By default, Vivado HLS implements the AXI4 port with a 32-bit address bus.

Controlling the Address Offset in an AXI4 Interface
By default, the AXI4 master interface starts all read and write operations from address
0x00000000.

void example(volatile int *a){
#pragma HLS INTERFACE m_axi depth=50 port=a
#pragma HLS INTERFACE s_axilite port=return bundle=AXILiteS
	int i;
	int buff[50];
	memcpy(buff,(const int*)a,50*sizeof(int));
	
	for(i=0; i < 50; i++){
		buff[i] = buff[i] + 100;
	}
	
	memcpy((int *)a,buff,50*sizeof(int));
}

如果要在HLS中实现DMA功能，
推荐做法是定义个临时变量，例如，buff，
首先从AXIM中，使用memcpy，传输数据到buff中，将buff处理完成后，再使用memcpy，将buff中的数据传输出去。

the design reads data from addresses
0x00000000 to 0x000000c7 (50 32-bit words, gives 200 bytes), which represents 50 address
values.

To apply an address offset, use the -offset option with the INTERFACE directive, and specify one of the following options:
• off: Does not apply an offset address. This is the default.
• direct: Adds a 32-bit port to the design for applying an address offset.
• slave: Adds a 32-bit register inside the AXI4-Lite interface for applying an address offset.

In the final RTL, Vivado HLS applies the address offset directly to any read or write address generated by the AXI4 master interface.

Xilinx recommends that you implement the AXI4-Lite interface using the following
pragma:

#pragma HLS INTERFACE s_axilite port=return

if you use the slave option，you must ensure that the AXI master port offset register is bundled into the correct AXI4-Lite interface.

#pragma HLS INTERFACE m_axi port=a depth=50 offset=slave
#pragma HLS INTERFACE s_axilite port=a bundle=AXI_Lite_1

#pragma HLS INTERFACE s_axilite port=return bundle=AXI_Lite_1

#pragma HLS INTERFACE s_axilite port=b bundle=AXI_Lite_2

为port a附加了两条约束，第一条定义指针为AXIM接口，同时指定了offset为slave，
紧接着，第二条定义了slave具体bundle到哪个AXILITE上。

+++++++++++++++++++++++++++++++++++++++++++++++
Optimizing for Throughput

When a function is pipelined, all loops in the hierarchy below are automatically unrolled. If a loop has variable bounds it cannot be unrolled.

The dataflow optimization is useful on a set of sequential tasks

For any dataflow region (except “dataflow-in-loop”), it is possible to specify

void region(...) {
#pragma HLS dataflow
#pragma HLS interface ap_ctrl_none port=return
	hls::stream<int> outStream1, outStream2;
	demux(inStream, outStream1, outStream2);
	worker1(outStream1, ...);
	worker2(outStream2, ....);
}

The ARRAY_RESHAPE directive combines ARRAY_PARTITIONING with the vertical mode of
ARRAY_MAP and is used to reduce the number of block RAM while still allowing the beneficial
attributes of partitioning: parallel access to the data.

void foo (...) {
	int array1[N];
	int array2[N];
	int array3[N];
	#pragma HLS ARRAY_RESHAPE variable=array1 block factor=2 dim=1
	#pragma HLS ARRAY_RESHAPE variable=array2 cycle factor=2 dim=1
	#pragma HLS ARRAY_RESHAPE variable=array3 complete dim=1
	...
}

++++++++++++++++++++++++++++++++++++++++++++++++++++++
Arbitrary Precision Data Types Library

#include <stdio.h>
#include ap_cint.h

typedef int6 dinA_t;
typedef int12 dinB_t;
typedef int22 dinC_t;
typedef int33 dinD_t;
typedef int18 dout1_t;
typedef uint13 dout2_t;
typedef int22 dout3_t;
typedef int6 dout4_t;

void apint_arith(dinA_t inA,dinB_t inB,dinC_t inC,dinD_t inD,dout1_t
				*out1,dout2_t *out2,dout3_t *out3,dout4_t *out4);

The following example shows casting to avoid integer promotion.

#include "ap_cint.h"
typedef int18 din_t;
typedef int36 dout_t;

dout_t apint_promotion(din_t a,din_t b) {
	dout_t tmp;
	tmp = (dout_t)a * (dout_t)b;
	return tmp;
}

C++ arbitrary precision types do not suffer from Integer Promotion Issues.
HLS更推荐使用C++.
无论是使用C还是C++，HLS都推荐首先typedef所使用的定制类型。

#include <stdio.h>
#include "ap_int.h"

#define N 9

typedef ap_int<6> dinA_t;
typedef ap_int<12> dinB_t;
typedef ap_int<22> dinC_t;
typedef ap_int<33> dinD_t;
typedef ap_int<18> dout1_t;
typedef ap_uint<13> dout2_t;
typedef ap_int<22> dout3_t;
typedef ap_int<6> dout4_t;


void cpp_ap_int_arith(dinA_t inA,dinB_t inB,dinC_t inC,dinD_t inD,dout1_t
				*out1,dout2_t *out2,dout3_t *out3,dout4_t *out4);

do not use the C++ cout operator to output the results to a file,keeps the test bench as similar as possible to C。
the built-in ap_int method .to_int() is used to convert the ap_int results to integer types used with the standard fprintf function.

	fprintf(fp, %d*%d=%d; %d+%d=%d; %d/%d=%d; %d mod %d=%d;\n,
		inA.to_int(), inB.to_int(), out1.to_int(),
		inB.to_int(), inA.to_int(), out2.to_int(),
		inC.to_int(), inA.to_int(), out3.to_int(),
		inD.to_int(), inA.to_int(), out4.to_int()
	);

The following code sample shows ap_fixed type.

#include "ap_fixed.h"
typedef ap_ufixed<10,8, AP_RND, AP_SAT> din1_t;
typedef ap_fixed<6,3, AP_RND, AP_WRAP> din2_t;
typedef ap_fixed<22,17, AP_TRN, AP_SAT> dint_t;
typedef ap_fixed<36,30> dout_t;

dout_t cpp_ap_fixed(din1_t d_in1, din2_t d_in2) {
	static dint_t sum;
	sum += d_in1;
	return sum * d_in2;
}

+++++++++++++++++++++++++++++++++++++++++++++++++
HLS Stream Library

To use hls::stream<> objects, include the header file hls_stream.h.

Modeling designs that use streaming data can be difficult in C.
Vivado HLS provides a C++ template class hls::stream<> for modeling streaming data
structures.

If an hls::stream is used to transfer data between tasks,
you should immediately consider implementing the tasks in a DATAFLOW region where data streams from one task to the next.

Local streams are always implemented as internal FIFOs. Global streams can be implemented as internal FIFOs or ports:
Globally-defined streams that are only read from, or only written to, are inferred as external ports of the top-level RTL block.
推荐使用local stream。

#include "ap_int.h"
#include "hls_stream.h"

typedef ap_uint<128> uint128_t; // 128-bit user defined type

hls::stream<uint128_t> my_wide_stream; // A stream declaration

首先typedef一个定制的类型，然后用这个定制类型去具象实例化一个类模版。

Streams may be optionally named.
推荐为stream对象取一个系统描述名称。

stream<uint8_t> bytestr_in2("input_stream2");

When streams are passed into and out of functions, they must be passed-by-reference as in the following example:

void stream_function (
		hls::stream<uint8_t> &strm_out,
		hls::stream<uint8_t> &strm_in,
		uint16_t strm_len
		)

推荐使用传引调用，而不是传针调用。

The << or >> operator is overloaded such that it may be used in a similar fashion to the stream
insertion operators for C++ stream (for example, iostreams and filestreams).
hls::stream<> object to be written to is supplied as the left-hand side argument and the value to be written as the right-hand side.

my_stream.write(src_var);
my_stream << src_var;

my_stream.read(dst_var);
my_stream >> dst_var;

Non-Blocking Writes
attempts to push variable src_var into the stream my_stream, returning a boolean true if successful. Otherwise, false is returned and the queue is unaffected.

if (my_stream.write_nb(src_var)) {
	// Perform standard operations
	...
}
else {
	// display Write did not occur
	return;
}

Fullness Test
Returns true, if and only if the hls::stream<> object is full.

bool stream_full;
stream_full = my_stream.full();

	if(stream_full ){
		// display full
		return;
	}
	else
	{
		// write stream
	}

Non-Blocking Read
This method attempts to read a value from the stream, returning true if successful. Otherwise,
false is returned and the queue is unaffected.

	if (my_stream.read_nb(dst_var)) {
		// Perform standard operations
		...
	}
	else {
		// Read did not occur
		return;
	}

Emptiness Test
Returns true if the hls::stream<> is empty.

bool stream_empty;
stream_empty = my_stream.empty();

+++++++++++++++++++++++++++++++++++++++++++++
HLS coding style

Vivado HLS defines the macro __ SYNTHESIS __ when synthesis is performed.

#ifndef __SYNTHESIS__
	FILE *fp1; // The following code is ignored for synthesis
	char filename[255];
	sprintf(filename,Out_apb_%03d.dat,apb);
	fp1=fopen(filename,w);
	fprintf(fp1, %d \n, apb);
	fclose(fp1);
#endif

HLS reuses the C test bench to verify the RTL design. No RTL test bench needs to be created when using Vivado HLS.
Xilinx recommends that you separate the top-level function for synthesis from the test bench, and that you use header files.

The optimizations unroll, partially unroll, flatten, and merge effectively make changes to the loop structure, as if the code was changed.
When variable loop bounds are present, Vivado HLS reports the latency as a question mark (?) instead of using exact values.
The solution to loops with variable bounds is to make the number of loop iteration a fixed value with conditional executions inside the loop.

LOOP_X:for (x=0;x<N; x++) {
	if (x<width) {
	out_accum += A[x];
	}
}

When a loop or function is pipelined, any loop in the hierarchy below the loop or function being pipelined must be unrolled.
If the top-level function is pipelined, both loops must be unrolled:

The concept to appreciate when selecting at which level of the hierarchy to pipeline is to understand that pipelining the innermost loop gives the smallest hardware with generally acceptable throughput for most applications.

Loop Parallelism

void loop_sequential(din_t A[N], din_t B[N], dout_t X[N], dout_t Y[N],
	dsel_t xlimit, dsel_t ylimit) 
{
	dout_t X_accum=0;
	dout_t Y_accum=0;
	int i,j;
	
	SUM_X:for (i=0;i<xlimit; i++) {
		X_accum += A[i];
		X[i] = X_accum;
	}
	SUM_Y:for (i=0;i<ylimit; i++) {
		Y_accum += B[i];
		Y[i] = Y_accum;
	}
}

在一个函数里，顺序放置了两个for循环，而且两个还都是变长循环，这是不好并行化的。
一个好的方法是，进一步分割成子函数。

void sub_func(din_t I[N], dout_t O[N], dsel_t limit) {
	int i;
	dout_t accum=0;
	
	SUM:for (i=0;i<limit; i++) {
		accum += I[i];
		O[i] = accum;
	}

}
void loop_functions(din_t A[N], din_t B[N], dout_t X[N], dout_t Y[N],
dsel_t xlimit, dsel_t ylimit) 
{
	sub_func(A,X,xlimit);
	sub_func(B,Y,ylimit);
}

A good way is to use dynamic memory allocation for simulation but a fixed sized array for
synthesis, This means that the memory required for this is allocated on the heap,

#ifdef __SYNTHESIS__
	// Use an arbitrary precision type & array for synthesis
	int32 la0[10000000], la1[10000000];
#else
	// Use an arbitrary precision type & dynamic memory for simulation
	int32 *la0 = malloc(10000000 * sizeof(int32));
	int32 *la1 = malloc(10000000 * sizeof(int32));
#endif

Arrays must be sized. for example: Array[10];. However, unsized arrays are not supported, for example: Array[];

在for循环中，同一条语句里，对一个数组的多个元素进行访问寻址，是不好的编码风格，例如下面代码，同一条语句里，同时访问了3个数组元素。

SUM_LOOP:for(i=2;i<N;++i)
	sum += mem[i] + mem[i-1] + mem[i-2];

by performing pre-reads and manually pipelining the data accesses, there is only one array read specified in each iteration of the loop.

dout_t array_mem_perform(din_t mem[N]) {
	din_t tmp0, tmp1, tmp2;
	dout_t sum=0;
	int i;
	tmp0 = mem[0];
	tmp1 = mem[1];
	SUM_LOOP:for (i = 2; i < N; i++) {
		//prepare window, import newest data into window
		tmp2 = mem[i];
		
		// process with window
		sum += tmp2 + tmp1 + tmp0;
		
		// move window, delete oldest data out of window
		tmp0 = tmp1;
		tmp1 = tmp2;
	}
	return sum;
}

通过定义多个临时变量，手工预读取数据，手工在for循环里实现窗口移动，实现流水化。
在循环体内，
首先是刷新窗口，
import newest data into window，从而形成了一个有效的窗口，
然后，利用当前窗口中的数据进行算法处理。
然后是移动窗口，
delete oldest data out of window，为下一次迭代准备好环境。

A static array behaves in C code as a memory does in RTL.

static int coeff[8] = {-2, 8, -4, 10, 14, 10, -4, 8, -2};

具有初始值的数组，应该加上static，这样，被实现为RAM，并在上电加载时复位。

Xilinx highly recommends using the static qualifier for arrays that are intended to be memories
The const qualifier is also recommended when arrays are only read,
In the following example, array sin_table[256] is inferred as a memory and implemented as a ROM after RTL synthesis.
If complex assignments are used to initialize a ROM，placing the array initialization into a separate function allows a ROM to be inferred.

#include "array_ROM_math_init.h"
#include <math.h>
void init_sin_table(din1_t sin_table[256])
{
	int i;
	for (i = 0; i < 256; i++) {
		dint_t real_val = sin(M_PI * (dint_t)(i - 128) / 256.0);
		sin_table[i] = (din1_t)(32768.0 * real_val);
	}
}
dout_t array_ROM_math_init(din1_t inval, din2_t idx)
{
	short sin_table[256];
	init_sin_table(sin_table);
	return (int)inval * (int)sin_table[idx];
}

Because the result of the sin() function results in constant values, no core is required in the RTL design to implement the sin() function.

Often with VHLS designs, unions are used to convert the raw bits from one data type to another
data type.

typedef float T;
unsigned int value; // the "input" of the conversion
T myhalfvalue; // the "output" of the conversion
union my_conv_u
{
	unsigned int as_uint32;
	T as_floatingpoint;
} my_converter;

my_converter.as_uint32 = value;
myhalfvalue = my_converter. as_floatingpoint;

使用UNION来巧妙实现类型转换，是C语言中常用的技巧。
C语言中，UNION是不能实现int和float的互相转换的。因为在C语言中，UNION仅仅是一个内存地址的base，根据成员的不同，选取不同的range，并解释成对应的类型。
最常用的的int 和char array的互相转换。

Static types in a function hold their value between function calls. The equivalent behavior in a
hardware design is a registered variable (a flip-flop or memory).

In the case of arrays, the const variable is implemented as a ROM in the final RTL design，Arrays specified with the const qualifier are (like statics) initialized in the RTL and in the FPGA bitstream.

Vivado HLS supports pointers to pointers for synthesis but does not support them on the toplevel interface, Arrays of pointers can also be synthesized.

Alternatively, the code must be modified with an array on the interface instead of a pointer,
HLS不支持指针偏移操作来实现偏移访问寻址，推荐使用数组来实现偏移访问寻址。

Modify the code to using a streaming data type.
The following code example has been updated to ensure that it reads four unique values from
the test bench and write two unique values. Because the pointer accesses are sequential and
start at location zero, a streaming interface type can be used during synthesis

void pointer_stream_good ( volatile dout_t *d_o, volatile din_t *d_i) {
	din_t acc = 0;
	acc += *d_i;
	acc += *(d_i+1);
	*d_o = acc;
	acc += *(d_i+2);
	acc += *(d_i+3);
	*(d_o+1) = acc;
}

第一，使用指针偏移，是在顺序排布的多条语句中操作的，
第二，单向访问寻址，只读或者只写，
第三，指针偏移单调递增。