本文链接：https://blog.csdn.net/qq_40884849/article/details/112733956

HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds

Information

Paper: HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds
Author: Jason Cong
Key words:Design space exploration(DSE) ,Variable loop bounds, Loop-carried dependency

Work

1.perform source to source HLS code transformation to increase the utilization of the compute resources for variable loops
2.describe a cycle and resource evaluating model to rapidly perform DSE with high accuracy

“Our work is more focused on optimizing these innermost loops by exploiting fine-grain parallelism and pipelining, accurately estimating resource sharing among these serial loops, and efficiently allocating non-sharable resource for overall latency minimization.”

Methods

For work 1,they mainly deal with examples with variable loop bounds like:completely parallel, reduction, and prefix sum
1.variable bounds:
Why：HLS tools cannot unroll the loops with variable bounds, a common optimization strategy is to pipeline the loop，but pipelining cannot exploit the data-level parallelism that exists in the loop.Unrolling the loop based on the maximum loop bound will lead to a severe PE efficiency problem.
Methods:
Partial Unrolling with Pipelining:
在这里插入图片描述
After transformation:

Result:

2.Variable Reduction:
Why: “#pragma HLS unroll factor=xxx.” is inefficient for floating-point variable loop reduction.If the loop bound is much smaller than the maximum, many adders will be left idle.Inserting pipelining directive is also not very efficient for floating-point reduction because of a true loop-carried dependency, and the result of the previous iteration cannot be immediately produced because of the long latency of the floating-point operations .
Methods:
Early termination
The reduction tree in stage 2 is pipelined across each level.
在这里插入图片描述
After transformation:

result:

3.variable Prefix Sum:
why：the true dependency between psum[k] and psum[k-1] prohibits II becoming 1 when the loop is pipelined and psum is a floating-point variable.Applying an unrolling directive results in a serialized addition due to the dependency and does not bring any speedup.
Method:
Kogge–Stone algorithm