HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds
Information
- Paper: HLS-Based Optimization and Design Space Exploration for Applications with Variable Loop Bounds
- Author: Jason Cong
- Key words:Design space exploration(DSE) ,Variable loop bounds, Loop-carried dependency
Work
1.perform source to source HLS code transformation to increase the utilization of the compute resources for variable loops
2.describe a cycle and resource evaluating model to rapidly perform DSE with high accuracy
“Our work is more focused on optimizing these innermost loops by exploiting fine-grain parallelism and pipelining, accurately estimating resource sharing among these serial loops, and efficiently allocating non-sharable resource for overall latency minimization.”
Methods
For work 1,they mainly deal with examples with variable loop bounds like:completely parallel, reduction, and prefix sum
1.variable bounds:
Why:HLS tools cannot unroll the loops with variable bounds, a common optimization strategy is to pipeline the loop,but pipelining cannot exploit the data-level parallelism that exists in the loop.Unrolling the loop based on the maximum loop bound will lead to a severe PE efficiency problem.
Methods:
Partial Unrolling with Pipelining:
After transformation:
Result:
2.Variable Reduction:
Why: “#pragma HLS unroll factor=xxx.” is inefficient for floating-point variable loop reduction.If the loop bound is much smaller than the maximum, many adders will be left idle.Inserting pipelining directive is also not very efficient for floating-point reduction because of a true loop-carried dependency, and the result of the previous iteration cannot be immediately produced because of the long latency of the floating-point operations .
Methods:
Early termination
The reduction tree in stage 2 is pipelined across each level.
After transformation:
result:
3.variable Prefix Sum:
why:the true dependency between psum[k] and psum[k-1] prohibits II becoming 1 when the loop is pipelined and psum is a floating-point variable.Applying an unrolling directive results in a serialized addition due to the dependency and does not bring any speedup.
Method:
Kogge–Stone algorithm
For work 2: CYCLE / RESOURCE ESTIMATION
Separate sharable and non-sharable resource of a loop