COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications
Information
- Paper:COMBA: A Comprehensive Model-Based Analysis Framework for High Level Synthesis of Real Applications
- Author: Jieru zhao
- Key words:
Backgrounds
Previous work only support limited number of pragmas,which are not sufficient enough to real applications.
Work
Framework overview:
1.Recursive data collection(RDC)
RDC analyzes the LLVM IR to compute the required parameters.
- Static information is obtained by analyzing the assembly instructions from the LLVM IR directly
- Dynamic information depends on the code structure and optimization
pragmas applied, and is computed using the DFG.
2.Performance Model
包括loop unrolling, loop pipelining, array partitioning, function pipelining and dataflow五种pragma。
- unroll中对perfect nest loop,Non-perfect nest loop,Multiple
loops三种嵌套循环结构latency的计算 - 从pipeline depth, initiation interval,trip count三个方面考虑latency
- supports multi-dimension array partitioning with three options: block, cyclic and complete
- calculate II of the function to measure the amount of function outputs per cycle(Vivado HLS unrolls all sub-loops completely and pipelines each sub-function inside a pipelined function.)
- Dataflow doesn’t require sub-functions to be pipelined and sub-loops to be unrolled, but this technique can only be applied to functions or loops at the top level.
3.Resource Model
Focus on DSP and BRAM
DSP:(operators)
- LUT-based and small bandwidth operations,the number of operations equals the number of instances
- DSP-based operators,一次迭代使用的操作数除以II
BRAM:(arrays)
For scalars, the channel is a register. For arrays, the channels are ping-pong buffers by default. BRAM has two copies,one is for the output buffer,the other for the input.(if dataflow is applied)
4.Metric-guided design space exploration
1.Redundancy Elimination:
2.Guided Search:
- MGDSE gives the top optimization priority to the longest sub-element,
which is assumed to have the greatest influence. - check whether the DSP and BRAM usage exceed the available resources on FPGAs
- evaluates which array partitioning type (block or cyclic) is
beneficial in dimension i