Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
Information
- Paper:Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
- Author: Jason Cong
- Key words:performance-resource trade-offs
Background
- Identifying the optimal design configuration in a tremendous design space is very difficult.
- Commercial HLS tools such as Xilinx SDAccel now accept an accelerator kernel implementation in C, C++ or OpenCL, and directly transform it down to an FPGA accelerator circuit, without the involvement of RTL design.
”Even for experienced hardware designers, it still requires a great deal of effort to resolve such complicated trade-offs and identify the optimal design choice.“
Work
Propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space.
- Introduce the CPP analytical model to capture the performance-resource trade-offs.
- Propose a series of pruning strategies to reduce the design space.
- Automate the entire accelerator generation and optimization process by implementing the AutoAccel framework
Automatically transform a user C program to a high-quality accelerator behavioral description.
Methods
1.CPP Microarchitecture
Steps:
①Fits the input kernel into the CPP microarchitecture.
②Performs design space exploration to identify the optimal parameter configuration.
③Transforms the input kernel code to the CPP microarchitecture description code.
Code transformations:
Improving off-chip DRAM bandwidth utilization.
Explicit data caching.
Loop pipelining/parallel.
On-chip memory organization.
Three features to realize the code transformations:
Coarse-grained pipeline with data caching.
Loop scheduling.
On-chip buffer reorganization.
2.Analytical model
The overall CPP microarchitecture consists of three stages:load,compute and store.
Performance Modeling:
Estimates an accelerator’s overall execution cycle,including all the loops , submodules and standalone logic.
所提议的模型递归地遍历所有循环和模块,直到一个循环或模块不包含任何子结构为止。
Resource Modeling:
The resource model estimates the consumptions of the four FPGA on-chip resources: BRAMs, LUTs, DSPs and FFs.In the paper they only demonstrate the BRAM and LUT models.
- For BRAM: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers and those used by all its submodules.
- For LUT:The LUT consumption of a hardware module is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic.
3.DESIGN SPACE EXPLORATION
DSE workflow:
Prune strategies:
- Small loop flatten:it is better to fatten the innermost loops with fixed, small trip counts,fully unroll innermost loops with trip count less than 16.
- Loop unroll factor pruning:Loop unroll factors determine the number of on-chip BRAM partitions, beneficial for programs with deep, complicated loop hierarchy.
- Saddleback search for loop unroll factors: based on the following theorem:
This strategy works very well for programs with shallow loop hierarchies. - Fine-grained pipeline pruning:
- Power-of-two buffer bit-widths and capacities:
4.AutoAccel Framework
Workflow:
Idea
- •Synthesizable.:The input kernel must be synthesizable via commercial HLS tools. That is, it should not include recursive function calls or dynamic memory allocation.
- Cacheable:The memory footprint of any single instance of the top-level loop must be smaller than the FPGA on-chip memory capacity to ensure that the kernel computation and external memory transaction can be fully decoupled.
这篇论文首先要通过一个多面体算法来判断一个input kernel是否满足以上两个约束条件:
Based on our problem formulation, computational kernels featuring extensive random accesses on a large memory footprint, e.g., PageRank and the BFS algorithm, will probably not meet the Cacheable constraint.
该算法在处理耗时长的程序时准确率较高,反之较低