Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
- Paper:Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
- Author: Jason Cong
- Key words:performance-resource trade-offs
- Identifying the optimal design configuration in a tremendous design space is very difficult.
- Commercial HLS tools such as Xilinx SDAccel now accept an accelerator kernel implementation in C, C++ or OpenCL, and directly transform it down to an FPGA accelerator circuit, without the involvement of RTL design.
”Even for experienced hardware designers, it still requires a great deal of effort to resolve such complicated trade-offs and identify the optimal design choice.“
Propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space.
- Introduce the CPP analytical model to capture the performance-resource trade-offs.
- Propose a series of pruning strategies to reduce the design space.
- Automate the entire accelerator generation and optimization process by implementing the AutoAccel framework
Automatically transform a user C program to a high-quality accelerator behavioral description.
①Fits the input kernel into the CPP microarchitecture.
②Performs design space exploration to identify the optimal parameter configuration.
③Transforms the input kernel code to the CPP microarchitecture description code.
Improving off-chip DRAM bandwidth utilization.
Explicit data caching.
On-chip memory organization.
Three features to realize the code transformations：
Coarse-grained pipeline with data caching.
On-chip buffer reorganization.
The overall CPP microarchitecture consists of three stages:load,compute and store.
Estimates an accelerator’s overall execution cycle，including all the loops , submodules and standalone logic.
The resource model estimates the consumptions of the four FPGA on-chip resources: BRAMs, LUTs, DSPs and FFs.In the paper they only demonstrate the BRAM and LUT models.
- For BRAM: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers and those used by all its submodules.
- For LUT:The LUT consumption of a hardware module is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic.
- Small loop flatten:it is better to fatten the innermost loops with fixed, small trip counts,fully unroll innermost loops with trip count less than 16.
- Loop unroll factor pruning:Loop unroll factors determine the number of on-chip BRAM partitions, beneficial for programs with deep, complicated loop hierarchy.
- Saddleback search for loop unroll factors: based on the following theorem:
This strategy works very well for programs with shallow loop hierarchies.
- Fine-grained pipeline pruning:
- Power-of-two buffer bit-widths and capacities:
- •Synthesizable.：The input kernel must be synthesizable via commercial HLS tools. That is, it should not include recursive function calls or dynamic memory allocation.
- Cacheable：The memory footprint of any single instance of the top-level loop must be smaller than the FPGA on-chip memory capacity to ensure that the kernel computation and external memory transaction can be fully decoupled.
Based on our problem formulation, computational kernels featuring extensive random accesses on a large memory footprint, e.g., PageRank and the BFS algorithm, will probably not meet the Cacheable constraint.