论文笔记（2.DAC.2018）

最新推荐文章于 2024-03-18 17:17:00 发布

Jason_1141

最新推荐文章于 2024-03-18 17:17:00 发布

阅读量295

点赞数

文章标签： hls fpga

本文链接：https://blog.csdn.net/qq_40884849/article/details/112759964

版权

Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Information

Paper:Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
Author: Jason Cong
Key words:performance-resource trade-offs

Background

Identifying the optimal design configuration in a tremendous design space is very difficult.
Commercial HLS tools such as Xilinx SDAccel now accept an accelerator kernel implementation in C, C++ or OpenCL, and directly transform it down to an FPGA accelerator circuit, without the involvement of RTL design.

”Even for experienced hardware designers, it still requires a great deal of effort to resolve such complicated trade-offs and identify the optimal design choice.“

Work

Propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space.

Introduce the CPP analytical model to capture the performance-resource trade-offs.
Propose a series of pruning strategies to reduce the design space.
Automate the entire accelerator generation and optimization process by implementing the AutoAccel framework

Automatically transform a user C program to a high-quality accelerator behavioral description.

Methods

1.CPP Microarchitecture

Steps：
①Fits the input kernel into the CPP microarchitecture.
②Performs design space exploration to identify the optimal parameter configuration.
③Transforms the input kernel code to the CPP microarchitecture description code.

Code transformations：
Improving off-chip DRAM bandwidth utilization.
Explicit data caching.
Loop pipelining/parallel.
On-chip memory organization.

Three features to realize the code transformations：
Coarse-grained pipeline with data caching.
Loop scheduling.
On-chip buffer reorganization.

2.Analytical model

The overall CPP microarchitecture consists of three stages:load,compute and store.

Performance Modeling：
Estimates an accelerator’s overall execution cycle，including all the loops , submodules and standalone logic.
所提议的模型递归地遍历所有循环和模块，直到一个循环或模块不包含任何子结构为止。

Resource Modeling：
The resource model estimates the consumptions of the four FPGA on-chip resources: BRAMs, LUTs, DSPs and FFs.In the paper they only demonstrate the BRAM and LUT models.

For BRAM: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers and those used by all its submodules.
For LUT:The LUT consumption of a hardware module is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic.

3.DESIGN SPACE EXPLORATION

DSE workflow:
在这里插入图片描述
Prune strategies：

Small loop flatten:it is better to fatten the innermost loops with fixed, small trip counts,fully unroll innermost loops with trip count less than 16.
Loop unroll factor pruning:Loop unroll factors determine the number of on-chip BRAM partitions, beneficial for programs with deep, complicated loop hierarchy.
Saddleback search for loop unroll factors: based on the following theorem:

This strategy works very well for programs with shallow loop hierarchies.
Fine-grained pipeline pruning:
Power-of-two buffer bit-widths and capacities:

4.AutoAccel Framework

Workflow:
在这里插入图片描述

Idea

•Synthesizable.：The input kernel must be synthesizable via commercial HLS tools. That is, it should not include recursive function calls or dynamic memory allocation.
Cacheable：The memory footprint of any single instance of the top-level loop must be smaller than the FPGA on-chip memory capacity to ensure that the kernel computation and external memory transaction can be fully decoupled.

这篇论文首先要通过一个多面体算法来判断一个input kernel是否满足以上两个约束条件：
Based on our problem formulation, computational kernels featuring extensive random accesses on a large memory footprint, e.g., PageRank and the BFS algorithm, will probably not meet the Cacheable constraint.
该算法在处理耗时长的程序时准确率较高，反之较低

Jason_1141

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
论文笔记（2.DAC.2018）

Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline ArchitectureInformationPaper:Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline ArchitectureAuthor: Jason CongKey words:Wo
复制链接

扫一扫