论文笔记(2.DAC.2018)

Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture

Information

  • Paper:Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture
  • Author: Jason Cong
  • Key words:performance-resource trade-offs

Background

  • Identifying the optimal design configuration in a tremendous design space is very difficult.
  • Commercial HLS tools such as Xilinx SDAccel now accept an accelerator kernel implementation in C, C++ or OpenCL, and directly transform it down to an FPGA accelerator circuit, without the involvement of RTL design.

”Even for experienced hardware designers, it still requires a great deal of effort to resolve such complicated trade-offs and identify the optimal design choice.“

Work

Propose the composable, parallel and pipeline (CPP) microarchitecture as an accelerator design template to substantially reduce the design space.

  • Introduce the CPP analytical model to capture the performance-resource trade-offs.
  • Propose a series of pruning strategies to reduce the design space.
  • Automate the entire accelerator generation and optimization process by implementing the AutoAccel framework

Automatically transform a user C program to a high-quality accelerator behavioral description.

Methods

1.CPP Microarchitecture

Steps:
①Fits the input kernel into the CPP microarchitecture.
②Performs design space exploration to identify the optimal parameter configuration.
③Transforms the input kernel code to the CPP microarchitecture description code.

Code transformations:
Improving off-chip DRAM bandwidth utilization.
Explicit data caching.
Loop pipelining/parallel.
On-chip memory organization.

Three features to realize the code transformations:
Coarse-grained pipeline with data caching.
Loop scheduling.
On-chip buffer reorganization.

2.Analytical model

The overall CPP microarchitecture consists of three stages:load,compute and store.

Performance Modeling
Estimates an accelerator’s overall execution cycle,including all the loops , submodules and standalone logic.
所提议的模型递归地遍历所有循环和模块,直到一个循环或模块不包含任何子结构为止。

Resource Modeling:
The resource model estimates the consumptions of the four FPGA on-chip resources: BRAMs, LUTs, DSPs and FFs.In the paper they only demonstrate the BRAM and LUT models.

  • For BRAM: The BRAM consumption of a hardware module consists of the BRAM blocks used by all its local buffers and those used by all its submodules.
  • For LUT:The LUT consumption of a hardware module is composed of the number of LUTs used by all loops, submodules, BRAM buffers (for control logic) and the standalone logic.

3.DESIGN SPACE EXPLORATION

DSE workflow:
在这里插入图片描述
Prune strategies

  • Small loop flatten:it is better to fatten the innermost loops with fixed, small trip counts,fully unroll innermost loops with trip count less than 16.
  • Loop unroll factor pruning:Loop unroll factors determine the number of on-chip BRAM partitions, beneficial for programs with deep, complicated loop hierarchy.
  • Saddleback search for loop unroll factors: based on the following theorem:
    在这里插入图片描述
    This strategy works very well for programs with shallow loop hierarchies.
  • Fine-grained pipeline pruning:
  • Power-of-two buffer bit-widths and capacities:

4.AutoAccel Framework

Workflow:
在这里插入图片描述

Idea

  • •Synthesizable.:The input kernel must be synthesizable via commercial HLS tools. That is, it should not include recursive function calls or dynamic memory allocation.
  • Cacheable:The memory footprint of any single instance of the top-level loop must be smaller than the FPGA on-chip memory capacity to ensure that the kernel computation and external memory transaction can be fully decoupled.

这篇论文首先要通过一个多面体算法来判断一个input kernel是否满足以上两个约束条件:
Based on our problem formulation, computational kernels featuring extensive random accesses on a large memory footprint, e.g., PageRank and the BFS algorithm, will probably not meet the Cacheable constraint.
该算法在处理耗时长的程序时准确率较高,反之较低

已标记关键词 清除标记
相关推荐
©️2020 CSDN 皮肤主题: 大白 设计师:CSDN官方博客 返回首页