文献笔记（2）Eyeriss

最新推荐文章于 2025-02-20 10:49:05 发布

tiaozhanzhe1900

最新推荐文章于 2025-02-20 10:49:05 发布

阅读量7.5k

点赞数 2

分类专栏： NPU

本文链接：https://blog.csdn.net/tiaozhanzhe1900/article/details/83069854

版权

NPU 专栏收录该内容

76 篇文章

订阅专栏

文章目录

1 缩写
2 abstract & introduction
3 spatial architecture
2 CNN background
- 2.1 CNN的挑战
- 2.2 CNN与传统图像处理的区别
3 现有的CNN dataflows
4 Row stationary dataflow

题目：Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
时间：2016
会议：ISCA
研究机构：MIT/NVIDIA

1 缩写

RS: row-stationary
NoC: network on chip
RF: register file
ISP: image signal processing
SA: Spatial architecture

2 abstract & introduction

本篇论文的主要贡献

对现有CNN dataflow的分类
基于row stationary的spatial architecture
对不同CNN dataflow的量化的分析框架

3 spatial architecture

Spatial architecture的定义： class of accelerators that can exploit high compute parallelism using direct communication between an array of relatively simple PE.

Spatial architecture的分类：

粗l粒度Spatial architecture
细粒度Spatial architecture

粗粒度Spatial architecture的优点：

CNN层的操作都很uniform，可以大量并行
inter-PE communication可以高效的利用

在这里插入图片描述

2 CNN background

2.1 CNN的挑战

CNN的挑战：

大量数据处理：计算资源、带宽、存储会有限制
adaptive processing：同一网络不同层之间相差大

在这里插入图片描述

输入数据的复用：

convolutional reuse：权重可以复用 $E^2$ 次，输入特征图的一个像素可以复用 $R^2$ 次
filter reuse：一个batch有N个输入特征图，故权重可以复用N次
ifmap reuse：一共M个kernel，故一个输入特征图可以复用M次

如何缩小存储空间： operation scheduling： $CR^2$ 个部分和reduce成一个结果

2.2 CNN与传统图像处理的区别

CNN的滤波器权重是训练得到的
ISP主要用二维卷积

3 现有的CNN dataflows

3.1 weight stationary(WS) dataflow

Each filter weight remains stationary in the RF to maximize convolutional reuse and filter reuse.

3.2 output stationary(OS) dataflow

The accumulation of each ofmap pixel stays stationary in a PE. The psums are stored in the same RF for accumulation to minimize the psum accumulation cost.
在这里插入图片描述

3.3 no local reuse(NLR) dataflow

通过inter-PE communication对输入特征图和部分和进行复用，有点像脉动阵列

4 Row stationary dataflow

4.1 一维卷积primitives

每个primitive operates是对一行权重和一行输入特征图像素进行计算，产生一行的部分和，每个primitive是在一个PE上计算。
对于一个PE来说，这一行权重会被复用好几次，这里就是convolution reuse

4.2 two-step primitive mapping

先是logical mapping，理论上需要很多一维卷积的操作，数量会远大于硬件已有的PE阵列，然后是physical mapping，进行相应的折叠
在这里插入图片描述
假设输入特征图batch size是N，input channel为C，output channel是M，那么

同一个权重可以复用N次
同一个输入特征图像素可以复用M次
C个通道的部分和输出可以累加到一起

4.3 energy-efficient data handling

数据流动会造成功耗，存储可以分成好几级

register file：after the first phase folding, the RF is used to exploit all types of data movements
primitive内部计算的过程可以filter reuse，input data sharing between folded primitives可以实现ifmap reuse，primitive和primitive之间可以psum accumulation
array/inter-PE communication：having multiple sets mapped spatially across the physical PE array，估计是physical mapping折叠的时候
global buffer: after the second phase folding