DaDianNao: A Machine-Learning Supercomputer

最新推荐文章于 2022-03-14 22:19:30 发布

王百万.

最新推荐文章于 2022-03-14 22:19:30 发布

阅读量442

点赞数

文章标签：芯片深度学习

本文链接：https://blog.csdn.net/weixin_44998570/article/details/121793867

版权

在这里插入图片描述

2014 MICRO DaDianNao

简述
1. 沿用DianNao的循环分块方法，提高局部性和重用性；沿用NFU进行循环展开，提高并行性。
2. 将重用模式差的Synapses缓存在大量片上存储(eDRAM)，靠近NFU的位置，降低访存次数。
3. 多节点，多NFU，大量输出神经元同时计算，大幅提高并行性。

DianNao vs DaDianNao

DianNao的局限性

DianNao的主要思想是通过tile，对各层中反复使用的大段数据(Input Neurons, Synapses, Output Neurons)进行分段重用，提升计算的局部性，起到重用作用的有三个buffer(NBin, SB, NBout)，那重用效果如何呢？
对NBin，以Fully Connected Layer为例，每行存放Ti个Input Neurons，共有Tii / Ti行，在计算对应的Tnn个Output Neurons时，每次计算Tn个Output Neurons要使用的Ti个Input Neurons都在NBin中，重用效果好，几乎不需要访问内存；在Convolution Layer中，Sliding Window每次滑动，或多个卷积核在同一位置卷积操作时都有大量Input Neurons被重用，几乎不需要访问内存，效果好。[DaDianNao沿用NBin]。
对SB，对于Shared Kernel的Convolution Layer来说，Synapses几乎可以缓存在SB中，几乎不需要访存，效果好；但对于Private Kernel的Convolution Layer和Fully Connected Layer来说，每次计算使用的Tn x Ti个Synapses都是不同的，本身就不具有重用模式，因此SB起不到缓存的作用，好比NFU需要直接到主存取数据， DianNao的主频为0.98 GHz，在此两种情况下如果期望NFU按照每周期产生Tn个输出，所需要的最低主存带宽为 (256 x 2 Byte x 0.98 x 10^30 Hz) / 2^30 = 467.30 GB/s。[DaDianNao没有沿用SB，而是用多芯片的片上存储保存Synapses，无需主存]。
对NBout，保存在NFU中并行计算的Tnn个Partial Sum以及Final Sum，但每次只能运算Tnn个(Tnn << Num of Output Neurons)，还需将Tnn个输出写回主存。[DaDianNao沿用NBout，但多芯片下可以同时计算全部Output Neurons，无需主存]。

DaDianNao的特点

Custom multi-chip machine-learning architecture。
High-degree parallelism at a reasonable area cost。
High internal bandwidth and low external communications。
Each chip(=node) containing specialized logic together with enough RAM that the sum of the RAM of all chips can contain the whole neural network, requiring no main memory。
Each node(=chip) contains computational logic, eDRAM, and the router fabric。
Inference and Training。

DaDianNao的优势

沿用NBin(SRAM)，复用tile后的Input Neurons。
用片内大量eDRAM代替SB，no main memory，针对Private Kernel的Convolution Layer和Fully Connected Layer，Synapses无需访存近距离送入NFU。
multi-chip，所有Output Neurons并行计算，不再分块。
Inference and Training，用于服务器端。

DaDianNao Accelerator Architecture

Overview

Synapses are always stored close to the neurons which will use them, minimizing data movement, the architecture is fully distributed, there is no main memory。
更加关注访存特点，而不是计算：each node footprint is massively biased towards storage rather than computations。
Synapses保存在chip的本地eDRAM中，chip之间传输Neurons，降低external(across chips) bandwidth。
Enable high internal bandwidth by breaking down the local storage into many tiles，这里的tile是为了降低布线面积，但降低了并行程度。

Node(=chip)

在这里插入图片描述

Locate the storage for synapses close to neurons and to make it massive, **move only neurons and to keep synapses in a fixed storage location **(Figure 5 右图中的eDRAM), because there are many more synapses than neurons。
初始设计想直接增加在NFU中并行计算的Neurons，由16 x 16 增加至 64 x 64，因为eDRAM代替了SB，所以一个cycle需要取64 x 64 x 16 bit = 65536 bit，为了加快存取速度，将eDRAM拆分为4个bank，这样需要的布线总数为 65536 x 4，占用了过多单芯片面积，因此通过tile，将芯片结构改进至Figure 5。
Chip(=Node)结构如左图所示，其中的每个tile的结构如右图所示，左图中心蓝色的eDRAM叫做Central eDRAM，右图中是Local eDRAM或tile eDRAM，Central eDRAM通过fat tree与tile相连。
Central eDRAM中存储当前计算层的一块输入神经元，tile eDRAM存储该层计算需要的Synapses，fat tree下发Input Neurons并将计算得到的Output Neurons收回到Central eDRAM，NFU依旧是16 x 16并行计算。
改变NFU的流水段来配置各层的训练或推理。

在这里插入图片描述

Node工作流程举例 (Programming, Code Generation)

在这里插入图片描述

The input data (value of the input layer) is initially partitioned across nodes and stored in a central eDRAM bank：假设我们只有4个Node，所有Input Neurons被分为4份并存储在4个Node的Central eDRAM。
All the tiles will get the same input neurons, and read synaptic weights from their local (tile) eDRAM, then write back the partial sums (of output neurons) to their local NBout SRAM：每个Node的16个tile拿到同样的输入如Node1 tile[1…16]；但每个tile要计算的输出是被分块的如Node1中tile 1、tile 2……tile 16，这样Node1中承担的全部输出神经元都在并行计算，在相同的16 x 16 NFU下比DianNao并行程度高，但比64 x 64 并行程度低了(因布线面积过大被舍弃)；在计算过程中，partial sum被存储在NBout SRAM (the sprite of DianNao)。
the NFU in each tile will finalize the sums, apply the transfer function, and store the output values back
to the central eDRAM：各Node中的Output Neurons计算完，将和回送到Central eDRAM，但要特别注意，我们以Node1为例，它承担的输出神经元仅计算了Synapse[1, 1]和对应的输入部分，还需要Synapse[1, 2: 4]以及对应的输入部分，这就是下面要说的Multi-Node Mapping(上面的介绍都仅在单个Node，那如何互联呢？)。

Multi-Node Mapping

在这里插入图片描述

At the beginning of a layer, the input neurons are distributed across all nodes, in the form of 3D rectangles corresponding to all feature maps of a subset of a layer. These input neurons will be first distributed to all node tiles through the (fat tree) internal network. 这就是我们上面阐述的单个Node的工作过程，但下面还有这样一句话，Simultaneously, the node control starts to send the block of input neurons to the rest of the nodes through the mesh. 就是每个Node要将自己的输入神经元块发送到其他的Node，这就是Node 1如何得到Synapse[1, 2: 4]对应的输入。
个人理解，左图中，Convolution Layer的全部Input Neurons被分成4份，每一份都具有全部的Channel，每一个Node的tile eDRAM中都存储了对应全部输出Feature Map的全部kernel，因此片间通信非常小，如红色部分。
对于Fully Connected Layer，每个Node要将自己的Input Neurons Block广播到其他Node，这里采用computing-and-forwarding communication scheme。

A node can start processing the newly arrived block of input neurons as soon as it has finished its
own computations, and has sent the previous block of input neurons; so the decision is made locally, there is no global synchronization or barrier.