DianNao Family :节能硬件加速器 for Machine Learning(正统的)

  • 这个发表在了
  • https://cacm.acm.org/
  • communication of the ACM

  • 参考文献链接是
Chen Y , Chen T , Xu Z , et al. 
DianNao Family: Energy-Efficient Hardware Accelerators for Machine Learning[J]. 
Communications of the Acm, 2016, 59(11):105-112
  • https://cacm.acm.org/magazines/2016/11/209123-diannao-family/fulltext
  • 果然找到了他

  • 特码的,我下载了,6不6

Abstract

  • ML are pervasive ,
    • in a broad range of systems(embedded systems to data centers)
  • computer evolve toward heterogeneous multi-cores composed of a mix of cores and hardware accelerators,
  • designing hardware accelerators for ML can simultaneously achieve high efficiency and broad application scope

  • efficient computational primitives are important for a hardware accelerator,
  • inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators,
  • an Amdahl’s law effect
  • become a first-order concern,
  • just like in processors, rather than an element factored in accelerator design on a second step

  • a series of hardware accelerators (DianNao family) designed for ML(especially nn), emphasis on the impact of memory on accelerator design, performance, and energy.
  • on a number of representative neural network layers,
  • a speedup of 450.65x over GPU,
  • reduce the energy by 150.31x on average for a 64-chip DaDianNao system (a member of the DianNao family)

1 INTRODUCTION

  • designing hardware accelerators which realize the best possible tradeoff between flexibility and efficiency is becoming a prominent
    issue.
  • The first question is for which category of applications one should primarily design accelerators?
  • Together with the architecture trend towards accelerators, a second simultaneous and significant trend in high-performance and embedded applications is developing: many of the emerging high-performance and embedded applications, from image/video/audio recognition to automatic translation, business analytics, and robotics rely on machine learning
    techniques.
  • This trend in application comes together with a third trend in machine learning (ML) where a small number
    of techniques, based on neural networks (especially deep learning techniques 16, 26 ), have been proved in the past few
    years to be state-of-the-art across a broad range of applications.
  • As a result, there is a unique opportunity to design accelerators having significant application scope as well as
    high performance and efficiency. 4

第二段

  • Currently, ML workloads are mostly executed on multicores using SIMD, 44 on GPUs, 7 or on FPGAs. 2
  • However, the aforementioned trends have already been identified by researchers who have proposed accelerators implementing,
  • for example, Convolutional Neural Networks (CNNs) 2 or Multi-Layer Perceptrons 43 ;
  • accelerators focusing on other domains, such as image processing, also propose efficient implementations of some of the computational primitives used by machine-learning techniques, such as convolutions. 37
  • There are also ASIC implementations of ML techniques, such as Support Vector Machine and CNNs.
  • However, all these works have first, and successfully, focused on efficiently implementing the computational primitives but they either voluntarily ignore memory transfers for the sake of simplicity, 37, 43 or they directly plug their computational accelerator to memory via a more or less sophisticated DMA. 2, 12, 19

第三段

  • While efficient implementation of computational primitives is a first and important step with promising results,
    inefficient memory transfers can potentially void the throughput, energy, or cost advantages of accelerators, that is, an
    Amdahl’s law effect, and thus, they should become a first-
    order concern, just like in processors, rather than an element
    factored in accelerator design on a second step.

  • Unlike in processors though, one can factor in the specific nature of
    memory transfers in target algorithms, just like it is done for accelerating computations.

  • This is especially important in the domain of ML where there is a clear trend towards scaling up the size of learning models in order to achieve better accuracy and more functionality. 16, 24

第四段

  • In this article, we introduce a series of hardware accelerators designed for ML (especially neural networks), including
    DianNao, DaDianNao, ShiDianNao, and PuDianNao as listed in Table 1.
  • We focus our study on memory usage, and we investigate the accelerator architecture to minimize memory
    transfers and to perform them as efficiently as possible.

2 DIANNAO: A NEURAL NETWORK ACCELERATOR

  • DianNao is the first member of the DianNao accelerator family,
  • accommodates state-of-the-art nn techniques (deep learning ),
  • inherits the broad application scope of neural networks.

2.1 Architecture

  • DianNao
    • component
    • input buffer for input (NBin)
    • output buffer for output (NBout),
    • buffer for synaptic(突触) weights (SB)
    • connected to a computational block (performing both synapses and neurons computations)
    • NFU, and CP, see Figure 1

这个NBin是存放输入神经元
SB是存放突触的权重的

这个NBout是存放输出神经元

我觉得图示的可以这样理解:2个输入神经元,2个突触,将这2个对应乘起来,输出是1个神经元啊。但是我的这个NFU牛逼啊,他可以一次性求两个输出神经元。

  • NFU implements a functional block of T i T_i Ti inputs/synapses(突触) and T n T_n Tn output neurons,
    • which can be time-shared by different algorithmic blocks of neurons.

这个NFU对 T i T_i Ti个输入和突触运算,得到 T n T_n Tn个输出神经元,突触不是应该是 T i × T n T_i\times T_n Ti×Tn个吗??,

  • Depending on the layer type, computations at the NFU can be decomposed in either two or three stages

  • For classifier and convolutional layers: multiplication of synapses × \times × inputs, additions of all multiplications, sigmoid.

如果是分类层或者卷积的话的话,那就是简单的突触 × \times × 输入,然后加起来,求sigmoid。这个我能理解哦,这种情况不就是卷积吗。

如果是分类层,那么输入就是

  • The nature of the last stage (sigmoid or another nonlinear function) can vary.
  • For pooling layers, there is no multiplication(no synapse), and the pooling operations can be average or max.
  • Note that the adders(加法器) have multiple inputs,
    • they are in fact adder trees,
    • see Figure 1
    • the second stage also contains shifters and max operators for pooling layers.
  • In the NFU,the sigmoid function (for classifier and convolutional layers)can be efficiently implemented using ( f ( x ) = a i x × + b i , x ∈ [ x i , x i + 1 ] f(x) = a_i x \times + b_i , x \in [x_i , x_{i+1} ] f(x)=aix×+bi,x[xi,xi+1]) (16 segments are sufficient)

On-chip Storage

  • The on-chip storage structures of DianNao can be construed as modified buffers of scratchpads.
  • While a cache is an excellent storage structure for a general-purpose processor, it is a sub-optimal way to exploit reuse because of the cache access overhead (tag check, associativity, line size, speculative read, etc.) and cache conflicts.
  • The efficient alternative, scratchpad, is used in VLIW processors but it is known to be very difficult to compile for.
  • However a scratchpad in a dedicated accelerator realizes the best of both worlds: efficient
    storage, and both efficient and easy exploitation of locality because only a few algorithms have to be manually adapted.
第二段
  • We split on-chip storage into three structures (NBin, NBout,and SB), because there are three type of data (input neurons,output neurons and synapses) with different characteristics (e.g., read width and reuse distance).
  • The first benefit of splitting structures is to tailor the SRAMs to the appropriate
    read/write width,
  • and the second benefit of splitting storage structures is to avoid conflicts, as would occur in a cache.
  • Moreover, we implement three DMAs to exploit spatial locality of data, one for each buffer (two load DMAs for inputs, one store DMA for outputs).

2.2 Loop tiling

  • DianNao 用 loop tiling去减少memory access
    • so可容纳大的神经网络
  • 举例
    • 一个classifier 层
      • N n N_n Nn输出神经元
      • 全连接到 N i N_i Ni的输入
      • 如下图

N n N_n Nn个输出, N i N_i Ni个输入,sypase应该是 N n × N i N_n\times N_i Nn×Ni大小,用这个矩阵 × N i \times N_i ×Ni即可得到结果啊

  • 先取出来一块
    • 有点疑惑啊
    • 万一右边第一个元素和左边全部元素都有关
    • 你咋算啊 ()
    • 其实啊,我他妈算右边第一个时候
    • 只需要用到和synapse的一行呀!
    • 那你那个大大的synapse矩阵咋办啊
      在这里插入图片描述
  • 下面是原始代码和和
    • tiled代码
    • 他把分类层映射到DianNao

在这里插入图片描述

for(int n=0;n<Nn;n++)
	sum[n]=0;
for(int n=0;n<Nn;n++) //输出神经元
	for(int i=0;i<Ni;i++) //输入神经元
		sum[n]+=synapse[n][i]*neuron[i];
for(int n=0;n<Nn;n++)
	neuron[n]=Sigmoid(sum[n]);		
  • 俺的想法:
    • 一次来Tnn个输出
    • 和Tii个输入
    • 然后这个东西对于硬件还是太大了
    • 再拆
    • 来Tn个和Ti个吧
    • 就酱
for(int nnn=0;nnn<Nn;nnn+=Tnn){//tiling for output 神经元
//第一个for循环准备扔出去Tnn个输出
    for(int iii=0;iii<Ni;iii+=Tii){//tiling for input 神经元
//第二个for循环准备扔进来Tii个输入
//下面就这两个东西动手

        for(int nn=nnn;nn<nnn+Tnn;nn+=Tn){
//第三个for循环觉得觉得Tnn还是太大了,继续拆
//大小是Tn
//那么我们对每一个Tnn块!(开始位置是nn哦!!)
//我们如下求解

///
            for(int n=nn;n<nn+Tn;n++)
//第一步把中间结果全部搞成零!

            sum[n]=0;
//为求sum[n],sum[n]=synapse的第n行乘neuron的全部啊!
        for(int ii=iii;ii<iii+Tii;ii+=Ti)

//上面的for是对Ti进行拆

            for(int n=nn;n<nn+Tn;n++)
                for(int i=ii;i<ii+Ti;i++)
                    sum[n]+=synapse[n][i]*neuron[i];

    for(int nn=nnn;nn<nnn+Tnn;nn+=Tn)
        neuron[n]=sigmoid(sum[n]);
///

 }   }  }
  • 在tiled代码中, i i ii ii n n nn nn
    • 表示NFU有 T i T_i Ti个输入和突触
      • T n T_n Tn个输出神经元
  • 输入神经元被每个输出神经元需要重用
    • 但这个输入向量也太他妈大了
    • 塞不到Nbin块里啊
    • 所以也要对循环 i i ii ii分块,因子 T i i T_{ii} Tii

上面的代码肯定有问题,正确的如下:

	for (int nnn = 0; nnn < Nn; nnn += Tnn) {
		for (int nn = nnn; nn < nnn + Tnn; nn += Tn) {
			for (int n = nn; n < nn + Tn; n++)
				sum[n] = 0;
			for (int iii = 0; iii < Ni; iii += Tii) {
				for (int ii = iii; ii < iii + Tii; ii += Ti)
					for (int n = nn; n < nn + Tn; n++)
						for (int i = ii; i < ii + Ti; i++)
							sum[n] += synapse[n][i] * neuron[i];
			}
			for (int n = nn; n < nn + Tn; n++)
				printf("s%ds ", sum[n]);
		}
	}
	for (int index = 0; index < Nn; index++)
		printf("%d ", sum[index]);
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

fgh431

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值