Understand and Evaluate Deep Learning Processors<Learning note>如何评估机器学习处理器<学习笔记>

Preface:This article is written for a nifty girl who I cherish.

在这里插入图片描述

(0)Introduction

  • Learning based on 2020_isscc
  • This article mainly to cover the detail information of various Deep Learning Processors

(0.1)Question directed learning

  • What are the key metrics that should be measured and compared?
  • What are the challenges towards achieving these metrics?
  • What are the design considerations and tradeoffs?
  • How do these challenges and design considerations differ across platforms (e.g., CPU, GPU, ASIC, PIM, FPGA)?

(1)Deep learning overview

(1.0)Weighted Sums

  • example: Y j = F n o n l i n e a r   a t i v a t i o n ( ∑ i = 1 3 W i j × X i ) Y_j=F_{nonlinear \space ativation}(\sum_{i=1}^3W_{ij}×X_i) Yj=Fnonlinear ativation(i=13Wij×Xi)
  • comment: multiply and accumulate account for over 90%

(1.1)Popular layers

  • Fully Connected Layer
    features:Feed forward, fully connected

  • Convolutional Layer(CNN)
    features: Feed forward, sparsely-connected w/ weight sharing
    use: Typically used for images

  • Recurrent Layer(RNN)
    features: feedback
    use: Typically used for sequential data (e.g., speech, language)

  • Attention layer/mechanism
    features: Attention (matrix multiply) + feed forward, fully connected

(1.2)Popular DNNS

DNN gets larger and deeper

(2)Key metrics and Design Objectives

(2.0)Key metrics

  • accuracy
  • throughput: Analytics on high volume data and Real-time performance
  • latency
  • energy and power: TOPS/W
  • hardware cost
  • flexibility: Range of DNN models and tasks
  • Scalability: Scaling of performance with amount of resources

(2.1)Key Design Objectives of DL Processors

  • Increase Throughput and Reduce Latency
    • Reduce time per MAC
      Reduce critical path -> increase clock frequency
      Reduce instruction overhead
    • Avoid unnecessary MACs (save cycles)
    • Increase number of processing elements (PE) -> more MACs in parallel
      Increase area density of PE or area cost of system
    • Increase PE utilization* -> keep PEs busy
      Distribute workload to as many PEs as possible
      Balance the workload across PEs
      Sufficient memory BW to deliver workload to PEs (reduce idle cycles)
  • Low latency has an additional constraint of small batch size
  • Reduce energy and power consumption
- Reduce data movement as it dominates energy consumption
	Exploit data reuse
- Reduce energy per MAC
	Reduce switching activity and/or capacitance
- Reduce instruction overhead
 Avoid unnecessary MACs
  • Flexibility
    • Reduce overhead of supporting flexibility
    • Maintain efficiency across wide range of DNN workloads
      • Different layer shapes impact the amount of
        Required storage and compute
        Available data reuse that can be exploited
    • Different precision across layers & data types (weight, activation, partial sum)
    • Different degrees of sparsity (number of zeros in weights or activations)
    • Types of DNN layers and compute beyond MACs (e.g., activation functions)

(3)Design Considerations\

(3.0)CPUs and GPUs

  • caculating elements: Use matrix multiplication libraries on CPUs and GPUs
  1. Fully connected layer can be directly represented as matrix multiplication
    input: M i = n × k M_i=n×k Mi=n×k
    weight: W = m × n W=m×n W=m×n
    output: M o = m × k M_o=m×k Mo=m×k
    do M o = W M i M_o=WM_i Mo=WMi
  2. Convolutional layer can be converted to Toeplitz Matrix
    在这里插入图片描述
    But data is repeated
  • Design Considerations for CPU and GPU
    • Software (compiler)
      • Reduce unnecessary MACs: Apply transforms
      • Increase PE utilization: Schedule loop order and tile data to increase data reuse in memory hierarchy
    • Hardware
      • Reduce time per MAC
        • Increase speed of PEs
        • Increase MACs per instructions using large aggregate instructions (e.g., SIMD, tensor core) -> requires additional hardware
      • Increase number of parallel MACs
        • Increase number of PEs on chip à area cost
        • Support reduced precision in PEs
      • Increase PE utilization
        • Increase on-chip storage -> area cost
        • External memory BW -> system cost

(3.1)ASIC(Specialized / Domain Specific Hardware)

  • Features:
  1. Operations exhibit high parallelism -> high throughput possible
  2. Memory Access is the Bottleneck

Example: AlexNet has 724M MACs -> 2896M DRAM accesses required
在这里插入图片描述

  • Properties to Leverage
    Input data reuse
    high parallelism

(4)END

  • While deep learning gives state-of-the-art accuracy on many tasks, it may not be
    the best approach for all tasks. Some factors to consider include
    • How much training data is available?
      Deep learning requires a significant amount of data -> in particular, labelled data
      (current state-of-the-art results rely on supervised learning)
    • How much computing resource is available?
      Despite the progress in the area of efficient deep learning, it can still require orders of magnitude more complexity than other machine learning based approaches
    • How critical is interpretability?
      Understanding why the DNN makes a certain decision is still an open area of research
      DNN models can be fooled – increasing robustness is still an open area of research
      In general, debugging what happens within a DNN can be challenging
    • Does a known model already exist?
      Many things in the world are based on known models or laws (e.g., Ohm’s Law V=IR);
      it may be unnecessary to re-learn this from the data
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值