Understand and Evaluate Deep Learning Processors＜Learning note＞如何评估机器学习处理器＜学习笔记＞_how to understand and evaluate deep learning proce-CSDN博客

本文链接：https://blog.csdn.net/libizhide/article/details/120697404

Preface:This article is written for a nifty girl who I cherish.

在这里插入图片描述

(0)Introduction

Learning based on 2020_isscc
This article mainly to cover the detail information of various Deep Learning Processors

(0.1)Question directed learning

What are the key metrics that should be measured and compared?
What are the challenges towards achieving these metrics?
What are the design considerations and tradeoffs?
How do these challenges and design considerations differ across platforms (e.g., CPU, GPU, ASIC, PIM, FPGA)?

(1)Deep learning overview

(1.0)Weighted Sums

example: $Y_j=F_{nonlinear \space ativation}(\sum_{i=1}^3W_{ij}×X_i)$
comment: multiply and accumulate account for over 90%

(1.1)Popular layers

Fully Connected Layer
features:Feed forward, fully connected
Convolutional Layer(CNN)
features: Feed forward, sparsely-connected w/ weight sharing
use: Typically used for images
Recurrent Layer(RNN)
features: feedback
use: Typically used for sequential data (e.g., speech, language)
Attention layer/mechanism
features: Attention (matrix multiply) + feed forward, fully connected

(1.2)Popular DNNS

DNN gets larger and deeper

(2)Key metrics and Design Objectives

(2.0)Key metrics

accuracy
throughput: Analytics on high volume data and Real-time performance
latency
energy and power: TOPS/W
hardware cost
flexibility: Range of DNN models and tasks
Scalability: Scaling of performance with amount of resources

(2.1)Key Design Objectives of DL Processors

Increase Throughput and Reduce Latency
- Reduce time per MAC
  Reduce critical path -> increase clock frequency
  Reduce instruction overhead
- Avoid unnecessary MACs (save cycles)
- Increase number of processing elements (PE) -> more MACs in parallel
  Increase area density of PE or area cost of system
- Increase PE utilization* -> keep PEs busy
  Distribute workload to as many PEs as possible
  Balance the workload across PEs
  Sufficient memory BW to deliver workload to PEs (reduce idle cycles)
Low latency has an additional constraint of small batch size
Reduce energy and power consumption

- Reduce data movement as it dominates energy consumption
	Exploit data reuse
- Reduce energy per MAC
	Reduce switching activity and/or capacitance
- Reduce instruction overhead
 Avoid unnecessary MACs

Flexibility
- Reduce overhead of supporting flexibility
- Maintain efficiency across wide range of DNN workloads
  - Different layer shapes impact the amount of
    Required storage and compute
    Available data reuse that can be exploited
- Different precision across layers & data types (weight, activation, partial sum)
- Different degrees of sparsity (number of zeros in weights or activations)
- Types of DNN layers and compute beyond MACs (e.g., activation functions)

(3)Design Considerations\

(3.0)CPUs and GPUs

caculating elements: Use matrix multiplication libraries on CPUs and GPUs

Fully connected layer can be directly represented as matrix multiplication
input: $M_i=n×k$
weight: $W = m \times n$
output: $M_o=m×k$
do $M_o=WM_i$
Convolutional layer can be converted to Toeplitz Matrix

But data is repeated

Design Considerations for CPU and GPU
- Software (compiler)
  - Reduce unnecessary MACs: Apply transforms
  - Increase PE utilization: Schedule loop order and tile data to increase data reuse in memory hierarchy
- Hardware
  - Reduce time per MAC
    - Increase speed of PEs
    - Increase MACs per instructions using large aggregate instructions (e.g., SIMD, tensor core) -> requires additional hardware
  - Increase number of parallel MACs
    - Increase number of PEs on chip à area cost
    - Support reduced precision in PEs
  - Increase PE utilization
    - Increase on-chip storage -> area cost
    - External memory BW -> system cost

(3.1)ASIC(Specialized / Domain Specific Hardware)

Features:

Operations exhibit high parallelism -> high throughput possible
Memory Access is the Bottleneck

Example: AlexNet has 724M MACs -> 2896M DRAM accesses required
在这里插入图片描述

Properties to Leverage
Input data reuse
high parallelism

(4)END

While deep learning gives state-of-the-art accuracy on many tasks, it may not be
the best approach for all tasks. Some factors to consider include
- How much training data is available?
  Deep learning requires a significant amount of data -> in particular, labelled data
  (current state-of-the-art results rely on supervised learning)
- How much computing resource is available?
  Despite the progress in the area of efficient deep learning, it can still require orders of magnitude more complexity than other machine learning based approaches
- How critical is interpretability?
  Understanding why the DNN makes a certain decision is still an open area of research
  DNN models can be fooled – increasing robustness is still an open area of research
  In general, debugging what happens within a DNN can be challenging
- Does a known model already exist?
  Many things in the world are based on known models or laws (e.g., Ohm’s Law V=IR);
  it may be unnecessary to re-learn this from the data