Preface:This article is written for a nifty girl who I cherish.
(0)Introduction
- Learning based on 2020_isscc
- This article mainly to cover the detail information of various Deep Learning Processors
(0.1)Question directed learning
- What are the key metrics that should be measured and compared?
- What are the challenges towards achieving these metrics?
- What are the design considerations and tradeoffs?
- How do these challenges and design considerations differ across platforms (e.g., CPU, GPU, ASIC, PIM, FPGA)?
(1)Deep learning overview
(1.0)Weighted Sums
- example: Y j = F n o n l i n e a r a t i v a t i o n ( ∑ i = 1 3 W i j × X i ) Y_j=F_{nonlinear \space ativation}(\sum_{i=1}^3W_{ij}×X_i) Yj=Fnonlinear ativation(i=1∑3Wij×Xi)
- comment: multiply and accumulate account for over 90%
(1.1)Popular layers
-
Fully Connected Layer
features:Feed forward, fully connected -
Convolutional Layer(CNN)
features: Feed forward, sparsely-connected w/ weight sharing
use: Typically used for images -
Recurrent Layer(RNN)
features: feedback
use: Typically used for sequential data (e.g., speech, language) -
Attention layer/mechanism
features: Attention (matrix multiply) + feed forward, fully connected
(1.2)Popular DNNS
DNN gets larger and deeper
(2)Key metrics and Design Objectives
(2.0)Key metrics
- accuracy
- throughput: Analytics on high volume data and Real-time performance
- latency
- energy and power: TOPS/W
- hardware cost
- flexibility: Range of DNN models and tasks
- Scalability: Scaling of performance with amount of resources
(2.1)Key Design Objectives of DL Processors
- Increase Throughput and Reduce Latency
- Reduce time per MAC
Reduce critical path -> increase clock frequency
Reduce instruction overhead - Avoid unnecessary MACs (save cycles)
- Increase number of processing elements (PE) -> more MACs in parallel
Increase area density of PE or area cost of system - Increase PE utilization* -> keep PEs busy
Distribute workload to as many PEs as possible
Balance the workload across PEs
Sufficient memory BW to deliver workload to PEs (reduce idle cycles)
- Reduce time per MAC
- Low latency has an additional constraint of small batch size
- Reduce energy and power consumption
- Reduce data movement as it dominates energy consumption
Exploit data reuse
- Reduce energy per MAC
Reduce switching activity and/or capacitance
- Reduce instruction overhead
Avoid unnecessary MACs
- Flexibility
- Reduce overhead of supporting flexibility
- Maintain efficiency across wide range of DNN workloads
- Different layer shapes impact the amount of
Required storage and compute
Available data reuse that can be exploited
- Different layer shapes impact the amount of
- Different precision across layers & data types (weight, activation, partial sum)
- Different degrees of sparsity (number of zeros in weights or activations)
- Types of DNN layers and compute beyond MACs (e.g., activation functions)
(3)Design Considerations\
(3.0)CPUs and GPUs
- caculating elements: Use matrix multiplication libraries on CPUs and GPUs
- Fully connected layer can be directly represented as matrix multiplication
input: M i = n × k M_i=n×k Mi=n×k
weight: W = m × n W=m×n W=m×n
output: M o = m × k M_o=m×k Mo=m×k
do M o = W M i M_o=WM_i Mo=WMi - Convolutional layer can be converted to Toeplitz Matrix
But data is repeated
- Design Considerations for CPU and GPU
- Software (compiler)
- Reduce unnecessary MACs: Apply transforms
- Increase PE utilization: Schedule loop order and tile data to increase data reuse in memory hierarchy
- Hardware
- Reduce time per MAC
- Increase speed of PEs
- Increase MACs per instructions using large aggregate instructions (e.g., SIMD, tensor core) -> requires additional hardware
- Increase number of parallel MACs
- Increase number of PEs on chip à area cost
- Support reduced precision in PEs
- Increase PE utilization
- Increase on-chip storage -> area cost
- External memory BW -> system cost
- Reduce time per MAC
- Software (compiler)
(3.1)ASIC(Specialized / Domain Specific Hardware)
- Features:
- Operations exhibit high parallelism -> high throughput possible
- Memory Access is the Bottleneck
Example: AlexNet has 724M MACs -> 2896M DRAM accesses required
- Properties to Leverage
Input data reuse
high parallelism
(4)END
- While deep learning gives state-of-the-art accuracy on many tasks, it may not be
the best approach for all tasks. Some factors to consider include- How much training data is available?
Deep learning requires a significant amount of data -> in particular, labelled data
(current state-of-the-art results rely on supervised learning) - How much computing resource is available?
Despite the progress in the area of efficient deep learning, it can still require orders of magnitude more complexity than other machine learning based approaches - How critical is interpretability?
Understanding why the DNN makes a certain decision is still an open area of research
DNN models can be fooled – increasing robustness is still an open area of research
In general, debugging what happens within a DNN can be challenging - Does a known model already exist?
Many things in the world are based on known models or laws (e.g., Ohm’s Law V=IR);
it may be unnecessary to re-learn this from the data
- How much training data is available?