1. Introduction
This paper, published in ISSCC in 2016, addresses the challenges posed by the computational complexity and significant memory requirements of Convolutional Neural Networks (CNNs). In order to enable intelligent edge devices, the researchers implemented an energy-efficient CNN processor.
2. Innovation points
2.1 Propose a dual-range multiply-accumulate (DRMAC) block
Here is the DRMAC architecture:
A 24bit (16.8 format) fixed point truncated multiplier and an adder are implemented in a single MAC block.
The analysis on the MAC operands used in CNN reveals that about 99% of operands require at most 8bit for the integer part, while only 0.01% of operands need 16bit for the integer part. Actually, operands which require 16bit for the integer part are generated by accumulations of operands with small values. Hence, instead of execution full 24bit MAC operations all the time, 16bit (8.8 format) MAC operations are performed at first. When an overflow flag is detected by a 16bit MAC operation, the DRMAC block begins to operate in a full 24bit mode.
Knowledge related to fixed-point numbers:
For the fixed-point number represented by "00011100":
- If we set the decimal point to be at the last digit, i.e., "00011100.", it represents the value 28.
- If we set the decimal point to be at the three digits from the end, i.e., "00011.100", it represents the value 3.50.
- If we set the decimal point to be at the four digits from the end, i.e., "0001.1100", it represents the value 1.75.
2.2 Using kernel data compression for reducing off-chip memory accesses
An algorithm-level modification in accord with the CNN hardware architecture is presented to reduce off-chip kernel data accesses. As shown in the following figure.
Because a group of kernels in CNN exhibits a high correlation between them, principal component analysis (PCA) is carried out on the kernel data, and fewer basic kernels are extracted. Therefore, only basic kernels are saved in on-chip buffers and constant values needed for the weighted sum are transferred from off-chip memory, reducing the overhead of transferring the whole group of kernels at a little cost of on-chip kernel generation process.
3. Summary
The following are the chip specification and micrograph: