Paper Reading: A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent

IC_菌

已于 2023-07-01 21:47:54 修改

阅读量100

点赞数

文章标签：硬件架构卷积神经网络边缘计算

于 2023-07-01 16:14:15 首次发布

本文链接：https://blog.csdn.net/m0_52357437/article/details/131491695

版权

该论文提出了一种双范围乘积累加(DRMAC)块，以应对CNN的计算复杂性和内存需求。DRMAC通过使用24位固定点运算，结合16位和24位模式，优化了CNN中的乘积累加操作。此外，通过主成分分析(PCA)压缩内核数据，减少离片内存访问，从而降低边缘设备的能耗。

摘要由CSDN通过智能技术生成

1. Introduction

This paper, published in ISSCC in 2016, addresses the challenges posed by the computational complexity and significant memory requirements of Convolutional Neural Networks (CNNs). In order to enable intelligent edge devices, the researchers implemented an energy-efficient CNN processor.

2. Innovation points

2.1 Propose a dual-range multiply-accumulate (DRMAC) block

Here is the DRMAC architecture:

A 24bit (16.8 format) fixed point truncated multiplier and an adder are implemented in a single MAC block.
The analysis on the MAC operands used in CNN reveals that about 99% of operands require at most 8bit for the integer part, while only 0.01% of operands need 16bit for the integer part. Actually, operands which require 16bit for the integer part are generated by accumulations of operands with small values. Hence, instead of execution full 24bit MAC operations all the time, 16bit (8.8 format) MAC operations are performed at first. When an overflow flag is detected by a 16bit MAC operation, the DRMAC block begins to operate in a full 24bit mode.

Knowledge related to fixed-point numbers:
For the fixed-point number represented by "00011100":

If we set the decimal point to be at the last digit, i.e., "00011100.", it represents the value 28.
If we set the decimal point to be at the three digits from the end, i.e., "00011.100", it represents the value 3.50.
If we set the decimal point to be at the four digits from the end, i.e., "0001.1100", it represents the value 1.75.

2.2 Using kernel data compression for reducing off-chip memory accesses

An algorithm-level modification in accord with the CNN hardware architecture is presented to reduce off-chip kernel data accesses. As shown in the following figure.

Fig: On-chip kernel generation technique.

Because a group of kernels in CNN exhibits a high correlation between them, principal component analysis (PCA) is carried out on the kernel data, and fewer basic kernels are extracted. Therefore, only basic kernels are saved in on-chip buffers and constant values needed for the weighted sum are transferred from off-chip memory, reducing the overhead of transferring the whole group of kernels at a little cost of on-chip kernel generation process.

3. Summary

The following are the chip specification and micrograph:

IC_菌

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
打赏
0
评论
Paper Reading: A 1.42TOPS/W Deep Convolutional Neural Network Recognition Processor for Intelligent

This paper, published in ISSCC in 2016, addresses the challenges posed by the computational complexity and significant memory requirements of Convolutional Neural Networks (CNNs). In order to enable intelligent edge devices, the researchers implemented an
复制链接

扫一扫