文献阅读（35）2022 Transformer加速器

tiaozhanzhe1900

已于 2023-03-24 21:40:21 修改

阅读量449

点赞数 1

分类专栏： NPU 文章标签： transformer 深度学习

于 2022-01-11 00:02:08 首次发布

本文链接：https://blog.csdn.net/tiaozhanzhe1900/article/details/122421891

版权

NPU 专栏收录该内容

74 篇文章 17 订阅

订阅专栏

文章目录

1 I-BERT
2 EdgeBERT

题目：I-BERT: Integer-only BERT Quantization
时间：2021
会议：Proceedings of the 38th International Conference on Machine Learning, PMLR
研究机构：UCB
Github：https://github.com/kssteven418/I-BERT

1 I-BERT

本篇论文的主要贡献：
针对Transformer计算量大的问题，提出了轻量化的量化方法，针对非线性函数GELU、softmax、Layer Normalization进行量化，从而在GPU上通过int8完成推理
在这里插入图片描述

1.1 softmax

$\operatorname{Softmax}(\mathbf{x})_{i}=\frac{\exp x_{i}}{\sum_{j=1}^{k} \exp x_{j}}$
首先为了避免指数范围过大，分子分母同时减掉最大值
$\operatorname{Softmax}(\mathbf{x})_{i}=\frac{\exp \left(x_{i}-x_{\max }\right)}{\sum_{j=1}^{k} \exp \left(x_{j}-x_{\max }\right)}$
此时指数一定为负数，数值范围小于1，再令 $x_i - x_{\max} = (-ln2)z + p$ ，其中z一定非负
$\exp (\tilde{x})=2^{-z} \exp (p)=\exp (p)>>z$
此时p的范围为-ln2到0，就可以用二阶近似来
$L(p)=0.3585(p+1.353)^{2}+0.344 \approx \exp (p)$

1.2 LayerNorm

unlike BatchNorm whose parameters/statistics can be fused into the previous convolutional layer in inference, LayerNorm requires the dynamic computation of the square root of the variance for each input

主要的问题是算标准差需要开平方
在这里插入图片描述
问题：需要除法？？

1.3 GELU

$\operatorname{GELU}(x):=x \cdot \frac{1}{2}\left[1+\operatorname{erf}\left(\frac{x}{\sqrt{2}}\right)\right]$
其中 $\operatorname{erf}(x)$ 为误差函数，
$\operatorname{erf}(x):=\frac{2}{\sqrt{\pi}} \int_{0}^{x} \exp \left(-t^{2}\right) d t$
一种方法是用sigmoid函数近似GELU函数，
$\operatorname{GELU}(x) \approx x \sigma(1.702 x)$
但是sigmoid函数本身也是非线性操作，如果再用h-sigmoid函数近似sigmoid函数的话，就会变成
$\text { h-GELU }(x):=x \frac{\operatorname{ReLU} 6(1.702 x+3)}{6} \approx \operatorname{GELU}(x)$
但是精度下降的很多，不如还是用多项式L(x)逼近erf函数
$\mathrm{L}(x)=\operatorname{sgn}(x)\left[a(\operatorname{clip}(|x|, \max =-b)+b)^{2}+1\right]$
根据clip函数可以知道，这里实际上是逼近x在[0, -b]之间的数值，因为erf首先是奇函数，其次当x很大的时候，由于 $t^2$ 很负， $exp(-t^2)$ 接近于零，结果会接近于定值(1)，所以可以clip到一定的定义域内，此时就可以用线性函数拟合GELU
在这里插入图片描述

题目：EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference
时间：2021
会议：MICRO
研究机构：哈佛大学

2 EdgeBERT

**边缘端的NLP运算需求： **智能虚拟助手更适合在边缘端运行，已实现个人数据的保密与低延时
本篇论文的主要贡献：

entropy-based early exit predication降低计算延时，实现对multi-task NLP的软硬件协同优化
利用low-dropout voltage regulator
(LDO)和all-digital phase-locked loop (ADPLL)实现DVFS
利用embedded non-volatile memories(eNVMs)与bitmask编码存储权重，实现稀疏计算，降低功耗与延时开销

BERT的变体：

DistilBERT、MobileBERT：知识蒸馏
SqueezeBERT：利用一维组卷积
Q8BERT：对BERT的量化操作
ALBERT：压缩后的参数量只有12M

在这里插入图片描述

Entropy-based Early Exit：逻辑在于让长的复杂的句子用更复杂的网络模型，让短的句子用更简单的网络模型。通过entropy metric来评判分类的可靠性，当熵H(x)小于一定阈值的时候就可以输出了
$\sum{p(x)\log p(x)}$
在这里插入图片描述

tiaozhanzhe1900

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
文献阅读（35）2022 Transformer加速器

文章目录1 introduction2 methods2.1 softmax题目：I-BERT: Integer-only BERT Quantization时间：2021会议：Proceedings of the 38th International Conference on Machine Learning, PMLR研究机构：UCBGithub：https://github.com/kssteven418/I-BERT1 introduction本篇论文的主要贡献：针对Trans
复制链接

扫一扫

专栏目录