【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding

最新推荐文章于 2024-04-16 10:56:21 发布

烫烫烫烫的若愚

最新推荐文章于 2024-04-16 10:56:21 发布

阅读量361

点赞数

文章标签： bert 深度学习自然语言处理模型压缩

本文链接：https://blog.csdn.net/gjh1716718326/article/details/120301542

版权

To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. Then, we introduce a new two-stage learning framework for TinyBERT, which performs Transformer distillation at both the pretraining and task-specific learning stages.

Transformer具有一定的复杂性，而越复杂的模型（参数量越大）往往冗余更多，但是结构上的缩减会带来比较大的影响。这篇文章的关键在于针对Transformer提出了专属的蒸馏方法。

There have been many model compression techniques (Han et al., 2016) proposed to accelerate deep model inference and reduce model size while maintaining accuracy. The most commonly used techniques include quantization (Gong et al., 2014),weights pruning (Han et al., 2015), and knowledge distillation (KD) (Romero et al., 2014).

模型量化：经过量化算法对数值进行压缩和解压缩，从而达到减小模型大小和加速运算的目的。几乎所有量化方法都能实现压缩，但是并不是所有量化方法都能实现加速。量化实现加速的两个重要条件，首先量化算法要简单不引入过多额外计算开支，其次硬件方面适用运算库进行运算加速。因此量化在实用中比较难，尤其对于我这种不太懂硬件的。

模型剪枝：根据我短时间的了解，剪枝方法分为结构化剪枝和非结构化剪枝。结构化剪枝即在大粒度上对模型进行结构级修剪，如剪掉卷积层中多余的卷积核（这似乎是我查阅资料中见过的唯一用法）；非结构化剪枝即粒度更低的参数级修剪，根据L1范数或L2范数等对全体参数进行评估，按照比例将最不重要一部分参数置零，最终得到稀疏模型（最终模型的结构不发生变化，参数量也不会发生变化，仅仅是变得稀疏），通过一些稀疏矩阵分解的方法能够达到压缩的目的，但是只有在特定的支持稀疏矩阵运算的硬件上才能达到加速，因此实用性相对结构化剪枝更差。

知识蒸馏：即训练学生教师模型的方法。值得一提的是知识蒸馏和结构化剪枝具有一定的相似性，二者都追求在模型结构方面进行缩减从而得到一个去冗余的小模型。剪枝的优势在于可以逐层剪枝并且在训练过程中剪枝，蒸馏的优势在于可以获取更多的泛化信息。二者或许可以相辅相成，也可能殊途同归。

The pre-training-then-fine-tuning paradigm firstly pretrains BERT on a large-scale unsupervised text corpus, then fine-tunes it on task-specific dataset,which greatly increases the difficulty of BERT distillation.Therefore, it is required to design an effective KD strategy for both training sta

最低0.47元/天解锁文章

烫烫烫烫的若愚

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
0
评论
【论文笔记】TinyBERT: Distilling BERT for Natural Language Understanding

To accelerate inference and reduce model size while maintaining accuracy, we first propose a novel Transformer distillation method that is specially designed for knowledge distillation (KD) of the Transformer-based models. Then, we introduce a new two-sta.
复制链接

扫一扫