DistilBert解读

最新推荐文章于 2025-03-06 14:57:18 发布

别水贴了

最新推荐文章于 2025-03-06 14:57:18 发布

阅读量9.1k

点赞数 9

分类专栏： NLP 文章标签：自然语言处理神经网络机器学习深度学习

本文链接：https://blog.csdn.net/fengzhou_/article/details/107211090

版权

NLP 专栏收录该内容

18 篇文章

订阅专栏

背景

NLP预训练模型随着近几年的发展，参数量越来越大，受限于算力，在实际落地上线带来了困难，针对最近最为流行的BERT预训练模型，提出了DistilBert，在保留97%的性能的前提下，模型大小下降40%，inference运算速度快了60%，具体论文参考《DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter》
在这里插入图片描述

Knowledge Distilling(知识蒸馏)

首先我们介绍下知识蒸馏这一概念。即用一个小的模型(student)去学习一个大模型或一个ensemble模型（teacher）的输出。最早提出于Hinton大佬的论文《Distilling the Knowledge in a Neural Network》
在监督学习领域，对于一个分类问题，定义soft label为模型的输出(即不同label的概率)， hard label为最终正确的label(也就是ground truth)，通常是通过最大化正确label的概率来进行学习的，通常采用 cross-entropy作为损失函数，即让正确label的概率尽可能预测为1，其余label的概率趋近于0，但是这些不正确趋近于0的label也是有大有小的(比把图片数字2识别成3的概率还是要比识别成9大，尽管他们都趋近于0)，这被称为"暗知识(Dark Knowledge)", 这也反应了模型的泛化能力。但因为过于趋紧0不利于student模型学习，为了让student也容易学习tearcher的输出，引入了带温度T的softmax概率为
在这里插入图片描述
当温度T为1的时候，即为标准的softmax。训练的时候T>1, 方便学到类间信息；预测的时候T=1，恢复到标准的softmax进行计算。T越大，输出的概率约平滑。
具体模型的训练方式如图

Loss Fn为cross entropy，最终的损失函数为图中两个loss的线性组合。

DistilBERT: a distilled version of BERT

Student architecture

和BERT类似，只是layer的数量减半

Student initialization

因为Student模型和Teacher模型每层的layer一样，因此每两层保留一层，利用相关的参数

Distillation

采用了RoBERTa的优化策略，动态mask，增大batch size，取消NSP任务的损失函数，

Training Loss

The final training objective is a linear combination of the distillation loss $L_{ce}$ with the supervised training loss, in our case the masked language modeling loss $L_{mlm}$ We found it beneficial to add a cosine embedding loss ( $L_{cos}$ ) which will tend to align the directions of the student and teacher hidden states vectors.

最终的loss由三部分构成

蒸馏损失，即 $L_{ce}=\sum_it_i*log(s_i)$ , 其中 $s_i$ 是student输出的概率， $t_i$ 是teacher输出的概率，當BERT预测的 $t_i$ 越高，而DistilBERT预测( $s_i$ )越低，得到的Loss就会越高
Mask language model loss，参考BERT，这部分也就是为hard loss
Cosine Embedding Loss，利于让student学习和teacher一样的hidden state vector