知识蒸馏 knowledge distill 相关论文理解-CSDN博客

本文链接：https://blog.csdn.net/weixin_43402775/article/details/109011296

本文介绍了知识蒸馏领域的多项关键技术和发展趋势，包括FitNets、Matching Guided Distillation等方法，探讨了如何通过蒸馏训练提高模型效率及性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Knowledge Distil 相关文章

1.FitNets : Hints For Thin Deep Nets （ICLR2015）
2.A Gift from Knowledge Distillation：Fast Optimization, Network Minimization and Transfer Learning (CVPR 2017)
3.Matching Guided Distillation（ECCV2020）
4.A Comprehensive Overhaul of Feature Distillation（ICCV2019）
5.Knowledge Transfer via Distillation of Activation Boundaries Formed by Hidden Neurons （AAAI2019）
6.Compressing GANs using Knowledge Distillation
7.GhostNet: More Feature from Cheap Operation（CVPR2020）
8.Data-Free Adversarial Distillation
9.Data-Free Learning of Student Networks (ICCV2020)

1.FitNets : Hints For Thin Deep Nets （ICLR2015）

论文目的：
蒸馏训练中，为了训练更加深的网络，在某个层中设置hint（暗示）,再与老师网络中的hint对比。这样做是使训练更加快，好。

在这里插入图片描述

实验：
分别在 CIFAR-10 和 CIFAR-100 SVHN MNIST AFLW进行了实验

在这里插入图片描述

2.A Gift from Knowledge Distillation：Fast Optimization, Network Minimization and Transfer Learning (CVPR 2017)

论文链接
论文目的:
发现蒸馏可以用来

对模型快速训练，训练更少的时间就能达到效果。
对模型进行初始化，
对模型进行转移学习（老师网络用于猫狗分类，学生网络用于马和斑马分类）

主要贡献：
1.提出了一蒸馏训练方法，认为教学生网络不同层输出的feature之间的关系比教学生网络结果好
The student DNN does not necessarily have to learn the intermediate output when the specific question is input but can learn the solution method when a specific type of question is encountered
在这里插入图片描述

论文内容：

1.定义了FSP matrix矩阵来表明两个层之间的关系流
The FSP matrix is generated by the features from two layers
在这里插入图片描述
网络模型

2.训练过程
先训练FSPloss ,然后再用数据集训练学生网络进行微调。

3.Matching Guided Distillation（ECCV2020）

论文链接

论文目的：
提出了一种新方法用于解决老师网络和学生网络输出feature维度不一致问题，进而导致对比的时候有一定误差。其中，其他老的方法是新增一个卷积，或者attention 去匹配维度。

本文提出三个方法去裁剪老师网络生成的feature通道数，进而与学生网络进行匹配，不需要增加一个桥梁（1*1卷积）去解决features不匹配的情况。

在这里插入图片描述

论文内容：
1.通道匹配
寻找一个矩阵M建立S和T特征的联系，
其中S是预训练学生网络输出的feature
T是预训练老师网络输出的feature
$S = M T$
$S\in \mathcal{R}^{S \times N},M\in \mathcal{R}^{S \times C},T\in \mathcal{R}^{C\times N}$

M 还要满足以下条件
在这里插入图片描述
2.通道裁剪
找到M之后进行裁剪，裁剪分为三个方法。
（1）sparse matching

（2）random drop
（3） max pooling

论文不足：使用的预训练的student模型，然后再利用teacher微调。其中M是两者的相关程度，可以直接对teacher生成的feature进行运算，找到有代表性的。

4.A Comprehensive Overhaul of Feature Distillation（ICCV2019）

论文链接

论文目的：
设计一种蒸馏方法，对teacher transform, student transform, distillation feature position and distance function 进行了设计

论文内容：
teacher transform 加了a new ReLU activation
student transform 加了1*1conv
distillation feature position 在pre-RELU
distance function 提出了新的 partial L2 distance
在这里插入图片描述