日期: 230924
本周方向:模型蒸馏
内容简介
随着AI应用的增加,在手机或者其他终端设备上进行算法部署也遇到了更多挑战。
This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.
模型压缩的相关技术:
• Parameter pruning and sharing: These methods focus on removing inessential parameters from deep neural networks without any significant effect on the performance. This category is further divided into model quantization (Wu et al., 2016), model binarization (Courbariaux et al., 2015), structural matrices (Sindhwani et al., 2015) and parameter sharing (Han et al., 2015; Wang et al., 2019f).
• Low-rank factorization: These methods identify redundant parameters of deep neural networks by employing the matrix and tensor decomposition (Yu et al., 2017; Denton et al., 2014). • Transferred compact convolutional filters: These methods remove inessential parameters by transferring or compressing the convolutional filters (Zhai et al., 2016).
• Knowledge distillation (KD): These methods distill the knowledge from a larger deep neural network into a small network ((Hinton et al., 2015).)
Main idea: The student model mimics the teacher model in order to obtain a competitive or even a superior performance. The three key components: knowledge, distilliaton algorithm, teacher-student architecture.
论文推荐
Wang, L., & Yoon, K. J. (2020). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks.
介绍: a survey on knowledge distillation, which presents the comprehensive progress from different perspective of teacher-student learning for vision and its challenges.
Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
介绍:vanilla knowledge distillation
论文具体章节
Section 2: Knowledge (知识)
2.1 Response-Based Knowledge
Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications.
2.2 Feature-Based Knowledge
中间层(feture层)和输出层都可以作为最终输出。
2.3 Relation-based Knowledge
Response-based 和 feature-based 知识都是使用teacher模型特定层的输出。而relation-based knowledge 则是探索不同层之间的关系。
使用特征层之间的相关性(correlation)作为被蒸馏的知识,
Section 3: Distillation Schemes
Distiliation scheme 可以被分为三种, offline distillation, online distillation 和 self-distillation.
offline distillation
advantage: simple and easy to be implemented.
online distillation
both the teacher model and the student model are updated simultaneously
self distillation
the same networks are used for the teacher and the student models. This is a special case of online distillation.
Section 4: Teacher-Student Architecture
如何决定student 和 teacher的模型呢?
这一段个人觉得重要的一段话
Recently, depth-wise separable convolution has been widely used to design efficient neural networks for mobile or embedded devices (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018a; Ma et al., 2018). Inspired by the success of neural architecture search (or NAS), the performances of small neural networks have been further improved by searching for a global structure based on efficient meta operations or blocks (Wu et al., 2019; Tan et al., 2019; Tan and Le, 2019; Radosavovic et al., 2020). Furthermore, the idea of dynamically searching for a knowledge transfer regime also appears in knowledge distillation, e.g., automatically removing redundant layers in a data-driven way using reinforcement learning (Ashok et al., 2018), and searching for optimal student networks given the teacher networks (Liu et al., 2019i; Xie et al., 2020; Gu and Tresp, 2020).
Section 5:Distillation Algorithms
简单有效的distillation方法就是直接匹配response-based knowledge, feature-based knowldege和representation distributions in feature space.
但是这章重点介绍的是其他方法, 包括
5.1 Adversarial Distillation
5.2 Multi-Teacher Distillation
5.3 Cross-Modal DIstillation
5.4 Graph-based distillation
5.5 Attention-based distllation
5.6 Data-Free distillation
5.7 Quantized Distillation
5.8 Lifelog distillation
先整理这么多。