每日论文230924-Knowledge DIstilliation: A Survey

最新推荐文章于 2024-03-30 15:25:57 发布

Undefined游侠

最新推荐文章于 2024-03-30 15:25:57 发布

阅读量76

点赞数

分类专栏：论文阅读文章标签：深度学习人工智能

本文链接：https://blog.csdn.net/qq_19859865/article/details/133237188

版权

论文阅读专栏收录该内容

9 篇文章 0 订阅

订阅专栏

日期： 230924

本周方向：模型蒸馏

内容简介

随着AI应用的增加，在手机或者其他终端设备上进行算法部署也遇到了更多挑战。

This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

The generic teacher-student framework for knowledge distillation

模型压缩的相关技术：

• Parameter pruning and sharing: These methods focus on removing inessential parameters from deep neural networks without any significant effect on the performance. This category is further divided into model quantization (Wu et al., 2016), model binarization (Courbariaux et al., 2015), structural matrices (Sindhwani et al., 2015) and parameter sharing (Han et al., 2015; Wang et al., 2019f).

• Low-rank factorization: These methods identify redundant parameters of deep neural networks by employing the matrix and tensor decomposition (Yu et al., 2017; Denton et al., 2014). • Transferred compact convolutional filters: These methods remove inessential parameters by transferring or compressing the convolutional filters (Zhai et al., 2016).

• Knowledge distillation (KD): These methods distill the knowledge from a larger deep neural network into a small network ((Hinton et al., 2015).)

Main idea: The student model mimics the teacher model in order to obtain a competitive or even a superior performance. The three key components: knowledge, distilliaton algorithm, teacher-student architecture.

论文推荐

Wang, L., & Yoon, K. J. (2020). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks.

介绍： a survey on knowledge distillation, which presents the comprehensive progress from different perspective of teacher-student learning for vision and its challenges.

Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

介绍：vanilla knowledge distillation

论文具体章节

Section 2: Knowledge （知识)

2.1 Response-Based Knowledge

Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications.

The generic response-based knowledge distillation

2.2 Feature-Based Knowledge

中间层（feture层）和输出层都可以作为最终输出。

The generic feature-based knowledge distillation

2.3 Relation-based Knowledge

Response-based 和 feature-based 知识都是使用teacher模型特定层的输出。而relation-based knowledge 则是探索不同层之间的关系。

The generic instance relation-based knowledge distillation

使用特征层之间的相关性（correlation）作为被蒸馏的知识，

Section 3: Distillation Schemes

Distiliation scheme 可以被分为三种， offline distillation, online distillation 和 self-distillation.

offline distillation

advantage: simple and easy to be implemented.

online distillation

both the teacher model and the student model are updated simultaneously

self distillation

the same networks are used for the teacher and the student models. This is a special case of online distillation.

Section 4: Teacher-Student Architecture

如何决定student 和 teacher的模型呢？

这一段个人觉得重要的一段话

Recently, depth-wise separable convolution has been widely used to design efficient neural networks for mobile or embedded devices (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018a; Ma et al., 2018). Inspired by the success of neural architecture search (or NAS), the performances of small neural networks have been further improved by searching for a global structure based on efficient meta operations or blocks (Wu et al., 2019; Tan et al., 2019; Tan and Le, 2019; Radosavovic et al., 2020). Furthermore, the idea of dynamically searching for a knowledge transfer regime also appears in knowledge distillation, e.g., automatically removing redundant layers in a data-driven way using reinforcement learning (Ashok et al., 2018), and searching for optimal student networks given the teacher networks (Liu et al., 2019i; Xie et al., 2020; Gu and Tresp, 2020).