每日论文230924-Knowledge DIstilliation: A Survey

日期: 230924

本周方向:模型蒸馏

内容简介

随着AI应用的增加,在手机或者其他终端设备上进行算法部署也遇到了更多挑战。

This paper provides a comprehensive survey of knowledge distillation from the perspectives of knowledge categories, training schemes, teacher-student architecture, distillation algorithms, performance comparison and applications. Furthermore, challenges in knowledge distillation are briefly reviewed and comments on future research are discussed and forwarded.

The generic teacher-student framework for knowledge distillation

模型压缩的相关技术:

• Parameter pruning and sharing: These methods focus on removing inessential parameters from deep neural networks without any significant effect on the performance. This category is further divided into model quantization (Wu et al., 2016), model binarization (Courbariaux et al., 2015), structural matrices (Sindhwani et al., 2015) and parameter sharing (Han et al., 2015; Wang et al., 2019f).

• Low-rank factorization: These methods identify redundant parameters of deep neural networks by employing the matrix and tensor decomposition (Yu et al., 2017; Denton et al., 2014). • Transferred compact convolutional filters: These methods remove inessential parameters by transferring or compressing the convolutional filters (Zhai et al., 2016).

• Knowledge distillation (KD): These methods distill the knowledge from a larger deep neural network into a small network ((Hinton et al., 2015).)

Main idea: The student model mimics the teacher model in order to obtain a competitive or even a superior performance. The three key components: knowledge, distilliaton algorithm, teacher-student architecture.

论文结构

论文推荐

Wang, L., & Yoon, K. J. (2020). Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. 

介绍: a survey on knowledge distillation, which presents the comprehensive progress from different perspective of teacher-student learning for vision and its challenges.

Hinton, G., Vinyals, O. & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

介绍:vanilla knowledge distillation

论文具体章节

Section 2: Knowledge (知识)

2.1 Response-Based Knowledge

Response-based knowledge usually refers to the neural response of the last output layer of the teacher model. The main idea is to directly mimic the final prediction of the teacher model. The response-based knowledge distillation is simple yet effective for model compression, and has been widely used in different tasks and applications.

The generic response-based knowledge distillation

2.2 Feature-Based Knowledge

中间层(feture层)和输出层都可以作为最终输出。

 

The generic feature-based knowledge distillation

feature-based knowledge总结

 

2.3 Relation-based Knowledge

Response-based 和 feature-based 知识都是使用teacher模型特定层的输出。而relation-based knowledge 则是探索不同层之间的关系。

The generic instance relation-based knowledge distillation

使用特征层之间的相关性(correlation)作为被蒸馏的知识,

relation-based knowledge 总结

Section 3: Distillation Schemes

Distiliation scheme 可以被分为三种, offline distillation, online distillation 和 self-distillation.

offline distillation

advantage: simple and easy to be implemented.

online distillation

both the teacher model and the student model are updated simultaneously

self distillation

the same networks are used for the teacher and the student models. This is a special case of online distillation.

Section 4: Teacher-Student Architecture

如何决定student 和 teacher的模型呢?

这一段个人觉得重要的一段话

Recently, depth-wise separable convolution has been widely used to design efficient neural networks for mobile or embedded devices (Chollet, 2017; Howard et al., 2017; Sandler et al., 2018; Zhang et al., 2018a; Ma et al., 2018). Inspired by the success of neural architecture search (or NAS), the performances of small neural networks have been further improved by searching for a global structure based on efficient meta operations or blocks (Wu et al., 2019; Tan et al., 2019; Tan and Le, 2019; Radosavovic et al., 2020). Furthermore, the idea of dynamically searching for a knowledge transfer regime also appears in knowledge distillation, e.g., automatically removing redundant layers in a data-driven way using reinforcement learning (Ashok et al., 2018), and searching for optimal student networks given the teacher networks (Liu et al., 2019i; Xie et al., 2020; Gu and Tresp, 2020).

Section 5:Distillation Algorithms

简单有效的distillation方法就是直接匹配response-based knowledge, feature-based knowldege和representation distributions in feature space.

但是这章重点介绍的是其他方法, 包括

5.1 Adversarial Distillation

5.2 Multi-Teacher Distillation

5.3 Cross-Modal DIstillation

5.4 Graph-based distillation

5.5 Attention-based distllation

5.6 Data-Free distillation

5.7 Quantized Distillation

5.8 Lifelog distillation

 

先整理这么多。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值