论文精读--A Gift from Knowledge Distillation

最新推荐文章于 2024-10-31 15:13:59 发布

__如果

最新推荐文章于 2024-10-31 15:13:59 发布

阅读量741

点赞数 15

文章标签：深度学习人工智能知识蒸馏论文阅读计算机视觉

本文链接：https://blog.csdn.net/m0_73202283/article/details/137973015

版权

本文介绍了一种新的知识转移技术，通过计算DNN中层间的内积来定义提炼知识，使得学生网络能快速学习、超越原始模型，且能在不同任务上表现出色。研究还提出了FSP矩阵，用于表示解决过程的流程，证明了其在优化和迁移学习中的有效性。

摘要由CSDN通过智能技术生成

Abstract

We introduce a novel technique for knowledge transfer, where knowledge from a pretrained deep neural network (DNN) is distilled and transferred to another DNN. As the DNN maps from the input space to the output space through many layers sequentially, we define the distilled knowledge to be transferred in terms of flow between layers, which is calculated by computing the inner product between features from two layers. When we compare the student DNN and the original network with the same size as the student DNN but trained without a teacher network, the proposed method of transferring the distilled knowledge as the flow between two layers exhibits three important phenomena: (1) the student DNN that learns the distilled knowledge is optimized much faster than the original model; (2) the student DNN outperforms the original DNN; and (3) the student DNN can learn the distilled knowledge from a teacher DNN that is trained at a different task, and the student DNN outperforms the original DNN that is trained from scratch.

翻译：

我们引入了一种新颖的知识转移技术，其中来自预训练深度神经网络（DNN）的知识被提炼并转移到另一个DNN。由于DNN通过许多层依次将输入空间映射到输出空间，我们将要转移的提炼知识定义为层间的流动，这是通过计算两个层中的特征之间的内积来计算的。当我们将学生DNN与与学生DNN相同大小的原始网络进行比较，但原始网络是在没有教师网络的情况下训练的时候，作为两个层之间流动的提取知识的方法表现出了三个重要现象：（1）学生DNN学习提取知识的速度比原始模型快得多；（2）学生DNN的性能优于原始DNN；（3）学生DNN可以从在不同任务上训练的教师DNN中学习提取的知识，并且学生DNN的性能优于从头开始训练的原始DNN。

总结：

利用内积捕获层与层之间的关系

Introduction

Gatys et al [6] used the Gramian matrix to represent the texture information of the input image. Because the Gramian matrix is generated by computing the inner product of feature vectors, it can contain the directionality between features, which can be thought of as texture information. Similar to Gatys et al [6], we represented the flow of solving a problem by using Gramian matrix consisting of the inner products between features from two layers. The key difference between the Gramian matrix in [6] and ours is that we compute the Gramian matrix across layers, whereas the Gramian matrix in [6] computes the inner products between features within a layer. Figure 1 shows the concept diagram of our proposed method of transferring distilled knowledge. The extracted feature maps from two layers are used to generate the flow of solution procedure (FSP) matrix. The student DNN is trained to make its FSP matrix similar to that of the teacher DNN.

翻译：

Gatys等人[6]使用Gram矩阵来表示输入图像的纹理信息。因为Gram矩阵是通过计算特征向量的内积生成的，它可以包含特征之间的方向性，这可以被视为纹理信息。与Gatys等人[6]类似，我们通过使用由两个层中的特征之间的内积组成的Gram矩阵来表示解决问题的流程。在[6]中的Gram矩阵和我们的之间的关键区别在于，我们计算跨层的Gram矩阵，而[6]中的Gram矩阵计算了层内特征之间的内积。图1显示了我们提出的提取提取知识的方法的概念图。从两个层中提取的特征映射用于生成解决方案流程（FSP）矩阵。学生DNN被训练使其FSP矩阵类似于教师DNN的矩阵。

总结：

之前neural style利用gram提取图片的纹理信息，但它是计算某个卷积特征自身的Gram矩阵

而本文提出的FSP和Gram矩阵差不多，但是是利用不同的两层之间的内积，目的是为了让学生学到教师模型的内积，从而学到一些层间关系

Our paper makes the following contributions: 1. We propose a novel technique to distill knowledge. 2. This approach is useful for fast optimization. 3. Using the proposed distilled knowledge to find the initial weight can improve the performance of a small network. 4. Even if the student DNN is trained at a different task from the teacher DNN, the proposed distilled knowledge improves the performance of the student DNN.

翻译：

我们的论文具有以下贡献：1. 我们提出了一种新颖的知识蒸馏技术。2. 这种方法有助于快速优化。3. 使用所提出的蒸馏知识来寻找初始权重可以提高小型网络的性能。4. 即使学生DNN与教师DNN训练的任务不同，所提出的蒸馏知识也可以提高学生DNN的性能。

Method

Proposed Distilled Knowledge

The DNN generates features layer by layer. Higher layer features are closer to the useful features for performing a main task. If we view the input of the DNN as the question and the output as the answer, we can think of the generated features at the middle of the DNN as the intermediate result in the solution process. Following this idea, the knowledge transfer technique proposed by Romero et al [20] lets the student DNN simply mimic the intermediate result of the teacher DNN. However, in the case of the DNN, there are many ways to solve the problem of generating the output from the input. In this sense, mimicking the generated features of the teacher DNN can be a hard constraint for the student DNN.

In the case of people, the teacher explains the solution process for a problem, and the student learns the flow of the solution procedure. The student DNN does not necessarily have to learn the intermediate output when the specific question is input but can learn the solution method when a specific type of question is encountered. In this manner, we believe that demonstrating the solution process for the problem provides better generalization than teaching the intermediate result.

翻译：

神经网络逐层生成特征。较高层的特征更接近于执行主要任务所需的有用特征。如果我们将DNN的输入视为问题，输出视为答案，我们可以将DNN中间生成的特征视为解决过程中的中间结果。按照这个思路，Romero等人提出的知识转移技术让学生DNN简单地模仿教师DNN的中间结果。然而，在DNN的情况下，有许多方法可以从输入生成输出。从这个意义上说，模仿教师DNN生成的特征对于学生DNN来说可能是一个硬性约束。

对于人类而言，老师解释问题的解决过程，学生学习解决方案的流程。当遇到特定类型的问题时，学生DNN不一定要学习中间输出，而是可以学习解决这种问题的方法。因此，我们认为展示问题的解决过程比教授中间结果能够更好地推广。

总结：

hint learning的guide太hard了，老师做啥学生就做啥，知识不够提炼，学不会举一反三

而通过内积可以捕获层间关系，可以理解为解题步骤的前后逻辑联系，是更好的学习方法

Mathematical Expression of the Distilled Knowledge

The flow of the solution procedure can be defined by the relationship between two intermediate results. In the case of a DNN, the relationship can be mathematically considered by the direction between features of two layers. We designed the FSP matrix to represent the flow of the solution process. The FSP matrix G ∈ R m×n is generated by the features from two layers. Let one of the selected layers generate the feature map F 1 ∈ R h×w×m, where h, w, and m represent the height, width, and number of channels, respectively. The other selected layer generates the feature map F 2 ∈ R h×w×n . Then, the FSP matrix G ∈R m×n is calculated by

where x and W represent the input image and the weights of the DNN, respectively. We prepared residual networks with 8, 26, 32 layers that were trained with the CIFAR-10 dataset. There are three points in the residual network for the CIFAR-10 dataset where the spatial size changes. We selected several points to generate the FSP matrix, as shown in Figure 2.

翻译：

解决过程的流程可以由两个中间结果之间的关系定义。在DNN的情况下，这种关系可以通过两个层的特征之间的方向在数学上进行考虑。我们设计了FSP矩阵来表示解决过程的流程。FSP矩阵 G ∈ R m×n 由来自两个层的特征生成。假设所选的层之一生成特征图 F 1 ∈ R h×w×m，其中 h、w 和 m 分别表示高度、宽度和通道数。另一个所选的层生成特征图 F 2 ∈ R h×w×n。然后，FSP矩阵 G ∈ R m×n 通过以下方式计算：

其中，x 和 W 分别表示输入图像和DNN的权重。我们准备了在 CIFAR-10 数据集上训练的具有 8、26、32 层的残差网络。在 CIFAR-10 数据集的残差网络中，有三个点导致空间尺寸的变化。我们选择了几个点来生成 FSP 矩阵，如图 2 所示。

从教师网络萃取知识不一定只从最后的softmax层这一层，还可以从多个层提取

Loss for the FSP Matrix

Learning Procedure

Our transfer method uses the distilled knowledge generated by the teacher network. To clearly explain what the teacher network represents in our paper, we define two conditions. First, the teacher network should be pretrained by some dataset. This dataset can be the same or different from the one that the student network will learn. The teacher network uses a different dataset from that of the student network in the case of a transfer learning task. Second, the teacher network can be deeper or shallower than the student network. However, we consider a teacher network that is the same or deeper than the student network.

The learning procedure contains two stages of training.

First, we minimize the loss function LFSP to make the FSP matrix of the student network similar to that of the teacher network. The student network that went through the first stage is now trained by the main task loss at the second stage. Because we used the classification task to verify the effectiveness of our proposed method, we can use the softmax cross entropy loss Lori as the main task loss.

翻译：

我们的迁移方法利用了由教师网络生成的蒸馏知识。为了清楚地解释我们论文中教师网络的含义，我们定义了两个条件。首先，教师网络应该是由某个数据集预训练的。这个数据集可以与学生网络将要学习的数据集相同或不同。在迁移学习任务的情况下，教师网络使用的数据集与学生网络不同。第二，教师网络可以比学生网络更深或更浅。然而，我们考虑的教师网络是与学生网络相同或更深的网络。

学习过程包含两个阶段的训练。首先，我们最小化损失函数 LFSP，使学生网络的 FSP 矩阵与教师网络的相似。经过第一阶段的学生网络现在通过第二阶段的主任务损失进行训练。由于我们使用分类任务来验证我们提出的方法的有效性，我们可以将 softmax 交叉熵损失 Lori 作为主任务损失。

总结：

第一阶段计算FSP的损失，使学生网络的层间关系与教师的层间关系相似

第二阶段做hard目标，用监督数据做微调

Experiments

如果两层的特征图维度不同，就通过 maxpooling 进行维度调整

将从教师网络学习到的知识用来对学生网络进行初始化

Conclusion

We proposed a novel approach to generate distilled knowledge from the DNN. By determining the distilled knowledge as the flow of the solving procedure calculated with the proposed FSP matrix, the proposed method outperforms state-of-the-art knowledge transfer methods. We verified the effectiveness of our proposed method in three important aspects. The proposed method optimizes the DNN faster and generates a higher level of performance. Furthermore, the proposed method can be used for the transfer learning task.

翻译：

我们提出了一种新颖的方法来生成从深度神经网络（DNN）中提炼出的知识。通过将提炼出的知识确定为利用所提出的FSP矩阵计算得出的解决过程的流动性，我们的方法优于现有的知识转移方法。我们在三个重要方面验证了我们提出方法的有效性。首先，该方法可以更快地优化DNN，并产生更高水平的性能。此外，该方法还可用于迁移学习任务。