【论文翻译】Highlight Every Step: Knowledge Distillation via Collaborative Teaching

小张好难瘦

已于 2022-04-19 20:20:25 修改

阅读量790

点赞数

分类专栏：论文文章标签：目标检测计算机视觉深度学习神经网络

于 2022-04-18 21:19:38 首次发布

原文链接：https://arxiv.org/abs/1907.09643

版权

论文专栏收录该内容

26 篇文章 9 订阅

订阅专栏

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

强调每一步：通过协作教学提炼知识

摘要

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation (KD) aims to train a compact student network by transferring knowledge from a larger pretrained teacher model. However, most existing methods on KD ignore the valuable information among the training process associated with training results. In this article, we provide a new collaborative teaching KD (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path toward the final logits with high accuracy. The other pretrained teacher (i.e., expert teacher) guides the student to focus on a critical region that is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in KD. The results of experiments on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and ImageNet datasets verify that the proposed KD method is efficient and achieves state-of-the-art performance.

高存储和计算成本阻碍了深度神经网络在资源有限的设备上的应用。知识蒸馏（KD）旨在通过从一个更大的预训练教师模型中转移知识来训练一个紧凑的学生网络。然而，现有的知识发现方法大多忽略了训练过程中与训练结果相关的有价值信息。在本文中，我们提供了一种新的协作教学KD（CTKD）策略，该策略使用了两名特殊的教师。具体来说，一名从零开始训练的教师（即零起点教师）使用临时输出逐步帮助学生。它迫使学生以高精度接近最终logits的最佳路径。另一位经过训练的教师（即专家教师）引导学生专注于对任务更有用的关键区域。结合两位特殊教师的知识可以显著提高KD学生网络的绩效。在CIFAR-10、CIFAR-100、SVHN、Tiny ImageNet和ImageNet数据集上的实验结果验证了所提出的KD方法的有效性，并实现了最先进的性能。

1 介绍

Recently, deep neural networks achieved superior performance in a variety of applications, such as computer vision [1]–[4] and natural language processing [5], [6]. However, along with high performance, the deep neural network’s architecture becomes much deeper and wider which requires a high cost of computation and memory in inference. It is a great burden to deploy these models on edge-computing systems, such as embedded devices and mobile phones. Therefore, many methods [7]–[11] are proposed to reduce the deep neural network’s computational complexity and high storage. Some lightweight networks, such as Inception [12], MobileNet [13], ShuffleNet [14], SqueezeNet [15], and Condense-Net [16] have been proposed to reduce the network size as much as possible under the condition of keeping high recognition accuracy. All the abovementioned methods focus on physically reducing the internal redundancy of the model to obtain a shallow and thin architecture. Nevertheless, how to train the reduced network with high performance is yet an unresolved issue.

最近，深度神经网络在各种应用中取得了优异的性能，如计算机视觉[1]–[4]和自然语言处理[5]，[6]。然而，随着性能的提高，深层神经网络的结构变得越来越深入和广泛，这就需要在推理过程中付出高昂的计算和存储成本。在边缘计算系统（如嵌入式设备和移动电话）上部署这些模型是一个巨大的负担。因此，人们提出了许多方法[7]–[11]来降低深层神经网络的计算复杂度和高存储量。一些轻量级网络，如Inception[12]、MobileNet[13]、ShuffleNet[14]、ShuffleNet[15]和Condence Net[16]已经被提出，以在保持高识别精度的情况下尽可能减少网络大小。上述所有方法都侧重于物理上减少模型的内部冗余，以获得一个浅薄的体系结构。然而，如何对精简后的网络进行高性能的训练仍是一个尚未解决的问题。

It is therefore critical to effectively train a compact neural network, and this research issue attracts more attention [17], [18], of which knowledge distillation (KD) is considered to be able to provide a practical way. Generally speaking, the distilling technique using the teacher–student strategy commonly trains a compact and shallow student network under the guidance of a complicated large teacher network. It is an effective approach to produce a compact neural network with performance close to the complicated teacher network. Once trained, this compact neural network can be directly deployed on resource-constrained devices. KD [19] uses a pretrained teacher’s soften outputs as dark knowledge to supervise the training process of the student network. It assumes the knowledge as a learned mapping from inputs to outputs and transfers the knowledge by training the student with the teacher’s outputs as targets. The hint-based training approach [20] and attention transfer [21] are devised to transfer the knowledge of intermediate layers from the teacher network to student network. Moreover, these approaches based on the teacher–student strategy can be combined with any physical methods. For example, network quantization can be combined with KD [22] to obtain a low-precision student network with high performance. Despite the very promising results, current methods only utilize different forms of knowledge limited in the pretrained teacher network which may ignore the valuable knowledge in the training process of the teacher network.

因此，有效地训练一个紧凑的神经网络至关重要，这一研究问题引起了更多的关注[17]，[18]，其中知识提取（KD）被认为能够提供一种实用的方法。一般来说，使用师生策略的提取技术通常在复杂的大型教师网络的指导下训练一个紧凑而简单的学生网络。这是一种有效的方法来产生一个紧凑的神经网络，其性能接近复杂的教师网络。经过训练后，这种紧凑的神经网络可以直接部署在资源有限的设备上。KD[19]将经过预训练的教师的输出作为暗知识来监督学生网络的训练过程。它假设知识是从输入到输出的学习映射，并通过以教师的输出为目标对学生进行训练来转移知识。基于提示的训练方法[20]和注意力转移[21]旨在将中间层的知识从教师网络转移到学生网络。此外，这些基于师生策略的方法可以与任何物理方法相结合。例如，网络量化可以与KD[22]相结合，以获得高性能的低精度学生网络。尽管取得了非常有希望的结果，但目前的方法只利用了预先训练的教师网络中有限的不同形式的知识，这可能会忽略教师网络训练过程中的有价值的知识。

In this article, we optimize the student network with the distilled knowledge from both a scratch teacher and an expert teacher. As illustrated in Fig. 1, the expert teacher (black ball) has already reached the local optimum and the scratch teacher (red ball) continuously trains with the student (green ball) from scratch. In the process of optimization, the scratch teacher pulls the student toward its optimal path (red arrow), and the expert teacher guides the student to focus on the key region which is more useful for the tasks (black arrow). In such collaborative teaching, the student reaches the local optimum with performance close to the teachers. Our motivation is that the scratch teacher and expert teacher can provide different supervisory information that can be fully utilized through collaborative training. Namely, we use the scratch teacher to jointly train with the student network in the entire training process. Due to the strong ability of the scratch teacher, it canguide the student toward the final logits with high accuracy step by step along the optimization path. However, the scratch teacher also wastes a large number of steps to optimize the path where the expert teacher has gone. This is the reason why we use the additional expert teacher to provide intermediatelevel hints for the training of the student network. As shown in Fig. 2, the scratch teacher provides temporary logits to supervise the entire training process of the student in the pale green rectangular frame. Meanwhile, the pretrained teacher provides attention maps from the middle of DNNs to constrain the lower layers of the student. In this manner, the compact student network can produce performance close to the teacher.

在本文中，我们利用从零起点教师和专家教师那里提取的知识来优化学生网络。如图1所示，专家教师（黑球）已经达到了局部最优，而零起点教师（红球）与学生（绿球）持续训练。在优化过程中，零起点教师将学生拉向其最佳路径（红色箭头），专家老师引导学生关注对任务更有用的关键区域（黑色箭头）。在这种合作教学中，学生达到局部最优，表现接近教师。我们的动机是，零起点教师和专家教师可以提供不同的监督信息，这些信息可以通过协作训练得到充分利用。也就是说，在整个训练过程中，我们使用零起点教师与学生网络进行联合训练。由于零起点教师的强大能力，它可以引导学生沿着优化路径一步一步地进行高精度的最终logits。然而，零起点教师也浪费了大量的步骤来优化专家教师的路径。这就是为什么我们使用额外的专家教师为学生网络的训练提供中级提示的原因。如图2所示，零起点教师提供临时logits，在淡绿色矩形框架中监督学生的整个训练过程。同时，经过训练的教师从DNN的中间提供注意力地图，以约束学生的较低层次。通过这种方式，紧凑的学生网络可以产生接近老师的表现。

Fig. 1. Illustration of our CTKD strategy. We illustrate the optimization process of the student network (green ball) under the collaborative guidance of scratch teacher (red ball) and expert teacher (black ball). The red and green line represent the optimization path of the scratch teacher and the student network. And the expert teacher has already reached the local optimum. The student network starts the optimization process with scratch teacher and expert teacher. CTKD策略介绍。我们展示了在零起点教师（红色球）和专家教师（黑色球）的合作指导下，学生网络（绿色球）的优化过程。红线和绿线代表零起点教师和学生网络的优化路径。专家教师已经达到了局部最优。学生网络开始与零起点教师和专家教师进行优化。

We verify our proposed collaborative teaching KD (CTKD) method on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. The experimental results show that our method effectively improves the student’s performance in KD.

我们在CIFAR-10、CIFAR-100、SVHN和Tiny ImageNet数据集上验证了我们提出的协作教学KD（CTKD）方法。实验结果表明，该方法有效地提高了学生的KD成绩。

Our contributions in this article are summarized as follows. 1) We propose a novel teacher–student KD strategy using two teachers; it combines both the path knowledge toward the final logits with high accuracy and the intermediate-level attention knowledge for lower layers. In addition to the final outputs from the pretrained teacher, the proposed architecture can continuously supervise the student network. 2) We analyze the importance of both knowledge from the two teachers, and we investigate the effect of attention maps distilled from a deep teacher network on the small student network. 3) We verify our method on several public datasets. Experiments show that our method can significantly improve the performance of student networks in KD

我们在本文中的贡献总结如下。1）我们提出了一种新的教师-学生KD策略，使用两名教师；它结合了高精度最终logits的路径知识和较低层次的中级注意知识。除了预训练教师的最终输出，所提出的体系结构还可以持续监控学生网络。2）我们分析了两位教师的两种知识的重要性，并研究了从深层教师网络中提取的注意力地图对小学生网络的影响。3）我们在几个公共数据集上验证了我们的方法。实验表明，我们的方法可以显著提高KD中学生网络的性能。

The remainder of this article is organized as follows. Related work is reviewed in Section II. And, we present the proposed KD architecture using two teachers in Section III. The experimental results are presented in Section IV. Finally, Section V concludes this article.

本文的其余部分组织如下。第二节回顾了相关工作。在第三节中，我们使用两名教师提出了建议的KD架构。第四节给出了实验结果。最后，第五节总结了本文。

2 相关工作

Deep neural networks have demonstrated extraordinary performance on various computer vision [23]–[25] and machine-learning tasks [26]–[29]. Traditional handcrafted features [30] for computer vision tasks are replaced by deep neural networks that have the strong ability at fitting the complicated feature-space distributions. Recently, deep neural networks have become predominant in the large-scale competitions [31]–[33]. Researchers design much deeper and wider networks [1], [34], [35]–[37] to further improve classification accuracy, and tend to discover network architectures automatically [38]–[40]. Powered by the powerful computational resources of the work stations and GPU clusters, it is possible to train and deploy such complicated deep networks. However, the resource-constrained devices are almost impossible to launch such complicated CNNs due to the computational complexity and high storage. For instance, over 232 MB of memory and more than 7.24 × 108 multiplications are demanded for processing one image using AlexNet [41], which cannot be tolerated by these devices [42]. Therefore, compact deep models with similar accuracies are urgently expected.

深度神经网络在各种计算机视觉[23]–[25]和机器学习任务[26]–[29]上表现出非凡的性能。用于计算机视觉任务的传统手工特征[30]被深度神经网络所取代，深度神经网络具有很强的拟合复杂特征空间分布的能力。最近，深度神经网络在大型比赛中占据主导地位[31]–[33]。研究人员设计了更深更广的网络[1]、[34]、[35]–[37]以进一步提高分类精度，并倾向于自动发现网络结构[38]–[40]。借助工作站和GPU集群强大的计算资源，训练和部署如此复杂的深层网络是可能的。然而，由于计算复杂性和高存储容量，资源受限的设备几乎不可能发射如此复杂的CNN。例如，使用AlexNet[41]处理一幅图像需要超过232 MB的内存和超过7.24×108的乘法运算，这是这些设备无法容忍的[42]。因此，迫切需要具有类似精度的紧凑深部模型。

Indeed, the training phase of the deep neural networks is usually performed on CPU and/or GPU clusters. The challenge we really need to face is the deployment of trained models on inference systems, such as resource-constrained devices. During the past few years, many researchers have been studying how to deploy these deep neural networks in practice [17], [43], [44]. The number of parameters usually represents the model complexity, but not all parameters contribute to the performance in inference stage [45]–[47]. Model compression techniques [48]–[51] have emerged to obtain a small model that retains the accuracy of a large one. In the following, we will briefly describe the most related works on network model compression and acceleration.

实际上，深度神经网络的训练阶段通常在CPU和/或GPU集群上执行。我们真正需要面对的挑战是在推理系统（如资源受限设备）上部署经过训练的模型。在过去几年中，许多研究人员一直在研究如何在实践中部署这些深层神经网络[17]、[43]、[44]。参数的数量通常代表模型的复杂性，但并非所有参数都会影响推理阶段的性能[45]–[47]。模型压缩技术[48]–[51]已经出现，以获得保持大型模型精度的小型模型。在下文中，我们将简要介绍网络模型压缩和加速方面最相关的工作。

DNNs compression and acceleration are important to the real-time applications which has gained increasing interests. These methods can be roughly divided into parameter pruning, low-rank decomposition, and KD. Parameter pruning [10], [52], [53] removes redundant weights from the pretrained network model, which can keep the accuracy of the larger model if the prune ratio is set properly. Recently, channel pruning, which is better compatibility with off-theshelf computing libraries, has become increasingly popular. Luo et al. [54] proposed to use the statistics of the next layer to select the channel to be pruned. However, parameter pruning approaches require many iterations to converge and we also need to manually set the pruning threshold. Low-rank decomposition [49], [55], [56] decomposes the original convolution kernel in the DNNs model by using the matrix decomposition technique. But such kind of methods increase the layers of the model, and are easy to cause the vanishing gradient during the training process. Both parameter pruning and low-rank decomposition usually lead to large accuracy drops, thus fine-tuning is required to alleviate those drops [57], [58].

DNNs的压缩和加速对于实时应用来说是非常重要的，已经引起了越来越多的关注。这些方法大致可分为参数剪枝、低秩分解和KD。参数剪枝[10]、[52]、[53]去除了预训练网络模型中的冗余权重，如果剪枝率设置正确，可以保持较大模型的准确性。最近，通道修剪（channel prunning）变得越来越流行，它与离线计算库具有更好的兼容性。Luo等人[54]建议使用下一层的统计数据来选择要修剪的通道。然而，参数修剪方法需要多次迭代才能收敛，我们还需要手动设置修剪阈值。低秩分解[49]、[55]、[56]使用矩阵分解技术对DNNs模型中的原始卷积核进行分解。但这种方法增加了模型的层次，在训练过程中容易导致梯度消失。参数修剪和低秩分解通常都会导致较大的精度下降，因此需要进行微调以缓解这些下降[57]，[58]。

Besides, the reinforcement learning algorithm can be used for designing networks, such as neural architecture search [59] and MetaQNN [60]. The network itself could search the efficient structure without manually setting. However, these models only focus on high performance rather than the size of the model.

此外，强化学习算法可用于设计网络，如神经结构搜索[59]和MetaQNN[60]。网络本身可以搜索有效的结构，而无需手动设置。然而，这些模型只关注高性能，而不是模型的大小。

KD methods are used to reduce the computational cost in the test stage. These approaches usually utilize the teacher–student strategy, where a large pretrained teacher network supervises the training of a small student network, for facilitating the deployment at test time. Buciluaˇ et al. [61] pioneered these series of methods in model compression. They attempt to transfer the knowledge from an ensemble of heterogeneous models to a small model. Lei and Caruana [62] extended this method by forcing the wider and shallower student network to mimic the teacher network’s logits before the softmax. Hinton et al. [19] first provided the concept of KD by introducing a hyperparameter temperature to divide the logits before softmax. The student network is forced to imitate the distribution of teacher network’s soft targets which contains more information than one-hot targets. In other words, the student’s fitting goal is no longer the one-hot vector (ground truth) which is too strict, but learns toward the teacher’s soften vector which most often with the correct prediction. Besides that, researchers attempt to get more supervised information from teacher network. Romero et al. [20] introduced a new metric of intermediate features between teacher and student networks. Zagoruyko and Komodakis [21] used attention features from intermediate layers as the supervised information. Yim et al. [63] proposed a new method using the gram matrix to fit the relationship between layers and students imitate the process of solving problems by teachers. Mishra and Marr [22] and Polino et al. [64] reduced bit precision of weights and activations by combining KD and network quantization. Xu et al. [65] used a conditional adversarial network to learn the loss function for KD. Recently, Lopes et al. [66] used the teacher model to provide metadata for data-free KD.

KD方法用于降低测试阶段的计算成本。这些方法通常利用教师-学生策略，即一个大型预训练教师网络监督一个小型学生网络的培训，以便于在考试时部署。Bucilua等人[61]开创了模型压缩的这一系列方法。他们试图将知识从一组异构模型转移到一个小模型。Lei和Caruana[62]通过迫使更宽、更浅的学生网络模仿softmax之前教师网络的logits，扩展了这种方法。Hinton等人[19]首先提出了KD的概念，通过引入超参数温度来划分softmax之前的Logit。学生网络被迫模仿教师网络软目标的分布，其中包含的信息多于one-hot目标。换句话说，学生的拟合目标不再是一个过于严格的热向量（地面真理），而是向教师的软化向量学习，后者通常具有正确的预测。除此之外，研究人员还试图从教师网络中获取更多有监督的信息。Romero等人[20]引入了教师和学生网络之间中间特性的一个新指标。Zagoruyko和Komodakis[21]使用中间层的注意特征作为监督信息。Yim等人[63]提出了一种新的方法，利用gram矩阵拟合层次之间的关系，学生模仿教师解决问题的过程。Mishra和Marr[22]以及Polino等人[64]通过结合KD和网络量化，降低了权重和激活的位精度。Xu等人[65]使用条件对抗网络学习KD的损失函数。最近，Lopes等人[66]使用教师模型为无数据KD提供元数据。

Recently, researchers note that it is effective in improving a teacher model itself by self-distillation [67], namely, a few models with the same architecture are trained one by one. The deep networks can be optimized in many generations, in which the next model is under the supervision of the previous one. Moreover, KD also has been applied to other applications, such as object detection [68], pedestrian reidentification [69], and semantic segmentation [70]. There also exists works that unify KD with privileged information [71]–[73] as generalized distillation where a teacher is pretrained by taking as input privileged information.

最近，研究人员注意到，通过自我蒸馏[67]可以有效地改进教师模型本身，也就是说，一些具有相同结构的模型被逐个训练。深层网络可以在多代人中进行优化，其中下一个模型在前一个模型的监督下。此外，KD还被应用于其他应用，如目标检测[68]、行人再识别[69]和语义分割[70]。还有一些作品将KD与特权信息[71]–[73]统一为广义蒸馏，其中教师通过将特权信息作为输入进行预训练。

There are also some theoretical and systematic studies about how and why KD improves neural-network training. Furlanello et al. [67] analyzed the success of KD through gradients on the soft-target part which acts as sampling weight based on the teacher’s confidence in its maximum value. Zhang et al. [74] investigated KD via the posterior entropy and prove that soft targets is a much more informed choice than blind entropy regularization.

关于KD如何以及为什么改进神经网络训练，也有一些理论和系统的研究。Furlanello等人[67]分析了KD通过在软目标部分上的梯度获得的成功，该部分作为基于教师对其最大值的信心的抽样权重。Zhang等人[74]通过后验熵研究了KD，并证明软目标比盲熵正则化更明智。

All the above methods use only one single teacher to provide supervised information. Recently, Shan et al. [75] attempted to combine the knowledge of multiple teacher networks in the intermediate representations. And, Shen et al. [76] aimed at learning a compact student model capable of handling the “super” task from multiple teachers. Mishra and Marr [22] proposed a new perspective view to combine network quantization with KD. They jointly train a teacher network and a student from scratch using KD. Zhou et al. [77] also provided a similar scheme where the student network and the teacher network share the lower layers and train simultaneously. The previous study [77] differs from ours in that their one-stage method sharing lower layers between teacher and student network and without using additional guidance from pretrained teacher network, while our two-stage architecture combines intermediate-level features from teacher network with training process from teacher. It means that both the path knowledge toward the final logits with high accuracy and the intermediate-level attention knowledge for lower layers are used in the training process.

以上所有方法都只使用一名教师提供监督信息。最近，Shan等人[75]试图在中间表征中结合多个教师网络的知识。Shen等人[76]的目标是学习一个紧凑的学生模型，能够处理来自多个老师的“超级”任务。Mishra和Marr[22]提出了将网络量化与KD相结合的新视角。他们使用KD从零开始联合培训教师网络和学生。Zhou等人[77]还提供了一个类似的方案，其中学生网络和教师网络共享较低层，同时进行训练。之前的研究[77]与我们的不同之处在于，他们的一阶段方法在教师和学生网络之间共享较低层，并且不使用来自预训练教师网络的额外指导，而我们的两阶段架构将教师网络的中级特征与教师的训练过程结合起来。这意味着在训练过程中使用了高精度的最终logits路径知识和较低层次的中级注意知识。

3 方法

The core idea of our method is to jointly train the student network using two teachers, that is, one expert teacher trained in advance provides attention maps as the intermediate-level supervised information, the other scratch teacher with random initialization provides optimal path knowledge which toward final logits with high accuracy.

该方法的核心思想是由两名教师共同训练学生网络，即一名预训练的专家教师提供注意图作为中间层监督信息，另一名随机初始化的零起点教师提供最佳路径知识，该知识以高精度指向最终logits。

A. Motivation

Existing KD methods [19] let the student network simply mimic the final outputs of the teacher network. However, in the case of the DNNs, there are many ways to generate the final outputs. So the student network might go around and close to the final targets in various ways. In this sense, mimicking the outputs of the teacher network can be a hard constraint for the student network. We propose the CTKD method to remedy such situation.

现有的KD方法[19]让学生网络简单地模拟教师网络的最终输出。然而，对于DNN，有许多方法可以生成最终输出。因此，学生网络可能会以各种方式绕过并接近最终目标。从这个意义上说，模仿教师网络的输出可能是学生网络的一个硬约束。我们提出了CTKD方法来弥补这种情况。

Our motivation is illustrated in Fig. 1 which trains the student network using two teachers, that is, the expert teacher (black ball) and the scratch teacher (red ball). Note that the three balls start training from the same point due to the same seed. The only difference is that the black ball which represents the expert teacher reaches the local optimum along the red curve in advance. Then, we begin to train the student network under the two teachers’ guidance and the green curve describes its optimization path. Let us take one point from the student’s optimization path to explain. The green ball has been pulled by two forces from the scratch teacher (in red arrow) and expert teacher (in black arrow), respectively. The scratch teacher with strong ability could pull the student toward its path. And the expert teacher pulls the student to focus on the critical region to achieve the final targets. Due to the scratch teacher penalizing the student step by step, the student network goes along the path close to the scratch teacher. As shown in Fig. 3, though the different structure of student and teacher network, they focus on the approximate region to classify the dog. But the deep teacher network focuses more on the critical region (the entire head of dog) for the task than the shallow model. Thus, we use the attention mechanism from expert teacher to provide the key hints which could avoid detours. In such manner, the student gets high performance close to the teachers.

我们的动机如图1所示，它使用两名教师训练学生网络，即专家教师（黑球）和零起点教师（红球）。请注意，由于同一seed，三个球从同一点开始训练。唯一的区别是，代表专家教师的黑球提前沿红色曲线到达局部最优。然后，我们开始在两位老师的指导下训练学生网络，绿色曲线描述了其优化路径。让我们从学生的优化路径中选取一点来解释。绿色的球分别由零起点教师（红色箭头）和专家教师（黑色箭头）两种力量拉动。能力强的零起点老师能把学生拉向正确的方向。专家老师会引导学生专注于关键区域，以实现最终目标。由于零起点教师一步一步地约束学生，学生网络沿着靠近临时教师的路径前进。如图3所示，虽然学生和教师网络的结构不同，但他们专注于对狗进行分类的近似区域。但深层教师网络比浅层模型更关注任务的关键区域（整个头部）。因此，我们利用专家教师的注意机制来提供关键提示，避免走弯路。通过这种方式，学生可以在接近老师的情况下取得优异的成绩。

Fig. 3. (a) Input image. Visualization of top activation attention maps of WRN-16-1 (b) and WRN-40-1 (c). Deep model focuses on more critical region than the shallow one due to its powerful ability. （a）输入图像。WRN-16-1（b）和WRN-40-1（c）顶部激活注意图的可视化。由于其强大的能力，深部模型比浅部模型更关注关键区域。

As we can see from Fig. 2, we prepare the expert teacher using the normal training process in advance which has been described in the blue rectangular. Then, we start to feed data (image batch) to our network and the Xt−1, Xt, and Xt+1 means three consecutive moments in our training process. The scratch teacher and student use the standard cross-entropy loss between softmax outputs and ground-truth label, respectively. Furthermore, the scratch teacher penalizes the student using L2 loss between its temporary logits and the student’s logits at every iteration. Note that only the student’s parameters have been updated during the backpropagation of the L2 loss term. Because the scratch teacher does not need to mimic the outputs of student. However, it is difficult to train a deeper student network using KD without introducing the intermediate constraint. So we let the expert teacher provide intermediate constraint using the attention loss. It could constrain the student to focus on the critical region where the expert teacher concentrates on. To train the student network, we optimize the total loss function in (4). We will detail the objective function in the next section.

如图2所示，我们提前使用蓝色矩形中描述的正常训练过程为专家教师做准备。然后，我们开始向网络提供数据（image batch）,Xt−1、Xt和Xt+1表示在我们的训练过程中连续三个时刻。零起点教师和学生分别使用softmax输出和ground truth标签之间的标准交叉熵损失。此外，零起点老师在每次迭代中使用临时logits和学生logits之间的L2 Loss失来惩罚学生网络。请注意，在L2 Loss的反向传播过程中，仅更新了学生的参数。因为教师不需要模仿学生的输出。然而，如果不引入中间约束，很难使用KD来训练更深层的学生网络。因此，我们让专家教师利用注意力损失提供中间约束。这可能会限制学生专注于专家老师关注的关键区域。为了训练学生网络，我们优化了（4）中的总损耗函数。我们将在下一节详细介绍目标函数。

Fig. 2. Illustration of the architecture. The scratch teacher collaboratively trains with the student network from scratch. We use standard cross-entropy loss for scratch teacher network and student network to learn the ground truth, respectively. Moreover, the distillation loss supervises the training of the student network by every step. The expert teacher (pretrained) guides the student network to focus on the critical region through intermediate-level attention maps. 零起点教师从零开始与学生网络合作训练。我们分别使用标准交叉熵损失对零起点教师网络和学生网络进行学习。此外，蒸馏损失监控学生网络的每一步训练。专家教师（预训练）通过中级注意图引导学生网络关注关键区域。

B. Formulation

Deep neural networks can generate features from any layers. The KD technology usually uses different layer’s features or outputs as knowledge to transfer from teacher network to student network. The high layer features are mostly closer to the object parts for performing a specific task. However, the lower layer features are usually the typical generic features (i.e., edges and corners). Therefore, we can take the features generated from the lower parts of the DNNs as the intermediate hints. All these features contain valuable dark knowledge that can be transferred to guide student network’s training process.

深度神经网络可以从任何层生成特征。KD技术通常利用不同层次的特征或输出作为知识从教师网络转移到学生网络。高层特征通常更接近用于执行特定任务的目标部分。然而，底层特征通常是典型的通用特征（即边和角）。因此，我们可以将DNN下部生成的特征作为中间提示。所有这些特征都包含有价值的暗知识，可以用来指导学生网络的训练过程。

Let us, respectively, denote x and y as the input of the DNNs and one-hot labels of our architecture. We let PT be the teacher network’s softmax output as PT = softmax(aT). Specifically, PT is obtained by applying the softmax function on the unnormalized log probability values aT. Similarly, the same image fed to the student network to get the predictions PS = softmax(aS). In the intermediate layers of the DNN, we denote the activation tensor A ∈ RC×X×Wwith its corresponding layer, which consists of C feature planes with spatial dimensions X × W. The jth pair of teacher and student attention maps are denoted as F(Aj T) and F(Aj S) in vectorized form, respectively [21]. And the standard cross-entropy is denoted as H. Hinton et al. [19] extended previous works by training a compact student network to mimic the output probability distribution of teacher network. They name this informative and representative knowledge as dark knowledge. It contains the relative probabilities of “incorrect” classification results provided by teacher networks. When we perform KD with a temperature parameter τ the student network will be trained to optimize the following loss function:

将x和y分别表示为该架构的DNN的输入和one-hot标签。我们让PT作为教师网络的softmax输出，即PT=softmax（aT）。具体地说，PT是通过将softmax函数应用于非标准化对数概率值来获得的。类似地，将相同的图像馈送到学生网络以获得预测PS=softmax（aS）。在DNN的中间层，我们表示激活张量及其对应层，由空间尺寸为X×W的C个特征平面组成。第j对教师和学生注意力图分别以矢量形式表示为[21]。标准交叉熵表示为H。Hinton等人[19]通过训练一个紧凑的学生网络来模拟教师网络的输出概率分布，从而扩展了之前的工作。他们将这种信息丰富且具有代表性的知识称为暗知识。它包含教师网络提供的“错误”分类结果的相对概率。当我们使用温度参数τ执行KD时，将训练学生网络以优化以下损失函数：

Mishra and Marr [22] proposed a new perspective view to jointly train a teacher network (full-precision) and a student network (low-precision) from scratch using KD. The total loss function is as follows:

Mishra和Marr[22]提出了一个新的视角，使用KD从零开始联合训练教师网络（全精度）和学生网络（低精度）。总损失函数如下所示：

In this case, the teacher and student network both train from scratch. Moreover, the teacher network would continuously guide the student network not only with the final trained logits [22]. A similar idea has been studied in [77] where the student network and the teacher network share lower layers and training simultaneously. However, the teacher trained from scratch may provide incorrect guidance to student network in the beginning of the training stage. Another fact is that it is difficult to train a deeper student using KD without introducing the intermediate constraint.

在这种情况下，教师和学生网络都是从零开始训练的。此外，教师网络将持续引导学生网络，而不仅仅是最终训练的Logits[22]。[77]中也研究了类似的想法，即学生网络和教师网络同时共享较低的层次和训练。然而，从零开始训练的教师在训练阶段开始时可能会向学生网络提供不正确的指导。另一个事实是，如果不引入中间约束，很难使用KD来训练更深层次的学生。

To this end, we propose a new KD method using two teachers. We denote the expert teacher trained in advance as T1 and the scratch teacher with random initialization as T2. The T1 provides intermediate constraint using attention maps [21] from lower layers using the loss function as follows:

为此，我们提出了一种新的KD方法，使用两名教师。我们将预先训练的专家教师表示为T1，随机初始化的临时教师表示为T2。T1使用注意力图[21]提供中间约束，从较低层使用损失函数，如下所示：

The F means the activation-based mapping function which inputs the above 3-D tensor A and outputs a spatial attention map, that is, a flattened 2-D tensor. More specifically, , the sum of absolute values raised to the power of p (where p > 1). And the T2 provides the log probability values before softmax as constraint from every step, that is, . It is important to note that this constraint only affects the backpropagation of the student network to avoid the teacher network closing to the student network. When we train the compact student, we aim to optimize the following loss function:

F表示基于激活的映射函数，该函数输入上述三维张量A并输出空间注意图，即平坦的二维张量。更具体地说，，绝对值之和提升到p的幂（其中p>1）。T2提供softmax之前的对数概率值作为每一步的约束，即。需要注意的是，该约束仅影响学生网络的反向传播，以避免教师网络靠近学生网络。当我们训练紧凑的学生网络时，我们的目标是优化以下损失函数：

The first part of total loss ensures T and S to train as original manner independently. In the second part, we denote the KD loss [62] as the L2 loss between logits aS and aT. To optimize the above loss function, the log probability values aS from the student network is to mimic the softmax activation aT from the teacher network. So the student network benefits from the supervisory information of the teacher network during all the training process. The complex teacher model with more learning capability can provide the possible path toward the final target. The last part from our architecture provides intermediate-level hints from a pretrained teacher network.

总损失的第一部分确保T和S按照原始方式独立训练。在第二部分中，我们将KD损失[62]表示为logits aS和aT之间的L2损失。为了优化上述损失函数，来自学生网络的logits概率值将模拟来自教师网络的softmax激活。因此，在整个训练过程中，学生网络受益于教师网络的监控信息。具有更多学习能力的复杂教师模型可以为最终目标提供可能的路径。我们架构的最后一部分提供了来自预训练教师网络的中级提示。

C. Training Procedure

The learning procedure contains two stages of training. On the first stage, we minimize the cross-entropy loss H(ytrue, PT1) to initialize the parameters of expert teacher (T1). Then, we train the student network using two teachers T1 and T2 simultaneously by optimizing the total loss function as shown in (4). The learning procedure is explained in Algorithm 1.

学习过程包括两个训练阶段。在第一阶段，我们最小化交叉熵损失H（ytrue，PT1）来初始化专家教师（T1）的参数。然后，我们通过优化总损失函数，同时使用两名教师T1和T2来训练学生网络，如（4）所示。算法1解释了学习过程。

Fig. 4. Structure of WRNs. (a) Basic residual blocks which is used in our base architecture. The widen factor m determine the network’s width and n means the number of bottlenecks in each group. (b) and (c) Pair of the teacher–student network, WRN-40-1 and WRN-16-1. WRN的结构。（a）在我们的基础架构中使用的基本剩余块。加宽因子m决定网络的宽度，n表示每组中的瓶颈数量。（b）（c）一对师生网络，WRN-40-1和WRN-16-1。

Our proposed method jointly trains the student network using two teachers. It is crucial to combine the temporary outputs from scratch teacher with the intermediate features from the expert teacher in the entire training process. The scratch teacher T2 guides the student step by step using the log probability values before softmax. Due to the powerful learning capability of scratch teacher, it makes the student close to the final target following the optimal path. However, only the supervised information from one single scratch teacher is not enough. Because the scratch teacher attempts many paths to find the optimal one. Meanwhile, the student follows it and pace backwards and forwards. Thus, we need the expert teacher to provide intermediate hints, such as attention maps. Note that our attention maps play a role similar to hints in FitNet [20], without introducing new weights. Thus, the expert teacher, which provides attention maps as intermediate hints, makes the student to find the correct path not only quick but also definitely than employing “hints” as information.

我们提出的方法使用两名教师联合训练学生网络。在整个训练过程中，将零起点教师的输出与专家教师的中间特征结合起来至关重要。零起点教师T2使用softmax之前的对数概率值逐步指导学生。由于零起点教师强大的学习能力，它使学生沿着最佳路径接近最终目标。然而，仅仅从一位老师那里获得监督信息是不够的。因为零起点教师尝试了很多方法来寻找最佳的方法。同时，学生跟着它来回踱步。因此，我们需要专家教师提供中间提示，比如注意图。请注意，我们的注意力地图的作用类似于FitNet[20]中的提示，但没有引入新的权重。因此，专家型教师提供注意力地图作为中间提示，使学生不仅能快速而且肯定地找到正确的路径，而不是使用“提示”作为信息。

We will demonstrate that the student from our KD method gets improved performance in Section IV. However, one might ask how the scratch teacher affect the training process of the student network? If the scratch teacher works, why not only use it to train student network? Or would other knowledge from the expert teacher be better helpful than attention knowledge? We attempt to investigate these questions from both empirical and theoretical aspects in Section IV.

我们将在第四节中证明，KD方法的学生表现有所改善。然而，有人可能会问，零起点教师如何影响学生网络的训练过程？如果零起点老师管用，为什么不只用它来训练学生网络呢？还是专家老师的其他知识比注意力知识更有帮助？在第四节中，我们试图从实证和理论两个方面来研究这些问题。

4 实验

In this section, we verify the effectiveness of our proposed CTKD method and investigate the importance of collaborative teaching. Experiments are conducted on several standard datasets CIFAR-10, CIFAR-100, SVHN, and Tiny ImageNet. We compare our proposed CTKD method with the existing KD methods, including KD [19], attention transfer KD (ATKD) [21], and rocket launching KD (RLKD) [77]. We implement the networks with Pytorch and trains on 1080Ti GPUs. Note that there are several hyperparameters in our experiments that need to be consistent. For the original KD method, we set the temperature factor for softened softmax to 4 as in [19]. And the β of AT is set to following [21]. The code is available at https://github.com/oucocean-group/CTKD.

在本节中，我们将验证我们提出的CTKD方法的有效性，并探讨协作教学的重要性。实验在几个标准数据集CIFAR-10、CIFAR-100、SVHN和Tiny ImageNet上进行。我们将我们提出的CTKD方法与现有的KD方法进行了比较，包括KD[19]、注意转移KD（ATKD）[21]和火箭发射KD（RLKD）[77]。我们在1080Ti GPU上使用Pytorch和trains实现网络。请注意，我们的实验中有几个超参数需要保持一致。对于原始KD方法，我们将软化softmax的温度系数设置为4，如[19]所示。AT的β设置为，如下[21]。该代码可在https://github.com/oucocean-group/CTKD.

A. Experimental Setup

Network Architecture: For all experiments, we employ the wide residual network (WRN) [78] as our base architecture for teacher and student network. The WRN stacks the basic residual blocks [1] as shown in Fig. 4(a) to achieve the stateof-the-art performance. Moreover, it uses the additional widen factor m to increase the width, which could bring more representation ability. The WRN has a standard convolutional layer (conv) followed by three groups of residual blocks, each of size n. Furthermore, the total depth and widen factor are served as a proxy for the size or flexibility of the network architecture. In the following sections, the architecture of WRNs is denoted as WRN-d-m [79], where the total depth is d = 6n+4, n represents the number of residual blocks and m is the widen factor used to increase the number of filters in each residual block. Our teacher network is deep and wide WRN with large d and m, while the student network is shallow and thin WRN with small d and m. As shown in Fig. 4(b) and (c), WRN-40-1 is our teacher network and student network uses the WRN-16-1.

网络体系结构：对于所有实验，我们使用宽残差网络（WRN）[78]作为教师和学生网络的基本体系结构。WRN堆叠基本残差块[1]，如图4（a）所示，以实现最先进的性能。此外，它使用额外的加宽因子m来增加宽度，这可以带来更强的表示力。WRN有一个标准卷积层（conv），后面是三组残差块，每组大小为n。此外，总深度和加宽因子被用作网络架构大小或灵活性的代理。在以下章节中，WRN的结构表示为WRN-d-m[79]，其中总深度为d=6n+4，n表示残差块的数量，m表示用于增加每个残差块中滤波器数量的加宽因子。我们的教师网络是深而宽的WRN，d和m较大，而学生网络是浅而薄的WRN，d和m较小。如图4（b）和（c）所示，WRN-40-1是我们的教师网络，学生网络使用WRN-16-1。

Implementation Details: We first conduct our experiments on the public datasets CIFAR-10 which has 32 × 32 small RGB images. For all experiments, we use minibatches of size 128 for training. Moreover, we use horizontal flips and random crops for data augmentations before each minibatch. The learning rate starts with 0.1 and is reduced by a factor of 0.2 on epoch 60, 120, and 160, respectively. For the CIFAR dataset, we use stochastic gradient descent (SGD) with momentum fixed at 0.9 for 200 epochs. However, we use Adam [80] with learning rate 0.01 initially and drop the learning rate by 0.2 at epoch 20, 40, and 60 for the SVHN dataset which is easy to learn. Furthermore, all networks have batch normalization [81].

实现细节：我们首先在公共数据集CIFAR-10上进行实验，该数据集具有32×32个小RGB图像。对于所有实验，我们使用128号的小批量进行训练。此外，在每个小批量之前，我们使用水平翻转和随机裁剪来增加数据。学习率从0.1开始，在第60、120和160个历元上分别降低0.2倍。对于CIFAR数据集，我们使用随机梯度下降（SGD），动量固定为0.9，持续200个时代。然而，对于易于学习的SVHN数据集，我们最初使用学习率为0.01的Adam[80]，并在第20、40和60 epoch将学习率降低0.2。此外，所有网络都具有批量标准化[81]。

B. CIFAR-10

The CIFAR-10 dataset [82] contains 32 × 32 small RGB images with ten classes. It consists of 50K training images with 5K images per class and 10K testing images with 1K images per class, respectively. However, we use the 32 × 32 RGB images after random crops and horizontal flips for training. And the original 32 × 32 RGB images are used for testing.

CIFAR-10数据集[82]包含具有10个类别的32×32小RGB图像。它由50K训练图像和10K测试图像组成，分别为每类5K图像和每类1K图像。然而，我们使用随机裁剪和水平翻转后的32×32 RGB图像进行训练。使用原始32×32 RGB图像进行测试。

We use the deep and wide WRN (e.g., WRN-40-1 and WRN-40-2) as the teacher’s network. However, the student network uses the shallow and thin WRN (e.g., WRN-16-1 and WRN-16-2). Note that we first train the expert teacher network using the normal training procedure on the CIFAR-10 dataset, which provides 93.43% accuracy for the classification task. The scratch teacher network and student network are random initialized. We use the scratch and expert teacher network to collaboratively supervise the training of the student network as described in Fig. 1.

我们使用深度和广度的WRN（例如WRN-40-1和WRN-40-2）作为教师网络。然而，学生网络使用浅而薄的WRN（例如WRN-16-1和WRN-16-2）。请注意，我们首先在CIFAR-10数据集上使用常规训练程序训练专家教师网络，这为分类任务提供了93.43%的准确率。零起点教师网络和学生网络是随机初始化的。我们使用零起点教师网络和专家教师网络协同监督学生网络的训练，如图1所示。

From the experimental results in Table I, we can find our proposed CTKD method improves the generalization ability of the student network and gets notable improvement compared to the existing methods. Note that all the numbers are the results of our implementation. We implement KD and ATKD according to [21]. We repeat five times with different seed and take the median of classification accuracy as the final results for all experiments. We set two pairs of teacher–student, that is, WRN-16-1 with WRN-40-1 teacher and WRN-16-2 with WRN-40-2 teacher. Taking the left part of the table as an example, we use WRN-16-1 as student network and WRN-40-1 is used as scratch and expert teacher network. We train the expert teacher using normal training procedure independently. And it gets 93.43% accuracy. Furthermore, the student network using the normal training method shows a 91.28% recognition rate. Surprisingly, our new architecture of CTKD gets 92.50% accuracy with 1.22% improvement than the independent student. And the performance of student in our method is close to the teacher network. Moreover, we compare the performance of the student network with the existing KD method (i.e., KD, ATKD, and RLKD). And the proposed method with distilled knowledge clearly performs better than the existing ones. As shown in Fig. 5(a), the student from our KD method gets significant improvement than it trains individually (baseline). And we plot the testing accuracy and training loss curves of all the experiments in Fig. 5(b). It describes the recognition results of different knowledge transfer methods compared with ours on the CIFAR-10 dataset. We can observe that our CTKD method gets a significant improvement on final accuracy and outperforms existing methods. It can be also noticed that our method has a fast convergence speed. This will be further discussed in the next part with more comparisons.

从表1中的实验结果可以看出，我们提出的CTKD方法提高了学生网络的泛化能力，与现有方法相比得到了显著的改进。请注意，所有数字都是我们实验的结果。我们根据[21]实现KD和ATKD。我们用不同的seed重复五次，并将分类精度的中值作为所有实验的最终结果。我们设置了两对教师-学生，即WRN-16-1和WRN-40-1教师，以及WRN-16-2和WRN-40-2教师。以表格左侧为例，我们使用WRN-16-1作为学生网络，WRN-40-1作为零起点和专家教师网络。我们独立使用常规训练程序训练专家教师,准确率为93.43%。此外，采用常规训练方法的学生网络识别率为91.28%。令人惊讶的是，我们新的CTKD架构比独立学生的准确率高92.50%，改进率为1.22%。而且我们的方法中学生的表现接近于教师网络。此外，我们还比较了学生网络与现有KD方法（即KD、ATKD和RLKD）的性能。与现有方法相比，本文提出的知识提取方法的性能明显更好。如图5（a）所示，我们的KD方法的学生比单独训练（基线）的学生得到了显著的改善。我们在图5（b）中绘制了所有实验的测试精度和训练损失曲线。在CIFAR-10数据集上描述了不同知识转移方法与我们的方法相比的识别结果。我们可以观察到，我们的CTKD方法在最终精度上得到了显著提高，并且优于现有方法。还可以注意到，我们的方法具有较快的收敛速度。这将在下一部分进一步讨论，并进行更多比较。

Fig. 5. (a) Testing accuracy of the scratch teacher, student from our KD method, and student trains individually. (b) Training loss and testing accuracy of different knowledge transfer methods on CIFAR-10. （a）测试零起点教师、KD方法的学生和单独训练的学生的准确性。（b）不同知识转移方法在CIFAR-10上的训练损失和测试精度。

The improvement of our CTKD method is attributed to both the supervised information from the two teachers. We compare the accuracy of student DNN in our KD architecture with different combination of teacher DNNs. As shown in Table II, for the generalization ability of student DNN, the two teachers are equally important and complement each other. The recognition rates of the student network under the single guidance of scratch teacher is 91.54%. It also gets 91.77% accuracy when we only use the expert teacher’s attention maps as supervised information in the training process. Interestingly, the accuracy of the student network gets 92.50% when we collaboratively train it with scratch teacher network and expert teacher network.

CTKD方法的改进归功于两位老师的监督信息。我们将DNN与我们的教师在DNN架构中的不同组合进行了比较。如表二所示，对于学生DNN的泛化能力，两位老师同等重要，相辅相成。教师单一指导下的学生网络识别率为91.54%。当我们在训练过程中仅使用专家教师的注意力图作为监督信息时，该方法的准确率为91.77%。有趣的是，当我们与零起点教师网络和专家教师网络进行合作培训时，学生网络的准确率达到92.50%。

The scratch teacher could provide its temporary outputs of logits to guide the student toward its optimization path. To prove this, we train the student network under the simple guidance of scratch teacher as RLKD [77]. As Fig. 6(a) shown, the testing accuracy curve of student tightly follows the scratch teacher’s. However, we can find that the performance of teacher in Fig. 6(a) has been affected due to the parameters sharing on lower layers. The performance of the teacher network also limits the student’s results. However, our method which introduces the expert teacher improves this in Fig. 6(b).

零起点教师可以提供临时的Logit输出，引导学生走向优化路径。为了证明这一点，我们在零起点教师的简单指导下将学生网络训练为RLKD[77]。如图6（a）所示，学生的测试准确度曲线紧跟着老师的测试准确度曲线。然而，我们可以发现，图6（a）中老师的表现受到了影响，因为较低层的参数共享。教师网络的表现也限制了学生的成绩。然而，我们引入专家教师的方法改进了这一点，如图6（b）所示。

Fig. 6. (a) Testing accuracy and loss of the teacher and student network in [77]. (b) Testing accuracy and loss of the scratch teacher and student network in our method.（a） [77]中教师和学生网络的测试准确性和损失。（b）在我们的方法中，测试准确度和零起点老师和学生网络的损失。

Why we use the attention maps as intermediate knowledge from the expert teacher network? We expect that the student could focus on the key region as same as the expert teacher model in the entire training process. As shown in Fig. 3, we visualize the top-level activation attention maps of pretrained WRN-40-1 and WRN-16-1 on the ImageNet dataset using the visualization technique in [83]. We can observe that the attention maps from different depth models focus on different region. Specifically, the deeper teacher model with powerful ability focuses on the pivotal region in order to classify the input image, however, the shallow student model focuses on a wider area. Thus, we make the student network to mimic the attention maps from the expert teacher network. In the training process, the student network learns to focus on the key region under the guidance of an expert teacher using the attention maps.

为什么我们使用注意力图作为专家教师网络的中间知识？我们希望学生在整个训练过程中能够像专家教师模型一样专注于关键区域。如图3所示，我们使用[83]中的可视化技术，在ImageNet数据集上可视化预训练WRN-40-1和WRN-16-1的顶层激活注意图。我们可以观察到，来自不同深度模型的注意力地图集中在不同的区域。具体来说，具有强大能力的深层教师模型侧重于关键区域，以便对输入图像进行分类，而浅层学生模型侧重于更广泛的区域。因此，我们让学生网络模仿专家教师网络的注意力地图。在训练过程中，学生网络在专家教师的指导下，使用注意力地图学习关注关键区域。

To demonstrate the effectiveness of the attention mechanism in our collaborative teaching architecture, we transfer different forms of intermediate knowledge from the expert teacher to student network. Fitnet [20] provides a kind of intermediate supervised knowledge, that is, the features maps from the middle layers of DNNs. Another form of intermediate knowledge can be the weights transferred from the teacher network for the student network, due to the same architecture in both teacher and student model. But these supervised information may be a hard constraint for the student network. Table III shows the accuracy of the student network using different intermediate knowledge transferred from the middle outputs of the expert teacher network in our architecture. The first row means the accuracy of the student network when an expert teacher using the intermediate knowledge in FitNet [20]. As shown, its performance is slightly better than the individual ones. And the second row shows the student network which directly transferring weights from the expert teacher for initializing also gets slightly improvements. The student network from our proposed KD method in the last row gets best performance. Because the attention maps just hint the student network to focus on the key region instead of imposing hard constraint for the lower layers.

为了证明在我们的协作教学体系中注意机制的有效性，我们将不同形式的中间知识从专家教师转移到学生网络。Fitnet[20]提供了一种中间监督知识，即DNN中间层的特征映射。另一种形式的中间知识可以是从教师网络转移到学生网络的权重，这是由于教师和学生模型中的体系结构相同。但这些受监督的信息可能是学生网络的硬约束。表三显示了在我们的体系结构中，学生网络使用从专家教师网络的中间输出转移的不同中间知识的准确性。第一行表示当专家教师使用FitNet[20]中的中间知识时，学生网络的准确性。如图所示，它的性能略好于单个的。第二行显示的是学生网络，它直接从专家老师那里转移权重进行初始化，也得到了轻微的改进。最后一行我们提出的KD方法中的学生网络得到了最好的性能。因为注意力地图只是提示学生网络关注关键区域，而不是对下层施加硬约束

C. CIFAR-100 and SVHN

In this section, we verify the effectiveness of our proposed method by conducting classification task on CIFAR-100 and SVHN datasets. We also add the ResNet as an advanced CNN architecture with skip connections to show the proposed method’s generalization on different architectures. We adopt the ResNet-110 as the teacher network and the ResNet-20 as the student network.

在本节中，我们通过在CIFAR-100和SVHN数据集上执行分类任务来验证我们提出的方法的有效性。我们还添加了ResNet作为一种先进的CNN体系结构，带有跳过连接，以展示所提出的方法在不同体系结构上的泛化。我们采用ResNet-110作为教师网络，ResNet-20作为学生网络。

The CIAR-100 dataset [82] contains 50K training images and 10K testing images. However, it contains 100 classes which are more challenging than CIFAR-10. Due to more complicated classification tasks, we set the width factor to 2 for our WRN architecture. Thus, we use WRN-40-2 as the teacher network and WRN-16-2 is used as the student network.

CIAR-100数据集[82]包含50K训练图像和10K测试图像。然而，它包含100个类别，比CIFAR-10更具挑战性。由于更复杂的分类任务，我们将WRN体系结构的宽度因子设置为2。因此，我们使用WRN-40-2作为教师网络，使用WRN-16-2作为学生网络。

The SVHN dataset [84] is similar to MNIST with small 32 × 32 RGB cropped digits in ten class and it is obtained from house numbers in Google Street View images. SVHN has 73 257 images for training, 26 032 images in the testing set, and 531 131 samples additional.

SVHN数据集[84]类似于MNIST，在10个类别中有小的32×32 RGB裁剪数字，它是从Google Street View图像中的门牌号中获得的。SVHN有73257张用于训练的图像，测试集中有26032张图像，另外还有531311个样本。

As shown in Table IV, the student network WRN-16-2/ResNet-20 from our CTKD method achieves 74.70%/70.75% classification accuracy on CIFAR-100 dataset and gets 2.43%/2.42% improvement compared with the student network trained individually. We also compare our proposed CTKD method with some of the most recent state-of-the-art KD methods. We can see that the student collaboratively trained from our proposed method outperforms all of them. Fig. 7 shows the accuracy change curves over time among different KD methods on CIFAR-100. Interestingly, we observe that our method has a significant improvement than used on the CIFAR-10 dataset through comparing Figs. 5(b) and 7. Considering that the CIFAR-100 dataset and WRN (wide factor as 2) is more complicated than CIFAR-10, we believe that our method is an effective technique for transferring the knowledge to a compact network. We use the Adam with learning rate 0.01 initially for the SVHN dataset as implementation details described and train the network 100 epochs. Furthermore, the student WRN-16-1/ResNet-20 also achieves1.35%/1.22% improvement compared with the baseline.

与CTAR-WRN网络相比，CTAR-WRN-2网络和CTAR-WRN-2网络的分类准确率分别提高了75%和70%。我们还将我们提出的CTKD方法与一些最新的最先进的KD方法进行了比较。我们可以看到，通过我们提出的方法进行协作训练的学生的表现优于所有学生。图7显示了CIFAR-100上不同KD方法的精度随时间的变化曲线。有趣的是，通过对比图5（b）和7，我们发现我们的方法比在CIFAR-10数据集上使用的方法有显著的改进。考虑到CIFAR-100数据集和WRN（宽因子as 2）比CIFAR-10更复杂，我们认为我们的方法是将知识转移到紧凑网络的有效技术。我们最初使用学习率为0.01的Adam作为SVHN数据集的实现细节，并训练网络100个时代。此外，与基线相比，学生WRN-16-1/ResNet-20也实现了1.35%/1.22%的改善。

Fig. 7. Training loss and testing accuracy of different KD approaches on the CIFAR-100 dataset 不同KD方法在CIFAR-100数据集上的训练损失和测试精度

D. Tiny ImageNet and ImageNet

We also validate the proposed method by conducting the image classification task on a much more challenging dataset, Tiny ImageNet dataset [85], which is a popular subset of the ImageNet database [31]. Tiny Imagenet contains 64×64 sized images with 200 classes. Each class has 500 training images, 50 validation images, and 50 test images

我们还通过在更具挑战性的数据集Tiny ImageNet数据集[85]上执行图像分类任务来验证所提出的方法，该数据集是ImageNet数据库[31]的一个常见子集。Tiny Imagenet包含大小为64×64的图像，包含200个类。每堂课有500张培训图片、50张验证图片和50张测试图片

In our Tiny ImageNet classification experiments, we apply random rotation and horizontal flipping for data augmentation. We optimize the model using SGD with minibatch 128 and momentum 0.9. The learning rate starts from 0.1 and is multiplied by 0.2 at 60, 120, 160, 200, and 250 epochs. We totally train the network for 300 epochs and adopt the deep and wide WRN (WRN-40-1) for a teacher model and WRN-16-1 as a student model.

在我们的微型ImageNet分类实验中，我们使用随机旋转和水平翻转来增强数据。我们使用小型批量128和动量0.9的SGD对模型进行了优化。学习率从0.1开始，在60、120、160、200和250个时代乘以0.2。我们总共培训了300个时代的网络，采用深度和广度WRN（WRN-40-1）作为教师模式，WRN-16-1作为学生模式。

Table V shows the classification results on Tiny ImageNet. The student network (WRN-16-1) from our CTKD method achieves 53.59% classification accuracy and gets 2.94% improvement compared with the student network trained individually. The overall results show that the proposed CTKD method outperforms the recent state-of-the-art KD methods.

表五显示了Tiny ImageNet上的分类结果。CTKD方法生成的学生网络（WRN-16-1）与单独训练的学生网络相比，分类准确率达到53.59%，提高了2.94%。总体结果表明，所提出的CTKD方法优于最新的KD方法。

To further investigate the effectiveness of the proposed method and make the performance gain be convincing, we extend the experiments to a large-scale dataset: ImageNet [31]. This dataset consists of 1.28M training images and 50K testing images. We crop the images to the size of 224 × 224 for training and testing. We employ the ResNet-152 as the teacher network and ResNet-50 as the student network. The starting learning rate is set as 0.1, and multiplied by 0.1 at 30, 50, and 80 epochs, totally 100 epochs.

为了进一步研究所提出方法的有效性，并使性能增益令人信服，我们将实验扩展到一个大型数据集：ImageNet[31]。该数据集由128万张训练图像和50K张测试图像组成。我们将图像裁剪成224×224的大小，用于训练和测试。我们使用ResNet-152作为教师网络，ResNet-50作为学生网络。开始学习率设置为0.1，在30、50和80个阶段乘以0.1，总共100个阶段。

The experiments use the pretrained model in the PyTorch library as the teacher network for a simple reproduction. As can be seen from Table VI, we compare our CTKD method with three latest algorithms. The results of our method show a great improvement. The proposed method gets a 77.95% accuracy, which surpasses the baseline by promoting 1.67%.

实验使用PyTorch图书馆中的预训练模型作为教师网络进行简单复制。从表六可以看出，我们将CTKD方法与三种最新算法进行了比较。我们的方法的结果显示了很大的改进。该方法的准确率为77.95%，比基线提高了1.67%。

E. Ablation Study
Most existing well-performed KD methods force the compact student to mimic the pretrained teacher’s outputs. However, there is a gap between the shallow student network and the deep teacher network due to their different network structures. It could be a hard constraint to learn the pretrained teacher’s knowledge for the student network. Thus, we use a scratch teacher to supervise the training of student using every step’s temporary output. The scratch teacher provides optimal path information to the student network as in Fig. 6(a). Moreover, the expert teacher only provides the key hints using attention maps for lower layers which close to the common features. This indicates that the student network will be trained under collaboratively supervising from two teachers. As shown in Fig. 6(b), the student and teacher network both get a higher performance than the method [77] as shown in Fig. 6(a).

大多数现有的表现良好的KD方法迫使紧凑型学生模仿预先训练过的教师的输出。然而，由于网络结构的不同，浅层学生网络和深层教师网络之间存在着差距。对于学生网络来说，学习预训练教师的知识可能是一个困难的限制。因此，我们使用一个零起点老师来监督学生使用每个步骤的临时输出进行训练。零起点教师向学生网络提供最佳路径信息，如图6（a）所示。此外，专家教师只会使用注意力图为接近共同特征的较低层次提供关键提示。这表明学生网络将在两名教师的合作监督下接受培训。如图6（b）所示，学生和教师网络的性能都高于图6（a）所示的方法[77]。

We conduct another experiment which compares KD, ATKD, and RLKD with two pretrained teacher ensemble model for a fair comparison, considering the computational cost. Different assemble models of the teacher networks have been adopted with the training and testing cost. As can be seen from Table VII, the proposed method takes the lowest model training cost and the same test cost as the other two teacher ensemble models. Although the single teacher KD method takes lower training cost than CTKD, the proposed method gets a good accuracy improvement. Indeed, the training phase of the networks is usually performed on CPU and/or GPU clusters. The challenging we really need to face is the deployment of trained models on inference systems, such as resource-constrained devices. Moreover, the student network of CTKD gets ×6.29 compression rate which is reflected by its number of parameters.

考虑到计算成本，我们进行了另一个实验，将KD、ATKD和RLKD与两个预训练教师集成模型进行比较，以进行公平比较。教师网络采用了不同的组装模式，并承担了培训和测试成本。从表七可以看出，所提出的方法采用的模型培训成本最低，测试成本与其他两个教师组合模型相同。虽然与CTKD相比，单教师KD方法的训练成本更低，但该方法的精度得到了很好的提高。实际上，网络的训练阶段通常在CPU和/或GPU集群上执行。我们真正需要面对的挑战是在推理系统（如资源受限设备）上部署经过训练的模型。此外，CTKD的学生网络获得了×6.29的压缩率，这是由其参数数量反映的。

To further investigate the impacts of the two teachers, we conduct experiments using the pretrained teacher network and the scratch network for T2, respectively. Specifically, we replace the scratch network with the pretrained network for T2 in CTKD†. As can be seen from Table VIII, our method gets 1.22% accuracy improvement than the CTKD†, which uses two pretrained teacher networks. Moreover, the scratch network for T2 can save one teacher network’s training cost. The improvement in performance is due to the additional knowledge when we employ the scratch teacher T2 instead of the pretrained teacher T2. We consider that the pretrained teacher T2 only transfers the final stationary outputs of the pretrained model. However, the valuable knowledge exists not only in the final outputs but also in the training process. The scratch teacher T2 gets not only the differences between the true label and temporary logits outputs but also the path information toward the final target. Note that, the scratch teacher T2 starts from a different initial condition compared with the pretrained teacher T1. Thus, they can learn different representations and provide the extra information in our framework. CTKD‡ shows the performance when using scratch teacher T1. We expect that the attention mechanism helps the student focus on the most critical part, and obviously attention maps from the scratch teacher T1 cannot provide accurate supervisory information in the early training phase and may even mislead the student.

为了进一步调查这两位教师的影响，我们分别使用T2的预训练教师网络和scratch网络进行实验。具体来说，我们用CTKD†中T2的预训练网络取代了scratch网络。从表VIII中可以看出，我们的方法比CTKD†的准确率提高了1.22%，后者使用了两个预先训练的教师网络。此外，T2的scratch网络可以节省一个教师网络的训练成本。绩效的提高是由于我们使用了scratch teacheT2而不是 pretrained teacher T2时获得了额外的知识。我们认为，预先训练的教师T2只传送最终的固定输出的预训练模型。然而，有价值的知识不仅存在于最终输出中，也存在于训练过程中。scratch teacher T2不仅获取真实标签和临时登录输出之间的差异，还获取指向最终目标的路径信息。注意，与pretrained teacher T1相比，scratch teacheT2从不同的初始条件开始。因此，他们可以学习不同的表示，并在我们的框架中提供额外的信息。CTKD——显示使用scratch teacher T1时的性能。我们期望注意机制能帮助学生将注意力集中在最关键的部分，而且很明显，scratch teacher T1的注意地图无法在早期训练阶段提供准确的监督信息，甚至可能误导学生。

Sensitivity Analysis: Table IX illustrates how the performance of the CTKD method is affected by the choice of hyperparameters λ and β. We train the ResNet-20 student network with λ and β ranging from 500 to 2500 and 0.1 to 0.3, respectively, on CIFAR-100. We find that a large value of λ causes the accuracy to deteriorate rapidly. This is because that the T2 is trained from scratch and provides uncertain information.

敏感性分析：表九说明了超参数λ和β的选择如何影响CTKD方法的性能。我们在CIFAR-100上训练了λ和β分别为500到2500和0.1到0.3的ResNet-20学生网络。我们发现λ的大值会导致精度迅速下降。这是因为T2是从头开始训练的，并且提供不确定的信息。

Why does our collaborative teaching approach work? First, the scratch teacher could transfer its path information to the student on every step as shown in Fig. 6(a). Though it could make mistakes in its training process, at least it provides a path to higher performance than student. Second, the expert teacher could also provide additional supervising information to the student network. However, which kind of knowledge from the expert teacher is most effective and suitable in our collaborative teaching approach? We investigate the effects of different knowledge which the expert teacher provides in our structure. The attention mechanism achieves excellent results. The expert teacher only provides the information about where it looks to the student network in the training process. Despite the student’s weaker ability, the expert teacher’s information makes it possible to catch up with the scratch teacher. We verify the effectiveness of our method with most existing KD approaches on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and ImageNet datasets in Sections IV-B–IV-D.

为什么我们的合作教学法有效？首先，scratch教师可以在每一步向学生传递其路径信息，如图6（a）所示。尽管它在训练过程中可能会犯错误，但至少它提供了一条比学生表现更高的途径。其次，专家教师还可以向学生网络提供额外的监督信息。然而，在我们的合作教学法中，专家教师提供的哪种知识最有效、最合适？我们调查专家教师在我们的结构中提供的不同知识的影响。注意机制取得了很好的效果。专家教师只提供培训过程中学生网络的位置信息。尽管学生的能力较弱，但专家老师的信息使他们有可能赶上老师。在第IV-B–IV-D节中，我们用CIFAR-10、CIFAR-100、SVHN、Tiny ImageNet和ImageNet数据集上的大多数现有KD方法验证了我们方法的有效性。

5 结论

In this article, we proposed a novel and efficient KD method to train a compact student neural network, which can be directly deployed on the resource-constrained devices. We show that the scratch teacher and expert teacher could provide different knowledge from the training process and results. To fully utilize both of these knowledge, we propose the CTKD method for transferring knowledge from teachers to the student network. In detail, we use the scratch teacher to supervise every step of the student’s training process. It can guide the student toward the final logits with high accuracy step by step along the optimization path. And the expert teacher only constrains the student to focus on the critical region in the entire training process. In such manner, the compact student network can produce performance closely to the teacher. We compare our proposed CTKD method with the state-of-the-art KD methods. The experimental results show that our method has a significant improvement for student network’s classification recognition on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and ImageNet datasets. We believe our method is a valuable complement to the state of the art.

在本文中，我们提出了一种新颖有效的KD方法来训练紧凑的学生神经网络，该网络可以直接部署在资源受限的设备上。我们发现，零起点教师和专家教师可以提供不同于训练过程和结果的知识。为了充分利用这两种知识，我们提出了将教师的知识转移到学生网络的CTKD方法。具体来说，我们使用零起点教师来监督学生训练过程的每一步。它可以引导学生沿着优化路径一步一步地进行高精度的最终logits。在整个训练过程中，专家教师只会约束学生关注关键区域。通过这种方式，紧凑的学生网络可以产生与教师密切相关的表现。我们将我们提出的CTKD方法与最新的KD方法进行了比较。实验结果表明，我们的方法在CIFAR-10、CIFAR-100、SVHN、Tiny ImageNet和ImageNet数据集上对学生网络的分类识别有显著改进。我们相信我们的方法是对最先进技术的宝贵补充。