[2022CVPR]Anomaly Detection via Reverse Distillation from One-Class Embedding翻译

最新推荐文章于 2024-09-03 17:27:18 发布

毕加猪plus

最新推荐文章于 2024-09-03 17:27:18 发布

阅读量1.1k

点赞数 1

分类专栏：论文翻译 # 异常检测文章标签：人工智能深度学习经验分享

本文链接：https://blog.csdn.net/Vincent_Tong_/article/details/130439663

版权

论文翻译同时被 2 个专栏收录

5 篇文章

订阅专栏

异常检测

4 篇文章

订阅专栏

本文提出了一种反向蒸馏方法用于无监督异常检测，解决了传统知识蒸馏中教师和学生模型结构相似导致的异常表示差异性不足的问题。通过使用编码器-解码器结构和反向知识蒸馏，该方法增强了模型在异常检测中的区分能力。同时，引入了单类瓶颈嵌入模块，以紧凑的形式保留正常模式的关键信息，减少异常特征的传播。实验结果表明，该方法在异常检测和定位任务上超越了现有的状态-of-the-art方法。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Anomaly Detection via Reverse Distillation from One-Class Embedding

Tags: CVPR
Year: 2022

GitHub - hq-deng/RD4AD: Anomaly Detection via Reverse Distillation from One-Class Embedding

GitHub - Merenguelkl/Reverse_Disstilation: Unofficial Implementation of “Anomaly Detection via Reverse Distillation from One-Class Embedding, CVPR2022”

摘要

Knowledge distillation (KD) achieves promising results on the challenging problem of unsupervised anomaly detection (AD). The representation discrepancy of anomalies in the teacher-student (T-S) model provides essential evidence for AD. However, using similar or identical architectures to build the teacher and student models in previous studies hinders the diversity of anomalous representations. To tackle this problem, we propose a novel T-S model consisting of a teacher encoder and a student decoder and introduce a simple yet effective ”reverse distillation” paradigm accordingly. Instead of receiving raw images directly, the student network takes teacher model’s one-class embedding as input and targets to restore the teacher’s multiscale representations. Inherently, knowledge distillation in this study starts from abstract, high-level presentations to low-level features. In addition, we introduce a trainable one-class bottleneck embedding (OCBE) module in our T-S model. The obtained compact embedding effectively preserves essential information on normal patterns, but abandons anomaly perturbations. Extensive experimentation on AD and one-class novelty detection benchmarks shows that our method surpasses SOTA performance, demonstrating our proposed approach’s effectiveness and generalizability.

知识提炼（KD）在无监督的异常检测（AD）这一具有挑战性的问题上取得了可喜的成果。师生（T-S）模型中的异常表现差异为AD提供了基本证据。然而，在以前的研究中，使用类似或相同的架构来建立教师和学生模型，阻碍了异常表示的多样性。为了解决这个问题，我们提出了一个由教师编码器和学生解码器组成的新型T-S模型，并相应地引入了一个简单而有效的 "反向提炼 "范式。学生网络不是直接接收原始图像，而是将教师模型的单类嵌入作为输入，目标是恢复教师的多尺度表征。从本质上讲，本研究中的知识提炼是从抽象的、高层次的表述开始到低层次的特征。此外，我们在T-S模型中引入了一个可训练的单类瓶颈嵌入（OCBE）模块。获得的紧凑嵌入有效地保留了正常模式的基本信息，但放弃了异常扰动。在AD和单类新颖性检测基准上的广泛实验表明，我们的方法超过了SOTA的性能，证明了我们提出的方法的有效性和通用性。

1 介绍

Anomaly detection (AD) refers to identifying and localizing anomalies with limited, even no, prior knowledge of abnormality. The wide applications of AD, such as industrial defect detection [3], medical out-of-distribution detection [50], and video surveillance [24], makes it a critical task as well as a spotlight. In the context of unsupervised AD, no prior information on anomalies is available. Instead, a set of normal samples is provided for reference. To tackle this problem, previous efforts attempt to construct various self-supervision tasks on those anomaly-free samples. These tasks include, but not limited to, sample reconstruction [2,5,11,16,26,34,38,48], pseudo-outlier augmentation [23, 42, 46], knowledge distillation [4, 33, 39], etc.

异常检测（AD）是指在事先对异常情况了解有限，甚至没有的情况下，识别和定位异常情况。异常检测的广泛应用，如工业缺陷检测[3]、医疗失调检测[50]和视频监控[24]，使其成为一项关键任务，也是一个亮点。在无监督AD的背景下，没有关于异常情况的先验信息可用。相反，提供了一组正常样本作为参考。为了解决这个问题，以前的努力试图在这些无异常的样本上构建各种自我监督任务。这些任务包括但不限于：样本重建[2,5,11,16,26,34,38,48]，伪离群点增强[23,42,46]，知识提炼[4,33,39]等。

In this study, we tackle the problem of unsupervised anomaly detection from the knowledge distillation-based point of view. In knowledge distillation (KD) [6, 15], knowledge is transferred within a teacher-student (T-S) pair. In the context of unsupervised AD, since the student experiences only normal samples during training, it is likely to generate discrepant representations from the teacher when a query is anomalous. This hypothesis forms the basis of KD-based methods for anomaly detection. However, this hypothesis is not always true in practice due to (1) the identical or similar architectures of the teacher and student networks (i.e., non-distinguishing filters [33]) and (2) the same data flow in the T-S model during knowledge transfer/distillation. Though the use of a smaller student network partially addresses this issue [33, 39], the weaker representation capability of shallow architectures hinders the model from precisely detecting and localizing anomalies.

在这项研究中，我们从基于知识蒸馏的角度来解决无监督的异常检测问题。在知识蒸馏（KD）[6, 15]中，知识在师生（T-S）配对中被转移。在无监督AD的背景下，由于学生在训练过程中只经历了正常的样本，当查询出现异常时，它很可能从教师那里产生不一致的表征。这一假设构成了基于KD的异常检测方法的基础。然而，由于(1)教师和学生网络的结构相同或相似（即无差别过滤器[33]），以及(2)在知识转移/蒸馏过程中T-S模型的数据流相同，这一假设在实践中并不总是真实的。尽管使用较小的学生网络可以部分解决这个问题[33, 39]，但浅层架构的较弱的表示能力阻碍了模型精确检测和定位异常。

To holistically address the issue mentioned above, we propose a new paradigm of knowledge distillation, namely Reverse Distillation, for anomaly detection. We use simple diagrams in Fig. 2 to highlight the systematic difference between conventional knowledge distillation and the proposed reverse distillation. First, unlike the conventional knowledge distillation framework where both teacher and student adopt the encoder structure, the T-S model in our reverse distillation consists of heterogeneous architectures: a teacher encoder and a student decoder. Second, instead of directly feeding the raw data to the T-S model simultaneously, the student decoder takes the low-dimensional embedding as input, targeting to mimic the teacher’s behavior by restoring the teacher model’s representations in different scales. From the regression perspective, our reverse distillation uses the student network to predict the representation of the teacher model. Therefore, ”reverse” here indicates both the reverse shapes of teacher encoder and student decoder and the distinct knowledge distillation order where high-level representation is first distilled, followed by low-level features. It is noteworthy that our reverse distillation presents two significant advantages: i) Non-similarity structure. In the proposed T-S model, one can consider the teacher encoder as a down-sampling filter and the student decoder as an up-sampling filter. The ”reverse structures” avoid the confusion caused by non-distinguishing filters [33] as we discussed above. ii) Compactness embedding. The low-dimensional embedding fed to the student decoder acts as an information bottleneck for normal pattern restoration. Let’s formulate anomaly features as perturbations on normal patterns. Then the compact embedding helps to prohibit the propagation of such unusual perturbations to the student model and thus boosts the T-S model’s representation discrepancy on anomalies. Notably, traditional AE-based methods [5, 11, 16, 26] detect anomalies utilising pixel differences, whereas we perform discrimination with dense descriptive features. Deep features as region-aware descriptors provide more effective discriminative information than per-pixel in images

为了全面解决上述问题，我们提出了一种新的知识蒸馏范式，即反向蒸馏，用于异常检测。我们在图2中使用简单的图表来强调传统的知识蒸馏和提议的反向蒸馏之间的系统差异。首先，与传统的知识蒸馏框架中教师和学生都采用编码器结构不同，我们的反向蒸馏中的T-S模型由异质结构组成：一个教师编码器和一个学生解码器。其次，学生解码器不直接将原始数据同时送入T-S模型，而是将低维嵌入作为输入，目标是通过恢复教师模型在不同尺度上的表示来模仿教师的行为。从回归的角度来看，我们的反向提炼使用学生网络来预测教师模型的表示。因此，这里的 "反向 "既表示教师编码器和学生解码器的反向形状，也表示独特的知识提炼顺序，即首先提炼高层次的表征，然后是低层次的特征。值得注意的是，我们的反向蒸馏法有两个显著的优点：①非相似性结构。在提出的T-S模型中，我们可以把教师编码器看作是一个下采样滤波器，把学生解码器看作是一个上采样滤波器。这种 "反向结构 "避免了我们在上面讨论的无差别滤波器[33]造成的混乱。 ii) 紧凑性嵌入。喂给学生解码器的低维嵌入作为正常模式修复的信息瓶颈。让我们把异常特征表述为对正常模式的扰动。然后，紧凑嵌入有助于禁止这种异常扰动向学生模型传播，从而提高T-S模型对异常现象的表示差异。值得注意的是，传统的基于AE的方法[5, 11, 16, 26]利用像素差异检测异常，而我们用密集的描述性特征进行判别。作为区域感知描述符的深度特征提供了比图像中每个像素更有效的判别信息。

在这里插入图片描述

In addition, since the compactness of the bottleneck embedding is vital for anomaly detection (as discussed above), we introduce a one-class bottleneck embedding (OCBE) module to condense the feature codes further. Our OCBE module consists of a multi-scale feature fusion (MFF) block and one-class embedding (OCE) block, both jointly optimized with the student decoder. Notably, the former aggregates low- and high-level features to construct a rich embedding for normal pattern reconstruction. The latter targets to retain essential information favorable for the student to decode out the teacher’s response.

此外，由于瓶颈嵌入的紧凑性对异常检测至关重要（如上所述），我们引入了一个单类瓶颈嵌入（OCBE）模块来进一步压缩特征代码。我们的OCBE模块由多尺度特征融合（MFF）块和单类嵌入（OCE）块组成，两者都与学生解码器联合优化。值得注意的是，前者聚合了低级和高级特征，为正常模式重建构建了丰富的嵌入。后者的目标是保留有利于学生解码的基本信息，以获得教师的回应。

We perform extensive experiments on public benchmarks. The experimental results indicate that our reverse distillation paradigm achieves comparable performance with prior arts. The proposed OCBE module further improves the performance to a new state-of-the-art (SOTA) record. Our main contributions are summarized as follows:

我们在公共基准上进行了广泛的实验。实验结果表明，我们的反向蒸馏范式取得了与先前的艺术相当的性能。提出的OCBE模块进一步提高了性能，达到了一个新的最先进的（SOTA）记录。我们的主要贡献总结如下：

We introduce a simple, yet effective Reverse Distillation paradigm for anomaly detection. The encoderdecoder structure and reverse knowledge distillation strategy holistically address the non-distinguishing filter problem in conventional KD models, boosting the T-S model’s discrimination capability on anomalies.
我们为异常检测引入了一个简单而有效的逆向蒸馏范式。编码器-解码器结构和反向知识蒸馏策略整体上解决了传统KD模型中的无差别过滤器问题，提高了T-S模型对异常现象的识别能力。
We propose a one-class bottleneck embedding module to project the teacher’s high-dimensional features to a compact one-class embedding space. This innovation facilitates retaining rich yet compact codes for anomaly-free representation restoration at the student.
我们提出一个单类瓶颈嵌入模块，将教师的高维特征投射到一个紧凑的单类嵌入空间。这一创新有利于保留丰富而紧凑的代码，以便在学生处进行无异常的表示修复。
We perform extensive experiments and show that our approach achieves new SOTA performance.
我们进行了广泛的实验，表明我们的方法取得了新的SOTA性能。

2 相关工作

This section briefly reviews previous efforts on unsupervised anomaly detection. We will highlight the similarity and difference between the proposed method and prior arts.

本节简要回顾了以前在无监督异常检测方面的工作。我们将强调所提出的方法与先前的技术之间的相似性和差异。

Classical anomaly detection methods focus on defining a compact closed one-class distribution using normal support vectors. The pioneer studies include one-class support vector machine (OC-SVM) [35] and support vector data description (SVDD) [36]. To cope with high-dimensional data, DeepSVDD [31] and PatchSVDD [43] estimate data representations through deep networks.

经典的异常检测方法侧重于使用正常支持向量定义一个紧凑的封闭式单类分布。先行的研究包括单类支持向量机（OC-SVM）[35]和支持向量数据描述（SVDD）[36]。为了应对高维数据，DeepSVDD[31]和PatchSVDD[43]通过深度网络估计数据表示。

Another unsupervised AD prototype is the use of generative models, such as AutoEncoder (AE) [19] and Generative Adversarial Nets (GAN) [12], for sample reconstruction. These methods rely on the hypothesis that generative models trained on normal samples only can successfully reconstruct anomaly-free regions, but fail for anomalous regions [2, 5, 34]. However, recent studies show that deep models generalize so well that even anomalous regions can be well-restored [46]. To address this issue, memory mechanism [11, 16, 26] , image masking strategy [42, 46] and pseudo-anomaly [28, 45] are incorporated in reconstruction-based methods. However, these methods still lack a strong discriminating ability for real-world anomaly detection [3, 5]. Recently, Metaformer (MF) [40] proposes the use of meta-learning [9] to bridge model adaptation and reconstruction gap for reconstruction-based approaches. Notably, the proposed reverse knowledge distillation also adopts the encoder-decoder architecture, but it differs from construction-based methods in two-folds. First, the encoder in a generative model is jointly trained with the decoder, while our reverse distillation freezes a pre-trained model as the teacher. Second, instead of pixel-level reconstruction error, it performs anomaly detection on the semantic feature space

另一个无监督的AD原型是使用生成模型，如AutoEncoder（AE）[19]和Generative Adversarial Nets（GAN）[12]，用于样本重建。这些方法依赖于这样的假设：只对正常样本进行训练的生成模型可以成功地重建无异常区域，但对异常区域则失败[2, 5, 34]。然而，最近的研究表明，深度模型的泛化能力很强，即使是异常区域也能很好地恢复[46]。为了解决这个问题，记忆机制[11, 16, 26]、图像遮蔽策略[42, 46]和伪异常[28, 45]被纳入基于重建的方法中。然而，这些方法对于真实世界的异常检测仍然缺乏强大的辨别能力[3, 5]。最近，Metaformer（MF）[40]提出使用元学习[9]来弥补基于重建的方法的模型适应和重建差距。值得注意的是，所提出的反向知识蒸馏法也采用了编码器-解码器结构，但它与基于构造的方法有两方面的不同。首先，生成式模型中的编码器是与解码器共同训练的，而我们的反向蒸馏法则冻结了一个预先训练好的模型作为教师。第二，它不是像素级的重建误差，而是在语义特征空间上进行异常检测。

Data augmentation strategy is also widely used. By adding pseudo anomalies in the provided anomaly-free samples, the unsupervised task is converted to a supervised learning task [23, 42, 46]. However, these approaches are prone to bias towards pseudo outliers and fail to detect a large variety of anomaly types. For example, CutPaste [23] generates pseudo outliers by adding small patches onto normal images and trains a model to detect these anomalous regions. Since the model focuses on detecting local features such as edge discontinuity and texture perturbations, it fails to detect and localize large defects and global structural anomalies as shown in Fig. 6.

数据增强策略也被广泛使用。通过在提供的无异常样本中添加伪异常，无监督任务被转换为有监督学习任务[23, 42, 46]。然而，这些方法容易偏向伪异常值，并且不能检测到大量的异常类型。例如，CutPaste[23]通过在正常图像上添加小斑块生成伪异常值，并训练一个模型来检测这些异常区域。由于该模型侧重于检测局部特征，如边缘不连续和纹理扰动，它不能检测和定位大型缺陷和全局结构异常，如图6所示。

在这里插入图片描述

Recently, networks pre-trained on the large dataset are proven to be capable of extracting discriminative features for anomaly detection [7,8,23,25,29,30]. With a pre-trained model, memorizing its anomaly-free features helps to identify anomalous samples [7, 29]. The studies in [8, 30] show that using the Mahalanobis distance to measure the similarity between anomalies and anomaly-free features leads to accurate anomaly detection. Since these methods require memorizing all features from training samples, they are computationally expensive.

最近，在大数据集上预训练的网络被证明能够提取用于异常检测的判别性特征[7,8,23,25,29,30]。通过预训练的模型，记忆其无异常特征有助于识别异常样本[7,29]。8,30]中的研究表明，使用Mahalanobis距离来衡量异常和无异常特征之间的相似性会导致准确的异常检测。由于这些方法需要记忆训练样本的所有特征，因此计算成本很高。

Knowledge distillation from pre-trained models is another potential solution to anomaly detection. In the context of unsupervised AD, since the student model is exposed to anomaly-free samples in knowledge distillation, the T-S model is expected to generate discrepant features on anomalies in inference [4,33,39]. To further increase the discrimnating capability of the T-S model on various types of abnormalities, different strategies are introduced. For instance, in order to capture multi-scale anomaly, US [4] ensembles several models trained on normal data at different scales, and MKD [33] propose to use multi-level features alignment. It should be noted that though the proposed method is also based on knowledge distillation, our reverse distillation is the first to adopt an encoder and a decoder to construct the T-S model. The heterogeneity of the teacher and student networks and reverse data flow in knowledge distillation distinguishes our method from prior arts.

来自预训练模型的知识提炼是另一个潜在的异常检测解决方案。在无监督AD的背景下，由于学生模型在知识提炼中接触到无异常的样本，预计T-S模型在推理中会产生关于异常的差异性特征[4,33,39]。为了进一步提高T-S模型对各种类型异常的判别能力，引入了不同的策略。例如，为了捕捉多尺度的异常，US[4]将几个在不同尺度的正常数据上训练的模型集合起来，MKD[33]提出使用多层次特征排列。需要注意的是，虽然所提出的方法也是基于知识提炼，但我们的反向提炼是第一次采用编码器和解码器来构建T-S模型。教师和学生网络的异质性以及知识蒸馏中的反向数据流使我们的方法区别于先前的技术。

3 我们的方法

Problem formulation: Let $\mathcal{I}^{t}=\{I_{1}^{t},\cdots,I_{n}^{t}\}$ be a set of available anomaly-free images and $\mathcal{I}^{q}=\{I_{1}^{q},\cdots,I_{m}^{q}\}$ be a query set containing both normal and abnormal samples. The goal is to train a model to recognize and localize anomalies in the query set. In the anomaly detection setting, normal samples in both $\mathcal{I}^{t}$ and $\mathcal{I}^{q}$ follow the same distribution. Out-of-distribution samples are considered anomalies.

问题的提出：设 $\mathcal{I}^{t}=\{I_{1}^{t},\cdots,I_{n}^{t}\}$ 为一组可用的无异常图像， $\mathcal{I}^{q}=\{I_{1}^{q},\cdots,I_{m}^{q}\}$ 为一个包含正常和异常样本的查询集。目标是训练一个模型来识别和定位查询集中的异常情况。在异常检测设置中， $\mathcal{I}^{t}$ 和 $\mathcal{I}^{q}$ 中的正常样本遵循相同的分布。超出分布范围的样本被认为是异常的。

System overview: Fig. 3 depicts the proposed reserve distillation framework for anomaly detection. Our reverse distillation framework consists of three modules: a fixed pre-trained teacher encoder $E$ , a trainable one-class bottleneck embedding module, and a student decoder $D$ . Given an input sample $I\in \mathcal{I}^{t}$ , the teacher $E$ extracts multiscale representations. We propose to train a student $D$ to restore the features from the bottleneck embedding. During testing/inference, the representation extracted by the teacher $E$ can capture abnormal, out-of-distribution features in anomalous samples. However, the student decoder $D$ fails to reconstruct these anomalous features from the corresponding embedding. The low similarity of anomalous representations in the proposed T-S model indicates a high abnormality score. We argue that the heterogeneous encoder and decoder structures and reverse knowledge distillation order contribute a lot to the discrepant representations of anomalies. In addition, the trainable OCBE module further condenses the multi-scale patterns into an extreme low-dimensional space for downstream normal representation reconstruction. This further improves feature discrepancy on anomalies in our T-S model, as abnormal representations generated by the teacher model are likely to be abandoned by OCBE. In the rest of this section, we first specify the reverse distillation paradigm. Then, we elaborate on the OCBE module. Finally, we describe anomaly detection and localization using reserve distillation.

系统概述：图3描述了拟议的异常检测的储备蒸馏框架。我们的反向蒸馏框架由三个模块组成：一个固定的预训练的教师编码器 $E$ ，一个可训练的单类瓶颈嵌入模块，以及一个学生解码器 $D$ 。给定一个输入样本 $I\in \mathcal{I}^{t}$ , ，教师E提取多尺度表征。我们建议训练一个学生 $D$ 来恢复来自瓶颈嵌入的特征。在测试/推理过程中，教师 $E$ 提取的表征可以捕捉到异常样本中的异常的、非分布的特征。然而，学生解码器 $D$ 未能从相应的嵌入中重构这些异常特征。在提出的T-S模型中，异常表征的低相似度表明异常得分很高。我们认为，异质的编码器和解码器结构以及反向的知识蒸馏顺序对异常现象的差异性表示有很大贡献。此外，可训练的OCBE模块进一步将多尺度模式浓缩到一个极端的低维空间，用于下游的正常表征重建。这进一步改善了我们的T-S模型中对异常现象的特征差异，因为教师模型产生的异常表征很可能被OCBE抛弃。在本节的其余部分，我们首先说明反向蒸馏范式。然后，我们阐述了OCBE模块。最后，我们描述了使用储备蒸馏法的异常检测和定位。

在这里插入图片描述

3.1 反向蒸馏

In conventional KD, the student network adopts a similar or identical neural network to the teacher model, accepts raw data/images as input, and targets to match its feature activations to the teacher’s [4, 33]. In the context of one-class distillation for unsupervised AD, the student model is expected to generate highly different representations from the teacher when the queries are anomalous samples [11, 26]. However, the activation discrepancy on anomalies vanishes sometimes, leading to anomaly detection failure. We argue that this issue is attributed to the similar architectures of the teacher and student nets and the same data flow during T-S knowledge transfer. To improve the T-S model’s representation diversity on unknown, out-of-distribution samples, we propose a novel reserves distillation paradigm, where the T-S model adopts the encoder-decoder architecture and knowledge is distilled from teacher’s deep layers to its early layers, i.e., high-level, semantic knowledge being transferred to the student first. To further facilitate the one-class distillation, we designed a trainable OCEB module to connect the teacher and student models (Sec. 3.2).

在传统的KD中，学生网络采用与教师模型类似或相同的神经网络，接受原始数据/图像作为输入，并以将其特征激活与教师匹配为目标[4, 33]。在无监督AD的单类提炼的背景下，当查询是异常样本时，学生模型预计会产生与教师高度不同的表征[11, 26]。然而，在异常情况下的激活差异有时会消失，导致异常检测失败。我们认为，这个问题归因于教师和学生网的类似架构以及T-S知识转移过程中的相同数据流。为了提高T-S模型在未知的、非分布式样本上的表示多样性，我们提出了一种新的储备提炼范式，T-S模型采用编码器-解码器结构，知识从教师的深层提炼到其早期层，即高级的、语义的知识首先被转移到学生身上。为了进一步促进单类提炼，我们设计了一个可训练的OCEB模块来连接教师和学生模型（3.2节）。

In the reverse distillation paradigm, the teacher encoder $E$ aims to extract comprehensive representations. We follow previous work and use a pre-trained encoder on ImageNet [21] as our backbone $E$ . To avoid the T-S model converging to trivial solutions, all parameters of teacher $E$ are frozen during knowledge distillation. We show in the ablation study that both ResNet [14] and WideResNet [44] are good candidates, as they are capable of extracting rich features from images [4, 8, 23, 29].

在反向蒸馏范式中，教师编码器 $E$ 旨在提取综合表征。为了避免T-S模型收敛到琐碎的解决方案，教师 $E$ 的所有参数在知识蒸馏期间被冻结。我们在消融研究中表明，ResNet[14]和WideResNet[44]都是很好的候选者，因为它们能够从图像中提取丰富的特征[4, 8, 23, 29]。

To match the intermediate representations of $E$ , the architecture of student decoder $D$ is symmetrical but reversed compared to $E$ . The reverse design facilitates eliminating the response of the student network to abnormalities, while the symmetry allows it to have the same representation dimension as the teacher network. For instance, when we take ResNet as the teacher model, the student $D$ consists of several residual-like decoding blocks for mirror symmetry. Specifically, the downsampling in ResNet is realized by a convolutional layer with a kernel size of 1 and a stride of 2 [14]. The corresponding decoding block in the student $D$ adopts deconvolutional layers [47] with a kernel size of 2 and a stride of 2. More details on the student decoder design are given in Supplementary Material.

为了与E的中间表征相匹配，学生解码器 $D$ 的结构是对称的，但与 $E$ 相比是反向的。反向设计有利于消除学生网络对异常情况的反应，而对称性使其具有与教师网络相同的表征维度。例如，当我们把ResNet作为教师模型时，学生 $D$ 由几个类似残差的解码块组成，用于镜像对称。具体来说，ResNet中的下采样是由一个卷积层实现的，其核大小为1，跨度为2[14]。学生 $D$ 中相应的解码块采用核大小为2、步长为2的解卷积层[47]。关于学生解码器设计的更多细节见补充材料。

In our reverse distillation, the student decoder $D$ targets to mimic the behavior of the teacher encoder E during training. In this work, we explore multi-scale feature-based distillation for anomaly detection. The motivation behind this is that shallow layers of a neural network extract local descriptors for low-level information (e.g., color, edge, texture, etc.), while deep layers have wider receptive fields, capable of characterizing regional/global semantic and structural information. That is, low similarity of low- and highlevel features in the T-S model suggests local abnormalities and regional/global structural outliers, respectively.

在我们的反向蒸馏中，学生解码器 $D$ 的目标是在训练期间模仿教师编码器 $E$ 的行为。在这项工作中，我们探索了基于多尺度特征的蒸馏法来进行异常检测。这背后的动机是，神经网络的浅层提取低层次信息（如颜色、边缘、纹理等）的局部描述符，而深层具有更广泛的接受域，能够描述区域/全球语义和结构信息。也就是说，T-S模型中低级和高级特征的低相似度分别表明局部异常和区域/全球结构异常。

Mathematically, let $\phi$ indicates the projection from raw data $I$ to the one-class bottleneck embedding space, the paired activation correspondence in our T-S model is $\{f^{k}_{E}=E^{k}(I),f^{k}_{D}=D^{k}(\phi)\}$ , where $E^{k}$ and $D^{k}$ represent the $k^{th}$ encoding and decoding block in the teacher and student model, respectively. $f^{k}_{E},f^{k}_{D}\in \mathbb{R}^{C_{k}\times H_{k} \times W_{k}}$ , where $C_k$ , $H_k$ and $W_{k}$ denote the number of channels, height and width of the $k^{th}$ layer activation tensor. For knowledge transfer in the T-S model, the cosine similarity is taken as the KD loss, as it is more precisely to capture the relation in both highand low-dimensional information [37, 49]. Specifically, for feature tensors $f^{k}_{E}$ and $f^{k}_{D}$ , we calculate their vector-wise cosine similarity loss along the channel axis and obtain a 2-D anomaly map $M^{k}\in \mathbb{R}^{H_{k}\times W_{k}}$ :

在数学上，让 $\phi$ 表示从原始数据 $I$ 到单类瓶颈嵌入空间的投影，我们的T-S模型中的配对激活对应关系是 $\{f^{k}_{E}=E^{k}(I),f^{k}_{D}=D^{k}(\phi)\}$ ，其中 $E^{k}$ 和 $D^{k}$ 分别代表教师和学生模型的第k个编码和解码块。 $f^{k}_{E},f^{k}_{D}\in \mathbb{R}^{C_{k}\times H_{k} \times W_{k}}$ ，其中 $C_k$ , $H_k$ 和 $W_{k}$ 表示第k层激活张量的通道数、高度和宽度。对于T-S模型中的知识转移，余弦相似度被作为KD损失，因为它更准确地捕捉到高维和低维信息的关系[37, 49]。具体来说，对于特征张量 $f^{k}_{E}$ 和 $f^{k}_{D}$ ，我们沿通道轴计算它们的矢量余弦相似度损失，得到一个二维异常图 $M^{k}\in \mathbb{R}^{H_{k}\times W_{k}}$ ：

$M^k(h,w)=1-\frac{(f_E^k(h,w))^T\cdot f_D^k(h,w)}{\left\|f_E^k(h,w)\right\|\left\|f_D^k(h,w)\right\|}.\quad(1)$

A large value in $M^{k}$ indicates high anomaly in that location. Considering the multi-scale knowledge distillation, the scalar loss function for student’s optimization is obtained by accumulating multi-scale anomaly maps:

$M^{k}$ 中的大值表示该位置的异常度高。考虑到多尺度的知识提炼，学生优化的标量损失函数是通过积累多尺度的异常地图得到的：

$\mathcal{L}_{KD}=\sum_{k=1}^{K}\Big\{\frac{1}{H_kW_k}\sum\limits_{h=1}^{H_k}\sum\limits_{w=1}^{W_k}M^k(h,w)\Big\},\quad(2)$

where $K$ indicates the number of feature layers used in the experiment.

其中 $K$ 代表实验中特征层使用的次数。

3.2 One-class 瓶颈嵌入

Since the student model $D$ attempts to restore representations of a teacher model $E$ in our reverse knowledge distillation paradigm, one can directly feed the activation output of the last encoding block in backbone to $D$ . However, this naive connection has two shortfalls. First, the teacher model in KD usually has a high capacity. Though the high-capacity model helps extract rich features, the obtained high-dimensional descriptors likely have a considerable redundancy. The high freedom and redundancy of representations are harmful to the student model to decode the essential anomaly-free features. Second, the activation of the last encoder block in backbone usually characterizes semantic and structural information of the input data. Due to the reverse order of knowledge distillation, directly feeding this high-level representation to the student decoder set a challenge for low-level features reconstruction. Previous efforts on data reconstruction usually introduce skip paths to connect the encoder and decoder. However, this approach doesn’t work in knowledge distillation, as the skip paths leak anomaly information to the student during inference.

由于在我们的反向知识蒸馏范式中，学生模型 $D$ 试图恢复教师模型 $E$ 的表征，我们可以直接将骨干的最后一个编码块的激活输出反馈给 $D$ 。然而，这种直接的连接有两个缺陷。首先，KD中的教师模型通常具有高容量。虽然高容量的模型有助于提取丰富的特征，但得到的高维描述符很可能有相当大的冗余。表征的高自由度和冗余度对学生模型解码基本的无异常特征是有害的。第二，骨干网中最后一个编码器块的激活通常表征着输入数据的语义和结构信息。由于知识提炼的顺序是相反的，直接将这种高层次的表征反馈给学生解码器对低层次的特征重建是一个挑战。以前关于数据重建的工作通常引入跳过路径来连接编码器和解码器。然而，这种方法在知识蒸馏中并不奏效，因为在推理过程中，跳过的路径会将异常信息泄露给学生。

To tackle the first shortfall above in one-class distillation, we introduce a trainable one-class embedding block to project the teacher model’s high-dimensional representations into a low-dimensional space. Let’s formulate anomaly features as perturbations on normal patterns. Then the compact embedding acts as an information bottleneck and helps to prohibit the propagation of unusual perturbations to the student model, therefore boosting the T-S model’s representation discrepancy on anomalies. In this study, we adopt the 4th residule block of ResNet [14] as the one-class embedding block.

为了解决上述单类提炼的第一个不足，我们引入了一个可训练的单类嵌入块，将教师模型的高维表征投射到低维空间。让我们把异常特征表述为对正常模式的扰动。然后，紧凑的嵌入作为一个信息瓶颈，有助于禁止异常扰动向学生模型的传播，因此提升了T-S模型在异常现象上的表示差异。在本研究中，我们采用ResNet[14]的第4个残差块作为单类嵌入块。

To address the problem on low-level feature restoration at decoder $D$ , the MFF block concatenates multi-scale representations before one-class embedding. To achieve representation alignment in feature concatenation, we downsample the shallow features through one or more 3 × 3 convolutional layers with stride of 2, followed by batch normalization and ReLU activation function. Then a 1×1 convolutional layer with stride of 1 and a batch normalization with relu activation are exploited for a rich yet compact feature.

为了解决解码器 $D$ 的低级特征修复问题，MFF块在一类嵌入之前将多尺度表征连接起来。为了在特征串联中实现表征对齐，我们通过一个或多个跨度为2的3×3卷积层对浅层特征进行降采样，然后进行批量归一化和ReLU激活函数。然后，利用跨度为1的1×1卷积层和带ReLU激活的批量归一化来获得丰富而紧凑的特征。

We depict the OCBE module in Fig. 4, where MFF aggregates low- and high-level features to construct a rich embedding for normal pattern reconstruction and OCE targets to retain essential information favorable for the student to decode out the teacher’s response. The convolutional layers in grey and ResBlock in green in Fig. 4 are trainable and optimized jointly with the student model D during knowledge distillation on normal samples.

我们在图4中描述了OCBE模块，其中MFF聚合了低级和高级特征，为正常模式重建构建了丰富的嵌入，OCE的目标是保留有利于学生解码出教师反应的基本信息。图4中灰色的卷积层和绿色的ResBlock是可训练的，并在正常样本的知识提炼过程中与学生模型D共同优化。

在这里插入图片描述

3.3 异常评分

At the inference stage, We first consider the measurement of pixel-level anomaly score for anomaly localization (AL). When a query sample is anomalous, the teacher model is capable of reflecting abnormality in its features. However, the student model is likely to fail in abnormal feature restoration, since the student decoder only learns to restore anomaly-free representations from the compact oneclass embedding in knowledge distillation. In other words, the student $D$ generates discrepant representations from the teacher when the query is anomalous. Following Eq. (1), we obtain a set of anomaly maps from T-S representation pairs, where the value in a map $M_k$ reflects the point-wise anomaly of the kth feature tensors. To localize anomalies in a query image, we up-samples $M^k$ to image size. Let $\Psi$ denotes the bilinear up-sampling operation used in this study. Then a precise score map $S_{I^{q}}$ is formulated as the pixel-wise accumulation of all anomaly maps:

在推理阶段，我们首先考虑测量像素级的异常得分，以进行异常定位（AL）。当查询样本出现异常时，教师模型能够在其特征中反映出异常情况。然而，学生模型很可能在异常特征修复中失败，因为学生解码器只学会从知识提炼中的紧凑的oneclass嵌入中恢复无异常的表示。换句话说，当查询出现异常时，学生 $D$ 会从老师那里产生不一致的表征。按照公式（1），我们从T-S表征对中得到一组异常图，其中图中 $M^{k}$ 的值反映了第k个特征张量的点状异常情况。为了定位查询图像中的异常点，我们对 $M_k$ 进行上样，以达到图像大小。让 $\Psi$ 表示本研究中使用的双线性上采样操作。然后，一个精确的得分图 $S_{I^{q}}$ 被表述为所有异常图的像素级累积：

$S_{AL}=\sum_{i=1}\Psi(M^{i}).\quad(3)$

In order to remove the noises in the score map, we smooth $S_{AL}$ by a Gaussian filter.

为了消除分数图中的噪音，我们用高斯滤波器来平滑 $S_{AL}$ 的高斯滤波器。

For anomaly detection, averaging all values in a score map $S_{AL}$ is unfair for samples with small anomalous regions. The most responsive point exists for any size of anomalous region. Hence, we define the maximum value in $S_{AL}$ as sample-level anomaly score $S_{AD}$ . The intuition is that no significant response exists in their anomaly score map for normal samples.

对于异常检测来说，对评分图 $S_{AL}$ 中的所有数值进行平均化，对于有小的异常区域的样本是不公平的。对于任何大小的异常区域都存在反应最强烈的点。因此，我们将 $S_{AL}$ 中的最大值定义为样本级异常得分 $S_{AD}$ 。其直觉是，对于正常的样本，在其异常得分图中不存在明显的反应。

4 实验和讨论

Empirical evaluations are carried on both the MVTec anomaly detection and localization benchmark and unsupervised one-class novelty detection datasets. In addition, we perform ablation study on the MVTec benchmark, investigating the effects of different modules/blocks on the final results.

我们对MVTec异常检测和定位基准以及无监督的单类新奇检测数据集进行了实证评估。此外，我们对MVTec基准进行了消融研究，调查了不同模块/块对最终结果的影响。

4.1 异常检测和定位

Dataset. MVTec [3] contains 15 real-world datasets for anomaly detection, with 5 classes of textures and 10 classes of objects. The training set comprises a total of 3,629 anomaly-free images. The test set has both anomalous and anomaly-free images, totaling 1,725. Each class has multiple defects for testing. In addition, pixel-level annotations are available in the test dataset for anomaly localization evaluation.

数据集。MVTec[3]包含15个用于异常检测的真实世界数据集，有5类纹理和10类物体。训练集包括总共3,629张无异常图像。测试集既有异常图像，也有无异常图像，共1,725张。每个类别都有多个缺陷用于测试。此外，在测试数据集中有像素级注释，用于异常定位评估。

Experimental settings. All images in MVTec are resized to a specific resolution (e.g. 128 × 128, 256 × 256 etc.). Following convention in prior works, anomaly detection and localization are performed on one category at a time. In this experiment, we adopt WideResNet50 as Backbone $E$ in our T-S model. We also report the AD results with ResNet18 and ResNet50 in ablation study. To train our reserve distillation model, we utilize Adam optimizer [18] with ， $\beta =（0.5, 0.999）$ . The learning rate is set to 0.005. We train 200 epochs with a batch size of 16. A Gaussian filter with $\sigma = 4$ is used to smooth the anomaly score map (as described in Sec. 3.3).

实验设置。MVTec中的所有图像都被调整为特定的分辨率（如128×128，256×256等）。按照先前工作的惯例，异常检测和定位是在一个类别上进行的。在这个实验中，我们采用WideResNet50作为我们T-S模型的骨干 $E$ 。我们还报告了在消融研究中使用ResNet18和ResNet50的AD结果。为了训练我们的反向蒸馏模型，我们利用Adam优化器[18]， $\beta =（0.5, 0.999）$ 。学习率被设定为0.005。我们训练200个epochs，批次大小为16。一个 $\sigma = 4$ 的高斯滤波器被用来平滑异常得分图（如第3.3节所述）。

For Anomaly dectction, we take area under the receiver operating characteristic (AUROC) as the evaluation metric. We include prior arts in this experiments, including MKD [33], GT [10], GANomaly (GN) [2], Uninformed Student (US) [4], PSVDD [43], DAAD [16], MetaFormer (MF) [40], PaDiM (WResNet50) [8] and CutPaste [23].

对于异常检测，我们采用接收器操作特征下的面积（AUROC）作为评价指标。我们在这个实验中包括先前的艺术，包括MKD [33], GT [10], GANomaly (GN) [2], Uninformed Student (US) [4], PSVDD [43], DAAD [16], MetaFormer (MF) [40], PaDiM (WResNet50) [8] 和 CutPaste [23] 。

For Anomaly Localization, we report both AUROC and per-region-overlap (PRO) [4]. Different from AUROC, which is used for per-pixel measurement, the PRO score treats anomaly regions with any size equally. The comparison baselines includes MKD [33], US [4], MF [40], SPADE (WResNet50) [7,29], PaDiM (WResNet50) [8], RIAD [46] and CutPaste [23].

对于异常定位，我们同时报告AUROC和每区域重叠（PRO）[4]。与AUROC不同的是，PRO得分用于每像素的测量，它平等地对待任何大小的异常区域。比较基准包括MKD [33], US [4], MF [40], SPADE (WResNet50) [7,29], PaDiM (WResNet50) [8], RIAD [46] 和 CutPaste [23] 。

Experimental results and discussions. Anomaly detection results on MVTec are shown in Tab. 1. The average outcome shows that our method exceeds SOTA by 2.5%. For textures and objects, our model reaches new SOTA of 99.5% and 98.0% of AUROC, respectively. The statistics of the anomaly scores are shown in Fig. 5. The non-overlap distribution of normal (blue) and anomalies (red) indicates the strong AD capability in our T-S model.

实验结果和讨论。MVTec的异常检测结果显示在表1。1. 平均结果显示，我们的方法比SOTA高出2.5%。对于纹理和物体，我们的模型分别达到了AUROC的99.5%和98.0%的新SOTA。异常得分的统计数字显示在图5中。正常（蓝色）和异常（红色）的非重叠分布表明我们的T-S模型具有强大的AD能力。
在这里插入图片描述

在这里插入图片描述

Quantitative results on anomaly localization are summarized in Tab. 2. For both AUROC and PRO average scores over all categories, our approach surpasses state-of-the-art with 97.8% and 93.9%. To investigate the robustness of our method to various anomalies, we classify the defect types into two categories: large defects or structural anomalies and tiny or inconspicuous defects, and qualitative evaluate the performance by visualization in Fig. 6 and Fig. 7. Compared to the runner-up (i.e. CutPaste [23]) in Tab. 1, our method produces a significant response to the whole anomaly region.

异常点定位的定量结果总结在表2中。2. 对于所有类别的AUROC和PRO平均得分，我们的方法超过了最先进的方法，达到了97.8%和93.9%。为了研究我们的方法对各种异常情况的鲁棒性，我们将缺陷类型分为两类：大的缺陷或结构异常和微小或不明显的缺陷，并通过图6和图7的可视化来定性评估性能。与表1中的亚军（即CutPaste[23]）相比，我们的方法产生了显著的响应。表1中，我们的方法对整个异常区域产生了明显的反应。

在这里插入图片描述

Complexity analysis. Recent pre-trained model based approaches achieve promising performance by extracting features from anomaly-free samples as a measurement [7, 8]. However, storing feature models leads to large memory consumption. In comparison, our approach achieves better performance depending only on an extra CNN model. As shown in Tab. 3. Our model obtain performance gain with low time and memory complexity.

复杂度分析。最近基于预训练模型的方法通过从无异常的样本中提取特征作为测量，取得了很好的性能[7, 8]。然而，存储特征模型会导致大量的内存消耗。相比之下，我们的方法仅依靠一个额外的CNN模型就能实现更好的性能。如表3所示，我们的模型以较低的时间和内存复杂度获得性能增益。

在这里插入图片描述

Limitations. We observe that the localization performance on the transistor dataset is relatively weak, despite the good AD performance. This performance drop is caused by misinterpretation between prediction and annotation. As shown in Fig. 6, our method localize the misplaced regions, while the ground truth covers both misplaced and original areas. Alleviating this problem requires associating more features with contextual relationships. We empirically find that a higher-level feature layer with a wider perceptive field can improve the performance. For instance, anomaly detection with the second and third layer features achieves 94.5% AUROC, while using only the third layer improve the performance to 97.3%. In addition, reducing image resolution to 128×128 also achieves 97.6% AUROC. We present more cases of anomaly detection and localization, both positive and negative, in the supplementary material.

局限性。我们观察到，尽管AD性能良好，但晶体管数据集上的定位性能相对较弱。这种性能下降是由预测和注释之间的误解造成的。如图6所示，我们的方法只定位了错误的区域，而真实状况同时覆盖了错误区域和原始区域。缓解这一问题需要将更多的特征与上下文关系联系起来。我们根据经验发现，具有更广泛感知领域的更高层次的特征层可以提高性能。例如，使用第二层和第三层特征的异常检测实现了94.5%的AUROC，而只使用第三层则将性能提高到97.3%。此外，将图像分辨率降低到128×128也达到了97.6%的AUROC。我们在补充材料中介绍了更多的异常检测和定位的案例，包括正面和负面的。

4.2 One-class 新颖性检测

To evaluate the generality of proposed approach, we conduct one-class novelty detection experiment on 3 sematic datasets [32], MNIST [22], FashionMNIST [41] and CIFAR10 [20]. MNIST is a hand-written digits dataset from numbers 0-9. FashionMNIST consists of images from 10 fashion product classes. Both datasets includes 60K samples for training and 10K samples for testing, all in resolution of 28 × 28. CIFAR10 is a challenging dataset for novelty detection because of its inclusion of diverse natural objects. It includes 50K training images and 10K test images with scale of 32 × 32 in 10 categories.

为了评估所提出方法的通用性，我们在3个语义数据集[32]、MNIST[22]、FashionMNIST[41]和CIFAR10[20]上进行单类新颖性检测实验。MNIST是一个数字0-9的手写数字数据集。FashionMNIST由10个时尚产品类别的图像组成。这两个数据集都包括60K样本用于训练，10K样本用于测试，分辨率都是28×28。CIFAR10是一个具有挑战性的新颖性检测数据集，因为它包含了不同的自然物体。它包括50K训练图像和10K测试图像，比例为32×32，分10个类别。

Following the protocol mentioned in [27], we train the model with samples from a single class and detect novel samples. Note that the novelty score is defined as the sum of scores in the similarity map. The baselines in this experiment include LSA [1], OCGAN [27], HRN [17], DAAD [16] and MKD [33]. We also include the comparision with OiG [45] and G2D [28] on Caltech-256 [13].

按照[27]中提到的协议，我们用单一类别的样本训练模型，并检测新奇的样本。请注意，新颖性得分被定义为相似性地图中的分数之和。本实验中的基线包括LSA[1]、OCGAN[27]、HRN[17]、DAAD[16]和MKD[33]。我们还包括与OiG[45]和G2D[28]在Caltech-256[13]上的比较。

Tab. 4 summarizes the quantitative results on the three datasets. Remarkably, our approach produces excellent results. Details of the experiments and the results of per-class comparisons are provided in the Supplementary Material.

表4总结了三个数据集的定量结果。4总结了三个数据集的定量结果。值得注意的是，我们的方法产生了出色的结果。实验的细节和每类比较的结果在补充材料中提供。

在这里插入图片描述

4.3 消融实验

We investigate effective of OCE and MFF blocks on AD and reports the numerical results in Tab. 5. We take the pre-trained residual block [14] as baseline. Embedding from pre-trained residual block may contain anomaly features, which decreases the T-S model’s representation discrepancy. Our trainable OCE block condenses feature codes and the MFM block fuses rich features into embedding, allows for more accurate anomaly detection and localization.

我们研究了OCE和MFF块对AD的影响，并在表5中报告了数值结果。我们把预先训练好的残差块[14]作为基线。从预训练的残余块中嵌入可能包含异常特征，这减少了T-S模型的表示差异。我们的可训练OCE块浓缩了特征代码，而MFM块将丰富的特征融合到嵌入中，可以实现更准确的异常检测和定位。

在这里插入图片描述

Tab. 6 displays qualitative comparisons of different backbone networks as the teacher model. Intuitively, a deeper and wider network usually have a stronger representative capacity, which facilitates detecting anomalies precisely. Noteworthy that even with a smaller neural network such as ResNet18, our reverse distillation method still achieves excellent performance.

Tab. 6显示了作为教师模型的不同骨干网络的定性比较。直观地说，一个更深更广的网络通常有更强的代表性，这有利于精确地检测异常情况。值得注意的是，即使使用较小的神经网络，如ResNet18，我们的反向蒸馏方法仍然取得了优异的性能。

在这里插入图片描述

Besides, we also explored the impact of different network layers on anomaly detection and shown the results in Tab. 7. For single-layer features, M2 yields the best result as it trades off both local texture and global structure information. Multi-scale feature fusion helps to cover more types of anomalies.

此外，我们还探讨了不同网络层对异常检测的影响，结果见表7。对于单层特征，M2产生了最好的结果，因为它同时交换了局部纹理和整体结构信息。多尺度特征融合有助于覆盖更多类型的异常现象。

在这里插入图片描述

5 结论

We proposed a novel knowledge distillation paradigm, reverse distillation, for anomaly detection. It holistically addressed the problem in previous KD-based AD methods and boosted the T-S model’s response on anomalies. In addition, we introduced trainable one-class embedding and multiscale feature fusion blocks in reverse distillation to improve one-class knowledge transfer. Experiments showed that our method significantly outperformed previous arts in anomaly detection, anomaly localization, and novelty detection.

我们提出了一种新的知识提炼范式，即反向提炼，用于异常检测。它从整体上解决了以前基于KD的AD方法中的问题，并提高了T-S模型对异常情况的反应。此外，我们在反向蒸馏中引入了可训练的单类嵌入和多尺度特征融合块，以改善单类知识的转移。实验表明，我们的方法在异常检测、异常定位和新颖性检测方面明显优于以前的技术。