[论文翻译]CVPR2023: DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

最新推荐文章于 2024-02-02 21:07:27 发布

毕加猪plus

最新推荐文章于 2024-02-02 21:07:27 发布

阅读量749

点赞数 4

分类专栏：论文翻译 # 异常检测文章标签：计算机视觉人工智能深度学习

本文链接：https://blog.csdn.net/Vincent_Tong_/article/details/132482493

版权

论文翻译同时被 2 个专栏收录

5 篇文章 2 订阅

订阅专栏

异常检测

4 篇文章 0 订阅

订阅专栏

DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

Tags: CVPR
GitHub: https://github.com/apple/ml-destseg
Year: 2023

摘要

Visual anomaly detection, an important problem in computer vision, is usually formulated as a one-class classification and segmentation task. The student-teacher (S-T) framework has proved to be effective in solving this challenge. However, previous works based on S-T only empirically applied constraints on normal data and fused multilevel information. In this study, we propose an improved model called DeSTSeg, which integrates a pre-trained teacher network, a denoising student encoder-decoder, and a segmentation network into one framework. First, to strengthen the constraints on anomalous data, we introduce a denoising procedure that allows the student network to learn more robust representations. From synthetically corrupted normal images, we train the student network to match the teacher network feature of the same images without corruption. Second, to fuse the multi-level S-T features adaptively, we train a segmentation network with rich supervision from synthetic anomaly masks, achieving a substantial performance improvement. Experiments on the industrial inspection benchmark dataset demonstrate that our method achieves state-of-the-art performance, 98.6% on image-level AUC, 75.8% on pixel-level average precision, and 76.4% on instance-level average precision.

视觉异常检测是计算机视觉领域的一个重要问题，通常被表述为一个单类分类和分割任务。事实证明，学生-教师(S-T)框架可以有效地解决这一难题。然而，以前基于S-T的工作知识根据经验对正常数据和融合的多级信息施加限制。在本研究中，我们提出了一种名为DeSTSeg的改进模型，它将预先训练的教师网络、去噪学生编码器-解码器和分割网络整合到一个框架中。首先，为了加强对异常数据的约束，我们引入了一个去噪程序，使学生网络能够学习到更稳健的表征。通过合成损坏的正常图像，我们训练学生网络，使其与未损坏的相同图像的教师网络特征相匹配。其次，为了自适应地融合多级S-T特征，我们从合成异常掩码中训练了一个具有丰富监督功能的分割网络，从而大幅度提高了性能。在工业检测基准数据集上的实验表明，我们的方法达到了最先进的性能，图像级AUC为98.6%，像素级平均精度为75.8%，实例级平均精度为76.4%。

1 介绍

Visual anomaly detection (AD) with localization is an essential task in many computer vision applications such as industrial inspection [24, 36], medical disease screening [27, 32], and video surveillance [18, 20]. The objective of these tasks is to identify both corrupted images and anomalous pixels in corrupted images. As anomalous samples occur rarely, and the number of anomaly types is enormous, it is unlikely to acquire enough anomalous samples with all possible anomaly types for training. Therefore, AD tasks were usually formulated as a one-class classification and segmentation, using only normal data for model training.

在工业检测[24, 36]、医疗疾病筛查[27, 32]和视频监控[18, 20]等许多计算机视觉应用中，带有定位功能的视觉异常检测(AD) 是一项必不可少的任务。这项任务的目标是识别损坏的图像和损坏图像中的异常像素。由于异常样本极少出现，而异常类型又非常多，因此不可能获得足够多的包含所有可能异常类型的异常样本来进行训练。因此，AD任务通常被表述为单类分类和分割，仅使用正常数据进行模型训练。

The student-teacher (S-T) framework, known as knowledge distillation, has proven effective in AD [3, 9, 26, 31, 33]. In this framework, a teacher network is pre-trained on a large-scale dataset, such as ImageNet [10], and a student network is trained to mimic the feature representations of the teacher network on an AD dataset with normal samples only. The primary hypothesis is that the student network will generate different feature representations from the teacher network on anomalous samples that have never been encountered in training. Consequently, anomalous pixels and images can be recognized in the inference phase. Notably, [26, 31] applied knowledge distillation at various levels of the feature pyramid so that discrepancies from multiple layers were aggregated and demonstrated good performance. However, there is no guarantee that the features of anomalous samples are always different between S-T networks because there is no constraint from anomalous samples during the training. Even with anomalies, the student network may be over-generalized [22] and output similar feature representations as those by the teacher network. Furthermore, aggregating discrepancies from multilevel in an empirical way, such as sum or product, could be suboptimal. For instance, in the MVTec AD dataset under the same context of [31], we observe that for the category of transistor, employing the representation from the last layer, with 88.4% on pixel-level AUC, outperforms that from the multi-level features, with 81.9% on pixel-level AUC.

学生-教师(S-T) 框架被称为知识蒸馏，已经被证明在异常检测领域非常有效[3, 9, 26, 31, 33]。在这一框架中，教师网络在大规模数据集(ImageNet) 上进行预训练，学省网络则在仅包含正常样本的AD数据集上进行训练，以模仿教师网络的特征表征。主要假设是，在训练中从未遇到过的异常样本上，学生网络会生成与教师网络不同的特征表征。因此，异常图像和图像可在推理阶段被识别出来。值得注意的是，[26, 31]在特征金字塔的不同层次上应用了知识蒸馏，从而将来自多个层次的差异汇总起来，并取得了良好的效果。但是，由于在训练过程中没有来自异常样本的约束，因此无法保证S-T网络之间异常样本的特征总是不同的。即使出现异常，学生网络也可能过渡泛化[22]，输出与教师网络相似的特征表征。此外，以经验方式(如总和或乘积)汇总多层次差异可能不是最佳方法。例如，在与[31] 相同背景下的MVTec AD数据集中，我们观察到，对于晶体管类别，采用最后一层表示法的像素级AUC为88.4%，优于采用多层次特征表示法的像素级AUC(81.9%)。

To address the problem mentioned above, we propose DeSTSeg, illustrated in Fig. 1, which consists of a denoising student network, a teacher network, and a segmentation network. We introduce random synthetic anomalies into the normal images and then use these corrupted images1 for training. The denoising student network takes a corrupted image as input, whereas the teacher network takes the original clean image as input. During training, the feature discrepancy between the two networks is minimized. In other words, the student network is trained to perform denoising in the feature space. Given anomalous images as input to both networks, the teacher network encodes anomalies naturally into features, while the trained denoising student network filters anomalies out of feature space. Therefore, the two networks are reinforced to generate distinct features from anomalous inputs. For the architecture of the denoising student network, we decided to use an encoder-decoder network for better feature denoising instead of adopting an identical architecture as the teacher network. In addition, instead of using empirical aggregation, we append a segmentation network to fuse the multilevel feature discrepancies in a trainable manner, using the generated binary anomaly mask as the supervision signal.

为了解决上述问题，我们提出了DeSTSeg，如图1所示，它由去噪学生网络、教师网络和分割网络组成。我们在正常图像中引入随机合成异常点，然后使用这些损坏的图像1进行训练。去噪学生网络将损坏的图像作为输入，而教师网络则将干净的原始图像作为输入。在训练过程中，两个网络之间的特征差异最小。换句话说，训练学生网络是为了在特征空间中进行去噪处理。将异常图像作为两个网络的输入，教师网络会将异常图像自然地编码到特征中，而训练有素的去噪学生网络则会将异常图像从特征空间中过滤出来。因此，这两个网络被强化，以从异常输入中生成不同的特征。关于去噪学生网络的结构，我们决定使用编码器-解码器网络，而不是采用与教师网络相同的结构，以获得更好的特征去噪效果。此外，我们没有使用经验聚合，而是附加了一个分割网络，以可训练的方式融合多层次特征差异，并将生成的二进制异常掩码作为监督信号。

在这里插入图片描述

Figure 1. Overview of DeSTSeg. Synthetic anomalous images are generated and used during training. In the first step (a), the student network with synthetic input is trained to generate similar feature representations as the teacher network from the clean image. In the second step (b), the element-wise product of the student and teacher networks’ normalized outputs are concatenated and utilized to train the segmentation network. The segmentation output is the predicted anomaly score map.

图1. DeSTSeg概述。合成异常图像在训练过程中生成并使用。第一步(a)，训练合成输入的学生网络，使其从干净的图像中生成与教师网络相似的特征表示。第二步(b)，将学生网络和教师网络的归一化输出的元素乘积连接起来，用于训练分割网络。分割输出是预测的异常得分图。

We evaluate our method on a benchmark dataset for surface anomaly detection and localization, MVTec AD [2]. Extensive experimental results show that our method outperforms the state-of-the-art methods on image-level, pixel-level, and instance-level anomaly detection tasks. We also conduct ablation studies to validate the effectiveness of our proposed components.

我们在用于表面异常检测和定位的基准数据集MVTec AD上对我们的方法进行了评估。大量实验结果表明，在图像级、像素级和实例级异常检测任务中，我们的方法否优于最先进的方法。我们还进行了消融研究，以验证我们提出的组件的有效性。

Our main contributions are summarized as follows.

(1) We propose a denoising student encoder-decoder, which is trained to explicitly generate different feature representations from the teacher with anomalous inputs.

(2) We employ a segmentation network to adaptively fuse the multilevel feature similarities to replace the empirical inference approach.

(3) We conduct extensive experiments on the benchmark dataset to demonstrate the effectiveness of our method for various tasks.

我们的主要贡献概述如下。

(1) 我们提出了一种去噪学生编码器-解码器，经过训练后，该编码器可以明确地从输入异常的教师那里生成不同的特征表征。

(2) 我们采用分割网络自适应融合多层次特征相似性，以取代经验推断法。

(3) 我们在基准数据集上进行了大量实验，以证明我们的方法在各种任务中的有效性。

2 相关工作

Anomaly detection and localization have been studied from numerous perspectives. In image reconstruction, researchers used autoencoder [4], variational autoencoder [1, 30] or generative adversarial network [21, 27, 28] to train an image reconstruction model on normal data. The presumption is that anomalous images cannot be reconstructed effectively since they are not seen during training, so the difference between the input and reconstructed images can be used as pixel-level anomaly scores. However, anomaly regions still have a chance to be accurately reconstructed due to the over-generalization issue [22]. Another perspective is the parametric density estimation, which assumes that the extracted features of normal data obey a certain distribution, such as a multivariate Gaussian distribution [8, 15, 16, 23], and uses the normal dataset to estimate the parameters. Then, the outlier data are recognized as anomalous data by inference. Since the assumption of Gaussian distribution is too strict, some recent works borrow ideas from normalizing flow, by projecting an arbitrary distribution to a Gaussian distribution to approximate the density of any distributions [13, 35]. Besides, the memory-based approaches [7,19,24,34] build a memory bank of normal data in training. During inference, given a query item, the model selects the nearest item in the memory bank and uses the similarity between the query item and the nearest item to compute the anomaly score.

异常检测和定位已从多个角度进行了研究。在图像重建方面，研究人员使用自编码器[4]、变异自编码器[1, 30]或生成对抗网络[21, 27, 28]在正常数据上训练图像重建模型。我们的假设是，由于在训练过程中没有看到异常图像，因此无法有效地重建异常图像，因此输入图像和重建图像之间的差异可用作像素级异常分数。然而，由于过度概括问题，异常区域仍有机会被准确重建[22]。另一种观点是参数密度估计，它假定提取的正态数据特征服从某种分布，如多元高斯分布[8, 15, 16, 23]，并使用正态数据集来估计参数。然后，通过推理将离群数据识别为异常数据。由于高斯分布的假设过于严格，最近的一些研究借鉴了归一化流的思路，通过将任意分布投影到高斯分布来近似任意分布的密度[13, 35]。此外，基于记忆的方法[7,19,24,34]在训练时会建立一个正常数据记忆库。在推理过程中，给定一个查询项，模型会在记忆库中选择最近的项，并利用查询项和最近项之间的相似度计算异常得分。

Knowledge distillation. Knowledge distillation is based on a pretrained teacher network and a trainable student network. As the student network is trained on an anomaly-free dataset, its feature representation of anomalies is expected to be distinct from that of the teacher network. Numerous solutions have been presented in the past to improve discrimination against various types of anomalies. For example, [3] used ensemble learning to train multiple student networks and exploited the irregularity of their feature representations to recognize the anomaly. [31], and [26] adopted multi-level feature representation alignment to capture both low-level and high-level anomalies. [9], and [33] designed decoder architectures for the student network to avoid the shortcomings of identical architecture and the same dataflow between S-T networks. These works focus on improving the similarity of S-T representations on normal inputs, whereas our work additionally attempts to differentiate their representations on anomalous input.

知识蒸馏。知识蒸馏基于与培训教师网络和可训练学生网络。由于学生网络是在无异常数据集上进行训练的，因此其异常表征表征预计将与教师网络截然不同。为了提高对各类异常情况的识别能力，过去曾提出过许多解决方案。例如，[3]使用集合学习来训练多个学生网络，并利用其特征表征的不规则性来识别异常。[31]和[26]采用了多级特征表示对齐，以捕捉低级和高级异常。[9]和[33]为学生网络设计了解码器架构，以避免S-T网络之间架构相同和数据流相同的缺点。这些工作的重点是提高S-T表征在正常输入上的相似性，而我们的工作则额外尝试区分它们在异常输入上的表征。

Anomaly simulation. Although there is no anomalous data for training in the context of one-class classification AD, the pseudo-anomalous data could be simulated so that an AD model can be trained in a supervised way. Classical anomaly simulation strategies, such as rotation [12] and cutout [11], do not perform well in detecting fine-grained anomalous patterns [16]. A simple yet effective strategy is called CutPaste [16] that randomly selects a rectangular region inside the original image and then copies and pastes the content to a different location within the image. Another strategy proposed in [36] and also adopted in [37] used two-dimensional Perlin noise to simulate a more realistic anomalous image. With the simulated anomalous images and corresponding ground truth masks, [29, 36, 37] localized anomalies with segmentation networks. In our system, we adopt the ideas of [36] for anomaly simulation and segmentation.

异常模拟。虽然在单类分类AD的情况下没有异常数据可供训练，但可以模拟伪异常数据，从而以监督方式训练AD模型。传统的异常模拟策略，如旋转[12]和剪切[11]，在检测细粒度异常模式方面表现不佳[16]。CutPaste是一种简单而有效的策略[16]，它在原始图像中随机选择一个矩形区域，然后将内容复制粘贴到图像中的不同位置。文献[36]提出了另一种策略，文献[37]也采用了这种策略，即使用二维Perlin 噪声来模拟更真实的异常图像。利用模拟的异常图像和相应的地面实况掩码，[29, 36, 37]通过分割网络定位异常点。在我们的系统中，我们采用了[36]的想法来进行异常模拟和分割。

3 方法

The proposed DeSTSeg consists of three main components: a pre-trained teacher network, a denoising student network, and a segmentation network. As illustrated in Fig. 1, synthetic anomalies are introduced into normal training images, and the model is trained in two steps. In the first step, the simulated anomalous image is utilized as the student network input, whereas the original clean image is the input to the teacher network. The weights of the teacher network are fixed, but the student network for denoising is trainable. In the second step, the student model is fixed as well. Both the student and teacher networks take the synthetic anomaly image as their input to optimize parameters in the segmentation network to localize the anomalous regions. For inference, pixel-level anomaly maps are generated in an end-to-end mode, and the corresponding image-level anomaly scores can be computed via post-processing.

DeSTSeg 由三个主要部分组成：预训练教师网络、去噪学生网络和分割网络。如图1所示，在正常训练图像中引入合成异常点，分两步对模型进行训练。在第一步中，模拟的异常图像被用作学生网络的输入，而原始的干净图像则是教师网络的输入。教师网络的权重是固定的，但用于去噪的学生网络是可训练。学生网络和教师网络都将合成异常图像作为输入，以优化分割网络的参数，从而定位异常区域。为了进行推理，以端到端模式生成像素级异常图，并通过后处理计算相应的图像级异常分数。

3.1 合成异常生成

The training of our model relies on synthetic anomalous images which are generated using the same algorithm proposed in [36]. Random two-dimensional Perlin noise is generated and binarized by a preset threshold to obtain an anomaly mask $M$ . An anomalous image $l_{a}$ is generated by replacing the mask region with a linear combination of an anomaly-free image $l_{n}$ and an arbitrary image from external data source $A$ , with an opacity factor $\beta$ randomly chosen between [0.15, 1].

我们模型的训练依赖于合成异常图像，这些图像是使用[36]中提出的相同算法生成的。随机生成二维Perlin 噪声，并通过预设阈值进行二值化处理，从而得到异常掩码 $M$ . 用无异常图像 $l_{n}$ 和来自外部数据源 $A$ 的任意图像的线性组合替换掩码区域，生成异常图像 $l_{a}$ ，不透明度系数 $\beta$ 在 [0.15, 1] 之间随机选择。

$I_a=\beta(M\odot A)+(1-\beta)(M\odot I_n)+(1-M)\odot I_n\text{(1)}$

$\odot$ means the element-wise multiplication operation. The anomaly generation is performed online during training. By using this algorithm, three benefits are introduced. First, compared to painting a rectangle anomaly mask [16], the anomaly mask generated by random Perlin noise is more irregular and similar to actual anomalous shapes. Second, the image used as anomaly content $A$ could be arbitrarily chosen without elaborate selection [36]. Third, the introduction of opacity factor $\beta$ can be regarded as a data augmentation [38] to effectively increase the diversity of the training set.

$\odot$ 表示元素相乘运算。异常生成是在训练过程中在线进行的。使用这种算法有三个好处。首先，与绘制矩形异常掩码[16]相比，随机Perlin 噪声生成的异常掩码更不规则，与实际异常形状更相似。其次，作为异常内容 $A$ 的图像可以任意选择，无需精心挑选[36]。第三，不透明因子 $\beta$ 的引入可视为一种数据增强[38], 可有效增加训练集的多样性。

3.2 师生网络去噪

In previous multi-level knowledge distillation approaches [26, 31], the input of the student network (normal image) is identical to that of the teacher network, as is the architecture of the student network. However, our proposed denoising student network and the teacher network take paired anomalous and normal images as input, with the denoising student network having a distinct encoder-decoder architecture. In the following two paragraphs, we will examine the motivation for this design.

在遗忘的多层次知识蒸馏方法中[26, 31]，学生网络的输入(正常图像)与教师网络的输入相同，学生网络的结构也是如此。然而，我们提出的去噪学生网络和教师网络都将成对的异常图像和正常图像作为输入，其中去噪学生网络具有独特的编码器-解码器结构。在下面两段中，我们将探讨这种设计的动机。

First, as mentioned in Sec. 1, an optimization target should be established to encourage the student network to generate anomaly-specific features that differ from the teacher’s. We further endow a more straightforward target to the student network: to build normal feature representations on anomalous regions supervised by the teacher network. As the teacher network has been pre-trained on a large dataset, it can generate discriminative feature representations in both normal and anomalous regions. Therefore, the denoising student network will generate different feature representations from those by the teacher network during inference. Besides, as mentioned in Sec. 2, the memory-based approaches look for the most similar normal item in the memory bank to the query item and use their similarity for inference. Similarly, we optimize the denoising student network to reconstruct the normal features.

首先，如第一节所述，应设定一个优化目标，鼓励学生网络生成与教师网络不同的特定异常特征。我们进一步赋予学生网络一个更直接的目标：在教师网络的监督下，在异常区域建立正常的特征表征。由于教师网络已在大型数据集上进行过预先训练，因此可以在正常和异常区域生成具有区分性的特征表征。因此，在推理过程中，去噪学生网络将生成与教师网络不同的特征表征。此外，如第二节所述，基于记忆的方法会在记忆库中寻找与查询项最相似的正常项，并利用它们的相似性进行推理。同样，我们队去噪学生网络进行优化，以重建正常特征。

Second, given the feature reconstruction task, we conclude that the student network should not copy the architecture of the teacher network. Considering the process of reconstructing the feature of an early layer, it is well known that the lower layers of CNN capture local information, such as texture and color. In contrast, the upper layers of CNN express global semantic information [9]. Recalling that our denoising student network should reconstruct the feature of the corresponding normal image from the teacher network, such a task relies on global semantic information of the image and could not be done perfectly with only a few lower layers. We notice that the proposed task design resembles image denoising, with the exception that we wish to denoise the image in the feature space. The encoder-decoder architecture is widely used for image denoising. Therefore, we adopted it as the denoising student network’s architecture. There is an alternative way to use the teacher network as an encoder and reverse the student network as the decoder [9, 33]; however, our preliminary experimental results show that a complete encoder-decoder student network performs better. One possible explanation is that the pre-trained teacher network is usually trained on ImageNet with classification tasks; thus, the encoded features in the last layers lack sufficient information to reconstruct the feature representations at all levels.

其次，考虑到特征重建任务，我们认为学生网络不应照搬教师网络的架构。考虑到早期层的特征重建过程，众所周知，CNN的底层捕捉局部信息，如纹理和颜色。相比之下，CNN的上层表达的是全局语义信息[9]。考虑到我们的去噪学生网络应从教师网络中重建相应正常图像的特征，这一任务依赖于图像的全局语义信息，仅靠几个下层是无法完美完成的。我们注意到，提议的任务设计类似于图像去噪，不同之处在于我们希望在特征空间中对图像进行去噪。编码器-解码器结构被广泛应用于图像去噪。因此，我们将其作为去噪学生网络的架构。还有一种方法是将教师网络用作编码器，而将学生网络反向用作解码器[9, 33]；不过，我们的初步试验结果表明，完整的编码器-解码器学生网络性能更好。一种可能的解释是，预训练的教师网络通常是在有分类任务的ImageNet上进行训练的；因此，最后几层的编码特征缺乏足够的信息来重建所有层次的特征表征。

Following [31], the teacher network is an ImageNet pretrained ResNet18 [14] with the final block removed (i.e., $conv5\_x$ ). The output feature maps are extracted from the three remaining blocks, i.e., , $conv2\_x$ , , $conv3\_x$ , and , $conv4\_x$ denoted as $T^{1}$ , $T^{2}$ , and $T^{3}$ , respectively. Regarding the denoising student network, the encoder is a randomly initialized ResNet18 with all blocks, named $S_E^1$ , $S_E^2$ , $S_E^3$ , and $S_E^4$ , respectively. The decoder is a reversed ResNet18 (by replacing all downsampling with bilinear upsampling) with four residual blocks, named $S_D^4$ , $S_D^3$ , $S_D^2$ , and $S_D^1$ , respectively.

根据文献[31]，教师网络是一个经过ImageNet预训练的ResNet18[14]，去除了最后一个区块 (即, $conv5\_x$ )。输出特征图从其余三个区块中提取，即, $conv2\_x$ 、, $conv3\_x$ 、和, $conv4\_x$ ，分别记为 $T^{1}$ , $T^{2}$ 和 $T^{3}$ 。关于去噪学生网络，编码器是一个随机初始化的ResNet18，所有块分别命名为 $S_E^1$ , $S_E^2$ , $S_E^3$ , 和 $S_E^4$ . 解码器是一个反向ResNet18(用双线性上采样代替所有下采样), 有四个残差块，分别命名为 $S_D^4$ , $S_D^3$ , $S_D^2$ , 和 $S_D^1$ 。

We minimize the cosine distance between features from $T^{k}$ and $S^{k}_{D}, k = 1, 2, 3$ . Denoting $F_{T_{k}} \in \mathcal{R}^{C_k×H_k×W_k}$ the feature representation from layer $T^{k}$ , and $F_{S_{k}} \in \mathcal{R}^{C_k×H_k×W_k}$ the feature representation from layer $S^{k}_{D}$ , the cosine distances can be computed through Eq. (2) and Eq. (3). $i$ and $j$ stand for the spatial coordinate on the feature map. In particular, $i = 1...H_k$ and $j = 1...W_k$ . The loss is the sum of distances across three different feature levels as shown in Eq. (4).

我们最小化来自 $T^{k}$ 和 $S^{k}_{D}, k = 1, 2, 3$ 的特征之间的余弦距离。表示 $F_{T_{k}} \in \mathcal{R}^{C_k×H_k×W_k}$ 表示来自层 $T^{k}$ 的特征表示， $F_{S_{k}} \in \mathcal{R}^{C_k×H_k×W_k}$ 是来自层 $S^{k}_{D}$ 的特征表示，余弦距离可以通过公式(2)和公式(3)计算。 $i$ 和 $j$ 代表特征图上的空间坐标。其中， $i = 1...H_k$ 和 $j = 1...W_k$ . 如公式(4)所示，损失是三个不同特征层的距离之和。

$X_k(i,j)=\frac{F_{T_k}(i,j)\odot F_{S_k}(i,j)}{\|F_{T_k}(i,j)\|_2\|F_{S_k}(i,j)\|_2}\quad(2)$

$D_k(i,j)=1-\sum_{c=1}^{C_k}X_k(i,j)_c\quad(3)$

$L_{cos}=\sum_{k=1}^3\left(\frac{1}{H_kW_k}\sum_{i,j=1}^{H_k,W_k}D_k(i,j)\right)\quad(4)$

3.3 分割网络

In [26,31], the cosine distances from multi-level features are summed up directly to represent the anomaly score of each pixel. However, the results can be suboptimal if discriminations of all level features are not equally accurate. To address this issue, we add a segmentation network to guide the feature fusion with additional supervision signals.

在[26, 31]中，多级特征的余弦距离直接相加来表示每个像素的异常得分。但是，如果各层次特征的判别精度不一样，结果可能会不理想。为了解决这个问题，我们添加了一个分割网络，用额外的监督信号来指导特征融合。

We freeze the weights of both the student and teacher networks to train the segmentation network. The synthetic anomalous image is utilized as the input for both S-T networks, and the corresponding binary anomaly mask is the ground truth. The similarities of the feature maps $T^{1}, S_{D}^{1})$ , $T^{2}, S_{D}^{2})$ , $T^{3}, S_{D}^{3})$ are calculated by Eq. (2) and upsampled to the same size as $X_1$ , which is $\frac{1}{4}$ of the input size. The upsampled features, denoted as $\hat{X_{1}}$ , $\hat{X_{2}}$ , and $\hat{X_{3}}$ , are then concatenated as $\hat{X}$ , which is fed into the segmentation network. We also investigate alternative ways to compute the input of the segmentation network in Sec. 4.4. The segmentation network contains two residual blocks and one Atrous Spatial Pyramid Pooling (ASPP) module [5]. There is no upsampling or downsampling; thus, the output size equals the size of $X_{1}$ . Although this may lead to resolution loss to some extent, it reduces the memory cost for training and inference, which is crucial in practice.

我们冻结学生和教师网络的权重来训练分割网络。合成异常图像被用作S-T网络的输入，而相应的二进制异常掩码则是真实值。特征图 $T^{1}, S_{D}^{1})$ , $T^{2}, S_{D}^{2})$ , $T^{3}, S_{D}^{3})$ 的相似度由公式(2)计算得出，并上采样到与 $X_1$ 相同的大小，即输入大小的 $\frac{1}{4}$ 。上采样后的特征分别表示为 $\hat{X_{1}}$ , $\hat{X_{2}}$ , 和 $\hat{X_{3}}$ , 然后将其连接为 $\hat{X}$ ，并输入分割网络。在4.4节，我们还研究了计算分割网络输入的其它方法。分割网络包含两个残差块和一个Atrous Spatial Pyramid Pooling (ASPP) 模块[5]. 没有上采样或下采样；因此，输出大小等于 $X_{1}$ 的大小。虽然这可能会在一定程度上导致分辨率损失，但它降低了训练和推理的记忆成本，这在实际应用中至关重要。

The segmentation training is optimized by employing the focal loss [17] and the L1 loss. In the training set, the majority of pixels are normal and easily recognized as background. Only a small portion of the image consists of anomalous pixels that must be segmented. Therefore, the focal loss can help the model to focus on the minority category and difficult samples. In addition, the L1 loss is employed to improve the sparsity of the output so that the segmentation mask’s boundaries are more distinct. To compute the loss, we downsample the ground truth anomaly mask to a size equal to $\frac{1}{4}$ of the input size, which matches the output $(H 1, W 1)$ . Mathematically, we denote the output probability map as $\hat{Y}$ and the downsampled anomaly mask as $M$ , and the focal loss is computed using Eq. (5) where $p_{ij} = M_{ij}\hat{Y}_{ij} + (1 - M_{ij})(1 - \hat{Y}_{ij})$ and $\gamma$ is the focusing parameter. The L1 loss is computed by Eq. (6), and the segmentation loss is computed by Eq. (7).

采用焦点损失[17]和L1损失对比分割训练进行优化。在训练集中，大部分像素是正常的，很容易被识别为背景。图像中只有一小部分是必须分割的异常像素。因此，焦点损失可以帮助模型关注少数类别和困难样本。此外，还采用了L1损失来改善输出的稀疏性，从而分割掩码的边界更加清晰。为了计算损失，我们对真实异常掩码进行下采样，使其大小等于输入大小的 $\frac{1}{4}$ ，这与输出 $(H 1, W 1)$ 相匹配。在数学上，我们用 $\hat{Y}$ 表示输出概率图，用 $M$ 表示下采样异常掩码，使用公式(5) 计算聚焦损失，其中 $p_{ij} = M_{ij}\hat{Y}_{ij} + (1 - M_{ij})(1 - \hat{Y}_{ij})$ ， $\gamma$ 是聚焦参数。 L1损失按公式(6)计算，分割损失按公式(7)计算。

$L_{focal}=-\frac{1}{H_1W_1}\sum_{i,j=1}^{H_1,W_1}(1-p_{ij})^\gamma\log(p_{ij})\quad(5)$

$L_{l1}=\frac{1}{H_1W_1}\sum_{i,j=1}^{H_1,W_1}|M_{ij}-\hat{Y}_{ij}|\quad\text{(6)}$

$L_{seg}=L_{focal}+L_{l1}\quad(7)$

3.4 推论

In the inference stage, the test image is fed into both the teacher and student networks. The segmentation prediction is finally upsampled to the input size and taken as the anomaly score map. It is expected that anomalous pixels in the input image will have greater values in the output. To calculate the image-level anomaly score, we use the average of the top $T$ values from the anomaly score map, where $T$ is a tuning hyperparameter.

在推理阶段，测试图像被输入教师和学生网络。最后将分割预测结果上采样到输入大小，作为异常得分图。预计输入图像中的异常像素在输出中会有更大的值。为了计算图像级异常得分，我们使用了异常得分图中最高 $T$ 值的平均值，其中 $T$ 是一个调整超参数。

4 实验

4.1 数据

We evaluate our method using the MVTec AD [2] dataset, which is one of the most widely used benchmarks for anomaly detection and localization. The dataset comprises 15 categories, including 10 objects and 5 textures. For each category, there are hundreds of normal images for training and a mixture of anomalous and normal images for evaluation. The image sizes range from 700 × 700 to 1024 × 1024 pixels. For evaluation purposes, pixel-level binary annotations are provided for anomalous images in the test set. In addition, the Describable Textures Dataset (DTD) [6] is used as the anomaly source image $A$ in Eq. (1). [36] showed that other datasets such as ImageNet can achieve comparable performance but DTD is much smaller and easy to use.

我们使用MVTec AD [2]数据集对我们的方法进行了评估，该数据集是异常检测和定位最广泛使用的基准之一。数据集由15个类别组成，包括10个物体类和5种纹理类。每个类别都有数百张正常图像用于训练，异常图像和正常图像的混合物用于评估。图像大小从700700到10241024像素不等。为便于评估，测试集中的异常图像都有像素级的二进制注释。此外，可描述纹理数据集(DTD) [6] 被用作公式(1) 中的异常源图像 $A$ 。 [36]的研究表明，ImageNet 等其它数据集也能达到与之相当的性能，但DTD要小很多，而且易于使用。

4.2 评价指标

Image-level evaluation. Following the previous work in anomaly detection work, AUC (i.e., area under the ROC curve) is utilized to evaluate image-level anomaly detection.

图像级评估。根据以往的异常检测工作，AUC(即ROC曲线下的面积)被用来评估图像级异常检测。

Pixel-level evaluation. AUC is also selected to evaluate the pixel-level result. Additionally, we report average precision (AP) since it is a more appropriate metric for heavily imbalanced classes [25].

像素级评估。我们还选择AUC来评估像素级结果。此外，我们还报告了平均精度AP，因为对于严重不平衡的类别来说，平均精度是一个更合适的指标[25]。

Instance-level evaluation. In real-world applications, such as industrial defect inspection and medical imaging lesion detection, users are more concerned with whether the model can fully or partially localize an instance than with each individual pixel. In [3], per-region-overlap (PRO) is proposed, which equally weights the connected components of different sizes in the ground truth. It computes the overlap between prediction and ground truth within a user-specified false positive rate (30%). However, because instance recall is essential in practice, we propose to use instance average precision (IAP) as a more straightforward metric. Formally, we define an anomaly instance as a maximally connected ground truth region. Given a prediction map, an anomalous instance is considered detected if and only if more than 50% of the region pixels are predicted as positive. Under different thresholds, a list of pixel-level precision and instance-level recall rate points can be drawn as a curve. The average precision of this curve is calculated as IAP. For those applications requiring an extremely high recall, the precision at $rec a ll = k$ is also computed and denoted as IAP@k. In our experiments, we evaluate our model under a high-stakes scenario by setting k = 90.

实例级评估。在工业缺陷检测和医学影像病变检测等实际应用中，用户更关心的是模型能否完全或全部定位实例，而不是每个像素。文献[3] 中提出了按区域重叠(PRO) 技术，它能对地面实况中不同大小的连通分量进行平均加权。它在用户指定的误报率(30%)范围内计算预测与地面实况之间的重叠率。然而，由于实例召回率在实践中至关重要，我们建议使用实例平均精度(IAP)作为更直接的指标。从形式上讲，我们将异常实例定义为最大连接的真实值区域。给定预测图后，只有当且仅当超过50%的区域像素被预测为阳性时，才会任务检测到异常实例。在不同的阈值下，像素级精确度和实例级召回率点列表可以绘制成一条曲线。该曲线的平均精度计算为IAP。对于那些要求极高召回率的应用，也会计算召回率=k%时的精确度，并记为IAP@k。在实验中，我们通过设置k=90来评估高风险情景下的模型。

Ground truth downsampling method. We notice that the prior implementations of pixel-level evaluation are poorly aligned. Most of the works downsampled the ground truth to 256 × 256 for faster computation, but some performed an extra 224 × 224 center crop [7, 8, 24]. In addition, the downsampling implementations are not standardized either [13,24,36], resulting in the varying ground truth and unfair evaluation. In some cases, the downsampling introduces severe distortion, as illustrated in Fig. 2. In our work, we use bilinear interpolation to downsample the binary mask to 256 × 256 and then round the result with a threshold of 0.5. This implementation can preserve the continuity of the original ground truth mask without over or under-estimating.

真实值降采样方法。我们注意到，之前像素级评估的实现方法对齐度较差。为了加快计算速度，大多数研究都将真实值的采样率降低到256256，但也有一些研究进行了额外的224224的中心裁剪 [7, 8, 24] 。此外，降采样的实现方法也没有标准化[13, 24, 36], 导致真实值各不相同，评估结果不公平。在某些情况下，降采样会带来严重的失真，如图2所示。在我们的工作中，我们使用双线性插值法将二进制掩码下采样到256*256，然后用0.5的阈值对结果进行四舍五入。这种实现方式可以保持原始真实值掩码的连续性，而不会过高或过低估计。

在这里插入图片描述

Figure 2. The binary ground truth is downsampled with different implementations. (a) The grid image with a crack anomaly. (b) Downsample with bilinear interpolation, then floor all values between (0, 1) to zero [13, 24]. The mask has almost vanished. © Downsample with bilinear interpolation, then ceil all values between (0, 1) to one [36]. The mask is thicker than expected. (d) Downsample with the nearest interpolation, interrupting the original contiguous region. (e) Our proposed approach, downsample with bilinear interpolation and round values by threshold=0.5, The original contiguous region is not interrupted.

图2：二进制真实值通过不同的实现方式进行降采样。(a) 有裂缝异常的网格图像 (b) 采用双线性插值法进行降采样，然后(0, 1) 之间的所有值降至零[13, 24]。掩膜几乎消失 © 利用双线性插值法进行下采样，然后将(0, 1)之间的所有值取1[36]。掩码比预期的要厚 (d) 采用最近插值法进行下采样，中断原始连续区域 (e)我们建议的方法，用双线性插值法进行下采样，并将数值取整到阈值=0.5，原始连续区域没有中断。

4.3 结果

In order to make fair comparisons with other works, we re-evaluated the official pre-trained models of [31], [24], and [36] using our proposed evaluation introduced in Sec. 4.2. For methods without open-source code, we use the results mentioned in the original papers. Unavailable results are denoted with ‘-’. We repeat the experiments of our method 5 times with different random seeds to report the standard deviation.

为了与其它工作进行公平的比较，我们使用第4.2节最后那个介绍的评估方法对[31], [24], 和[36]的官方预训练模型进行了重新评估。对于没有公开源代码的方法，我们使用原始论文中提到的结果。无法获得的结果用“-”表示。我们用不同的随机种子重复试验5次，以报告标准偏差。

Image-level anomaly detection. We report the AUC for the image-level anomaly detection task in Tab. 1. The performance of our method reaches state-of-the-art on average. Category-specific results are shown in the supplementary material.

图像级异常检测。我们在表1中报告了图像级异常检测任务的AUC。我们的方法的平均性能达到了最先进水平。具体类别的结果见补充材料。

在这里插入图片描述

Table 1. Image-level anomaly detection AUC (%) on MVTec AD dataset. Results are averaged over all categories.

表1：MVTec AD 数据集的图像级异常检测AUC(%)。结果为所有类别的平均值。

Pixel-level anomaly localization. We report the AUC and AP values for the pixel-level anomaly localization task in Tab. 2. On average, our method outperforms state-of-theart by 5.6% on AP and achieves AUC scores comparable to PatchCore [24]. Our method reaches the highest or nearhighest score in the majority of categories, indicating that our approach generalizes well over a wide range of industrial application scenes.

像素级异常定位。我们在表2中报告了像素级异常定位任务的AUC和AP值。就平均值而言，我们的方法在AP方面比现有技术高出5.6%，AUC分数与PatchCore[24]不相上下。我们的方法在大多数类别中都获得了最高或接近最高的分数，这表明我们的方法在广泛的工业应用场景中具有良好的通用性。

在这里插入图片描述

Table 2. Pixel-level anomaly localization AUC / AP (%) on MVTec AD dataset.

表2：在MVTec AD数据集上像素级异常定位AUC/AP(%)

Instance-level anomaly detection. The IAP and IAP@90 of the instance-level anomaly detection are reported in Tab. 3. Our method achieves the state of the art for both metrics. On average, our approach reaches an IAP@90 of 57.8%, which indicates that when 90% of anomaly instances are detected, the pixel-level precision is 57.8%, or equally, the pixel-level false positive rate is 42.2%. As some categories (e.g., carpet, pill) contain hard samples close to the decision boundary, their standard deviations of IAP@90 are relatively high. In practice, these metrics can be used to determine whether the performance is acceptable for an application.

实例级异常检测。表3报告了实例级异常检测的IAP和IAP@90. 就平均值而言, 我们的方法达到了57.8%的IAP@90，这表明当检测到90%的异常实例时，像素级精度为57.8%，或者同样，像素级误报率为42.2%。由于某些类别(如地毯、药丸)的困难样本接近决策边界，其IAP@90的标准差相对较高。

在这里插入图片描述

Table 3. Instance-level anomaly detection IAP / IAP@90 (%) on MVTec AD dataset

表 3. MVTec AD 数据集的实例级异常检测 IAP / IAP@90 (%)

Category-specific analysis. For the category cable, memory-based approaches [24,37] have better performance than ours since the normal pixels have larger intra-class distances than categories with periodic textures. For the categories grid, screw, and tile, the anomalies are relatively small or thin. Therefore, methods with higher resolution predictions, such as [36, 37], can achieve higher performance, but require more memory and computation. For the remaining categories, our method achieves comparable or higher performance than the compared methods.

特定类别分析。对于类别cable, 基于记忆的方法[24, 37]比我们的方法性能更好，因为正常像素的类内距离比具有周期性纹理的类别更大。grid, screw, 和 tile 类别的异常相对较小或较薄。因此，分辨率更高的预测方法(如[36, 37]等) 可以实现更高的性能，但需要更多的内存和计算量。在其它类别中，我们的方法取得了与其它方法相当或更高的性能。

Visualization examples. Several visualization examples of our method from various categories are presented in Fig. 3. Our method can precisely localize the anomaly regions. More examples are shown in the supplementary material.

可视化示例。图3展示了我们的方法在不同类别中的几个可视化示例。我们的方法可以精确定位异常区域。更多示例见补充材料。

在这里插入图片描述

Figure 3. Visualization examples of our method. For each example, left: input image; middle: ground truth; right: prediction map.

图3：我们方法的可视化示例。每个例子的左图：输入图像；中图：基本真实图像；右图：预测图。

Analysis of failure cases. We analyze some failure cases illustrated in Fig. 4. On the one hand, several ambiguous ground truths are responsible for a number of failure occurrences. In a transistor case from the first row, the ground truth highlights both the original and misplaced location, while the prediction mask only covers the misplaced location. For a capsule case shown in the second tow, the ground truth contains most of the distorted parts, whereas the prediction mask covers the entire capsule. In these cases, we would argue that our predictions are still useful.

失败案例分析。我们队图4所示的一些失败案例进行分析。一方面，一些模棱两可的真实值是失败出现的原因。在第一行的晶体管案例中，真实值突出显示原始位置和误放位置，而预测掩码只覆盖误放位置。在第二行显示的胶囊案例中，真实值包含了大部分扭曲部分，而预测掩码则覆盖了整个胶囊。在这种情况下，我们认为我们的预测仍然有用。

在这里插入图片描述

Figure 4. Failure cases of our method. The examples are chosen from transistor, capsule, screw, and hazelnut (from top to bottom). For each example, left: input image; middle: ground truth; right: prediction map.

图4：我们方法的失败案例。例如晶体管、胶囊、螺钉和榛子(自上到下)。每个示例的左图：输入图像；中图：真实值；右图：预测图。

On the other hand, some failure cases, such as those shown in the third and fourth rows, result from noisy backgrounds. Tiny fibers and stains are highlighted due to the susceptibility of our model. We leave it to future work to investigate whether these anomalies are acceptable in order to draw more accurate conclusions.

另一方面，一些错误案例，如第三行和第四行所示的案例，是由背景噪音造成的。由于模型的易感性，细小的纤维和污点会被突出显示。为了得出更准确的结论，我们将留待今后的工作来研究这些反常现象是否可以接受。

4.4 消融研究

Network architecture. In Tab. 4, we evaluate the effectiveness of our three designs: replacing the training inputs of the vanilla student network with synthetic anomalies to enable a denoising procedure (den), applying encoderdecoder architecture to the student network(ed), and appending the segmentation network (seg) to replace the empirical feature fusion strategy, i.e., a product of cosine distances [31].

(a) Comparing experiments 1 and 2, it can be found that only changing the student network’s input to anomalous images undermines performance. However, experiment 5 shows improvement when ed is added, indicating that the den can be boosted by adopting ed architecture.

(b) The comparisons of experiments 1 with 4, 2 with 6, 3 with 7, and 5 with 8, showcase that the segmentation network can significantly improve the performances of all three metrics.

© Comparing experiments 4 and 8, it can be found that the combination of den and ed provides more useful features for the segmentation network than a vanilla S-T network does. The best result is achieved by combining all three main designs.

网络结构。在表4中，我们对三种设计的有效性进行了评估：用合成异常点替换虚构学生网络的训练输入(den)，应用编码器-解码器(encoder-derator)。在表4中，我们评估了三种设计的有效性：用合成异常点替换虚构学生网络的训练输入以启用去噪程序(den)，将编码器-解码器架构应用于学生网络(ed), 以及附加分割网络(seg) 以替换经验特征融合策略，即余弦距离的乘积[31]。

(a) 对比实验 1 和实验 2 可以发现，仅将学生网络的输入改为异常图像就会降低性能。不过，实验 5 显示，加入 ed 后，情况有所改善，这表明采用 ed 架构可以提高den的性能。

(b) 实验 1（4）、实验 2（6）、实验 3（7）和实验 5（8）的对比结果表明，seg可以显著提高所有三个指标的性能。

在这里插入图片描述

Table 4. Ablation studies on our main designs: denoising training (den), the encoder-decoder architecture of student network (ed), and segmentation network (seg). AUC, AP, and IAP (%) are used to evaluate image-level, pixel-level, and instance-level detection, respectively. Exp. 1 uses the same architecture of [31], but different training settings to align with Exp. 2∼8.

表4。对我们主要设计的消融研究：去噪训练(den)、学生网络的编码器-解码器架构(ed) 和分割网络(seg)。AUC、AP和IAP(%)分别用于评估图像级、像素级和实例级检测。实验1采用与[31]相同的架构，但训练设置不同，以便于实验2-8保持一致。

Segmentation loss. In Tab. 5, we examine the effectiveness of the L1-loss in the segmentation loss (Eq. (7)). It can be observed that the L1-loss improves performance.

分割损失。在表5中，我们考察了L1损失在分割损失(公式7)中的有效性。可以看出，L1损失提高了性能。

在这里插入图片描述

Table 5. Ablation studies on the segmentation loss: AUC, AP, and IAP (%) are used to evaluate image-level, pixel-level, and instance-level detection, respectively.

表5。关于分割损失的消融研究：AUC、AP和IAP(%)分别用于评估图像级、像素级和实例级检测。

Segmentation network input. As mentioned in Sec. 3.3, the input of the segmentation network is the element-wise product between the normalized feature maps of S-T networks as defined by Eq. (2). To prove the rationality of this setting, we build two distinct feature combinations as input. The first is to directly concatenate the feature maps of S-T networks $F_{S_k}$ and $F_{T_k}$ as the input of the segmentation network, which preserves the information of the S-T networks more effectively. The second is to compute the cosine distance of the S-T networks’ feature maps using Eq. (3), which utilizes more prior information when we train the student network by optimizing the cosine distance. We show the results in Tab. 6. Both approaches result in suboptimal performance, indicating that $\hat{X}$ is a suitable choice as the input to balance the information and prior.

分割网络输入。如第3.3节所述，分割网络的输入是S-T网络归一化特征图之间的元素乘积，如公式(2)所定义。为了证明这种设置的合理性，我们建立了两种不同的特征组合作为输入。第一种是直接串联S-T网络的特征图 $F_{S_k}$ 和 $F_{T_k}$ 作为分割网络的输入，这样可以更有效地保留S-T网络的信息。第二种方法是利用公式(3)计算S-T网络特征图的余弦距离，通过优化余弦距离来训练学生网络，从而利用更多的先验信息。结果见表6.两种方法的结果都是次优的，这表明 $\hat{X}$ 是平衡信息和先验的合适输入选择。

在这里插入图片描述

Table 6. Ablation studies on the input of segmentation network: AUC, AP, and IAP (%) are used to evaluate image-level, pixel-level, and instance-level detection, respectively.

表 6. 对分割网络输入的消融研究： AUC、AP 和 IAP (%) 分别用于评估图像级、像素级和实例级检测。

5 结论

We propose the DeSTSeg, a segmentation-guided denoising student-teacher framework for the anomaly detection task. The denoising student-teacher network is adopted to enable the S-T network to generate discriminative features in anomalous regions. The segmentation network is built to fuse the S-T network features adaptively. Experiments on the surface anomaly detection benchmark show that all of our proposed components considerably boost performance. Our results outperform the previous state-of-the-art by 0.1% AUC for image-level anomaly detection, 5.6% AP for pixel-level anomaly localization, and 4.9% IAP for instance-level anomaly detection.

我们针对异常检测任务提出了一种以分割为指导的去噪师生框架DeSTSeg。采用去噪学生-教师网络，使S-T网络能够在异常区域生成判别特征。建立分割网络是为了自适应地融合S-T网络特征。对表面异常检测基准的实验表明，我们提出的所有组件都大大提高了性能。在图像级异常检测方面，我们的结果以0.1%的AUC、5.6%的像素级异常定位AP和4.9%的实例级异常检测IAP优于之前的先进水平。

毕加猪plus

关注

4
点赞
踩
4

收藏

觉得还不错? 一键收藏
打赏
0
评论
[论文翻译]CVPR2023: DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection

视觉异常检测是计算机视觉领域的一个重要问题，通常被表述为一个单类分类和分割任务。事实证明，学生-教师(S-T)框架可以有效地解决这一难题。然而，以前基于S-T的工作知识根据经验对正常数据和融合的多级信息施加限制。在本研究中，我们提出了一种名为DeSTSeg的改进模型，它将预先训练的教师网络、去噪学生编码器-解码器和分割网络整合到一个框架中。首先，为了加强对异常数据的约束，我们引入了一个去噪程序，使学生网络能够学习到更稳健的表征。
复制链接

扫一扫