[2023CVPR] SimpleNet A Simple Network for Image Anomaly Detection and Localizations

最新推荐文章于 2024-12-03 21:02:41 发布

毕加猪plus

最新推荐文章于 2024-12-03 21:02:41 发布

阅读量3.3k

点赞数 7

分类专栏：论文翻译 # 异常检测文章标签：计算机视觉人工智能深度学习

本文链接：https://blog.csdn.net/Vincent_Tong_/article/details/130414293

版权

论文翻译同时被 2 个专栏收录

5 篇文章

订阅专栏

异常检测

4 篇文章

订阅专栏

SimpleNet A Simple Network for Image Anomaly Detection and Localizations

文章链接:SimpleNet
GitHub: https://github.com/DonaldRR/SimpleNet
Tags: CVPR
Year: 2023

摘要

We propose a simple and application-friendly network (called SimpleNet) for detecting and localizing anomalies. SimpleNet consists of four components: (1) a pre-trained Feature Extractor that generates local features, (2) a shallow Feature Adapter that transfo local features towards target domain, (3) a simple Anomaly Feature Generator that counterfeits anomaly features by adding Gaussian noise to normal features, and (4) a binary Anomaly Discriminator that distinguishes anomaly features from normal features. During inference, the Anomaly Feature Generator would be discarded. Our approach is based on three intuitions. First, transforming pre-trained features to targetoriented features helps avoid domain bias. Second, generating synthetic anomalies in feature space is more effective, as defects may not have much commonality in the image space. Third, a simple discriminator is much efficient and practical. In spite of simplicity, SimpleNet outperforms previous methods quantitatively and qualitatively. On the MVTec AD benchmark, SimpleNet achieves an anomaly detection AUROC of 99.6%, reducing the error by 55.5% compared to the next best performing model. Furthermore, SimpleNet is faster than existing methods, with a high frame rate of 77 FPS on a 3080ti GPU. Additionally, SimpleNet demonstrates significant improvements in performance on the One-Class Novelty Detection task.

我们提出了一个简单且应用友好的网络（称为SimpleNet），用于检测和定位异常情况。SimpleNet由四个部分组成：(1)一个预先训练好的特征提取器，生成局部特征；(2)一个浅层特征适配器，将局部特征转换到目标域；(3)一个简单的异常特征生成器，通过向正常特征添加高斯噪声来伪造异常特征；(4)一个二进制异常鉴别器，区分异常特征和正常特征。在推理过程中，异常特征生成器将被丢弃。我们的方法是基于三个直觉的。首先，将预训练的特征转化为面向目标的特征有助于避免领域偏差。第二，在特征空间中生成合成异常是更有效的，因为缺陷在图像空间中可能没有多少共性。第三，一个简单的判别器更加有效和实用。尽管简单，SimpleNet在数量上和质量上都超过了以前的方法。在MVTec AD基准测试中，SimpleNet实现了99.6%的异常检测AUROC，与表现次好的模型相比，误差减少了55.5%。此外，SimpleNet比现有方法更快，在3080ti GPU上的帧率高达77 FPS。此外，SimpleNet在单类新奇事物检测任务上表现出明显的性能改进。

1 介绍

Image anomaly detection and localization task aims to identify abnormal images and locate abnormal subregions. The technique to detect the various anomalies of interest has a broad set of applications in industrial inspection [3, 6]. In industrial scenarios, anomaly detection and localization is especially hard, as abnormal samples are scarce and anomalies can vary from subtle changes such as thin scratches to large structural defects, e.g. missing parts. Some examples from the MVTec AD benchmark [3] along with results from our proposed method are shown in Figure 1. This situation prohibits the supervised methods from approaching.

图像异常检测和定位任务旨在识别异常图像和定位异常子区域。检测各种感兴趣的异常现象的技术在工业检测中有着广泛的应用[3, 6]。在工业场景中，异常检测和定位尤其困难，因为异常样本很少，而且异常可以从细微的变化，如薄薄的划痕到大的结构缺陷，如缺失的零件。MVTec AD基准[3]中的一些例子以及我们提出的方法的结果显示在图1中。这种情况禁止有监督的方法接近。

在这里插入图片描述

Current approaches address this problem in an unsupervised manner, where only normal samples are used during the training process. The reconstruction-based methods [10, 21, 31], synthesizing-based methods [17, 30], and embedding-based methods [6, 22, 24] are three main trends for tackling this problem. The reconstruction-based methods such as [21,31] assume that a deep network trained with only normal data cannot accurately reconstruct anomalous regions. The pixel-wise reconstruction errors are taken as anomaly scores for anomaly localization. However, this assumption may not always hold, and sometimes a network can ”generalize” so well that it can also reconstruct the abnormal inputs well, leading to misdetection [10, 19]. The synthesizing-based methods [17, 30] estimate the decision boundary between the normal and anomalous by training on synthetic anomalies generated on anomaly-free images. However, the synthesized images are not realistic enough. Features from synthetic data might stray far from the normal features, training with such negative samples could result in a loosely bounded normal feature space, meaning indistinct defects could be included in in-distribution feature space.

目前的方法是以无监督的方式解决这个问题，在训练过程中只使用正常样本。基于重建的方法[10, 21, 31]、基于合成的方法[17, 30]和基于嵌入的方法[6, 22, 24]是解决这个问题的三个主要趋势。基于重建的方法，如[21,31]，假定只用正常数据训练的深度网络不能准确重建异常区域。象素级的重建误差被当作异常点的分数，用于异常点的定位。然而，这个假设不一定成立，有时网络可以 "泛化 "得很好，以至于它也可以很好地重建异常输入，从而导致错误检测[10, 19]。基于合成的方法[17, 30]通过对无异常图像上产生的合成异常进行训练来估计正常和异常之间的决策边界。然而，合成的图像并不足够真实。来自合成数据的特征可能会偏离正常特征很远，用这样的负面样本进行训练可能会导致松散的边界正常特征空间，这意味着不明确的缺陷可能被包含在分布内特征空间中。

Recently, the embedding-based methods [6, 7, 22, 24] achieve state-of-the-art performance. These methods use ImageNet pre-trained convolutional neural networks (CNN) to extract generalized normal features. Then a statistical algorithm such as multivariate Gaussian distribution [6], normalizing flow [24], and memory bank [22] is adopted to embed normal feature distribution. Anomalies are detected by comparing the input features with the learned distribution or the memorized features. However, industrial images generally have a different distribution from ImageNet. Directly using these biased features may cause mismatch problems. Moreover, the statistical algorithms always suffer from high computational complexity or high memory consumption.

最近，基于嵌入的方法[6, 7, 22, 24]取得了最先进的性能。这些方法使用ImageNet预训练的卷积神经网络（CNN）来提取广义的正常特征。然后采用多变量高斯分布[6]、归一化流[24]和记忆库[22]等统计算法来嵌入正常特征分布。通过将输入特征与学习的分布或记忆的特征进行比较来检测异常情况。然而，工业图像通常具有与ImageNet不同的分布。直接使用这些有偏差的特征可能会导致不匹配问题。此外，统计算法总是受到高计算复杂性或高内存消耗的影响。

To mitigate the aforementioned issues, we propose a novel anomaly detection and localization network, called SimpleNet. SimpleNet takes advantage of the synthesizing-based and the embedding-based manners, and makes several improvements. First, instead of directly using pretrained features, we propose to use a feature adaptor to produce target-oriented features which reduce domain bias. Second, instead of directly synthesizing anomalies on the images, we propose to generate anomalous features by posing noise to normal features in feature space. We argue that with a properly calibrated scale of the noise, a closely bounded normal feature space can be obtained. Third, we simplify the anomalous detection procedure by training a simple discriminator, which is much more computational efficient than the complex statistical algorithms adopted by the aforementioned embedding-based methods. Specifically, SimpleNet makes use of a pre-trained backbone for normal feature extraction followed by a feature adapter to transfer the feature into the target domain. Then, anomaly features are simply generated by adding Gaussian noise to the adapted normal features. A simple discriminator consisting of a few layers of MLP is trained on these features to discriminate anomalies.

为了缓解上述问题，我们提出了一种新型的异常检测和定位网络，称为SimpleNet。SimpleNet利用了基于合成和基于嵌入的方式，并做了一些改进。首先，我们不直接使用预训练的特征，而是使用一个特征适配器来产生面向目标的特征，以减少领域偏差。第二，我们不直接合成图像上的异常点，而是通过对特征空间中的正常特征施加噪声来产生异常特征。我们认为，通过适当地校准噪声的尺度，可以得到一个紧密结合的正常特征空间。第三，我们通过训练一个简单的判别器来简化异常检测程序，这比上述基于嵌入的方法所采用的复杂统计算法的计算效率要高得多。具体来说，SimpleNet利用一个预先训练好的骨干来提取正常特征，然后用一个特征适配器将特征转移到目标域。然后，通过向适应的正常特征添加高斯噪声，简单地生成异常特征。一个由几层MLP组成的简单判别器被训练在这些特征上，以判别异常情况。

SimpleNet is easy to train and apply, with outstanding performance and inference speed. The proposed SimpleNet, based on a widely used WideResnet50 backbone, achieves 99.6 % AUROC on MVTec AD while running at 77 fps, surpassing the previous best-published anomaly detection methods on both accuracy and efficiency, see Figure 2. We further introduce SimpleNet to the task of OneClass Novelty Detection to show its generality. These advantages make SimpleNet bridge the gap between academic research and industrial application. Code will be publicly available.

SimpleNet易于训练和应用，具有出色的性能和推理速度。所提出的SimpleNet基于广泛使用的WideResnet50骨干网，在MVTec AD上实现了99.6%的AUROC，同时运行速度为77fps，在准确性和效率上都超过了之前公布的最佳异常检测方法，见图2。我们进一步将SimpleNet引入到单类新颖性检测的任务中，以显示其通用性。这些优势使SimpleNet成为学术研究和工业应用之间的桥梁。代码将公开提供。

2 相关工作

Anomaly detection and localization methods can be mainly categorized into three types, i.e., the reconstructionbased methods, the synthesizing-based methods, and the embedding-based methods.

异常检测和定位方法主要可分为三种类型，即基于重建的方法、基于合成的方法和基于嵌入的方法。

Reconstruction-based methods hold the insight that anomalous image regions should not be able to be properly reconstructed since they do not exist in the training samples. Some methods [10] utilize generative models such as auto-encoders and generative adversarial networks [11] to encode and reconstruct normal data. Other methods [13,21,31] frame anomaly detection as an inpainting problem, where patches from images are masked randomly. Then, neural networks are utilized to predict the erased information. Integrating structural similarity index (SSIM) [29] loss function is widely used in training. An anomaly map is generated as pixel-wise difference between the input image and its reconstructed image. However, if anomalies share common compositional patterns (e.g. local edges) with the normal training data or the decoder is ”too strong” for decoding some abnormal encodings well, the anomalies in images are likely to be reconstructed well [31].

基于重构的方法认为，异常的图像区域不应该被正确地重构，因为它们不存在于训练样本中。一些方法[10]利用生成模型，如自动编码器和生成对抗网络[11]来编码和重建正常数据。其他方法[13,21,31]将异常检测作为一个画图问题，图像中的斑块被随机掩盖。然后，利用神经网络来预测被抹去的信息。整合结构相似性指数（SSIM）[29]损失函数被广泛用于训练。异常图被生成为输入图像和其重建图像之间的像素级差异。然而，如果异常点与正常的训练数据有共同的组成模式（如局部边缘），或者解码器对某些异常编码的解码能力 “太强”，那么图像中的异常点就有可能被很好地重建[31]。

Synthesizing-based methods typically synthesize anomalies on anomaly-free images. DRÆM [30] proposes a network that is discriminatively trained in an end-to-end manner on synthetically generated just-out-of-distribution patterns. CutPaste [17] proposes a simple strategy to generate synthetic anomalies for anomaly detection that cuts an image patch and pastes at a random location of a large image. A CNN is trained to distinguish images from normal and augmented data distributions. However, the appearance of the synthetic anomalies does not closely match the real anomalies’. In practice, as defects are various and unpredictable, generating an anomaly set that includes all outliers is impossible. Instead of synthesizing anomalies on images, with the proposed SimpleNet, negative samples are synthesized in the feature space.

基于合成的方法通常在无异常的图像上合成异常点。DRÆM[30]提出了一个网络，该网络以端到端的方式对合成的刚出炉的模式进行判别训练。CutPaste[17]提出了一个简单的策略来生成用于异常检测的合成异常点，该策略在大图像的随机位置剪切一个图像补丁并进行粘贴。一个CNN被训练来区分正常和增强的数据分布的图像。然而，合成异常点的外观与真实异常点的外观并不紧密匹配。在实践中，由于缺陷是多种多样且不可预测的，生成一个包括所有异常值的异常集是不可能的。用所提出的SimpleNet代替合成图像上的异常现象，在特征空间中合成负面样本。

Embedding-based methods achieve state-of-the-art performance recently. These methods embed normal features into a compressed space. The anomalous features are far from the normal clusters in the embedding space. Typical methods [6,7,22,24] utilize networks that are pre-trained on ImageNet for feature extraction. With a pre-trained model, PaDiM [6] embeds the extracted anomaly patch features by multivariate Gaussian distribution. PatchCore [22] uses a maximally representative memory bank of nominal patch features. Mahalanobis distance or maximum feature distance is adopted to score the input features in testing. However, industrial images generally have a different distribution from ImageNet. Directly using pre-trained features may cause a mismatch problem. Moreover, either computing the inverse of covariance [6] or searching through the nearest neighbor in the memory bank [22] limits the realtime performance, especially for edge devices.

基于嵌入的方法最近取得了最先进的性能。这些方法将正常特征嵌入到一个压缩的空间。异常特征在嵌入空间中远离正常集群。典型的方法[6,7,22,24]利用在ImageNet上预先训练好的网络进行特征提取。通过预训练的模型，PaDiM[6]通过多变量高斯分布嵌入提取的异常补丁特征。PatchCore[22]使用名义斑块特征的最大代表存储库。在测试中采用Mahalanobis距离或最大特征距离对输入特征进行评分。然而，工业图像通常具有与ImageNet不同的分布。直接使用预训练的特征可能会造成不匹配的问题。此外，无论是计算协方差的逆值[6]还是通过内存库中的近邻搜索[22]都会限制实时性能，尤其是对于边缘设备。

CS-Flow [24], CFLOW-AD [12], and DifferNet [23] propose to transform the normal feature distribution into Gaussian distribution via normalizing flow (NF) [20]. As normalizing flow can only process full-sized feature maps, i.e., down sample is not allowed and the coupling layer [9] consumes a few times of memory than the normal convolutional layer, these methods are memory consuming. Distillation methods [4, 7] train a student network to match the outputs of a fixed pre-trained teacher network with only normal samples. A discrepancy between student and teacher output should be detected given an anomalous query. The computational complexity is doubled as an input image should pass through both the teacher and the student.

CS-Flow[24]、CFLOW-AD[12]和DifferNet[23]提出通过归一化流（NF）[20]将正常特征分布转化为高斯分布。由于归一化流只能处理全尺寸的特征图，即不允许向下采样，而且耦合层[9]消耗的内存是正常卷积层的几倍，所以这些方法都很耗费内存。蒸馏法[4, 7]训练学生网络以匹配固定的预训练的教师网络的输出，只用正常的样本。在异常查询的情况下，学生和教师的输出之间的差异应该被检测出来。计算的复杂性是双倍的，因为输入图像应该同时通过教师和学生。

SimpleNet overcomes the aforementioned problems. SimpleNet uses a feature adaptor that performs transfer learning on the target dataset to alleviate the bias of pretrained CNNs. SimpleNet proposes to synthesize anomalous in the feature space rather than directly on the images. SimpleNet follows a single-stream manner at inference and is totally constructed by conventional CNN blocks which facilitate fast training, inference, and industrial application.

SimpleNet克服了上述的问题。SimpleNet使用一个特征适配器，在目标数据集上进行转移学习，以减轻预训练的CNN的偏见。SimpleNet建议在特征空间中合成异常，而不是直接在图像上合成。SimpleNet在推理时遵循单流方式，完全由传统的CNN模块构建，这有利于快速训练、推理和工业应用。

3 方法

The proposed SimpleNet is elaborately introduced in this section. As illustrated in Figure 3, SimpleNet consist of a Feature Extractor, a Feature Adaptor, an Anomalous Feature Generator and a Discriminator. The Anomalous Feature Generator is only used during training, thus SimpleNet follows a single stream manner at inference. These modules will be described below in sequence.

本节将详细介绍拟议的SimpleNet。如图3所示，SimpleNet由一个特征提取器、一个特征适应器、一个异常特征生成器和一个判别器组成。异常特征生成器只在训练过程中使用，因此SimpleNet在推理过程中采用单流方式。这些模块将在下文中依次描述。

在这里插入图片描述

3.1 特征提取器

Feature Extractor acquires local feature as in [22]. We reformulate the process as follows. We denote the training set and test set as $\mathcal{X}_{train}$ and $\mathcal{X}_{test}$ . For any image $x_i \in \mathbb{R}^{H \times W \times3}$ in $\mathcal{X}_{train} \cup \mathcal{X}_{test}$ , the pre-trained network $\phi$ extracts features from different hierarchies, as normally done with ResNet-like backbone. Since pre-trained network is biased towards the dataset in which it is trained, it is reasonable to choose only a subset of levels for the target dataset. Formally, we define $L$ the subset including the indexes of hierarchies for use. The feature map from level $\in L$ is denoted as $\phi^{l.i} \sim \phi^{l}(x_{i})\in\mathbb{R}^{H_{l}\times W_{l}\times C_{l}}$ , where $H_{l}$ , $W_{l}$ and $C_{l}$ are the height, width and channel size of the feature map. For an entry $\phi^{l,i}\in \mathbb{R}^{C_{1}}$ at location $(h, w)$ , its neighborhood with patchsize $p$ is defined as

特征提取器按照[22]的方法获取局部特征。我们将这个过程重新表述如下。我们把训练集和测试集表示为 $\mathcal{X}_{train}$ 和 $\mathcal{X}_{test}$ 。对于 $\mathcal{X}_{train} \cup \mathcal{X}_{test}$ 中的任何图像 $x_i \in \mathbb{R}^{H \times W \times3}$ ，预训练的网络 $\phi$ 从不同的层次中提取特征，就像通常用ResNet-like骨架做的那样。由于预训练的网络偏向于它所训练的数据集，所以只为目标数据集选择一个层次子集是合理的。形式上，我们定义 $L$ 为包括层次索引的子集，以供使用。来自层次 $\in L$ 的特征图表示为 $\phi^{l.i} \sim \phi^{l}(x_{i})\in\mathbb{R}^{H_{l}\times W_{l}\times C_{l}}$ ，其中 $H_{l}$ , $W_{l}$ 和 $C_{l}$ 是特征图的高度、宽度和通道大小。对于一个在位置 $(h, w)$ ,上的条目 $\phi^{l,i}\in \mathbb{R}^{C_{1}}$ ，其具有补丁大小 $p$ 的邻域被定义为

$\mathcal{N}^{(h,w)}_p=\{(h',y')|h'\in[h-\lfloor p/2\rfloor,\cdots,h+\lfloor p/2\rfloor], \\ y'\in[w-\lfloor p/2\rfloor,\cdots,w+\lfloor p/2\rfloor]\} \quad(1)$

Aggregating the features within the neighborhood $\mathcal{N}^{h,w}_{p}$ with aggregation function $f_{agg}$ (use adaptive average pooling here) results in the local feature $z^{l,i}_{h,w}$ , as

用聚合函数 $f_{agg}$ （在此使用自适应平均池）聚合邻域 $\mathcal{N}^{h,w}_{p}$ 内的特征，结果是本地特征 $z^{l,i}_{h,w}$ ，为

$z_{h,w}^{l,i}=f_{agg}(\{\phi_{h',y'}^{l,i}|(h',y')\in\mathcal{N}_p^{h,w}\})\quad(2)$

To combine features $z^{l,i}_{h,w}$ from different hierarchies, all feature maps are linearly resized to the same size $H_{0},W_{0})$ , i.e. the size of the largest one. Simply concatenating the feature maps channel-wise gives the feature map $o^{i}\in \mathbb{R}^{H_{0},\times H_{0} \times C}$ . The process is defined as

为了将不同层次的特征 $z^{l,i}_{h,w}$ ，所有的特征图都被线性地调整为相同的大小 $H_{0},W_{0})$ ，即最大的一个的大小。即最大的一个的大小。简单地将这些特征图连接起来就可以得到特征图 $o^{i}\in \mathbb{R}^{H_{0},\times H_{0} \times C}$ 。这个过程被定义为

$o^i=f_{cat}(resize(z^{l',i},(H_0,W_0))|l'\in L\quad(3)$

we define $o^{i}_{h,w}\in \mathbb{R}^{C}$ as the entry of oi at location $(h, w)$ . We simplify the above expressions as

我们将 $o^{i}_{h,w}\in \mathbb{R}^{C}$ 定义为 $o^{i}$ 在位置 $(h, w)$ 的条目。我们将上述表达式简化为:

$o^i=F_\phi(x^i)\quad(4)$

where $F_{\phi}$ is the Feature Extractor. 其中 $F_{\phi}$ 是特征提取器

3.2 特征自适应

As industrial images generally have a different distribution from the dataset used in backbone pre-training, we adopt a Feature Adaptor Gθ to transfer the training features to the target domain. The Feature Adaptor $G_{\theta}$ projects local feature $o_{h,w}$ to adapted feature $q_{h,w}$ as

由于工业图像的分布通常与骨干预训练中使用的数据集不同，我们采用特征适配器Gθ将训练特征转移到目标域。特征适配器 $G_{\theta}$ 将本地特征 $o_{h,w}$ 投射到适配的特征 $q_{h,w}$ ，即

$\begin{aligned}& \\q_{h,v}^{i}& _{v}=G_{\theta}(o_{h,w}^{i})\quad(5) \end{aligned}$

The Feature Adaptor can be made up of simple neural blocks such as a fully-connected layer or multi-layer perceptron (MLP). We experimentally find that a single fully connected layer yields good performance.

特征适应器可以由简单的神经块组成，如全连接层或多层感知器（MLP）。我们在实验中发现，单一的全连接层可以产生良好的性能。

3.3 异常特征生成器

To train the Discriminator to estimate the likelihood of samples being normal, the easiest way is sampling negative samples, i.e. defect features, and optimizing it together with normal samples. The lack of defects makes the sampling distribution estimation intractable. While [17, 18, 30] relying on extra data to synthesize defect images, we add simple noise on normal samples in the feature space, claiming that it outperforms those manipulated methods.

为了训练判别器估计样本为正常的可能性，最简单的方法是对负面样本，即缺陷特征进行采样，并与正常样本一起优化。缺陷的存在使得抽样分布估计变得难以实现。虽然[17, 18, 30]依靠额外的数据来合成缺陷图像，但我们在特征空间的正常样本上添加简单的噪声，声称它优于那些被操纵的方法。

The anomalous features are generated by adding Gaussian noise on the normal features $q^{i}_{h,w}\in \mathbb{R}^{C}$ . Formally, a noise vector $\epsilon \in \mathbb{R}^{C}$ is sampled, with each entry following an i.i.d. Gaussian distribution $\mathcal{N}(\mu,\sigma^{2})$ . The anomalous feature $q_{h,w}^{i-}$ is fused as

异常特征是通过在正常特征 $q^{i}_{h,w}\in \mathbb{R}^{C}$ 上添加高斯噪声产生的。形式上，对噪声向量 $\epsilon \in \mathbb{R}^{C}$ 进行采样，每个条目都遵循i.i.d.高斯分布 $\mathcal{N}(\mu,\sigma^{2})$ 。异常特征 $q_{h,w}^{i-}$ 被融合为

$q_{h,w}^{i-}=q_{h,w}^{i}+\epsilon\quad(6)$

Figure 4 illustrates the influence of anomalous features on four classes of MVTec AD. We can see that the standard deviation along each dimension of the adapted features tends to be consistent. Thus, the feature space tends to be compact when distinguishing anomalous features from normal features.

图4说明了异常特征对四类MVTec AD的影响。我们可以看到，沿着适应特征的每个维度的标准偏差趋于一致。因此，在区分异常特征和正常特征时，特征空间趋于紧凑。

在这里插入图片描述

3.4 鉴别器

The Discriminator $D_{\psi}$ works as a normality scorer, estimating the normality at each location $(h, w)$ directly. Since negative samples are generated along with normal features $\{q^{i}|x^{i}\in \mathcal{X}_{train}\}$ , they are both fed to the Discriminator during training. The Discriminator expects positive output for normal features while negative for anomalous features. We simply use a 2-layer multi-layer perceptron (MLP) structure as common classifiers do, estimating normality as $D_{\psi}(q_{h,w})\in\mathbb{R}$ .

鉴别器 $D_{\psi}$ 作为一个正态性评分器，直接估计每个位置 $(h, w)$ 的正态性。由于负样本与正常特征 $\{q^{i}|x^{i}\in \mathcal{X}_{train}\}$ 一起产生，它们在训练期间都被送入鉴别器。鉴别器期望正常特征的输出为正，而异常特征的输出为负。我们简单地使用一个2层的多层感知器（MLP）结构，就像常见的分类器一样，将正常性估计为 $D_{\psi}(q_{h,w})\in\mathbb{R}$ 。

3.5 损失函数和训练

A simple truncated $l 1$ loss is derived as 一个简单的截断的 $l 1$ 损失被推导为

$l_{h,w}^i=\max(0,th^+-D_{\psi}(q_{h,w}^i))+\max(0,-th^-+D_{\psi}(q_{h,w}^{i-})) \quad(7)$

$th^{+}$ and $th^{-}$ are truncation terms preventing overfitting. They are set to 0.5 and -0.5 by default. The training objective is

$th^{+}$ 和 $th^{-}$ 是防止过度拟合的截断项。默认情况下，它们被设置为0.5和-0.5。训练目标是

$\mathcal{L}=\min\limits_{\theta,\psi}\sum\limits_{x^i\in X_{train}}\sum\limits_{h,w}\frac{l_{h,w}^i}{H_0*W_0}\quad(8)$

We will experimentally evaluate the proposed truncated $l 1$ loss function with the widely used cross-entropy loss in the experiments section. The pseudo-code of the training procedure is shown in Algorithm 1.

我们将在实验部分用广泛使用的交叉熵损失来评估提议的截断 $l 1$ 损失函数。实验部分。训练程序的伪代码显示在算法1中。

在这里插入图片描述

3.6 推理和评分函数

The Anomalous Feature Generator is discarded at inference. Note that the remaining modules can be stacked into an end-to-end network. We feed each $x_{i}\in \mathcal{X}_{test}$ into the aforementioned Feature Extractor $F_{\phi}$ and the Feature Adaptor $G_{\theta}$ sequentially to get adapted features $q_{h,w}^{i}$ as in Equation 5. The anomaly score is provided by the Discriminator $D_{\psi}$ as

异常特征发生器在推理时被丢弃。请注意，其余的模块可以堆叠成一个端到端的网络。我们将每个 $x_{i}\in \mathcal{X}_{test}$ 依次送入上述的特征提取器 $F_{\phi}$ 和特征适配器 $G_{\theta}$ ，得到适配的特征 $q_{h,w}^{i}$ ，如公式5所示。异常得分由判别器 $D_{\psi}$ 提供，为

$s^i_{h,w}=-D_\psi(q^i_{h,w})\quad(9)$

The anomaly map for anomaly localization during inference is defined as

推理过程中用于异常定位的异常图被定义为

$S_{AL}(x_i):=\{s_{h,w}^i|(h,w)\in W_0\times H_0\}\quad(10)$

Then $S_{AL}(x_{i})$ is interpolated to have the spatial resolution of the input sample and Gaussian filtered with $\sigma=4$ for smooth boundaries. As the most responsive point exists for any size of the anomalous region, the maximum score of the anomaly map is taken as the anomaly detection score of each image

然后， $S_{AL}(x_{i})$ 被内插以具有输入样本的空间分辨率，并以 $\sigma=4$ 进行高斯滤波以获得平滑边界。由于任何大小的异常区域都存在反应最强烈的点，因此异常图的最大分值被作为每幅图像的异常检测得分

$S_{AD}(x_i):=\max\limits_{(h,w)\in W_0\times H_0}s^i_{h,w}\quad(11)$

4 实验

4.1 数据集

We conduct most of the experiments on the MVTec Anomaly Detection benchmark [3], that is, a famous dataset in the anomaly detection and localization field. MVTec AD contains 5 texture and 10 object categories stemming from manufacturing with a total of 5354 images. The dataset is composed of normal images for training and both normal and anomaly images with various types of defect for test. It also provides pixel-level annotations for defective test images. Typical images are illustrated in Figure 1. As in [6, 22], images are resized and center cropped to 256 × 256 and 224 × 224, respectively. No data augmentation is applied. We follow the one-class classification protocol, also known as cold-start anomaly detection, where we train a one-class classifier for each category on its respective normal training samples.

我们在MVTec异常检测基准[3]上进行了大部分实验，这是异常检测和定位领域的一个著名数据集。MVTec AD包含5个纹理和10个源于制造业的物体类别，总共有5354张图像。该数据集由用于训练的正常图像和用于测试的具有各种类型缺陷的正常和异常图像组成。它还为有缺陷的测试图像提供了像素级的注释。典型的图像如图1所示。如同[6, 22]，图像的大小和中心裁剪分别为256×256和224×224。没有应用数据增强。我们遵循单类分类协议，也被称为冷启动异常检测，我们在每个类别的正常训练样本上训练一个单类分类器。

We conduct one-class novelty detection on CIFAR10 [16], which contains 50K training images and 10K test images with scale of 32 × 32 in 10 categories. Under the setting of one-class novelty detection, one category is regarded as normal data and other categories are used as novelty.

我们在CIFAR10[16] 上进行了单类新颖性检测，其中包含50K训练图像和10K测试图像，比例为32×32，分10个类别。在单类新颖性检测的设定下，一个类别被视为正常数据，其他类别被作为新颖性。

4.2 评价指标

Image-level anomaly detection performance is measured via the standard Area Under the Receiver Operator Curve, which we denote as I-AUROC, using produced anomaly detection scores $S_{AD}$ (Equation 11). For anomaly localization, the anomaly map $S_{AL}$ (Equation 10) is used for an evaluation of pixel-wise AUROC (denoted as P-AUROC). In accordance with prior works [6, 22], we compute on MVTec AD the class-average AUROC and mean AUROC overall categories for detection and localization. The comparison baselines includes AE-SSIM [3], RIAD [31], DRÆM [30], CutPaste [17], CS-Flow [24], PaDiM [6], RevDist [7] and PatchCore [22].

图像层面的异常检测性能是通过标准的接收者操作曲线下面积来衡量的，我们将其称为I-AUROC，使用产生的异常检测分数 $S_{AD}$ （公式11）。对于异常定位，异常地图 $S_{AL}$ （公式10）被用于评估像素级的AUROC（表示为P-AUROC）。根据先前的工作[6，22]，我们在MVTec AD上计算出检测和定位的类平均AUROC和平均AUROC总类别。比较基线包括AE-SSIM[3]、RIAD[31]、DRÆM[30]、CutPaste[17]、CS-Flow[24]、PaDiM[6]、RevDist[7]和PatchCore[22]。

4.3 实验细节

This section describes the configuration implementation details of the experiments in this paper. All backbones used in the experiments were pre-trained with ImageNet [8]. The 2nd and 3rd intermediate layers of the backbone e.g. $l'\in[2,3]$ in Equation 3 are used in the feature extractor as in [22] when the backbone is ResNet-like architecture. By default, our implementation uses WideResnet50 as backbone, and the feature dimension from the feature extractor is set to 1536. The later feature adaptor is essentially a fully connected layer without bias. The dimensions of the input and output features for the FC layer in the adaptor are the same. The anomaly feature generator adds i.i.d. Gaussian noise $\mathcal{N}(0,\sigma^{2})$ to each entry of normal features. $\sigma$ is set to 0.015 by default. The subsequent discriminator composes of a linear layer, a batch normalization layer, a leaky relu(0.2 slope), and a linear layer. $th^{+}$ and $th^{-}$ are both set to 0.5 in Equation 7. The Adam optimizer is used, setting the learning rate for the feature adaptor and discriminator to 0.0001 and 0.0002 respectively, and weight decay to 0.00001. Training epochs is set to 160 for each dataset and batchsize is 4.

本节介绍了本文实验的配置实现细节。实验中使用的所有骨干网都是用ImageNet[8]预先训练过的。当骨干网为类ResNet结构时，骨干网的第2和第3中间层，如公式3中的 $l'\in[2,3]$ 被用于特征提取器，如[22]。默认情况下，我们的实现使用WideResnet50作为骨干网，特征提取器的特征维度被设置为1536。后面的特征适配器基本上是一个没有偏置的全连接层。适配器中FC层的输入和输出特征的尺寸是相同的。异常特征生成器在正常特征的每个条目中加入了i.i.d.高斯噪声 $\mathcal{N}(0,\sigma^{2})$ 。 $\sigma$ 默认设置为0.015。随后的判别器由一个线性层、一个批处理归一化层、一个leaky ReLU（0.2斜率）和一个线性层组成。在公式7中 $th^{+}$ 和 $th^{-}$ 都被设置为0.5。使用Adam优化器，将特征适应器和判别器的学习率分别设置为0.0001和0.0002，权重衰减为0.00001。每个数据集的训练周期设置为160个，批次大小为4。

4.4 在MVTec AD上异常检测

Anomaly detection results on MVTec AD are shown in Table 1. Image-level anomaly score is given by the maximum score of the anomaly map as in Equation 11. SimpleNet achieves the highest score for 9 out of 15 classes. For textures and objects, SimpleNet reaches new SOTA of 99.8% and 99.5% of I-AUROC, respectively. SimpleNet achieves significantly higher mean image anomaly detection performance i.e. I-AUROC score of 99.6%. Please note that, a reduction from an error of 0.9% for PatchCore [22] (next best competitor, under the same WideResnet50 backbone) to 0.4% for SimpleNet means a reduction of the error by 55.5%. In industrial inspection settings, this is a relevant and significant reduction.

MVTec AD的异常检测结果见表1。图像级别的异常得分是由异常图的最大得分给出的，如公式11。SimpleNet在15个类别中的9个取得了最高分。对于纹理和物体，SimpleNet分别达到了I-AUROC的99.8%和99.5%的新SOTA。SimpleNet实现了明显更高的平均图像异常检测性能，即I-AUROC得分99.6%。请注意，从PatchCore[22]（下一个最好的竞争者，在相同的WideResnet50骨干网下）的误差减少到SimpleNet的0.4%意味着误差减少了55.5%。在工业检测环境中，这是一个相关的、重要的减少。

在这里插入图片描述

4.5 MVTec AD 上的异常定位

The anomaly localization performance is measured by pixel-wise AUROC, which we note as P-AUROC. Comparisons with the state-of-the-art methods are shown in Table 1. SimpleNet achieves the best anomaly detection performance of 98.1% P-AUROC on MVTec AD as well as the new SOTA of 98.4% P-AUROC for objects. SimpleNet achieves the highest score for 4 out of 15 classes. We visualize representative samples for anomaly localization in Figure 8.

异常定位的性能是通过像素化的AUROC来衡量的，我们把它称为P-AUROC。与最先进的方法的比较见表1。SimpleNet在MVTec AD上取得了98.1%的P-AUROC的最佳异常检测性能，在新的SOTA上也取得了98.4%的P-AUROC。SimpleNet在15个类别中的4个取得了最高分。我们在图8中可视化了异常定位的代表性样本。

在这里插入图片描述

4.6 推理时间

Alongside the detection and localization performance, inference time is the most important concern for industrial model deployment. The comparison with the state-of-theart methods on inference time is shown in Figure 2. All the methods are measured on the same hardware contain ing a Nvidia GeForce GTX 3080ti GPU and an Intel® Xeon® CPU E5-2680 v3@2.5GHZ. It clearly shows that our method achieves the best performance as well as the fastest speed at the same time. SimpleNet is nearly 8× faster than PatchCore [22].

除了检测和定位性能外，推理时间是工业模型部署中最重要的问题。图2显示了与最先进的推理方法在推理时间上的比较。所有的方法都是在相同的硬件上测量的，包括Nvidia GeForce GTX 3080ti GPU和Intel® Xeon® CPU E5-2680 v3@2.5GHZ。这清楚地表明，我们的方法同时达到了最好的性能和最快的速度。SimpleNet比PatchCore[22]快近8倍。

在这里插入图片描述

4.7 消融实验

Neighborhood size and hierarchies. We investigate the influence of neighborhood size p in Equation 1. Results in Figure 6 show a clear optimum between locality and global context for anomaly predictions, thus motivating the neighborhood size p = 3. We design a group of experiments to test the influence of hierarchies subset L on model performance and the results are shown in Table 2. We index the first three WideResNet50 blocks with 1 − 3. As can be seen, features from hierarchy level 3 can already achieve state-of-the-art performance but benefit from additional hierarchy level 2. We chose 2 + 3 as the default setting.

邻里的大小和等级制度。我们研究方程1中邻域大小p的影响。图6中的结果显示，在异常预测的局部性和全局性之间有一个明显的最佳状态，从而促使邻域大小p=3。我们设计了一组实验来测试层次结构子集L对模型性能的影响，结果见表2。我们用1-3对前三个WideResNet50块进行索引。可以看出，来自层次结构第3层的特征已经可以达到最先进的性能，但却从额外的层次结构第2层中受益。我们选择2+3作为默认设置。

在这里插入图片描述

Adaptor configuration. Adaptor provides a transformation (projection) on the pre-trained features. Our default feature adaptor is a single FC layer without bias, with equal input and output channels. A comparison of different feature adaptors is shown in Table 3, the first row ”Ours” implementation follows the same configuration as in Table 1. “Ours-complex-FA” replaces the simple feature adaptor with a nonlinear one (i.e. 1 layer MLPs with nonlinearity). The row ”Ours-w/o-FA” drops the feature adaptor. The results indicate that a single FC layer yields the best performance. Intuitively, the feature adaptor finds a projection such that the faked abnormal features and projected pre-trained features are easily severed, meaning a simple solution to the discriminator. This is also indicated by the phenomenon that using a feature adaptor helps the network converge fast (Figure 7). We observe a significant performance drop with a complex feature adaptor. One possible reason is that a complex adaptor may lead to overfitting, reducing the generalization ability for various defects in test. Figure 4 compares the histogram of standard deviation along each dimension of the features before and after the feature adaptor. We can see that, when training with anomalous features, adapted feature space becomes compact.

适应器配置。适应器在预训练的特征上提供了一个转换（投影）。我们默认的特征适配器是一个没有偏置的单一FC层，具有相同的输入和输出通道。表3显示了不同特征适配器的比较，第一行 "Ours "的实现与表1的配置相同。"Ours-complex-FA "用一个非线性的特征适配器取代了简单的特征适配器（即具有非线性的1层MLPs）。而 "Ours-w/o-FA "一行则放弃了特征适配器。结果表明，单一的FC层产生了最好的性能。直观地说，特征适应器找到了一个投影，使伪造的异常特征和投影的预训练特征很容易被切断，这意味着判别器的一个简单解决方案。使用特征适配器帮助网络快速收敛的现象也说明了这一点（图7）。我们观察到使用复杂的特征适配器时，性能明显下降。一个可能的原因是，复杂的适应器可能会导致过度拟合，降低测试中各种缺陷的泛化能力。图4比较了使用特征适配器前后沿着特征的每个维度的标准差直方图。我们可以看到，当用异常特征进行训练时，适应的特征空间变得紧凑。

在这里插入图片描述

Scale of noise. The scale of noise in the anomaly feature generator controls how far away the synthesized abnormal features are from the normal ones. To be specific, high σ results in abnormal features keeping a high Euclidean distance towards normal features. Training on a large σ will result in a loose decision bound, leading to a high false negative. Conversely, the training procedure will become unstable if σ is tiny, and the discriminator cannot generalize to normal features well. Figure 5 details the effect of σ for each class in MVTec AD. As can be seen, σ = 0.015 reaches the balance and yield the best performance.

噪声的规模。异常特征发生器中的噪声规模控制了合成的异常特征与正常特征的距离。具体来说，高的σ会使异常特征与正常特征保持较高的欧氏距离。在大的σ上进行训练会导致决策边界松散，从而导致高假阴性。相反，如果σ很小，训练过程会变得不稳定，判别器不能很好地概括正常特征。图5详细说明了σ对MVTec AD中每个类别的影响。可以看出，σ=0.015达到了平衡，产生了最好的性能。

在这里插入图片描述

Loss function. We compared the proposed loss function in Section 3.5 with the widely used cross-entropy loss (as show in row ”Ours-CE” in Table 3). We found the improvements, 0.2% I-AUROC and 0.3% P-AUROC, over crossentropy loss.

损失函数。我们将第3.5节中提出的损失函数与广泛使用的交叉熵损失进行了比较（如表3中 "Ors-CE "一行所示）。我们发现，I-AUROC和P-AUROC比交叉熵损失有0.2%和0.3%的改进。

Dependency on backbone. We test SimpleNet with different backbones, the results are shown in Table 4. We find that results are mostly stable over the choice of different backbones. The choice of WideResNet50 is made to be comparable with PaDiM [6] and PatchCore [22].

对骨干网的依赖性。我们用不同的骨干网测试SimpleNet，结果见表4。我们发现，在选择不同的骨干网时，结果基本稳定。选择WideResNet50是为了与PaDiM[6]和PatchCore[22]相比较。

Qualitative Results Figure 8 shows results of anomaly localization that indicate the abnormal areas. The threshold for segmentation results is obtained by calculating the F1-score for all anomaly scores of each sub-class. Experimental results prove that the proposed method can localize abnormal areas well even in rather difficult cases. In addition, we can find that the proposed method has consistent performance in both object and texture classes.

定性结果图8显示了表明异常区域的异常定位的结果。分割结果的阈值是通过计算每个子类的所有异常分数的F1分数得到的。实验结果证明，即使在相当困难的情况下，建议的方法也能很好地定位异常区域。此外，我们可以发现，所提出的方法在物体和纹理类中都有一致的表现

在这里插入图片描述

4.8 One-class 新颖性检测

To evaluate the generality of the proposed SimpleNet, we conduct a one-class novelty detection experiment on CIFAR-10 [16]. Following [19], we train the model with samples from a single class and detect novel samples from other categories. We train the corresponding model for each class respectively. Note that the novelty score is defined as the max score in the similarity map. Table 5 reports the IAUROC scores of our method and other methods. For fair comparison, all the methods are pre-trained on ImageNet. The baselines include VAE [2], LSA [1], DSVDD [25], OCGAN [19], HRN [15], AnoGAN [27], DAAD [14], MKD [26], DisAug CLR [28], IGD [5] and RevDist [7]. Our method outperforms these comparison methods. Note that, IGD [5] and DisAug CLR [28] achieve 91.25% and 92.4% respectively when boosted by self-supervised learning.

为了评估所提出的SimpleNet的通用性，我们在CIFAR-10[16]上进行了一个单类新颖性检测实验。按照文献[19]，我们用单一类别的样本训练模型，并检测其他类别的新奇样本。其他类别的样本。我们分别为每个类别训练相应的模型。请注意，新颖性得分被定义为相似性图中的最大得分。表5报告了我们的方法和其他方法的IAUROC分数。为了公平比较，所有的方法都在ImageNet上进行了预训练。基线包括VAE [2], LSA [1], DSVDD [25], OCGAN [19], HRN [15], AnoGAN [27], DAAD [14], MKD [26], DisAug CLR [28], IGD [5] 和RevDist [7] 。我们的方法优于这些比较方法。请注意，IGD[5]和DisAug CLR[28]在通过自我监督学习提升时分别达到91.25%和92.4%。

5 结论

In this paper, we propose a simple but efficient approach named SimpleNet for unsupervised anomaly detection and localization. SimpleNet consists of several simple neural network modules which are easy to train and apply in industrial scenarios. Though simple, SimpleNet achieves the highest performance as well as the fastest inference speed compared to the previous state-of-the-art methods on the MVtec AD benchmark. SimpleNet provides a new perspective to bridge the gap between academic research and industrial application in anomaly detection and localization.

在本文中，我们提出了一个简单而有效的方法，名为SimpleNet，用于无监督的异常检测和定位。SimpleNet由几个简单的神经网络模块组成，易于训练和应用于工业场景。虽然简单，但与之前MVtec AD基准的最先进方法相比，SimpleNet实现了最高的性能和最快的推理速度。SimpleNet提供了一个新的视角，弥补了异常检测和定位方面的学术研究和工业应用之间的差距。