[论文翻译ICPR2021]PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization

最新推荐文章于 2024-10-23 22:52:01 发布

毕加猪plus

最新推荐文章于 2024-10-23 22:52:01 发布

阅读量1.2k

点赞数 1

分类专栏：论文翻译 # 异常检测文章标签：人工智能深度学习计算机视觉

本文链接：https://blog.csdn.net/Vincent_Tong_/article/details/132450897

版权

论文翻译同时被 2 个专栏收录

5 篇文章

订阅专栏

异常检测

4 篇文章

订阅专栏

PaDiM: a Patch Distribution Modeling Framework for Anomaly Detection and Localization

Tags: ICPR
github
Year: 2021
Note: 基于嵌入的方法

摘要

We present a new framework for Patch Distribution Modeling, PaDiM, to concurrently detect and localize anomalies in images in a one-class learning setting. PaDiM makes use of a pretrained convolutional neural network (CNN) for patch embedding, and of multivariate Gaussian distributions to get a probabilistic representation of the normal class. It also exploits correlations between the different semantic levels of CNN to better localize anomalies. PaDiM outperforms current state-ofthe-art approaches for both anomaly detection and localization on the MVTec AD and STC datasets. To match real-world visual industrial inspection, we extend the evaluation protocol to assess performance of anomaly localization algorithms on non-aligned dataset. The state-of-the-art performance and low complexity of PaDiM make it a good candidate for many industrial applications.

我们提出了一种新的斑块分布建模框架(PaDiM)，可在单类学习环境中同时检测和定位图像中的异常点。PaDiM利用预训练的卷积神经网络(CNN)进行补丁嵌入，并利用多元高斯分布来获得正态类别的概率表示。它还利用CNN不同语义层次之间的关联，更好地定位异常。在MVTec AD 和STC数据集上，PaDiM的异常检测和定位能力均优于目前最先进的方法。为了与真实世界的视觉工业检测相匹配，我们扩展了评估协议，以评估异常定位算法在非对齐数据集上的性能。PaDiM 性能先进，复杂度低，是许多工业应用的理想选择。

1 介绍

Humans are able to detect heterogeneous or unexpected patterns in a set of homogeneous natural images. This task is known as anomaly or novelty detection and has a large number of applications, among which visual industrial inspections. However, anomalies are very rare events on manufacturing lines and cumbersome to detect manually. Therefore, anomaly detection automation would enable a constant quality control by avoiding reduced attention span and facilitating human operator work. In this paper, we focus on anomaly detection and, in particular, on anomaly localization, mainly in an industrial inspection context. In computer vision, anomaly detection consists in giving an anomaly score to images. Anomaly localization is a more complex task which assigns each pixel, or each patch of pixels, an anomaly score to output an anomaly map. Thus, anomaly localization yields more precise and interpretable results. Examples of anomaly maps produced by our method to localize anomalies in images from the MVTec Anomaly Detection (MVTec AD) dataset [1] are displayed in Figure 1.

人类能够从一组同质的自然图像中检测出异质或意想不到的模式。这项任务被称为异常或新颖性检测，有大量应用，其中包括视觉工业检测。然而，异常情况在生产线上非常罕见，人工检测非常麻烦。因此，异常检测自动化可以避免注意力下降，方便人类操作员工作，从而实现持续的质量控制。在本文中，我们将重点关注异常检测，特别是异常定位，主要是在工业检测背景下。在计算机视觉中，异常检测包括对图像进行异常评分。因此，异常定位会产生更精确、更可解释的结果。图1展示了我们从MVTec AD数据集中定位图像异常点的方法生成的异常点的图示例。

在这里插入图片描述

Fig. 1. Image samples from the MVTec AD. Left column: normal images of Transistor, Capsule and Wood classes. Middle column: images of the same classes with the ground truth anomalies highlighted in yellow. Right column: anomaly heatmaps obtained by our PaDiM model. Yellow areas correspond to the detected anomalies, whereas the blue areas indicate the normality zones.

图1: MVTec AD 的图像样本。左栏：晶体管胶囊和木材类别的正常图像。中间一列：相同类别的图像，地面实况异常点用黄色标出。右栏：PaDiM模型获得的异常热图。黄色区域对应检测到的异常，蓝色区域表示正常区域。

Anomaly detection is a binary classification between the normal and the anomalous classes. However, it is not possible to train a model with full supervision for this task because we frequently lack anomalous examples, and, what is more, anomalies can have unexpected patterns. Hence, anomaly detection models are often estimated in a one-class learning setting, i.e., when the training dataset contains only images from the normal class and anomalous examples are not available during the training. At test time, examples that differ from the normal training dataset are classified as anomalous.

异常检测是正常类和异常类之间的二元分类。然而，我们不可能为这项任务训练一个具有完全监督的模型，因为我们经常缺乏异常示例，而且异常示例可能具有想象不到的模式。因此，异常检测模型通常是在单类学习环境下进行估算的，即训练数据集只包含正常类图像，训练过程中不存在异常示例。测试时，与正常训练数据集不同的示例会被归类为异常。

Recently, several methods have been proposed to combine anomaly localization and detection tasks in a one-class learning setting. However, either they require deep neural network training [3], [6] which might be cumbersome, or they use a K-nearest-neighbor (K-NN) algorithm [7] on the entire training dataset at test time [4], [5]. The linear complexity of the KNN algorithm increases the time and space complexity as the size of the training dataset grows. These two scalability issues may hinder the deployment of anomaly localization algorithms in industrial context.

最近，有人提出了几种方法，将异常定位和监测任务结合在一类学习环境中。然而，他们要么需要进行深度神经网络训练，这可能很麻烦；要么在测试时对整个训练数据集使用K近邻(KNN)算法。KNN算法的线性复杂度会随着训练数据集规模的扩大而增加时间和空间复杂度。这两个可扩展性问题可能会阻碍异常定位算法在工业环境中的应用。

To mitigate the aforementioned issues, we propose a new anomaly detection and localization approach, named PaDiM for Patch Distribution Modeling. It makes use of a pretrained convolutional neural network (CNN) for embedding extraction and has the two following properties:

Each patch position is described by a multivariate Gaussian distribution;
PaDiM takes into account the correlations between different semantic levels of a pretrained CNN.

为了缓解上述问题，我们提出了一种新的异常检测和定位方法，命名为“Patch Distribution Modeling (PaDiM )”. 它利用预训练的卷积神经网络(CNN)进行嵌入提取，并具有以下两个特性：

每个Patch 位置都由多元高斯分布来描述；
PaDiM 考虑了预训练CNN不同语义层次之间的相关性。

With this new and efficient approach, PaDiM outperforms the existing state-of-the-art methods for anomaly localization and detection on the MVTec AD [1] and the ShanghaiTech Campus (STC) [8] datasets. Besides, at test time, it has a low time and space complexity, independent of the dataset training size which is an asset for industrial applications. We also extend the evaluation protocol to assess model performance in more realistic conditions, i.e., on a non-aligned dataset.

利用这种高效的新方法，PaDiM在MVTec AD和ShanghaiTech Campus (STC) 数据集上的异常定位和检测效果优于现有的先进方法。此外，在测试时，它的时间和空间复杂度都很低，与数据集的训练规模无关，这对于工业应用来说是非常重要的。我们还扩展了评估协议，以评估模型在更现实条件下的性能，即在非对齐数据集上的性能。

2 相关工作

Anomaly detection and localization methods can be categorized as either reconstruction-based or embedding similarity-based methods.

异常检测和定位方法可以被分为基于重构的方法和基于嵌入相似性的方法。

Reconstruction-based methods are widely-used for anomaly detection and localization. Neural network architectures like autoencoders (AE) [1], [9]–[11], variational autoencoders (VAE) [3], [12]–[14] or generative adversarial networks (GAN) [15]–[17] are trained to reconstruct normal training images only. Therefore, anomalous images can be spotted as they are not well reconstructed. At the image level, the simplest approach is to take the reconstructed error as an anomaly score [10] but additional information from the latent space [16], [18], intermediate activations [19] or a discriminator [17], [20] can help to better recognize anomalous images. Yet to localize anomalies, reconstruction-based methods can take the pixel-wise reconstruction error as the anomaly score [1] or the structural similarity [9]. Alternatively, the anomaly map can be a visual attention map generated from the latent space [3], [14]. Although reconstruction-based methods are very intuitive and interpretable, their performance is limited by the fact that AE can sometimes yield good reconstruction results for anomalous images too [21].

基于重构的方法被广泛用于异常检测和定位。自编码器(AE) 变异自编码器(VAE)或生成对抗网络(GAN) 等神经网络架构仅用于重建正常的训练图像。因此，异常图像可以被发现，因为它们没有被很好地重建。在图像层面，最简单的方法是将重建误差作为异常得分，但来自潜空间、中间激活或判别器的附加信息有助于更好地识别异常图像。然而，为了定位异常点，基于重建的方法可以将像素重建误差作为异常点得分或结构相似性。另外，异常图也可以由潜空间生成的视觉注意力图。虽然基于重构的方法非常直观且可解释，但由于AE有时也会对异常图像产生良好的重构结果，因此其性能受到限制。

Embedding similarity-based methods use deep neural networks to extract meaningful vectors describing an entire image for anomaly detection [6], [22]–[24] or an image patch for anomaly localization [2], [4], [5], [25]. Still, embedding similarity-based methods that only perform anomaly detection give promising results but often lack interpretability as it is not possible to know which part of an anomalous images is responsible for a high anomaly score. The anomaly score is in this case the distance between embedding vectors of a test image and reference vectors representing normality from the training dataset. The normal reference can be the center of a nsphere containing embeddings from normal images [4], [22], parameters of Gaussian distributions [23], [26] or the entire set of normal embedding vectors [5], [24]. The last option is used by SPADE [5] which has the best reported results for anomaly localization. However, it runs a K-NN algorithm on a set of normal embedding vectors at test time, so the inference complexity scales linearly to the dataset training size. This may hinder industrial deployment of the method.

基于嵌入相似性的方法使用深度神经网络提取描述整幅图像的有意义向量，用于异常检测或用于异常定位的图像patch. 不过，基于嵌入相似性的方法只进行异常检测，虽然结果很好，但往往缺乏可解释性，因为无法知道异常图像的哪一部分导致了高异常分数。在这种情况下，异常得分就是测试图像的嵌入向量与代表训练数据集正常性的参考向量之间的距离。正态参考可以是包含正态图像嵌入的N-sphere的中心，也可以是高斯分布的参数或整个正态嵌入向量集。SPADE采用了最后一种方法，据报告，它在异常定位方面取得了最佳效果。不过，它在测试时有一组正常嵌入向量上运行KNN算法，因此推理复杂度与数据集的规模成线性关系。这可能会阻碍该方法的工业应用。

Our method, PaDiM, generates patch embeddings for anomaly localization, similar to the aforementioned approaches. However, the normal class in PaDiM is described through a set of Gaussian distributions that also model correlations between semantic levels of the used pretrained CNN model. Inspired by [5], [23], we choose as pretrained networks a ResNet [27], a Wide-ResNet [28] or an EfficientNet [29]. Thanks to this modelisation, PaDiM outperforms the current state-of-the-art methods. Moreover, its time complexity is low and independent of the training dataset size at the prediction stage.

我们的方法PaDiM生成用于异常定位的patch嵌入，与上述方法类似。不过，PaDiM 中的正常类是通过一组高斯分布来描述的，这些高斯分布还对所用预训练CNN模型的语义水平之间的相关性进行建模。受到[5][7]的启发，我们选择ResNet、Wide-ResNet和EfficientNet作为预训练网络。得益于这种建模方法，PaDiM 的性能优于目前最先进的方法。此外，他的时间复杂度很低，而且与预测阶段的训练数据集大小无关。

3 Patch 分布建模

3.1 嵌入提取

Pretrained CNNs are able to output relevant features for anomaly detection [24]. Therefore, we choose to avoid ponderous neural network optimization by only using a pretrained CNN to generate patch embedding vectors. The patch embedding process in PaDiM is similar to one from SPADE [5] and illustrated in Figure 2. During the training phase, each patch of the normal images is associated to its spatially corresponding activation vectors in the pretrained CNN activation maps. Activation vectors from different layers are then concatenated to get embedding vectors carrying information from different semantic levels and resolutions, in order to encode finegrained and global contexts. As activation maps have a lower resolution than the input image, many pixels have the same embeddings and then form pixel patches with no overlap in the original image resolution. Hence, an input image can be divided in a grid of $(i,j)\in[1,W] \times [1,H]$ positions where $W\times H$ is the resolution of the largest activation map used to generate embeddings. Finally, each patch position $(i, j)$ in this grid is associated to an embedding vector xij computed as described above.

经过预训练的CNN能够为异常检测输出相关特征。因此，我们选择只使用预训练的CNN来生成Patch嵌入向量，从而避免繁琐的神经网络优化。PaDiM 中的patch 嵌入过程类似于SPADE ，如图2所示。在训练阶段，正常图像的每个patch都与预训练CNN激活图中的空间对应激活向量相关联。然后，将不同层的激活向量连接起来，就能得到包含不同语义层次和分辨率信息的嵌入向量，从而对细粒度和全局语境进行编码。由于激活图的分辨率低于输出图像的分辨率，因此许多像素具有相同的嵌入，然后形成在原始图像分辨率下没有重叠的像素pathch. 因此，输入图像可以被划分为 $(i,j)\in[1,W] \times [1,H]$ 位置的网格，其中 $W\times H$ 是用于生成嵌入的最大激活图的分辨率。

The generated patch embedding vectors may carry redundant information, therefore we experimentally study the possibility to reduce their size (Section V-A). We noticed that randomly selecting few dimensions is more efficient that a classic Principal Component Analysis (PCA) algorithm [30]. This simple random dimensionality reduction significantly decreases the complexity of our model for both training and testing time while maintaining the state-of-the-art performance. Finally, patch embedding vectors from test images are used to output an anomaly map with the help of the learned parametric representation of the normal class described in the next subsection.

生成的patch嵌入向量可能包含冗余信息，因此我们对缩小其大小的可能性进行了实验研究(5.1节)。我们注意到，随机选择几个维度比经典的主成分分析(PCA)算法更有效。这种简单的随机降维方法大大降低了模型在训练和测试时的复杂度，同时保持了最先进的性能。最后，在下一小节所述的正常类参数表示法的帮助下，利用测试图像中的patch嵌入向量输出异常图。

3.2 常态学习

To learn the normal image characteristics at position $(i, j)$ , we first compute the set of patch embedding vectors at $(i, j)$ , $X_{ij} =\{ x_{ij}^{k}, k \in [1, N]\}$ from the N normal training images as shown on Figure 2. To sum up the information carried by this set we make the assumption that $X_{ij}$ is generated by a multivariate Gaussian distribution $\mathcal{N}(\mu_{ij},\sum_{ij})$ where $\mu_{ij}$ is the sample mean of $X_{ij}$ and the sample covariance $\sum_{ij}$ is estimated as follows :

在这里插入图片描述

图2 对于最大CNN特征图中与位置(i, j)相对应的每个图像片段，PaDiM从N组训练嵌入向量 $X_{i,j}=\{x_{ij}^{k}, k\in[1,N]\}$ 中学习高斯参数 $(\mu_{ij},\sum_{ij})$ , 这些嵌入向量是由N个不同的训练图像和三个不同的预训练CNN层计算得出的。

为了学习位置 $(i, j)$ 处的正常图像特征，我们首先从N幅正常训练图像中计算出位置 $(i, j)$ 处的patch嵌入向量集，即 $X_{ij} =\{ x_{ij}^{k}, k \in [1, N]\}$ ，如图2所示。为了总结这组数据所携带的信息，我们假设 $X_{ij}$ 是由多元高斯分布 $\mathcal{N}(\mu_{ij},\sum_{ij})$ 产生的，其中 $\mu_{ij}$ 是 $X_{ij}$ 的样本平均数，样本协方差 $\sum_{ij}$ 的估计值如下：

$\Sigma_{ij}=\frac{1}{N-1}\sum_{k=1}^{N}(\mathbf{x_{ij}^{k}}-\mu_{\mathbf{ij}})(\mathbf{x_{ij}^{k}}-\mu_{\mathbf{ij}})^{\mathrm{T}}+\epsilon I\quad(1)$

where the regularisation term $\epsilon I$ makes the sample covariance matrix $\sum_{ij}$ full rank and invertible. Finally, each possible patch position is associated with a multivariate Gaussian distribution as shown in Figure 2 by the matrix of Gaussian parameters.

其中，正则化项 $\epsilon I$ 使样本协方差矩阵 $\sum_{ij}$ 全秩并可反转。最后，如图2所示，每个可能的patch位置都与高斯参数矩阵的多元高斯分布相关联。

Our patch embedding vectors carry information from different semantic levels. Hence, each estimated multivariate Gaussian distribution $\mathcal{N}(\mu_{ij},\sum_{ij})$ captures information from different levels too and $\sum_{ij}$ contains the inter-level correlations. We experimentally show (Section 5.1) that modeling these relationships between the different semantic levels of the pretrained CNN helps to increase anomaly localization performance.

我们的patch嵌入向量包含不同语义层次的信息。因此，每个估计的多元高斯分布 $\mathcal{N}(\mu_{ij},\sum_{ij})$ 也捕获了来自不同层面的信息，而 $\sum_{ij}$ 则包含了层面间的相关性。我们的实验表明(5.1节), 在预训练CNN的不同语义层次之间建立这些关系模型有助于提高异常定位性能。

3.3 推理：异常图的计算

Inspired by [23], [26], we use the Mahalanobis distance [31] $M_{x_{ij}}$ to give an anomaly score to the patch in position $(i, j)$ of a test image. $M_{x_{ij}}$ can be interpreted as the distance between the test patch embedding $x_{ij}$ and learned distribution $\mathcal{N}(\mu_{ij},\sum_{ij})$ , where $M_{x_{ij}}$ is computed as follows:

受[23][26]的启发，我们使用Mahalanobis 距离[31] $M_{x_{ij}}$ 给测试图像中位于 $(i, j)$ 位置的斑块打分。 $M_{x_{ij}}$ 可以解释为测试patch嵌入 $x_{ij}$ 与学习分布 $\mathcal{N}(\mu_{ij},\sum_{ij})$ 之间的距离，其中 $M_{x_{ij}}$ 的计算方法如下：

$M(x_{ij})=\sqrt{(x_{ij}-\mu_{ij})^T\Sigma_{ij}^{-1}(x_{ij}-\mu_{ij})}\quad(2)$

Hence, the matrix of Mahalanobis distances $M=(M(x_{ij}))_{1<i<W,1<j<H}$ that forms an anomaly map can be computed. High scores in this map indicate the anomalous areas. The final anomaly score of the entire image is the maximum of anomaly map M. Finally, at test time, our method does not have the scalability issue of the K-NN based methods [4]–[6], [25] as we do not have to compute and sort a large amount of distance values to get the anomaly score of a patch.

因此，可以计算出构成异常图的Mahalanobis 距离矩阵 $M=(M(x_{ij}))_{1<i<W,1<j<H}$ 。该图中的高分表示异常区域。整个图像的最终异常得分是异常图M的最大值。最后，在测试时，我们的方法不像基于K-NN的方法[4]-[6], [25]那样存在可扩展性问题，因为我们不需要计算和排序大量的距离值就能得到一个patch的异常得分。

4 实验

4.1 数据集和衡量标准

Metrics. To assess the localization performance we compute two threshold independent metrics. We use the Area Under the Receiver Operating Characteristic curve (AUROC) where the true positive rate is the percentage of pixels correctly classified as anomalous. Since the AUROC is biased in favor of large anomalies we also employ the per-region-overlap score (PRO-score) [2]. It consists in plotting, for each connected component, a curve of the mean values of the correctly classified pixel rates as a function of the false positive rate between 0 and 0.3. The PRO-score is the normalized integral of this curve. A high PRO-score means that both large and small anomalies are well-localized.

衡量标准。为了评估定位性能，我们计算了两个与阈值独立的指标。我们使用接收者工作特征曲线下面积(AUROC), 其中真阳性率是指被正确归类为异常像素的百分比。由于AUROC偏向于大的异常点，因此我们还采用了每个区域重叠得分(PRO-score)。它包括为每个连接组件绘制一条正确分类像素率平均值与0-0.3之间误判率的函数曲线。PRO分数是该曲线的归一化积分。PRO分数越高，说明大小异常点的定位越准确。

Datasets. We first evaluate our models on the MVTec AD [1] designed to test anomaly localization algorithms for industrial quality control and in a one-class learning setting. It contains 15 classes of approximately 240 images. The original image resolution is between 700x700 and 1024x1024. There are 10 object and 5 texture classes. Objects are always well-centered and aligned in the same way across the dataset as we can see in Figure 1 for classes Transistor and Capsule. In addition to the original dataset, to assess performance of anomaly localization models in a more realistic context, we create a modified version of the MVTec AD, referred as RdMVTec AD, where we apply random rotation (-10, +10) and random crop (from 256x256 to 224x224) to both the train and test sets. This modified version of the MVTec AD may better describe real use cases of anomaly localization for quality control where objects of interest are not always centered and aligned in the image.

**数据集。**我们首先在MVTeC AD上对我们的模型进行评估，MVTeC AD的设计目的是测试工业质量控制的异常定位算法，并采用单类学习设置。它包含15类约240幅图像。原始图像的分辨率在700700和10241024之间。共有10个对象类和5个纹理类。正如我们在图1中看到的晶体管和胶囊类，在整个数据集中，对象始终居中，并以相同的方式对齐。除了原始数据集以外，为了在更真实的环境中评估异常定位模型的性能，我们还创建了MVTeC AD的改进版本，成为RdMVTeC AD, 其中我们对训练集合测试集进行了随机旋转(-10, +10)和随机裁剪(256256到224224)

For further evaluation, we also test PaDiM on the Shanghai Tech Campus (STC) Dataset [8] that simulates video surveillance from a static camera. It contains 274 515 training and 42 883 testing frames divided in 13 scenes. The original image resolution is 856x480. The training videos are composed of normal sequences and test videos have anomalies like the presence of vehicles in pedestrian areas or people fighting.

为了进一步评估，我们还在上海科技园(STC)数据集上测试了PaDiM，该数据集模拟了静态摄像头的视频监控。它包含274,515个训练帧和42,883个测试帧，分为13个场景。原始图像分辨率为856*480.

4.2 实验设置

We train PaDiM with different backbones, a ResNet18 (R18) [27], a Wide ResNet-50-2 (WR50) [28] and an EfficientNet-B5 [29], all pretrained on ImageNet [32]. Like in [5], patch embedding vectors are extracted from the first three layers when the backbone is a ResNet, in order to combine information from different semantic levels, while keeping a high enough resolution for the localization task. Following this idea, we extract patch embedding vectors from layers 7 (level 2), 20 (level 4), and 26 (level 5), if an EfficientNet-B5 is used. We also apply a random dimensionality reduction (Rd) (see Sections III-A and V-A). Our model names indicate the backbone and the dimensionality reduction method used, if any. For example, PaDiM-R18-Rd100 is a PaDiM model with a ResNet18 backbone using 100 randomly selected dimensions for the patch embedding vectors. By default we use $\epsilon$ = 0:01for the from Equation 1.

我们使用不同的骨干网络来训练PaDiM，它们分别是ResNet18 (R18) [27]，Wide ResNet-50-2 (WR50) [28]，和 EfficientNet-B5 [29], 所有骨干网络都在ImageNet上进行了预训练。与文献[5]一样，当骨干网络为ResNet时，从前三层提取patch嵌入向量，以结合来自不同语义层的信息，同时为定位任务保持足够高的分辨率。按照这一思路，我们从第7层(level 2)，第20层(level 4)和第26层(level 5)，(如果使用EfficientNet-B5) 提取patch 嵌入向量。我们还应用了随机降维(Rd) (3.1节和5.1节)。我们的模型名称表明了所使用的的骨干网络和降维方法(如果有)。例如，PaDiM-R18-Rd100是一个以ResNet18为骨干的PaDiM模型，使用100个随机选择的维度作为patch嵌入向量。默认情况下，我们在公式1中使用 $\epsilon =0.01$ 。

We reproduce the model SPADE [5] as described in the original publication with a Wide ResNet-50-2 (WR50) [28] as backbone. For SPADE and PaDiM we apply the same prepocessing as in [5]. We resize the images from the MVTec AD to 256x256 and center crop them to 224x224. For the images from the STC we use a 256x256 resize only. We resize the images and the localization maps using bicubic interpolation and we use a Gaussian filter on the anomaly maps with parameter σ = 4 like in [5].

我们以Wide ResNet-50-2 (WR50) [28]为骨干，复制了原始出版物中描述的模型SPADE[5]。对于SPADE和PaDiM，我们采用了与[5]相同的预评估方法。我们将MVTec AD的图像调整为256256，然后居中裁剪为224224. 对于来自STC的图像，我们仅使用256*256的调整尺寸。我们使用双三次插值法调整图像和定位图大小，并在异常图上使用参数为 $\sigma=4$ 的高斯滤波器，如文献[5]。

We also implement our own VAE as a reconstruction-based baseline implemented with a ResNet18 as encoder and a 8x8 convolutional latent variable. It is trained on each MVTec AD class with 10 000 images using the following data augmentations operations: random rotation (-2◦, +2◦), 292x292 resize, random crop to 282x282, and finally center crop to 256x256. The training is performed during 100 epochs with the Adam optimizer [12] with an initial learning rate of 10-4 and a batch size of 32 images. The anomaly map for the localization corresponds to the pixel-wise L2 error for reconstruction.

我们还是用ResNet18作为编码器和88卷积潜变量实现了我们自己的基于从狗的基线VAE。它在每个MVTec AD 类别的10,000张图像上进行了训练，并使用了以下数据增强操作：随机旋转(-2°, +2°)、292292调整大小、随机裁剪为282282，最后中心裁剪为256256。使用Adam优化器，初始学习率为0.0001，epoch大小为100，batchsize大小为32。

5 结果

5.1 消融研究

First, we evaluate the impact of modeling correlations between semantic levels in PaDiM and explore the possibility to simplify our method through dimensionality reduction.

首先，我们评估了再PaDiM中建立语义层次间相关性模型的影响，并探讨了通过降维简化我们的方法的可能性。

Inter-layer correlation. The combination of Gaussian modeling and the Mahalanobis distance has already been employed in previous works to detect adversarial attacks [26] and for anomaly detection [23] at the image level. However those methods do not model correlations between different CNN’s semantic levels as we do in PaDiM. In Table I we show the anomaly localization performance on the MVTec AD of PaDiM with a ResNet18 backbone when using only one of the first three layers (Layer 1, Layer 2, or Layer 3) and when summing the outputs of these 3 models to form an ensemble method that takes into account the first three layers but not the correlations between them (Layers 1+2+3). The last row of Table I (PaDiM-R18) is our proposed version of PaDiM where each patch location is described by one Gaussian distribution taking into account the first three ResNet18 layers and correlations between them. It can be observed that using Layer 3 produces the best results in terms of AUROC among the three layers. It is due to the fact that Layer 3 carries higher semantic level information which helps to better describe normality. However, Layer 3 has a slightly worse PRO-score than Layer 2 that can be explained by the lower resolution of Layer 2 which affects the accuracy of anomaly localization. As we see in the two last rows of Table I, aggregating information from different layers can solve the trade-off issue between high semantic information and high resolution. Unlike model Layer 1+2+3 that simply sums the outputs, our model PaDiMR18 takes into account correlations between semantic levels. As a result, it outperforms Layer 1+2+3 by 1.1p.p (percent point) for AUROC and 1.8p.p for PRO-score. It confirms the relevance of modeling correlation between semantic levels.

层间相关性。高斯建模和Mahalanobis 距离的结合已被用于检测对抗性攻击[26]和图像层面的异常检测[23]。然而，这些方法并没有像我们在PaDiM中做的那样，对不同CNN语义水平之间的相关性进行建模。表1中显示了以ResNet18为骨干的PaDiM MVTec AD在仅使用前三层(第一层，第二层，或者第三层)中的一层时，以及在将这3个模型的输出相加以形成考虑到前三层但不考虑它们之间相关性(第1+2+3层)的集合方法时的异常定位性能。表1中的最后一行(PaDiM-R18)是我们提出的PaDiM版本，其中每个patch 位置都由一个高斯分布来描述，并考虑了前三个ResNet18层以及它们之间的相关性。可以看出，在三个层中，第3层的AUROC结果最好。这是因为第三层承载了更高语义层次的信息，有助于更好地描述正常性。然而，第三层的PRO分数略低于第二层，原因可能是第二层的分辨率较低，影响了异常定位的准确性。正如我们在表1中最后两行中看到的，汇总不同层的信息可以解决高语义信息和高分辨率之间的平衡问题。与模型1+2+3简单地将输出相加不同，我们的模型PaDiMR18考虑到了语义层之间的相关性。因此，就AUROC而言，它比1+2+3层高出1.1个百分点，就PRO-score而言，它比1+2+3层高出了1.8个百分点。

TABLE I: STUDY OF THE ANOMALY LOCALIZATION PERFORMANCE USING DIFFERENT SEMANTIC-LEVEL CNN LAYERS. RESULTS ARE DISPLAYED AS TUPLES (AUROC%, PRO-SCORE%) ON THE MVTEC AD

表1：使用不同语义级CNN层的异常定位性能研究。结果以(AUROC%、PRO-SCORE%)的形式显示在MVTec AD 上。

在这里插入图片描述

Dimensionality reduction. PaDiM-R18 estimates multivariate Gaussian distributions from sets of patch embeddings vectors of 448 dimensions each. Decreasing the embedding vector size would reduce the computational and memory complexity of our model. We study two different dimensionality reduction methods. The first one consists in applying a Principal Component Analysis (PCA) algorithm to reduce the vector size to 100 or 200 dimensions. The second method is a random feature selection where we randomly select features before the training. In this case, we train 10 different models and take the average scores. Still the randomness does not change the results between different seeds as the standard error mean (SEM) for the average AUROC is always between $10^{-4}$ and $10^{-7}$ .

降维。PaDiM-R18可根据每组448维的patch嵌入向量估计多元高斯分布。减小嵌入向量的大小可以降低模型的计算和内存复杂度。我们研究了两种不同的降维方法。第一种方法是应用主成分分析(PCA)算法，将矢量缩小到100或200维。第二种方法是随机特征选择法，即在训练前随机选择特征。在这种情况下，我们会训练10个不同的模型，然后取平均分。尽管如此，随机性并不会改变不同种子之间的结果，因为平均AUROC的标准误差均值(SEM)始终在 $10^{-4}$ 和 $10^{-7}$ 之间。

From Table II we can notice that for the same number of dimensions, the random dimensionality reduction (Rd) outperforms the PCA on all the MVTec AD classes by at least 1.3p.p in the AUROC and 1.2p.p in the PRO-score. It can be explained by the fact that PCA selects the dimensions with the highest variance which may not be the ones that help to discriminate the normal class from the anomalous one [23]. It can also be noted from Table II that randomly reducing the embedding vector size to only 100 dimensions has a very little impact on the anomaly localization performance. The results drop only by 0.4p.p in the AUROC and 0.3p.p in the PRO-score. This simple yet effective dimensionality reduction method significantly reduces PaDiM time and space complexity as it will be shown in Section V-D.

从表2中我们可以看出，在相同维数的情况下，随机降维(Rd)在所有MVTec AD类别中的AUROC和PRO分数上分别比PCA高出至少1.3个百分点和1.2个百分点。这是因为PCA选择了方差最大的维度，而这些维度可能无助于区分正常类和异常类[23]。从表2中还可以看出，随机将嵌入向量的大小减少到100维，对异常定位性能的影响非常小。结果，AUROC仅下降了0.4个百分点，PRO-score仅下降了0.3个百分点。这种简单而有效的降维方法大大降低了PaDiM 在时间和空间上的复杂性，这一点将在5.4节中说明。

TABLE II: STUDY OF THE ANOMALY LOCALIZATION PERFORMANCE WITH A DIMENSIONALITY REDUCTION FROM 448 TO 100 AND 200 USING PCA OR RANDOM FEATURE SELECTION (RD). RESULTS ARE DISPLAYED AS TUPLES (AUROC%, PRO-SCORE%) ON THE MVTEC AD.

表2：使用PCA或随机特征选择(Rd)将维度从448降到100和200时的异常定位性能研究。结果以(AUROC%、PRO-SCORE%)的形式显示在MVTec AD 上。

在这里插入图片描述

TABLE III: COMPARISON OF OUR PADIM MODELS WITH THE STATE-OF-THE-ART FOR THE ANOMALY LOCALIZATION ON THE MVTEC AD. RESULTS ARE DISPLAYED AS TUPLES (AUROC%, PRO-SCORE%)

表3：我们的PaDiM模型与最先进的MVTec AD 异常定位模型的比较。结果以(AUROC%、PRO-SCORE%)的形式显示在MVTec AD 上。

在这里插入图片描述

5.2 与最新技术的比较

Localization. In Table III, we show the AUROC and the PRO-score results for anomaly localization on the MVTec AD. For a fair comparison, we used a Wide ResNet-50-2 (WR50) as this backbone is used in SPADE [5]. Since the other baselines have smaller backbones, we also try a ResNet18 (R18). We randomly reduce the embedding size to 550 and 100 for PaDiM with WR50 and R18 respectively.

定位。在表3中，显示了MVTeC AD异常定位的AUROC和PRO分数结果。为了进行公平比较，我们使用了Wide ResNet-50-2 (WR50)，因为SPADE [5]也使用了这一骨干网络。由于其它基线的骨干很小，我们还尝试了ResNet18(R18)。我们使用WR50和R18将PaDiM的嵌入式大小分别随机减小到550和100。

We first notice that PaDiM-WR50-Rd550 outperforms all the other methods in both the PRO-score and the AUROC on average for all the classes. PaDiM-R18-Rd100 which is a very light model also outperforms all models in the average AUROC on the MVTec AD classes by at least 0.2p.p. When we further analyze the PaDiM performances, we see that the gap for the object classes is small as PaDiM-WR50-Rd550 is the best only in the AUROC (+0.2p.p) but SPADE [5] is the best in the PRO-score (+1.8p.p). However, our models are particularly accurate on texture classes. PaDiM-WR50-Rd550 outperforms the second best model SPADE [5] by 4.8p.p and 4.0p.p in the PRO-score and the AUROC respectively on average on texture classes. Indeed, PaDiM learns an explicit probabilistic model of the normal classes contrary to SPADE [5] or Patch-SVDD [4]. It is particularly efficient on texture images because even if they are not aligned and centered like object images, PaDiM effectively captures their statistical similarity accross the normal train dataset.

我们首先注意到，在所有类别中，PaDiM-WR50-Rd550的PRO分数和AUROC平均值都优于所有其它方法。PaDiM-R18-Rd100是一款非常轻的骨干网络，在MVTec AD级的平均AUROC方面也比所有模型高出至少0.2个百分点。当我们进一步分析PaDiM的性能时，我们发现对象类的差距很小，PaDiM-WR50-Rd550仅在AUROC(+0.2p.p)方面是最好的，但SPADE[5]在PRO-score(+1.8p.p) 方面是最好的。不过，我们的模型在纹理类别上特别准确。在纹理类别上，PaDiM-WR50-Rd550分数和AUROC平均值分别比第二好的模型SPADE[5]高出4.8个百分点和4.0个百分点。事实上，与SPADE[5]或Patch-SVDD[4]相反, PaDiM学习的是正常类的明确概率模型。它对纹理图像的处理特别有效，因为即使纹理图像不像物体图像那样对齐和居中，PaDiM也能有效捕捉到它们在整个正常训练数据集上的统计相似性。

Additionally, we evaluate our model on the STC dataset. We compare our method to the two best reported models performing anomaly localization without temporal information, CAVGA-RU [3] and SPADE [5]. As shown in Table IV, the best result (AUROC) on the STC dataset is achieved with our simplest model PaDiM-R18-Rd100 by a 2.1p.p. margin. In fact, pedestrian positions in images are highly variable in this dataset and, as shown in Section V-C, our method performs well on non-aligned datasets.

此外，我们还在STC数据集上评估了我们的模型。我们将我们的方法与报告的两个在没有时间信息的情况下进行异常定位的最佳模型CAVGA-RU [3]和SPADE [5]进行了比较。如表4所示，在STC数据集上，我们最简单的模型PaDiM-R18-Rd100以2.1p.p的优势取得了最佳结果(AUROC)。事实上，在该数据集中，行人在图像中的位置变化很大，而且正如5.3节所示，我们的方法在非对齐数据集中表现良好。

TABLE IV: COMPARISON OF OUR PADIM MODEL WITH THE STATE-OF-THE-ART FOR THE ANOMALY LOCALIZATION ON THE STC IN THE AUROC%.

表4：我们的PaDiM模型与最先进的模型在STC数据集中异常定位的比较(AUROC)。

在这里插入图片描述

Detection. By taking the maximum score of the anomaly maps issued by our models (see Section III-C) we give anomaly scores to entire images to perform anomaly detection at the image level. We test PaDiM for anomaly detection with a Wide ResNet-50-2 (WR50) [28] used in SPADE and an EfficientNet-B5 [29]. The Table V shows that our model PaDiM-WR50-Rd550 outperforms every method except MahalanobisAD [23] with their best reported backbone, an EfficientNet-B4. Still our PaDiM-EfficientNet-B5 outperforms every model by at least 2.6p.p on average on all the classes in the AUROC. Besides, contrary to the second best method for anomaly detection, Mahalanobis AD [23], our model also performs anomaly segmentation which characterizes more precisely the anomalous areas in the images.

检测。通过获取我们的模型(3.3节)所发布的异常图的最大得分，我们可以给整个图像打分，从而在图像层面进行异常检测。我们使用SPADE中使用的Wide ResNet-50-2 (WR50) [28]和EfficientNet-B5 [29]测试了PaDiM的异常检测能力。表5显示，除了MahalanobisAD [23]，我们的PaDiM-WR50-Rd550模型在其最佳骨干网络(EfficientNet-B4)上的表现优于其它所有方法。在AUROC的所有类别中，我们的PaDiM-EfficientNet-B5仍然比每个模型平均高出至少2.6个百分点。此外，与排名第二的异常检测方法MahalanobisAD [23]相反，我们的模型还能进行异常分割，从而更精度地确定图像中的异常区域。

TABLE V: ANOMALY DETECTION RESULTS (AT THE IMAGE LEVEL) ON THE MVTEC AD USING AUROC%.

表5：使用AUROC(%)对MVTec AD进行异常检测的结果(图像级)。

在这里插入图片描述

5.3 非对齐数据集的异常定位

To estimate the robustness of anomaly localization methods, we train and evaluate the performance of PaDiM and several state-of-the-art methods (SPADE [5], VAE) on a modified version of the MVTec AD, Rd-MVTec AD, described in Section IV-A. Results of this experiment are displayed in Table VI. For each test configuration we run 5 times data preprocessing on the MVTec AD with random seeds to obtain 5 different versions of the dataset, denoted as Rd-MVTec AD. Then, we average the obtained results and report them in Table VI. According to the presented results, PaDiM-WR50-Rd550 outperforms the other models on both texture and object classes in the PRO-score and the AUROC. Besides, the SPADE [5] and VAE performances on the Rd-MVTec AD decrease more than the performance of PaDiM-WR50-Rd550 when comparing to the results obtained on the normal MVTec AD (refer to Table III). The AUROC results decrease by 5.3p.p for PaDiM-WR50-Rd550 against 12.2p.p and 8.8p.p decline for VAE and SPADE respectively. Thus, we can conclude that our method seems to be more robust to non-aligned images than the other existing and tested works.

为了评估异常定位方法的鲁棒性，我们在4.1节所述的MVTec AD的改进版Rd-MVTec AD上训练和评估了PaDiM和几种最先进方法(SPADE[5]、VAE)的性能。结果见表6. 对于每个测试配置，我们在MVTec AD上使用随机种子运行5次数据预处理，以获得补个不同版本的数据集，记为Rd-MVTec AD。然后，我们将所得结果平均化，并在表6中报告。根据上述结果，PaDiM-WR50-Rd550在纹理和物体类别上的PRO分数和AUROC均优于其它模型。此外，与在普通MVTec AD上获得的结果相比，Rd-MVTec AD上的SPADE[5] 和VAE性能下降幅度大于PaDiM-WR50-Rd550 的性能（参见表 3）。PaDiM-WR50-Rd550的AUROC结果下降了5.3个百分点，而VAE和SPADE分别下降了12.2个百分点和8.8个百分点。因此，我们可以得出结论，与其它现有的测试方法相比，我们的方法似乎对非对齐的图像更加稳健。

TABLE VI: ANOMALY LOCALIZATION RESULTS ON THE NON-ALIGNED RD-MVTEC AD. RESULTS ARE DISPLAYED AS TUPLES (AUROC%, PRO-SCORE%)

表6：非对齐Rd-MVTec AD的异常定位结果。结果以(AUROC%、PRO-SCORE%)的形式显示。

在这里插入图片描述

5.4 可扩展性增益

Time complexity. In PaDiM, the training time complexity scales linearly with the dataset size because the Gaussian parameters are estimated using the entire training dataset. However, contrary to the methods that require to train deep neural networks, PaDiM uses a pretrained CNN, and, thus, no deep learning training is required which is often a complex procedure. Hence, it is very fast and easy to train it on small datasets like MVTec AD. For our most complex model PaDiM-WR50-Rd550, the training on a CPU (Intel CPU 6154 3GHz 72th) with a serial implementation takes on average 150 seconds on the MVTec AD classes and 1500 seconds on average on the STC video scenes. These training procedures could be further accelerated using GPU hardware for the forward pass and the covariance estimation. In contrast, training the VAE with 10 000 images per class on the MVTec AD following the procedure described in Section IV-B takes 2h40 per class using one GPU NVIDIA P5000. Conversely, SPADE [5] requires no training as there are no parameters to learn. Still, it computes and stores in the memory before testing all the embedding vectors of the normal training images. Those vectors are the inputs of a K-NN algorithm which makes SPADE’s inference very slow as shown in Table VII.

时间复杂度。在PaDiM中，训练时间复杂度与数据集代销成线性关系，因为高斯参数是使用整个训练数据集估算的。然而，与需要训练深度神经网络的方法不同，PaDiM使用的是预先训练好的CNN，因此不需要深度学习训练，而深度学习训练通常是一个复杂的过程。因此，在像MVTec AD这样的小型数据集上对其进行训练是非常快速和简单的。对于我们最复杂的模型PaDiM-WR50-Rd550，使用串行执行的CPU(Intel CPU 6154 3GHz 72th)训练MVTeC AD类平均需要150秒，STC视频场景平均需要1500秒。使用GPU硬件可以进一步加快前向传递和协方差估计的训练过程。相比之下，在MVTec AD上按照5.2节所述步骤训练每类10,000张图像的VAE，使用一个NVIDIA P5000 GPU，每类需要2小时40分钟。相反，SPADE[5]不需要训练，因为没有参数需要学习。在测试之前，他还会计算并存储正常训练图像的所有嵌入向量。这些向量是KNN算法的输入，因此SPADE的推理速度非常慢，如表7所示。

In Table VII, we measure the model inference time using a mainstream CPU (Intel i7-4710HQ CPU @ 2.50GHz) with a serial implementation. On the MVTec AD, the inference time of SPADE is around seven times slower than our PaDiM model with equivalent backbone because of the computationally expensive NN search. Our VAE implementation, which is similar to most reconstruction-based models, is the fastest model but our simple model PaDiM-R18-Rd100 has the same order of magnitude for the inference time. While having similar complexity, PaDiM largely outperfoms the VAE methods (see Section V-B).

在表7中，我们测量了使用主流CPU(Intel i7-4710HQ CPU @ 2.50GHz)串行执行的模型推理时间。在MVTec AD上，SPADE的推理时间比具有相同骨干网络的PaDiM模型慢7倍左右，因为NN搜索的计算成本很高。我们的VAE实现与大多数基于重构的模型类似，是速度最快的模型，但我们的简单模型PaDiM-R18-Rd100的推理时间与之数量级相同。虽然复杂程度相似，但PaDiM在很大程度上优于VAE方法(5.2节)。

TABLE VII: AVERAGE INFERENCE TIME OF ANOMALY LOCALIZATION IN SECONDS ON THE MVTEC AD WITH A CPU INTEL I7-4710HQ @ 2.50GHZ.

表7：使用INTEL I7-4710HQ @ 2.50GHZ CPU处理器的MVTec AD的异常定位平均推理时间(s)。

在这里插入图片描述

Memory complexity. Unlike SPADE [5] and Patch SVDD [4], the space complexity of our model is independent of the dataset training size and depends only on the image resolution. PaDiM keeps in the memory only the pretrained CNN and the Gaussian parameters associated with each patch. In Table VIII we show the memory requirement of SPADE, our VAE implementation, and PaDiM, assuming that parameters are encoded in float32. Using equivalent backbone, SPADE has a lower memory consumption than PaDiM on the MVTec AD. However, when using SPADE on a larger dataset like the STC, its memory consumption becomes intractable, whereas PaDiM-WR50-Rd550 requires seven times less memory. The PaDiM space complexity increases from the MVTec AD to the STC only because the input image resolution is higher in the latter dataset as described in Section IV-B. Finally, one of the advantages of our framework PaDiM is that the user can easily adapt the method by choosing the backbone and the embedding size to fit its inference time requirements, resource limits, or expected performance.

内存复杂度。与SPADE[5]和Patch SVDD[4]不同，我们模型的空间复杂度与数据集的训练规模无关，只取决于图像分辨率。PaDiM 仅将预训练的CNN和与每个patch相关的高斯参数保存在内存中。表8显示了SPADE、我们的VAE实现和PADiM(假设参数以float32编码)的内存需求。在MVTec AD上，SPADE的内存消耗量低于PaDiM。然而，当在STC这样的大型数据集上使用SPADE时，其内存消耗变得难以承受，而PaDiM-WR50-Rd550所需的内存则减少了7倍。从MVTec AD到STC，PaDiM空间复杂度增加的唯一原因是，如4.2节所示，后一个数据集的输入图像分辨率更高。最后，我们的PaDiM框架的优势之一是，用户可以通过选择骨干和嵌入大小来调整方法，以适应其推理时间要求、资源限制或预期性能。

TABLE VIII: MEMORY REQUIREMENT IN GB OF THE ANOMALY LOCALIZATION METHODS TRAINED ON THE MVTEC AD AND THE STC DATASET.

表8：在MVTec AD和STC数据集上训练的异常定位方法的内存需求(GB)

在这里插入图片描述

6 结论

We have presented a framework called PaDiM for anomaly detection and localization in one-class learning setting which is based on distribution modeling. It achieves state-of-the-art performance on MVTec AD and STC datasets. Moreover, we extend the evaluation protocol to non-aligned data and the first results show that PaDiM can be robust on these more realistic data. PaDiM low memory and time consumption and its ease of use make it suitable for various applications, such as visual industrial control.

我们提出了一个名为PaDiM的框架，用于在单类学习环境中进行异常检测和定位，该框架基于分布建模。它在MVTec AD 和STC数据集上实现了最先进的性能。此外，我们还将评估协议扩展到了非对齐数据，首批结果表明，PaDiM可以在这些更真实的数据上保持稳健。PaDiM的内存和时间消耗低，使用方便，因此适用于各种应用，如视觉工业控制。