异常检测阅读笔记《Inpainting Transformer for Anomaly Detection》CVPR 2021

最新推荐文章于 2024-11-26 23:46:55 发布

研途可达

最新推荐文章于 2024-11-26 23:46:55 发布

阅读量4.5k

点赞数 8

分类专栏：异常检测（Anomaly Detection）文章标签： transformer 深度学习自然语言处理

本文链接：https://blog.csdn.net/qq_45496282/article/details/120128047

版权

异常检测（Anomaly Detection）专栏收录该内容

20 篇文章

订阅专栏

异常检测阅读笔记《Inpainting Transformer for Anomaly Detection》 CVPR 2021

来源：2021年的CVPR，原文论链接

论文的方向是图像方面的异常检测，本质上还是重构之后与原图像对比来得出异常的归类于基于重构的（ reconstruct）研究方法。

因为是基于自注意力进行实验的，所以需要先对自注意力有一定的了解，知道q（query）和k （keys）的作用
给一个链接先了解一下

首先整体概括：
作者基于之前的研究《Reconstruction by inpainting for visual anomaly detection 》，首先将主要重构的方法选定为 inpainting，这是与当前的神经网络模型（GAN,VAE等）相差很大的一种方法，从而将当前在NLP（自然语言处理）中火热的 Transformer 引进到本论文之中作文主要的模型框架。

整个修复(重构)的工作流程：就是对于一张输入的图像数据，首先将其切割成（n*m）个小格子，其中单个patch的重建包含了一个更大的全局环境，而不仅仅是它的近邻的信息，从而将整个图片进行修复，修复重构之后将其与原论文进行对比，从而可以得出异常以及异常的位置。

好开始复现论文，详细的看看每一步都是怎么工作的：

1.Abstract：

首先介绍一下异常的常规定义： 计算机视觉中的异常检测是识别偏离一组正常图像的图像的任务。当前的和自己研究方向相近的已有的方法：一种常见的方法是训练深度卷积网络（比如gan，编码器-解码器）来修复（Inpainting）（建议理解为重构的意思）覆盖在原图像的某块区域，并将输出与原始图像进行比较。训练时通过仅对无异常样本进行训练，前提是假设该模型不能正确地重建异常区域。从而在遇见异常样本的时候，还是按照正常样本的结构去重构，就可以在重构之后与原样本对比的时候将异常检测出来。
作者提出的新观点： 对于通过修复（Inpainting）来检测异常，我们假设合并来自潜在的周围区域的信息是有助于更好的重构出正常样本的特征。特别地，我们将异常检测作为一个补丁绘制（ patch-inpainting）问题，并提出用一种纯粹的基于自我注意（purely self-attention based）的方法从而不使用比较流行的卷积网络来解决它。
主要的工作流程： The proposed Inpainting Transformer (InTra) is trained to inpaint covered patches in a large sequence of image patches, thereby integrating information across large regions of the input image.
所提出的 Inpainting Transformer(InTra) 被训练成在大量的中绘制被覆盖的补丁image patches，从而可以更好的结合输入图像的得被修复区域的周围的大部分信息。
实验结果： When learning from scratch（划痕）, InTra achieves better than state-of-the-art results on the MVTec AD [1] dataset for detection and localization.

2.Introduction：

视觉中的异常检测和定位描述了一组数据中，是否有自己不期望的或者与大多数据不一样的图像（称为异常检测），以及不一样出现在哪里给他定位出来（异常定位）。虽然经常将这两个问题分开看待，但这两个问题对工业检查和医疗应用都有强烈的影响。但在实际的工业应用中，异常很少发生。由于缺乏足够的异常样本，而且异常的形状和纹理可能令人意外，因此很难用有监督的方法来处理这个问题。目前的方法遵循无监督的方法，并试图只模拟正态数据的分布。在测试时，每张图像都得到异常评分，以表明它与正常样本的偏离程度。
对于异常定位，我们会为图像的子区域或单个像素分配类似的分数，从而很好地定位显示。
遵循重构的一种常见方法是使用深度卷积自动编码器或生成模型，如变分自动编码器（VAE）和 generative adversarial networks (GANs)。然后利用输入图像与重建图像之间的差值来计算异常分数。这是基于通过仅对正常图像进行训练，模型将无法正确地重建异常图像，从而导致更高的异常分数来实现异常检测的。但实际应用中，这种方法往往存在卷积自编码器的泛化能力也比较强，对异常来说也能重构较好，从而导致异常检测的失败。

按作者的研究方向，最近的研究现状，通过将生成部分假定为 as an inpainting problem 来减轻这种影响：Parts of the input image are covered and the model is trained to reconstruct the covered parts in a self-supervised way [9, 10, 11, 12]. By conditioning on the neighborhood of the excluded part only, small anomalies get effectively retouched. 通过只调整被排除部分的邻域，小的异常得到有效地修复。

既然有新的突破点了，那为什么不继续使用卷积网络解决呢？ 作者是这样解释的 *Due to their limited receptive field, fully convolutional neural networks (CNNs) are partially ineffective in modeling distant contextual information, which makes the removal of larger anomalous regions difficult.*也就是说，由于完全卷积神经网络(CNNs)的感受野有限，在模拟远距离上下文信息时会造成部分无效，因此很难去除较大的异常区域。

此时 Inpainting 的好处就凸显出来了. For inpainting in general settings, this can be effectively addressed by introducing contextual attention in the model。
对于常规设置下的 inpainting ，这可以通过在模型中引入上下文注意来有效地解决这个问题。

本文的中心思想 For inpainting in the context of anomaly detection we suggest it to be beneficial to learn the relevant patterns alone by combining information from large regions around the covered image part via attention.
对于异常检测环境中的修复，我们建议通过注意将覆盖图像周围的大区域的信息结合起来，单独学习相关的模式是有益的。

实验方法的提出：
Inspired by the recent success of self-attention based models such as Transformers in image recognition ,we pose anomaly detection as a patch-inpainting problem and propose to solve it without convolutions: images are split into square patches, and a Transformer model is trained to reconstruct covered patches on the basis of a long sequence of neighboring patches.

提出了完全没有卷积神经网络的，基于自注意力的（self-attention）模型这里用的是Transformers。主要的工作流程是：将图像分割成方形块，并对Transformers模型进行训练，以便在长序列相邻 patchs 的基础上重建覆盖patch 。
**作者假设的实验效果：**By recovering the whole image in this way, a full reconstructed image is obtained where the reconstruction of an individual patch incorporates a larger global context and not only the appearance of its immediate neighborhood. Thus patches are not reconstructed by simply mimicking the local neighborhood, leading to high anomaly scores even for spacious anomalous regions.
通过这样恢复整个图像，就可以得到一个完整的重建图像，其中单个patch的重建包含了一个更大的全局环境，而不仅仅是它的近邻的信息。因此，仅仅模仿局部邻域就不能重建patch，因为即使是在大面积的的异常区域也会导致较高的异常分数。

**论文的创新性也是贡献：**Our contributions enfold the modeling of anomaly detection as a patch-sequence inpainting problem which we solve using a deep Transformer network consisting of a simple stack of multiheaded self-attention blocks.
我们的贡献包括异常检测的建模，作为一个patch序列修复问题，我们使用一个由简单的多头自关注块（multiheaded self-attention）组成的Transformer网络来解决这个问题。

由于完全不用卷积网络来提取特征，故又添加以下设置：
Within this network convolutional operations are removed entirely. Furthermore we propose to
a.) employ long residual connections between the Transformer blocks
Transformer 块之间的长余量连接
b.) perform a dimension reduction for keys and queries with a small multilayer perceptron when computing self-attention
在计算自我注意力时，使用小型多层感知器对键（keys）和查询(queries）执行降维操作。

模型结构

2. Related Work

这个部分就是介绍一些最近在异常检测（图像方面的）得一些方法，以及研究现状。

先简单概括： CNNs have shown to be highly successful for vision based tasks and also anomaly detection and segmentation [1, 19, 20, 21, 10, 9]（这些数字指的是原文引用的论文）, sequence-to-sequence Transformer models, originating from Natural Language Processing (NLP) have found application in computer vision tasks[16, 15].
就是卷积神经网络（CNNs）已经被成功的应用在异常检测方向且取得了不错的成绩，同时来自自然语言处理 (NLP) 中的Transformer models，也被发现应用在视觉任务中。

2.1. Anomaly Detection and Segmentation

Anomaly detection is concerned about deciding if an image contains an unexpected deviation from a predefined norm,while in segmentation the goal is to find and localize these deviations accurately on a pixel level to extract the regions where a defect occurred.
异常检测关注的是判断一幅图像是否包含不期望看到的偏差（场景），而对于分割过程其目的是在像素级上精确地查找和定位这些偏差，以提取出缺陷发生的区域。
For explainability a good overall performance in both tasks is needed which is not guaranteed if a method achieves good results in one of the tasks. Existing methods can be roughly categorized into two different approaches.

为了便于解释，需要在这两个任务中都具有良好的总体性能，如果一个方法在其中一个任务中取得了良好的结果，则不能保证这一点。现有的方法大致可以分为两种不同的方法。

Reconstruction based.（基于重构的）

基于重建的模型试图只建模正常的、无缺陷的样本。这在各个领域都取得了显著的结果。
对于异常的检测其工作原理是这样的： 因为它们只建模正常数据，当测试数据有缺陷，这些模型应该不能正确地重建异常图像。

An anomaly map for segmentation is usually generated via pixel-wise difference between the input image and its model reconstruction, leading to noticeable anomalies.
对于分割过程是这样的： 用于分割的异常图通常是通过输入图像与其模型重建之间的像素差来生成的，从而导致明显的异常。

在训练过程中需要注意的是：

Modifications like integrating structural similarity index measure (SSIM) in the loss function during training are used to improve reconstruction quality by producing smoother images while focusing on retaining structural information such as edges[3, 25].
在训练过程中将(SSIM)集成到损失函数中，以提高重建质量，方法是在保持结构信息(如边缘)的同时产生更平滑的图像。
现阶段基于重建的有(VAEs)和对抗网络（GANs）。

变分自动编码器(VAEs) 也被用于异常检测和分割但是现阶段存在还未解决的问题。
Their probabilistic latent space tries to capture a distribution capable of generating normal samples. This generative approach allows for the inclusion of the Kullback-Leibler divergence in the anomaly score to incorporate a probabilistic scoring [4, 26, 27]. In general VAEs are not automatically superior to traditional autoencoder methods [1, 24].

但是(VAEs)存在一个如上面所述的弊端，故而其性能并不会优于传统的自动编码器模型。

Adversarial models（对抗性模型） 如生成对抗网络(GANs)已被用于异常检测和分割。虽然GANs经常存在不稳定的训练过程，但可以生成高度真实和几乎自然的图像。研究者建议在计算异常分数时再考虑 discriminator network。为此，基于鉴别器模型的最后一个卷积层，采用了输入图像与其重建的对应图像之间的特征损失来训练网络。

Embedding based.（基于嵌入的）

虽然基于嵌入的方法具有较好的检测效果，但由于不存在对原始图像空间的固有映射，因此不自动包含异常的精确定位。
尽管后面也有相应的一些研究改进，但是还是没有达到很理想的实验结果。

2.2. Inpainting in Anomaly Detection

给定部分覆盖的图像，目标是准确地重建原始未覆盖的数据。早期的方法已经成功地使用了基于局部图像描述符的匹配。
但是存在如下问题：
Complex scenes and objects are harder to get consistent and realistic results for as the model has to understand the context and content of the image.
复杂的场景和对象很难得到和结果一致的真实效果，因为模型必须了解图像的上下文和内容。（由于覆盖了原始图像的某些部分，重建方法需要对图像进行语义理解，才能生成连贯和真实的图像。）

Anomalies which span over a large area may still cause problems as these will not be covered up sufficiently enough. As such we propose to add global context via replacing CNNs with a Transformer-based framework applied in vision.

作者在基于以上的问题的情况下，提出用Transformer-based framework 去替换 CNNs以用来添加全局的上下文的信息来解决此问题。

2.3. Transformers in Vision

Transformers 模型最初是在NLP中引入的，后来发展为各种序列任务，如文本翻译、生成和文档分类。其中起到关键作用的是自注意力模块（self-Attention）。
vit
上面说，先将图片分成小patch同时嵌入位置信息，然后再将图片看成一个序列，最后通过Transformers模块进行按小patch进行整张图像的重构，同时说到这种方法已经达到当前实验最高的效果。

Attention is used to relate elements of a sequence to each other. Based on the relative weighted importance a shared representation is calculated taking into account the relative dependencies between sequence elements.
注意力被用来将序列中的元素相互关联。基于相对加权重要性，考虑到序列元素之间的相对依赖关系，计算了一个共享表示。（就是说Attention是用来给序列元素之间添加联系的，使得元素之间有相互关联性，从而可以产生上下文语义联系信息）

3. Inpainting Transformer for Anomaly Detection

前面的铺垫性工作叙述之后，开始引入这篇论文的主角。
approach
这里提到的neighboring patches，指的是结合上下文语境的紧邻patchs,不是单纯的没有关系的neighboring。
再回首
结合前面的各个小结的解释与介绍，我们再来看这张图就清晰多了。

其中每一步的具体的实施操作以及函数表达式如下小分节

3.1. Embedding Patches and Positions

位置编码
主要的目的就是将一张图片分成N*M个小patch,然后对于每一个小patch，选择以他为中心（就是对他进行覆盖的，需要进行修复的）的边长为L的包含许多patchs的正方形进行修复重建，进而对整个图片进行重构。
局部与整体两种
在位置嵌入时有两种编码方式，局部与整体，分类型的原因是图片存在有纹理（对位置信息要求比较高）
位置公式
我们将补丁的窗口及其位置信息映射到D维的一些潜在空间中，随后又将嵌入信息加进去，最后得到模型输入的序列。
最后序列

3.2. Multihead Feature Self-Attention

Self-attention is the main building block of the Transformer and will be successively applied to the projected patch sequence y in (5).

在训练图像的patch非常相似到那不明显的情况下，q和k的点乘以及加权矩阵几乎一致，为了减轻这一点，我们提出在计算q和k的时候进行非线性降维：
q and k

multihead self-attention (MSA)：一个Attention获得一个表示空间，如果多个Attention，则可以获得多个不同的表示空间。

这一部分就是修改了q 和 k 同时添加了 (MSA)以更加有利于解决当前的修复重构问题。

3.3. Network Architecture

Our network architecture for inpainting is composed of a simple stack of n Transformer blocks.
模型结构

3.4. Training

从正常图像数据中随机抽取具有固定边长度L的贴片窗口进行训练。在每个窗口中，都会选择一个随机的补丁位置(t，u)。
对于损失函数，我们将原始和重建的补丁与像素级l2损失进行了比较。

3.5. Inference and Anomaly Detection

实验结果

4. Experiments

在按照作者给出的一些实验设置的前提之下，进行实验对比：
结果对比
也进行了相关的消融实验，
针对这结果模块进行相应的对比实验研究 Long residual connections，Feature Self-Attention，Patch Position Embedding，Patch Window Size。
都能够证明提出的这么新的模型结构可以很好进行异常检测。

5. Conclusion

Inspired by the success of using attention-only methods in vision tasks, we have successfully used a Transformer model for visual anomaly detection by using an inpainting reconstruction approach while considering embeddings of patch sequences as input.

总结，作者成功地利用Transformer模型进行视觉异常的检测，采用了修复重建的方法同时考虑了补丁序列的嵌入作为输入。

仅使用自我注意将全局上下文纳入重建，异常可以成功地检测和定位，对于作者提出的任务，在计算self-attention之前应用非线性降维已经证明可以改善异常的修复。此外，我们在体系结构中采用了长时间的残余连接，以利用Transformer后面块中的底层特性及其上下文。