MyDLNote-Inpainting: 2020 ECCV VCNet: A Robust Approach to Blind Image Inpainting 盲图像补全

2020 ECCV VCNet: A Robust Approach to Blind Image Inpainting

Keywords: Blind image inpainting · visual consistency · spatial normalization · generative adversarial networks

[paper]

 

Abstract

Blind inpainting is a task to automatically complete visual contents without specifying masks for missing areas in an image. Previous work assumes known missing-region-pattern, limiting the application scope.

解释盲图像补全的含义和传统给定 missing region 模式的问题。

We instead relax the assumption by defining a new blind inpainting setting, making training a neural system robust against various unknown missing region patterns. Specifically, we propose a two-stage visual consistency network (VCN) to estimate where to fill (via masks) and generate what to fill. In this procedure, the unavoidable potential mask prediction errors lead to severe artifacts in the subsequent repairing. To address it, our VCN predicts semantically inconsistent regions first, making mask prediction more tractable. Then it repairs these estimated missing regions using a new spatial normalization, making VCN robust to mask prediction errors. Semantically convincing and visually compelling content can be generated.

本文要做的事:在不给定 missing region,自动识别图像中存在需要 inpainting 的区域,将其准确标记为 missing region,并实现高质量补全。

具体地,提出了两个阶段的 视觉一致性网络 visual consistency network (VCN):

1. 估计哪里(where)需要补全:预测语义不一致(emantically inconsistent)的区域;

2. 补全什么(what):利用一种新的空间归一化方法对估计的缺失区域进行修复。

Extensive experiments show that our method is effective and robust in blind image inpainting. And our VCN allows for a wide spectrum of applications.

 

Introduction

第一段,提出问题:

We note the requirement of having accurate masks makes it difficult to be practical in several scenarios where masks are not available, e.g., graffiti and raindrop removal (Fig. 1). Users need to carefully locate corrupted regions manually, where inaccurate masks may lead to inferior results. We in this paper analyze blind inpainting that automatically finds pixels to complete, and propose a suitable solution based on image context understanding.

问题:在实际应用中,要补全的区域并不是提前标记好的,如下图。用户需要自己很仔细的人工标记,而这种标记不仅繁琐,而且不一定准确。

第二段,与已有算法的不同:

Existing work [3, 24] on blind inpainting assumes that the missing areas are filled with constant values or Gaussian noise. Thus the corrupted areas can be identified easily and almost perfectly based on noise patterns. This oversimpli- fied assumption could be problematic when corrupted areas are with unknown content. To improve the applicability, we relax the assumption and propose the versatile blind inpainting task. We solve it by taking deeper semantics of the input image into overall consideration and detecting more semantically meaningful inconsistency based on the context in contrast to previous blind inpainting.

现有的盲补全的算法假设缺失的区域用定值或高斯噪声填充。因此,可以很容易并完美地识别污染区域的噪声模式。当损坏的区域包含未知内容时,这种过于简单的假设可能会产生问题。为了提高其适用性,本文的模型放宽了假设,提出了通用的盲补全任务。本文的解决方法是综合考虑输入图像更深层次的语义,并基于上下文检测更多语义意义上的不一致性,即盲补全的区域不再是简单的定值或高斯噪声形式,而是多样的、与正常背景语义不连续的区域。

 

第三段,难点和策略:

Note that blind inpainting without assuming the damage patterns is highly ill-posed. This is because the unknown degraded regions need to be located based on their difference from the intact ones instead of their known characteristics, and the uncertainties in this prediction make the further inpainting challenging.

We address it in two aspects, i.e., a new data generation approach and a novel network architecture.

预测需要补全的区域的难点在于:未知的退化区域需要根据其与完好区域的差异来定位,而不是根据其已知特征来定位,这种预测中的不确定性使进一步的补绘具有挑战性。

本文的策略是:从两个方面来解决这个问题,即提出一种新的样本生成方法,用于生成多样的退化区域;提出一种新的图像补全网络

 

第四段,关于样本生成:

For training data collection, if we only take common black or noise pixels in damaged areas as input, the network may detect these patterns as features instead of utilizing the contextual semantics as we need. In this scenario, the damage for training should be diverse and complicated enough so that the contextual inconsistency instead of the pattern in damage can be extracted. Our first contribution, therefore, is the new strategy to generate diverse training data where natural images are adopted as the filling content with random strokes.

本文的第一个贡献,就是提出了新的盲补全样本的生成,旨在退化区域的 多样性 复杂性 足够的多,使得网络在预测 missing region 时能够有足够的能力识别出 语义不连续 区域。

 

第五段,关于补全模型:

For model design, our framework consists of two stages of mask prediction and robust inpainting.

A discriminative model is used to conduct binary pixelwise classification to predict inconsistent areas.

With the mask estimated, we use it to guide the inpainting process. Though this framework is intuitive, its specific designs to address the biggest issue in this framework are non-trivial: how to neutralize the generation degradation brought by inevitable mask estimation errors in the first stage. To cope with this challenge, we propose a probabilistic context normalization (PCN) to spatially transfers contextual information in different neural layers, enhancing information aggregation of the inpainting network based on the mask prediction probabilities.

We experimentally validate that it outperforms other existing approaches exploiting masks, e.g., concatenating mask with the input image and using convolution variants (like Partial Convolution [22] or Gated Convolution [44]) to employ masks, in evaluation.

模型包括两个部分,mask 估计和补全。

mask 估计:采用判别模型对不一致区域进行二元像素分类预测。

补全:根据 mask 的估计,使用它来引导补全过程。虽然该框架是直观的,但其针对该框架中最大问题的具体设计并非琐碎:如何能抵消在第一阶段中由于 mask 估计存在误差导致的生成退化(用自己的话说就是,mask 估计可能是不准确的,存在一定误差,导致图像生成存在问题,如何抵消这个负面作用)。为了应对这一问题,本文提出了一种基于概率上下文归一化 probabilistic context normalization (PCN) 的方法来在不同的神经层空间传输上下文信息,增强基于 mask 预测概率的 inpaint网络的信息聚合。

 

原文 Introduction 的最后两段是实验结论说明和主要贡献。

 

Robust Blind Inpainting

给出本文的符号表达,和任务描述:

For this task, the input is only a degraded image {I} \in R^{h\times w\times c} (contaminated by unknown visual signals), and the output is expected to be a plausible image \widehat{O} \in R^{h\times w\times c} , approaching ground truth {O} \in R^{h\times w\times c} of I. The degraded image I in the blind inpainting setting is formulated as

I = O \odot (1-M) + N \odot M,                     (1)

where {M} \in R^{h\times w\times 1} is a binary region mask (with value 0 for known pixels and 1 otherwise), and {N} \in R^{h\times w\times c} is a noisy visual signal. is the Hadamard product operator. Given I, we predict \widehat{O} (an estimate of O) with latent variables M and N. Also, Eq. (1) is the means to produce training tuples < I_i , O_i ,M_i , N_i >_{|i=1,...,m}.

从公式(1)中可以看到,网络训练时,需要数据集提供的图片包括,退化图像 I,图像的 ground truth O,退化区域 ground truth M,退化区域中的填充信号 NM 表示在哪里对图像进行退化;N 表示进行什么样的退化。

N indicates what and M indicates where

 

Training Data Generation

这一节,提出如何生成退化图像,具体地,填充信号 N 和区域 M 是怎么生成的。首先,提出一个重要的方法论

The key for defining N is to make it indistinguishable as much as possible from I on image pattern, so that the model cannot decide if a local patch is corrupted without seeing the image context. Then a neural system trained with such data has the potential to work on unknown contamination.

定义 N 的关键是使它在图像模式上尽可能与 I 无法区分,这样在没有看到图像上下文的情况下,模型就无法判断局部 patch 是否损坏。然后,用这些数据训练的神经系统就有可能在未知的污染上发挥作用。

 

然后,给出具体的方法:

In this paper, we use real-world image patches to form N. This ensures that local patches between N and I are indistinguishable, enforcing the model to draw an inference based on contextual information, which eventually improves the generalization ability for real-world data.

Further, we alleviate any priors introduced by M in training via employing free-form strokes [44]. Existing blind or non-blind inpainting methods often generate the arbitrary size of a rectangle or text-shaped masks. However, this is not suitable for our task, because it may encourage the model to locate the corrupted part based on the rectangle shape. Free-form masks can largely diversify the shape of masks, making the model harder to infer corrupted regions with shape information.

Also, we note that direct blending image O and N using Eq. (1) would lead to noticeable edges, which are strong indicators to distinguish among noisy areas. This will inevitably sacrifice the semantic understanding capability of the used model. Thus, we dilate the M into \widetilde{M} by the iterative Gaussian smoothing in [37] and employ alpha blending in the contact regions between O and N.

1.  N 是用真实的图像 patch 来形成。这样,当给定局部图像时,无法区分 NI,这样做是为了增强模型根据上下文信息进行推断的能力,从而最终提高对真实数据的泛化能力。

2. 对于 M 的生成,消除任何先验,即形状位置大小完全随意。自由形状的 mask 可以极大地丰富 M 的形状,使得模型很难通过形状信息推断出被破坏的区域。

3. 直接将 M 填充 N,是会有明显边界的。本文用高斯平滑和 alpha blending 的方法,将 O 按照 M 的空间和 N 进行接合。

 

Inpainting Method: Visual Consistent Network (VCN)

第一、二段,算是 overview。

VCN has two sub-modules, i.e. Mask Prediction Network (MPN) and Robust Inpainting Network (RIN). MPN is to predict potential visually inconsistent areas of a given image, while RIN is to inpaint inconsistent parts based on the predicted mask and the image context. Note that these two submodules are correlated. MPN provides an inconsistency mask \hat{M}\in R^{h\times w\times 1} , where \hat{M}_p \in [0, 1], helping RIN locate inconsistent regions. On the other hand, by leveraging local and global semantic context, RIN largely regularizes MPN, enforcing it to focus on these regions instead of simply fitting our generated data.

VCN 包括两个模块:

1. Mask 预测网络 MPN :预测视觉不一致区域;

2. 鲁棒补全网络 RIN :补全不一致区域。

MPN 和 RIN 是相互关联的:

MPN 估计的不一致区域 \hat{M} 帮助 RIN 定位。另一方面,由于 RIN 补全的图像需要在局部和全局语义准确,这种需求极大的规范了 MPN 的估计结果。

Our proposed VCN is robust to blind image inpainting in the given relativistic generalized setting. Its robustness is shown in two aspects. MPN of VCN can predict the regions to be repaired with decent performance even the contamination patterns are new to the trained model. More importantly, RIN of VCN synthesizes plausible and convincing visual content for the predicted missing regions, robust against mask prediction errors.

有个问题:训练数据集中虽然给定了各种各样的填充内容,但如果测试图像中的填充内容 N 并没有在训练集中出现过,本文的盲补全算法还能稳定吗?

答案是肯定的。其实,MPN 在预测 mask 区域时,是根据语义的不一致来预测的,而不是根据内容预测的。另一方面,RIN 在生成可信内容的同时,促使 mask 预测更加准确。

 

下面先看 MPN 网络:

  • Mask Prediction Network (MPN)

MPN 用来学习从退化图像 I 到二值 mask \hat{M} 的映射 FM 是给定的 ground truth mask 区域。

首先用到的技术是 自适应损失(self-adaptive loss)函数

这里 \tau 的作用是平衡非退化区域 1-M 和退化区域 M 直接的比例。公式(2)第一项计算的是 M 和 \hat{M}信息熵,第二项同理。

上述模型需要注意两点:

Note that \hat{M} is an estimated soft mask where 0 ≤ \hat{M}p ≤ 1 for ∀p, although we employ a binary version for M in Eq. (1). It means the damaged pixels are not totally abandoned in the following inpainting process. The softness of \hat{M} enables the differentiability of the whole network. Additionally, it lessens error accumulation caused by pixel misclassification, since pixels whose status (damaged or not) MPN are uncertain about are still utilized in the later process.

注意: 这里预测的 \hat{M} 是一个 soft mask,即其取值范围在 0 到1 之间,而不是传统的那种 非 0 即 1 的 hard mask。

\hat{M} 的 soft 性质使整个网络具有可微性。此外,它减少了由于像素分类错误造成的积累误差,因为在后期处理中仍然使用 MPN 状态的不确定 (损坏或未损坏) 的像素。

 

Note that the objective of MPN is to detect all corrupted regions. Thus it tends to predict large corrupted regions for an input corrupted image, which is shown in Fig. 3(c). As a result, it makes the subsequent inpainting task too difficult to achieve. To make the task more tractable, we instead propose to detect the inconsistency region of the image, as shown in Fig. 3(d), which is much smaller. If these regions are correctly detected, other corrupted regions can be naturally blended to the image, leading to realistic results. In the following, we show that by jointly learning MPN with RIN, the MPN eventually locates inconsistency regions instead of all corrupted ones.

注意:MPN 的目标是检测所有损坏的区域。因此,它倾向于预测出一个偏大的损坏区域,如下图(c)。

为了解决这个问题,本文提出的方法是只检测局部不一致区域,如下图(d)。这一点,需要联合学习 RIN 和 MPN 来实现。

下面先看 RIN 网络:

  • Robust Inpainting Network (RIN)

RIN 的目标是将退化图像 I,在借助预测的 M 给定的区域下,生成 O 的映射。

RIN is structured in an encoder-decoder fashion with probabilistic contextual blocks (PCB). PCB is a residual block variant armed with a new normalization (Fig. 4), incorporating spatial information with the predicted mask.

RIN 模块中包含 probabilistic contextual blocks (PCB) 模块。

 

第一个问题:为什么要提出并使用 概率上下文规范化 probabilistic context normalization (PCN)?

With the predicted mask \hat{M}, repairing corrupted regions requires knowledge inference from context, and being skeptical to the mask for error propagation from the previous stage. A naive solution is to concatenate the mask with the image and feed them to a network. However, this way captures context semantics only in deeper layers, and does not consider the mask prediction error explicitly. To improve contextual modeling and minimize mask error propagation, it would be better if the transfer is done in all building blocks, driven by the estimated mask confidence. Hence, we propose a probabilistic context normalization (PCN, Fig. 4) to transfer contextual information in different layers.

考虑到 mask 的预测误差会传播到图像内容补全任务中。解决的方法是,让信息能够在 mask 预测模块和 图像补全模块之间进行转换,驱动各项任务的可信度。如下图,本文的模型中,预测的 mask 会送入 RIN 的每一层,RIN 在优化时,其误差也会反向传播到 MPN 中,从而优化 MPN 预测的准确性。

 

 

下面给出 PCN 的整个原理:

Our PCN module is composed of the context feature transfer term and feature preserving term. The former transfers mean and variance from known features to unknown areas, both indicated by the estimated soft mask \hat{M} (H defined below is its downsampled version). It is a learnable convex combination of feature statistics from the predicted known areas and unknowns ones. Feature preserving term keeps the features in the known areas (of high confidence) intact. The formulation of PCN is given as

and the operator \mathcal{ T } (·) is to conduct instance internal statistics transfer as

where X is the input feature map of PCN, and H is nearest-neighbor downsampled from \hat{M}, which shares the same height and width with X. \bar{H} = 1- H indicates the regions that MPN considers clean. X_P = X\odot H and X_Q = X\odot \bar{H} . β is a learnable channel-wise vector (\beta \in R^{1\times 1\times c} and β ∈ [0, 1]) computed from X by a squeeze-and-excitation module [12] as

    (5)

where\bar{x} \in R^{1\times 1\times c} is also a channel-wise vector computed by average pooling X, and f(·) is the excitation function composed by two fully-connected layers with activation functions (ReLU and Sigmoid, respectively).

µ(·, ·) and σ(·, ·) in Eq. (4) compute the weighted average and standard deviation respectively in the following manner:

where Y is a feature map, T is a soft mask with the same size of Y, and is a small positive constant. i and j are the indexes of height and width, respectively

这么一大段,就是给出了 PCN 的数学表达式。

公式 (5)  说明 \beta 表示的就是 SE 模块(就是一个通道注意力模块)。

下面是对 PCN 的解释:

Prior work [8, 15] showed that feature mean and variance from an image are related to its semantics and texture. The feature statistics propagation by PCN helps regenerate inconsistent areas by leveraging contextual mean and variance. This is intrinsically different from existing methods that implicitly achieve this goal in deep layers, as we explicitly accomplish it in each building block. Thus PCN is beneficial to the learning and performance of blind inpainting. More importantly, RIN keeps robust considering potential errors in \hat{M} from MPN, although RIN is guided by \hat{M} for repairing.

有工作说明了,特征的均值和方差与图像的语义和纹理有关。PCN 传播的上下文均值和方差特征统计,有助于更新不一致区域。

 

第二个问题:综合优化目标函数的确立

一共用了 4 中:

L1 损失函数:即补全图像与 ground truth 图像之间的平均绝对误差;

VGG 损失函数:不多解释了;

ID-MRF 损失函数:

ID-MRF loss [37, 25] is employed as our texture consistency term. It computes the sum of the patch-wise difference between neural patches from the generated content and those from the corresponding ground truth using a relative similarity measure. It enhances generated image details by minimizing discrepancy with its most similar patch from the ground truth.

它通过一种相对的相似性度量来计算生成内容 patch 与对应的 ground truth patch 之间的差值之和。它通过最小化与 ground truth 最相似的 patch 的差异来增强生成的图像细节。

对抗损失函数:采用的是 WGAN。

For the adversarial term, WGAN-GP [10, 1] is adopted as

where P denotes data distribution, and D is a discriminator for the adversarial training. Its corresponding learning objective for the discriminator is given as

 

 

 

 

  • 3
    点赞
  • 15
    收藏
    觉得还不错? 一键收藏
  • 4
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值