MyDLNote-Inpainting: 2020 ECCV 视频补全论文速读 Video Inpainting

最新推荐文章于 2024-06-08 10:01:22 发布

Phoenixtree_DongZhao

最新推荐文章于 2024-06-08 10:01:22 发布

阅读量1.9k

点赞数 2

分类专栏： MyDLNote-Inpainting deep learning 文章标签：人工智能

本文链接：https://blog.csdn.net/u014546828/article/details/109070555

版权

deep learning 同时被 2 个专栏收录

113 篇文章 8 订阅

订阅专栏

MyDLNote-Inpainting

7 篇文章 1 订阅

订阅专栏

DVI: Depth Guided Video Inpainting for Autonomous Driving

[paper] [code]

本文通过引入 3D 点云数据，对视频进行补全。该方法解决了一个特别重要的问题：即在整个视频中，某个区域始终被遮挡（而后面两篇文章却默认需要补全的区域在其他某个关键帧并未遮挡）。

Abstract

To get clear street-view and photo-realistic simulation in autonomous driving, we present an automatic video inpainting algorithm that can remove traffic agents from videos and synthesize missing regions with the guidance of depth/point cloud.

为了在自动驾驶过程中获得清晰的街景和逼真的仿真效果，提出了一种基于深度/点云引导的自动视频补绘算法，该算法可以去除视频中的交通代理，合成缺失区域。

By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated via this common 3D map. In order to fill a target inpainting area in a frame, it is straightforward to transform pixels from other frames into the current one with correct occlusion. Furthermore, we are able to fuse multiple videos through 3D point cloud registration, making it possible to inpaint a target video with multiple source videos. The motivation is to solve the long-time occlusion problem where an occluded area has never been visible in the entire video.

通过拼接点云构建密集的 3D map，视频内的帧通过这个普通的 3D map 进行几何关联。为了在一个帧中填充一个目标 inpainting区域，它是直接转换像素从其他帧到具有正确遮挡的当前帧。此外，可以通过 3D 点云配准来融合多个视频，使目标视频与多个源视频 inpaint 成为可能。本文的动机是为了解决长时间遮挡的问题，其中这个遮挡区域在整个视频中都不可见。

To our knowledge, we are the first to fuse multiple videos for video inpainting. To verify the effectiveness of our approach, we build a large inpainting dataset in the real urban road environment with synchronized images and Lidar data including many challenge scenes, e.g., long time occlusion.

本文还是第一篇融合多个视频的视频补全模型。

Depth Guided Video Inpainting

本模型的整体构架：3D map 首先是通过将所有点云拼接在一起，然后再投射到单独的帧上来构建的。利用密集的深度映射和已知的外部相机参数，我们可以从其他帧中抽取候选颜色来填充当前帧中的空洞。然后，采用 belief propagation based regularization 方法来保证补绘区域内像素颜色的一致性。

Fig. 1. Frame-wise point clouds (a) are stitched into a 3D map (b) using LOAM. The 3D map is projected onto a frame (c) to generate a depth map. For each pixel in the target region (e), we use its depth (d) as guidance to sample colors from other frames (f). Final pixel values are determined by BP regularization and color harmonization to ensure photometric consistency. (g) shows the final inpainting result.

Learning Joint Spatial-Temporal Transformations for Video Inpaintin

[paper] [github]

这是一篇将 Transformer 用在视频补全算法中。Transformer 已经被应用在好多领域了。

Abstract

State-of-the-art approaches adopt attention models to complete a frame by searching missing contents from reference frames, and further complete whole videos frame by frame. However, these approaches can suffer from inconsistent attention results along spatial and temporal dimensions, which often leads to blurriness and temporal artifacts in videos.

目前最先进的方法是采用注意力模型，通过从参考帧中搜索缺失的内容来完成一个帧，进而一帧一帧地完成整个视频。然而，这些方法可能会在空间和时间维度上产生不一致的注意结果，这常常会导致视频中的模糊和时间伪影。

In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting. Specifically, we simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.

在本文中，提出学习一个联合时空变换网络(STTN)用于视频 inpaint。具体地说，利用自注意同时填充所有输入帧中缺失的区域，并提出利用时空对抗损失优化 STTN。

Spatial-Temporal Transformer Networks

Overall design

Problem formulation:

The intuition is that an occluded region in a current frame would probably be revealed in a region from a distant frame, especially when a mask is large or moving slowly. To fill missing regions in a target frame, it is more effective to borrow useful contents from the whole video by taking both neighboring frames and distant frames as conditions. To simultaneously complete all the input frames in a single feed-forward process, we formulate the video inpainting task as a “multi-to-multi” problem. Based on the Markov assumption [11], we simplify the “multi-to-multi” problem and denote it as:

where $X ^{t+n}_{t-n}$ denotes a short clip of neighboring frames with a center moment t and a temporal radius n. $X^T_{1,s}$ denotes distant frames that are uniformly sampled from the videos in a sampling rate of s. Since $X^T_{1,s}$ can usually cover most key frames of the video, it is able to describe “the whole story” of the video. Under this formulation, video inpainting models are required to not only preserve temporal consistency in neighboring frames, but also make the completed frames to be coherent with “the whole story” of the video.

本文的一个核心 intuition：

某帧中被遮挡的区域，可以通过其它在该区域没变被遮挡的帧来补偿。这个特殊的帧被认为是 the whole story 帧 X^T_1 。

因此，本文的模型是：当前帧的补全，不仅与 t-n 到 t+n 这些临近的帧有关，还和那个 the whole story 的关键帧 X^T_1 有关。

Network design:

The overview of the proposed Spatial-Temporal Transformer Networks (STTN) is shown in Figure 2. As indicated in Eq. (1), STTN takes both neighboring frames $X ^{t+n}_{t-n}$ and distant frames $X^T_{1,s}$ as conditions, and complete all the input frames simultaneously. Specifically, STTN consists of three components, including a frame-level encoder, multi-layer multi-head spatialtemporal transformers, and a frame-level decoder. The frame-level encoder is built by stacking several 2D convolution layers with strides, which aims at encoding deep features from low-level pixels for each frame. Similarly, the frame-level decoder is designed to decode features back to frames. Spatial-temporal transformers are the core component, which aims at learning joint spatial-temporal transformations for all missing regions in the deep encoding space.

本文提出的 STTN 模型包括三个部分：

frame-level 编码器：通过跨步叠加几个二维卷积层构建的，目的是为每一帧从低层次像素编码深层特征。

multi-layer multi-head spatialtemporal transformers：目的学习深度编码空间中所有缺失区域的联合时空变换。

frame-level 解码器：将特征解码回帧。

Short-Term and Long-Term Context Aggregation Network for Video Inpainting

[paper]

Abstract

Existing methods either suffer from inaccurate short-term context aggregation or rarely explore long-term frame information. In this work, we present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting.

现有的方法要么存在短期上下文聚合不准确的问题，要么很少探索长期帧信息。在这项工作中，提出了一个新的上下文聚合网络，以有效地利用短期和长期帧信息的视频补全。

In the encoding stage, we propose boundary-aware shortterm context aggregation, which aligns and aggregates, from neighbor frames, local regions that are closely related to the boundary context of missing regions into the target frame. Furthermore, we propose dynamic long-term context aggregation to globally refine the feature map generated in the encoding stage using long-term frame features, which are dynamically updated throughout the inpainting process.

在编码阶段，我们提出了边界感知的短期上下文聚合，即从相邻帧中对齐并聚合与缺失区域的边界上下文密切相关的局部区域到目标帧中。此外，我们提出了动态长期上下文聚合，以全局细化编码阶段使用长期帧特征生成的特征图，这些特征在整个 inpaint 过程中动态更新。

Short-Term and Long-Term Context Aggregation Network

Network Overview

Fig. 3. Overview of our proposed network. In the encoding stage, we conduct Boundary-aware Short-term Context Aggregation (BSCA) (Sec. 3.2) using short-term frame information from neighbor frames, which is beneficial to context aggregation and generating temporally consistent contents. In the decoding stage, we propose the Dynamic Long-term Context Aggregation (DLCA) (Sec. 3.3), which utilizes dynamically updated long-term frame information to refine the encoding-generated feature map.

在编码阶段，利用邻近帧的短期帧信息进行了边界感知的短期上下文聚合 (Boundary-aware Short-term Context Aggregation, BSCA)，有利于上下文聚合，生成时间上一致的内容。在译码阶段，提出了动态长期上下文聚合 (Dynamic Long-term Context Aggregation)，它利用动态更新的长期帧信息来细化编码生成的 feature map。采用了 convolutional LSTM (Conv-LSTM) layer 方法。

Boundary-aware Short-term Context Aggregation

Fig. 4. Left: Boundary-aware Short-term Context Aggregation (BSCA) module. Right: The boundary-aware context alignment operation in BSCA. Here, l ∈ { 1 2 , 1 4 , 1 8 } refers to the encoding scale.

Dynamic Long-term Context Aggregation

固定采样长期参考帧没有考虑视频的运动多样性。因此，它们可能不可避免地带来更多不相关甚至是嘈杂的信息。因为不同的视频有不同的运动模式 (例如缓慢移动或来回移动)，它会导致帧之间不同的上下文依赖关系。因此，所选择的长期参考信息必须与当前目标框架的上下文相关。本文采用动态策略，有效利用长期参考信息。这个解码阶段上下文聚合模块的结构如图所示。通过1 动态更新长期特征和 2) 非基于局部的聚合对上述编码阶段生成的特征图进行细化。