ICCV2023自监督相关论文摘要速览

在这里插入图片描述

Paper1 Self-supervised Monocular Depth Estimation: Let’s Talk About The Weather

摘要原文: Current, self-supervised depth estimation architectures rely on clear and sunny weather scenes to train deep neural networks. However, in many locations, this assumption is too strong. For example in the UK (2021), 149 days consisted of rain. For these architectures to be effective in real-world applications, we must create models that can generalise to all weather conditions, times of the day and image qualities. Using a combination of computer graphics and generative models, one can augment existing sunny-weather data in a variety of ways that simulate adverse weather effects. While it is tempting to use such data augmentations for self-supervised depth, in the past this was shown to degrade performance instead of improving it. In this paper, we put forward a method that uses augmentations to remedy this problem. By exploiting the correspondence between unaugmented and augmented data we introduce a pseudo-supervised loss for both depth and pose estimation. This brings back some of the benefits of supervised learning while still not requiring any labels. We also make a series of practical recommendations which collectively offer a reliable, efficient framework for weather-related augmentation of self-supervised depth from monocular video. We present extensive testing to show that our method, Robust-Depth, achieves SotA performance on the KITTI dataset while significantly surpassing SotA on challenging, adverse condition data such as DrivingStereo, Foggy CityScape and NuScenes-Night. The project website can be found at https://kieran514.github.io/Robust-Depth-Project/.

中文总结: 这段话主要讨论了目前自监督深度估计架构在训练深度神经网络时依赖于晴朗天气场景,然而在许多地方,这一假设过于强烈。为了使这些架构在实际应用中有效,必须创建能够泛化到所有天气条件、一天中的不同时间和图像质量的模型。通过结合计算机图形学和生成模型,可以对现有的晴天数据进行各种模拟恶劣天气效果的数据增强。虽然诱人地使用这种数据增强来进行自监督深度估计,但过去已经显示这会降低性能而不是提高性能。在本文中,我们提出了一种利用增强来解决这个问题的方法。通过利用未增强和增强数据之间的对应关系,我们引入了一种伪监督损失,用于深度和姿态估计。这样做可以恢复部分监督学习的好处,同时仍然不需要任何标签。我们还提出了一系列实用建议,共同为从单目视频中进行与天气相关的增强提供了可靠、高效的框架。我们进行了大量测试,展示了我们的方法Robust-Depth在KITTI数据集上实现了最先进的性能,同时在具有挑战性的恶劣条件数据(如DrivingStereo、Foggy CityScape和NuScenes-Night)上显著超越了最先进技术。项目网站可在https://kieran514.github.io/Robust-Depth-Project/找到。

Paper2 Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive

摘要原文: Modern consumer cameras usually employ the rolling shutter (RS) mechanism, where images are captured by scanning scenes row-by-row, yielding RS distortions for dynamic scenes. To correct RS distortions, existing methods adopt a fully supervised learning manner, where high framerate global shutter (GS) images should be collected as ground-truth supervision. In this paper, we propose a Self-supervised learning framework for Dual reversed RS distortions Correction (SelfDRSC), where a DRSC network can be learned to generate a high framerate GS video only based on dual RS images with reversed distortions. In particular, a bidirectional distortion warping module is proposed for reconstructing dual reversed RS images, and then a self-supervised loss can be deployed to train DRSC network by enhancing the cycle consistency between input and reconstructed dual reversed RS images. Besides start and end RS scanning time, GS images at arbitrary intermediate scanning time can also be supervised in SelfDRSC, thus enabling the learned DRSC network to generate a high framerate GS video. Moreover, a simple yet effective self-distillation strategy is introduced in self-supervised loss for mitigating boundary artifacts in generated GS images. On synthetic dataset, SelfDRSC achieves better or comparable quantitative metrics in comparison to state-of-the-art methods trained in the full supervision manner. On real-world RS cases, our SelfDRSC can produce high framerate GS videos with finer correction textures and better temporary consistency. The source code and trained models are made publicly available at https://github.com/ shangwei5/SelfDRSC.

中文总结: 这段话主要讨论了现代消费级相机通常采用滚动快门(RS)机制,其中图像通过逐行扫描场景进行捕捉,对于动态场景会产生RS失真。为了纠正RS失真,现有方法采用完全监督学习方式,需要收集高帧率的全局快门(GS)图像作为地面真实监督。本文提出了一种自监督学习框架用于双向反转RS失真校正(SelfDRSC),其中可以学习一个DRSC网络,仅基于具有反向失真的双RS图像生成高帧率GS视频。具体来说,提出了一个双向失真变形模块用于重建双向反转RS图像,然后可以部署自监督损失来训练DRSC网络,通过增强输入和重建的双向反转RS图像之间的循环一致性。除了起始和结束RS扫描时间外,SelfDRSC还可以监督任意中间扫描时间的GS图像,从而使学习到的DRSC网络能够生成高帧率GS视频。此外,在自监督损失中引入了一种简单而有效的自蒸馏策略,用于减轻生成的GS图像中的边界伪影。在合成数据集上,SelfDRSC在与以全监督方式训练的最先进方法相比实现了更好或可比较的定量指标。在真实世界的RS案例中,我们的SelfDRSC可以产生纹理更细致、临时一致性更好的高帧率GS视频。源代码和训练模型已在https://github.com/ shangwei5/SelfDRSC 上公开。

Paper3 Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network

摘要原文: Monocular depth estimation is known as an ill-posed task that objects in a 2D image usually do not contain sufficient information to predict their depth. Thus, it acts differently from other tasks (e.g., classification and segmentation) in many ways. In this paper, we find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency in the feature representation. But the current CNN backbones borrowed from other tasks cannot handle different types of environmental information efficiently, limiting the overall depth accuracy. To bridge this gap, we propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth feature representation in two aspects. First, we propose a direction-aware module, which can learn to adjust the feature extraction in each direction, facilitating the encoding of different types of information. Secondly, we design a new cumulative convolution to improve the efficiency for aggregating important environmental information. Experiments show that our method achieves significant improvements on three widely used benchmarks and sets a new state-of-the-art performance on the popular benchmarks with all three types of self-supervision.

中文总结: 这段话主要讨论了单目深度估计这一被称为不适定任务的问题,即在2D图像中的物体通常不包含足够信息来预测其深度,因此与其他任务(如分类和分割)在许多方面表现不同。作者发现自监督单目深度估计在特征表示中表现出方向敏感性和环境依赖性,但当前从其他任务借鉴的CNN骨干网络不能有效处理不同类型的环境信息,限制了整体深度准确性。为了弥补这一差距,他们提出了一种新的面向方向的累积卷积网络(DaCCN),在两个方面改进了深度特征表示。首先,他们提出了一个面向方向的模块,可以学习调整每个方向的特征提取,有助于编码不同类型的信息。其次,他们设计了一种新的累积卷积来提高聚合重要环境信息的效率。实验证明,他们的方法在三个广泛使用的基准测试中取得了显著的改进,并在所有三种自监督方法的流行基准测试中取得了新的最先进性能。

Paper4 GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes

摘要原文: This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a naive introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive.

To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization.
To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself.
Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono

中文总结: 这篇论文主要解决了在室内场景中自监督单目深度估计面临的挑战,这些挑战主要是由帧间大旋转和低纹理引起的。为了应对前者,作者通过多视角几何从单目序列中获取粗糙的相机姿态来简化学习过程。然而,作者发现由于训练数据集中不同场景之间的尺度模糊性限制,对几何粗糙姿态的朴素引入并不能对性能改进起到积极作用,这是违反直觉的。

为了解决这个问题,作者提出在训练过程中通过旋转和平移/尺度优化来优化这些姿态。
为了减轻低纹理的影响,作者将视觉变换器的全局推理与一种过拟合感知、迭代自蒸馏机制相结合,提供更准确的深度指导,来自网络本身。
在NYUv2、ScanNet、7scenes和KITTI数据集上的实验证明了我们框架中每个组件的有效性,该框架在室内自监督单目深度估计领域取得了新的最先进水平,并具有出色的泛化能力。代码和模型可在https://github.com/zxcqlf/GasMono 上找到。

Paper5 Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations

摘要原文: In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues.

中文总结: 近年来,区分性自监督方法在推动各种视觉任务方面取得了显著进展。学习对数据扭曲/增强具有鲁棒性的数据编码器的核心思想简单而高效。尽管许多研究已经证明了各种学习方法的经验成功,但所得到的学习表示可能表现出不稳定性,并阻碍下游性能。在本研究中,我们从因果的角度分析了区分性自监督方法,以解释这些不稳定行为,并提出解决方案来克服这些问题。我们的方法从之前的研究中汲取灵感,这些研究在一定程度上经验地证明了区分性自监督方法具有一定程度的分离真实因果源的能力。与以往关于因果-增强表示学习的研究不同,我们不是在训练过程中应用我们的解决方案,而是在推理过程中应用以提高时间效率。通过对受控图像数据集和现实图像数据集的实验,我们展示了我们提出的解决方案,其中包括通过受控合成数据调节线性变换,有效地解决了这些问题。

Paper6 Self-Supervised Character-to-Character Distillation for Text Recognition

摘要原文: When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code will be released soon.

中文总结: 这段话主要讨论了在处理复杂文本图像时,现有的监督文本识别方法对数据需求较大的问题。虽然这些方法利用大规模合成文本图像来减少对标注真实图像的依赖性,但领域差距仍然限制了识别性能。因此,通过自监督学习在未标记的真实图像上探索稳健的文本特征表示是一个很好的解决方案。然而,现有的自监督文本识别方法通过粗略地沿水平轴分割视觉特征进行序列到序列表示学习,这限制了增强的灵活性,因为大型基于几何的增强可能导致序列到序列特征不一致。在此基础上,提出了一种新颖的自监督字符对字符蒸馏方法CCD,通过设计自监督字符分割模块来描绘未标记真实图像的字符结构。随后,CCD利用两个图像的增强视图之间的变换矩阵,轻松丰富了本地字符的多样性,同时保持它们的成对对齐,从而实现了通用文本表示学习。实验证明,CCD取得了最先进的结果,在文本识别方面平均性能提高了1.38%,文本分割提高了1.7%,文本超分辨率方面PSNR提高了0.24 dB,SSIM提高了0.0321。代码将很快发布。

Paper7 Multi-Label Self-Supervised Learning with Scene Images

摘要原文: Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module. This paper shows that instead of hinging on these strenuous operations, quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem, which greatly simplifies the learning framework. Specifically, multiple binary pseudo-labels are assigned for each input image by comparing its embeddings with those in two dictionaries, and the network is optimized using the binary cross entropy loss. The proposed method is named Multi-Label Self-supervised learning (MLS). Visualizations qualitatively show that clearly the pseudo-labels by MLS can automatically find semantically similar pseudo-positive pairs across different images to facilitate contrastive learning. MLS learns high quality representations on MS-COCO and achieves state-of-the-art results on classification, detection and segmentation benchmarks. At the same time, MLS is much simpler than existing methods, making it easier to deploy and for further exploration.

中文总结: 这篇论文展示了自监督学习(SSL)方法在场景图像方面最近取得了快速增长,它们主要依赖于专门的密集匹配机制或昂贵的无监督对象发现模块。该论文表明,与依赖这些费力的操作不同,可以通过将场景/多标签图像SSL简单地视为多标签分类问题来学习高质量的图像表示,从而极大简化了学习框架。具体而言,通过将输入图像的嵌入与两个字典中的嵌入进行比较,为每个输入图像分配多个二进制伪标签,并使用二元交叉熵损失来优化网络。提出的方法被命名为多标签自监督学习(MLS)。可视化结果定性地展示了MLS的伪标签可以自动在不同图像之间找到语义相似的伪正对以促进对比学习。MLS在MS-COCO上学习了高质量的表示,并在分类、检测和分割基准测试中取得了最先进的结果。与现有方法相比,MLS要简单得多,更易于部署和进一步探索。

Paper8 SINC: Self-Supervised In-Context Learning for Vision-Language Tasks

摘要原文: Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: “How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?”. To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain.

中文总结: 大型预训练Transformer展现出了在上下文学习方面的引人注目的能力。这些模型在没有梯度更新的情况下,可以迅速从输入中呈现的示例中构建新的预测器。最近的研究在视觉-语言领域促进了这种能力,通过将视觉信息整合到已经能够进行上下文预测的大型语言模型中。然而,这些方法可能会继承语言领域的问题,比如模板敏感性和幻觉。此外,这些语言模型的规模带来了对计算资源的巨大需求,使得学习和操作这些模型变得资源密集。为此,我们提出了一个问题:“如何在不依赖大型语言模型固有的上下文能力的情况下实现上下文学习?”。为了回答这个问题,我们提出了一个简洁而通用的框架,自监督上下文学习(SINC),引入了一个元模型,用于在由定制示例组成的自监督提示上学习。学习的模型可以转移到下游任务,实时进行上下文预测。大量实验证明,SINC在各种视觉-语言任务中的少样本设置下优于基于梯度的方法。此外,SINC的设计帮助我们研究跨不同任务的上下文学习的好处,进一步揭示了视觉-语言领域中上下文学习的关键组成部分。

Paper9 Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-Identification

摘要原文: This paper aims to learn a domain-generalizable (DG) person re-identification (ReID) representation from large-scale videos without any annotation. Prior DG ReID methods employ limited labeled data for training due to the high cost of annotation, which restricts further advances. To overcome the barriers of data and annotation, we propose to utilize large-scale unsupervised data for training. The key issue lies in how to mine identity information. To this end, we propose an Identity-seeking Self-supervised Representation learning (ISR) method. ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further presented to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training. The learned representation exhibits superior generalization ability. Without human annotation and fine-tuning, ISR achieves 87.0% Rank-1 on Market-1501 and 56.4% Rank-1 on MSMT17, outperforming the best supervised domain-generalizable method by 5.0% and 19.5%, respectively. In the pre-training-to-fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4% Rank-1 on MSMT17.

中文总结: 这篇论文旨在从大规模视频中学习一个领域通用的人员再识别(ReID)表示,而无需任何注释。先前的领域通用ReID方法由于注释成本高昂而仅使用有限的标记数据进行训练,这限制了进一步的发展。为了克服数据和注释的障碍,我们提出利用大规模无监督数据进行训练。关键问题在于如何挖掘身份信息。为此,我们提出了一种Identity-seeking Self-supervised Representation learning(ISR)方法。ISR通过将实例关联建模为最大权重二分匹配问题,从帧间图像中构建正对。进一步提出了一种可靠性引导的对比损失,以抑制嘈杂正对的不利影响,确保可靠的正对主导学习过程。ISR的训练成本与数据大小近似线性相关,使其能够利用大规模数据进行训练。学习到的表示具有出色的泛化能力。在没有人类注释和微调的情况下,ISR在Market-1501上取得了87.0%的Rank-1,在MSMT17上取得了56.4%的Rank-1,分别比最佳监督领域通用方法高出5.0%和19.5%。在预训练到微调的情景中,ISR取得了最先进的性能,在MSMT17上达到了88.4%的Rank-1。

Paper10 Iterative Denoiser and Noise Estimator for Self-Supervised Image Denoising

摘要原文: With the emergence of powerful deep learning tools, more and more effective deep denoisers have advanced the field of image denoising. However, the huge progress made by these learning-based methods severely relies on large-scale and high-quality noisy/clean training pairs, which limits the practicality in real-world scenarios. To overcome this, researchers have been exploring self-supervised approaches that can denoise without paired data. However, the unavailable noise prior and inefficient feature extraction take these methods away from high practicality and precision. In this paper, we propose a Denoise-Corrupt-Denoise pipeline (DCD-Net) for self-supervised image denoising. Specifically, we design an iterative training strategy, which iteratively optimizes the denoiser and noise estimator, and gradually approaches high denoising performances using only single noisy images without any noise prior. The proposed self-supervised image denoising framework provides very competitive results compared with state-of-the-art methods on widely used synthetic and real-world image denoising benchmarks.

中文总结: 随着强大的深度学习工具的出现,越来越多有效的深度去噪器推动了图像去噪领域的发展。然而,这些基于学习的方法取得的巨大进展严重依赖于大规模和高质量的噪声/清晰训练对,这限制了在实际场景中的实用性。为了克服这一问题,研究人员一直在探索能够在没有配对数据的情况下进行去噪的自监督方法。然而,缺乏噪声先验和低效的特征提取使得这些方法远离高实用性和精度。在本文中,我们提出了一种用于自监督图像去噪的Denoise-Corrupt-Denoise管道(DCD-Net)。具体来说,我们设计了一种迭代训练策略,该策略迭代优化去噪器和噪声估计器,并逐渐接近高去噪性能,仅使用单个带噪图像而无需任何噪声先验。所提出的自监督图像去噪框架在广泛使用的合成和真实世界图像去噪基准测试中与最先进方法相比提供了非常有竞争力的结果。

Paper11 Noise2Info: Noisy Image to Information of Noise for Self-Supervised Image Denoising

摘要原文: Unsupervised image denoising has been proposed to alleviate the widespread noise problem without requiring clean images. Existing works mainly follow the self-supervised way, which tries to reconstruct each pixel x of noisy images without the knowledge of x. More recently, some pioneer works further emphasize the importance of x and propose to weigh the information extracted from x and other pixels when recovering x. However, such a method is highly sensitive to the standard deviation \sigma_n of noises injected to clean images, where \sigma_n is inaccessible without knowing clean images. Thus, it is unrealistic to assume that \sigma_n is known for pursuing high model performance.

To alleviate this issue, we propose Noise2Info to extract the critical information, the standard deviation \sigma_n of injected noise, only based on the noisy images. Specifically, we first theoretically provide an upper bound on \sigma_n, while the bound requires clean images. Then, we propose a novel method to estimate the bound of \sigma_n by only using noisy images. Besides, we prove that the difference between our estimation with the true deviation goes smaller as the model training. Empirical studies show that Noise2Info is effective and robust on benchmark data sets and closely estimates the standard deviation of noises during model training.

中文总结: 这段话主要讨论了无监督图像去噪的方法。现有的研究主要遵循自监督的方式,试图在不需要干净图像的情况下重建每个像素x的嘈杂图像。最近一些先驱性工作进一步强调了x的重要性,并提出在恢复x时加权提取自x和其他像素的信息的重要性。然而,这种方法对注入到干净图像中的噪声的标准差σ_n非常敏感,而σ_n是无法在不知道干净图像的情况下获得的。因此,假设为了追求高模型性能而已知σ_n是不现实的。

为了解决这个问题,提出了Noise2Info来仅基于嘈杂图像提取关键信息——注入噪声的标准差σ_n。具体地,首先在理论上提供了σ_n的上限,但这个上限需要干净图像。然后,提出了一种新颖的方法,通过仅使用嘈杂图像来估计σ_n的上限。此外,证明了随着模型训练,我们的估计与真实偏差之间的差异会变得更小。实证研究表明,Noise2Info在基准数据集上是有效且稳健的,并且在模型训练过程中能够准确估计噪声的标准差。

Paper12 Random Sub-Samples Generation for Self-Supervised Real Image Denoising

摘要原文: With sufficient paired training samples, the supervised deep learning methods have attracted much attention in image denoising because of their superior performance. However, it is still very challenging to widely utilize the supervised methods in real cases due to the lack of paired noisy-clean images. Meanwhile, most self-supervised denoising methods are ineffective as well when applied to the real-world denoising tasks because of their strict assumptions in applications. For example, as a typical method for self-supervised denoising, the original blind spot network (BSN) assumes that the noise is pixel-wise independent, which is much different from the real cases. To solve this problem, we propose a novel self-supervised real image denoising framework named Sampling Difference As Perturbation (SDAP) based on Random Sub-samples Generation (RSG) with a cyclic sample difference loss. Specifically, we dig deeper into the properties of BSN to make it more suitable for real noise. Surprisingly, we find that adding an appropriate perturbation to the training images can effectively improve the performance of BSN. Further, we propose that the sampling difference can be considered as perturbation to achieve better results. Finally we propose a new BSN framework in combination with our RSG strategy. The results show that it significantly outperforms other state-of-the-art self-supervised denoising methods on real-world datasets. The code is available at https://github.com/p1y2z3/SDAP.

中文总结: 这段话主要讨论了在图像去噪领域中,监督深度学习方法在拥有足够的配对训练样本时表现出色,但由于缺乏配对的噪声-清晰图像,在实际应用中广泛利用这些监督方法仍然具有挑战性。同时,大多数自监督去噪方法在应用于真实世界的去噪任务时也效果不佳,因为它们在应用中有严格的假设。为了解决这个问题,提出了一种名为“采样差异作为扰动(SDAP)”的新型自监督真实图像去噪框架,基于随机子样本生成(RSG)和循环样本差异损失。通过深入研究原始的盲点网络(BSN)的特性,使其更适用于真实噪声情况。研究发现,向训练图像添加适当的扰动可以有效提高BSN的性能。进一步提出,采样差异可以被视为扰动以获得更好的结果。最后,提出了一个新的BSN框架,结合了RSG策略。实验结果表明,该方法在真实世界数据集上明显优于其他最先进的自监督去噪方法。

Paper13 Denoising Diffusion Autoencoders are Unified Self-supervised Learners

摘要原文: Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for generative-and-discriminative dual learning. To validate this, we conduct linear probe and fine-tuning evaluations. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to contrastive learning and masked autoencoders for the first time. Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models. Code is available at github.com/FutureXiang/ddae.

中文总结: 受最近扩散模型的进展启发,这些模型类似于去噪自动编码器,我们调查它们是否可以通过生成式预训练获得用于分类的判别性表示。本文表明,扩散模型中的网络,即去噪扩散自动编码器(DDAE),是统一的自监督学习器:通过在无条件图像生成上进行预训练,DDAE已经在其中间层中学习到了强线性可分的表示,而无需辅助编码器,从而使得扩散预训练成为生成和判别双学习的一般方法。为了验证这一点,我们进行了线性探测和微调评估。我们基于扩散的方法在CIFAR-10和Tiny-ImageNet上分别达到了95.9%和50.0%的线性评估准确率,与对比学习和掩蔽自动编码器首次可比。从ImageNet进行的迁移学习也证实了DDAE适用于视觉Transformer,表明了将DDAE扩展为统一基础模型的潜力。代码可在github.com/FutureXiang/ddae上找到。

Paper14 Representation Uncertainty in Self-Supervised Learning as Variational Inference

摘要原文: In this study, a novel self-supervised learning (SSL) method is proposed, which considers SSL in terms of variational inference to learn not only representation but also representation uncertainties. SSL is a method of learning representations without labels by maximizing the similarity between image representations of different augmented views of an image. Meanwhile, variational autoencoder (VAE) is an unsupervised representation learning method that trains a probabilistic generative model with variational inference. Both VAE and SSL can learn representations without labels, but their relationship has not been investigated in the past. Herein, the theoretical relationship between SSL and variational inference has been clarified. Furthermore, a novel method, namely variational inference SimSiam (VI-SimSiam), has been proposed. VI-SimSiam can predict the representation uncertainty by interpreting SimSiam with variational inference and defining the latent space distribution. The present experiments qualitatively show that VI-SimSiam could learn uncertainty by comparing input images and predicted uncertainties. Additionally, we described a relationship between estimated uncertainty and classification accuracy.

中文总结: 本研究提出了一种新颖的自监督学习(SSL)方法,该方法将SSL视为变分推断,不仅学习表示还学习表示的不确定性。SSL是一种学习表示的方法,通过最大化图像不同增强视图的表示之间的相似性来学习,而不需要标签。同时,变分自编码器(VAE)是一种无监督表示学习方法,利用变分推断训练概率生成模型。VAE和SSL都可以在没有标签的情况下学习表示,但它们之间的关系过去尚未得到研究。本文澄清了SSL和变分推断之间的理论关系。此外,提出了一种新方法,即变分推断SimSiam(VI-SimSiam)。VI-SimSiam通过将SimSiam解释为变分推断并定义潜在空间分布,可以预测表示的不确定性。实验结果表明,VI-SimSiam可以通过比较输入图像和预测的不确定性来学习不确定性。此外,我们描述了估计不确定性与分类准确度之间的关系。

Paper15 Learning by Sorting: Self-supervised Learning with Group Ordering Constraints

摘要原文: Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo), that leverages the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly. To this end, the GroCo loss is based on differentiable sorting networks, which enable training with sorting supervision by matching a differentiable permutation matrix, which is produced by sorting a given set of scores, to a respective ground truth permutation matrix. Applying this idea to groupwise pre-ordered inputs of multiple positive and negative pairs allows introducing the GroCo loss with implicit emphasis on strong positives and negatives, leading to better optimization of the local neighborhood. We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance.

中文总结: 对比学习已成为从无标签数据中学习表示的重要工具,主要依赖于最小化正数据对之间的距离(例如来自相同图像的视图),并最大化负数据对之间的距离(例如来自不同图像的视图)的思想。本文提出了对比学习目标的一种新变体,即Group Ordering Constraints(GroCo),利用了对正负数据对的距离进行排序的思想,并根据有多少正数据对的距离大于负数据对的距离,从而未正确排序,计算相应的损失。为此,GroCo损失基于可微分排序网络,通过匹配一个可微分排列矩阵,该矩阵是通过对给定一组分数进行排序而产生的,与相应的地面真实排列矩阵。将这个思想应用于多个正负数据对的分组预排序输入,允许引入GroCo损失,隐含地强调强正负数据对,从而更好地优化局部邻域。我们在各种自监督学习基准上评估了所提出的公式,并表明它不仅比普通对比学习实现了更好的结果,而且在线性探测和k-NN性能方面表现出与可比方法的竞争性能,并且在k-NN性能方面优于当前方法。

Paper16 Self-Supervised Burst Super-Resolution

摘要原文: We introduce a self-supervised training strategy for burst super-resolution that only uses noisy low-resolution bursts during training. Our approach eliminates the need to carefully tune synthetic data simulation pipelines, which often do not match real-world image statistics. Compared to weakly-paired training strategies, which require noisy smartphone burst photos of static scenes, paired with a clean reference obtained from a tripod-mounted DSLR camera, our approach is more scalable, and avoids the color mismatch between the smartphone and DSLR. To achieve this, we propose a new self-supervised objective that uses a forward imaging model to recover a high-resolution image from aliased high frequencies in the burst. Our approach does not require any manual tuning of the forward model’s parameters; we learn them from data. Furthermore, we show our training strategy is robust to dynamic scene motion in the burst, which enables training burst super-resolution models using in-the-wild data. Extensive experiments on real and synthetic data show that, despite only using noisy bursts during training, models trained with our self-supervised strategy match, and sometimes surpass, the quality of fully-supervised baselines trained with synthetic data or weakly-paired ground-truth. Finally, we show our training strategy is general using four different burst super-resolution architectures.

中文总结: 这段话主要介绍了一种自监督训练策略,用于爆发式超分辨率,只使用噪声低分辨率爆发图像进行训练。该方法消除了需要精心调整合成数据模拟管道的需求,这些管道通常无法匹配真实世界图像的统计特性。与需要在静态场景下使用噪声智能手机爆发照片与三脚架安装的DSLR相配合的弱配对训练策略相比,我们的方法更具可扩展性,并避免了智能手机和DSLR之间的颜色不匹配。为了实现这一点,我们提出了一种新的自监督目标,利用前向成像模型从爆发中的混叠高频率中恢复高分辨率图像。我们的方法不需要手动调整前向模型的参数;我们从数据中学习这些参数。此外,我们展示了我们的训练策略对爆发中的动态场景运动具有鲁棒性,这使得可以使用野外数据训练爆发式超分辨率模型。对真实和合成数据进行的大量实验表明,尽管只使用噪声爆发进行训练,但使用我们的自监督策略训练的模型与使用合成数据或弱配对地面真实数据训练的全监督基线模型在质量上匹配,有时甚至超过。最后,我们展示了我们的训练策略是通用的,使用了四种不同的爆发式超分辨率架构。

Paper17 An Embarrassingly Simple Backdoor Attack on Self-supervised Learning

摘要原文: As a new paradigm in machine learning, self-supervised learning (SSL) is capable of learning high-quality representations of complex data without relying on labels. In addition to eliminating the need for labeled data, research has found that SSL improves the adversarial robustness over supervised learning since lacking labels makes it more challenging for adversaries to manipulate model predictions. However, the extent to which this robustness superiority generalizes to other types of attacks remains an open question. We explore this question in the context of backdoor attacks. Specifically, we design and evaluate CTRL, an embarrassingly simple yet highly effective self-supervised backdoor attack. By only polluting a tiny fraction of training data (<1%) with indistinguishable poisoning samples, CTRL causes any trigger-embedded input to be misclassified to the adversary’s designated class with a high probability (>99%) at inference time. Our findings suggest that SSL and supervised learning are comparably vulnerable to backdoor attacks. More importantly, through the lens of CTRL, we study the inherent vulnerability of SSL to backdoor attacks. With both empirical and analytical evidence, we reveal that the representation invariance property of SSL, which benefits adversarial robustness, may also be the very reason making SSL highly susceptible to backdoor attacks. Our findings also imply that the existing defenses against supervised backdoor attacks are not easily retrofitted to the unique vulnerability of SSL.

中文总结: 这段话主要讨论了自监督学习(SSL)作为机器学习中的一种新范式,能够在不依赖标签的情况下学习复杂数据的高质量表示。研究发现,SSL不仅消除了对标记数据的需求,而且提高了对抗鲁棒性,因为缺乏标签使得对手更难以操纵模型预测。然而,这种鲁棒性优势在其他类型的攻击中是否具有普遍性仍然是一个未解之谜。作者在背后门攻击的背景下探讨了这个问题。具体地,他们设计并评估了CTRL,一个非常简单但高效的自监督背门攻击。通过仅向少量训练数据(<1%)中注入无法区分的污染样本,CTRL在推断时能够高概率(>99%)将带有触发器的输入误分类为对手指定的类别。作者的研究表明,SSL和监督学习对背门攻击的脆弱性相当。更重要的是,通过CTRL的研究,他们探讨了SSL对背门攻击的固有脆弱性。通过经验和分析证据,他们揭示了SSL的表示不变性特性,这有利于对抗鲁棒性,也可能是SSL高度容易受到背门攻击的原因。作者的研究还暗示,现有的对抗监督背门攻击的防御措施不容易适应SSL的独特脆弱性。

Paper18 DDS2M: Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration

摘要原文: Diffusion models have recently received a surge of interest due to their impressive performance for image restoration, especially in terms of noise robustness. However, existing diffusion-based methods are trained on a large amount of training data and perform very well in-distribution, but can be quite susceptible to distribution shift. This is especially inappropriate for data-starved hyperspectral image (HSI) restoration. To tackle this problem, this work puts forth a self-supervised diffusion model for HSI restoration, namely Denoising Diffusion Spatio-Spectral Model (DDS2M), which works by inferring the parameters of the proposed Variational Spatio-Spectral Module (VS2M) during the reverse diffusion process, solely using the degraded HSI without any extra training data. In VS2M, a variational inference-based loss function is customized to enable the untrained spatial and spectral networks to learn the posterior distribution, which serves as the transitions of the sampling chain to help reverse the diffusion process. Benefiting from its self-supervised nature and the diffusion process, DDS2M enjoys stronger generalization ability to various HSIs compared to existing diffusion-based methods and superior robustness to noise compared to existing HSI restoration methods. Extensive experiments on HSI denoising, noisy HSI completion and super-resolution on a variety of HSIs demonstrate DDS2M’s superiority over the existing task-specific state-of-the-arts. Code is available at: https://github.com/miaoyuchun/DDS2M.

中文总结: 这段话主要讨论了扩散模型近期因其在图像恢复方面的出色性能而备受关注,尤其是在噪声鲁棒性方面。然而,现有的基于扩散的方法虽然在大量训练数据上训练并在分布内表现良好,但对分布转移很敏感,这对于数据匮乏的高光谱图像(HSI)恢复尤其不合适。为解决这一问题,该研究提出了一种自监督扩散模型用于HSI恢复,即去噪扩散空谱模型(DDS2M),其通过在逆扩散过程中仅使用降质HSI来推断所提出的变分空谱模块(VS2M)的参数。在VS2M中,定制了基于变分推断的损失函数,使未经训练的空间和光谱网络能够学习后验分布,后验分布作为采样链的转移,有助于逆向扩散过程。由于其自监督性质和扩散过程的影响,DDS2M相对于现有的基于扩散的方法在各种HSI上具有更强的泛化能力,并相对于现有的HSI恢复方法具有更强的噪声鲁棒性。在各种HSI上进行的大量实验表明,DDS2M在HSI去噪、噪声HSI补全和超分辨率方面优于现有的特定任务最新技术。代码可在以下链接找到:https://github.com/miaoyuchun/DDS2M。

Paper19 SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning

摘要原文: In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images.

中文总结: 在鱼眼图像中,丰富且明显的畸变模式定期分布在图像平面上。这些畸变模式与视觉内容无关,为矫正提供了信息提示。为了充分利用这些矫正提示,我们引入了SimFIR,这是一个基于自监督表示学习的简单的鱼眼图像矫正框架。从技术上讲,我们首先将鱼眼图像分成多个补丁,并使用Vision Transformer(ViT)提取它们的表示。为了学习细粒度的畸变表示,我们根据鱼眼模型将不同的图像补丁与它们特定的畸变模式关联起来,并进一步巧妙设计了一个创新的统一畸变感知预训练任务来进行学习。在下游矫正任务上的转移性能显著提升,验证了学习表示的有效性。进行了大量实验,定量和定性结果表明我们的方法优于最先进的算法,并且在真实世界的鱼眼图像上具有强大的泛化能力。

Paper20 SEMPART: Self-supervised Multi-resolution Partitioning of Image Semantics

摘要原文: Accurately determining salient regions of an image is challenging when labeled data is scarce. DINO-based self-supervised approaches have recently leveraged meaningful image semantics captured by patch-wise features for locating foreground objects. Recent methods have also incorporated intuitive priors and demonstrated value in unsupervised methods for object partitioning. In this paper, we propose SEMPART, which jointly infers coarse and fine bi-partitions over an image’s DINO-based semantic graph. Furthermore, SEMPART preserves fine boundary details using graph-driven regularization and successfully distills the coarse mask semantics into the fine mask. Our salient object detection and single object localization findings suggest that SEMPART produces high-quality masks rapidly without additional post-processing and benefits from co-optimizing the coarse and fine branches.

中文总结: 当标记数据稀缺时,准确确定图像的显著区域是具有挑战性的。基于DINO的自监督方法最近利用了通过基于补丁的特征捕获的有意义的图像语义来定位前景对象。最近的方法还结合直觉先验,并展示了在无监督方法中用于对象分割的价值。在本文中,我们提出了SEMPART,它联合推断图像的DINO-based语义图上的粗细双分区。此外,SEMPART使用基于图的正则化保留了细粒度边界细节,并成功地将粗略掩模语义提炼为细粒度掩模。我们的显著对象检测和单个对象定位结果表明,SEMPART可以快速生成高质量的掩模,无需额外的后处理,并受益于联合优化粗细分支。

Paper21 FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation

摘要原文: Curvilinear object segmentation is critical for many applications. However, manually annotating curvilinear objects is very time-consuming and error-prone, yielding insufficiently available annotated datasets for existing supervised methods and domain adaptation methods. This paper proposes a self-supervised curvilinear object segmentation method (FreeCOS) that learns robust and distinctive features from fractals and unlabeled images. The key contributions include a novel Fractal-FDA synthesis (FFS) module and a geometric information alignment (GIA) approach. FFS generates curvilinear structures based on the parametric Fractal L-system and integrates the generated structures into unlabeled images to obtain synthetic training images via Fourier Domain Adaptation. GIA reduces the intensity differences between the synthetic and unlabeled images by comparing the intensity order of a given pixel to the values of its nearby neighbors. Such image alignment can explicitly remove the dependency on absolute intensity values and enhance the inherent geometric characteristics which are common in both synthetic and real images. In addition, GIA aligns features of synthetic and real images via the prediction space adaptation loss (PSAL) and the curvilinear mask contrastive loss (CMCL). Extensive experimental results on four public datasets, i.e., XCAD, DRIVE, STARE and CrackTree demonstrate that our method outperforms the state-of-the-art unsupervised methods, self-supervised methods and traditional methods by a large margin. The source code of this work is available at https://github.com/TY-Shi/FreeCOS.

中文总结: 这篇论文提出了一种自监督曲线对象分割方法(FreeCOS),通过从分形和未标记图像中学习稳健和独特的特征,解决了手动注释曲线对象耗时且容易出错的问题,从而提供了现有监督方法和域自适应方法所需的不足的可用注释数据集。关键贡献包括一种新颖的分形-傅里叶域适应(FFS)模块和几何信息对齐(GIA)方法。FFS基于参数化的分形L系统生成曲线结构,并通过傅里叶域适应将生成的结构集成到未标记图像中,从而通过合成训练图像来获得合成训练图像。GIA通过比较给定像素的强度顺序与其附近邻居的值来减少合成和未标记图像之间的强度差异。这种图像对齐可以明确消除对绝对强度值的依赖,并增强合成和真实图像中共同的几何特征。此外,GIA通过预测空间适应损失(PSAL)和曲线掩模对比损失(CMCL)对合成和真实图像的特征进行对齐。对四个公共数据集(XCAD、DRIVE、STARE和CrackTree)的广泛实验结果表明,我们的方法在无监督方法、自监督方法和传统方法方面均取得了显著优于最先进的效果。这项工作的源代码可在https://github.com/TY-Shi/FreeCOS 上找到。

Paper22 Learn TAROT with MENTOR: A Meta-Learned Self-Supervised Approach for Trajectory Prediction

摘要原文: Predicting diverse yet admissible trajectories that adhere to the map constraints is challenging. Graph-based scene encoders have been proven effective for preserving local structures of maps by defining lane-level connections. However, such encoders do not capture more complex patterns emerging from long-range heterogeneous connections between nonadjacent interacting lanes. To this end, we shed new light on learning common driving patterns by introducing meTA ROad paTh (TAROT) to formulate combinations of various relations between lanes on the road topology. Intuitively, this can be viewed as finding feasible routes. Furthermore, we propose MEta-road NeTwORk (MENTOR) that helps trajectory prediction by providing it with TAROT as navigation tips. More specifically, 1) we define TAROT prediction as a novel self-supervised proxy task to identify the complex heterogeneous structure of the map. 2) For typical driving actions, we establish several TAROTs that result in multiple Heterogeneous Structure Learning (HSL) tasks. These tasks are used in MENTOR, which performs meta-learning by simultaneously predicting trajectories along with proxy tasks, identifying an optimal combination of them, and automatically balancing them to improve the primary task. We show that our model achieves state-of-the-art performance on the Argoverse dataset, especially on diversity and admissibility metrics, achieving up to 20% improvements in challenging scenarios. We further investigate the contribution of proposed modules in ablation studies.

中文总结: 这段话主要讨论了在预测符合地图约束的多样且可接受的轨迹时面临的挑战。提到了基于图的场景编码器在保留地图的局部结构方面的有效性,但是这些编码器无法捕捉长距离异构连接产生的更复杂模式。为此,引入了meTA ROad paTh(TAROT)来学习共同的驾驶模式,以便形成道路拓扑图中各车道之间的各种关系组合。进一步提出了MEta-road NeTwORk(MENTOR),通过将TAROT作为导航提示来帮助轨迹预测。文章指出了TAROT预测作为一种新颖的自监督代理任务,用于识别地图的复杂异构结构,并建立了多个Heterogeneous Structure Learning(HSL)任务,这些任务在MENTOR中用于元学习,同时预测轨迹以及代理任务,以找到最佳组合并自动平衡以改善主要任务。研究结果表明,该模型在Argoverse数据集上取得了最先进的性能,特别是在多样性和可接受性指标上,挑战性场景中的改进高达20%。还对提出的模块在消融研究中的贡献进行了进一步探讨。

Paper23 Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need

摘要原文: Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views.

Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input.
In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples.
PAL achieves three main objectives. First, it is a theoretically grounded learning framework that encapsulates standard SSL but also supervised and semi-supervised learning depending on the employed oracle.
Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline.
Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs.

中文总结: 这段话主要讨论了自监督学习(SSL)作为从未标记数据中学习可转移表示的首选解决方案的出现。然而,SSL需要构建已知在语义上相似的样本,即正面视图。需要这种知识是SSL的主要限制,通常通过特定策略来解决,例如将已知的数据增强应用于相同的输入。

在这项工作中,作者通过积极正学习(PAL)来推广和正式化这一原则,其中一个oracle查询样本之间的语义关系。
PAL实现了三个主要目标。首先,它是一个理论上基础的学习框架,封装了标准的SSL,同时根据所使用的oracle,也包括监督和半监督学习。其次,它提供了一致的算法,将先验知识(例如一些观察到的标签)嵌入到任何SSL损失中,而无需改变训练流程。第三,它提供了一个适当的主动学习框架,为标注数据集提供低成本解决方案,可以说是缩小了基于对输入之间的语义关系进行简单回答的非专家查询的主动学习的理论和实践之间的差距。

Paper24 DeLiRa: Self-Supervised Depth, Light, and Radiance Fields

摘要原文: Differentiable volumetric rendering is a powerful paradigm for 3D reconstruction and novel view synthesis. However, standard volume rendering approaches struggle with degenerate geometries in the case of limited viewpoint diversity, a common scenario in robotics applications. In this work, we propose to use the multi-view photometric objective from the self-supervised depth estimation literature as a geometric regularizer for volumetric rendering, significantly improving novel view synthesis without requiring additional information. Building upon this insight, we explore the explicit modeling of scene geometry using a generalist Transformer, jointly learning a radiance field as well as depth and light fields with a set of shared latent codes. We demonstrate that sharing geometric information across tasks is mutually beneficial, leading to improvements over single-task learning without an increase in network complexity. Our DeLiRa architecture achieves state-of-the-art results on the ScanNet benchmark, enabling high quality volumetric rendering as well as real-time novel view and depth synthesis in the limited viewpoint diversity setting.

中文总结: 这段话主要讨论了可微体积渲染在3D重建和新视角合成中的强大作用。然而,在有限视角多样性的情况下,标准的体积渲染方法往往难以处理退化几何结构,这在机器人应用中很常见。作者提出利用自监督深度估计文献中的多视角光度目标作为体积渲染的几何正则化器,显著改善新视角合成而无需额外信息。在此基础上,作者探索使用通用Transformer显式建模场景几何,共同学习辐射场、深度和光场,并使用一组共享的潜在编码。作者证明了跨任务共享几何信息是互惠的,可以在不增加网络复杂性的情况下改善单任务学习效果。他们的DeLiRa架构在ScanNet基准测试中取得了最先进的结果,实现了高质量的体积渲染以及有限视角多样性设置下的实时新视角和深度合成。

Paper25 Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot Network

摘要原文: There have been many image denoisers using deep neural networks, which outperform conventional model-based methods by large margins. Recently, self-supervised methods have attracted attention because constructing a large real noise dataset for supervised training is an enormous burden. The most representative self-supervised denoisers are based on blind-spot networks, which exclude the receptive field’s center pixel. However, excluding any input pixel is abandoning some information, especially when the input pixel at the corresponding output position is excluded. In addition, a standard blind-spot network fails to reduce real camera noise due to the pixel-wise correlation of noise, though it successfully removes independently distributed synthetic noise. Hence, to realize a more practical denoiser, we propose a novel self-supervised training framework that can remove real noise. For this, we derive the theoretic upper bound of a supervised loss where the network is guided by the downsampled blinded output. Also, we design a conditional blind-spot network (C-BSN), which selectively controls the blindness of the network to use the center pixel information. Furthermore, we exploit a random subsampler to decorrelate noise spatially, making the C-BSN free of visual artifacts that were often seen in downsample-based methods. Extensive experiments show that the proposed C-BSN achieves state-of-the-art performance on real-world datasets as a self-supervised denoiser and shows qualitatively pleasing results without any post-processing or refinement.

中文总结: 这段话主要讨论了图像去噪方法中使用深度神经网络的发展情况。指出了深度神经网络在图像去噪方面优于传统基于模型的方法。最近,自监督方法备受关注,因为构建大规模真实噪声数据集用于监督训练是一项巨大负担。最具代表性的自监督去噪器基于盲点网络,但是排除任何输入像素会丢失一些信息,尤其是当对应输出位置的输入像素被排除时。此外,标准的盲点网络无法减少真实相机噪声,因为噪声的像素间相关性,尽管它成功地去除了独立分布的合成噪声。因此,为了实现更实用的去噪器,提出了一种新颖的自监督训练框架,可以去除真实噪声。为此,推导了一个监督损失的理论上限,其中网络受到下采样的盲目输出的引导。此外,设计了一种条件盲点网络(C-BSN),可以选择性地控制网络对中心像素信息的盲目性。此外,利用随机子采样器在空间上去相关噪声,使得C-BSN不再有常见的基于下采样的方法中经常出现的视觉伪影。大量实验证明,所提出的C-BSN作为一种自监督去噪器在真实数据集上取得了最先进的性能,并且在没有任何后处理或优化的情况下呈现出令人满意的质量结果。

Paper26 EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity

摘要原文: Self-supervised monocular scene flow estimation, aiming to understand both 3D structures and 3D motions from two temporally consecutive monocular images, has received increasing attention for its simple and economical sensor setup. However, the accuracy of current methods suffers from the bottleneck of less-efficient network architecture and lack of motion rigidity for regularization. In this paper, we propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning. We further impose explicit and robust geometric constraints with an elaborately constructed ego-motion aggregation module where a rigidity soft mask is proposed to filter out dynamic regions for stable ego-motion estimation using static regions. Moreover, we propose a motion consistency loss along with a mask regularization loss to fully exploit static regions. Several efficient training strategies are integrated including a gradient detachment technique and an enhanced view synthesis process for better performance. Our proposed method outperforms the previous self-supervised works by a large margin and catches up to the performance of supervised methods. On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44% and demonstrates superior performance across sub-tasks including depth and visual odometry, amongst other self-supervised single-task or multi-task methods.

中文总结: 这段话主要讨论了自监督单目场景流估计的研究,旨在从两个时间连续的单目图像中理解3D结构和3D运动。该方法受到越来越多关注,因为其简单和经济的传感器设置。然而,当前方法的准确性受到 less-efficient 网络架构和缺乏运动刚性的限制。作者提出了一种名为 EMR-MSF 的优越模型,通过借鉴监督学习范围内的网络架构设计的优势来解决这些问题。他们进一步通过精心构建的自运动聚合模块施加明确和稳健的几何约束,其中提出了刚性软膜来过滤动态区域,以便使用静态区域进行稳定的自运动估计。此外,他们提出了一种运动一致性损失以及一个蒙版正则化损失,以充分利用静态区域。作者还整合了几种有效的训练策略,包括梯度分离技术和增强的视图合成过程,以提高性能。他们的方法在KITTI场景流基准测试中明显优于以前的自监督作品,并赶上了监督方法的性能。在SF-all指标上,他们的方法将最先进的自监督单目方法的性能提高了44%,并在深度和视觉测距等子任务上表现出优越性能,超过其他自监督单任务或多任务方法。

Paper27 Implicit Autoencoder for Point-Cloud Self-Supervised Representation Learning

摘要原文: This paper advocates the use of implicit surface representation in autoencoder-based self-supervised 3D representation learning. The most popular and accessible 3D representation, i.e., point clouds, involves discrete samples of the underlying continuous 3D surface. This discretization process introduces sampling variations on the 3D shape, making it challenging to develop transferable knowledge of the true 3D geometry. In the standard autoencoding paradigm, the encoder is compelled to encode not only the 3D geometry but also information on the specific discrete sampling of the 3D shape into the latent code. This is because the point cloud reconstructed by the decoder is considered unacceptable unless there is a perfect mapping between the original and the reconstructed point clouds. This paper introduces the Implicit AutoEncoder (IAE), a simple yet effective method that addresses the sampling variation issue by replacing the commonly-used point-cloud decoder with an implicit decoder. The implicit decoder reconstructs a continuous representation of the 3D shape, independent of the imperfections in the discrete samples. Extensive experiments demonstrate that the proposed IAE achieves state-of-the-art performance across various self-supervised learning benchmarks.

中文总结: 这篇论文主张在基于自监督的三维表示学习中使用隐式曲面表示。最流行且易获取的三维表示,即点云,涉及对连续三维曲面的离散采样。这种离散化过程引入了对三维形状的采样变化,使得开发真实三维几何的可传递知识变得具有挑战性。在标准的自编码范式中,编码器被迫不仅编码三维几何,还要将关于特定离散采样的信息编码到潜在代码中。这是因为解码器重建的点云被认为是不可接受的,除非原始点云与重建点云之间存在完美的映射。这篇论文介绍了隐式自编码器(IAE),这是一种简单而有效的方法,通过将常用的点云解码器替换为隐式解码器来解决采样变化问题。隐式解码器重建了三维形状的连续表示,独立于离散样本中的缺陷。大量实验表明,所提出的IAE在各种自监督学习基准测试中取得了最先进的性能。

Paper28 Self-supervised Pre-training for Mirror Detection

摘要原文: Existing mirror detection methods require supervised ImageNet pre-training to obtain good general-purpose image features. However, supervised ImageNet pre-training focuses on category-level discrimination and may not be suitable for downstream tasks like mirror detection, due to the overfitting upstream tasks (e.g., supervised image classification). We observe that mirror reflection is crucial to how people perceive the presence of mirrors, and such mid-level features can be better transferred from self-supervised pre-trained models. Inspired by this observation, in this paper we aim to improve mirror detection methods by proposing a new self-supervised learning (SSL) pre-training framework for modeling the representation of mirror reflection progressively in the pre-training process.

Our framework consists of three pre-training stages at different levels:

  1. an image-level pre-training stage to globally incorporate mirror reflection features into the pre-trained model;
  2. a patch-level pre-training stage to spatially simulate and learn local mirror reflection from image patches; and
  3. a pixel-level pre-training stage to pixel-wisely capture mirror reflection via reconstructing corrupted mirror images based on the relationship between the inside and outside of mirrors.
    Extensive experiments show that our SSL pre-training framework significantly outperforms previous state-of-the-art CNN-based SSL pre-training frameworks and even outperforms supervised ImageNet pre-training when transferred to the mirror detection task.
    Code and models are available at https://jiaying.link/iccv2023-sslmirror/

中文总结: 这段话主要讨论了现有的镜像检测方法需要经过监督的ImageNet预训练才能获得良好的通用图像特征。然而,监督的ImageNet预训练侧重于类别级别的区分,可能不适用于镜像检测等下游任务,因为过度拟合了上游任务(例如,监督图像分类)。作者观察到镜像反射对人们感知镜子存在的重要性,这种中层特征可以更好地从自监督预训练模型中转移。受到这一观察的启发,在这篇论文中,作者旨在通过提出一种新的自监督学习(SSL)预训练框架来逐步模拟镜像反射的表示,以改进镜像检测方法。

该框架包括三个不同级别的预训练阶段:
1)图像级预训练阶段,全局地将镜像反射特征纳入预训练模型;
2)补丁级预训练阶段,从图像补丁中模拟和学习局部镜像反射;
3)像素级预训练阶段,通过基于镜子内外关系重建受损镜像来像素地捕获镜像反射。

大量实验证明,作者的SSL预训练框架明显优于先前最先进的基于CNN的SSL预训练框架,甚至在转移到镜像检测任务时也优于监督的ImageNet预训练。

代码和模型可在https://jiaying.link/iccv2023-sslmirror/获得。

Paper29 Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation

摘要原文: Most self-supervised 6D object pose estimation methods can only work with additional depth information or rely on the accurate annotation of 2D segmentation masks, limiting their application range. In this paper, we propose a 6D object pose estimation method that can be trained with pure RGB images without any auxiliary information. We first obtain a rough pose initialization from networks trained on synthetic images rendered from the target’s 3D mesh. Then, we introduce a refinement strategy leveraging the geometry constraint in synthetic-to-real image pairs from multiple different views. We formulate this geometry constraint as pixel-level flow consistency between the training images with dynamically generated pseudo labels. We evaluate our method on three challenging datasets and demonstrate that it outperforms state-of-the-art self-supervised methods significantly, with neither 2D annotations nor additional depth images.

中文总结: 这段话主要讲述了大多数自监督6D物体姿态估计方法通常需要额外的深度信息或准确的2D分割标注,从而限制了它们的应用范围。作者提出了一种可以通过纯RGB图像进行训练而无需任何辅助信息的6D物体姿态估计方法。首先,他们从在目标的3D网格上渲染的合成图像上训练的网络中获得粗略的姿态初始化。然后,引入了一种利用来自多个不同视角的合成到真实图像对中的几何约束的细化策略。作者将这种几何约束形式化为训练图像之间的像素级流一致性,使用动态生成的伪标签。作者在三个具有挑战性的数据集上评估了他们的方法,并展示了它明显优于最先进的自监督方法,而无需2D标注或额外的深度图像。

Paper30 SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields

摘要原文: 3D reconstruction from a single 2D image was extensively covered in the literature but relies on depth supervision at training time, which limits its applicability. To relax the dependence to depth we propose SceneRF, a self-supervised monocular scene reconstruction method using only posed image sequences for training. Fueled by the recent progress in neural radiance fields (NeRF) we optimize a radiance field though with explicit depth optimization and a novel probabilistic sampling strategy to efficiently handle large scenes. At inference, a single input image suffices to hallucinate novel depth views which are fused together to obtain 3D scene reconstruction. Thorough experiments demonstrate that we outperform all baselines for novel depth views synthesis and scene reconstruction, on indoor BundleFusion and outdoor SemanticKITTI. Code is available at https://astra-vision.github.io/SceneRF .

中文总结: 这段话主要介绍了一种名为SceneRF的自监督单目场景重建方法。传统的从单个2D图像进行3D重建的方法需要在训练时依赖深度监督,限制了其适用性。为了减少对深度的依赖,他们提出了SceneRF方法,只使用姿态图像序列进行训练。借助神经辐射场(NeRF)的最新进展,他们优化了一个辐射场,通过显式深度优化和新颖的概率采样策略来高效处理大场景。在推断时,只需一个输入图像即可产生新的深度视图,这些视图被融合在一起以获取3D场景重建。通过详细的实验证明,他们在室内BundleFusion和室外SemanticKITTI数据集上优于所有基线方法。源代码可在https://astra-vision.github.io/SceneRF获取。

Paper31 Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning

摘要原文: In contrastive self-supervised learning, the common way to learn discriminative representation is to pull different augmented “views” of the same image closer while pushing all other images further apart, which has been proven to be effective. However, it is unavoidable to construct undesirable views containing different semantic concepts during the augmentation procedure. It would damage the semantic consistency of representation to pull these augmentations closer in

the feature space indiscriminately. In this study, we introduce feature-level augmentation and propose a novel semantics-consistent feature search (SCFS) method to mitigate this negative effect. The main idea of SCFS is to adaptively
search semantics-consistent features to enhance the contrast between semantics-consistent regions in different augmentations. Thus, the trained model can learn to focus on meaningful object regions, improving the semantic representation ability. Extensive experiments conducted on different datasets and tasks demonstrate that SCFS effectively improves the performance of self-supervised learning and achieves state-of-the-art performance on different downstream tasks.

中文总结: 在对比式自监督学习中,学习具有区分性表示的常见方式是将同一图像的不同增强“视图”拉近,同时将所有其他图像推远,这已被证明是有效的。然而,在增强过程中不可避免地会构建包含不同语义概念的不良视图。在特征空间中不加区分地将这些增强拉近,会破坏表示的语义一致性。在这项研究中,我们引入了特征级增强,并提出了一种新颖的语义一致特征搜索(SCFS)方法来减轻这种负面影响。SCFS的主要思想是自适应地搜索语义一致特征,以增强不同增强中语义一致区域之间的对比度。因此,训练模型可以学习集中在有意义的对象区域,提高语义表示能力。在不同数据集和任务上进行的大量实验表明,SCFS有效地提高了自监督学习的性能,并在不同下游任务上实现了最先进的性能。

Paper32 Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-Supervised Depth Estimation

摘要原文: Monocular and binocular self-supervised depth estimations are two important and related tasks in computer vision, which aim to predict scene depths from single images and stereo image pairs respectively. In literature, the two tasks are usually tackled separately by two different kinds of models, and binocular models generally fail to predict depth from single images, while the prediction accuracy of monocular models is generally inferior to binocular models. In this paper, we propose a Two-in-One self-supervised depth estimation network, called TiO-Depth, which could not only compatibly handle the two tasks, but also improve the prediction accuracy. TiO-Depth employs a Siamese architecture and each sub-network of it could be used as a monocular depth estimation model. For binocular depth estimation, a Monocular Feature Matching module is proposed for incorporating the stereo knowledge between the two images, and the full TiO-Depth is used to predict depths. We also design a multi-stage joint-training strategy for improving the performances of TiO-Depth in both two tasks by combining the relative advantages of them. Experimental results on the KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms both the monocular and binocular state-of-the-art methods in most cases, and further verify the feasibility of a two-in-one network for monocular and binocular depth estimation. The code is available at https://github.com/ZM-Zhou/TiO-Depth_pytorch.

中文总结: 这段话主要讨论了单目和双目自监督深度估计在计算机视觉中的重要性,它们旨在分别从单张图像和立体图像对中预测场景深度。文献中通常通过两种不同类型的模型分别处理这两个任务,双目模型通常无法从单张图像中预测深度,而单目模型的预测精度通常低于双目模型。作者提出了一种名为TiO-Depth的双合一自监督深度估计网络,能够兼容处理这两个任务,并提高预测精度。TiO-Depth采用孪生架构,其中的每个子网络都可以用作单目深度估计模型。对于双目深度估计,提出了一个单目特征匹配模块,用于整合两个图像之间的立体知识,并使用完整的TiO-Depth来预测深度。作者还设计了一个多阶段联合训练策略,通过结合它们的相对优势来提高TiO-Depth在两个任务中的性能。在KITTI、Cityscapes和DDAD数据集上的实验结果表明,TiO-Depth在大多数情况下优于单目和双目的最先进方法,并进一步验证了用于单目和双目深度估计的双合一网络的可行性。代码可在https://github.com/ZM-Zhou/TiO-Depth_pytorch找到。

Paper33 CROSSFIRE: Camera Relocalization On Self-Supervised Features from an Implicit Representation

摘要原文: Beyond novel view synthesis, Neural Radiance Fields are useful for applications that interact with the real world. In this paper, we use them as an implicit map of a given scene and propose a camera relocalization algorithm tailored for this representation. The proposed method enables to compute in real-time the precise position of a device using a single RGB camera, during its navigation. In contrast with previous work, we do not rely on pose regression or photometric alignment but rather use dense local features obtained through volumetric rendering which are specialized on the scene with a self-supervised objective. As a result, our algorithm is more accurate than competitors, able to operate in dynamic outdoor environments with changing lightning conditions and can be readily integrated in any volumetric neural renderer.

中文总结: 这段话主要讨论了神经辐射场(Neural Radiance Fields)在与真实世界互动的应用中的用途。作者将神经辐射场用作给定场景的隐式地图,并提出了一种针对这种表示的相机重定位算法。该方法使得能够实时计算设备的精确位置,只需使用一个RGB摄像头,而无需依赖姿态回归或光度对齐,而是利用通过体积渲染获得的密集局部特征,这些特征专门针对具有自监督目标的场景。因此,该算法比竞争对手更准确,能够在动态的户外环境中操作,并可以轻松集成到任何体积神经渲染器中。

Paper34 Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations

摘要原文: Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos - but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. The implementation can be found here : https://github.com/SMSD75/Timetuning

中文总结: 这段话主要讨论了空间密集自监督学习是一个快速增长的问题领域,具有无监督分割和为密集下游任务预训练的应用前景。尽管视频形式的时间数据丰富,但这一信息丰富的来源却被大多数人忽视。该论文旨在通过提出一种新颖的方法,将时间一致性纳入密集自监督学习中,来填补这一空白。尽管仅针对图像设计的方法在视频上难以取得相同的性能,我们的方法不仅改善了视频的表示质量,还改善了图像的表示质量。我们的方法称为时间调整,从图像预训练模型开始,并使用一种新颖的自监督时间对齐聚类损失在未标记的视频上进行微调。这有效地促进了从视频到图像表示的高级信息传递。时间调整在无监督语义分割视频方面将现有技术水平提高了8-10%,并与图像方面相匹配。我们相信这种方法为通过利用丰富的视频资源进一步自监督扩展铺平了道路。实现代码可在此处找到:https://github.com/SMSD75/Timetuning。

Paper35 Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders

摘要原文: This study explores the application of self-supervised learning (SSL) to the task of motion forecasting, an area that has not yet been extensively investigated despite the widespread success of SSL in computer vision and natural language processing. To address this gap, we introduce Forecast-MAE, an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task. Our approach includes a novel masking strategy that leverages the strong interconnections between agents’ trajectories and road networks, involving complementary masking of agents’ future or history trajectories and random masking of lane segments. Our experiments on the challenging Argoverse 2 motion forecasting benchmark show that Forecast-MAE, which utilizes standard Transformer blocks with minimal inductive bias, achieves competitive performance compared to state-of-the-art methods that rely on supervised learning and sophisticated designs. Moreover, it outperforms the previous self-supervised learning method by a significant margin. Code is available at https://github.com/jchengai/forecast-mae.

中文总结: 这项研究探讨了自监督学习(SSL)在运动预测任务中的应用,尽管自监督学习在计算机视觉和自然语言处理领域取得了广泛成功,但在运动预测领域尚未得到广泛研究。为填补这一空白,我们引入了Forecast-MAE,这是一个专门为自监督学习运动预测任务设计的mask autoencoders框架的扩展。我们的方法包括一种新颖的掩蔽策略,利用了代理人轨迹和道路网络之间的强连接,涉及代理人未来或历史轨迹的互补掩蔽以及车道段的随机掩蔽。我们在具有挑战性的Argoverse 2运动预测基准测试上的实验表明,Forecast-MAE利用具有最小归纳偏差的标准Transformer块实现了与依赖于监督学习和复杂设计的最新方法相竞争的性能。此外,它比以前的自监督学习方法表现出色。代码可在https://github.com/jchengai/forecast-mae找到。

  • 12
    点赞
  • 17
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值