WACV2023论文速览域迁移Domain相关

在这里插入图片描述

Paper1 CellTranspose: Few-Shot Domain Adaptation for Cellular Instance Segmentation

摘要原文: Automated cellular instance segmentation is a process utilized for accelerating biological research for the past two decades, and recent advancements have produced higher quality results with less effort from the biologist. Most current endeavors focus on completely cutting the researcher out of the picture by generating highly generalized models. However, these models invariably fail when faced with novel data, distributed differently than the ones used for training. Rather than approaching the problem with methods that presume the availability of large amounts of target data and computing power for retraining, in this work we address the even greater challenge of designing an approach that requires minimal amounts of new annotated data as well as training time. We do so by designing specialized contrastive losses that leverage the few annotated samples very efficiently. A large set of results show that 3 to 5 annotations lead to models with accuracy that: 1) significantly mitigate the covariate shift effects; 2) matches or surpasses other adaptation methods; 3) even approaches methods that have been fully retrained on the target distribution. The adaptation training is only a few minutes, paving a path towards a balance between model performance, computing requirements and expert-level annotation needs.

中文总结: 这段话主要讨论了自动细胞实例分割在过去二十年来被用于加速生物研究的过程,并指出最近的进展使得在减少生物学家的工作量的同时获得了更高质量的结果。目前大部分工作侧重于通过生成高度概括的模型来完全剔除研究人员的参与。然而,这些模型在面对分布方式与训练数据不同的新数据时往往会失败。本文的研究不是采用假设有大量目标数据和计算资源进行重新训练的方法,而是致力于设计一种需要最少量新标注数据和训练时间的方法。作者设计了专门的对比损失函数,可以高效地利用少量标注样本。大量结果表明,3至5个标注样本可以获得准确性显著改善、能够匹敌或超越其他适应方法、甚至接近完全在目标分布上重新训练的方法。适应训练仅需几分钟,为在模型性能、计算需求和专家级标注需求之间取得平衡铺平了道路。

Paper2 Towards Online Domain Adaptive Object Detection

摘要原文: Existing object detection models assume both the training and test data are sampled from the same source domain. This assumption does not hold true when these detectors are deployed in real-world applications, where they encounter new visual domains. Unsupervised Domain Adaptation (UDA) methods are generally employed to mitigate the adverse effects caused by domain shift. Existing UDA methods operate in an offline manner where the model is first adapted toward the target domain and then deployed in real-world applications. However, this offline adaptation strategy is not suitable for real-world applications as the model frequently encounters new domain shifts. Hence, it is critical to develop a feasible UDA method that generalizes to the new domain shifts encountered during deployment time in a continuous online manner. To this end, we propose a novel unified adaptation framework that adapts and improves generalization on the target domain in both offline and online settings. Specifically, we introduce MemXformer - a cross-attention transformer-based memory module where items in the memory take advantage of domain shifts and record prototypical patterns of the target distribution. Further, MemXformer produces strong positive and negative pairs to guide a novel contrastive loss, which enhances target-specific representation learning. Experiments on diverse detection benchmarks show that the proposed strategy producs state-of-the-art performance in both offline and online settings. To the best of our knowledge, this is the first work to address online and offline adaptation settings for object detection. Source code will be released after review.

中文总结: 这段话主要讨论了现有的目标检测模型通常假设训练和测试数据来自同一源域,但在实际应用中,模型会遇到新的视觉领域,导致域偏移。为了缓解域偏移带来的负面影响,通常会采用无监督域自适应(UDA)方法。现有的UDA方法通常以离线方式操作,首先将模型适应到目标域,然后部署在实际应用中。然而,这种离线适应策略不适用于实际应用,因为模型经常遇到新的域偏移。因此,开发一种能够在部署时连续在线适应新域偏移的可行UDA方法至关重要。为此,提出了一种新颖的统一适应框架,在离线和在线设置中适应和改进目标域的泛化能力。具体来说,引入了MemXformer - 基于交叉注意力变换器的记忆模块,其中记忆中的项目利用域偏移并记录目标分布的原型模式。此外,MemXformer生成强大的正负对,以指导新颖的对比损失,增强目标特定的表示学习。在各种检测基准上的实验表明,所提出的策略在离线和在线设置中均实现了最先进的性能。据我们所知,这是第一项针对目标检测的在线和离线适应设置的工作。源代码将在审查后发布。

Paper3 Exploiting Instance-Based Mixed Sampling via Auxiliary Source Domain Supervision for Domain-Adaptive Action Detection

摘要原文: We propose a novel domain adaptive action detection approach and a new adaptation protocol that leverages the recent advancements in image-level unsupervised domain adaptation (UDA) techniques and handle vagaries of instance-level video data. Self-training combined with cross-domain mixed sampling has shown remarkable performance gain in semantic segmentation in UDA (unsupervised domain adaptation) context. Motivated by this fact, we propose an approach for human action detection in videos that transfers knowledge from the source domain (annotated dataset) to the target domain (unannotated dataset) using mixed sampling and pseudo-label-based selftraining. The existing UDA techniques follow a ClassMix algorithm for semantic segmentation. However, simply adopting ClassMix for action detection does not work, mainly because these are two entirely different problems, i.e., pixel-label classification vs. instance-label detection. To tackle this, we propose a novel action instance mixed sampling technique that combines information across domains based on action instances instead of action classes. Moreover, we propose a new UDA training protocol that addresses the long-tail sample distribution and domain shift problem by using supervision from an auxiliary source domain (ASD). For the ASD, we propose a new action detection dataset with dense frame-level annotations. We name our proposed framework as domain-adaptive action instance mixing (DA-AIM). We demonstrate that DA-AIM consistently outperforms prior works on challenging domain adaptation benchmarks. The source code is available at https://github.com/wwwfan628/DA-AIM.

中文总结: 这段话主要内容是介绍了一种新颖的领域自适应动作检测方法以及一种新的适应性协议,利用了图像级无监督领域自适应技术的最新进展,并处理了实例级视频数据的变化。作者提出了一种人类动作检测的方法,该方法通过混合采样和基于伪标签的自训练,将知识从源域(带注释的数据集)转移到目标域(未标注的数据集)。为了解决这个问题,作者提出了一种新颖的动作实例混合采样技术,该技术基于动作实例而不是动作类别跨域结合信息。此外,作者提出了一种新的领域自适应训练协议,通过使用来自辅助源域(ASD)的监督来解决长尾样本分布和领域偏移问题。作者将他们提出的框架命名为领域自适应动作实例混合(DA-AIM)。他们展示了DA-AIM在具有挑战性的领域自适应基准上始终优于先前的工作。源代码可在https://github.com/wwwfan628/DA-AIM上找到。

Paper4 Cross-Domain Video Anomaly Detection Without Target Domain Adaptation

摘要原文: Most cross-domain unsupervised Video Anomaly Detection (VAD) works assume that at least few task-relevant target domain training data are available for adaptation from the source to the target domain. However, this requires laborious model-tuning by the end-user who may prefer to have a system that works “out-of-the-box”. To address such practical scenarios, we identify a novel target domain (inference-time) VAD task where no target domain training data are available. To this end, we propose a new ‘Zero-shot Cross-domain Video Anomaly Detection (zxvad)’ framework that includes a future-frame prediction generative model setup. Different from prior future-frame prediction models, our model uses a novel Normalcy Classifier module to learn the features of normal event videos by learning how such features are different “relative” to features in pseudo-abnormal examples. A novel Untrained Convolutional Neural Network based Anomaly Synthesis module crafts these pseudo-abnormal examples by adding foreign objects in normal video frames with no extra training cost. With our novel relative normalcy feature learning strategy, zxvad generalizes and learns to distinguish between normal and abnormal frames in a new target domain without adaptation during inference. Through evaluations on common datasets, we show that zxvad outperforms the state-of-the-art (SOTA), regardless of whether task-relevant (i.e., VAD) source training data are available or not. Lastly, zxvad also beats the SOTA methods in inference-time efficiency metrics including the model size, total parameters, GPU energy consumption, and GMACs.

中文总结: 这段话主要介绍了一个名为"Zero-shot Cross-domain Video Anomaly Detection (zxvad)"的新框架,用于处理跨领域无监督视频异常检测任务。传统的方法通常假设在源领域和目标领域之间至少有一些相关任务的目标领域训练数据可用于适应,但这需要用户进行繁琐的模型调整。为了解决这一问题,作者提出了一个新的目标领域(推理时间)VAD任务,其中没有目标领域的训练数据可用。他们的框架通过未来帧预测生成模型设置来实现,利用了一种新颖的正常性分类器模块来学习正常事件视频的特征,并通过学习这些特征如何与伪异常示例中的特征“相对”不同来区分正常和异常帧。通过在常见数据集上的评估,作者展示了zxvad在效率指标方面的优势,包括模型大小、总参数、GPU能耗和GMACs等。最后,zxvad在推理时间的效率指标上也击败了现有方法。

Paper5 CoNMix for Source-Free Single and Multi-Target Domain Adaptation

摘要原文: This work introduces the novel task of Source-free Multi-target Domain Adaptation and proposes adaptation framework comprising of Consistency with Nuclear-Norm Maximization and MixUp knowledge distillation (CoNMix) as a solution to this problem. The main motive of this work is to solve for Single and Multi target Domain Adaptation (SMTDA) for the source-free paradigm, which enforces a constraint where the labeled source data is not available during target adaptation due to various privacy-related restrictions on data sharing. The source-free approach leverages target pseudo labels, which can be noisy, to improve the target adaptation. We introduce consistency between label preserving augmentations and utilize pseudo label refinement methods to reduce noisy pseudo labels. Further, we propose novel MixUp Knowledge Distillation (MKD) for better generalization on multiple target domains using various source-free STDA models. We also show that the Vision Transformer (VT) backbone gives better feature representation with improved domain transferability and class discriminability. Our proposed framework achieves the state-of-the-art (SOTA) results in various paradigms of source-free STDA and MTDA settings on popular domain adaptation datasets like Office-Home, Office-Caltech, and DomainNet. Project Page: https://sites.google.com/view/conmix-vcl

中文总结: 这项工作介绍了一项新颖的任务,即无源多目标域自适应,并提出了一个自适应框架,包括核范数最大化和MixUp知识蒸馏(CoNMix),作为解决这一问题的方案。这项工作的主要动机是解决单目标和多目标域自适应(SMTDA)的问题,其中强制实施一个约束条件,即在目标自适应过程中源标记数据不可用,这是由于数据共享上的各种与隐私相关的限制。无源方法利用目标伪标签来改善目标自适应,这些伪标签可能是嘈杂的。我们介绍了标签保持增强之间的一致性,并利用伪标签细化方法来减少嘈杂的伪标签。此外,我们提出了新颖的MixUp知识蒸馏(MKD),以在多个目标域上更好地推广使用各种无源STDA模型。我们还展示了视觉变换器(VT)骨干网络提供了更好的特征表示,具有改进的域可迁移性和类别可区分性。我们提出的框架在流行的域自适应数据集如Office-Home、Office-Caltech和DomainNet上实现了各种无源STDA和MTDA设置的最新结果。项目页面:https://sites.google.com/view/conmix-vcl。

Paper6 Empirical Generalization Study: Unsupervised Domain Adaptation vs. Domain Generalization Methods for Semantic Segmentation in the Wild

摘要原文: For autonomous vehicles and mobile robots to safely operate in the real world, i.e., the wild, scene understanding models should perform well in the many different scenarios that can be encountered. In reality, these scenarios are not all represented in the model’s training data, leading to poor performance. To tackle this, current training strategies attempt to either exploit additional unlabeled data with unsupervised domain adaptation (UDA), or to reduce overfitting using the limited available labeled data with domain generalization (DG). However, it is not clear from current literature which of these methods allows for better generalization to unseen data from the wild. Therefore, in this work, we present an evaluation framework in which the generalization capabilities of state-of-the-art UDA and DG methods can be compared fairly. From this evaluation, we find that UDA methods, which leverage unlabeled data, outperform DG methods in terms of generalization, and can deliver similar performance on unseen data as fully-supervised training methods that require all data to be labeled. We show that semantic segmentation performance can be increased up to 30% for a priori unknown data without using any extra labeled data.

中文总结: 这段话主要讨论了自主车辆和移动机器人在真实世界中(即“野外”)安全运行所需的场景理解模型应该在各种可能遇到的不同情况下表现良好。然而,实际情况是这些情况并不都包含在模型的训练数据中,导致性能不佳。为了解决这个问题,当前的训练策略尝试利用无监督域自适应(UDA)来利用额外的未标记数据,或者利用有限的可用标记数据进行域泛化(DG)以减少过拟合。然而,目前的文献并不清楚哪种方法能更好地推广到野外未知数据。因此,在这项工作中,我们提出了一个评估框架,可以公平地比较最先进的UDA和DG方法的泛化能力。通过这个评估,我们发现利用未标记数据的UDA方法在泛化方面优于DG方法,并且可以在未知数据上提供类似于需要所有数据标记的完全监督训练方法的性能。我们展示了语义分割性能可以在不使用额外标记数据的情况下提高高达30%的先验未知数据。

Paper7 Domain Invariant Vision Transformer Learning for Face Anti-Spoofing

摘要原文: Existing face anti-spoofing (FAS) models have achieved high performance on specific datasets. However, for the application of real-world systems, the FAS model should generalize to the data from unknown domains rather than only achieve good results on a single baseline. As vision transformer models have demonstrated astonishing performance and strong capability in learning discriminative information, we investigate applying transformers to distinguish the face presentation attacks over unknown domains. In this work, we propose the Domain-invariant Vision Transformer (DiVT) for FAS, which adopts two losses to improve the generalizability of the vision transformer. First, a concentration loss is employed to learn a domain-invariant representation that aggregates the features of real face data. Second, a separation loss is utilized to union each type of attack from different domains. The experimental results show that our proposed method achieves state-of-the-art performance on the protocols of domain-generalized FAS tasks. Compared to previous domain generalization FAS models, our proposed method is simpler but more effective.

中文总结: 现有的人脸防欺诈(FAS)模型在特定数据集上取得了很高的性能。然而,对于实际系统的应用,FAS模型应该能够泛化到未知领域的数据,而不仅仅在单一基准上取得良好的结果。由于视觉Transformer模型展示了惊人的性能和学习辨别信息的强大能力,我们研究将Transformer应用于区分未知领域的人脸展示攻击。在这项工作中,我们提出了用于FAS的域不变视觉Transformer(DiVT),采用两种损失来提高视觉Transformer的泛化能力。首先,采用浓度损失来学习一个聚合真实人脸数据特征的域不变表示。其次,利用分离损失来合并来自不同领域的每种攻击类型。实验结果表明,我们提出的方法在领域泛化FAS任务的协议上取得了最先进的性能。与先前的领域泛化FAS模型相比,我们提出的方法更简单但更有效。

Paper8 Learning Across Domains and Devices: Style-Driven Source-Free Domain Adaptation in Clustered Federated Learning

摘要原文: Federated Learning (FL) has recently emerged as a possible way to tackle the domain shift in real-world Semantic Segmentation (SS) without compromising the private nature of the collected data. However, most of the existing works on FL unrealistically assume labeled data in the remote clients. Here we propose a novel task (FFREEDA) in which the clients’ data is unlabeled and the server accesses a source labeled dataset for pre-training only. To solve FFREEDA, we propose LADD, which leverages the knowledge of the pre-trained model by employing self-supervision with ad-hoc regularization techniques for local training and introducing a novel federated clustered aggregation scheme based on the clients’ style. Our experiments show that our algorithm is able to efficiently tackle the new task outperforming existing approaches. The code is available at https://github.com/Erosinho13/LADD.

中文总结: 这段话主要介绍了最近出现的联邦学习(Federated Learning,FL)作为一种可能的方法来解决实际语义分割(Semantic Segmentation,SS)中的领域转移问题,同时又不损害收集数据的私密性。然而,大多数现有的关于FL的研究都不切实际地假设远程客户端具有标记数据。作者提出了一种新的任务(FFREEDA),其中客户端的数据是无标签的,服务器仅访问一个带标签的源数据集进行预训练。为了解决FFREEDA,作者提出了LADD,通过利用预训练模型的知识,采用自监督和特定的正则化技术进行本地训练,并引入了基于客户端风格的新颖联邦聚类聚合方案。实验证明,他们的算法能够有效地解决这一新任务,优于现有方法。代码可在https://github.com/Erosinho13/LADD找到。

Paper9 Reducing Annotation Effort by Identifying and Labeling Contextually Diverse Classes for Semantic Segmentation Under Domain Shift

摘要原文: Abstract not available

中文总结: 这段话的主要内容是:抽象内容不可用。

Paper10 How To Practice VQA on a Resource-Limited Target Domain

摘要原文: Visual question answering (VQA) is an active research area at the intersection of computer vision and natural language understanding. One major obstacle that keeps VQA models that perform well on benchmarks from being as successful on real-world applications, is the lack of annotated Image-Question-Answer triplets in the task of interest. In this work, we focus on a previously overlooked perspective, which is the disparate effectiveness of transfer learning and domain adaptation methods depending on the amount of labeled/unlabeled data available. We systematically investigated the visual domain gaps and question-defined textual gaps, and compared different knowledge transfer strategies under unsupervised, self-supervised, semi-supervised and fully-supervised adaptation scenarios. We show that different methods have varied sensitivity and requirements for data amount in the target domain. We conclude by sharing the best practice from our exploration regarding transferring VQA models to resource-limited target domains.

中文总结: 视觉问答(VQA)是计算机视觉和自然语言理解交叉领域的一个活跃研究领域。使得在基准测试上表现良好的VQA模型在实际应用中不够成功的一个主要障碍是在感兴趣任务中缺乏已注释的图像-问题-答案三元组。在这项工作中,我们专注于一个先前被忽视的视角,即迁移学习和领域自适应方法在有标记/无标记数据可用的情况下的效果不同。我们系统地研究了视觉领域差距和问题定义的文本差距,并比较了在无监督、自监督、半监督和全监督适应场景下的不同知识转移策略。我们展示了不同方法对目标领域中数据量的敏感性和需求的差异。最后,我们总结了从我们的探索中关于将VQA模型转移到资源有限的目标领域的最佳实践。

Paper11 Select, Label, and Mix: Learning Discriminative Invariant Feature Representations for Partial Domain Adaptation

摘要原文: Partial domain adaptation which assumes that the unknown target label space is a subset of the source label space has attracted much attention in computer vision. Despite recent progress, existing methods often suffer from three key problems: negative transfer, lack of discriminability, and domain invariance in the latent space. To alleviate the above issues, we develop a novel ‘Select, Label, and Mix’ (SLM) framework that aims to learn discriminative invariant feature representations for partial domain adaptation. First, we present an efficient “select” module that automatically filters out the outlier source samples to avoid negative transfer while aligning distributions across both domains. Second, the “label” module iteratively trains the classifier using both the labeled source domain data and the generated pseudo-labels for the target domain to enhance the discriminability of the latent space. Finally, the “mix” module utilizes domain mixup regularization jointly with the other two modules to explore more intrinsic structures across domains leading to a domain-invariant latent space for partial domain adaptation. Extensive experiments on several benchmark datasets demonstrate the superiority of our proposed framework over state-of-the-art methods. Project page: https://cvir.github.io/projects/slm.

中文总结: 这段话主要讨论了偏域自适应(Partial domain adaptation)的问题和解决方法。偏域自适应假设未知的目标标签空间是源标签空间的子集,在计算机视觉领域引起了广泛关注。然而,现有方法通常存在三个关键问题:负迁移、缺乏可辨识性以及潜在空间中的域不变性。为了缓解以上问题,作者提出了一种新颖的“选择、标记和混合”(SLM)框架,旨在学习用于偏域自适应的具有区分性不变性的特征表示。首先,他们提出了一个高效的“选择”模块,自动过滤掉异常源样本,避免负迁移同时对齐两个域的分布。其次,“标记”模块通过迭代训练分类器,使用源域标记数据和为目标域生成的伪标签来增强潜在空间的可辨识性。最后,“混合”模块结合域混合正则化与其他两个模块,探索跨域更多内在结构,实现偏域自适应的域不变潜在空间。在多个基准数据集上进行的大量实验表明,该框架优于现有最先进方法。

Paper12 Discrete Cosin TransFormer: Image Modeling From Frequency Domain

摘要原文: In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transformer-based networks are able to learn semantics directly from frequency domain representation based on discrete cosine transform (DCT) without compromising the performance. To achieve the desired efficiency-effectiveness trade-off, we then leverage an input information compression on its frequency domain representation, which highlights the visually significant signals inspired by JPEG compression. We explore different frequency domain down-sampling strategies and show that it is possible to preserve the semantic meaningful information by strategically dropping the high-frequency components. The proposed DCFormer is tested on various downstream tasks including image classification, object detection and instance segmentation, and achieves state-of-the-art comparable performance with less FLOPs, and outperforms the commonly used backbone (e.g. SWIN) at similar FLOPs. Our ablation results also show that the proposed method generalizes well on different transformer backbones.

中文总结: 在这篇论文中,我们提出了离散余弦变换器(DCFormer),它可以直接从基于离散余弦变换(DCT)的频域表示中学习语义。我们首先展示了基于变压器的网络能够直接从频域表示中学习语义,而不会影响性能。为了实现所需的效率-有效性权衡,我们利用频域表示的输入信息压缩,突出了受JPEG压缩启发的视觉上显著的信号。我们探索了不同的频域下采样策略,并展示通过有策略地丢弃高频成分,可以保留语义上有意义的信息。提出的DCFormer在包括图像分类、目标检测和实例分割在内的各种下游任务上进行了测试,并以更少的FLOPs实现了与常用骨干(例如SWIN)相媲美的性能,并在相似的FLOPs下表现出色。我们的消融结果还表明,所提出的方法在不同的变压器骨干上具有良好的泛化能力。

Paper13 Intra-Source Style Augmentation for Improved Domain Generalization

摘要原文: The generalization with respect to domain shifts, as they frequently appear in applications such as autonomous driving, is one of the remaining big challenges for deep learning models. Therefore, we propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation. Our method is based on a novel masked noise encoder for StyleGAN2 inversion. The model learns to faithfully reconstruct the image preserving its semantic layout through noise prediction. Random masking of the estimated noise enables the style mixing capability of our model, i.e. it allows to alter the global appearance without affecting the semantic layout of an image. Using the proposed masked noise encoder to randomize style and content combinations in the training set, ISSA effectively increases the diversity of training data and reduces spurious correlation. As a result, we achieve up to 12.4% mIoU improvements on driving-scene semantic segmentation under different types of data shifts, i.e., changing geographic locations, adverse weather conditions, and day to night. ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers. It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by 3% mIoU in Cityscapes to Dark Zurich.

中文总结: 这段话主要讨论了在深度学习模型中,针对领域转移(domain shifts)的普遍性问题,特别是在自动驾驶等应用中的挑战。作者提出了一种名为"Intra-source Style Augmentation (ISSA)"的方法,旨在改善语义分割中的领域泛化。该方法基于一种新颖的掩蔽噪声编码器,用于StyleGAN2反演。模型通过噪声预测学习如何忠实重构图像,保留其语义布局。对估计的噪声进行随机掩蔽使得我们的模型具有样式混合能力,即能够改变全局外观而不影响图像的语义布局。利用提出的掩蔽噪声编码器在训练集中随机化样式和内容组合,ISSA有效地增加了训练数据的多样性,并减少了偶然相关性。结果显示,在不同类型的数据转移情况下,如地理位置变化、恶劣天气条件和昼夜变化下,我们在驾驶场景语义分割中实现了高达12.4%的mIoU改进。ISSA是与模型无关的,可以直接应用于CNN和Transformer。它也可以与其他领域泛化技术互补,例如,在Cityscapes到Dark Zurich数据集上,它使最近的最先进解决方案RobustNet的mIoU提高了3%。

Paper14 TVT: Transferable Vision Transformer for Unsupervised Domain Adaptation

摘要原文: Unsupervised domain adaptation (UDA) aims to transfer the knowledge learnt from a labeled source domain to an unlabeled target domain. Previous work is mainly built upon convolutional neural networks (CNNs) to learn domain-invariant representations. With the recent exponential increase in applying Vision Transformer (ViT) to vision tasks, the capability of ViT in adapting cross-domain knowledge, however, remains unexplored in the literature. To fill this gap, this paper first comprehensively investigates the performance of ViT on a variety of domain adaptation tasks. Surprisingly, ViT demonstrates superior generalization ability, while the performance can be further improved by incorporating adversarial adaptation. Notwithstanding, directly using CNNs-based adaptation strategies fails to take the advantage of ViT’s intrinsic merits (e.g., attention mechanism and sequential image representation) which play an important role in knowledge transfer. To remedy this, we propose an unified framework, namely Transferable Vision Transformer (TVT), to fully exploit the transferability of ViT for domain adaptation. Specifically, we delicately devise a novel and effective unit, which we term Transferability Adaption Module (TAM). By injecting learned transferabilities into attention blocks, TAM compels ViT focus on both transferable and discriminative features. Besides, we leverage discriminative clustering to enhance feature diversity and separation which are undermined during adversarial domain alignment. To verify its versatility, we perform extensive studies of TVT on four benchmarks and the experimental results demonstrate that TVT attains significant improvements compared to existing state-of-the-art UDA methods.

中文总结: 无监督领域自适应(UDA)旨在将从标记源域学习到的知识转移到未标记的目标域。先前的工作主要基于卷积神经网络(CNN)来学习域不变表示。随着最近在视觉任务中应用Vision Transformer(ViT)的指数级增长,ViT在适应跨域知识方面的能力尚未在文献中得到探索。为填补这一空白,本文首先全面调查了ViT在各种领域自适应任务中的性能。令人惊讶的是,ViT展现出卓越的泛化能力,而通过整合对抗适应性可以进一步提高性能。然而,直接使用基于CNN的适应策略未能充分利用ViT固有优势(例如,注意机制和序列图像表示),这些优势在知识转移中起着重要作用。为了弥补这一不足,我们提出了一个统一的框架,即可转移视觉变换器(TVT),以充分利用ViT的可转移性进行领域自适应。具体而言,我们精心设计了一个新颖有效的单元,我们将其称为可转移适应模块(TAM)。通过将学习到的可转移性注入注意力块,TAM迫使ViT专注于可转移和区分特征。此外,我们利用区分性聚类来增强特征的多样性和分离性,这在对抗性领域对齐过程中被削弱。为了验证其多功能性,我们在四个基准上进行了广泛的TVT研究,实验结果表明,与现有最先进的UDA方法相比,TVT取得了显著的改进。

Paper15 Center-Aware Adversarial Augmentation for Single Domain Generalization

摘要原文: Domain generalization (DG) aims to learn a model from multiple training (i.e., source) domains that can generalize well to the unseen test (i.e., target) data coming from a different distribution. Single domain generalization (Single-DG) has recently emerged to tackle a more challenging, yet realistic setting, where only one source domain is available at training time. The existing Single-DG approaches typically are based on data augmentation strategies and aim to expand the span of source data by augmenting out-of-domain samples. Generally speaking, they aim to generate hard examples to confuse the classifier. While this may make the classifier robust to small perturbation, the generated samples are typically not diverse enough to mimic a large domain shift, resulting in sub-optimal generalization performance To alleviate this, we propose a center-aware adversarial augmentation technique that expands the source distribution by altering the source samples so as to push them away from the class centers via a novel angular center loss. We conduct extensive experiments to demonstrate the effectiveness of our approach on several benchmark datasets for Single-DG and show that our method outperforms the state-of-the-art in most cases.

中文总结: 该段话主要讨论了域泛化(DG)的概念,其旨在从多个训练域(即源域)中学习模型,以便能够很好地泛化到来自不同分布的未见测试数据(即目标数据)。最近出现了单域泛化(Single-DG)来解决更具挑战性但更现实的情况,即在训练时只有一个源域可用。现有的单域泛化方法通常基于数据增强策略,旨在通过增加域外样本来扩展源数据的范围。总的来说,它们旨在生成难以分类的样本以混淆分类器。虽然这可能使分类器对小扰动具有鲁棒性,但生成的样本通常不够多样化,无法模拟大的领域转移,导致泛化性能亚优。为了缓解这一问题,作者提出了一种中心感知对抗性增强技术,通过一种新颖的角度中心损失来改变源样本,使其远离类中心,从而扩展源分布。作者进行了大量实验证明了他们的方法在几个单域泛化基准数据集上的有效性,并表明在大多数情况下,他们的方法优于现有技术水平。

Paper16 Backprop Induced Feature Weighting for Adversarial Domain Adaptation With Iterative Label Distribution Alignment

摘要原文: The requirement for large labeled datasets is one of the limiting factors for training accurate deep neural networks. Unsupervised domain adaptation tackles this problem of limited training data by transferring knowledge from one domain, which has many labeled data, to a different domain for which little to no labeled data is available. One common approach is to learn domain-invariant features for example with an adversarial approach. Previous methods often train the domain classifier and label classifier network separately, where both classification networks have little interaction with each other. In this paper, we introduce a classifier-based backprop-induced weighting of the feature space. This approach has two main advantages. Firstly, it lets the domain classifier focus on features that are important for the classification, and, secondly, it couples the classification and adversarial branch more closely. Furthermore, we introduce an iterative label distribution alignment method, that employs results of previous runs to approximate a class-balanced dataloader. We conduct experiments and ablation studies on three benchmarks Office-31, OfficeHome, and DomainNet to show the effectiveness of our proposed algorithm.

中文总结: 这段话主要讨论了大规模标记数据集的需求是训练准确的深度神经网络的限制因素之一。无监督领域自适应通过将知识从一个具有大量标记数据的领域转移到另一个几乎没有标记数据可用的领域来解决训练数据有限的问题。一种常见的方法是使用对抗方法学习领域不变特征。先前的方法通常分别训练领域分类器和标签分类器网络,两个分类网络之间互动较少。在本文中,我们介绍了一种基于分类器的反向传播诱导的特征空间加权方法。这种方法有两个主要优点。首先,它让领域分类器专注于对分类重要的特征,其次,它更紧密地将分类和对抗分支联系起来。此外,我们介绍了一种迭代标签分布对齐方法,利用先前运行的结果来近似一个类平衡的数据加载器。我们在三个基准测试数据集Office-31、OfficeHome和DomainNet上进行实验和消融研究,以展示我们提出的算法的有效性。

Paper17 Domain Adaptive Video Semantic Segmentation via Cross-Domain Moving Object Mixing

摘要原文: The network trained for domain adaptation is prone to bias toward the easy-to-transfer classes. Since the ground truth label on the target domain is unavailable during training, the bias problem leads to skewed predictions, forgetting to predict hard-to-transfer classes. To address this problem, we propose Cross-domain Moving Object Mixing (CMOM) that cuts several objects, including hard-to-transfer classes, in the source domain video clip and pastes them into the target domain video clip. Unlike image-level domain adaptation, the temporal context should be maintained to mix moving objects in two different videos. Therefore, we design CMOM to mix with consecutive video frames, so that unrealistic movements are not occurring. We additionally propose Feature Alignment with Temporal Context (FATC) to enhance target domain feature discriminability. FATC exploits the robust source domain features, which are trained with ground truth labels, to learn discriminative target domain features in an unsupervised manner by filtering unreliable predictions with temporal consensus. We demonstrate the effectiveness of the proposed approaches through extensive experiments. In particular, our model reaches mIoU of 53.81% on VIPER -> Cityscapes-Seq benchmark and mIoU of 56.31% on SYNTHIA-Seq -> Cityscapes-Seq benchmark, surpassing the state-of-the-art methods by large margins.

中文总结: 这段话主要讨论了针对领域自适应训练的网络容易偏向易于转移的类别,导致预测结果偏向于容易转移的类别,而忽略了难以转移的类别。为了解决这个问题,提出了跨领域移动对象混合(CMOM)方法,通过在源域视频剪切包括难以转移类别在内的多个对象,并将它们粘贴到目标域视频中。与图像级领域自适应不同,需要保持时间上下文以在两个不同视频中混合移动对象。因此,设计了CMOM以混合连续视频帧,以避免发生不真实的运动。此外,提出了带有时间上下文的特征对齐(FATC)方法来增强目标域特征的可辨识性。FATC利用在源域训练的具有地面真实标签的稳健特征,通过过滤具有时间一致性的不可靠预测来以无监督方式学习具有区分性的目标域特征。通过大量实验展示了所提出方法的有效性,特别是在VIPER -> Cityscapes-Seq基准上达到了53.81%的mIoU,在SYNTHIA-Seq -> Cityscapes-Seq基准上达到了56.31%的mIoU,大幅超越了现有方法。

Paper18 Self-Distillation for Unsupervised 3D Domain Adaptation

摘要原文: Point cloud classification is a popular task in 3D vision. However, previous works, usually assume that point clouds at test time are obtained with the same procedure or sensor as those at training time. Unsupervised Domain Adaptation (UDA) instead, breaks this assumption and tries to solve the task on an unlabeled target domain, leveraging only on a supervised source domain. For point cloud classification, recent UDA methods try to align features across domains via auxiliary tasks such as point cloud reconstruction, which however do not optimize the discriminative power in the target domain in feature space. In contrast, in this work, we focus on obtaining a discriminative feature space for the target domain enforcing consistency between a point cloud and its augmented version. We then propose a novel iterative self-training methodology that exploits Graph Neural Networks in the UDA context to refine pseudo-labels. We perform extensive experiments and set the new state-of-the art in standard UDA benchmarks for point cloud classification. Finally, we show how our approach can be extended to more complex tasks such as part segmentation.

中文总结: 这段话主要讨论了点云分类在3D视觉中的重要性以及针对该任务的一种新方法——无监督域自适应(UDA)。传统方法假设测试时的点云是通过与训练时相同的程序或传感器获取的,而UDA打破了这种假设,尝试在未标记的目标域上解决任务,仅依靠监督源域。针对点云分类,最近的UDA方法尝试通过辅助任务(如点云重构)来在域之间对齐特征,但这种方法并不优化目标域中特征空间的判别能力。相反,在这项工作中,作者专注于在目标域中获得一个具有判别性的特征空间,通过强制实现点云及其增强版本之间的一致性。然后提出了一种新颖的迭代自训练方法,在UDA环境中利用图神经网络来优化伪标签。作者进行了大量实验,并在标准UDA基准测试中创造了新的技术水平,用于点云分类。最后,作者展示了他们的方法如何扩展到更复杂的任务,如部分分割。

Paper19 ConfMix: Unsupervised Domain Adaptation for Object Detection via Confidence-Based Mixing

摘要原文: Unsupervised Domain Adaptation (UDA) for object detection aims to adapt a model trained on a source domain to detect instances from a new target domain for which annotations are not available. Different from traditional approaches, we propose ConfMix, the first method that introduces a sample mixing strategy based on region-level detection confidence for adaptive object detector learning. We mix the local region of the target sample that corresponds to the most confident pseudo detections with a source image, and apply an additional consistency loss term to gradually adapt towards the target data distribution. In order to robustly define a confidence score for a region, we exploit the confidence score per pseudo detection that accounts for both the detector-dependent confidence and the bounding box uncertainty. Moreover, we propose a novel pseudo labelling scheme that progressively filters the pseudo target detections using the confidence metric that varies from a loose to strict manner along the training. We perform extensive experiments with three datasets, achieving state-of-the-art performance in two of them and approaching the supervised target model performance in the other. Code is available at https://github.com/giuliomattolin/ConfMix.

中文总结: 这段话主要讲述了无监督领域自适应(UDA)用于目标检测的内容。其目的是将在源域上训练的模型适应到新的目标域,以便检测那些没有可用注释的实例。与传统方法不同,作者提出了ConfMix 方法,这是第一个基于区域级检测置信度的样本混合策略,用于自适应目标检测器学习。他们混合了与最有信心的伪检测对应的目标样本的局部区域与源图像,并应用了额外的一致性损失项,逐渐向目标数据分布进行适应。为了稳健地定义区域的置信度分数,他们利用了每个伪检测的置信度分数,该分数考虑了检测器相关的置信度和边界框的不确定性。此外,他们提出了一种新颖的伪标记方案,通过逐渐过滤伪目标检测结果来使用置信度指标,该指标在训练过程中从宽松到严格逐渐变化。他们在三个数据集上进行了大量实验,在其中两个数据集中实现了最先进的性能,并在另一个数据集中接近了监督目标模型的性能。代码可在 https://github.com/giuliomattolin/ConfMix 上找到。

Paper20 SALAD: Source-Free Active Label-Agnostic Domain Adaptation for Classification, Segmentation and Detection

摘要原文: We present a novel method, SALAD, for the challenging vision task of adapting a pre-trained “source” domain network to a “target” domain, with a small budget for annotation in the “target” domain and a shift in the label space. Further, the task assumes that the source data is not available for adaptation, due to privacy concerns or otherwise. We postulate that such systems need to jointly optimize the dual task of (i) selecting fixed number of samples from the target domain for annotation and (ii) transfer of knowledge from the pre-trained network to the target domain. To do this, SALAD consists of a novel Guided Attention Transfer Network (GATN) and an active learning function, HAL. The GATN enables feature distillation from pre-trained network to the target network, complemented with the target samples mined by HAL using transfer-ability and uncertainty criteria. SALAD has three key benefits: (i) it is task-agnostic, and can be applied across various visual tasks such as classification, segmentation and detection; (ii) it can handle shifts in output label space from the pre-trained source network to the target domain; (iii) it does not require access to source data for adaptation. We conduct extensive experiments across 3 visual tasks, viz. digits classification (MNIST, SVHN, VISDA), synthetic (GTA5) to real (CityScapes) image segmentation, and document layout detection (PubLayNet to DSSE). We show that our source-free approach, SALAD, results in an improvement of 0.5%-31.3% (across datasets and tasks) over prior adaptation methods that assume access to large amounts of annotated source data for adaptation.

中文总结: 本文介绍了一种新颖的方法SALAD,用于解决视觉任务中的一个挑战性问题,即将一个预先训练的“源”域网络适应到一个“目标”域,同时在“目标”域中有一个较小的标注预算和标签空间的转移。此外,该任务假设由于隐私问题或其他原因,源数据不可用于适应。作者认为这种系统需要同时优化两个任务:从目标域中选择固定数量的样本进行标注,以及从预训练网络向目标域传递知识。为此,SALAD包括一个新颖的引导注意力传递网络(GATN)和一个主动学习函数HAL。GATN实现了从预训练网络到目标网络的特征蒸馏,同时HAL使用可传递性和不确定性标准从目标样本中挖掘样本。SALAD具有三个关键优点:(i)它是任务不可知的,可以应用于各种视觉任务,如分类、分割和检测;(ii)它可以处理从预先训练的源网络到目标域的输出标签空间的转移;(iii)它不需要访问源数据进行适应。作者在3个视觉任务上进行了广泛的实验,即数字分类(MNIST、SVHN、VISDA)、合成(GTA5)到真实(CityScapes)图像分割,以及文档布局检测(PubLayNet到DSSE)。结果表明,我们的无源方法SALAD在数据集和任务之间的改进范围为0.5%-31.3%,优于先前假设需要大量标注源数据进行适应的方法。

Paper21 Learning Classifiers of Prototypes and Reciprocal Points for Universal Domain Adaptation

摘要原文: Universal Domain Adaptation aims to transfer the knowledge between the datasets by handling two shifts: domain-shift and category-shift. The main challenge is correctly distinguishing the unknown target samples while adapting the distribution of known class knowledge from source to target. Most existing methods approach this problem by first training the target adapted known classifier and then relying on the single threshold to distinguish unknown target samples. However, this simple threshold-based approach prevents the model from considering the underlying complexities existing between the known and unknown samples in the high-dimensional feature space. In this paper, we propose a new approach in which we use two sets of feature points, namely dual Classifiers for Prototypes and Reciprocals (CPR). Our key idea is to associate each prototype with corresponding known class features while pushing the reciprocals apart from these prototypes to locate them in the potential unknown feature space. The target samples are then classified as unknown if they fall near any reciprocals at test time. To successfully train our framework, we collect the partial, confident target samples that are classified as known or unknown through on our proposed multi-criteria selection. We then additionally apply the entropy loss regularization to them. For further adaptation, we also apply standard consistency regularization that matches the predictions of two different views of the input to make more compact target feature space. We evaluate our proposal, CPR, on three standard benchmarks and achieve comparable or new state-of-the-art results. We also provide extensive ablation experiments to verify our main design choices in our framework.

中文总结: 这段话主要讲述了通用领域自适应的目标是通过处理两种转移:领域转移和类别转移来在数据集之间传递知识。主要挑战在于在将已知类别知识从源领域适应到目标领域时,正确区分未知目标样本。大多数现有方法通过首先训练目标适应的已知分类器,然后依赖于单一阈值来区分未知目标样本来解决这个问题。然而,这种简单的基于阈值的方法阻止了模型考虑高维特征空间中已知和未知样本之间存在的潜在复杂性。在本文中,我们提出了一种新方法,即使用两组特征点,即原型和互补的双分类器(CPR)。我们的关键想法是将每个原型与相应的已知类特征关联起来,同时将互补物推开以将它们定位在潜在的未知特征空间中。然后在测试时,如果目标样本靠近任何互补物,则将其分类为未知。为了成功训练我们的框架,我们通过我们提出的多标准选择收集了部分自信的目标样本,这些样本通过分类为已知或未知。然后我们额外对它们应用熵损失正则化。为了进一步适应,我们还应用标准一致性正则化,匹配输入的两个不同视图的预测,以使目标特征空间更加紧凑。我们在三个标准基准上评估了我们的提议,CPR,并取得了可比或新的最新成果。我们还提供了广泛的消融实验来验证我们框架中的主要设计选择。

Paper22 Domain Adaptive Object Detection for Autonomous Driving Under Foggy Weather

摘要原文: Most object detection methods for autonomous driving usually assume a onsistent feature distribution between training and testing data, which is not always the case when weathers differ significantly. The object detection model trained under clear weather might be not effective enough on the foggy weather because of the domain gap. This paper proposes a novel domain adaptive object detection framework for autonomous driving under foggy weather. Our method leverages both image-level and object-level adaptation to diminish the domain discrepancy in image style and object appearance. To further enhance the model’s capabilities under challenging samples, we also come up with a new adversarial gradient reversal layer to perform adversarial mining for the hard examples together with domain adaptation. Moreover, we propose to generate an auxiliary domain by data augmentation to enforce a new domain-level metric regularization. Experimental results on public benchmarks show the effectiveness and accuracy of the proposed method.

中文总结: 这段话主要讨论了针对自动驾驶中的目标检测方法通常假设训练和测试数据之间具有一致的特征分布,但当天气条件差异显著时,这种假设并不总是成立。在不同天气条件下,由于域差异,在晴天训练的目标检测模型可能在雾天下效果不佳。因此,本文提出了一种新颖的针对雾天自动驾驶的域自适应目标检测框架。该方法利用图像级和目标级适应来减少图像风格和目标外观的域差异。为了进一步增强模型在具有挑战性样本下的能力,我们还提出了一种新的对抗梯度反转层,用于对困难样本进行对抗挖掘以及域适应。此外,我们提出通过数据增强生成辅助域,以强化新的域级度量规则。在公共基准测试上的实验结果显示了所提出方法的有效性和准确性。

Paper23 Image-Free Domain Generalization via CLIP for 3D Hand Pose Estimation

摘要原文: RGB-based 3D hand pose estimation has been successful for decades thanks to large-scale databases and deep learning. However, the hand pose estimation network does not operate well for hand pose images whose characteristics are far different from the training data. This is caused by various factors such as illuminations, camera angles, diverse backgrounds in the input images, etc. Many existing methods tried to solve it by supplying additional large-scale unconstrained/target domain images to augment data space; however collecting such large-scale images takes a lot of labors. In this paper, we present a simple image-free domain generalization approach for the hand pose estimation framework that uses only source domain data. We try to manipulate the image features of the hand pose estimation network by adding the features from text descriptions using the CLIP (Contrastive Language-Image Pre-training) model. The manipulated image features are then exploited to train the hand pose estimation network via the contrastive learning framework. In experiments with STB and RHD datasets, our algorithm shows improved performance over the state-of-the-art domain generalization approaches.

中文总结: 这段话主要讨论了基于RGB的三维手部姿势估计在过去几十年取得成功,这要归功于大规模数据库和深度学习技术。然而,手部姿势估计网络在处理与训练数据特征差异很大的手部姿势图像时表现不佳。这是由于输入图像中的各种因素,如光照、摄像机角度、不同背景等。许多现有方法尝试通过提供额外的大规模无约束/目标域图像来扩充数据空间来解决这个问题;然而,收集这样的大规模图像需要大量的工作。本文提出了一种简单的无图像域泛化方法,用于手部姿势估计框架,只使用源域数据。我们尝试通过使用CLIP(对比语言-图像预训练)模型中的文本描述特征来操作手部姿势估计网络的图像特征。然后利用这些操作后的图像特征通过对比学习框架来训练手部姿势估计网络。在STB和RHD数据集的实验中,我们的算法表现出比最先进的域泛化方法更好的性能。

Paper24 D2F2WOD: Learning Object Proposals for Weakly-Supervised Object Detection via Progressive Domain Adaptation

摘要原文: Weakly-supervised object detection (WSOD) models attempt to leverage image-level annotations in lieu of accurate but costly-to-obtain object localization labels. This oftentimes leads to substandard object detection and localization at inference time. To tackle this issue, we propose D2DF2WOD, a Dual-Domain Fully-to-Weakly Supervised Object Detection framework that leverages synthetic data, annotated with precise object localization, to supplement a natural image target domain, where only image-level labels are available. In its warm-up domain adaptation stage, the model learns a fully-supervised object detector (FSOD) to improve the precision of the object proposals in the target domain, and at the same time learns target-domain-specific and detection-aware proposal features. In its main WSOD stage, a WSOD model is specifically tuned to the target domain. The feature extractor and the object proposal generator of the WSOD model are built upon the fine-tuned FSOD model. We test D2DF2WOD on five dual-domain image benchmarks. The results show that our method results in consistently improved object detection and localization compared with state-of-the-art methods.

中文总结: 弱监督目标检测(WSOD)模型试图利用图像级别注释来代替精确但成本高昂的目标定位标签。这往往导致推断时的目标检测和定位不够准确。为了解决这个问题,我们提出了D2DF2WOD,这是一个双域全弱监督目标检测框架,利用合成数据进行标注,精确地定位目标,以补充自然图像目标域,其中只有图像级别标签可用。在其预热域自适应阶段,模型学习一个全监督目标检测器(FSOD),以提高目标域中目标提议的精度,并同时学习目标域特定和检测感知的提议特征。在其主要WSOD阶段,WSOD模型被专门调整到目标域。WSOD模型的特征提取器和目标提议生成器是建立在经过微调的FSOD模型之上的。我们在五个双域图像基准上测试了D2DF2WOD。结果表明,与最先进的方法相比,我们的方法在目标检测和定位方面始终得到了改善。

Paper25 Learning Style Subspaces for Controllable Unpaired Domain Translation

摘要原文: The unpaired domain-to-domain translation aims to learn inter-domain relationships between diverse modalities without relying on paired data, which can help complex structure prediction tasks such as age transformation, where it is challenging to attain paired samples. A common approach used by most current methods is to factorize the data into a domain-invariant content space and a domain-specific style space. In this work, we argue that the style space can be further decomposed into smaller subspaces. Learning these style subspaces has two-fold advantages: (i) it allows more robustness and reliability in the generation of images in unpaired domain translation; and (ii) it allows better control and thereby interpolating the latent space, which can be helpful in complex translation tasks involving multiple domains. To achieve this decomposition, we propose a novel scalable approach to partition the latent space into style subspaces. We also propose a new evaluation metric that quantifies the controllable generation capability of domain translation methods. We compare our proposed method with several strong baselines on standard domain translation tasks such as gender translation (male-to-female and female-to-male), age transformation, reference-guided image synthesis, multi-domain image translation, and multi-attribute domain translation on celebA-HQ and AFHQ datasets. The proposed technique achieves state-of-the-art performance on various domain translation tasks while outperforming all the baselines on controllable generation tasks.

中文总结: 这段话主要讨论了无配对域到域的翻译,旨在学习不同模态之间的域间关系,而无需依赖配对数据,这可以帮助复杂的结构预测任务,如年龄转换,在这种任务中很难获得配对样本。大多数当前方法使用的常见方法是将数据分解为域不变内容空间和域特定样式空间。作者认为样式空间可以进一步分解为更小的子空间。学习这些样式子空间具有双重优势:(i)它可以在无配对域翻译中生成图像时提供更强大和可靠的性能;(ii)它可以更好地控制并插值潜在空间,这对涉及多个域的复杂翻译任务很有帮助。为了实现这种分解,作者提出了一种新颖的可扩展方法,将潜在空间划分为样式子空间。作者还提出了一种新的评估指标,量化了域翻译方法的可控生成能力。作者在celebA-HQ和AFHQ数据集上将所提出的方法与几种强基线进行了比较,包括性别转换(男性到女性和女性到男性)、年龄转换、参考引导图像合成、多域图像翻译和多属性域翻译等标准域翻译任务。该方法在各种域翻译任务中实现了最先进的性能,同时在可控生成任务中胜过了所有基线方法。

Paper26 Contrastive Learning of Semantic Concepts for Open-Set Cross-Domain Retrieval

摘要原文: We consider the problem of image retrieval where query images during testing belong to classes and domains both unseen during training. This requires learning a feature space that has the ability to generalize across both classes and domains. To this end, we propose semantic contrastive concept network (SCNNet), a new learning framework that helps take a step towards class and domain generalization in a principled fashion. Unlike existing methods that rely on global object representations, SCNNet proposes to learn a set of local concept vectors to facilitate unseen-class generalization. To this end, SCNNet’s key innovations include (a) a novel trainable local concept extraction module that learns an orthonormal set of basis vectors, and (b) computes local features for any unseen-class data as a linear combination of the learned basis set. Next, to enable unseen-domain generalization, SCNNet proposes to generate supervisory signals from an adjacent data modality, i.e., natural language, by mining freely available textual label information associated with images. SCNNet derives these signals from our novel trainable semantic ordinal distance constraints that ensure semantic consistency between pairs of images sampled from different domains. Both the proposed modules above enable end-to-end training of the SCNNet, resulting in a model that helps establish state-of-the-art performance on the standard DomainNet, PACS, and Sketchy benchmark datasets with average Prec@200 improvements of 42.6%, 6.5%, and 13.6% respectively over the most recently reported results.

中文总结: 这段话主要讨论了图像检索中的问题,其中测试期间的查询图像属于在训练期间未见过的类别和领域。这需要学习一个具有跨类别和领域泛化能力的特征空间。为此,作者提出了语义对比概念网络(SCNNet),这是一个新的学习框架,有助于以一种原则性的方式迈向类别和领域的泛化。与现有方法依赖全局对象表示不同,SCNNet提出学习一组本地概念向量以促进未见类别的泛化。为此,SCNNet的关键创新包括(a)一种新颖的可训练本地概念提取模块,学习一组正交基向量,以及(b)计算未见类别数据的本地特征作为学习基础集的线性组合。接下来,为了实现未见领域的泛化,SCNNet提出从相邻数据模态,即自然语言中挖掘与图像相关的自由可用文本标签信息来生成监督信号。SCNNet从我们的新颖可训练语义序数距离约束中导出这些信号,以确保从不同领域采样的图像对之间的语义一致性。上述提出的两个模块都使SCNNet能够进行端到端的训练,从而建立了在标准DomainNet、PACS和Sketchy基准数据集上的最先进性能,平均Prec@200的改进分别为42.6%、6.5%和13.6%,比最近报告的结果有所提高。

Paper27 Camera Alignment and Weighted Contrastive Learning for Domain Adaptation in Video Person ReID

摘要原文: Systems for person re-identification (ReID) can achieve a high level of accuracy when trained on large fully-labeled image datasets. However, the domain shift typically associated with diverse operational capture conditions (e.g., camera viewpoints and lighting) may translate to a significant decline in performance. This paper focuses on unsupervised domain adaptation (UDA) for video-based ReID – a relevant scenario that is less explored in the literature. In this scenario, the ReID model must adapt to a complex target domain defined by a network of diverse video cameras based on tracklet information. State-of-art methods cluster unlabeled target data, yet domain shifts across target cameras (sub-domains) can lead to poor initialization of clustering methods that propagates noise across epochs, and the ReID model cannot accurately associate samples of the same identity. In this paper, an UDA method is introduced for video person ReID that leverages knowledge on video tracklets, and on the distribution of frames captured over target cameras to improve the performance of CNN backbones trained using pseudo-labels. Our method relies on an adversarial approach, where a camera-discriminator network is introduced to extract discriminant camera-independent representations, facilitating the subsequent clustering. In addition, a weighted contrastive loss is proposed to leverage the confidence of clusters, and mitigate the risk of incorrect identity associations. Experimental results obtained on three challenging video-based person ReID datasets – PRID2011, iLIDS-VID, and MARS – indicate that our proposed method can outperform related state-of-the-art methods. The code is available at: https://github.com/wacv23775/775.

中文总结: 这篇论文主要关注无监督领域自适应(UDA)在基于视频的人员再识别(ReID)中的应用。文章指出,尽管在大型完全标记的图像数据集上训练的ReID系统可以实现高准确度,但通常与多样化的操作捕获条件(如摄像机视角和光照)相关的领域转移可能导致性能显著下降。在视频ReID中,模型必须适应一个由各种视频摄像机组成的复杂目标领域,基于tracklet信息。文章介绍了一种基于视频tracklet信息和目标摄像机捕获帧分布的UDA方法,用于改善使用伪标签训练的CNN骨干网络的性能。该方法引入了一个对抗方法,其中引入了一个摄像机判别器网络,用于提取独立于摄像机的表征,从而促进后续的聚类。此外,提出了一种加权对比损失,以利用聚类的置信度,并减轻错误身份关联的风险。实验结果表明,该方法在三个具有挑战性的视频人员ReID数据集上(PRID2011,iLIDS-VID和MARS)可以胜过相关的最新方法。

Paper28 Semi-Supervised Domain Adaptation With Auto-Encoder via Simultaneous Learning

摘要原文: We present a new semi-supervised domain adaptation framework that combines a novel auto-encoder-based domain adaptation model with a simultaneous learning scheme providing stable improvements over state-of-the-art domain adaptation models. Our framework holds strong distribution matching property by training both source and target auto-encoders using a novel simultaneous learning scheme on a single graph with an optimally modified MMD loss objective function. Additionally, we design a semi-supervised classification approach by transferring the aligned domain invariant feature spaces from source domain to the target domain. We evaluate on three datasets and show proof that our framework can effectively solve both fragile convergence (adversarial) and weak distribution matching problems between source and target feature space (discrepancy) with a high ‘speed’ of adaptation requiring a very low number of iterations.

中文总结: 这段话主要内容是介绍了一个新的半监督领域自适应框架,结合了一种基于自编码器的领域自适应模型和一个同时学习方案,稳定地提升了当前领域自适应模型的性能。该框架通过在单个图上使用一种经过优化修改的MMD损失目标函数,训练源自编码器和目标自编码器,具有强大的分布匹配特性。此外,他们设计了一种半监督分类方法,通过将源领域的对齐域不变特征空间转移到目标领域。他们在三个数据集上进行了评估,并证明他们的框架可以有效解决源领域和目标领域特征空间之间的脆弱收敛(对抗性)和弱分布匹配问题(差异性),并具有高速自适应的特点,需要很少的迭代次数。

Paper29 CUDA-GHR: Controllable Unsupervised Domain Adaptation for Gaze and Head Redirection

摘要原文: The robustness of gaze and head pose estimation models is highly dependent on the amount of labeled data. Recently, generative modeling has shown excellent results in generating photo-realistic images, which can alleviate the need for annotations. However, adopting such generative models to new domains while maintaining their ability to provide fine-grained control over different image attributes, e.g., gaze and head pose directions, has been a challenging problem. This paper proposes CUDA-GHR, an unsupervised domain adaptation framework that enables fine-grained control over gaze and head pose directions while preserving the appearance-related factors of the person. Our framework simultaneously learns to adapt to new domains and disentangle visual attributes such as appearance, gaze direction, and head orientation by utilizing a label-rich source domain and an unlabeled target domain. Extensive experiments on the benchmarking datasets show that the proposed method can outperform state-of-the-art techniques on both quantitative and qualitative evaluations. Furthermore, we demonstrate the effectiveness of generated image-label pairs in the target domain for pretraining networks for the downstream task of gaze and head pose estimation. The source code and pre-trained models are available at https://github.com/jswati31/cuda-ghr.

中文总结: 这段话主要讨论了注视和头部姿势估计模型的鲁棒性高度依赖于标记数据的数量。最近,生成建模在生成逼真图像方面取得了出色的结果,可以减轻对注释的需求。然而,在将这种生成模型应用于新领域的同时,保持其对不同图像属性(如注视和头部姿势方向)提供精细控制的能力是一个具有挑战性的问题。本文提出了CUDA-GHR,这是一个无监督领域自适应框架,可以在保持人的外观相关因素的同时实现对注视和头部姿势方向的精细控制。我们的框架同时学习适应新领域并解开视觉属性,如外观、注视方向和头部方向,通过利用一个标记丰富的源域和一个未标记的目标域。在基准数据集上进行的大量实验表明,所提出的方法在定量和定性评估方面都能胜过最先进的技术。此外,我们展示了在目标领域中生成的图像-标签对在为注视和头部姿势估计的下游任务预训练网络方面的有效性。源代码和预训练模型可在https://github.com/jswati31/cuda-ghr 上找到。

Paper30 FFM: Injecting Out-of-Domain Knowledge via Factorized Frequency Modification

摘要原文: This work addresses the Single Domain Generalization (SDG) problem, and aims to generalize a model from a single source (i.e., training) domain to multiple target (i.e., test) domains with different distributions. Most of the existing SDG approaches aim at generating out-of-domain samples by either transforming the source images into different styles or optimizing adversarial noise perturbations. In this paper, we show that generating images with diverse styles can be complementary to creating hard samples when tackling the SDG task. This inspires us to propose our approach of Factorized Frequency Modification (FFM) which can fulfill the requirement of generating diverse and hard samples to tackle the problem of out-of-domain generalization. Specifically, we design a unified framework consisting of a style transformation module, an adversarial perturbation module, and a dynamic frequency selection module. We seamlessly equip the framework with iterative adversarial training which facilitates the task model to learn discriminative features from hard and diverse augmented samples. We perform extensive experiments on four image recognition benchmark datasets of Digits-DG, CIFAR-10-C, CIFAR-100-C, and PACS, which demonstrates that our method outperforms existing state-of-the-art approaches.

中文总结: 这项工作解决了单领域泛化(SDG)问题,并旨在将模型从单个源(即训练)域泛化到具有不同分布的多个目标(即测试)域。大多数现有的SDG方法旨在通过将源图像转换为不同风格或优化对抗性噪声扰动来生成领域外样本。在本文中,我们表明在处理SDG任务时,生成具有多样风格的图像可以与创建困难样本互补。这启发我们提出了我们的Factorized Frequency Modification(FFM)方法,可以满足生成多样和困难样本的要求,以解决领域外泛化问题。具体而言,我们设计了一个包含风格转换模块、对抗扰动模块和动态频率选择模块的统一框架。我们无缝地为框架配备了迭代对抗训练,这有助于任务模型从困难和多样化的增强样本中学习有区分力的特征。我们在四个图像识别基准数据集Digits-DG、CIFAR-10-C、CIFAR-100-C和PACS上进行了大量实验,结果表明我们的方法优于现有的最先进方法。

Paper31 WHFL: Wavelet-Domain High Frequency Loss for Sketch-to-Image Translation

摘要原文: Even a rough sketch can effectively convey the descriptions of objects, as humans can imagine the original shape from the sketch. The sketch-to-photo translation is a computer vision task that enables a machine to do this imagination, taking a binary sketch image and generating plausible RGB images corresponding to the sketch. Hence, deep neural networks for this task should learn to generate a wide range of frequencies because most parts of the input (binary sketch image) are composed of DC signals. In this paper, we propose a new loss function named Wavelet-domain High-Frequency Loss (WHFL) to overcome the limitations of previous methods that tend to have a bias toward low frequencies. The proposed method emphasizes the loss on the high frequencies by designing a new weight matrix imposing larger weights on the high bands. Unlike existing hand-craft methods that control frequency weights using binary masks, we use the matrix with finely controlled elements according to frequency scales. The WHFL is designed in a multi-scale form, which lets the loss function focus more on the high frequency according to decomposition levels. We use the WHFL as a complementary loss in addition to conventional ones defined in the spatial domain. Experiments show we can improve the qualitative and quantitative results in both spatial and frequency domains. Additionally, we attempt to verify the WHFL’s high-frequency generation capability by defining a new evaluation metric named Unsigned Euclidean Distance Field Error (UEDFE).

中文总结: 这段话主要讨论了草图转照片的计算机视觉任务,使机器能够通过草图生成对应的彩色图像。为了克服先前方法的局限性,提出了一种名为Wavelet-domain High-Frequency Loss(WHFL)的新损失函数,通过设计一个新的权重矩阵,在高频带上施加更大的权重,以强调高频损失。WHFL以多尺度形式设计,使损失函数根据分解级别更加关注高频率。实验证明,WHFL可以在空间和频率领域中提高定性和定量结果。此外,为了验证WHFL的高频生成能力,定义了一种名为Unsigned Euclidean Distance Field Error(UEDFE)的新评估指标。

Paper32 Improving Diversity With Adversarially Learned Transformations for Domain Generalization

摘要原文: To be successful in single source domain generalization (SSDG), maximizing diversity of synthesized domains has emerged as one of the most effective strategies. Recent success in SSDG comes from methods that pre-specify diversity inducing image augmentations during training, so that it may lead to better generalization on new domains. However, naive pre-specified augmentations are not always effective, either because they cannot model large domain shift, or because the specific choice of transforms may not cover the types of shifts commonly occurring in domain generalization. To address this issue, we present a novel framework called ALT: adversarially learned transformations, that uses an adversary neural network to model plausible, yet hard image transformations that fool the classifier. ALT learns image transformations by randomly initializing the adversary network for each batch and optimizing it for a fixed number of steps to maximize classification error. The classifier is trained by enforcing a consistency between its predictions on the clean and transformed images. With extensive empirical analysis, we find that this new form of adversarial transformations achieves both objectives of diversity and hardness simultaneously, outperforming all existing techniques on competitive benchmarks for SSDG. We also show that ALT can seamlessly work with existing diversity modules to produce highly distinct, and large transformations of the source domain leading to state-of-the-art performance. Code: https://github.com/tejas-gokhale/ALT

中文总结: 这段话主要讨论了在单源领域泛化中取得成功的关键策略之一是最大化合成域的多样性。最近在单源领域泛化方面取得成功的方法之一是在训练过程中预先指定多样性诱导的图像增强方法,这可能会导致在新领域上更好的泛化能力。然而,简单的预先指定增强并不总是有效的,因为它们可能无法模拟大的领域转移,或者因为特定的转换选择可能无法涵盖在领域泛化中常见的转移类型。为了解决这个问题,作者提出了一种新颖的框架ALT:对抗学习转换,它使用对抗神经网络来建模可能的、但困难的图像转换,以欺骗分类器。ALT通过为每个批次随机初始化对抗网络,并优化固定数量的步骤来最大化分类错误来学习图像转换。分类器通过强制其在原始图像和转换图像上的预测之间的一致性来训练。通过广泛的实证分析,作者发现这种新形式的对抗性转换同时实现了多样性和难度的目标,优于所有现有的单源领域泛化竞争基准技术。作者还展示了ALT可以与现有的多样性模块无缝配合,产生高度不同和大的源域转换,从而实现最先进的性能。

Paper33 Generative Alignment of Posterior Probabilities for Source-Free Domain Adaptation

摘要原文: Existing domain adaptation literature comprises multiple techniques that align the labeled source and unlabeled target domains at different stages, and predict the target labels. In a source-free domain adaptation setting, the source data is not available for alignment. We present a source-free generative paradigm that captures the relations between the source categories and enforces them onto the unlabeled target data, thereby circumventing the need for source data without introducing any new hyper-parameters. The adaptation is performed through the adversarial alignment of the posterior probabilities of the source and target categories. The proposed approach demonstrates competitive performance against other source-free domain adaptation techniques and can also be used for source-present settings.

中文总结: 这段话主要讨论了现有的领域自适应文献中包含多种技术,这些技术在不同阶段对齐带标签的源域和无标签的目标域,并预测目标标签。在无源域自适应设置中,源数据不可用于对齐。作者提出了一种无源领域自适应的生成范式,捕捉源类别之间的关系,并将其强加到无标签的目标数据上,从而避免了需要源数据但又不引入任何新的超参数。该自适应是通过对齐源和目标类别的后验概率进行对抗性调整来实现的。该方法表现出与其他无源领域自适应技术相竞争的性能,并且也可用于有源域的设置。

Paper34 Auxiliary Task-Guided CycleGAN for Black-Box Model Domain Adaptation

摘要原文: The research area of domain adaptation investigates methods that enable the transfer of existing models across different domains, e.g., addressing environmental changes or the transfer from synthetic to real data. Especially unsupervised domain adaptation is beneficial because it does not require any labeled target domain data. Usually, existing methods are targeted at specific tasks and require access or even modifications to the source model and its parameters which is a major drawback when only a black-box model is available. Therefore, we propose a CycleGAN-based approach suitable for black-box source models to translate target domain data into the source domain on which the source model can operate. Inspired by multi-task learning, we extend CycleGAN with an additional auxiliary task that can be arbitrarily chosen to support the transfer of task-related information across domains without the need for having access to a differentiable source model or its parameters. In this work, we focus on the regression task of 2D human pose estimation and compare our results in four different domain adaptation settings to CycleGAN and RegDA, a state-of-the-art method for unsupervised domain adaptation for keypoint detection.

中文总结: 这段话主要讨论了域自适应研究领域的内容,探讨了使现有模型能够在不同领域之间进行转移的方法,例如应对环境变化或从合成数据到真实数据的转移。特别是无监督域自适应是有益的,因为它不需要任何标记的目标域数据。通常,现有方法针对特定任务,并需要访问或甚至修改源模型及其参数,这是一个主要的缺点,尤其当只有一个黑盒模型可用时。因此,我们提出了一种基于CycleGAN的方法,适用于黑盒源模型,将目标域数据转换为源域数据,以便源模型可以进行操作。受多任务学习的启发,我们在CycleGAN中扩展了一个额外的辅助任务,可以任意选择以支持跨域传输与任务相关的信息,而无需访问不可微分的源模型或其参数。在这项工作中,我们专注于2D人体姿势估计的回归任务,并将我们的结果与四种不同的域自适应设置下的CycleGAN和RegDA进行比较,后者是一种针对关键点检测的无监督域自适应的最先进方法。

Paper35 Domain Adaptation Using Self-Training With Mixup for One-Stage Object Detection

摘要原文: In this paper, we present an end-to-end domain adaptation technique that utilizes both feature distribution alignment and Self-Training effectively for object detection. One set of methods for domain adaptation relies on feature distribution alignment and adapts models on an unlabeled target domain by learning domain invariant representations through adversarial loss. Although this approach is effective, it may not be adequate or even have an adverse effect when domain shifts are large and inconsistent. Another set of methods utilizes Self-Training which relies on pseudo labels to approximate the target domain distribution directly. However, it can also have a negative impact on the model performance due to erroneous pseudo labels. To overcome these two issues, we propose to generate reliable pseudo labels through feature distribution alignment and data distillation. Further, to minimize the adverse effect of incorrect pseudo labels during Self-Training we employ interpolation-based consistency regularization called mixup. While distribution alignment helps in generating more accurate pseudo labels, mixup regularization of Self-Training reduces the adverse effect of less accurate pseudo labels. Both approaches supplement each other and achieve effective adaptation on the target domain which we demonstrate through extensive experiments on one-stage object detector. Experiment results show that our approach achieves a significant performance improvement on multiple benchmark datasets.

中文总结: 本文介绍了一种利用特征分布对齐和自训练相结合的端到端域自适应技术,用于目标检测。一组域自适应方法依赖于特征分布对齐,在未标记的目标域上通过对抗损失学习域不变表示来调整模型。尽管这种方法有效,但当域漂移大且不一致时可能不足够甚至产生负面影响。另一组方法利用自训练,依靠伪标签直接逼近目标域分布。然而,由于错误的伪标签,它也可能对模型性能产生负面影响。为了克服这两个问题,我们提出通过特征分布对齐和数据蒸馏生成可靠的伪标签。此外,为了减少自训练中错误伪标签的负面影响,我们采用基于插值的一致性正则化称为mixup。特征分布对齐有助于生成更准确的伪标签,mixup正则化自训练则减少了不太准确伪标签的负面影响。这两种方法相互补充,通过在一个阶段目标检测器上进行广泛实验展示了在目标域上的有效自适应。实验结果表明,我们的方法在多个基准数据集上实现了显著的性能提升。

  • 11
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值