ICCV2023论文阅读速览自适应Adaptation28篇

最新推荐文章于 2024-08-13 08:05:46 发布

木木阳

最新推荐文章于 2024-08-13 08:05:46 发布

阅读量1.2k

点赞数 13

文章标签：论文阅读 ICCV 自适应 adaptation

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_44287798/article/details/140033534

版权

在这里插入图片描述

Paper1 Order-preserving Consistency Regularization for Domain Adaptation and Generalization

摘要原文: Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks.

中文总结: 这段话主要讨论了深度学习模型在跨领域挑战中失败的原因，即如果模型对领域特定属性（如光照、背景、摄像机角度等）过于敏感，会导致失败。为了解决这个问题，通常采用数据增强和一致性正则化来使模型对领域特定属性不那么敏感。一致性正则化强制模型对同一图像的两个视图输出相同的表示或预测。然而，这些约束条件对于分类概率来说要么太严格，要么不保持顺序。在这项工作中，我们提出了用于跨领域任务的保序一致性正则化（OCR）。预测的保序属性使模型对任务无关的转换具有鲁棒性，从而使模型对领域特定属性不那么敏感。全面的实验表明，我们的方法在五个不同的跨领域任务上取得了明显的优势。

Paper2 Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization

摘要原文: Test-time adaptation (TTA) methods, which generally rely on the model’s predictions (e.g., entropy minimization) to adapt the source pretrained model to the unlabeled target domain, suffer from noisy signals originating from 1) incorrect or 2) open-set predictions. Long-term stable adaptation is hampered by such noisy signals, so training models without such error accumulation is crucial for practical TTA. To address these issues, including open-set TTA, we propose a simple yet effective sample selection method inspired by the following crucial empirical finding. While entropy minimization compels the model to increase the probability of its predicted label (i.e., confidence values), we found that noisy samples rather show decreased confidence values. To be more specific, entropy minimization attempts to raise the confidence values of an individual sample’s prediction, but individual confidence values may rise or fall due to the influence of signals from numerous other predictions (i.e., wisdom of crowds). Due to this fact, noisy signals misaligned with such ‘wisdom of crowds’, generally found in the correct signals, fail to raise the individual confidence values of wrong samples, despite attempts to increase them. Based on such findings, we filter out the samples whose confidence values are lower in the adapted model than in the original model, as they are likely to be noisy. Our method is widely applicable to existing TTA methods and improves their long-term adaptation performance in both image classification (e.g., 49.4% reduced error rates with TENT) and semantic segmentation (e.g., 11.7% gain in mIoU with TENT).

中文总结: 这段话主要讨论了测试时适应（TTA）方法存在的问题以及提出的解决方案。传统的TTA方法依赖于模型的预测结果（如熵最小化），来适应源预训练模型到未标记的目标域，但由于错误或开放集预测产生的噪音信号，长期稳定的适应受到阻碍。为了解决这些问题，作者提出了一种简单而有效的样本选择方法，该方法受到一个关键的经验发现的启发。作者发现，熵最小化虽然会促使模型提高预测标签的概率（即置信度值），但噪音样本往往显示出降低的置信度值。基于这一发现，作者提出了过滤掉在适应模型中置信度值低于原始模型的样本，因为它们很可能是噪音。这种方法适用于现有的TTA方法，并在图像分类和语义分割任务中提高了长期适应性表现。

Paper3 SFHarmony: Source Free Domain Adaptation for Distributed Neuroimaging Analysis

摘要原文: To represent the biological variability of clinical neuroimaging populations, it is vital to be able to combine data across scanners and studies. However, different MRI scanners produce images with different characteristics, resulting in a domain shift known as the ‘harmonisation problem’. Additionally, neuroimaging data is inherently personal in nature, leading to data privacy concerns when sharing the data. To overcome these barriers, we propose an Unsupervised Source-Free Domain Adaptation (SFDA) method, SFHarmony. Through modelling the imaging features as a Gaussian Mixture Model and minimising an adapted Bhattacharyya distance between the source and target features, we can create a model that performs well for the target data whilst having a shared feature representation across the data domains, without needing access to the source data for adaptation or target labels. We demonstrate the performance of our method on simulated and real domain shifts, showing that the approach is applicable to classification, segmentation and regression tasks, requiring no changes to the algorithm. Our method outperforms existing SFDA approaches across a range of realistic data scenarios, demonstrating the potential utility of our approach for MRI harmonisation and general SFDA problems. Our code is available at https://github.com/nkdinsdale/SFHarmony.

中文总结: 为了代表临床神经影像人群的生物变异性，能够跨扫描仪和研究结合数据至关重要。然而，不同的MRI扫描仪产生具有不同特征的图像，导致所谓的“协调问题”领域转移。此外，神经影像数据在本质上是个人性质的，因此在共享数据时存在数据隐私方面的担忧。为了克服这些障碍，我们提出了一种无监督源无关领域自适应（SFDA）方法，SFHarmony。通过将成像特征建模为高斯混合模型，并最小化源特征和目标特征之间的适应巴氏距离，我们可以创建一个在目标数据上表现良好的模型，同时在数据域之间具有共享特征表示，无需访问源数据进行自适应或目标标签。我们展示了我们的方法在模拟和真实领域转移上的性能，表明该方法适用于分类、分割和回归任务，无需更改算法。我们的方法在各种现实数据场景中优于现有的SFDA方法，展示了我们的方法在MRI协调和一般SFDA问题中的潜在实用性。我们的代码可在https://github.com/nkdinsdale/SFHarmony上找到。

Paper4 StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-shot and Few-shot Domain Adaptation

摘要原文: Domain adaptation of GANs is a problem of fine-tuning GAN models pretrained on a large dataset (e.g. StyleGAN) to a specific domain with few samples (e.g. painting faces, sketches, etc.). While there are many methods that tackle this problem in different ways, there are still many important questions that remain unanswered. In this paper, we provide a systematic and in-depth analysis of the domain adaptation problem of GANs, focusing on the StyleGAN model. We perform a detailed exploration of the most important parts of StyleGAN that are responsible for adapting the generator to a new domain depending on the similarity between the source and target domains. As a result of this study, we propose new efficient and lightweight parameterizations of StyleGAN for domain adaptation. Particularly, we show that there exist directions in StyleSpace (StyleDomain directions) that are sufficient for adapting to similar domains. For dissimilar domains, we propose Affine+ and AffineLight+ parameterizations that allows us to outperform existing baselines in few-shot adaptation while having significantly less training parameters. Finally, we examine StyleDomain directions and discover their many surprising properties that we apply for domain mixing and cross-domain image morphing. Source code can be found at https://github.com/AIRI-Institute/StyleDomain.

中文总结: 这段话主要讨论的是GAN的领域自适应问题，即对预先在大型数据集上训练的GAN模型（如StyleGAN）进行微调，以适应具有少量样本的特定领域（如绘画人脸、素描等）。虽然有许多方法以不同方式解决这个问题，但仍有许多重要问题尚未解答。在本文中，我们对GAN的领域自适应问题进行了系统和深入的分析，重点关注StyleGAN模型。我们对StyleGAN中最重要的部分进行了详细探索，这些部分负责根据源域和目标域之间的相似性将生成器适应到新领域。通过这项研究，我们提出了StyleGAN的新的高效轻量级参数化方法，用于领域自适应。特别地，我们展示了在StyleSpace中存在的适用于适应相似领域的StyleDomain方向。对于不相似的领域，我们提出了Affine+和AffineLight+参数化方法，使我们能够在少样本自适应中超越现有基线，同时具有显著较少的训练参数。最后，我们研究了StyleDomain方向，并发现了它们许多令人惊讶的特性，我们将这些特性应用于领域混合和跨领域图像变形。源代码可在https://github.com/AIRI-Institute/StyleDomain 找到。

Paper5 Fine-grained Unsupervised Domain Adaptation for Gait Recognition

摘要原文: Gait recognition has emerged as a promising technique for the long-range retrieval of pedestrians, providing numerous advantages such as accurate identification in challenging conditions and non-intrusiveness, making it highly desirable for improving public safety and security. However, the high cost of labeling datasets, which is a prerequisite for most existing fully supervised approaches, poses a significant obstacle to the development of gait recognition. Recently, some unsupervised methods for gait recognition have shown promising results. However, these methods mainly rely on a fine-tuning approach that does not sufficiently consider the relationship between source and target domains, leading to the catastrophic forgetting of source domain knowledge. This paper presents a novel perspective that adjacent-view sequences exhibit overlapping views, which can be leveraged by the network to gradually attain cross-view and cross-dressing capabilities without pre-training on the labeled source domain. Specifically, we propose a fine-grained Unsupervised Domain Adaptation (UDA) framework that iteratively alternates between two stages. The initial stage involves offline clustering, which transfers knowledge from the labeled source domain to the unlabeled target domain and adaptively generates pseudo-labels according to the expressiveness of each part. Subsequently, the second stage encompasses online training, which further achieves cross-dressing capabilities by continuously learning to distinguish numerous features of source and target domains. The effectiveness of the proposed method is demonstrated through extensive experiments conducted on widely-used public gait datasets.

中文总结: 这段话主要内容是关于步态识别作为一种有前景的技术，可用于长距离检索行人，具有诸多优势，如在挑战条件下准确识别和非侵入性，因此在提高公共安全和安全性方面非常理想。然而，标记数据集的高成本是大多数现有全监督方法的先决条件，这对步态识别的发展构成了重大障碍。最近，一些无监督方法已经显示出了有希望的结果。然而，这些方法主要依赖于微调方法，未能充分考虑源域和目标域之间的关系，导致源域知识的灾难性遗忘。本文提出了一个新的观点，即相邻视图序列展示了重叠视图，网络可以利用这一特点逐渐获得跨视图和跨穿着能力，而无需在标记的源域上进行预训练。具体而言，我们提出了一个精细的无监督域自适应（UDA）框架，该框架在两个阶段之间交替进行。初始阶段涉及离线聚类，将知识从标记的源域转移到未标记的目标域，并根据每个部分的表现生成伪标签。随后，第二阶段包括在线训练，通过不断学习区分源域和目标域的众多特征，进一步实现了跨穿着能力。通过在广泛使用的公共步态数据集上进行的大量实验，验证了所提方法的有效性。

Paper6 Self-regulating Prompts: Foundational Model Adaptation without Forgetting

摘要原文: Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model’s original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and © regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform extensive experiments on 4 benchmarks where PromptSRC overall performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available.

中文总结: 这段话主要内容是介绍了一种名为PromptSRC（Prompting with Self-regulating Constraints）的自我正则化框架，用于引导提示（prompts）在微调基础模型（如CLIP）时同时优化特定任务和任务不可知的通用表示。该框架通过三方面的方法来引导提示进行自我正则化：（a）通过最大化与冻结模型的互相一致性来规范提示表示，（b）通过提示的自我集成来编码它们的互补优势，（c）通过文本多样性来规范以减轻样本多样性不平衡。PromptSRC明确引导提示学习一个表示空间，以在下游任务上最大化性能，同时不损害CLIP的泛化能力。作者在4个基准测试上进行了大量实验，结果表明PromptSRC相对于现有方法表现良好。他们的代码和预训练模型已公开发布。

Paper7 Test Time Adaptation for Blind Image Quality Assessment

摘要原文: While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model.

中文总结: 尽管盲图像质量评估（IQA）算法的设计已经显著改进，但训练和测试场景之间的分布偏移经常导致这些方法在推断时表现不佳。这促使研究测试时间适应（TTA）技术以提高它们在推断时的性能。现有的用于TTA的辅助任务和损失函数可能与预训练模型的质量感知适应不相关。在这项工作中，我们引入了两种新颖的质量相关辅助任务，分别在批次级别和样本级别，以实现盲IQATTA。具体来说，我们引入了批次级别的组对比损失和样本级别的相对排名损失，使模型具有质量感知并适应目标数据。我们的实验表明，即使使用来自测试分布的小批量图像，也能通过更新源模型的批次归一化统计数据显著提高性能。

Paper8 Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts

摘要原文: Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts.

中文总结: 这段话主要讨论了测试时适应（TTA）的概念，旨在在推断过程中以逐批方式将预训练模型调整到目标领域。然而，在真实场景中，标签分布往往存在不平衡，而大多数先前的TTA方法通常假定源领域和目标领域数据集都具有平衡的标签分布。由于某些类别在某些领域中出现频率更高（例如城市中的建筑物，森林中的树木），标签分布随着领域的变化而发生变化是很自然的。然而，我们发现现有大多数TTA方法未能解决协变量和标签转移的共存问题。为了解决这一挑战，他们提出了一种新颖的标签转移适配器，可以整合到现有的TTA方法中，以有效处理TTA过程中的标签转移。具体来说，他们估计目标领域的标签分布，将其输入到标签转移适配器中。随后，标签转移适配器为目标标签分布生成最佳参数。通过仅预测预训练源模型的一部分的参数，他们的方法在计算上是高效的，并且可以轻松应用，而不受模型架构的限制。通过大量实验，他们证明将他们的策略与TTA方法结合使用可以在标签和协变量转移共存的情况下显著提高性能。

Paper9 Generalized Lightness Adaptation with Channel Selective Normalization

摘要原文: Lightness adaptation is vital to the success of image processing to avoid unexpected visual deterioration, which covers multiple aspects, e.g., low-light image enhancement, image retouching, and inverse tone mapping. Existing methods typically work well on their trained lightness conditions but perform poorly in unknown ones due to their limited generalization ability. To address this limitation, we propose a novel generalized lightness adaptation algorithm that extends conventional normalization techniques through a channel filtering design, dubbed Channel Selective Normalization (CSNorm). The proposed CSNorm purposely normalizes the statistics of lightness-relevant channels and keeps other channels unchanged, so as to improve feature generalization and discrimination. To optimize CSNorm, we propose an alternating training strategy that effectively identifies lightness-relevant channels. The model equipped with our CSNorm only needs to be trained on one lightness condition and can be well generalized to unknown lightness conditions. Experimental results on multiple benchmark datasets demonstrate the effectiveness of CSNorm in enhancing the generalization ability for the existing lightness adaptation methods. Code is available at https://github.com/mdyao/CSNorm.

中文总结: 这段话主要讨论了光亮度适应对于图像处理成功的重要性，以避免意外的视觉恶化，涵盖了多个方面，如低光图像增强、图像修饰和反向色调映射。现有方法通常在其训练的光亮度条件下表现良好，但在未知条件下表现不佳，这是由于它们的泛化能力有限。为了解决这一限制，提出了一种新颖的广义光亮度适应算法，通过通道滤波设计扩展了传统的归一化技术，被称为通道选择性归一化（CSNorm）。所提出的CSNorm有意地归一化了与光亮度相关的通道的统计数据，并保持其他通道不变，以改善特征的泛化和区分能力。为了优化CSNorm，提出了一种有效识别光亮度相关通道的交替训练策略。配备我们的CSNorm的模型只需在一个光亮度条件下进行训练，就可以很好地推广到未知的光亮度条件。在多个基准数据集上的实验结果表明了CSNorm在增强现有光亮度适应方法的泛化能力方面的有效性。源代码可在https://github.com/mdyao/CSNorm找到。

Paper10 Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

摘要原文: In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal “Prompt Cube” into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.

中文总结: 这段话主要讨论了在文本-视频检索中，最近的研究受益于预训练的文本-图像基础模型（如CLIP），通过将其调整到视频领域来获得强大的学习能力。针对这些模型的一个关键问题是如何有效地利用CLIP的图像编码器捕捉视频内部丰富的语义信息。为了解决这个问题，最先进的方法采用复杂的跨模态建模技术将文本信息融合到视频帧表示中，然而这会在大规模检索系统中引起严重的效率问题，因为视频表示必须针对每个文本查询在线重新计算。因此，本文放弃了这种问题性的跨模态融合过程，旨在纯粹从视频中学习语义增强表示，以便视频表示可以离线计算并在不同文本中重复使用。具体而言，首先在CLIP图像编码器中引入了一个空间-时间的“提示立方体”，并在编码器层中迭代地切换它，以有效地将全局视频语义合并到帧表示中。然后提出应用辅助视频字幕目标来训练帧表示，通过在语义空间提供细粒度的指导，促进了对详细视频语义的学习。通过对增强帧表示采用一种简单的时间融合策略（即均值池化），在三个基准数据集（MSR-VTT、MSVD和LSMDC）上实现了最先进的性能。

Paper11 A Low-Shot Object Counting Network With Iterative Prototype Adaptation

摘要原文: We consider low-shot counting of arbitrary semantic categories in the image using only few annotated exemplars (few-shot) or no exemplars (no-shot). The standard few-shot pipeline follows extraction of appearance queries from exemplars and matching them with image features to infer the object counts. Existing methods extract queries by feature pooling which neglects the shape information (e.g., size and aspect) and leads to a reduced object localization accuracy and count estimates. We propose a Low-shot Object Counting network with iterative prototype Adaptation (LOCA). Our main contribution is the new object prototype extraction module, which iteratively fuses the exemplar shape and appearance information with image features. The module is easily adapted to zero-shot scenarios, enabling LOCA to cover the entire spectrum of low-shot counting problems. LOCA outperforms all recent state-of-the-art methods on FSC147 benchmark by 20-30% in RMSE on one-shot and few-shot and achieves state-of-the-art on zero-shot scenarios, while demonstrating better generalization capabilities. The code and models are available here: https://github.com/djukicn/loca.

中文总结: 这段话主要介绍了一种新的低样本计数方法，该方法可以在图像中对任意语义类别进行低样本计数，只需使用少量注释示例（few-shot）或不使用示例（no-shot）。传统的few-shot流程是从示例中提取外观查询，然后将其与图像特征进行匹配，推断对象计数。现有方法通过特征汇聚提取查询，忽略了形状信息（例如大小和方面），导致对象定位精度降低和计数估计减少。我们提出了一种具有迭代原型适应性的低样本对象计数网络（LOCA）。我们的主要贡献是新的对象原型提取模块，该模块迭代地融合了示例形状和外观信息与图像特征。该模块很容易适应零样本情况，使LOCA能够涵盖整个低样本计数问题的范围。在FSC147基准测试中，LOCA在一次性和few-shot情况下的RMSE上优于所有最近的最先进方法20-30％，在零样本情况下达到了最先进水平，同时展示了更好的泛化能力。代码和模型可在此处找到：https://github.com/djukicn/loca。

Paper12 Smoothness Similarity Regularization for Few-Shot GAN Adaptation

摘要原文: The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this limitation, we propose a new smoothness similarity regularization that transfers the inherently learned smoothness of the pre-trained GAN to the few-shot target domain even if the two domains are very different. We evaluate our approach by adapting an unconditional and a class-conditional GAN to diverse few-shot target domains. Our proposed method significantly outperforms prior few-shot GAN adaptation methods in the challenging case of structurally dissimilar source-target domains, while performing on par with the state of the art for similar source-target domains.

中文总结: 这段话主要讨论了少样本GAN适应的任务，旨在将预训练的GAN模型适应到一个只有很少训练图像的小数据集上。现有方法在预训练数据集与目标数据集在结构上相似时表现良好，但当两个领域中的对象结构非常不同时，这些方法容易出现训练不稳定或记忆问题。为了缓解这一限制，提出了一种新的平滑相似性正则化方法，将预训练GAN的固有学习到的平滑性转移到少样本目标领域，即使两个领域非常不同。通过将无条件和有条件类别的GAN适应到不同的少样本目标领域来评估我们的方法。在结构不相似的源-目标领域的挑战性情况下，我们的方法明显优于先前的少样本GAN适应方法，同时在源-目标领域相似的情况下与最先进技术持平。

Paper13 Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation

摘要原文: For video models to be transferred and applied seamlessly across video tasks in varied environments, Video Unsupervised Domain Adaptation (VUDA) has been introduced to improve the robustness and transferability of video models. However, current VUDA methods rely on a vast amount of high-quality unlabeled target data, which may not be available in real-world cases. We thus consider a more realistic Few-Shot Video-based Domain Adaptation (FSVDA) scenario where we adapt video models with only a few target video samples. While a few methods have touched upon Few-Shot Domain Adaptation (FSDA) in images and in FSVDA, they rely primarily on spatial augmentation for target domain expansion with alignment performed statistically at the instance level. However, videos contain more knowledge in terms of rich temporal and semantic information, which should be fully considered while augmenting target domains and performing alignment in FSVDA. We propose a novel SSA2lign to address FSVDA at the snippet level, where the target domain is expanded through a simple snippet-level augmentation followed by the attentive alignment of snippets both semantically and statistically, where semantic alignment of snippets is conducted through multiple perspectives. Empirical results demonstrate state-of-the-art performance of SSA2lign across multiple cross-domain action recognition benchmarks.

中文总结: 为了使视频模型能够在不同环境中跨视频任务无缝转移和应用，引入了视频无监督域自适应（VUDA）来提高视频模型的鲁棒性和可转移性。然而，当前的VUDA方法依赖于大量高质量的未标记目标数据，这在现实情况下可能无法获得。因此，我们考虑了一个更加现实的少样本视频域自适应（FSVDA）情景，即只利用少量目标视频样本来调整视频模型。虽然一些方法已经涉及到了图像中的少样本域自适应（FSDA）以及在FSVDA中的应用，但它们主要依赖于对目标域进行空间增强以扩展目标域，然后在实例级别上进行统计对齐。然而，视频包含更多关于丰富的时间和语义信息的知识，这些应该在进行FSVDA中的目标域增强和对齐时得到充分考虑。我们提出了一种新颖的SSA2lign方法，以在片段级别解决FSVDA问题，其中通过简单的片段级别增强来扩展目标域，然后通过关注对齐片段的语义和统计信息，其中片段的语义对齐是通过多个角度进行的。实证结果表明，SSA2lign在多个跨领域动作识别基准上展现出最先进的性能。

Paper14 Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation

摘要原文: Low-light conditions not only hamper human visual experience but also degrade the model’s performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method’s effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Our project page is available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/.

中文总结: 这段话主要讨论了在低光条件下，不仅会影响人类的视觉体验，还会降低模型在下游视觉任务中的性能。虽然现有的研究在日夜域自适应方面取得了显著进展，但它们在很大程度上依赖于从特定于任务的夜间数据集中获得的领域知识。本文挑战了一个更复杂的情景，即零样本日夜域自适应，它消除了对任何夜间数据的依赖。与先前强调图像级翻译或模型级自适应的零样本自适应方法不同，我们提出了一种考虑它们在统一框架下的相似性最小-最大范式。在图像级别上，我们将图像变暗以最小化特征相似性，从而扩大域间差距。然后在模型级别上，我们最大化变暗图像与其正常光照对应物之间的特征相似性，以实现更好的模型适应性。据我们所知，这项工作代表了在共同优化这两个方面方面的开拓性工作，从而显著提高了模型的泛化能力。大量实验表明我们的方法在各种夜间视觉任务上的有效性和广泛适用性，包括分类、语义分割、视觉地点识别和视频动作识别。我们的项目页面可在https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/找到。

Paper15 DETA: Denoised Task Adaptation for Few-Shot Learning

摘要原文: Test-time task adaptation in few-shot learning aims to adapt a pre-trained task-agnostic model for capturing task-specific knowledge of the test task, rely only on few-labeled support samples. Previous approaches generally focus on developing advanced algorithms to achieve the goal, while neglecting the inherent problems of the given support samples. In fact, with only a handful of samples available, the adverse effect of either the image noise (a.k.a. X-noise) or the label noise (a.k.a. Y-noise) from support samples can be severely amplified. To address this challenge, in this work we propose DEnoised Task Adaptation (DETA), a first, unified image- and label-denoising framework orthogonal to existing task adaptation approaches. Without extra supervision, DETA filters out task-irrelevant, noisy representations by taking advantage of both global visual information and local region details of support samples. On the challenging Meta-Dataset, DETA consistently improves the performance of a broad spectrum of baseline methods applied on various pre-trained models. Notably, by tackling the overlooked image noise in Meta-Dataset, DETA establishes new state-of-the-art results. Code is released at https://github.com/JimZAI/DETA.

中文总结: 这段话主要讨论了在少样本学习中的测试时任务适应，旨在调整预先训练的任务无关模型以捕捉测试任务的特定知识，仅依赖于少量标记的支持样本。先前的方法通常集中于开发先进的算法来实现这一目标，而忽视了给定支持样本的固有问题。事实上，只有少量样本可用时，支持样本中的图像噪声（即X噪声）或标签噪声（即Y噪声）的不利影响可能会被严重放大。为了解决这一挑战，在这项工作中，我们提出了DEnoised Task Adaptation（DETA），这是一个首个、统一的图像和标签去噪框架，与现有的任务适应方法正交。在没有额外监督的情况下，DETA通过利用支持样本的全局视觉信息和局部区域细节，过滤掉与任务无关的噪声表示。在具有挑战性的Meta-Dataset上，DETA始终提高了应用于各种预训练模型的基线方法的性能。值得注意的是，通过解决Meta-Dataset中被忽视的图像噪声，DETA建立了新的最先进结果。代码发布在https://github.com/JimZAI/DETA。

Paper16 Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation

摘要原文: Unsupervised domain adaptation aims to transfer knowledge from a fully-labeled source domain to an unlabeled target domain. However, in real-world scenarios, providing abundant labeled data even in the source domain can be infeasible due to the difficulty and high expense of annotation. To address this issue, recent works consider the Few-shot Unsupervised Domain Adaptation (FUDA) where only a few source samples are labeled, and conduct knowledge transfer via self-supervised learning methods. Yet existing methods generally overlook that the sparse label setting hinders learning reliable source knowledge for transfer. Additionally, the learning difficulty difference in target samples is different but ignored, leaving hard target samples poorly classified. To tackle both deficiencies, in this paper, we propose a novel Confidence-based Visual Dispersal Transfer learning method (C-VisDiT) for FUDA. Specifically, C-VisDiT consists of a cross-domain visual dispersal strategy that transfers only high-confidence source knowledge for model adaptation and an intra-domain visual dispersal strategy that guides the learning of hard target samples with easy ones. We conduct extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet benchmark datasets and the results demonstrate that the proposed C-VisDiT significantly outperforms state-of-the-art FUDA methods. Our code is available at https://github.com/Bostoncake/C-VisDiT.

中文总结: 这段话主要讨论了无监督领域自适应的概念，旨在将来自一个完全标记的源域的知识转移到一个未标记的目标域。然而，在现实场景中，即使在源域提供丰富的标记数据也可能是不可行的，因为标注的困难和高昂的费用。为了解决这个问题，最近的研究考虑了少样本无监督领域自适应（Few-shot Unsupervised Domain Adaptation，FUDA），其中只有少数源样本被标记，并通过自监督学习方法进行知识转移。然而，现有方法通常忽视了稀疏标签设置阻碍了可靠的源知识传递的学习。此外，目标样本中的学习困难差异是不同的，但被忽视了，导致难以分类的目标样本。为了解决这两个不足，本文提出了一种针对FUDA的新颖的基于置信度的视觉分散传递学习方法（C-VisDiT）。具体来说，C-VisDiT包括一个跨域视觉分散策略，只传递高置信度的源知识用于模型适应，以及一个内部域视觉分散策略，引导难易目标样本的学习。我们在Office-31、Office-Home、VisDA-C和DomainNet基准数据集上进行了大量实验，结果表明，所提出的C-VisDiT方法明显优于最先进的FUDA方法。我们的代码可在https://github.com/Bostoncake/C-VisDiT 上获得。

Paper17 First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning

摘要原文: In Class-Incremental Learning (CIL) an image classification system is exposed to new classes in each learning session and must be updated incrementally. Methods approaching this problem have updated both the classification head and the feature extractor body at each session of CIL. In this work, we develop a baseline method, First Session Adaptation (FSA), that sheds light on the efficacy of existing CIL approaches, and allows us to assess the relative performance contributions from head and body adaption. FSA adapts a pre-trained neural network body only on the first learning session and fixes it thereafter; a head based on linear discriminant analysis (LDA), is then placed on top of the adapted body, allowing exact updates through CIL. FSA is replay-free i.e. it does not memorize examples from previous sessions of continual learning. To empirically motivate FSA, we first consider a diverse selection of 22 image-classification datasets, evaluating different heads and body adaptation techniques in high/low-shot offline settings. We find that the LDA head performs well and supports CIL out-of-the-box. We also find that Featurewise Layer Modulation (FiLM) adapters are highly effective in the few-shot setting, and full-body adaption in the high-shot setting. Second, we empirically investigate various CIL settings including high-shot CIL and few-shot CIL, including settings that have previously been used in the literature. We show that FSA significantly improves over the state-of-the-art in 15 of the 16 settings considered. FSA with FiLM adapters is especially performant in the few-shot setting. These results indicate that current approaches to continuous body adaptation are not working as expected. Finally, we propose a measure that can be applied to a set of unlabelled inputs which is predictive of the benefits of body adaptation.

中文总结: 在Class-Incremental Learning（CIL）中，一个图像分类系统在每个学习会话中都会暴露于新的类别，并且必须进行增量更新。解决这个问题的方法在每个CIL会话中都会更新分类头部和特征提取器。在这项工作中，我们开发了一种基准方法，称为First Session Adaptation（FSA），它揭示了现有CIL方法的有效性，并使我们能够评估头部和体部适应的相对性能贡献。FSA仅在第一个学习会话中对预训练的神经网络体部进行适应，然后固定它；然后在适应的体部上放置一个基于线性判别分析（LDA）的头部，从而通过CIL实现精确更新。FSA是无回放的，即它不会记忆来自先前连续学习会话的示例。为了在实证上支持FSA，我们首先考虑了22个不同的图像分类数据集，评估了高/低拍摄离线设置中不同的头部和体部适应技术。我们发现LDA头部表现良好，并支持CIL的开箱即用。我们还发现Featurewise Layer Modulation（FiLM）适配器在少拍摄设置中非常有效，而在高拍摄设置中全身适应更好。其次，我们在实证中研究了各种CIL设置，包括高拍摄CIL和少拍摄CIL，包括以前在文献中使用过的设置。我们展示了FSA在考虑的16个设置中的15个中明显优于现有技术。在少拍摄设置中，具有FiLM适配器的FSA表现尤为出色。这些结果表明，目前的连续体部适应方法并不如预期那样有效。最后，我们提出了一种可应用于一组未标记输入的度量，该度量可预测体部适应的益处。

Paper18 DomainAdaptor: A Novel Approach to Test-time Adaptation

摘要原文: To deal with the domain shift between training and test samples, current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples that are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximumly mine the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor.

中文总结: 这篇论文主要研究了如何在测试阶段将训练好的CNN模型适应到未见过的领域。为了最大化挖掘测试数据中的信息，他们提出了一种统一的方法称为DomainAdaptor，其中包括一个AdaMixBN模块和一个广义熵最小化（GEM）损失。具体来说，AdaMixBN通过动态混合系数和统计变换操作，在归一化层中自适应地融合训练和测试统计数据，以解决领域转移问题。为了进一步增强AdaMixBN的适应能力，他们设计了一个GEM损失，将熵最小化损失扩展以更好地利用测试数据中的信息。大量实验表明，DomainAdaptor在四个基准测试中始终优于最先进的方法。此外，他们的方法在少量数据的未见领域上相对于现有方法带来了更显著的改进。代码可在https://github.com/koncle/DomainAdaptor找到。

Paper19 CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation

摘要原文: Most nighttime semantic segmentation studies are based on domain adaptation approaches and image input. However, limited by the low dynamic range of conventional cameras, images fail to capture structural details and boundary information in low-light conditions. Event cameras, as a new form of vision sensors, are complementary to conventional cameras with their high dynamic range. To this end, we propose a novel unsupervised Cross-Modality Domain Adaptation (CMDA) framework to leverage multi-modality (Images and Events) information for nighttime semantic segmentation, with only labels on daytime images. In CMDA, we design the Image Motion-Extractor to extract motion information and the Image Content-Extractor to extract content information from images, in order to bridge the gap between different modalities (Images to Events) and domains (Day to Night). Besides, we introduce the first image-event nighttime semantic segmentation dataset. Extensive experiments on both the public image dataset and the proposed image-event dataset demonstrate the effectiveness of our proposed approach. We open-source our code, models, and dataset at https://github.com/XiaRho/CMDA.

中文总结: 这段话主要讨论了关于夜间语义分割研究的内容。传统的夜间语义分割研究基于领域自适应方法和图像输入，但由于传统摄像机的低动态范围限制，图像在低光条件下无法捕捉结构细节和边界信息。事件摄像机作为一种新型视觉传感器，具有高动态范围，与传统摄像机互补。因此，作者提出了一种新颖的无监督跨模态领域自适应（CMDA）框架，利用多模态（图像和事件）信息进行夜间语义分割，仅在白天图像上标记标签。在CMDA中，作者设计了图像运动提取器来提取运动信息和图像内容提取器来提取内容信息，以弥合不同模态（图像到事件）和领域（白天到夜晚）之间的差距。此外，作者还介绍了第一个图像-事件夜间语义分割数据集。在公共图像数据集和提出的图像-事件数据集上进行的大量实验表明了所提出方法的有效性。作者在https://github.com/XiaRho/CMDA开源了他们的代码、模型和数据集。

Paper20 Local Context-Aware Active Domain Adaptation

摘要原文: Active Domain Adaptation (ADA) queries the labels of a small number of selected target samples to help adapting a model from a source domain to a target domain. The local context of queried data is important, especially when the domain gap is large. However, this has not been fully explored by existing ADA works. In this paper, we propose a Local context-aware ADA framework, named LADA, to address this issue. To select informative target samples, we devise a novel criterion based on the local inconsistency of model predictions. Since the labeling budget is usually small, fine-tuning model on only queried data can be inefficient. We progressively augment labeled target data with the confident neighbors in a class-balanced manner. Experiments validate that the proposed criterion chooses more informative target samples than existing active selection strategies. Furthermore, our full method clearly surpasses recent ADA arts on various benchmarks. Code is available at https://github.com/tsun/LADA.

中文总结: 主要内容概述：主动域自适应（ADA）查询少量选定目标样本的标签，以帮助将模型从源域适应到目标域。在领域差距较大时，查询数据的局部上下文很重要，但现有的ADA工作尚未充分探讨这一点。本文提出了一种局部上下文感知的ADA框架，命名为LADA，以解决这一问题。为了选择信息丰富的目标样本，我们设计了一个基于模型预测的局部不一致性的新标准。由于标注预算通常较小，仅在查询数据上微调模型可能效率低下。我们以一种类平衡的方式逐步增加标记的目标数据与自信邻居。实验证实，所提出的标准选择比现有主动选择策略更具信息量的目标样本。此外，我们的完整方法在各种基准测试中明显优于最近的ADA方法。代码可在https://github.com/tsun/LADA找到。

Paper21 SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets

摘要原文: Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion – agreement filtering and entropy weighting – based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at https://github.com/csimo005/SUMMIT.

中文总结: 这段话主要讨论了使用多模态数据进行场景理解在许多应用中的必要性，例如自主导航。为了在各种情况下实现这一目标，现有模型必须能够适应不断变化的数据分布，而无需繁琐的数据标注。目前的方法假设在适应过程中源数据是可用的，并且源数据包含配对的多模态数据。然而，这两个假设对许多应用可能存在问题。源数据可能由于隐私、安全或经济方面的考虑而不可用。而且，假设训练时存在配对的多模态数据也会导致昂贵的数据收集成本，并且未能充分利用广泛可用的免费分发的预训练单模态模型。作者通过解决一个问题来放宽这两个假设，即如何将独立训练在单模态数据上的一组模型适应到由未标记的多模态数据组成的目标域，而无需访问原始源数据集。作者提出的方法通过一个自动选择两种互补的跨模态伪标签融合方法的切换框架来解决这个问题，这两种方法分别是一致性过滤和熵加权，基于估计的领域差距来选择。作者在语义分割问题上展示了他们的工作。通过对七种具有挑战性的适应场景进行实验，验证了他们方法的有效性，实现了与假设可以访问源数据的方法相媲美甚至有时表现更好的结果。作者的方法在mIoU方面比竞争基线方法提高了高达12%。作者的代码公开可用于https://github.com/csimo005/SUMMIT。

Paper22 To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation

摘要原文: The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework’s encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results.

中文总结: 这段话主要介绍了在线领域自适应语义分割的目标是处理在部署过程中发生的无法预见的领域变化，例如突发的天气事件。然而，与蛮力自适应相关的高计算成本使得这种范式在实际应用中不可行。本文提出了HAMLET，一种用于实时领域自适应的硬件感知模块化最廉价训练框架。我们的方法包括硬件感知反向传播编排代理（HAMT）和专用领域转移检测器，可以实现对模型何时以及如何进行自适应的主动控制（LT）。由于这些进展，我们的方法能够在单个消费级GPU上以超过29FPS的速度执行语义分割同时进行自适应。我们的框架在OnDA和SHIFT基准测试中通过实验结果展示了令人鼓舞的准确性和速度折衷。

Paper23 LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation

摘要原文: We introduce LiDAR-UDA, a novel two-stage self-training-based Unsupervised Domain Adaptation (UDA) method for LiDAR segmentation. Existing self-training methods use a model trained on labeled source data to generate pseudo labels for target data and refine the predictions via fine-tuning the network on the pseudo labels. These methods suffer from domain shifts caused by different LiDAR sensor configurations in the source and target domains. We propose two techniques to reduce sensor discrepancy and improve pseudo label quality: 1) LiDAR beam subsampling, which simulates different LiDAR scanning patterns by randomly dropping beams; 2) cross-frame ensembling, which exploits temporal consistency of consecutive frames to generate more reliable pseudo labels. Our method is simple, generalizable, and does not incur any extra inference cost. We evaluate our method on several public LiDAR datasets and show that it outperforms the state-of-the-art methods by more than 3.9% mIoU on average for all scenarios. Code will be available at https://github.com/JHLee0513/lidar_uda.

中文总结: 我们介绍了LiDAR-UDA，这是一种基于自训练的两阶段自监督领域自适应（UDA）方法，用于LiDAR分割。现有的自训练方法使用在标记的源数据上训练的模型为目标数据生成伪标签，并通过在伪标签上微调网络来改进预测。这些方法受到源域和目标域中不同LiDAR传感器配置引起的领域偏移的影响。我们提出了两种技术来减少传感器差异并改善伪标签质量：1）LiDAR光束子采样，通过随机丢弃光束模拟不同的LiDAR扫描模式；2）跨帧集成，利用连续帧的时间一致性生成更可靠的伪标签。我们的方法简单、通用，并不会产生额外的推理成本。我们在几个公共LiDAR数据集上评估了我们的方法，并显示在所有场景中，我们的方法的性能优于现有的方法，平均提高了超过3.9%的mIoU。代码将在https://github.com/JHLee0513/lidar_uda上提供。

Paper24 Black-Box Unsupervised Domain Adaptation with Bi-Directional Atkinson-Shiffrin Memory

摘要原文: Black-box unsupervised domain adaptation (UDA) learns with source predictions of target data without accessing either source data or source models during training, and it has clear superiority in data privacy and flexibility in target network selection. However, the source predictions of target data are often noisy and training with them is prone to learning collapses. We propose BiMem, a bi-directional memorization mechanism that learns to remember useful and representative information to correct noisy pseudo labels on the fly, leading to robust black-box UDA that can generalize across different visual recognition tasks. BiMem constructs three types of memory, including sensory memory, short-term memory, and long-term memory, which interact in a bi-directional manner for comprehensive and robust memorization of learnt features. It includes a forward memorization flow that identifies and stores useful features and a backward calibration flow that rectifies features’ pseudo labels progressively. Extensive experiments show that BiMem achieves superior domain adaptation performance consistently across various visual recognition tasks such as image classification, semantic segmentation and object detection.

中文总结: 这段话主要讲述了黑盒无监督域自适应（UDA）的工作原理和优势。在训练过程中，黑盒UDA利用目标数据的源预测进行学习，而不访问源数据或源模型，从而在数据隐私性和目标网络选择灵活性方面具有明显优势。然而，目标数据的源预测往往存在噪声，使用它们进行训练容易导致学习崩溃。作者提出了BiMem，一种双向记忆机制，可以学习记住有用和代表性信息，以在学习过程中校正噪声伪标签，从而实现跨不同视觉识别任务的鲁棒黑盒UDA泛化能力。BiMem构建了三种记忆类型，包括感知记忆、短期记忆和长期记忆，它们以双向方式相互作用，全面而鲁棒地记忆学到的特征。它包括前向记忆流程，用于识别和存储有用特征，以及后向校准流程，逐步纠正特征的伪标签。大量实验证明，BiMem在各种视觉识别任务中（如图像分类、语义分割和目标检测）始终取得优越的域自适应性能。

Paper25 PODA: Prompt-driven Zero-shot Domain Adaptation

摘要原文: Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of ‘Prompt-driven Zero-shot Domain Adaptation’, where we adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt. First, we leverage a pretrained contrastive vision-language model (CLIP) to optimize affine transformations of source features, steering them towards the target text embedding while preserving their content and semantics. To achieve this, we propose Prompt-driven Instance Normalization (PIN). Second, we show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand, even surpassing one-shot unsupervised domain adaptation. A similar boost is observed on object detection and image classification. The code is available at https://github.com/astra-vision/PODA .

中文总结: 这段话主要讨论了一种名为“Prompt-driven Zero-shot Domain Adaptation”的任务。在计算机视觉中，领域自适应已经得到广泛研究，但仍然需要在训练时访问目标图像，这在一些不常见的情况下可能是棘手的。该研究提出了一种新的方法，通过仅使用自然语言中对目标领域的一般描述（即提示），来对在源领域上训练的模型进行调整。首先，利用预训练的对比视觉-语言模型（CLIP）来优化源特征的仿射变换，将其引导到目标文本嵌入，同时保留其内容和语义。为了实现这一目标，提出了Prompt-driven Instance Normalization（PIN）。其次，展示了这些提示驱动的增强可以用于执行语义分割的零样本领域适应。实验证明，该方法在几个数据集上明显优于基于CLIP的风格转移基线，甚至超过了一次性无监督领域适应。在目标检测和图像分类任务上也观察到了类似的提升。相关代码可在https://github.com/astra-vision/PODA 上找到。

Paper26 SSDA: Secure Source-Free Domain Adaptation

摘要原文: Source-free domain adaptation (SFDA) is a popular unsupervised domain adaptation method where a pre-trained model from a source domain is adapted to a target domain without accessing any source data. Despite rich results in this area, existing literature overlooks the security challenges of the unsupervised SFDA setting in presence of a malicious source domain owner. This work investigates the effect of a source adversary which may inject a hidden malicious behavior (Backdoor/Trojan) during source training and potentially transfer it to the target domain even after benign training by the victim (target domain owner). Our investigation of the current SFDA setting reveals that because of the unique challenges present in SFDA (e.g., no source data, target label), defending against backdoor attack using existing defenses become practically ineffective in protecting the target model. To address this, we propose a novel target domain protection scheme called secure source-free domain adaptation (SSDA). SSDA adopts a single-shot model compression of a pre-trained source model and a novel knowledge transfer scheme with a spectral-norm-based loss penalty for target training. The proposed static compression and the dynamic training loss penalty are designed to suppress the malicious channels responsive to the backdoor during the adaptation stage. At the same time, the knowledge transfer from an uncompressed auxiliary model helps to recover the benign test accuracy. Our extensive evaluation on multiple dataset and domain tasks against recent backdoor attacks reveal that the proposed SSDA can successfully defend against strong backdoor attacks with little to no degradation in test accuracy compared to the vulnerable baseline SFDA methods. Our code is available at https://github.com/ML-Security-Research-LAB/SSDA.

中文总结: 这段话主要讨论了源无关领域自适应（SFDA）作为一种流行的无监督领域自适应方法，其中通过将来自源域的预训练模型适应到目标域，而无需访问任何源数据。尽管在这一领域取得了丰富的结果，但现有文献忽视了在存在恶意源域所有者的情况下无监督SFDA设置的安全挑战。该工作研究了源对手的影响，该对手可能在源训练期间注入隐藏的恶意行为（后门/木马），并在受害者（目标域所有者）进行良性训练后，将其潜在地转移到目标域。对当前SFDA设置的调查表明，由于SFDA中存在的独特挑战（例如，没有源数据、目标标签），使用现有防御措施来防御后门攻击在保护目标模型方面实际上是无效的。为了解决这个问题，他们提出了一种名为安全源无关领域自适应（SSDA）的新领域保护方案。SSDA采用预训练源模型的单次模型压缩和一种基于谱范数的损失惩罚的新领域训练知识转移方案。所提出的静态压缩和动态训练损失惩罚旨在在适应阶段抑制对后门敏感的恶意通道。同时，从未压缩的辅助模型进行的知识转移有助于恢复良性测试准确性。他们在多个数据集和领域任务上对最近的后门攻击进行了广泛评估，结果显示所提出的SSDA能够成功抵御强大的后门攻击，与易受攻击的基线SFDA方法相比，测试准确性几乎没有降低。他们的代码可在https://github.com/ML-Security-Research-LAB/SSDA 上找到。

Paper27 Class-Aware Patch Embedding Adaptation for Few-Shot Image Classification

摘要原文: “A picture is worth a thousand words”, significantly beyond mere a categorization. Accompanied by that, many patches of the image could have completely irrelevant meanings with the categorization if they were independently observed. This could significantly reduce the efficiency of a large family of few-shot learning algorithms, which have limited data and highly rely on the comparison of image patches. To address this issue, we propose a Class-aware Patch Embedding Adaptation (CPEA) method to learn “class-aware embeddings” of the image patches. The key idea of CPEA is to integrate patch embeddings with class-aware embeddings to make them class-relevant. Furthermore, we define a dense score matrix between class-relevant patch embeddings across images, based on which the degree of similarity between paired images is quantified. Visualization results show that CPEA concentrates patch embeddings by class, thus making them class-relevant. Extensive experiments on four benchmark datasets, miniImageNet, tieredImageNet, CIFAR-FS, and FC-100, indicate that our CPEA significantly outperforms the existing state-of-the-art methods. The source code is available at https://github.com/FushengHao/CPEA.

中文总结: 这段话主要内容是介绍了一种名为“Class-aware Patch Embedding Adaptation (CPEA)”的方法，用于解决图像中不同区域的图像块可能在独立观察时具有完全不相关含义的问题。该方法的关键思想是将图像块嵌入与类别相关的嵌入相结合，使它们与类别相关。通过在图像之间定义一个密集的分数矩阵，来量化成对图像之间的相似度。可视化结果表明，CPEA通过类别集中图像块嵌入，使它们与类别相关。在四个基准数据集（miniImageNet、tieredImageNet、CIFAR-FS和FC-100）上的大量实验表明，我们的CPEA明显优于现有的最先进方法。源代码可在https://github.com/FushengHao/CPEA 上找到。

Paper28 Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation

摘要原文: Endeavors have been recently made to transfer knowledge from the labeled pinhole image domain to the unlabeled panoramic image domain via Unsupervised Domain Adaptation (UDA). The aim is to tackle the domain gaps caused by the style disparities and distortion problem of the non-uniformly distributed pixels of equirectangular projection (ERP). Previous works typically focus on transferring knowledge based on geometric priors with specially designed multi-branch network architectures. As a result, considerable computational costs are induced, and meanwhile, their generalization abilities are profoundly hindered by the variation of distortion among pixels. In this paper, we find that the pixels’ neighborhood regions of the ERP indeed introduce less distortion. Intuitively, we propose a novel UDA framework that can effectively address the distortion problems for panoramic semantic segmentation. In comparison, our method is simpler, easier to implement, and more computationally efficient. Specifically, we propose distortion-aware attention (DA) capturing the neighboring pixel distribution without using any geometric constraints. Moreover, we propose a class-wise feature aggregation (CFA) module to iteratively update the feature representations with a memory bank. As such, the feature similarity between two domains can be consistently optimized. Extensive experiments show that our method achieves new state-of-the-art performance while remarkably reducing 80% parameters.

中文总结: 最近，人们尝试通过无监督领域自适应（UDA）将标记的针孔图像领域的知识转移到未标记的全景图像领域。其目的是解决由等距投影（ERP）的非均匀分布像素的风格差异和失真问题导致的领域差距。以往的研究通常集中在基于几何先验的知识转移，使用特别设计的多分支网络架构。结果，这些方法引入了相当大的计算成本，同时，它们的泛化能力受到像素失真变化的严重阻碍。在本文中，我们发现ERP像素的邻域区域确实引入了较少的失真。直觉上，我们提出了一种新颖的UDA框架，可以有效解决全景语义分割的失真问题。相比之下，我们的方法更简单、更易实现，计算效率更高。具体来说，我们提出了一种捕获邻近像素分布的失真感知（DA）模块，而不使用任何几何约束。此外，我们提出了一种类别特征聚合（CFA）模块，用记忆库迭代更新特征表示。因此，两个领域之间的特征相似性可以持续优化。大量实验证明，我们的方法实现了新的最先进性能，同时显著减少了80%的参数。

评论 1

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。