[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉语言导航

最新推荐文章于 2025-04-09 08:46:52 发布

晓理紫

最新推荐文章于 2025-04-09 08:46:52 发布

阅读量2k

点赞数 18

文章标签：机器人人工智能大模型深度学习

本文链接：https://blog.csdn.net/u011573853/article/details/136062487

版权

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== LLM ==

标题: Training-Free Consistent Text-to-Image Generation

作者: Yoad Tewel, Omri Kaduri, Rinon Gal

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03286v1

Project: https://consistory-paper.github.io|

中文摘要: 文本到图像模型允许用户通过自然语言指导图像生成过程，从而提供了新的创造性灵活性。然而，使用这些模型在不同的提示中一致地描绘同一主题仍然具有挑战性。现有的方法对模型进行微调，教它描述特定用户提供的主题的新词，或者给模型添加图像条件。这些方法需要长时间的每个受试者的优化或大规模的预训练。此外，他们很难将生成的图像与文本提示对齐，并在描绘多个主题时面临困难。在这里，我们提出了ConsiStory，这是一种免训练的方法，通过共享预训练模型的内部激活来实现一致的主题生成。我们引入了主题驱动的共享注意力块和基于对应的特征注入来促进图像之间的主题一致性。此外，我们制定策略来鼓励布局多样性，同时保持主题的一致性。我们将ConsiStory与一系列基线进行比较，并展示了主题一致性和文本对齐方面的最新性能，而不需要任何优化步骤。最后，ConsiStory可以自然地扩展到多主题场景，甚至可以实现常见对象的免训练个性化。

摘要: Text-to-image models offer a new level of creative flexibility by allowing users to guide the image generation process through natural language. However, using these models to consistently portray the same subject across diverse prompts remains challenging. Existing approaches fine-tune the model to teach it new words that describe specific user-provided subjects or add image conditioning to the model. These methods require lengthy per-subject optimization or large-scale pre-training. Moreover, they struggle to align generated images with text prompts and face difficulties in portraying multiple subjects. Here, we present ConsiStory, a training-free approach that enables consistent subject generation by sharing the internal activations of the pretrained model. We introduce a subject-driven shared attention block and correspondence-based feature injection to promote subject consistency between images. Additionally, we develop strategies to encourage layout diversity while maintaining subject consistency. We compare ConsiStory to a range of baselines, and demonstrate state-of-the-art performance on subject consistency and text alignment, without requiring a single optimization step. Finally, ConsiStory can naturally extend to multi-subject scenarios, and even enable training-free personalization for common objects.

标题: Comparative Analysis of LLaMA and ChatGPT Embeddings for Molecule Embedding

作者: Shaghayegh Sadeghi, Alan Bui, Ali Forooghi

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.00024v2

GitHub: https://github.com/sshaghayeghs/LLaMA-VS-ChatGPT|

中文摘要: 目的：像ChatGPT和LLaMA这样的大型语言模型（LLMs）在化学信息学领域的潜力越来越得到认可，特别是在解释简化分子输入行输入系统（SMILES）方面，这是一种表示化学结构的标准方法。这些LLM可以将SMILES字符串解码成向量表示，为理解化学图提供了一种新的方法。方法：我们研究了ChatGPT和美洲驼在嵌入微笑字符串中的表现。我们的评估集中在两个关键应用上：分子特性（MP）预测和药物相互作用（DDI）预测，这两个应用在药物开发和医疗保健中都是必不可少的。结果：我们发现在MP和DDI预测任务中，使用LLaMA生成的SMILES嵌入优于来自ChatGPT的SMILES嵌入。值得注意的是，基于LLaMA的SMILES嵌入在两种预测任务中显示出与现有方法相当的结果。结论：LLMs在化学信息学中的应用，特别是在利用SMILES嵌入方面，为推进药物开发显示了重要的前景。这包括改进化学性质的预测和促进药物发现过程。GitHub：https://github.com/sshaghayeghs/LLaMA-VS-ChatGPT

摘要: Purpose: Large Language Models (LLMs) like ChatGPT and LLaMA are increasingly recognized for their potential in the field of cheminformatics, particularly in interpreting Simplified Molecular Input Line Entry System (SMILES), a standard method for representing chemical structures. These LLMs can decode SMILES strings into vector representations, providing a novel approach to understanding chemical graphs. Methods: We investigate the performance of ChatGPT and LLaMA in embedding SMILES strings. Our evaluation focuses on two key applications: molecular property (MP) prediction and drug-drug interaction (DDI) prediction, both essential in drug development and healthcare. Results: We find that SMILES embeddings generated using LLaMA outperform those from ChatGPT in both MP and DDI prediction tasks. Notably, LLaMA-based SMILES embeddings show results comparable to existing methods in both prediction tasks. Conclusion: The application of LLMs in cheminformatics, particularly in utilizing SMILES embeddings, shows significant promise for advancing drug development. This includes improving the prediction of chemical properties and facilitating the drug discovery process. GitHub: https://github.com/sshaghayeghs/LLaMA-VS-ChatGPT

标题: Weak-to-Strong Jailbreaking on Large Language Models

作者: Xuandong Zhao, Xianjun Yang, Tianyu Pang

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2401.17256v2

GitHub: https://github.com/XuandongZhao/weak-to-strong|

中文摘要: 大型语言模型（LLMs）容易受到越狱攻击——导致有害、不道德或有偏见的文本生成。然而，现有的越狱方法计算成本很高。在本文中，我们提出了由弱到强的越狱攻击，这是一种攻击对齐的LLMs以产生有害文本的有效方法。我们的关键直觉是基于这样的观察，即越狱模型和对齐模型仅在初始解码分布上不同。弱到强攻击的关键技术是使用两个较小的模型（一个安全模型和一个不安全模型）来对抗地修改一个明显较大的安全模型的解码概率。我们评估了对来自3个组织的5个不同LLM的从弱到强的攻击。结果表明，我们的方法可以将两个数据集的失配率提高到99%以上，每个例子只需一次前向传递。我们的研究揭示了在调整LLMs时需要解决的一个紧迫的安全问题。作为最初的尝试，我们提出了一种防御策略来抵御这种攻击，但创建更先进的防御仍然具有挑战性。复制该方法的代码可从https：//github.com/XuandongZhao/weak-to-strong###2309.17102-++获得——基于指令的图像编辑通过自然命令提高了图像操作的可控性和灵活性，而无需详细描述或区域遮罩。然而，人类的指令有时太简短，当前的方法无法捕捉和遵循。多模态大型语言模型（MLLMs）在跨模态理解和通过LMs产生视觉感知响应方面显示出有前途的能力。我们研究MLLMs如何促进编辑指令，并提出MLLM引导的图像编辑（MGIE）。MGIE学会推导表达指令，并提供明确的指导。编辑模型共同捕捉这种视觉想象，并通过端到端的训练进行操作。我们评估Photoshop风格修改、全局照片优化和局部编辑的各个方面。大量的实验结果表明，表达性指令对于基于指令的图像编辑至关重要，我们的MGIE可以在保持竞争性推理效率的同时，显著改善自动度量和人工评估。

摘要: Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient method to attack aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

标题: Guiding Instruction-based Image Editing via Multimodal Large Language Models

作者: Tsu-Jui Fu, Wenze Hu, Xianzhi Du

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2309.17102v2

Project: https://mllm-ie.github.io|

GitHub: https://github.com/tsujuifu/pytorch_mgie|

摘要: Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

标题: VIGC: Visual Instruction Generation and Correction

作者: Bin Wang, Fan Wu, Xiao Han

PubTime: 2024-02-04

Downlink: http://arxiv.org/abs/2308.12714v3

Project: https://opendatalab.github.io/VIGC|https://opendatalab.github.io/VIGC|

GitHub: https://github.com/opendatalab/VIGC|

中文摘要: 视觉编码器和大型语言模型（LLMs）的集成推动了多模态大型语言模型（MLLMs）的最新进展。然而，视觉语言任务的高质量指令调整数据的缺乏仍然是一个挑战。当前领先的范式，如LLaVA，依赖于纯语言的GPT-4来生成数据，这需要预先注释的图像标题和检测边界框，难以理解图像细节。这个问题的实际解决方案是利用可用的多模态大型语言模型（MLLMs）为视觉语言任务生成指令数据。然而，值得注意的是，当前可访问的MLLMs不如它们的LLM对应物强大，因为它们往往会产生不充分的响应并产生错误的信息。作为解决当前问题的解决方案，本文提出了可视化指令生成和校正（VIGC）框架，该框架使多模态大型语言模型能够生成指令调整数据，并逐步提高其动态质量。具体来说，视觉指令生成（VIG）指导视觉语言模型生成不同的指令调整数据。为了保证生成质量，视觉指令校正（VIC）采用迭代更新机制来纠正VIG产生的数据中的任何不准确之处，有效地降低了产生幻觉的风险。利用VIGC生成的多样化高质量数据，我们微调主流模型，并根据各种评估验证数据质量。实验结果表明，VIGC不仅弥补了纯语言数据生成方法的不足，而且有效地提高了基准测试的性能。模型、数据集和代码可从https://opendatalab.github.io/VIGC。

摘要: The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it’s worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at https://opendatalab.github.io/VIGC.

标题: LitLLM: A Toolkit for Scientific Literature Review

作者: Shubham Agarwal, Issam H. Laradji, Laurent Charlin

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01788v1

Project: https://huggingface.co/spaces/shubhamagarwal92/LitLLM|https://youtu.be/E2ggOZBAFw0|

GitHub: https://github.com/shubhamagarwal92/LitLLM|

摘要: Conducting literature reviews for scientific papers is essential for understanding research, its limitations, and building on existing work. It is a tedious task which makes an automatic literature review generator appealing. Unfortunately, many existing works that generate such reviews using Large Language Models (LLMs) have significant limitations. They tend to hallucinate-generate non-actual information-and ignore the latest research they have not been trained on. To address these limitations, we propose a toolkit that operates on Retrieval Augmented Generation (RAG) principles, specialized prompting and instructing techniques with the help of LLMs. Our system first initiates a web search to retrieve relevant papers by summarizing user-provided abstracts into keywords using an off-the-shelf LLM. Authors can enhance the search by supplementing it with relevant papers or keywords, contributing to a tailored retrieval process. Second, the system re-ranks the retrieved papers based on the user-provided abstract. Finally, the related work section is generated based on the re-ranked results and the abstract. There is a substantial reduction in time and effort for literature review compared to traditional methods, establishing our toolkit as an efficient alternative. Our open-source toolkit is accessible at https://github.com/shubhamagarwal92/LitLLM and Huggingface space (https://huggingface.co/spaces/shubhamagarwal92/LitLLM) with the video demo at https://youtu.be/E2ggOZBAFw0.

== CLIP@ViT @ VLM @ visual model ==

标题: FROSTER: Frozen CLIP Is A Strong Teacher for Open-Vocabulary Action Recognition

作者: Xiaohu Huang, Hao Zhou, Kun Yao

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03241v1

Project: https://visual-ai.github.io/froster|

中文摘要: 在本文中，我们介绍了FROSTER，一个用于开放词汇动作识别的有效框架。剪辑模型在一系列基于图像的任务中取得了显著的成功，这得益于其强大的泛化能力，这种泛化能力源于对大量图像——文本对的预处理。然而，由于CLIP的预训练中缺乏时间信息，将CLIP直接应用于开放词汇动作识别任务具有挑战性。此外，在动作识别数据集上微调CLIP可能导致过度拟合并阻碍其可推广性，从而在处理看不见的动作时导致不令人满意的结果。为了解决这些问题，FROSTER采用了残差特征提取方法来确保CLIP在有效适应动作识别任务的同时保持其泛化能力。具体来说，剩余特征提取将冻结的剪辑模型视为教师，以保持原始剪辑所表现出的可推广性，并监督特征学习，以提取视频特定的特征，从而弥合图像和视频之间的差距。同时，它使用残差子网络进行特征提取，以达到学习可概括特征和视频特定特征这两个不同目标之间的平衡。我们广泛评估了FROSTER在开放词汇动作识别基准下的基础到新的和跨数据集设置。FROSTER始终在所有数据集上实现最先进的性能。项目页面：https：//visual-ai.github.io/froster。

摘要: In this paper, we introduce FROSTER, an effective framework for open-vocabulary action recognition. The CLIP model has achieved remarkable success in a range of image-based tasks, benefiting from its strong generalization capability stemming from pretaining on massive image-text pairs. However, applying CLIP directly to the open-vocabulary action recognition task is challenging due to the absence of temporal information in CLIP’s pretraining. Further, fine-tuning CLIP on action recognition datasets may lead to overfitting and hinder its generalizability, resulting in unsatisfactory results when dealing with unseen actions. To address these issues, FROSTER employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while effectively adapting to the action recognition task. Specifically, the residual feature distillation treats the frozen CLIP model as a teacher to maintain the generalizability exhibited by the original CLIP and supervises the feature learning for the extraction of video-specific features to bridge the gap between images and videos. Meanwhile, it uses a residual sub-network for feature distillation to reach a balance between the two distinct objectives of learning generalizable and video-specific features. We extensively evaluate FROSTER on open-vocabulary action recognition benchmarks under both base-to-novel and cross-dataset settings. FROSTER consistently achieves state-of-the-art performance on all datasets across the board. Project page: https://visual-ai.github.io/froster.

标题: Transcending Adversarial Perturbations: Manifold-Aided Adversarial Examples with Legitimate Semantics

作者: Shuai Li, Xiaoyu Jiang, Xiaoguang Ma

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03095v1

GitHub: https://github.com/shuaili1027/MAELS.git|

中文摘要: 深度神经网络非常容易受到恶意微小扰动操纵的敌对例子的攻击。尽管大多数传统的对抗性攻击通过最小化它们的几何距离来确保对抗性示例和相应的原始图像之间的视觉不可察觉性，但是对几何距离的这些限制导致了有限的攻击可转移性、较差的视觉质量和人类不可察觉的可解释性。在本文中，我们提出了一个有监督的语义转换生成模型来生成具有真实和合法语义的对抗性示例，其中首次构建了一个包含连续语义变化的无限制对抗性流形，以实现从非对抗性示例到对抗性示例的合法转换。在MNIST和工业缺陷数据集上的综合实验表明，我们的对抗性示例不仅表现出更好的视觉质量，而且实现了卓越的攻击转移性和对模型漏洞更有效的解释，表明它们作为通用对抗性示例的巨大潜力。代码和预先训练的模型可从https://github.com/shuaili1027/MAELS.git。

摘要: Deep neural networks were significantly vulnerable to adversarial examples manipulated by malicious tiny perturbations. Although most conventional adversarial attacks ensured the visual imperceptibility between adversarial examples and corresponding raw images by minimizing their geometric distance, these constraints on geometric distance led to limited attack transferability, inferior visual quality, and human-imperceptible interpretability. In this paper, we proposed a supervised semantic-transformation generative model to generate adversarial examples with real and legitimate semantics, wherein an unrestricted adversarial manifold containing continuous semantic variations was constructed for the first time to realize a legitimate transition from non-adversarial examples to adversarial ones. Comprehensive experiments on MNIST and industrial defect datasets showed that our adversarial examples not only exhibited better visual quality but also achieved superior attack transferability and more effective explanations for model vulnerabilities, indicating their great potential as generic adversarial examples. The code and pre-trained models were available at https://github.com/shuaili1027/MAELS.git.

标题: LKCA: Large Kernel Convolutional Attention

作者: Chenghao Li, Boheng Zeng, Yi Lu

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2401.05738v2

GitHub: https://github.com/CatworldLee/LKCA|

中文摘要: 我们重新讨论了视觉转换器中注意机制和大核ConvNets之间的关系，并提出了一种新的空间注意，称为大核卷积注意（LKCA）。它通过用单个大核卷积代替它来简化注意力操作。LKCA结合了卷积神经网络和视觉变压器的优点，拥有大的感受野、局部性和参数共享。我们从卷积和注意力两个角度解释了LKCA的优势，为每个视图提供了等效的代码实现。实验证实，从卷积和注意力角度实现的LKCA表现出相同的性能。我们在分类和分割任务中广泛试验了ViT的LKCA变体。实验表明，LKCA在视觉任务中表现出竞争性表现。我们的代码将在https：//github.com/CatworldLee/LKCA。公开。

摘要: We revisit the relationship between attention mechanisms and large kernel ConvNets in visual transformers and propose a new spatial attention named Large Kernel Convolutional Attention (LKCA). It simplifies the attention operation by replacing it with a single large kernel convolution. LKCA combines the advantages of convolutional neural networks and visual transformers, possessing a large receptive field, locality, and parameter sharing. We explained the superiority of LKCA from both convolution and attention perspectives, providing equivalent code implementations for each view. Experiments confirm that LKCA implemented from both the convolutional and attention perspectives exhibit equivalent performance. We extensively experimented with the LKCA variant of ViT in both classification and segmentation tasks. The experiments demonstrated that LKCA exhibits competitive performance in visual tasks. Our code will be made publicly available at https://github.com/CatworldLee/LKCA.

标题: Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm Perspectives

作者: Sheng Luo, Wei Chen, Wanxin Tian

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02968v1

GitHub: https://github.com/rolsheng/MM-VUFM4DS|

中文摘要: 基础模型确实对各个领域产生了深远的影响，成为重要塑造智能系统能力的关键组件。在智能车辆的背景下，利用基础模型的力量已被证明是变革性的，在视觉理解方面提供了显著的进步。多模态多任务视觉理解基础模型（MM-VUFMs）配备了多模态和多任务学习能力，可以有效地处理和融合来自不同模态的数据，并以强大的适应性同时处理各种驾驶相关任务，有助于对周围场景进行更全面的理解。在本次调查中，我们对专门为道路场景设计的MM-VUFMs进行了系统分析。我们的目标不仅是提供常见实践的全面概述，涉及特定任务模型、统一多模态模型、统一多任务模型和基础模型提示技术，而且还强调它们在不同学习范式中的高级能力。这些范例包括开放世界理解、道路场景的有效转换、持续学习、互动和生成能力。此外，我们还提供了对关键挑战和未来趋势的见解，如闭环驾驶系统、可解释性、具体化驾驶代理和世界模型。为方便研究人员掌握用于道路场景的MM-VUFMs的最新发展，我们在https://github.com/rolsheng/MM-VUFM4DS

摘要: Foundation models have indeed made a profound impact on various fields, emerging as pivotal components that significantly shape the capabilities of intelligent systems. In the context of intelligent vehicles, leveraging the power of foundation models has proven to be transformative, offering notable advancements in visual understanding. Equipped with multi-modal and multi-task learning capabilities, multi-modal multi-task visual understanding foundation models (MM-VUFMs) effectively process and fuse data from diverse modalities and simultaneously handle various driving-related tasks with powerful adaptability, contributing to a more holistic understanding of the surrounding scene. In this survey, we present a systematic analysis of MM-VUFMs specifically designed for road scenes. Our objective is not only to provide a comprehensive overview of common practices, referring to task-specific models, unified multi-modal models, unified multi-task models, and foundation model prompting techniques, but also to highlight their advanced capabilities in diverse learning paradigms. These paradigms include open-world understanding, efficient transfer for road scenes, continual learning, interactive and generative capability. Moreover, we provide insights into key challenges and future trends, such as closed-loop driving systems, interpretability, embodied driving agents, and world models. To facilitate researchers in staying abreast of the latest developments in MM-VUFMs for road scenes, we have established a continuously updated repository at https://github.com/rolsheng/MM-VUFM4DS

标题: Enhancing Compositional Generalization via Compositional Feature Alignment

作者: Haoxiang Wang, Haozhe Si, Huajie Shao

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02851v1

GitHub: https://github.com/Haoxiang-Wang/Compositional-Feature-Alignment|

中文摘要: 机器学习模型的现实世界应用经常面临数据分布变化，其中训练和测试数据分布之间存在差异。在常见的多域多类设置中，随着类和域的数量增加，为每个域——类组合收集训练数据变得不可行。这一挑战自然导致了对具有组合泛化（CG）能力的模型的追求，其中模型可以泛化到看不见的域类组合。为了深入研究CG挑战，我们开发了CG-Bench，这是一套来自现有真实世界图像数据集的CG基准，并观察到基础模型（如CLIP和DINOv2）上流行的预训练——微调范式正在努力应对这一挑战。为了应对这一挑战，我们提出了组合特征对齐（CFA），这是一种简单的两阶段微调技术，其i）在预训练的编码器上学习关于类和域标签的两个正交线性头，以及ii）在冻结新学习的头的情况下微调编码器。我们从理论和经验上证明了CFA鼓励预训练模型的组合特征学习。我们进一步在CG-Bench上对CLIP和DINOv2这两个强大的预训练视觉基础模型进行了广泛的实验。实验结果表明，CFA在合成泛化方面优于普通微调技术，证实了CFA在合成特征学习方面的有效性。

摘要: Real-world applications of machine learning models often confront data distribution shifts, wherein discrepancies exist between the training and test data distributions. In the common multi-domain multi-class setup, as the number of classes and domains scales up, it becomes infeasible to gather training data for every domain-class combination. This challenge naturally leads the quest for models with Compositional Generalization (CG) ability, where models can generalize to unseen domain-class combinations. To delve into the CG challenge, we develop CG-Bench, a suite of CG benchmarks derived from existing real-world image datasets, and observe that the prevalent pretraining-finetuning paradigm on foundational models, such as CLIP and DINOv2, struggles with the challenge. To address this challenge, we propose Compositional Feature Alignment (CFA), a simple two-stage finetuning technique that i) learns two orthogonal linear heads on a pretrained encoder with respect to class and domain labels, and ii) fine-tunes the encoder with the newly learned head frozen. We theoretically and empirically justify that CFA encourages compositional feature learning of pretrained models. We further conduct extensive experiments on CG-Bench for CLIP and DINOv2, two powerful pretrained vision foundation models. Experiment results show that CFA outperforms common finetuning techniques in compositional generalization, corroborating CFA’s efficacy in compositional feature learning.

标题: VM-UNet: Vision Mamba UNet for Medical Image Segmentation

作者: Jiacheng Ruan, Suncheng Xiang

PubTime: 2024-02-04

Downlink: http://arxiv.org/abs/2402.02491v1

GitHub: https://github.com/JCruan519/VM-UNet|

中文摘要: 在医学图像分割领域，基于CNN和基于Transformer model的模型都得到了广泛的探索。然而，CNC NS在远程建模能力方面表现出局限性，而变压器则受到其二次计算复杂性的阻碍。最近，St以Mamba为例的ate空间模型（SSMs）已经成为一种有前途的方法。它们不仅擅长模拟长程相互作用，而且保持线性计算复杂度。本文利用状态空间模型，提出了一种用于医学图像分割的U形结构模型，命名为Vision Mamba UNet（VM-UNet）。具体来说，视觉状态空间（VSS）块被引入作为捕获大量上下文信息的基础块，并构造了不对称的编码器——解码器结构。我们在ISIC17、ISIC18和Synapse数据集上进行了全面的实验，结果表明VM-UNet在医学图像分割任务中具有竞争力。据我们所知，这是第一个基于纯SSM模型构建的医学图像分割模型。我们的目标是建立一个基线，并为未来开发更高效、更有效的SSM细分系统提供有价值的见解。我们的代码可从https://github.com/JCruan519/VM-UNet获得。

摘要: In the realm of medical image segmentation, both CNN-based and Transformer-based models have been extensively explored. However, CNNs exhibit limitations in long-range modeling capabilities, whereas Transformers are hampered by their quadratic computational complexity. Recently, State Space Models (SSMs), exemplified by Mamba, have emerged as a promising approach. They not only excel in modeling long-range interactions but also maintain a linear computational complexity. In this paper, leveraging state space models, we propose a U-shape architecture model for medical image segmentation, named Vision Mamba UNet (VM-UNet). Specifically, the Visual State Space (VSS) block is introduced as the foundation block to capture extensive contextual information, and an asymmetrical encoder-decoder structure is constructed. We conduct comprehensive experiments on the ISIC17, ISIC18, and Synapse datasets, and the results indicate that VM-UNet performs competitively in medical image segmentation tasks. To our best knowledge, this is the first medical image segmentation model constructed based on the pure SSM-based model. We aim to establish a baseline and provide valuable insights for the future development of more efficient and effective SSM-based segmentation systems. Our code is available at https://github.com/JCruan519/VM-UNet.

== diffusion policy@diffusion formulation@diffusion model ==

标题: Lumiere: A Space-Time Diffusion Model for Video Generation

作者: Omer Bar-Tal, Hila Chefer, Omer Tov

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2401.12945v2

Project: https://lumiere-video.github.io/|https://www.youtube.com/watch?v=wxLr02Dz2Sc|

中文摘要: 我们介绍了Lumiere——一种文本到视频的扩散模型，旨在合成描绘真实、多样和连贯运动的视频——这是视频合成中的一个关键挑战。为此，我们引入了一种时空U-Net架构，通过模型中的一次传递，一次生成视频的整个时间持续时间。这与现有的视频模型形成对比，现有的视频模型合成远距离关键帧，然后进行时间超分辨率——这种方法本质上使全局时间一致性难以实现。通过部署空间和（重要的）时间下采样和上采样，并利用预先训练的文本到图像扩散模型，我们的模型学习通过在多个时空尺度上处理来直接生成全帧速率、低分辨率的视频。我们展示了最先进的文本到视频生成结果，并表明我们的设计可以轻松实现各种内容创建任务和视频编辑应用，包括图像到视频、视频修复和风格化生成。

摘要: We introduce Lumiere – a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution – an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.

标题: Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion

作者: Shiyuan Yang, Liang Hou, Haibin Huang

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.03162v1

Project: https://direct-a-video.github.io/|

中文摘要: 最近的文本到视频传播模型取得了令人印象深刻的进展。在实践中，用户通常希望能够独立地控制对象运动和摄像机运动，以进行定制的视频创建。然而，当前的方法缺乏对以解耦方式分别控制对象运动和摄像机运动的关注，这限制了文本到视频模型的可控性和灵活性。在本文中，我们介绍了Direct-a-Video，这是一个允许用户独立指定一个或多个对象的运动和/或摄像机运动的系统，就像指导视频一样。我们提出了一种简单而有效的目标运动和摄像机运动解耦控制策略。使用模型的固有先验通过空间交叉注意调制来控制物体运动，不需要额外的优化。对于相机运动，我们引入了新的时间交叉注意层来解释定量相机运动参数。我们进一步采用基于增强的方法在小规模数据集上以自我监督的方式训练这些层，消除了对显式运动注释的需要。这两个组件独立运行，允许单独或组合控制，并且可以推广到开放域场景。大量实验证明了该方法的优越性和有效性。项目页面：https：//direct-a-video.github.io/。

摘要: Recent text-to-video diffusion models have achieved impressive progress. In practice, users often desire the ability to control object motion and camera movement independently for customized video creation. However, current methods lack the focus on separately controlling object motion and camera movement in a decoupled manner, which limits the controllability and flexibility of text-to-video models. In this paper, we introduce Direct-a-Video, a system that allows users to independently specify motions for one or multiple objects and/or camera movements, as if directing a video. We propose a simple yet effective strategy for the decoupled control of object motion and camera movement. Object motion is controlled through spatial cross-attention modulation using the model’s inherent priors, requiring no additional optimization. For camera movement, we introduce new temporal cross-attention layers to interpret quantitative camera movement parameters. We further employ an augmentation-based approach to train these layers in a self-supervised manner on a small-scale dataset, eliminating the need for explicit motion annotation. Both components operate independently, allowing individual or combined control, and can generalize to open-domain scenarios. Extensive experiments demonstrate the superiority and effectiveness of our method. Project page: https://direct-a-video.github.io/.

标题: Retrieval-Augmented Score Distillation for Text-to-3D Generation

作者: Junyoung Seo, Susung Hong, Wooseok Jang

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02972v1

Project: https://ku-cvlab.github.io/RetDream/|https://ku-cvlab.github.io/RetDream/|

中文摘要: Text-to-3D生成通过整合强大的2D扩散模型取得了显著的成功，但3D先验知识不足也导致了3D几何的不一致性。近年来，随着大规模多视图数据集的发布，在多视图数据集上微调扩散模型成为解决三维不一致性问题的主流。然而，与2D数据相比，它面临着三维数据有限的质量和多样性方面的根本困难。为了避开这些权衡，我们探索了一种为分数提取量身定制的检索增强方法，称为RetDream。我们假设，通过在优化过程中直接使用语义相关的资产，可以充分利用2D扩散模型的表达性和3D资产的几何一致性。为此，我们引入了一种新的框架，用于文本到3D生成中基于检索的质量增强。我们利用检索到的资产将其几何先验合并到变分目标中，并使扩散模型的2D先验适应视图一致性，从而在生成场景的几何和保真度方面实现了显著的改进。我们进行了大量的实验来证明RetDream表现出卓越的质量和更高的几何一致性。项目页面可在https：//ku-cvlab.github.io/RetDream/。

摘要: Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed RetDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model’s 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that RetDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/RetDream/.

标题: Extreme Two-View Geometry From Object Poses with Diffusion Models

作者: Yujing Sun, Caiyi Sun, Yuan Liu

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02800v1

GitHub: https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models|

中文摘要: 人类有一种不可思议的能力，可以毫不费力地感知包含相同对象的两幅图像之间的视点差异，即使视点变化惊人地巨大，而图像中没有共同可见的区域。然而，这一非凡的技能已被证明是对现有相机姿态估计方法的挑战，当面对大的视点差异时，由于缺乏用于匹配的重叠局部特征，这些方法经常失败。在本文中，我们旨在有效地利用对象先验的力量，在面对极端视点变化时准确地确定双视图几何图形。在我们的方法中，我们首先将相对相机姿态估计问题数学转化为目标姿态估计问题。然后，为了估计物体姿态，我们利用从扩散模型Zero123学习的物体先验来合成物体的新视图图像。新视图图像被匹配以确定物体姿态，从而确定双视图相机姿态。在实验中，我们的方法表现出了非凡的鲁棒性和对大视点变化的弹性，在合成和真实世界的数据集上一致地估计双视图姿态，具有出色的泛化能力。代码将在https：//github.com/scy 639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models上提供。

摘要: Human has an incredible ability to effortlessly perceive the viewpoint difference between two images containing the same object, even when the viewpoint change is astonishingly vast with no co-visible regions in the images. This remarkable skill, however, has proven to be a challenge for existing camera pose estimation methods, which often fail when faced with large viewpoint differences due to the lack of overlapping local features for matching. In this paper, we aim to effectively harness the power of object priors to accurately determine two-view geometry in the face of extreme viewpoint changes. In our method, we first mathematically transform the relative camera pose estimation problem to an object pose estimation problem. Then, to estimate the object pose, we utilize the object priors learned from a diffusion model Zero123 to synthesize novel-view images of the object. The novel-view images are matched to determine the object pose and thus the two-view camera pose. In experiments, our method has demonstrated extraordinary robustness and resilience to large viewpoint changes, consistently estimating two-view poses with exceptional generalization ability across both synthetic and real-world datasets. Code will be available at https://github.com/scy639/Extreme-Two-View-Geometry-From-Object-Poses-with-Diffusion-Models.

标题: Contrastive Diffuser: Planning Towards High Return States via Contrastive Learning

作者: Yixiang Shan, Zhengbang Zhu, Ting Long

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2402.02772v1

Project: https://anonymous.4open.science/r/ContrastiveDiffuser|

中文摘要: 在长期规划的强化学习中应用扩散模型最近获得了很多关注。一些基于扩散的方法已经成功地利用了扩散对任意分布的建模能力。这些方法为规划生成后续轨迹，并已显示出显著的改进。然而，这些方法受到其简单的基分布和忽略样本多样性的限制，其中不同的状态有不同的回报。它们简单地利用扩散来学习离线数据集的分布，生成状态与离线数据集共享相同分布的轨迹。因此，这些模型达到高回报状态的概率很大程度上取决于数据集分布。即使配备了制导模型，性能仍然受到抑制。为了解决这些限制，在本文中，我们提出了一种称为CDiffuser的新方法，该方法设计了一种返回对比机制，将生成的轨迹中的状态拉向高返回状态，同时将它们推离低返回状态，以改善基分布。在14个常用的D4RL基准上的实验证明了我们提出的方法的有效性。我们的代码可以在https：//anonymous.4open.science/r/ContrastiveDiffuser。

摘要: Applying diffusion models in reinforcement learning for long-term planning has gained much attention recently. Several diffusion-based methods have successfully leveraged the modeling capabilities of diffusion for arbitrary distributions. These methods generate subsequent trajectories for planning and have demonstrated significant improvement. However, these methods are limited by their plain base distributions and their overlooking of the diversity of samples, in which different states have different returns. They simply leverage diffusion to learn the distribution of offline dataset, generate the trajectories whose states share the same distribution with the offline dataset. As a result, the probability of these models reaching the high-return states is largely dependent on the dataset distribution. Even equipped with the guidance model, the performance is still suppressed. To address these limitations, in this paper, we propose a novel method called CDiffuser, which devises a return contrast mechanism to pull the states in generated trajectories towards high-return states while pushing them away from low-return states to improve the base distribution. Experiments on 14 commonly used D4RL benchmarks demonstrate the effectiveness of our proposed method. Our code is publicly available at https://anonymous.4open.science/r/ContrastiveDiffuser.

标题: ChatTraffic: Text-to-Traffic Generation via Diffusion Model

作者: Chengyang Zhang, Yong Zhang, Qitan Shao

PubTime: 2024-02-05

Downlink: http://arxiv.org/abs/2311.16203v3

GitHub: https://github.com/ChyaZhang/ChatTraffic|

摘要: Traffic prediction is one of the most significant foundations in Intelligent Transportation Systems (ITS). Traditional traffic prediction methods rely only on historical traffic data to predict traffic trends and face two main challenges. 1) insensitivity to unusual events. 2) limited performance in long-term prediction. In this work, we explore how generative models combined with text describing the traffic system can be applied for traffic generation, and name the task Text-to-Traffic Generation (TTG). The key challenge of the TTG task is how to associate text with the spatial structure of the road network and traffic data for generating traffic situations. To this end, we propose ChatTraffic, the first diffusion model for text-to-traffic generation. To guarantee the consistency between synthetic and real data, we augment a diffusion model with the Graph Convolutional Network (GCN) to extract spatial correlations of traffic data. In addition, we construct a large dataset containing text-traffic pairs for the TTG task. We benchmarked our model qualitatively and quantitatively on the released dataset. The experimental results indicate that ChatTraffic can generate realistic traffic situations from the text. Our code and dataset are available at https://github.com/ChyaZhang/ChatTraffic.

== Visual Navigation@VLN @ Visual Language Navigation ==

标题: DRAGON: A Dialogue-Based Robot for Assistive Navigation with Visual Language Grounding

作者: Shuijing Liu, Aamir Hasan, Kaiwen Hong

PubTime: 2024-02-03

Downlink: http://arxiv.org/abs/2307.06924v2

Project: https://sites.google.com/view/dragon-wayfinding/home|

中文摘要: 有视觉障碍(PwVI)的人在理解和导航周围空间方面有困难。当前的寻路技术要么只关注导航，要么提供有限的环境通信。受视觉语言基础和语义导航的最新进展的激励，我们提出了DRAGON，这是一个由对话系统驱动的引导机器人，具有将环境与自然语言联系起来的能力。通过理解来自用户的命令，DRAGON能够引导用户到地图上所需的地标，描述环境，并回答来自视觉观察的问题。通过有效利用对话，机器人可以将用户的自由形式描述与环境中的地标联系起来，并通过口语向用户提供语义信息。我们在日常室内环境中对被蒙住眼睛的参与者进行了一项用户研究。我们的结果表明，DRAGON能够与用户流畅地交流，提供良好的引导体验，并以直观的方式将用户与其周围环境联系起来。视频和代码可在https：//sites.google.com/view/dragon-wayfinding/home获得。

摘要: Persons with visual impairments (PwVI) have difficulties understanding and navigating spaces around them. Current wayfinding technologies either focus solely on navigation or provide limited communication about the environment. Motivated by recent advances in visual-language grounding and semantic navigation, we propose DRAGON, a guiding robot powered by a dialogue system and the ability to associate the environment with natural language. By understanding the commands from the user, DRAGON is able to guide the user to the desired landmarks on the map, describe the environment, and answer questions from visual observations. Through effective utilization of dialogue, the robot can ground the user’s free-form descriptions to landmarks in the environment, and give the user semantic information through spoken language. We conduct a user study with blindfolded participants in an everyday indoor environment. Our results demonstrate that DRAGON is able to communicate with the user smoothly, provide a good guiding experience, and connect users with their surrounding environment in an intuitive manner. Videos and code are available at https://sites.google.com/view/dragon-wayfinding/home.

标题: NavHint: Vision and Language Navigation Agent with a Hint Generator

作者: Yue Zhang, Quan Guo, Parisa Kordjamshidi

PubTime: 2024-02-04

Downlink: http://arxiv.org/abs/2402.02559v1

中文摘要: 现有的视觉和语言导航工作主要依赖于导航相关的损失来建立视觉和语言模态之间的联系，而忽略了帮助导航代理建立对视觉环境的深刻理解的方面。在我们的工作中，我们通过提供详细视觉描述的提示生成器向导航代理提供间接监督。提示生成器帮助导航代理开发对视觉环境的全局理解。它将代理的注意力引向相关的导航细节，包括相关的子指令、识别中的潜在挑战和接地中的模糊性，以及目标视点描述。为了训练提示生成器，我们基于指令中的地标和视觉环境中可见和独特的对象构建合成数据集。我们在R2R和R4R数据集上评估了我们的方法，并在几个指标上达到了最先进的水平。实验结果表明，生成提示不仅提高了导航性能，而且有助于提高智能体行为的可解释性

摘要: Existing work on vision and language navigation mainly relies on navigation-related losses to establish the connection between vision and language modalities, neglecting aspects of helping the navigation agent build a deep understanding of the visual environment. In our work, we provide indirect supervision to the navigation agent through a hint generator that provides detailed visual descriptions. The hint generator assists the navigation agent in developing a global understanding of the visual environment. It directs the agent’s attention toward related navigation details, including the relevant sub-instruction, potential challenges in recognition and ambiguities in grounding, and the targeted viewpoint description. To train the hint generator, we construct a synthetic dataset based on landmarks in the instructions and visible and distinctive objects in the visual environment. We evaluate our method on the R2R and R4R datasets and achieve state-of-the-art on several metrics. The experimental results demonstrate that generating hints not only enhances the navigation performance but also helps improve the interpretability of the agent’s actions.