[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉语言导航

最新推荐文章于 2025-02-25 17:21:27 发布

晓理紫

最新推荐文章于 2025-02-25 17:21:27 发布

阅读量1.8k

点赞数 16

文章标签：人工智能

本文链接：https://blog.csdn.net/u011573853/article/details/136053746

版权

专属领域论文订阅

关注{晓理紫|小李子}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== LLM ==

标题: Two Failures of Self-Consistency in the Multi-Step Reasoning of LLMs

作者: Angelica Chen, Jason Phang, Alicia Parrish

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2305.14279v4

Project: https://openreview.net/forum?id=5nBqY1y96B|

中文摘要: 大型语言模型（LLMs）在各种上下文中的少量任务上取得了广泛的成功，但这种成功通常是通过正确性而不是一致性来评估的。我们认为，在解决方案由多个子步骤的答案组成的任务中，自一致性是有效的多步推理的重要标准。我们提出了两种类型的自一致性，它们对于多步推理特别重要——假设一致性（模型预测其输出在假设的其他上下文中是什么的能力）和组成一致性（当中间子步骤被这些步骤的模型输出替换时，模型最终输出的一致性）。我们证明了GPT-3/-4模型的多个变体在各种任务的两种类型的一致性中表现出较差的一致性率。

摘要: Large language models (LLMs) have achieved widespread success on a variety of in-context few-shot tasks, but this success is typically evaluated via correctness rather than consistency. We argue that self-consistency is an important criteria for valid multi-step reasoning in tasks where the solution is composed of the answers to multiple sub-steps. We propose two types of self-consistency that are particularly important for multi-step reasoning – hypothetical consistency (a model’s ability to predict what its output would be in a hypothetical other context) and compositional consistency (consistency of a model’s final outputs when intermediate sub-steps are replaced with the model’s outputs for those steps). We demonstrate that multiple variants of the GPT-3/-4 models exhibit poor consistency rates across both types of consistency on a variety of tasks.

标题: MAGDi: Structured Distillation of Multi-Agent Interaction Graphs Improves Reasoning in Smaller Language Models

作者: Justin Chih-Yao Chen, Swarnadeep Saha, Elias Stengel-Eskin

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01620v1

GitHub: https://github.com/dinobby/MAGDi|

中文摘要: 大型语言模型（LLM）代理之间的多代理交互在不同的推理任务上显示出重大改进。然而，这些涉及跨几轮的多个模型的长代，使它们昂贵。此外，这些多智能体方法无法为有效推理提供最终的单一模型。为了解决这个问题，我们引入了MAGDi，这是一种将多个LMs之间的推理交互结构化蒸馏为更小LMs的新方法。MAGDi通过将多智能体交互表示为图形来教授较小的模型，用图形编码器来扩充基础学生模型，并使用三个目标函数来提取知识：下一个令牌预测，正确和不正确推理之间的对比损失，以及基于图形的目标来建模交互结构。在七个广泛使用的常识和数学推理基准上的实验表明，MAGDi提高了较小模型的推理能力，优于从单个教师和多个教师中提取的几种方法。此外，MAGDi的效率也比其教师高出一个数量级。我们进行了广泛的分析，以表明MAGDi（1）增强了领域外任务的可推广性，（2）与基础学生模型的规模和强度成正比，以及（3）在应用自一致性（一种依赖于模型多样性的推理技术）时获得了更大的改进（通过我们的多教师培训）。

摘要: Multi-agent interactions between Large Language Model (LLM) agents have shown major improvements on diverse reasoning tasks. However, these involve long generations from multiple models across several rounds, making them expensive. Moreover, these multi-agent approaches fail to provide a final, single model for efficient inference. To address this, we introduce MAGDi, a new method for structured distillation of the reasoning interactions between multiple LLMs into smaller LMs. MAGDi teaches smaller models by representing multi-agent interactions as graphs, augmenting a base student model with a graph encoder, and distilling knowledge using three objective functions: next-token prediction, a contrastive loss between correct and incorrect reasoning, and a graph-based objective to model the interaction structure. Experiments on seven widely-used commonsense and math reasoning benchmarks show that MAGDi improves the reasoning capabilities of smaller models, outperforming several methods that distill from a single teacher and multiple teachers. Moreover, MAGDi also demonstrates an order of magnitude higher efficiency over its teachers. We conduct extensive analyses to show that MAGDi (1) enhances the generalizability to out-of-domain tasks, (2) scales positively with the size and strength of the base student model, and (3) obtains larger improvements (via our multi-teacher training) when applying self-consistency - an inference technique that relies on model diversity.

标题: KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases

作者: Jiajie Zhang, Shulin Cao, Linmei Hu

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01619v1

GitHub: https://github.com/THU-KEG/KB-Plugin|

中文摘要: 程序归纳（PI）已经成为使用知识库（KB）帮助大型语言模型（LLMs）回答复杂的知识密集型问题的有前途的范例。尽管如此，PI通常依赖于大量并行问题——程序对来使LLM知道给定知识库的模式，因此对于许多缺乏注释数据的低资源知识库来说是具有挑战性的。为此，我们提出了KB-Plugin，这是一个即插即用的框架，使LLMs能够在任何低资源的KB上诱导程序。首先，知识库插件采用自监督学习将给定知识库的详细模式信息编码到一个可插拔模块中，即模式插件。其次，KB-Plugin利用来自资源丰富的知识库的丰富注释数据来训练另一个可插拔模块，即PI plugin，它可以帮助LLM从任何知识库的模式插件中提取与问题相关的模式信息，并利用该信息在该知识库上归纳程序。在五个异构KBQA数据集上的实验表明，对于低资源KB，KB-Plugin与SoTA PI方法相比，以25$倍小的主干LLM实现了更好或相当的性能，甚至接近监督方法的性能。我们的代码和数据可从https://github.com/THU-KEG/KB-Plugin获得。

摘要: Program induction (PI) has become a promising paradigm for using knowledge bases (KBs) to help large language models (LLMs) answer complex knowledge-intensive questions. Nonetheless, PI typically relies on a large number of parallel question-program pairs to make the LLM aware of the schema of the given KB, and is thus challenging for many low-resourced KBs that lack annotated data. To this end, we propose KB-Plugin, a plug-and-play framework that enables LLMs to induce programs over any low-resourced KB. Firstly, KB-Plugin adopts self-supervised learning to encode the detailed schema information of a given KB into a pluggable module, namely schema plugin. Secondly, KB-Plugin utilizes abundant annotated data from a rich-resourced KB to train another pluggable module, namely PI plugin, which can help the LLM extract question-relevant schema information from the schema plugin of any KB and utilize this information to induce programs over this KB. Experiments on five heterogeneous KBQA datasets show that KB-Plugin achieves better or comparable performance with 25 $\times$ smaller backbone LLM compared to SoTA PI methods for low-resourced KBs, and even approaches the performance of supervised methods. Our code and data are available at https://github.com/THU-KEG/KB-Plugin.

标题: TrustAgent: Towards Safe and Trustworthy LLM-based Agents through Agent Constitution

作者: Wenyue Hua, Xianjun Yang, Zelong Li

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01586v1

GitHub: https://github.com/agiresearch/TrustAgent|

中文摘要: 基于LLM的代理的出现已经获得了相当大的关注，但他们的可信度仍然是一个探索不足的领域。由于代理可以直接与物理环境交互，因此它们的可靠性和安全性至关重要。本文提出了一个基于agent构造的agent框架TrustAgent，这是对提高基于LLM的agent的可信度安全维度的初步研究。该框架由三重策略组成：在计划生成之前向模型注入安全知识的预规划策略，在计划生成期间加强安全的规划内策略，以及通过规划后检查确保安全的规划后策略。通过实验分析，我们展示了这些方法如何通过识别和预防潜在危险来有效地提高LLM代理的安全性。此外，我们探索了安全性和有用性之间的复杂关系，以及模型的推理能力和它作为安全代理的功效之间的复杂关系。本文强调了将安全意识和可信度整合到基于LLM的代理的设计和部署中的必要性，这不仅是为了提高它们的性能，也是为了确保它们负责任地集成到以人为中心的环境中。数据和代码可从https：//github.com/agi research/trust agent获得。

摘要: The emergence of LLM-based agents has garnered considerable attention, yet their trustworthiness remains an under-explored area. As agents can directly interact with the physical environment, their reliability and safety is critical. This paper presents an Agent-Constitution-based agent framework, TrustAgent, an initial investigation into improving the safety dimension of trustworthiness in LLM-based agents. This framework consists of threefold strategies: pre-planning strategy which injects safety knowledge to the model prior to plan generation, in-planning strategy which bolsters safety during plan generation, and post-planning strategy which ensures safety by post-planning inspection. Through experimental analysis, we demonstrate how these approaches can effectively elevate an LLM agent’s safety by identifying and preventing potential dangers. Furthermore, we explore the intricate relationships between safety and helpfulness, and between the model’s reasoning ability and its efficacy as a safe agent. This paper underscores the imperative of integrating safety awareness and trustworthiness into the design and deployment of LLM-based agents, not only to enhance their performance but also to ensure their responsible integration into human-centric environments. Data and code are available at https://github.com/agiresearch/TrustAgent.

标题: Scaling Sparse Fine-Tuning to Large Language Models

作者: Alan Ansell, Ivan Vulić, Hannah Sterz

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2401.16405v2

GitHub: https://github.com/AlanAnsell/peft|https://github.com/ducdauge/sft-llm|

中文摘要: 大型语言模型（LLMs）由于其参数数量庞大，很难完全微调（例如，通过指令或人工反馈）。一系列参数高效的稀疏微调方法已被证明在性能方面有前途，但它们的内存需求与LLMs的大小成比例增加。在这项工作中，我们将稀疏微调扩展到最先进的LLMs，如LLaMA 2 7B和13B。我们提出了SpIEL，一种新的稀疏微调方法，对于期望的密度水平，它保持一组参数指数和这些参数相对于其预训练值的增量。它迭代：（a）更新活动增量，（b）修剪指数（基于其增量大小的变化）和（c）指数的再生。对于再生，我们探索了两个基于几个候选参数的累积梯度或使用有效的SM3优化器估计的它们的近似动量的标准。我们在标准数据集混合物上对LLMs的指令调整进行了实验，发现SpIEL在性能方面通常优于LoRA（低秩自适应）等流行的参数高效微调方法，并且在运行时间方面具有可比性。我们还表明，SpIEL与量化和高效优化器兼容，有助于扩展到更大的模型规模。我们在https://github.com/AlanAnsell/peft发布SpIEL代码，在https://github.com/ducdauge/sft-llm发布指令调优实验代码

摘要: Large Language Models (LLMs) are difficult to fully fine-tune (e.g., with instructions or human feedback) due to their sheer number of parameters. A family of parameter-efficient sparse fine-tuning methods have proven promising in terms of performance but their memory requirements increase proportionally to the size of the LLMs. In this work, we scale sparse fine-tuning to state-of-the-art LLMs like LLaMA 2 7B and 13B. We propose SpIEL, a novel sparse fine-tuning method which, for a desired density level, maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values. It iterates over: (a) updating the active deltas, (b) pruning indices (based on the change of magnitude of their deltas) and © regrowth of indices. For regrowth, we explore two criteria based on either the accumulated gradients of a few candidate parameters or their approximate momenta estimated using the efficient SM3 optimizer. We experiment with instruction-tuning of LLMs on standard dataset mixtures, finding that SpIEL is often superior to popular parameter-efficient fine-tuning methods like LoRA (low-rank adaptation) in terms of performance and comparable in terms of run time. We additionally show that SpIEL is compatible with both quantization and efficient optimizers, to facilitate scaling to ever-larger model sizes. We release the code for SpIEL at https://github.com/AlanAnsell/peft and for the instruction-tuning experiments at https://github.com/ducdauge/sft-llm.

标题: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

作者: Jingbo Zhang, Xiaoyu Li, Ziyu Wan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2305.11588v2

Project: https://eckertzhang.github.io/Text2NeRF.github.io/|

GitHub: https://github.com/eckertzhang/Text2NeRF|https://github.com/eckertzhang/Text2NeRF|

摘要: Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.

== CLIP@ViT @ VLM @ visual model ==

标题: Di-NeRF: Distributed NeRF for Collaborative Learning with Unknown Relative Poses

作者: Mahboubeh Asadi, Kourosh Zareinia, Sajad Saeedi

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01485v1

Project: https://sites.google.com/view/di-nerf/home)|

中文摘要: 未知环境的协作映射可以比单个机器人更快、更鲁棒地完成。然而，协作方法需要一个可伸缩的分布式范例来处理通信问题。这项工作提出了一种完全分布式的算法，使一组机器人能够集体优化神经辐射场（NeRF）的参数。该算法涉及通过网状网络通信每个机器人的训练NeRF参数，其中每个机器人训练其NeRF，并且只能访问其自己的视觉数据。此外，所有机器人的相对姿态与模型参数一起被联合优化，从而能够使用未知的相对相机姿态进行映射。我们表明，多机器人系统可以受益于从多个神经网络优化的可微和鲁棒的三维重建。真实世界和合成数据的实验证明了该算法的有效性。实验视频和补充材料见项目网站（https：//sites.google.com/view/di-nerf/home）。

摘要: Collaborative mapping of unknown environments can be done faster and more robustly than a single robot. However, a collaborative approach requires a distributed paradigm to be scalable and deal with communication issues. This work presents a fully distributed algorithm enabling a group of robots to collectively optimize the parameters of a Neural Radiance Field (NeRF). The algorithm involves the communication of each robot’s trained NeRF parameters over a mesh network, where each robot trains its NeRF and has access to its own visual data only. Additionally, the relative poses of all robots are jointly optimized alongside the model parameters, enabling mapping with unknown relative camera poses. We show that multi-robot systems can benefit from differentiable and robust 3D reconstruction optimized from multiple NeRFs. Experiments on real-world and synthetic data demonstrate the efficiency of the proposed algorithm. See the website of the project for videos of the experiments and supplementary material(https://sites.google.com/view/di-nerf/home).

标题: EmoSpeaker: One-shot Fine-grained Emotion-Controlled Talking Face Generation

作者: Guanwen Feng, Haoran Cheng, Yunan Li

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01422v1

Project: https://peterfanfan.github.io/EmoSpeaker/|

中文摘要: 实现细粒度的情绪控制对于情绪生成任务至关重要，因为它增强了生成模型的表达能力，使其能够准确、全面地捕捉和表达各种细微的情绪状态，从而提高生成内容的情绪质量和个性化。仅使用肖像和音频记录来生成准确描绘情感表达的细粒度面部动画是一项挑战。为了应对这一挑战，我们提出了一种视觉属性引导的音频解耦器。这使得能够获得仅与音频内容相关的内容向量，从而增强后续唇部运动系数预测的稳定性。为了实现更精确的情感表达，我们引入了一个细粒度的情感系数预测模块。此外，我们提出了一种使用细粒度情绪矩阵的情绪强度控制方法。通过这些方法，实现了对生成视频中情感表达的有效控制和情感强度的精细分类。随后，设计一系列3DMM系数生成网络来预测3D系数，随后利用渲染网络来生成最终视频。我们的实验结果表明，我们提出的方法EmoSpeaker在表情变化和嘴唇同步方面优于现有的情感说话人脸生成方法。项目页面：https://peterfanfan.github.io/emopeaker/

摘要: Implementing fine-grained emotion control is crucial for emotion generation tasks because it enhances the expressive capability of the generative model, allowing it to accurately and comprehensively capture and express various nuanced emotional states, thereby improving the emotional quality and personalization of generated content. Generating fine-grained facial animations that accurately portray emotional expressions using only a portrait and an audio recording presents a challenge. In order to address this challenge, we propose a visual attribute-guided audio decoupler. This enables the obtention of content vectors solely related to the audio content, enhancing the stability of subsequent lip movement coefficient predictions. To achieve more precise emotional expression, we introduce a fine-grained emotion coefficient prediction module. Additionally, we propose an emotion intensity control method using a fine-grained emotion matrix. Through these, effective control over emotional expression in the generated videos and finer classification of emotion intensity are accomplished. Subsequently, a series of 3DMM coefficient generation networks are designed to predict 3D coefficients, followed by the utilization of a rendering network to generate the final video. Our experimental results demonstrate that our proposed method, EmoSpeaker, outperforms existing emotional talking face generation methods in terms of expression variation and lip synchronization. Project page: https://peterfanfan.github.io/EmoSpeaker/

标题: Conditional Diffusion Models for Semantic 3D Brain MRI Synthesis

作者: Zolnamar Dorjsembe, Hsing-Kuo Pao, Sodtavilan Odonchimed

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2305.18453v4

GitHub: https://github.com/mobaidoctor/med-ddpm/|

中文摘要: 医疗保健领域的人工智能（AI），尤其是医学成像领域，因数据稀缺和隐私问题而面临挑战。针对这些问题，我们介绍了Med-DDPM，一种为3D语义脑MRI合成设计的扩散模型。该模型通过集成语义条件，有效地解决了数据稀缺和隐私问题。这涉及调节图像到模型输入的通道级联，使得能够在图像生成中进行控制。与现有的3D脑成像合成方法相比，Med-DDPM表现出优异的稳定性和性能。它生成具有高视觉保真度的多样化、解剖学上连贯的图像。在肿瘤分割任务中的dice评分准确度方面，Med-DDPM达到0.6207，接近真实图像的0.6531准确度，优于基线模型。结合真实图像，它进一步将分割精度提高到0.6675，显示了我们提出的数据扩充方法的潜力。该模型代表了扩散模型在3D语义脑MRI合成中的首次使用，产生了高质量的图像。它的语义调节功能还显示了生物医学成像中图像匿名化的潜力，解决了数据和隐私问题。我们在我们的GitHub存储库（https：//github.com/mobaidoctor/Med-DDPM/）上提供了Med-DDPM的代码和模型权重，以支持再现性。

摘要: Artificial intelligence (AI) in healthcare, especially in medical imaging, faces challenges due to data scarcity and privacy concerns. Addressing these, we introduce Med-DDPM, a diffusion model designed for 3D semantic brain MRI synthesis. This model effectively tackles data scarcity and privacy issues by integrating semantic conditioning. This involves the channel-wise concatenation of a conditioning image to the model input, enabling control in image generation. Med-DDPM demonstrates superior stability and performance compared to existing 3D brain imaging synthesis methods. It generates diverse, anatomically coherent images with high visual fidelity. In terms of dice score accuracy in the tumor segmentation task, Med-DDPM achieves 0.6207, close to the 0.6531 accuracy of real images, and outperforms baseline models. Combined with real images, it further increases segmentation accuracy to 0.6675, showing the potential of our proposed method for data augmentation. This model represents the first use of a diffusion model in 3D semantic brain MRI synthesis, producing high-quality images. Its semantic conditioning feature also shows potential for image anonymization in biomedical imaging, addressing data and privacy issues. We provide the code and model weights for Med-DDPM on our GitHub repository (https://github.com/mobaidoctor/med-ddpm/) to support reproducibility.

标题: Enlighten-Your-Voice: When Multimodal Meets Zero-shot Low-light Image Enhancement

作者: Xiaofeng Zhang, Zishan Xu, Hao Tang

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2312.10109v2

GitHub: https://github.com/zhangbaijin/Enlighten-Your-Voice|

中文摘要: 弱光图像增强是一项至关重要的视觉任务，许多无监督方法往往会忽略弱光场景中可见信息的退化，这不利地影响了互补信息的融合，阻碍了满意结果的生成。为了解决这个问题，我们的研究引入了“Enlighten-Your-Voice”，这是一个多模态增强框架，通过语音和文本命令创新性地丰富了用户交互。这种方法不仅意味着技术上的飞跃，也代表了用户参与的范式转变。我们的模型配备了双重协作注意力模块（DCAM），精心迎合不同的内容和颜色差异，从而促进细微的增强。作为补充，我们引入了一个语义特征融合（SFM）即插即用模块，该模块将语义上下文与弱光增强操作相结合，增强了算法的功效。至关重要的是，“启发你的声音”在无监督的零镜头场景中展示了非凡的普遍性。源代码可从https://github.com/zhangbaijin/Enlighten-Your-Voice

摘要: Low-light image enhancement is a crucial visual task, and many unsupervised methods tend to overlook the degradation of visible information in low-light scenes, which adversely affects the fusion of complementary information and hinders the generation of satisfactory results. To address this, our study introduces “Enlighten-Your-Voice”, a multimodal enhancement framework that innovatively enriches user interaction through voice and textual commands. This approach does not merely signify a technical leap but also represents a paradigm shift in user engagement. Our model is equipped with a Dual Collaborative Attention Module (DCAM) that meticulously caters to distinct content and color discrepancies, thereby facilitating nuanced enhancements. Complementarily, we introduce a Semantic Feature Fusion (SFM) plug-and-play module that synergizes semantic context with low-light enhancement operations, sharpening the algorithm’s efficacy. Crucially, “Enlighten-Your-Voice” showcases remarkable generalization in unsupervised zero-shot scenarios. The source code can be accessed from https://github.com/zhangbaijin/Enlighten-Your-Voice

标题: Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

作者: Arjun Majumdar, Karmesh Yadav, Sergio Arnaud

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2303.18240v2

Project: https://eai-vc.github.io|

中文摘要: 我们提出了最大和最全面的经验研究预训练视觉表征（PVR）或视觉“基础模型”的具体人工智能。首先，我们策划CortexBench，由17个不同的任务组成，涵盖运动、导航、灵巧和移动操作。接下来，我们系统地评估现有的PVR，发现没有一个是普遍占主导地位的。为了研究预训练数据大小和多样性的影响，我们结合了来自7个不同来源（超过430万张图像）和ImageNet的超过4000小时的以自我为中心的视频，使用屏蔽自动编码（MAE）对这些数据切片训练不同大小的视觉转换器。与先前工作的推论相反，我们发现扩展数据集大小和多样性并不能普遍提高性能（但平均来说是这样）。我们最大的模型，命名为VC-1，平均优于所有以前的PVR，但也不是普遍占主导地位。接下来，我们展示了VC-1的特定任务或领域适应带来了实质性的收益，VC-1（适应）在CortexBench的所有基准测试中实现了比最佳已知结果更有竞争力或更优越的性能。最后，我们提出了真实世界的硬件实验，其中VC-1和VC-1（适应）优于最强的预先存在的PVR。总的来说，这篇论文没有提出新的技术，而是提出了一个严格的系统评估，一组关于PVR的广泛发现（在某些情况下，反驳了先前工作中在狭窄领域中做出的发现），以及开源代码和模型（需要超过10,000个GPU小时来训练），以造福于研究社区。

摘要: We present the largest and most comprehensive empirical study of pre-trained visual representations (PVRs) or visual ‘foundation models’ for Embodied AI. First, we curate CortexBench, consisting of 17 different tasks spanning locomotion, navigation, dexterous, and mobile manipulation. Next, we systematically evaluate existing PVRs and find that none are universally dominant. To study the effect of pre-training data size and diversity, we combine over 4,000 hours of egocentric videos from 7 different sources (over 4.3M images) and ImageNet to train different-sized vision transformers using Masked Auto-Encoding (MAE) on slices of this data. Contrary to inferences from prior work, we find that scaling dataset size and diversity does not improve performance universally (but does so on average). Our largest model, named VC-1, outperforms all prior PVRs on average but does not universally dominate either. Next, we show that task- or domain-specific adaptation of VC-1 leads to substantial gains, with VC-1 (adapted) achieving competitive or superior performance than the best known results on all of the benchmarks in CortexBench. Finally, we present real-world hardware experiments, in which VC-1 and VC-1 (adapted) outperform the strongest pre-existing PVR. Overall, this paper presents no new techniques but a rigorous systematic evaluation, a broad set of findings about PVRs (that in some cases, refute those made in narrow domains in prior work), and open-sourced code and models (that required over 10,000 GPU-hours to train) for the benefit of the research community.

标题: AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

作者: Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00769v1

Project: https://animatelcm.github.io/|

GitHub: https://github.com/G-U-N/AnimateLCM|

中文摘要: 视频扩散模型因其能够制作连贯且高保真的视频而受到越来越多的关注。然而，迭代去噪过程使其计算密集且耗时，从而限制了其应用。受一致性模型（CM）的启发，我们提出了AnimateLCM，它提取预训练的图像扩散模型，以最小的步骤加速采样，并成功扩展了潜在一致性模型（LCM）在条件图像生成上，允许在最小的步骤内生成高保真视频。我们提出了一种解耦一致性学习策略，将图像生成先验和运动生成先验的提取解耦，而不是直接对原始视频数据集进行一致性学习，这提高了训练效率并增强了生成视觉质量。此外，使稳定扩散社区中的即插即用适配器的组合能够实现各种功能（例如，用于可控生成的ControlNet）。我们提出了一种有效的策略来使现有的适配器适应我们的提取的文本条件视频一致性模型，或者在不损害采样速度的情况下从头开始训练适配器。我们在图像条件视频生成和布局条件视频生成中验证了所提出的策略，都实现了最佳性能的结果。实验结果验证了该方法的有效性。代码和重量将被公开。更多详情请访问https://github.com/G-U-N/AnimateLCM。

摘要: Video diffusion models has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model (CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model (LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhance the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in stable diffusion community to achieve various functions (e.g., ControlNet for controllable generation). we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results. Experimental results validate the effectiveness of our proposed method. Code and weights will be made public. More details are available at https://github.com/G-U-N/AnimateLCM.

== diffusion policy@diffusion formulation@diffusion model ==

标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2401.07519v2

Project: https://instantid.github.io/|

GitHub: https://github.com/InstantID/InstantID|

中文摘要: 使用文本反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而，它们在现实世界中的适用性受到高存储需求、冗长的微调过程以及对多个参考图像的需求的阻碍。相反，现有的基于ID嵌入的方法虽然只需要单一的正向推理，但面临着挑战：它们要么需要跨众多模型参数进行广泛的微调，缺乏与社区预训练模型的兼容性，要么无法保持高人脸保真度。针对这些限制，我们引入了InstantID，这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像就能熟练地处理各种风格的图像个性化，同时确保高保真。为了实现这一点，我们设计了一个新的身份网，通过施加强语义和弱空间条件，将面部和地标图像与文本提示相结合来指导图像生成。InstantID展示了卓越的性能和效率，证明在身份保护至关重要的实际应用中非常有益。此外，我们的工作与SD1.5和SDXL等流行的预训练文本到图像扩散模型无缝集成，作为一个适应性强的插件。我们的代码和预先训练的检查点将在https：//github.com/InstantID/InstantID上提供。

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

标题: MagicPose: Realistic Human Poses and Facial Expressions Retargeting with Identity-aware Diffusion

作者: Di Chang, Yichun Shi, Quankai Gao

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2311.12052v2

Project: https://boese0601.github.io/magicdance/|

GitHub: https://github.com/Boese0601/MagicDance|

中文摘要: 在这项工作中，我们提出了MagicPose，这是一个基于扩散的2D人体姿势和面部表情重定向模型。具体来说，给定一个参考图像，我们的目标是通过控制姿势和面部表情来生成一个人的新图像，同时保持身份不变。为此，我们提出了一种两阶段训练策略来解开人类运动和外观（例如，面部表情、肤色和穿着），包括（1）外观控制块的预训练和（2）学习外观解开姿势控制。我们的新颖设计能够对生成的人类图像进行强大的外观控制，包括身体、面部属性，甚至背景。通过利用图像扩散模型的先验知识，MagicPose可以很好地推广到看不见的人类身份和复杂的姿势，而无需额外的微调。此外，所提出的模型易于使用，可以被认为是稳定扩散的插件模块/扩展。

摘要: In this work, we propose MagicPose, a diffusion-based model for 2D human pose and facial expression retargeting. Specifically, given a reference image, we aim to generate a person’s new images by controlling the poses and facial expressions while keeping the identity unchanged. To this end, we propose a two-stage training strategy to disentangle human motions and appearance (e.g., facial expressions, skin tone and dressing), consisting of (1) the pre-training of an appearance-control block and (2) learning appearance-disentangled pose control. Our novel design enables robust appearance control over generated human images, including body, facial attributes, and even background. By leveraging the prior knowledge of image diffusion models, MagicPose generalizes well to unseen human identities and complex poses without the need for additional fine-tuning. Moreover, the proposed model is easy to use and can be considered as a plug-in module/extension to Stable Diffusion.

标题: Text Image Inpainting via Global Structure-Guided Diffusion Models

作者: Shipeng Zhu, Pengfei Fang, Chenjie Zhu

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2401.14832v2

GitHub: https://github.com/blackprotoss/GSDM|

中文摘要: 真实世界的文本可能会被环境或人为因素引起的腐蚀问题损坏，这阻碍了文本完整风格的保存，例如纹理和结构。这些腐蚀问题，例如涂鸦标志和不完整的签名，给理解文本带来困难，从而对下游应用，例如场景文本识别和签名识别带来重大挑战。值得注意的是，当前的修复技术通常不能充分解决这个问题，并且难以恢复准确的文本图像以及合理和一致的样式。本文将此表述为文本图像修复的一个公开问题，旨在建立一个基准来促进其研究。在此过程中，我们建立了两个特定的文本修复数据集，分别包含场景文本图像和手写文本图像。它们中的每一个都包括由现实生活和合成数据集修改的图像，以成对的原始图像、损坏的图像和其他辅助信息为特色。在数据集的基础上，我们进一步开发了一个新的神经框架，全局结构引导扩散模型（GSDM），作为一个潜在的解决方案。利用文本的全局结构作为先验，所提出的GSDM开发了一个有效的扩散模型来恢复干净的文本。我们的方法的有效性通过彻底的实证研究得到了证明，包括识别准确性和图像质量的显著提高。这些发现不仅突出了我们的方法的有效性，而且强调了它在更广泛的文本图像理解和处理领域的潜力。代码和数据集可从以下网址获得：https://github.com/blackprotoss/GSDM。

摘要: Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: https://github.com/blackprotoss/GSDM.

标题: Repositioning the Subject within Image

作者: Yikai Wang, Chenjie Cao, Qiaole Dong

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16861v1

Project: https://yikai-wang.github.io/seele/|

GitHub: https://github.com/Yikai-Wang/ReS|

中文摘要: 当前的图像操作主要集中在静态操作上，例如替换图像中的特定区域或改变其整体样式。在本文中，我们介绍了一个创新的动态操作任务，主题重新定位。这项任务包括将用户指定的对象重新定位到期望的位置，同时保持图像的保真度。我们的研究表明，主体重新定位的基本子任务，包括填充重新定位的主体留下的空白，重建主体的模糊部分，并将主体与周围区域融合，可以有效地重新制定为一个统一的，即时引导的修复任务。因此，我们可以使用单个扩散生成模型来处理这些子任务，使用通过我们提出的任务反转技术学习的各种任务提示。此外，我们集成了预处理和后处理技术，以进一步提高主体重新定位的质量。这些元素共同构成了我们的细分生成和混合（SEELE）框架。为了评估SEELE在受试者重新定位方面的有效性，我们组装了一个名为ReS的真实世界受试者重新定位数据集。我们在ReS上的结果证明了重定位图像生成的质量。

摘要: Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image’s fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE’s effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Our results on ReS demonstrate the quality of repositioned image generation.

标题: Inversion by Direct Iteration: An Alternative to Denoising Diffusion for Image Restoration

作者: Mauricio Delbracio, Peyman Milanfar

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2303.11435v5

中文摘要: 直接迭代反演（InDI）是一种用于监督图像恢复的新公式，它避免了所谓的“回归均值”效应，并且比现有的基于回归的方法产生更真实和详细的图像。它通过小步逐步提高图像质量来实现这一点，类似于生成式去噪扩散模型。图像恢复是一个不适定问题，其中多个高质量图像是给定低质量输入的似是而非的重建。因此，单步回归模型的结果通常是所有可能解释的集合，因此缺乏细节和真实性。InDI的主要优点是它不试图在一个步骤中预测干净的目标图像，而是在小步骤中逐渐改善图像，从而产生更好的感知质量。虽然生成去噪扩散模型也在小步骤中工作，但我们的公式是独特的，因为它不需要退化过程的任何分析形式的知识。相反，我们直接从低质量和高质量的配对示例中学习迭代恢复过程。给定成对的训练数据，InDI可以应用于几乎任何图像退化。在条件去噪扩散图像恢复中，去噪网络通过重复去噪纯噪声的初始图像来生成恢复的图像，条件是退化的输入。与条件去噪公式相反，InDI直接通过迭代恢复输入的低质量图像，在各种图像恢复任务中产生高质量的结果，包括运动和离焦去模糊、超分辨率、压缩伪影去除和去噪。

摘要: Inversion by Direct Iteration (InDI) is a new formulation for supervised image restoration that avoids the so-called “regression to the mean” effect and produces more realistic and detailed images than existing regression-based methods. It does this by gradually improving image quality in small steps, similar to generative denoising diffusion models. Image restoration is an ill-posed problem where multiple high-quality images are plausible reconstructions of a given low-quality input. Therefore, the outcome of a single step regression model is typically an aggregate of all possible explanations, therefore lacking details and realism. The main advantage of InDI is that it does not try to predict the clean target image in a single step but instead gradually improves the image in small steps, resulting in better perceptual quality. While generative denoising diffusion models also work in small steps, our formulation is distinct in that it does not require knowledge of any analytic form of the degradation process. Instead, we directly learn an iterative restoration process from low-quality and high-quality paired examples. InDI can be applied to virtually any image degradation, given paired training data. In conditional denoising diffusion image restoration the denoising network generates the restored image by repeatedly denoising an initial image of pure noise, conditioned on the degraded input. Contrary to conditional denoising formulations, InDI directly proceeds by iteratively restoring the input low-quality image, producing high-quality results on a variety of image restoration tasks, including motion and out-of-focus deblurring, super-resolution, compression artifact removal, and denoising.

标题: Boximator: Generating Rich and Controllable Motions for Video Synthesis

作者: Jiawei Wang, Yuchen Zhang, Jiaxin Zou

PubTime: 2024-02-02

Downlink: http://arxiv.org/abs/2402.01566v1

摘要: Generating rich and controllable motion is a pivotal challenge in video synthesis. We propose Boximator, a new approach for fine-grained motion control. Boximator introduces two constraint types: hard box and soft box. Users select objects in the conditional frame using hard boxes and then use either type of boxes to roughly or rigorously define the object’s position, shape, or motion path in future frames. Boximator functions as a plug-in for existing video diffusion models. Its training process preserves the base model’s knowledge by freezing the original weights and training only the control module. To address training challenges, we introduce a novel self-tracking technique that greatly simplifies the learning of box-object correlations. Empirically, Boximator achieves state-of-the-art video quality (FVD) scores, improving on two base models, and further enhanced after incorporating box constraints. Its robust motion controllability is validated by drastic increases in the bounding box alignment metric. Human evaluation also shows that users favor Boximator generation results over the base model.

== Visual Navigation@VLN @ Visual Language Navigation ==

标题: SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization

作者: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17907v1

GitHub: https://github.com/remaro-network/SubPipe-dataset|

中文摘要: 本文介绍了SubPipe，这是一个用于SLAM、对象检测和图像分割的水下数据集。SubPipe已经使用由OceanScan MST运营的\gls{LAUV}进行了记录，并携带了一套传感器，包括两个摄像机、一个侧扫声纳和一个惯性导航系统以及其他传感器。AUV已经部署在管道检查环境中，海底管道部分被沙子覆盖。AUV的姿态地面真实值由导航传感器估计。侧扫声纳和RGB图像分别包括目标检测和分割注释。最先进的分割、对象检测和SLAM方法在SubPipe上进行了基准测试，以展示数据集在利用计算机视觉算法方面的挑战和机遇。据作者所知，这是第一个带注释的水下数据集，提供了真实的管道检查场景。数据集和实验可在https：//github.com/remaro-network/SubPipe-dataset

摘要: This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV’s pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset’s challenges and opportunities for leveraging computer vision algorithms. To the authors’ knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset

标题: Test-time Adaptive Vision-and-Language Navigation

作者: Junyu Gao, Xuan Yao, Changsheng Xu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2311.13209v2

中文摘要: 视觉和语言导航（VLN）近年来取得了重大进展，这在很大程度上归功于精心策划的数据集和训练有素的模型。然而，当在不同的环境中测试时，训练好的模型不可避免地会遇到数据分布的重大变化，这突出表明仅仅依靠预训练和固定的导航模型是不够的。为了增强模型的泛化能力，测试时间自适应（TTA）通过利用未标记的测试样本进行模型更新，在计算机视觉领域显示出巨大的潜力。然而，简单地将现有的TTA方法应用于VLN任务不能很好地处理VLN模型的适应性——稳定性困境，即频繁的更新会导致模型参数的剧烈变化，而偶尔的更新会使模型不适于处理动态变化的环境。因此，我们提出了一种用于VLN的快——慢测试时间适应（FSTTA）方法，通过在统一的框架中对梯度和参数进行分解——累积分析。具体来说，在快速更新阶段，在最近的多步导航过程中生成的梯度被分解成具有不同一致性水平的分量。然后，这些分量被自适应地累积，以精确定位一致的方向，用于快速模型自适应。在缓慢更新阶段，收集历史记录的参数，并进行类似的分解——累积分析，以将模型恢复到稳定状态。大量的实验表明，我们的方法在四个流行的基准测试中获得了令人印象深刻的性能提升。

摘要: Vision-and-Language Navigation (VLN) has witnessed significant advancements in recent years, largely attributed to meticulously curated datasets and proficiently trained models. Nevertheless, when tested in diverse environments, the trained models inevitably encounter significant shifts in data distribution, highlighting that relying solely on pre-trained and fixed navigation models is insufficient. To enhance models’ generalization ability, test-time adaptation (TTA) demonstrates significant potential in the computer vision field by leveraging unlabeled test samples for model updates. However, simply applying existing TTA methods to the VLN task cannot well handle the adaptability-stability dilemma of VLN models, i.e., frequent updates can result in drastic changes in model parameters, while occasional updates can make the models ill-equipped to handle dynamically changing environments. Therefore, we propose a Fast-Slow Test-Time Adaptation (FSTTA) approach for VLN by performing decomposition-accumulation analysis for both gradients and parameters in a unified framework. Specifically, in the fast update phase, gradients generated during the recent multi-step navigation process are decomposed into components with varying levels of consistency. Then, these components are adaptively accumulated to pinpoint a concordant direction for fast model adaptation. In the slow update phase, historically recorded parameters are gathered, and a similar decomposition-accumulation analysis is conducted to revert the model to a stable state. Extensive experiments show that our method obtains impressive performance gains on four popular benchmarks.

标题: Cognitive TransFuser: Semantics-guided Transformer-based Sensor Fusion for Improved Waypoint Prediction

作者: Hwan-Soo Choi, Jongoh Jeong, Young Hoo Cho

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2308.02126v2

中文摘要: 智能自动驾驶代理的传感器融合方法仍然是从输入传感器获取的视觉全局上下文中理解驾驶场景的关键。具体来说，对于局部航路点预测任务，单模态网络仍然受到对输入传感器灵敏度的强烈依赖性的限制，因此最近的工作促进了在实践中在特征级融合中使用多传感器。虽然众所周知，多种数据模态鼓励相互的上下文交换，但它需要在部署到实际驾驶场景时以最小的计算实时理解全局3D场景，从而在给定有限数量的实际可用传感器的情况下对训练策略具有更大的重要性。在这种情况下，我们通过融合辅助任务特征以及使用辅助头进行基于模仿学习的路点预测，来利用精心选择的与感兴趣的目标任务高度相关的辅助任务（例如，交通灯识别和语义分割）。我们基于RGB激光雷达的多任务特征融合网络，创造了认知输血器，大大增强并超过了基线网络，在CARLA模拟器中实现了更安全、更完整的道路导航。我们通过大量实验在Town05 Short和Town05 Long基准上验证了所提出的网络，实现了高达44.2 FPS的实时推理时间。

摘要: Sensor fusion approaches for intelligent self-driving agents remain key to driving scene understanding given visual global contexts acquired from input sensors. Specifically, for the local waypoint prediction task, single-modality networks are still limited by strong dependency on the sensitivity of the input sensor, and thus recent works therefore promote the use of multiple sensors in fusion in feature level in practice. While it is well known that multiple data modalities encourage mutual contextual exchange, it requires global 3D scene understanding in real-time with minimal computation upon deployment to practical driving scenarios, thereby placing greater significance on the training strategy given a limited number of practically usable sensors. In this light, we exploit carefully selected auxiliary tasks that are highly correlated with the target task of interest (e.g., traffic light recognition and semantic segmentation) by fusing auxiliary task features and also using auxiliary heads for waypoint prediction based on imitation learning. Our RGB-LIDAR-based multi-task feature fusion network, coined Cognitive TransFuser, augments and exceeds the baseline network by a significant margin for safer and more complete road navigation in the CARLA simulator. We validate the proposed network on the Town05 Short and Town05 Long Benchmark through extensive experiments, achieving up to 44.2 FPS real-time inference time.

标题: Pixel to Elevation: Learning to Predict Elevation Maps at Long Range using Images for Autonomous Offroad Navigation

作者: Chanyoung Chung, Georgios Georgakis, Patrick Spieler

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.17484v1

中文摘要: 了解远程地形拓扑对于越野机器人任务的成功至关重要，尤其是在高速导航时。激光雷达传感器目前严重依赖于几何测绘，当在更远的距离测绘时，提供稀疏的测量。为了应对这一挑战，我们提出了一种新的基于学习的方法，能够仅使用机载以自我为中心的图像实时预测远距离地形高程地图。我们提出的方法由三个主要元素组成。首先，引入了基于Transformer model的编码器，其学习以自我为中心的视图和先前鸟瞰高程地图预测之间的交叉视图关联。其次，提出了一种方向感知的位置编码方法，将复杂非结构化地形上的三维车辆姿态信息与多视图视觉图像特征相结合。最后，提出了一种历史增强的可学习地图嵌入，以实现高程地图预测之间更好的时间一致性，从而促进下游导航任务。我们使用真实世界的越野驾驶数据，通过实验验证了我们提出的方法在复杂和非结构化地形中自主越野机器人导航的适用性。此外，该方法与当前最先进的方法进行了定性和定量的比较。大量的现场实验表明，我们的方法在准确预测地形高程的同时有效地捕捉长期的整体地形拓扑方面优于基线模型。最后，进行消融研究，以突出和理解所提出的方法的关键组件的效果，并验证它们对提高越野机器人导航能力的适用性。

摘要: Understanding terrain topology at long-range is crucial for the success of off-road robotic missions, especially when navigating at high-speeds. LiDAR sensors, which are currently heavily relied upon for geometric mapping, provide sparse measurements when mapping at greater distances. To address this challenge, we present a novel learning-based approach capable of predicting terrain elevation maps at long-range using only onboard egocentric images in real-time. Our proposed method is comprised of three main elements. First, a transformer-based encoder is introduced that learns cross-view associations between the egocentric views and prior bird-eye-view elevation map predictions. Second, an orientation-aware positional encoding is proposed to incorporate the 3D vehicle pose information over complex unstructured terrain with multi-view visual image features. Lastly, a history-augmented learn-able map embedding is proposed to achieve better temporal consistency between elevation map predictions to facilitate the downstream navigational tasks. We experimentally validate the applicability of our proposed approach for autonomous offroad robotic navigation in complex and unstructured terrain using real-world offroad driving data. Furthermore, the method is qualitatively and quantitatively compared against the current state-of-the-art methods. Extensive field experiments demonstrate that our method surpasses baseline models in accurately predicting terrain elevation while effectively capturing the overall terrain topology at long-ranges. Finally, ablation studies are conducted to highlight and understand the effect of key components of the proposed approach and validate their suitability to improve offroad robotic navigation capabilities.