[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉语言导航_caphuman: capture your moments in parallel univers-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/136016812

专属领域论文订阅

VX 关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需VX关注公号并回复{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== LLM ==

标题: Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields

作者: Jingbo Zhang, Xiaoyu Li, Ziyu Wan

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2305.11588v2

Project: https://eckertzhang.github.io/Text2NeRF.github.io/|

GitHub: https://github.com/eckertzhang/Text2NeRF|https://github.com/eckertzhang/Text2NeRF|

中文摘要: 文本驱动的3D场景生成广泛适用于对3D场景有大量需求的视频游戏、电影行业和元宇宙应用。然而，现有的文本到3D生成方法仅限于生成具有简单几何图形和缺乏真实感的梦幻风格的3D对象。在这项工作中，我们提出了Text2NeRF，它能够纯粹从文本提示中生成具有复杂几何结构和高保真纹理的各种3D场景。为此，我们采用NeRF作为3D表示，并利用预先训练的文本到图像扩散模型来约束NeRF的3D重建，以反映场景描述。具体来说，我们采用扩散模型来推断文本相关图像作为内容先验，并使用单目深度估计方法来提供几何先验。内容和几何先验都被用来更新NeRF模型。为了保证不同视图之间的纹理和几何一致性，我们引入了一种渐进的场景修复和更新策略来进行场景的新视图合成。我们的方法不需要额外的训练数据，只需要场景的自然语言描述作为输入。大量实验表明，我们的Text2NeRF在从各种自然语言提示生成照片般逼真、多视图一致和多样化的3D场景方面优于现有方法。我们的代码可在https：//github.com/eckertzhang/text 2 nerf。

摘要: Text-driven 3D scene generation is widely applicable to video gaming, film industry, and metaverse applications that have a large demand for 3D scenes. However, existing text-to-3D generation methods are limited to producing 3D objects with simple geometries and dreamlike styles that lack realism. In this work, we present Text2NeRF, which is able to generate a wide range of 3D scenes with complicated geometric structures and high-fidelity textures purely from a text prompt. To this end, we adopt NeRF as the 3D representation and leverage a pre-trained text-to-image diffusion model to constrain the 3D reconstruction of the NeRF to reflect the scene description. Specifically, we employ the diffusion model to infer the text-related image as the content prior and use a monocular depth estimation method to offer the geometric prior. Both content and geometric priors are utilized to update the NeRF model. To guarantee textured and geometric consistency between different views, we introduce a progressive scene inpainting and updating strategy for novel view synthesis of the scene. Our method requires no additional training data but only a natural language description of the scene as the input. Extensive experiments demonstrate that our Text2NeRF outperforms existing methods in producing photo-realistic, multi-view consistent, and diverse 3D scenes from a variety of natural language prompts. Our code is available at https://github.com/eckertzhang/Text2NeRF.

标题: K-QA: A Real-World Medical Q&A Benchmark

作者: Itay Manes, Naama Ronn, David Cohen

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14493v1

Project: https://huggingface.co/spaces/Itaykhealth/K-QA|

GitHub: https://github.com/Itaymanes/K-QA|

中文摘要: 确保大型语言模型（LLMs）提供的响应的准确性至关重要，尤其是在不正确的信息可能直接影响患者健康的临床环境中。为了应对这一挑战，我们构建了K-QA，这是一个包含1,212个患者问题的数据集，这些问题来自K Health（一个人工智能驱动的临床平台）上举行的真实世界对话。我们雇佣了一组内部医生来回答并手动将K-QA的子集分解成独立的陈述。此外，我们制定了两个基于NLI的评估指标，近似召回率和精确度：（1）全面性，测量生成的答案中基本临床信息的百分比，以及（2）幻觉率，测量与LLM答案相矛盾的医生策划的反应的陈述数量。最后，我们使用K-QA和这些指标来评估几个最先进的模型，以及作者开发的上下文学习和面向医学的增强检索方案的效果。我们的发现表明，情境学习提高了模型的全面性，增强检索在减少幻觉方面是有效的。我们向社区提供K-QA，以刺激对医学上精确的NLP应用的研究。

摘要: Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.

标题: Wordflow: Social Prompt Engineering for Large Language Models

作者: Zijie J. Wang, Aishwarya Chakravarthy, David Munechika

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14447v1

Project: https://poloclub.github.io/wordflow|https://poloclub.github.io/wordflow|https://youtu.be/3dOcVuofGVo|

GitHub: https://github.com/poloclub/wordflow/|

中文摘要: 大型语言模型（LLMs）需要精心制作的提示才能有效使用。提示工程，即设计提示的过程，具有挑战性，特别是对于不太熟悉人工智能技术的非专家来说。虽然研究人员已经提出了帮助LLM用户快速设计的技术和工具，但这些工作主要针对人工智能应用程序开发人员，而不是非专家。为了解决这一研究空白，我们提出了社会提示工程，这是一种利用社会计算技术来促进协作提示设计的新范式。为了研究社交提示工程，我们引入了Wordflow，这是一个开源的社交文本编辑器，使日常用户能够轻松地创建、运行、共享和发现LLM提示。此外，通过利用现代网络技术，Wordflow允许用户在他们的浏览器中本地和私有地运行LLMs。两个使用场景强调了社交提示工程和我们的工具如何增强外行人与LLMs的交互。Wordflow可在https：//poloclub.github.io/Wordflow。

摘要: Large language models (LLMs) require well-crafted prompts for effective use. Prompt engineering, the process of designing prompts, is challenging, particularly for non-experts who are less familiar with AI technologies. While researchers have proposed techniques and tools to assist LLM users in prompt design, these works primarily target AI application developers rather than non-experts. To address this research gap, we propose social prompt engineering, a novel paradigm that leverages social computing techniques to facilitate collaborative prompt design. To investigate social prompt engineering, we introduce Wordflow, an open-source and social text editor that enables everyday users to easily create, run, share, and discover LLM prompts. Additionally, by leveraging modern web technologies, Wordflow allows users to run LLMs locally and privately in their browsers. Two usage scenarios highlight how social prompt engineering and our tool can enhance laypeople’s interaction with LLMs. Wordflow is publicly accessible at https://poloclub.github.io/wordflow.

标题: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models

作者: Xin Zhang, Dong Zhang, Shimin Li

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2308.16692v2

Project: https://0nutation.github.io/SpeechTokenizer.github.io/|

GitHub: https://github.com/ZhangXInFD/SpeechTokenizer/.|

中文摘要: 当前的语音大型语言模型建立在离散的语音表示之上，这些表示可以分为语义标记和声学标记。然而，现有的语音标记并不是专门为语音语言建模而设计的。为了评估语音标记对于构建语音语言模型的适用性，我们建立了第一个基准SLMTokBench。我们的结果表明，无论是语义标记还是声学标记都不适合这一目的。因此，我们提出了SpeechTokenizer，一个用于语音大型语言模型的统一语音标记器。SpeechTokenizer采用带有残差矢量量化（RVQ）的编码器——解码器架构。SpeechTokenizer统一了语义和声学标记，跨不同的RVQ层分层解开语音信息的不同方面。此外，我们利用SpeechTokenizer构建了一个统一的语音语言模型（USLM）。实验表明，SpeechTokenizer在语音重建方面的性能与EnCodec相当，并在SLMTokBench基准测试中表现出强大的性能。此外，USLM在零镜头文本到语音转换任务中优于VALL-E。代码和模型可在https：//github.com/ZhangXInFD/SpeechTokenizer/获得。

摘要: Current speech large language models build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech language modeling. To assess the suitability of speech tokens for building speech language models, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech large language models. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

标题: Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning

作者: Mustafa Shukor, Alexandre Rame, Corentin Dancette

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2310.00647v2

Project: https://evalign-icl.github.io/|

GitHub: https://github.com/mshukor/EvALign-ICL.|

中文摘要: 随着大型语言模型（LLMs）的成功，大型多模态模型（LMMs），如Flamingo模型及其后续竞争对手，已经开始成为走向通才代理的自然步骤。然而，与最近的LMM的互动揭示了当前评估基准很难捕捉到的主要局限性。事实上，任务性能（例如，VQA准确性）本身并不能提供足够的线索来理解它们的真实能力、局限性以及这些模型在多大程度上符合人类的期望。为了完善我们对这些缺陷的理解，我们偏离了当前的评估范式，并且（1）在5个不同的轴上评估了10个最近的开源LMM，从3B到80B参数尺度；幻觉、弃权、组合性、可解释性和指令遵循。我们对这些轴的评估揭示了LMMs的主要缺陷。虽然当前调整这些模型的首选解决方案是基于培训，如指令调整或RLHF，但我们宁愿（2）探索免培训情境学习（ICL）作为解决方案，并研究它如何影响这些限制。基于我们的ICL研究，（3）我们进一步推动ICL，并提出新的多模态ICL变体，如；多任务——ICL，后见之明链——ICL，和自我纠正——ICL。我们的发现如下。（1）尽管LMM取得了成功，但它们仍有缺陷，仅通过扩展无法解决。（2）ICL对LMMs缺陷的影响是微妙的；尽管ICL对提高可解释性和答案弃权很有效，但它只是稍微提高了指令遵循，并没有提高写作能力，实际上甚至放大了幻觉。（3）建议的ICL变体作为有效解决其中一些缺陷的事后方法是有希望的。代码可在以下网址获得：https://github.com/mshukor/EvALign-ICL。

摘要: Following the success of Large Language Models (LLMs), Large Multimodal Models (LMMs), such as the Flamingo model and its subsequent competitors, have started to emerge as natural steps towards generalist agents. However, interacting with recent LMMs reveals major limitations that are hardly captured by the current evaluation benchmarks. Indeed, task performances (e.g., VQA accuracy) alone do not provide enough clues to understand their real capabilities, limitations, and to which extent such models are aligned to human expectations. To refine our understanding of those flaws, we deviate from the current evaluation paradigm, and (1) evaluate 10 recent open-source LMMs from 3B up to 80B parameter scale, on 5 different axes; hallucinations, abstention, compositionality, explainability and instruction following. Our evaluation on these axes reveals major flaws in LMMs. While the current go-to solution to align these models is based on training, such as instruction tuning or RLHF, we rather (2) explore the training-free in-context learning (ICL) as a solution, and study how it affects these limitations. Based on our ICL study, (3) we push ICL further and propose new multimodal ICL variants such as; Multitask-ICL, Chain-of-Hindsight-ICL, and Self-Correcting-ICL. Our findings are as follows. (1) Despite their success, LMMs have flaws that remain unsolved with scaling alone. (2) The effect of ICL on LMMs flaws is nuanced; despite its effectiveness for improved explainability, answer abstention, ICL only slightly improves instruction following, does not improve compositional abilities, and actually even amplifies hallucinations. (3) The proposed ICL variants are promising as post-hoc approaches to efficiently tackle some of those flaws. The code is available here: https://github.com/mshukor/EvALign-ICL.

标题: GenSim: Generating Robotic Simulation Tasks via Large Language Models

作者: Lirui Wang, Yiyang Ling, Zhecheng Yuan

PubTime: 2024-01-21

Downlink: http://arxiv.org/abs/2310.01361v2

Project: https://liruiw.github.io/gensim)|https://liruiw.github.io/gensim),|https://huggingface.co/spaces/Gen-Sim/Gen-Sim),|

GitHub: https://github.com/liruiw/GenSim)|

中文摘要: 收集大量真实世界的交互数据来训练一般的机器人策略通常非常昂贵，因此激发了模拟数据的使用。然而，由于提出和验证新任务需要人工，现有的数据生成方法通常集中于场景级多样性（例如，对象实例和姿态）而不是任务级多样性。这使得在模拟数据上训练的策略很难展示重要的任务级泛化。在本文中，我们建议通过利用大型语言模型（LLM）的基础和编码能力来自动生成丰富的仿真环境和专家演示。我们的方法被称为GenSim，有两种模式：目标导向生成，其中将目标任务交给LLM，LLM提出任务课程来解决目标任务；探索性生成，其中LLM从以前的任务开始，迭代地提出有助于解决更复杂任务的新任务。我们使用GPT4将现有的基准扩展了10倍，达到100多个任务，在这些任务上，我们进行了有监督的微调，并评估了几个LLM，包括微调的GPTs和机器人模拟任务代码生成的Code Llama。此外，我们观察到，当用于多任务策略训练时，LLMs生成的模拟程序可以显著增强任务级泛化。我们进一步发现，在最小的模拟到真实适应的情况下，在GPT4生成的模拟任务上预训练的多任务策略表现出对现实世界中看不见的长期任务的更强转移，并且比基线高出25%。有关代码、演示和视频，请访问项目网站（https：//liruiw.github.io/gensim）。

摘要: Collecting large amounts of real-world interaction data to train general robotic policies is often prohibitively expensive, thus motivating the use of simulation data. However, existing methods for data generation have generally focused on scene-level diversity (e.g., object instances and poses) rather than task-level diversity, due to the human effort required to come up with and verify novel tasks. This has made it challenging for policies trained on simulation data to demonstrate significant task-level generalization. In this paper, we propose to automatically generate rich simulation environments and expert demonstrations by exploiting a large language models’ (LLM) grounding and coding ability. Our approach, dubbed GenSim, has two modes: goal-directed generation, wherein a target task is given to the LLM and the LLM proposes a task curriculum to solve the target task, and exploratory generation, wherein the LLM bootstraps from previous tasks and iteratively proposes novel tasks that would be helpful in solving more complex tasks. We use GPT4 to expand the existing benchmark by ten times to over 100 tasks, on which we conduct supervised finetuning and evaluate several LLMs including finetuned GPTs and Code Llama on code generation for robotic simulation tasks. Furthermore, we observe that LLMs-generated simulation programs can enhance task-level generalization significantly when used for multitask policy training. We further find that with minimal sim-to-real adaptation, the multitask policies pretrained on GPT4-generated simulation tasks exhibit stronger transfer to unseen long-horizon tasks in the real world and outperform baselines by 25%. See the project website (https://liruiw.github.io/gensim) for code, demos, and videos.

== VLM ==

标题: AnimateLCM: Accelerating the Animation of Personalized Diffusion Models and Adapters with Decoupled Consistency Learning

作者: Fu-Yun Wang, Zhaoyang Huang, Xiaoyu Shi

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00769v1

Project: https://animatelcm.github.io/|

GitHub: https://github.com/G-U-N/AnimateLCM|

中文摘要: 视频扩散模型因其能够产生连贯且高保真的视频而受到越来越多的关注。然而，迭代去噪过程使其计算密集且耗时，从而限制了其应用。受一致性模型（CM）的启发，我们提出了AnimateLCM，它提取预训练的图像扩散模型，以最小的步骤加速采样，并成功扩展了潜在一致性模型（LCM）在条件图像生成上，允许在最小的步骤内生成高保真视频。我们提出了一种解耦一致性学习策略，将图像生成先验和运动生成先验的提取解耦，而不是直接对原始视频数据集进行一致性学习，这提高了训练效率并增强了生成视觉质量。此外，使稳定扩散社区中的即插即用适配器的组合能够实现各种功能（例如，用于可控生成的ControlNet）。我们提出了一种有效的策略来使现有的适配器适应我们的提取的文本条件视频一致性模型，或者在不损害采样速度的情况下从头开始训练适配器。我们在图像条件视频生成和布局条件视频生成中验证了所提出的策略，都实现了最佳性能的结果。实验结果验证了该方法的有效性。代码和重量将被公开。更多详情请访问https://github.com/G-U-N/AnimateLCM。

摘要: Video diffusion models has been gaining increasing attention for its ability to produce videos that are both coherent and of high fidelity. However, the iterative denoising process makes it computationally intensive and time-consuming, thus limiting its applications. Inspired by the Consistency Model (CM) that distills pretrained image diffusion models to accelerate the sampling with minimal steps and its successful extension Latent Consistency Model (LCM) on conditional image generation, we propose AnimateLCM, allowing for high-fidelity video generation within minimal steps. Instead of directly conducting consistency learning on the raw video dataset, we propose a decoupled consistency learning strategy that decouples the distillation of image generation priors and motion generation priors, which improves the training efficiency and enhance the generation visual quality. Additionally, to enable the combination of plug-and-play adapters in stable diffusion community to achieve various functions (e.g., ControlNet for controllable generation). we propose an efficient strategy to adapt existing adapters to our distilled text-conditioned video consistency model or train adapters from scratch without harming the sampling speed. We validate the proposed strategy in image-conditioned video generation and layout-conditioned video generation, all achieving top-performing results. Experimental results validate the effectiveness of our proposed method. Code and weights will be made public. More details are available at https://github.com/G-U-N/AnimateLCM.

标题: CapHuman: Capture Your Moments in Parallel Universes

作者: Chao Liang, Fan Ma, Linchao Zhu

PubTime: 2024-02-01

Downlink: http://arxiv.org/abs/2402.00627v1

Project: https://caphuman.github.io/|

GitHub: https://github.com/VamosC/CapHuman|

中文摘要: 我们专注于一项新颖的以人为中心的图像合成任务，即，仅给定一张参考面部照片，它有望在不同的上下文中生成具有不同头部位置、姿势和面部表情的特定个体图像。为了实现这一目标，我们认为我们的生成模型应该具有以下有利的特征：（1）对我们的世界和人类社会有很强的视觉和语义理解，用于基本对象和人类图像的生成。（2）广义身份保持能力。（3）灵活细粒度的头部控制。最近，大型预训练文本到图像扩散模型显示出显著的效果，作为一个强大的生成基础。作为基础，我们旨在释放预训练模型的上述两种能力。在这项工作中，我们提出了一个新的框架命名为CapHuman。我们采用“编码然后学习对齐”范式，这种范式能够为新个体保留可推广的身份，而无需在推理时进行繁琐的调整。CapHuman对身份特征进行编码，然后学习将它们排列到潜在空间中。此外，我们在以灵活和3D一致的方式为我们的模型配备对人类头部的控制之前引入了3D面部。广泛的定性和定量分析表明，我们的CapHuman可以制作出身份保存良好、照片逼真和高保真的肖像，具有内容丰富的表示和各种头部再现，优于既定的基线。代码和检查点将在https：//github.com/VamosC/CapHuman。发布

摘要: We concentrate on a novel human-centric image synthesis task, that is, given only one reference facial photograph, it is expected to generate specific individual images with diverse head positions, poses, and facial expressions in different contexts. To accomplish this goal, we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently, large pre-trained text-to-image diffusion models have shown remarkable results, serving as a powerful generative foundation. As a basis, we aim to unleash the above two capabilities of the pre-trained model. In this work, we present a new framework named CapHuman. We embrace the ``encode then learn to align" paradigm, which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover, we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved, photo-realistic, and high-fidelity portraits with content-rich representations and various head renditions, superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman.

标题: Synchformer: Efficient Synchronization from Sparse Cues

作者: Vladimir Iashin, Weidi Xie, Esa Rahtu

PubTime: 2024-01-29

Downlink: http://arxiv.org/abs/2401.16423v1

Project: https://www.robots.ox.ac.uk/|

GitHub: https://github.com/v-iashin/Synchformer|

中文摘要: 我们的目标是视听同步，重点是“野外”视频，如YouTube上的视频，其中同步提示可能很少。我们的贡献包括一个新的视听同步模型，以及通过多模态段级对比预训练将特征提取与同步建模解耦的训练。这种方法在密集和稀疏设置中都实现了最先进的性能。我们还将同步模型训练扩展到AudioSet百万规模的“野外”数据集，研究可解释性的证据归因技术，并探索同步模型的新功能：视听同步性。

摘要: Our objective is audio-visual synchronization with a focus on ‘in-the-wild’ videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale ‘in-the-wild’ dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.

标题: Pixel-Wise Recognition for Holistic Surgical Scene Understanding

作者: Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.11174v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42|https://ieeexplore.ieee.org/document/10230819|

GitHub: https://github.com/BCV-Uniandes/GraSP|

中文摘要: 本文介绍了前列腺切除术的整体和多粒度手术场景理解（GraSP）数据集，这是一个精心策划的基准，将手术场景理解建模为具有不同粒度级别的互补任务的层次结构。我们的方法能够对手术活动进行多层次的理解，包括长期任务，如手术阶段和步骤识别，以及短期任务，包括手术器械分割和原子视觉动作检测。为了利用我们提出的基准，我们引入了动作、阶段、步骤和仪器分割变压器（TAPIS）模型，这是一种通用架构，它将全局视频特征提取器与来自仪器分割模型的局部区域建议相结合，以解决我们基准的多粒度问题。通过大量的实验，我们展示了在短期识别任务中包含分割注释的影响，强调了每个任务的不同粒度要求，并建立了TAPIS优于以前提出的基线和传统的基于CNN的模型。此外，我们通过多个公共基准验证了我们方法的稳健性，确认了我们数据集的可靠性和适用性。这项工作代表了内窥镜视觉向前迈出的重要一步，为未来对外科手术的整体理解研究提供了一个新颖而全面的框架。

摘要: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and establish TAPIS’s superiority over previously proposed baselines and conventional CNN-based models. Additionally, we validate the robustness of our method across multiple public benchmarks, confirming the reliability and applicability of our dataset. This work represents a significant step forward in Endoscopic Vision, offering a novel and comprehensive framework for future research towards a holistic understanding of surgical procedures.

标题: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

作者: Lihe Yang, Bingyi Kang, Zilong Huang

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10891v1

Project: https://depth-anything.github.io|

GitHub: https://github.com/LiheYoung/Depth-Anything.|

中文摘要: 这项工作提出了深度任何东西，一个非常实用的解决方案，用于鲁棒的单目深度估计。不追求新的技术模块，我们的目标是建立一个简单而强大的基础模型，处理任何情况下的任何图像。为此，我们通过设计一个数据引擎来收集和自动注释大规模未标记数据（~62M），从而扩大数据覆盖范围，从而能够减少泛化误差。我们研究了两种简单而有效的策略，它们使数据扩展变得有希望。首先，通过利用数据扩充工具创建更具挑战性的优化目标。它迫使模型主动寻求额外的视觉知识并获得鲁棒的表示。其次，开发了一个辅助监督来加强模型从预训练的编码器继承丰富的语义先验。我们广泛评估了它的零拍摄能力，包括六个公共数据集和随机拍摄的照片。它展示了令人印象深刻的概括能力。此外，通过使用来自NYUv2和KITTI的度量深度信息对其进行微调，可以设置新的SOTA。我们更好的深度模型也产生了更好的深度调节控制网络。我们的模型在https：//github.com/LiheYoung/Depth-Anything发布。

摘要: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.

标题: Crossway Diffusion: Improving Diffusion-based Visuomotor Policy via Self-supervised Learning

作者: Xiang Li, Varun Belagali, Jinghuan Shang

PubTime: 2024-01-11

Downlink: http://arxiv.org/abs/2307.01849v3

Project: https://youtu.be/9deKHueZBuk|

GitHub: https://github.com/LostXine/crossway_diffusion|

中文摘要: 序列建模方法在机器人模仿学习中显示出有希望的结果。最近，扩散模型已经以序列建模的方式被用于行为克隆，受益于它们在建模复杂数据分布方面的卓越能力。标准的基于扩散的策略根据输入状态从随机噪声中迭代地生成动作序列。尽管如此，扩散策略的模型可以在视觉表示方面进一步改进。在这项工作中，我们提出了交叉扩散，这是一种简单而有效的方法，通过精心设计的状态解码器和辅助的自我监督学习（SSL）目标来增强基于扩散的视觉运动策略学习。状态解码器从反向扩散过程的中间表示重建原始图像像素和其他状态信息。利用SSL目标和初始扩散损失对整个模型进行了联合优化。我们的实验证明了交叉扩散在各种模拟和真实世界机器人任务中的有效性，证实了它相对于标准的基于扩散的策略的一致优势以及对基线的实质性改进。

摘要: Sequence modeling approaches have shown promising results in robot imitation learning. Recently, diffusion models have been adopted for behavioral cloning in a sequence modeling fashion, benefiting from their exceptional capabilities in modeling complex data distributions. The standard diffusion-based policy iteratively generates action sequences from random noise conditioned on the input states. Nonetheless, the model for diffusion policy can be further improved in terms of visual representations. In this work, we propose Crossway Diffusion, a simple yet effective method to enhance diffusion-based visuomotor policy learning via a carefully designed state decoder and an auxiliary self-supervised learning (SSL) objective. The state decoder reconstructs raw image pixels and other state information from the intermediate representations of the reverse diffusion process. The whole model is jointly optimized by the SSL objective and the original diffusion loss. Our experiments demonstrate the effectiveness of Crossway Diffusion in various simulated and real-world robot tasks, confirming its consistent advantages over the standard diffusion-based policy and substantial improvements over the baselines.

== diffusion model ==

标题: Repositioning the Subject within Image

作者: Yikai Wang, Chenjie Cao, Qiaole Dong

PubTime: 2024-01-30

Downlink: http://arxiv.org/abs/2401.16861v1

Project: https://yikai-wang.github.io/seele/|

GitHub: https://github.com/Yikai-Wang/ReS|

中文摘要: 当前的图像操作主要集中在静态操作上，例如替换图像中的特定区域或改变其整体样式。在本文中，我们介绍了一个创新的动态操作任务，主题重新定位。这项任务包括将用户指定的对象重新定位到期望的位置，同时保持图像的保真度。我们的研究表明，主体重新定位的基本子任务，包括填充重新定位的主体留下的空白，重建主体的模糊部分，并将主体与周围区域融合，可以有效地重新制定为一个统一的，即时引导的修复任务。因此，我们可以使用单个扩散生成模型来处理这些子任务，使用通过我们提出的任务反转技术学习的各种任务提示。此外，我们集成了预处理和后处理技术，以进一步提高主体重新定位的质量。这些元素共同构成了我们的细分生成和混合（SEELE）框架。为了评估SEELE在受试者重新定位方面的有效性，我们组装了一个名为ReS的真实世界受试者重新定位数据集。我们在ReS上的结果证明了重新定位的图像生成的质量。

摘要: Current image manipulation primarily centers on static manipulation, such as replacing specific regions within an image or altering its overall style. In this paper, we introduce an innovative dynamic manipulation task, subject repositioning. This task involves relocating a user-specified subject to a desired position while preserving the image’s fidelity. Our research reveals that the fundamental sub-tasks of subject repositioning, which include filling the void left by the repositioned subject, reconstructing obscured portions of the subject and blending the subject to be consistent with surrounding areas, can be effectively reformulated as a unified, prompt-guided inpainting task. Consequently, we can employ a single diffusion generative model to address these sub-tasks using various task prompts learned through our proposed task inversion technique. Additionally, we integrate pre-processing and post-processing techniques to further enhance the quality of subject repositioning. These elements together form our SEgment-gEnerate-and-bLEnd (SEELE) framework. To assess SEELE’s effectiveness in subject repositioning, we assemble a real-world subject repositioning dataset called ReS. Our results on ReS demonstrate the quality of repositioned image generation.

标题: VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

作者: Haoxin Chen, Yong Zhang, Xiaodong Cun

PubTime: 2024-01-17

Downlink: http://arxiv.org/abs/2401.09047v1

Project: https://ailab-cvc.github.io/videocrafter;|

GitHub: https://github.com/AILab-CVC/VideoCrafter|

中文摘要: 文本到视频生成旨在根据给定的提示生成视频。最近，一些商业视频模型已经能够生成具有最小噪声、出色细节和高美学分数的似是而非的视频。然而，这些模型依赖于大规模、过滤良好、高质量的视频，社区无法访问这些视频。许多现有的研究工作使用低质量的WebVid-10M数据集训练模型，很难生成高质量的视频，因为模型经过优化以适应WebVid-10M。在这项工作中，我们探索了从稳定扩散扩展的视频模型的训练方案，并研究了利用低质量视频和合成的高质量图像来获得高质量视频模型的可行性。我们首先分析了视频模型的空间和时间模块之间的联系以及向低质量视频的分布转移。我们观察到，与仅训练时间模块相比，所有模块的完全训练导致空间和时间模块之间更强的耦合。基于这种更强的耦合，我们通过用高质量图像微调空间模块，将分布转移到更高的质量而没有运动退化，从而产生通用的高质量视频模型。进行评估以证明所提出的方法的优越性，特别是在图像质量、运动和概念合成方面。

摘要: Text-to-video generation aims to produce a video based on a given prompt. Recently, several commercial video models have been able to generate plausible videos with minimal noise, excellent details, and high aesthetic scores. However, these models rely on large-scale, well-filtered, high-quality videos that are not accessible to the community. Many existing research works, which train models using the low-quality WebVid-10M dataset, struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work, we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling, we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images, resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method, particularly in picture quality, motion, and concept composition.

标题: InstantID: Zero-shot Identity-Preserving Generation in Seconds

作者: Qixun Wang, Xu Bai, Haofan Wang

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2401.07519v1

Project: https://instantid.github.io/|

GitHub: https://github.com/InstantID/InstantID.|

中文摘要: 使用文本反转、DreamBooth和LoRA等方法进行个性化图像合成已经取得了重大进展。然而，它们在现实世界中的适用性受到高存储需求、冗长的微调过程以及对多个参考图像的需求的阻碍。相反，现有的基于ID嵌入的方法虽然只需要单一的正向推理，但面临着挑战：它们要么需要跨众多模型参数进行广泛的微调，缺乏与社区预训练模型的兼容性，要么无法保持高人脸保真度。针对这些限制，我们引入了InstantID，这是一个强大的基于扩散模型的解决方案。我们的即插即用模块仅使用一张面部图像就能熟练地处理各种风格的图像个性化，同时确保高保真。为了实现这一点，我们设计了一个新的身份网，通过施加强语义和弱空间条件，将面部和地标图像与文本提示相结合来指导图像生成。InstantID展示了卓越的性能和效率，证明在身份保护至关重要的实际应用中非常有益。此外，我们的工作与SD1.5和SDXL等流行的预训练文本到图像扩散模型无缝集成，作为一个适应性强的插件。我们的代码和预先训练的检查点将在https：//github.com/InstantID/InstantID上提供。

摘要: There has been significant progress in personalized image synthesis with methods such as Textual Inversion, DreamBooth, and LoRA. Yet, their real-world applicability is hindered by high storage demands, lengthy fine-tuning processes, and the need for multiple reference images. Conversely, existing ID embedding-based methods, while requiring only a single forward inference, face challenges: they either necessitate extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high face fidelity. Addressing these limitations, we introduce InstantID, a powerful diffusion model-based solution. Our plug-and-play module adeptly handles image personalization in various styles using just a single facial image, while ensuring high fidelity. To achieve this, we design a novel IdentityNet by imposing strong semantic and weak spatial conditions, integrating facial and landmark images with textual prompts to steer the image generation. InstantID demonstrates exceptional performance and efficiency, proving highly beneficial in real-world applications where identity preservation is paramount. Moreover, our work seamlessly integrates with popular pre-trained text-to-image diffusion models like SD1.5 and SDXL, serving as an adaptable plugin. Our codes and pre-trained checkpoints will be available at https://github.com/InstantID/InstantID.

标题: Training Diffusion Models with Reinforcement Learning

作者: Kevin Black, Michael Janner, Yilun Du

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2305.13301v4

Project: http://rl-diffusion.github.io|

中文摘要: 扩散模型是一类灵活的生成模型，其训练近似于对数似然目标。然而，大多数扩散模型的用例并不关注可能性，而是关注下游目标，如人类感知的图像质量或药物有效性。在本文中，我们研究了用于直接优化此类目标的扩散模型的强化学习方法。我们描述了将去噪作为一个多步骤决策问题如何实现一类策略梯度算法，我们称之为去噪扩散策略优化（DDPO），该算法比其他奖励加权似然方法更有效。从经验上讲，DDPO能够使文本到图像的扩散模型适应难以通过提示表达的目标，如图像压缩性，以及来自人类反馈的目标，例如审美质量。最后，我们展示了DDPO可以使用来自视觉语言模型的反馈来改进即时图像对齐，而不需要额外的数据收集或人工注释。该项目的网站位于http://rl-diffusion.github.io.

摘要: Diffusion models are a class of flexible generative models trained with an
approximation to the log-likelihood objective. However, most use cases of
diffusion models are not concerned with likelihoods, but instead with
downstream objectives such as human-perceived image quality or drug
effectiveness. In this paper, we investigate reinforcement learning methods for
directly optimizing diffusion models for such objectives. We describe how
posing denoising as a multi-step decision-making problem enables a class of
policy gradient algorithms, which we refer to as denoising diffusion policy
optimization (DDPO), that are more effective than alternative reward-weighted
likelihood approaches. Empirically, DDPO is able to adapt text-to-image
diffusion models to objectives that are difficult to express via prompting,
such as image compressibility, and those derived from human feedback, such as
aesthetic quality. Finally, we show that DDPO can improve prompt-image
alignment using feedback from a vision-language model without the need for
additional data collection or human annotation. The project’s website can be
found at http://rl-diffusion.github.io .

标题: VASE: Object-Centric Appearance and Shape Manipulation of Real Videos

作者: Elia Peruzzo, Vidit Goel, Dejia Xu

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2401.02473v1

Project: https://helia95.github.io/vase-website/|https://helia95.github.io/vase-website/|

摘要: Recently, several works tackled the video editing task fostered by the
success of large-scale text-to-image generative models. However, most of these
methods holistically edit the frame using the text, exploiting the prior given
by foundation diffusion models and focusing on improving the temporal
consistency across frames. In this work, we introduce a framework that is
object-centric and is designed to control both the object’s appearance and,
notably, to execute precise and explicit structural modifications on the
object. We build our framework on a pre-trained image-conditioned diffusion
model, integrate layers to handle the temporal dimension, and propose training
strategies and architectural modifications to enable shape control. We evaluate
our method on the image-driven video editing task showing similar performance
to the state-of-the-art, and showcasing novel shape-editing capabilities.
Further details, code and examples are available on our project page:
https://helia95.github.io/vase-website/

标题: UpFusion: Novel View Diffusion from Unposed Sparse View Observations

作者: Bharath Raj Nagoor Kani, Hsin-Ying Lee, Sergey Tulyakov

PubTime: 2024-01-04

Downlink: http://arxiv.org/abs/2312.06661v2

Project: https://upfusion3d.github.io/|

中文摘要: 我们提出了UpFusion，这是一种在没有相应姿态信息的情况下，在给定稀疏参考图像集的情况下可以执行新颖的视图合成并推断对象的3D表示的系统。当前的稀疏视图3D推理方法通常依赖于相机姿态来几何地聚合来自输入视图的信息，但当这种信息不可用/不准确时，这种方法在野外并不稳健。相反，UpFusion通过学习在条件生成模型中隐含地利用可用图像作为上下文来合成新视图，从而避开了这一要求。我们将两种互补的条件反射形式结合到扩散模型中，以利用输入视图：a）通过使用场景级变换器推断与查询视图对齐的特征，b）通过可以直接观察输入图像标记的中间注意力层。我们表明，这种机制允许生成高保真度的新视图，同时在给定额外（未着色）图像的情况下提高合成质量。我们在Co3Dv2和Google Scanned Objects数据集上评估了我们的方法，并展示了与依赖姿势的稀疏视图方法以及无法利用其他视图的单视图方法相比，我们的方法的优势。最后，我们还表明，我们学习的模型可以超越训练类别进行推广，甚至可以从野外通用对象的自捕获图像中进行重建

摘要: We propose UpFusion, a system that can perform novel view synthesis and infer
3D representations for an object given a sparse set of reference images without
corresponding pose information. Current sparse-view 3D inference methods
typically rely on camera poses to geometrically aggregate information from
input views, but are not robust in-the-wild when such information is
unavailable/inaccurate. In contrast, UpFusion sidesteps this requirement by
learning to implicitly leverage the available images as context in a
conditional generative model for synthesizing novel views. We incorporate two
complementary forms of conditioning into diffusion models for leveraging the
input views: a) via inferring query-view aligned features using a scene-level
transformer, b) via intermediate attentional layers that can directly observe
the input image tokens. We show that this mechanism allows generating
high-fidelity novel views while improving the synthesis quality given
additional (unposed) images. We evaluate our approach on the Co3Dv2 and Google
Scanned Objects datasets and demonstrate the benefits of our method over
pose-reliant sparse-view methods as well as single-view methods that cannot
leverage additional views. Finally, we also show that our learned model can
generalize beyond the training categories and even allow reconstruction from
self-captured images of generic objects in-the-wild.

== VLN ==

标题: SubPipe: A Submarine Pipeline Inspection Dataset for Segmentation and Visual-inertial Localization

作者: Olaya Álvarez-Tuñón, Luiza Ribeiro Marnet, László Antal

PubTime: 2024-01-31

Downlink: http://arxiv.org/abs/2401.17907v1

GitHub: https://github.com/remaro-network/SubPipe-dataset|

中文摘要: 本文介绍了SubPipe，这是一个用于SLAM、对象检测和图像分割的水下数据集。SubPipe已经使用由OceanScan MST运营的\gls{LAUV}进行了记录，并携带了一套传感器，包括两个摄像机、一个侧扫声纳和一个惯性导航系统以及其他传感器。AUV已经部署在管道检查环境中，海底管道部分被沙子覆盖。AUV的姿态地面真实值由导航传感器估计。侧扫声纳和RGB图像分别包括目标检测和分割注释。最先进的分割、对象检测和SLAM方法在SubPipe上进行了基准测试，以展示数据集在利用计算机视觉算法方面的挑战和机遇。据作者所知，这是第一个带注释的水下数据集，提供了真实的管道检查场景。数据集和实验可在https：//github.com/remaro-network/SubPipe-dataset

摘要: This paper presents SubPipe, an underwater dataset for SLAM, object detection, and image segmentation. SubPipe has been recorded using a \gls{LAUV}, operated by OceanScan MST, and carrying a sensor suite including two cameras, a side-scan sonar, and an inertial navigation system, among other sensors. The AUV has been deployed in a pipeline inspection environment with a submarine pipe partially covered by sand. The AUV’s pose ground truth is estimated from the navigation sensors. The side-scan sonar and RGB images include object detection and segmentation annotations, respectively. State-of-the-art segmentation, object detection, and SLAM methods are benchmarked on SubPipe to demonstrate the dataset’s challenges and opportunities for leveraging computer vision algorithms. To the authors’ knowledge, this is the first annotated underwater dataset providing a real pipeline inspection scenario. The dataset and experiments are publicly available online at https://github.com/remaro-network/SubPipe-dataset

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型（LMM）的发展，这些模型能够处理复杂的任务，包括对图像中的文本和视觉内容进行联合推理（例如，在公共场所导航地图）。本文介绍了ConTextual，这是一个新颖的基准测试，包括明确设计的指令，用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景（例如，时间阅读、导航、购物等），要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V（ision）和使用人类评估的人类能力之间30.8%的显著性能差距，表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是，虽然GPT-4V在模因和引用解释等抽象类别中表现出色，但其整体表现仍落后于人类。除了人工评估，我们还采用了使用GPT-4的自动评估指标，揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估，并提供定性分析，为LMM设计的未来发展提供了一个强大的框架。https：//con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图（VSLAM）中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性，因此需要频繁的摄像机输入。然而，频繁处理图像会导致大量的内存使用和计算开销。在这项研究中，我们介绍了SemanticSLAM，这是一个端到端的视觉惯性里程计系统，它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图，并确保可靠的相机定位。SemanticSLAM是场景不可知的，这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作，即使没有频繁的摄像机输入，也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆（ConvLSTM）网络实现的，该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比，SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息，并且可以容易地应用于各种下游任务，例如路径规划、避障和机器人导航。该代码将在https：//github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM

标题: ETPNav: Evolving Topological Planning for Vision-Language Navigation in Continuous Environments

作者: Dong An, Hanqing Wang, Wenguan Wang

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2304.03047v3

GitHub: https://github.com/MarSaKi/ETPNav.|https://github.com/MarSaKi/ETPNav|

中文摘要: Vision-language导航是一项需要代理按照指令在环境中导航的任务。它在具体化人工智能领域变得越来越重要，在自主导航、搜索和救援以及人机交互方面具有潜在的应用。在本文中，我们提出了一个更实际但具有挑战性的对应设置——连续环境中的视觉语言导航（VLN-CE）。为了开发一个鲁棒的VLN-CE代理，我们提出了一个新的导航框架ETPNav，它专注于两个关键技能：1）抽象环境和生成远程导航计划的能力，以及2）在连续环境中的避障控制能力。ETPNav通过沿着穿越路径自组织预测的航路点来执行环境的在线拓扑映射，而无需先前的环境经验。它赋予代理将导航过程分解为高级规划和低级控制的特权。同时，ETPNav利用基于Transformer model的跨模态规划器来基于拓扑图和指令生成导航计划。然后，该计划通过避障控制器来执行，该控制器利用试错法来防止导航陷入障碍物。实验结果证明了该方法的有效性。ETPNav的产量超过10%和20%的改进比以前的属性R2R-CE和RxR-CE数据集的最新技术。我们的代码可在https：//github.com/MarSaKi/ETPNav。获得

摘要: Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments. It becomes increasingly crucial in the field of embodied AI, with potential applications in autonomous navigation, search and rescue, and human-robot interaction. In this paper, we propose to address a more practical yet challenging counterpart setting - vision-language navigation in continuous environments (VLN-CE). To develop a robust VLN-CE agent, we propose a new navigation framework, ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments. ETPNav performs online topological mapping of environments by self-organizing predicted waypoints along a traversed path, without prior environmental experience. It privileges the agent to break down the navigation procedure into high-level planning and low-level control. Concurrently, ETPNav utilizes a transformer-based cross-modal planner to generate navigation plans based on topological maps and instructions. The plan is then performed through an obstacle-avoiding controller that leverages a trial-and-error heuristic to prevent navigation from getting stuck in obstacles. Experimental results demonstrate the effectiveness of the proposed method. ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets, respectively. Our code is available at https://github.com/MarSaKi/ETPNav.

标题: Multimotion Visual Odometry (MVO)

作者: Kevin M. Judd, Jonathan D. Gammell

PubTime: 2024-01-15

Downlink: http://arxiv.org/abs/2110.15169v3

Project: https://www.youtube.com/watch?v=mNj3s1nf-6A|https://www.youtube.com/playlist?list=PLbaQBz4TuPcxMIXKh5Q80s0N9ISezFcpi|

中文摘要: 视觉运动估计是自主导航中一个研究得很好的挑战。最近的工作集中于解决高度动态环境中的多运动估计问题。这些环境不仅包括多个复杂的运动，而且往往表现出显著的遮挡。很难同时估计第三方运动和传感器自运动，因为物体的观测运动包括其真实运动和传感器运动。先前在多运动估计中的大多数工作通过依赖于基于外观的对象检测或特定于应用程序的运动约束来简化这个问题。这些方法在特定的应用程序和环境中是有效的，但不能很好地推广到完整的多运动估计问题（MEP）。本文介绍了Multimotion Visual Odometry（MVO），这是一种多运动估计管道，它估计场景中每个运动的完整SE（3）轨迹，包括传感器自身运动，而不依赖于基于外观的信息。MVO通过多运动分割和跟踪技术扩展了传统的视觉里程计（VO）管道。它使用物理建立的运动先验来推断通过临时遮挡的运动，并通过运动闭合来识别运动的再现。对牛津多运动数据集（OMD）和KITTI Vision Benchmark Suite的真实世界数据的评估表明，与类似方法相比，MVO实现了良好的估计精度，并适用于各种多运动估计挑战

摘要: Visual motion estimation is a well-studied challenge in autonomous navigation. Recent work has focused on addressing multimotion estimation in highly dynamic environments. These environments not only comprise multiple, complex motions but also tend to exhibit significant occlusion. Estimating third-party motions simultaneously with the sensor egomotion is difficult because an object’s observed motion consists of both its true motion and the sensor motion. Most previous works in multimotion estimation simplify this problem by relying on appearance-based object detection or application-specific motion constraints. These approaches are effective in specific applications and environments but do not generalize well to the full multimotion estimation problem (MEP). This paper presents Multimotion Visual Odometry (MVO), a multimotion estimation pipeline that estimates the full SE(3) trajectory of every motion in the scene, including the sensor egomotion, without relying on appearance-based information. MVO extends the traditional visual odometry (VO) pipeline with multimotion segmentation and tracking techniques. It uses physically founded motion priors to extrapolate motions through temporary occlusions and identify the reappearance of motions through motion closure. Evaluations on real-world data from the Oxford Multimotion Dataset (OMD) and the KITTI Vision Benchmark Suite demonstrate that MVO achieves good estimation accuracy compared to similar approaches and is applicable to a variety of multimotion estimation challenges.

标题: Learning Interactive Real-World Simulators

作者: Mengjiao Yang, Yilun Du, Kamyar Ghasemipour

PubTime: 2024-01-13

Downlink: http://arxiv.org/abs/2310.06114v2

Project: https://universal-simulator.github.io.|https://universal-simulator.github.io|

中文摘要: 基于互联网数据训练的生成模型彻底改变了文本、图像和视频内容的创建方式。也许生成模型的下一个里程碑是模拟现实体验，以响应人类、机器人和其他交互式代理所采取的行动。真实世界模拟器的应用范围从游戏和电影中的可控内容创建，到纯粹在模拟中训练可直接部署在现实世界中的具体代理。我们探索了通过生成建模学习真实世界交互的通用模拟器的可能性。我们首先提出了一个重要的观察结果，即可用于学习真实世界模拟器的自然数据集通常在不同维度上是丰富的（例如，图像数据中的大量对象、机器人数据中的密集采样动作以及导航数据中的不同运动）。通过仔细编排不同的数据集，每个数据集都提供了整体体验的不同方面，我们可以从静态场景和对象中模拟高级指令（如“打开抽屉”）和低级控件（如“按x，y移动”）的视觉结果。我们使用模拟器来训练高级视觉语言策略和低级强化学习策略，在纯模拟训练后，每一种策略都可以在现实世界中零次部署。我们还表明，其他类型的智能，如视频字幕模型，可以从模拟经验的训练中受益，从而开辟更广泛的应用。视频演示可在https://universal-simulator.github.io.

摘要: Generative models trained on internet data have revolutionized how text, image, and video content can be created. Perhaps the next milestone for generative models is to simulate realistic experience in response to actions taken by humans, robots, and other interactive agents. Applications of a real-world simulator range from controllable content creation in games and movies, to training embodied agents purely in simulation that can be directly deployed in the real world. We explore the possibility of learning a universal simulator of real-world interaction through generative modeling. We first make the important observation that natural datasets available for learning a real-world simulator are often rich along different dimensions (e.g., abundant objects in image data, densely sampled actions in robotics data, and diverse movements in navigation data). With careful orchestration of diverse datasets, each providing a different aspect of the overall experience, we can simulate the visual outcome of both high-level instructions such as ``open the drawer’’ and low-level controls such as “move by x, y” from otherwise static scenes and objects. We use the simulator to train both high-level vision-language policies and low-level reinforcement learning policies, each of which can be deployed in the real world in zero shot after training purely in simulation. We also show that other types of intelligence such as video captioning models can benefit from training with simulated experience, opening up even wider applications. Video demos can be found at https://universal-simulator.github.io.