[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉语言导航_slicegpt: compress large language models by deleti-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/135922676

专属领域论文订阅

VX 关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

为了答谢各位网友的支持，从今日起免费为300名读者提供订阅主题论文服务，只需关注公号并在留言中提供{邮箱+论文主题}（如：123456@xx.com + chatgpt@large language model @LLM）,主题必须是同一个领域，最多三个关键词。解释权归博主所有

在这里插入图片描述

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

== LLM ==

标题: ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models

作者: Yi-Lin Sung, Jaehong Yoon, Mohit Bansal

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2310.02998v2

Project: https://ecoflap.github.io/|

中文摘要: 大型视觉语言模型（LVLMs）通过整合来自不同模态的丰富信息，可以全面理解世界，在各种多模态下游任务上取得显著进步。然而，由于其巨大的计算/能源成本和碳消耗，部署LVLMs通常是有问题的。这些问题使得采用传统的迭代全局修剪是不可行的，由于计算整个大模型的Hessian矩阵进行稀疏化，这是昂贵的。或者，一些研究最近提出了逐层修剪方法，以避免全局修剪的昂贵计算，并根据模型权重在层内的重要性有效地压缩模型权重。然而，由于缺乏全局视角，它们经常遭受次优模型压缩。为了解决最近大型模型的有效修剪方法中的这一限制，我们提出了有效的从粗到细分层修剪（ECoFLaP），这是一种用于LVLMs的两阶段从粗到细权重修剪方法。我们首先通过利用全局重要性分数来确定不同层或块的稀疏率，全局重要性分数是基于全局模型梯度的零阶近似有效计算的。然后，该模型基于全局信息稀疏率执行局部层非结构化权重修剪。我们在各种多模态和单模态模型和数据集上验证了我们提出的方法，证明了在高稀疏状态下比流行的修剪技术有显著的性能改进。

摘要: Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities, achieving remarkable advancements on various multimodal downstream tasks. However, deploying LVLMs is often problematic due to their massive computational/energy costs and carbon consumption. Such issues make it infeasible to adopt conventional iterative global pruning, which is costly due to computing the Hessian matrix of the entire large model for sparsification. Alternatively, several studies have recently proposed layer-wise pruning approaches to avoid the expensive computation of global pruning and efficiently compress model weights according to their importance within a layer. However, they often suffer from suboptimal model compression due to their lack of a global perspective. To address this limitation in recent efficient pruning methods for large models, we propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs. We first determine the sparsity ratios of different layers or blocks by leveraging the global importance score, which is efficiently computed based on the zeroth-order approximation of the global model gradients. Then, the model performs local layer-wise unstructured weight pruning based on globally-informed sparsity ratios. We validate our proposed method across various multimodal and unimodal models and datasets, demonstrating significant performance improvements over prevalent pruning techniques in the high-sparsity regime.

标题: SliceGPT: Compress Large Language Models by Deleting Rows and Columns

作者: Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.15024v1

GitHub: https://github.com/microsoft/TransformerCompression|

中文摘要: 大型语言模型已经成为自然语言处理的基石，但是它们的使用在计算和内存资源方面伴随着巨大的成本。稀疏化提供了一种缓解这些资源限制的解决方案，最近的工作表明，训练好的模型可以事后稀疏化。现有的稀疏化技术面临着挑战，因为它们需要额外的数据结构，并且在当前硬件的情况下提供受限的加速。在本文中，我们提出了SliceGPT，一种新的训练后稀疏化方案，它用一个更小（密集）的矩阵替换每个权重矩阵，降低了网络的嵌入维数。通过大量的实验，我们表明SliceGPT可以去除LLAMA2-70B、OPT 66B和Phi-2模型高达25%的模型参数（包括嵌入），同时分别保持密集模型99%、99%和90%的零射击任务性能。我们的切片模型运行在更少的GPU上，运行速度更快，无需任何额外的代码优化：在24GB消费级GPU上，我们将LLAMA2-70B上的推理总计算减少到密集模型的64%；在40GB A100 GPUs上，我们将其降至66%。我们提供了一种新的见解，Transformer model网络中的计算不变性，这使得SliceGPT成为可能，我们希望它能够激励和实现未来的途径，以减少预训练模型的内存和计算需求。代码可从以下网址获得：https://github.com/microsoft/TransformerCompression

摘要: Large language models have become the cornerstone of natural language processing, but their use comes with substantial costs in terms of compute and memory resources. Sparsification provides a solution to alleviate these resource constraints, and recent works have shown that trained models can be sparsified post-hoc. Existing sparsification techniques face challenges as they need additional data structures and offer constrained speedup with current hardware. In this paper we present SliceGPT, a new post-training sparsification scheme which replaces each weight matrix with a smaller (dense) matrix, reducing the embedding dimension of the network. Through extensive experimentation, we show that SliceGPT can remove up to 25% of the model parameters (including embeddings) for LLAMA2-70B, OPT 66B and Phi-2 models while maintaining 99%, 99% and 90% zero-shot task performance of the dense model respectively. Our sliced models run on fewer GPUs and run faster without any additional code optimization: on 24GB consumer GPUs we reduce the total compute for inference on LLAMA2-70B to 64% of that of the dense model; on 40GB A100 GPUs we reduce it to 66%. We offer a new insight, computational invariance in transformer networks, which enables SliceGPT and we hope it will inspire and enable future avenues to reduce memory and computation demands for pre-trained models. Code is available at: https://github.com/microsoft/TransformerCompression

标题: Airavata: Introducing Hindi Instruction-tuned LLM

作者: Jay Gala, Thanmay Jayakumar, Jaavid Aktar Husain

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.15006v1

Project: https://ai4bharat.github.io/airavata|

中文摘要: 我们宣布“Airavata”的首次发布，这是一种针对印地语的教学调整LLM。Airavata是通过使用多样化的指令调整印地语数据集对OpenHathi进行微调而创建的，使其更适合辅助任务。除了该模型，我们还共享IndicInstruct数据集，这是一个不同指令调优数据集的集合，以便对IndicLLMs进行进一步研究。此外，我们还提出了评估基准和一个框架，用于评估印地语LLM跨任务的绩效。目前，Airavata支持印地语，但我们计划将其扩展到所有22种预定的印度语。您可以访问https：//ai4bharat.github.io/airavata。

摘要: We announce the initial release of “Airavata,” an instruction-tuned LLM for Hindi. Airavata was created by fine-tuning OpenHathi with diverse, instruction-tuning Hindi datasets to make it better suited for assistive tasks. Along with the model, we also share the IndicInstruct dataset, which is a collection of diverse instruction-tuning datasets to enable further research for Indic LLMs. Additionally, we present evaluation benchmarks and a framework for assessing LLM performance across tasks in Hindi. Currently, Airavata supports Hindi, but we plan to expand this to all 22 scheduled Indic languages. You can access all artifacts at https://ai4bharat.github.io/airavata.

标题: Prompt-based Distribution Alignment for Unsupervised Domain Adaptation

作者: Shuanghao Bai, Min Zhang, Wanqi Zhou

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2312.09553v2

GitHub: https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment|

摘要: Recently, despite the unprecedented success of large pre-trained visual-language models (VLMs) on a wide range of downstream tasks, the real-world unsupervised domain adaptation (UDA) problem is still not well explored. Therefore, in this paper, we first experimentally demonstrate that the unsupervised-trained VLMs can significantly reduce the distribution discrepancy between source and target domains, thereby improving the performance of UDA. However, a major challenge for directly deploying such models on downstream UDA tasks is prompt engineering, which requires aligning the domain knowledge of source and target domains, since the performance of UDA is severely influenced by a good domain-invariant representation. We further propose a Prompt-based Distribution Alignment (PDA) method to incorporate the domain knowledge into prompt learning. Specifically, PDA employs a two-branch prompt-tuning paradigm, namely base branch and alignment branch. The base branch focuses on integrating class-related representation into prompts, ensuring discrimination among different classes. To further minimize domain discrepancy, for the alignment branch, we construct feature banks for both the source and target domains and propose image-guided feature tuning (IFT) to make the input attend to feature banks, which effectively integrates self-enhanced and cross-domain features into the model. In this way, these two branches can be mutually promoted to enhance the adaptation of VLMs for UDA. We conduct extensive experiments on three benchmarks to demonstrate that our proposed PDA achieves state-of-the-art performance. The code is available at https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment.

标题: Endowing Protein Language Models with Structural Knowledge

作者: Dexiong Chen, Philip Hartout, Paolo Pellizzoni

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14819v1

GitHub: https://github.com/BorgwardtLab/PST|

中文摘要: 理解蛋白质序列、结构和功能之间的关系是一个长期的生物学挑战，从药物设计到我们对进化的理解都有多方面的影响。最近，蛋白质语言模型已经成为应对这一挑战的首选方法，这要归功于它们利用大型序列数据库的能力。然而，它们对扩展序列数据和参数集的依赖限制了它们在现实世界场景中的灵活性和实用性。同时，最近计算预测的蛋白质结构的激增开启了蛋白质表示学习的新机遇。虽然前景看好，但这种复杂数据带来的计算负担仍然阻碍了广泛采用的实际应用。为了解决这些限制，我们引入了一个新的框架，通过整合蛋白质结构数据来增强蛋白质语言模型。借鉴图转换器的最新进展，我们的方法通过将结构信息与结构提取器模块相结合，改进了预训练语言转换器的自我注意机制。这种被称为蛋白质结构Transformer model（PST）的改进模型在小型蛋白质结构数据库上进一步预训练，使用与传统蛋白质语言模型相同的掩蔽语言建模目标。PST的经验评估证明了其相对于蛋白质语言模型的优越参数效率，尽管是在仅包含542K结构的数据集上预训练的。值得注意的是，PST始终优于最先进的蛋白质序列基础模型ESM-2，为蛋白质功能预测树立了新的基准。我们的发现强调了将结构信息整合到蛋白质语言模型中的潜力，为更有效和高效的蛋白质建模代码铺平了道路，预训练的模型可在https：//github.com/BorgwardtLab/PST。

摘要: Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data and parameter sets limits their flexibility and practicality in real-world scenarios. Concurrently, the recent surge in computationally predicted protein structures unlocks new opportunities in protein representation learning. While promising, the computational burden carried by such complex data still hinders widely-adopted practical applications. To address these limitations, we introduce a novel framework that enhances protein language models by integrating protein structural data. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database, using the same masked language modeling objective as traditional protein language models. Empirical evaluations of PST demonstrate its superior parameter efficiency relative to protein language models, despite being pretrained on a dataset comprising only 542K structures. Notably, PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction. Our findings underscore the potential of integrating structural information into protein language models, paving the way for more effective and efficient protein modeling Code and pretrained models are available at https://github.com/BorgwardtLab/PST.

标题: K-QA: A Real-World Medical Q&A Benchmark

作者: Itay Manes, Naama Ronn, David Cohen

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14493v1

Project: https://huggingface.co/spaces/Itaykhealth/K-QA|

GitHub: https://github.com/Itaymanes/K-QA|

摘要: Ensuring the accuracy of responses provided by large language models (LLMs) is crucial, particularly in clinical settings where incorrect information may directly impact patient health. To address this challenge, we construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health (an AI-driven clinical platform). We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements. Additionally, we formulate two NLI-based evaluation metrics approximating recall and precision: (1) comprehensiveness, measuring the percentage of essential clinical information in the generated answer and (2) hallucination rate, measuring the number of statements from the physician-curated response contradicted by the LLM answer. Finally, we use K-QA along with these metrics to evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes developed by the authors. Our findings indicate that in-context learning improves the comprehensiveness of the models, and augmented retrieval is effective in reducing hallucinations. We make K-QA available to to the community to spur research into medically accurate NLP applications.

== VLM ==

标题: Prompt-based Distribution Alignment for Unsupervised Domain Adaptation

作者: Shuanghao Bai, Min Zhang, Wanqi Zhou

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2312.09553v2

GitHub: https://github.com/BaiShuanghao/Prompt-based-Distribution-Alignment|

中文摘要: 最近，尽管大型预训练视觉语言模型（VLM）在广泛的下游任务上取得了前所未有的成功，但现实世界的无监督域适应（UDA）问题仍然没有得到很好的探索。因此，在本文中，我们首先通过实验证明了无监督训练的VLMs可以显著降低源域和目标域之间的分布差异，从而提高UDA的性能。然而，在下游UDA任务上直接部署这种模型的主要挑战是即时工程，这需要调整源域和目标域的域知识，因为UDA的性能受到良好的域不变表示的严重影响。我们进一步提出了一种基于提示的分布对齐（PDA）方法，将领域知识融入到提示学习中。具体来说，PDA采用了两个分支的即时调谐范例，即基础分支和对齐分支。基础分支侧重于将与类相关的表示整合到提示中，确保不同类之间的区分。为了进一步最小化域差异，对于比对分支，我们为源域和目标域构建特征库，并提出图像引导特征调整（IFT）以使输入关注特征库，这有效地将自增强和跨域特征集成到模型中。这样，这两个分支可以相互促进，以增强VLMs对UDA的适应性。我们在三个基准上进行了广泛的实验，以证明我们提出的PDA实现了最先进的性能。代码可在https：//github.com/BaiShuanghao/Prompt-based-Distribution-Alignment获得。

标题: PL-FSCIL: Harnessing the Power of Prompts for Few-Shot Class-Incremental Learning

作者: Songsong Tian, Lusi Li, Weijun Li

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14807v1

GitHub: https://github.com/TianSongS/PL-FSCIL|

中文摘要: 少数镜头类增量学习（FSCIL）旨在使深度神经网络能够从少量标记的样本中增量学习新任务，而不会忘记之前学习的任务，密切模仿人类的学习模式。在本文中，我们提出了一种称为FSCIL即时学习（PL-FSCIL）的新方法，该方法利用提示的力量结合预训练的视觉Transformer model（ViT）模型来有效地解决FSCIL的挑战。我们的工作开创了在FSCIL中使用视觉提示的先河，其特点是非常简单。PL-FSCIL由两个不同的提示组成：域提示和FSCIL提示。两者都是通过将自身嵌入ViT模型的注意力层来扩充模型的向量。具体来说，域提示帮助ViT模型适应新的数据域。特定于任务的FSCIL提示符与原型分类器相结合，增强了模型有效处理FSCIL任务的能力。我们在广泛使用的基准数据集（如CIFAR-100和CUB-200）上验证了PL-FSCIL的有效性。这些结果展示了具有竞争力的性能，强调了其在高质量数据通常稀缺的现实世界应用中的巨大潜力。源代码可从以下网址获得：https://github.com/TianSongS/PL-FSCIL。

摘要: Few-Shot Class-Incremental Learning (FSCIL) aims to enable deep neural networks to learn new tasks incrementally from a small number of labeled samples without forgetting previously learned tasks, closely mimicking human learning patterns. In this paper, we propose a novel approach called Prompt Learning for FSCIL (PL-FSCIL), which harnesses the power of prompts in conjunction with a pre-trained Vision Transformer (ViT) model to address the challenges of FSCIL effectively. Our work pioneers the use of visual prompts in FSCIL, which is characterized by its notable simplicity. PL-FSCIL consists of two distinct prompts: the Domain Prompt and the FSCIL Prompt. Both are vectors that augment the model by embedding themselves into the attention layer of the ViT model. Specifically, the Domain Prompt assists the ViT model in adapting to new data domains. The task-specific FSCIL Prompt, coupled with a prototype classifier, amplifies the model’s ability to effectively handle FSCIL tasks. We validate the efficacy of PL-FSCIL on widely used benchmark datasets such as CIFAR-100 and CUB-200. The results showcase competitive performance, underscoring its promising potential for real-world applications where high-quality data is often scarce. The source code is available at: https://github.com/TianSongS/PL-FSCIL.

标题: Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

作者: Xiaojun Wu, Dixiang Zhang, Ruyi Gan

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14688v1

Project: https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/|

中文摘要: 文本到图像模型的最新进展显著增强了图像生成能力，但开源模型在双语或中文支持方面仍存在显著差距。为了满足这一需求，我们提出了一种新的中英文双语文本到图像模型Taiyi-Diffusion-XL，它是通过双语连续预训练过程扩展CLIP和Stable-Diffusion-XL的能力而开发的。这种方法包括通过将最常用的汉字集成到CLIP的标记器和嵌入层中来有效扩展词汇，以及绝对位置编码扩展。此外，我们通过大型视觉语言模型丰富了文本提示，从而产生更好的图像标题，并具有更高的视觉质量。这些增强随后应用于下游的文本到图像模型。我们的实证结果表明，所开发的CLIP模型在双语图像——文本检索方面表现出色。此外，Taiyi-Diffusion-XL的双语图像生成能力超过了以前的模型。这项研究导致了Taiyi-Diffusion-XL模型的开发和开源，代表了图像生成领域的显著进步，特别是对于中文应用。这一贡献是在解决多模态研究中对更多样化语言支持的需求方面向前迈出的一步。该模型和演示可在\href{https：//hugging face.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/}{此https URL}公开获得，促进了该领域的进一步研究和合作。

摘要: Recent advancements in text-to-image models have significantly enhanced image generation capabilities, yet a notable gap of open-source models persists in bilingual or Chinese language support. To address this need, we present Taiyi-Diffusion-XL, a new Chinese and English bilingual text-to-image model which is developed by extending the capabilities of CLIP and Stable-Diffusion-XL through a process of bilingual continuous pre-training. This approach includes the efficient expansion of vocabulary by integrating the most frequently used Chinese characters into CLIP’s tokenizer and embedding layers, coupled with an absolute position encoding expansion. Additionally, we enrich text prompts by large vision-language model, leading to better images captions and possess higher visual quality. These enhancements are subsequently applied to downstream text-to-image models. Our empirical results indicate that the developed CLIP model excels in bilingual image-text retrieval.Furthermore, the bilingual image generation capabilities of Taiyi-Diffusion-XL surpass previous models. This research leads to the development and open-sourcing of the Taiyi-Diffusion-XL model, representing a notable advancement in the field of image generation, particularly for Chinese language applications. This contribution is a step forward in addressing the need for more diverse language support in multimodal research. The model and demonstration are made publicly available at \href{https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/}{this https URL}, fostering further research and collaboration in this domain.

标题: Multi-task robot data for dual-arm fine manipulation

作者: Heecheol Kim, Yoshiyuki Ohmura, Yasuo Kuniyoshi

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.07603v2

Project: https://sites.google.com/view/multi-task-fine|https://sites.google.com/view/multi-task-fine|

中文摘要: 在机器人操纵领域，深度模仿学习被认为是一种很有前途的获得操纵技能的方法。此外，从不同的机器人数据集学习被认为是实现多功能性和适应性的可行方法。在这样的研究中，通过学习各种任务，机器人实现了跨多个对象的通用性。然而，这种多任务机器人数据集主要集中在相对不精确的单臂任务上，而没有解决机器人在现实世界中预期执行的细粒度对象操作。本文介绍了一个不同对象操作的数据集，包括双臂任务和/或需要精细操作的任务。为此，我们生成了224k集（150小时，1,104种语言指令）的数据集，其中包括双臂精细任务，如移动碗、打开铅笔盒或剥香蕉，这些数据是公开可用的。此外，该数据集包括视觉注意力信号以及双动作标签，该信号将动作分成稳健的到达轨迹和与对象的精确交互，以及实现稳健和精确的对象操作的语言指令。我们将该数据集应用于我们的双动作和注意力（DAA），这是一个为细粒度双臂操作任务设计的模型，对协变量偏移具有鲁棒性。该模型在实际机器人操作任务中进行了超过7k次试验，证明了其精细操作能力。该数据集可在https://sites.google.com/view/multi-task-fine查阅。

摘要: In the field of robotic manipulation, deep imitation learning is recognized as a promising approach for acquiring manipulation skills. Additionally, learning from diverse robot datasets is considered a viable method to achieve versatility and adaptability. In such research, by learning various tasks, robots achieved generality across multiple objects. However, such multi-task robot datasets have mainly focused on single-arm tasks that are relatively imprecise, not addressing the fine-grained object manipulation that robots are expected to perform in the real world. This paper introduces a dataset of diverse object manipulations that includes dual-arm tasks and/or tasks requiring fine manipulation. To this end, we have generated dataset with 224k episodes (150 hours, 1,104 language instructions) which includes dual-arm fine tasks such as bowl-moving, pencil-case opening or banana-peeling, and this data is publicly available. Additionally, this dataset includes visual attention signals as well as dual-action labels, a signal that separates actions into a robust reaching trajectory and precise interaction with objects, and language instructions to achieve robust and precise object manipulation. We applied the dataset to our Dual-Action and Attention (DAA), a model designed for fine-grained dual arm manipulation tasks and robust against covariate shifts. The model was tested with over 7k total trials in real robot manipulation tasks, demonstrating its capability in fine manipulation. The dataset is available at https://sites.google.com/view/multi-task-fine.

标题: Pixel-Wise Recognition for Holistic Surgical Scene Understanding

作者: Nicolás Ayobi, Santiago Rodríguez, Alejandra Pérez

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.11174v2

Project: https://link.springer.com/chapter/10.1007/978-3-031-16449-1_42|https://ieeexplore.ieee.org/document/10230819|

GitHub: https://github.com/BCV-Uniandes/GraSP|

中文摘要: 本文介绍了前列腺切除术的整体和多粒度手术场景理解（GraSP）数据集，这是一个精心策划的基准，将手术场景理解建模为具有不同粒度级别的互补任务的层次结构。我们的方法能够对手术活动进行多层次的理解，包括长期任务，如手术阶段和步骤识别，以及短期任务，包括手术器械分割和原子视觉动作检测。为了利用我们提出的基准，我们引入了动作、阶段、步骤和仪器分割变压器（TAPIS）模型，这是一种通用架构，它将全局视频特征提取器与来自仪器分割模型的局部区域建议相结合，以解决我们基准的多粒度问题。通过大量的实验，我们展示了在短期识别任务中包含分割注释的影响，强调了每个任务的不同粒度要求，并建立了TAPIS优于以前提出的基线和传统的基于CNN的模型。此外，我们通过多个公共基准验证了我们方法的稳健性，确认了我们数据集的可靠性和适用性。这项工作代表了内窥镜视觉向前迈出的重要一步，为未来对外科手术的整体理解研究提供了一个新颖而全面的框架。

摘要: This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach enables a multi-level comprehension of surgical activities, encompassing long-term tasks such as surgical phases and steps recognition and short-term tasks including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation, we demonstrate the impact of including segmentation annotations in short-term recognition tasks, highlight the varying granularity requirements of each task, and establish TAPIS’s superiority over previously proposed baselines and conventional CNN-based models. Additionally, we validate the robustness of our method across multiple public benchmarks, confirming the reliability and applicability of our dataset. This work represents a significant step forward in Endoscopic Vision, offering a novel and comprehensive framework for future research towards a holistic understanding of surgical procedures.

标题: Rethinking FID: Towards a Better Evaluation Metric for Image Generation

作者: Sadeep Jayasumana, Srikumar Ramalingam, Andreas Veit

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.09603v2

GitHub: https://github.com/google-research/google-research/tree/master/cmmd|

中文摘要: 与许多机器学习问题一样，图像生成方法的进步取决于良好的评估指标。其中最受欢迎的是Frechet盗梦空间距离（FID）。FID估计真实图像的Inception-v3特征分布与算法生成的图像之间的距离。我们强调了FID的重要缺点：Inception对现代文本到图像模型生成的丰富多样的内容的糟糕表示，不正确的正态性假设，以及糟糕的样本复杂性。我们呼吁重新评估FID作为生成图像的主要质量指标的使用。我们从经验上证明，FID与人类评分者相矛盾，它不能反映迭代文本到图像模型的逐渐改进，它不能捕捉失真水平，并且当改变样本量时，它会产生不一致的结果。我们还提出了一种替代的新度量，CMMD，基于更丰富的剪辑嵌入和高斯RBF核的最大平均差异距离。它是一个无偏估计器，不对嵌入的概率分布做任何假设，并且是样本有效的。通过大量的实验和分析，我们证明了基于FID的文本到图像模型的评估可能是不可靠的，并且CMMD提供了更稳健和可靠的图像质量评估。

摘要: As with many machine learning problems, the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images, and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception’s poor representation of the rich and varied content generated by modern text-to-image models, incorrect normality assumptions, and poor sample complexity. We call for a reevaluation of FID’s use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters, it does not reflect gradual improvement of iterative text-to-image models, it does not capture distortion levels, and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric, CMMD, based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis, we demonstrate that FID-based evaluations of text-to-image models may be unreliable, and that CMMD offers a more robust and reliable assessment of image quality.

== diffusion model ==

标题: DAM: Diffusion Activation Maximization for 3D Global Explanations

作者: Hanxiao Tan

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14938v1

GitHub: https://github.com/Explain3D/DAM|

中文摘要: 近年来，点云模型的性能得到了迅速提高。然而，由于相关可解释性研究的数量有限，这些黑盒模型的不可靠性和不透明性可能会在危及人类生命的应用中导致潜在风险，例如自动驾驶或医疗保健。本文提出了一种基于DDPM的点云全局可解释性方法（DAM），该方法利用点扩散Transformer model（PDT），一种新的逐点对称模型，具有双分类器指导，以生成高质量的全局解释。此外，提出了一种适用于DAM的自适应路径梯度积分方法，该方法不仅提供了点云类别显著性图的全局概览，还揭示了解释属性在生成过程中是如何变化的。大量实验表明，我们的方法在可感知性、代表性和多样性方面优于现有的方法，并显著减少了生成时间。我们的代码可从以下网址获得：https：//github.com/explain 3D/DAM

摘要: In recent years, the performance of point cloud models has been rapidly improved. However, due to the limited amount of relevant explainability studies, the unreliability and opacity of these black-box models may lead to potential risks in applications where human lives are at stake, e.g. autonomous driving or healthcare. This work proposes a DDPM-based point cloud global explainability method (DAM) that leverages Point Diffusion Transformer (PDT), a novel point-wise symmetric model, with dual-classifier guidance to generate high-quality global explanations. In addition, an adapted path gradient integration method for DAM is proposed, which not only provides a global overview of the saliency maps for point cloud categories, but also sheds light on how the attributions of the explanations vary during the generation process. Extensive experiments indicate that our method outperforms existing ones in terms of perceptibility, representativeness, and diversity, with a significant reduction in generation time. Our code is available at: https://github.com/Explain3D/DAM

标题: Text Image Inpainting via Global Structure-Guided Diffusion Models

作者: Shipeng Zhu, Pengfei Fang, Chenjie Zhu

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14832v1

GitHub: https://github.com/blackprotoss/GSDM|

中文摘要: 真实世界的文本可能会因环境或人为因素导致的腐蚀问题而损坏，这些腐蚀问题会阻碍文本完整风格的保存，例如纹理和结构。这些腐蚀问题，例如涂鸦标志和不完整的签名，给理解文本带来困难，从而对下游应用，例如场景文本识别和签名识别带来重大挑战。值得注意的是，当前的修复技术通常不能充分解决这个问题，并且难以恢复准确的文本图像以及合理和一致的样式。本文将此表述为文本图像修复的一个公开问题，旨在建立一个基准来促进其研究。在此过程中，我们建立了两个特定的文本修复数据集，分别包含场景文本图像和手写文本图像。它们中的每一个都包括由现实生活和合成数据集修改的图像，以成对的原始图像、损坏的图像和其他辅助信息为特色。在数据集的基础上，我们进一步开发了一个新的神经框架，全局结构引导扩散模型（GSDM），作为一个潜在的解决方案。利用文本的全局结构作为先验，所提出的GSDM开发了一个有效的扩散模型来恢复干净的文本。我们的方法的有效性通过彻底的实证研究得到了证明，包括识别准确性和图像质量的显著提高。这些发现不仅突出了我们的方法的有效性，而且强调了它在更广泛的文本图像理解和处理领域的潜力。代码和数据集可从以下网址获得：https://github.com/blackprotoss/GSDM。

摘要: Real-world text can be damaged by corrosion issues caused by environmental or human factors, which hinder the preservation of the complete styles of texts, e.g., texture and structure. These corrosion issues, such as graffiti signs and incomplete signatures, bring difficulties in understanding the texts, thereby posing significant challenges to downstream applications, e.g., scene text recognition and signature identification. Notably, current inpainting techniques often fail to adequately address this problem and have difficulties restoring accurate text images along with reasonable and consistent styles. Formulating this as an open problem of text image inpainting, this paper aims to build a benchmark to facilitate its study. In doing so, we establish two specific text inpainting datasets which contain scene text images and handwritten text images, respectively. Each of them includes images revamped by real-life and synthetic datasets, featuring pairs of original images, corrupted images, and other assistant information. On top of the datasets, we further develop a novel neural framework, Global Structure-guided Diffusion Model (GSDM), as a potential solution. Leveraging the global structure of the text as a prior, the proposed GSDM develops an efficient diffusion model to recover clean texts. The efficacy of our approach is demonstrated by thorough empirical study, including a substantial boost in both recognition accuracy and image quality. These findings not only highlight the effectiveness of our method but also underscore its potential to enhance the broader field of text image understanding and processing. Code and datasets are available at: https://github.com/blackprotoss/GSDM.

标题: Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

作者: Xiaojun Wu, Dixiang Zhang, Ruyi Gan

PubTime: 2024-01-26

Downlink: http://arxiv.org/abs/2401.14688v1

Project: https://huggingface.co/IDEA-CCNL/Taiyi-Stable-Diffusion-XL-3.5B/|

标题: pix2gestalt: Amodal Segmentation by Synthesizing Wholes

作者: Ege Ozguroglu, Ruoshi Liu, Dídac Surís

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2401.14398v1

Project: https://gestalt.cs.columbia.edu/|

中文摘要: 我们介绍了pix2gestalt，这是一个用于零镜头无模态分割的框架，它学习估计在遮挡后仅部分可见的整个对象的形状和外观。通过利用大规模扩散模型并将它们的表示转移到这项任务中，我们学习了一种条件扩散模型，用于在具有挑战性的零镜头情况下重建整个对象，包括打破自然和物理先验的例子，如art。作为训练数据，我们使用一个综合策划的数据集，该数据集包含与它们的整个对应物配对的遮挡对象。实验表明，我们的方法在已建立的基准上优于监督基线。我们的模型还可以用于在存在遮挡的情况下显著提高现有对象识别和3D重建方法的性能。

摘要: We introduce pix2gestalt, a framework for zero-shot amodal segmentation, which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task, we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases, including examples that break natural and physical priors, such as art. As training data, we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions.

标题: Disentanglement in a GAN for Unconditional Speech Synthesis

作者: Matthew Baas, Herman Kamper

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2307.01673v2

GitHub: https://github.com/RF5/simple-asgan/|

中文摘要: 我们能否开发一种模型，可以直接从潜在空间合成逼真的语音，而无需显式条件反射？尽管在过去十年中做了一些努力，以前的对抗性和基于扩散的方法仍然难以实现这一点，即使是在小词汇量的数据集上。为了解决这个问题，我们提出了AudioStyleGAN（ASGAN）-一个用于无条件语音合成的生成对抗网络，用于学习解开的潜在空间。基于StyleGAN系列图像合成模型，ASGAN将采样噪声映射到一个解开的潜在向量，然后将该向量映射到一系列音频特征，从而在每一层抑制信号混叠。为了成功地训练ASGAN，我们引入了许多新技术，包括对自适应鉴别器增强的修改，其概率跳过鉴别器更新。我们将它应用于小词汇量的Google Speech Commands digits数据集，在那里它实现了无条件语音合成的最先进的结果。它也比现有的性能最好的扩散模型快得多。我们证实了ASGAN的潜在空间是解开的：我们展示了空间中简单的线性运算如何被用来执行几个在训练中看不见的任务。具体来说，我们在语音转换、语音增强、说话人验证和关键字分类方面进行评估。我们的工作表明，GANs在无条件语音合成领域仍然具有很高的竞争力，并且解开的潜在空间可以用来帮助概括看不见的任务。代码、模型、示例：https://github.com/RF5/simple-asgan/

摘要: Can we develop a model that can synthesize realistic speech directly from a latent space, without explicit conditioning? Despite several efforts over the last decade, previous adversarial and diffusion-based approaches still struggle to achieve this, even on small-vocabulary datasets. To address this, we propose AudioStyleGAN (ASGAN) – a generative adversarial network for unconditional speech synthesis tailored to learn a disentangled latent space. Building upon the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a disentangled latent vector which is then mapped to a sequence of audio features so that signal aliasing is suppressed at every layer. To successfully train ASGAN, we introduce a number of new techniques, including a modification to adaptive discriminator augmentation which probabilistically skips discriminator updates. We apply it on the small-vocabulary Google Speech Commands digits dataset, where it achieves state-of-the-art results in unconditional speech synthesis. It is also substantially faster than existing top-performing diffusion models. We confirm that ASGAN’s latent space is disentangled: we demonstrate how simple linear operations in the space can be used to perform several tasks unseen during training. Specifically, we perform evaluations in voice conversion, speech enhancement, speaker verification, and keyword classification. Our work indicates that GANs are still highly competitive in the unconditional speech synthesis landscape, and that disentangled latent spaces can be used to aid generalization to unseen tasks. Code, models, samples: https://github.com/RF5/simple-asgan/

标题: Diffusion Model for Dense Matching

作者: Jisu Nam, Gyuseong Lee, Sunwoo Kim

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2305.19094v2

Project: https://ku-cvlab.github.io/DiffMatch/|https://ku-cvlab.github.io/DiffMatch/|

摘要: The objective for establishing dense correspondence between paired images consists of two terms: a data term and a prior term. While conventional techniques focused on defining hand-designed prior terms, which are difficult to formulate, recent approaches have focused on learning the data term with deep neural networks without explicitly modeling the prior, assuming that the model itself has the capacity to learn an optimal prior from a large-scale dataset. The performance improvement was obvious, however, they often fail to address inherent ambiguities of matching, such as textureless regions, repetitive patterns, and large displacements. To address this, we propose DiffMatch, a novel conditional diffusion-based framework designed to explicitly model both the data and prior terms. Unlike previous approaches, this is accomplished by leveraging a conditional denoising diffusion model. DiffMatch consists of two main components: conditional denoising diffusion module and cost injection module. We stabilize the training process and reduce memory usage with a stage-wise training strategy. Furthermore, to boost performance, we introduce an inference technique that finds a better path to the accurate matching field. Our experimental results demonstrate significant performance improvements of our method over existing approaches, and the ablation studies validate our design choices along with the effectiveness of each component. Project page is available at https://ku-cvlab.github.io/DiffMatch/.

== VLN ==

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型（LMM）的发展，这些模型能够处理复杂的任务，包括对图像中的文本和视觉内容进行联合推理（例如，在公共场所导航地图）。本文介绍了ConTextual，这是一个新颖的基准测试，包括明确设计的指令，用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景（例如，时间阅读、导航、购物等），要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V（ision）和使用人类评估的人类能力之间30.8%的显著性能差距，表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是，虽然GPT-4V在模因和引用解释等抽象类别中表现出色，但其整体表现仍落后于人类。除了人工评估，我们还采用了使用GPT-4的自动评估指标，揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估，并提供定性分析，为LMM设计的未来发展提供了一个强大的框架。https：//con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图（VSLAM）中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性，因此需要频繁的摄像机输入。然而，频繁处理图像会导致大量的内存使用和计算开销。在这项研究中，我们介绍了SemanticSLAM，这是一个端到端的视觉惯性里程计系统，它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图，并确保可靠的相机定位。SemanticSLAM是场景不可知的，这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作，即使没有频繁的摄像机输入，也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆（ConvLSTM）网络实现的，该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比，SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息，并且可以容易地应用于各种下游任务，例如路径规划、避障和机器人导航。该代码将在以下网址公开：https://github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM

标题: Long-Tailed 3D Detection via 2D Late Fusion

作者: Yechi Ma, Neehar Peri, Shuoquan Wei

PubTime: 2024-01-25

Downlink: http://arxiv.org/abs/2312.10986v2

中文摘要: 为了安全导航，自动驾驶汽车（AVs）必须准确地检测普通和稀有类别的物体，这引发了长尾3D物体检测（LT3D）的问题。当代基于激光雷达的3D探测器在罕见的类别上表现不佳（例如，CenterPoint在婴儿车上仅实现5.1 AP），因为很难仅从稀疏的激光雷达点识别物体。RGB图像提供了视觉证据来帮助解决这种模糊性，激发了RGB-激光雷达融合的研究。在本文中，我们深入研究了一个简单的后期融合框架，该框架集成了独立训练的RGB和激光雷达探测器。与最近需要配对多模态训练数据的端到端方法不同，我们的后期融合方法可以轻松利用大规模单模态数据集，显著改善稀有类检测。特别是，我们从基本原理出发，研究了这一后期融合框架中的三个关键组成部分，包括是否训练2D或3D RGB检测器，是否在3D或投影2D图像平面中匹配RGB和激光雷达检测，以及如何融合匹配的检测。大量实验表明，2D RGB检测器比3D RGB检测器实现了更好的识别精度，2D图像平面上的匹配减轻了深度估计误差，并且将分数与校准概率融合导致了最先进的LT3D性能。我们的后期融合方法在已建立的nuScenes LT3D基准上实现了51.4 mAP，比之前的工作提高了5.9 mAP。

摘要: Autonomous vehicles (AVs) must accurately detect objects from both common and rare classes for safe navigation, motivating the problem of Long-Tailed 3D Object Detection (LT3D). Contemporary LiDAR-based 3D detectors perform poorly on rare classes (e.g., CenterPoint only achieves 5.1 AP on stroller) as it is difficult to recognize objects from sparse LiDAR points alone. RGB images provide visual evidence to help resolve such ambiguities, motivating the study of RGB-LiDAR fusion. In this paper, we delve into a simple late-fusion framework that ensembles independently trained RGB and LiDAR detectors. Unlike recent end-to-end methods which require paired multi-modal training data, our late-fusion approach can easily leverage large-scale uni-modal datasets, significantly improving rare class detection. In particular, we examine three critical components in this late-fusion framework from first principles, including whether to train 2D or 3D RGB detectors, whether to match RGB and LiDAR detections in 3D or the projected 2D image plane, and how to fuse matched detections.Extensive experiments reveal that 2D RGB detectors achieve better recognition accuracy than 3D RGB detectors, matching on the 2D image plane mitigates depth estimation errors, and fusing scores probabilistically with calibration leads to state-of-the-art LT3D performance. Our late-fusion approach achieves 51.4 mAP on the established nuScenes LT3D benchmark, improving over prior work by 5.9 mAP.

标题: Multi-Object Navigation in real environments using hybrid policies

作者: Assem Sadek, Guillaume Bono, Boris Chidlovskii

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13800v1

中文摘要: 导航已经通过SLAM和规划的结合在机器人学中得到了经典的解决。最近，除了航路点规划之外，涉及（视觉）高级推理的重要组成部分的问题已经在模拟环境中进行了探索，主要通过大规模机器学习来解决，特别是RL、离线RL或模仿学习。这些方法要求代理学习各种技能，如局部规划、映射对象和查询学习的空间表示。与航路点规划（PointGoal）等更简单的任务相比，对于这些更复杂的任务，当前最先进的模型已经在模拟中进行了彻底的评估，但据我们所知，还没有在真实环境中进行评估。在这项工作中，我们重点关注sim2real传输。我们针对具有挑战性的多对象导航（Multi-ON）任务，并将其移植到包含原始虚拟Multi-ON对象的真实副本的物理环境中。我们引入了一种混合导航方法，该方法将问题分解为两种不同的技能：（1）航路点导航用经典的SLAM结合符号规划器来处理，而（2）探索、语义映射和目标检索用结合监督学习和RL训练的深度神经网络来处理。我们在模拟和真实环境中展示了这种方法与端到端方法相比的优势，并且在这项任务中优于SOTA。

摘要: Navigation has been classically solved in robotics through the combination of SLAM and planning. More recently, beyond waypoint planning, problems involving significant components of (visual) high-level reasoning have been explored in simulated environments, mostly addressed with large-scale machine learning, in particular RL, offline-RL or imitation learning. These methods require the agent to learn various skills like local planning, mapping objects and querying the learned spatial representations. In contrast to simpler tasks like waypoint planning (PointGoal), for these more complex tasks the current state-of-the-art models have been thoroughly evaluated in simulation but, to our best knowledge, not yet in real environments. In this work we focus on sim2real transfer. We target the challenging Multi-Object Navigation (Multi-ON) task and port it to a physical environment containing real replicas of the originally virtual Multi-ON objects. We introduce a hybrid navigation method, which decomposes the problem into two different skills: (1) waypoint navigation is addressed with classical SLAM combined with a symbolic planner, whereas (2) exploration, semantic mapping and goal retrieval are dealt with deep neural networks trained with a combination of supervised learning and RL. We show the advantages of this approach compared to end-to-end methods both in simulation and a real environment and outperform the SOTA for this task.

标题: VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

作者: Raphael Schumann, Wanrong Zhu, Weixi Feng

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2307.06082v2

中文摘要: 现实世界环境中的增量决策是具身人工智能中最具挑战性的任务之一。其中一个要求特别高的场景是视觉和语言导航~（VLN），它需要视觉和自然语言理解以及空间和时间推理能力。具体化代理需要将其对导航指令的理解建立在对真实世界环境（如街景）的观察中。尽管LLMs在其他研究领域取得了令人印象深刻的成果，但如何最好地将它们与交互式视觉环境联系起来是一个持续存在的问题。在这项工作中，我们提出了VELMA，一个具体化的LLM代理，它使用轨迹和视觉环境观察的语言化作为下一步行动的上下文提示。视觉信息由管道描述，管道从人类书面导航指令中提取地标，并使用CLIP来确定它们在当前全景视图中的可见性。我们用两个上下文中的例子证明了VELMA能够成功地遵循街景中的导航指令。我们在几千个例子上进一步微调了LLM代理，在两个数据集上，任务完成比以前的技术水平相对提高了25%-30%。

摘要: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.