[晓理紫]每日论文分享(有中文摘要，源码或项目地址)--大模型、扩散模型、视觉导航_diffusionmodelbasedposteriorsamplingfornoisylinear-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/135854577

专属领域论文订阅

VX关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

如果你感觉对你有所帮助，请关注我，每日准时为你推送最新论文。

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉导航
具身智能，机器人
强化学习
开放词汇，检测分割

[晓理紫]每日论文分享(有中文摘要，源码或项目地址)

== LLM ==

标题: VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

作者: Jing Yu Koh, Robert Lo, Lawrence Jang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13649v1

Project: https://jykoh.com/vwa.|https://jykoh.com/vwa|

中文摘要: 能够在web上规划、推理和执行动作的自主代理为自动化计算机任务提供了一个有前途的途径。然而，大多数现有的基准主要集中在基于文本的代理上，忽略了许多需要视觉信息来有效解决的自然任务。鉴于大多数计算机界面迎合人类的感知，视觉信息经常以纯文本模型难以有效利用的方式增加文本数据。为了弥补这一差距，我们引入了VisualWebArena，这是一个旨在评估多模态web代理在realistic\textit{可视化基础任务}上的性能的基准测试。VisualWebArena由一组不同且复杂的基于web的任务组成，这些任务评估自主多模态代理的各种能力。为了在这个基准上执行，代理需要准确地处理图像文本输入，解释自然语言指令，并在网站上执行操作，以完成用户定义的目标。我们对最先进的基于LLM的自主代理进行了广泛的评估，包括几个多模态模型。通过广泛的定量和定性分析，我们确定了纯文本LLM代理的几个局限性，并揭示了最先进的多模态语言代理的能力差距。VisualWebArena提供了一个评估多模态自治语言代理的框架，并为构建更强大的web自治代理提供了见解。我们的代码、基线模型和数据可在https：//jyko.com/vwa。公开获得

摘要: Autonomous agents capable of planning, reasoning, and executing actions on the web offer a promising avenue for automating computer tasks. However, the majority of existing benchmarks primarily focus on text-based agents, neglecting many natural tasks that require visual information to effectively solve. Given that most computer interfaces cater to human perception, visual information often augments textual data in ways that text-only models struggle to harness effectively. To bridge this gap, we introduce VisualWebArena, a benchmark designed to assess the performance of multimodal web agents on realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set of diverse and complex web-based tasks that evaluate various capabilities of autonomous multimodal agents. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives. We conduct an extensive evaluation of state-of-the-art LLM-based autonomous agents, including several multimodal models. Through extensive quantitative and qualitative analysis, we identify several limitations of text-only LLM agents, and reveal gaps in the capabilities of state-of-the-art multimodal language agents. VisualWebArena provides a framework for evaluating multimodal autonomous language agents, and offers insights towards building stronger autonomous agents for the web. Our code, baseline models, and data is publicly available at https://jykoh.com/vwa.

标题: Large Malaysian Language Model Based on Mistral for Enhanced Local Language Understanding

作者: Husein Zolkepli, Aisyah Razak, Kamarul Adha

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13565v1

Project: https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c|

中文摘要: 在本文中，我们使用32.6 GB的数据集（相当于11亿个令牌）展示了大规模语言模型Mistral 7B的预训练方面的重大进展。我们探索了扩展上下文长度的影响，发布了上下文长度为4096和32768令牌的模型，并使用专门的16384上下文长度指令调整模型（我们称之为Malaysian Mistral）进一步优化了性能。我们的实验证明了持续预训练的有效性和扩展上下文长度对Mistral 7B语言理解能力的影响。此外，我们发布了一个专门针对16384上下文长度指令进行调整的模型，展示了其捕捉细微语言复杂性的潜力。此外，我们的研究有助于马来西亚Mistral与著名语言模型（包括ChatGPT3.5和Claude 2）的基准测试。我们提出了令人信服的结果，表明马来西亚Mistral在Tatabahasa（马来语语法）测试集上的优越表现，特别是当用指令微调时。https：//hugging face.co/collections/mesolitica/malaysian-mistral-7b-6528 f 2 ec 825 f 4 bba 46 c 1700 c

摘要: In this paper, we present significant advancements in the pretraining of Mistral 7B, a large-scale language model, using a dataset of 32.6 GB, equivalent to 1.1 billion tokens. We explore the impact of extending the context length, releasing models with context lengths of 4096 and 32768 tokens, and further refining performance with a specialized 16384 context length instruction-tuned model, we called it Malaysian Mistral. Our experiments demonstrate the efficacy of continue pretraining and the influence of extended context lengths on Mistral 7B’s language understanding capabilities. Additionally, we release a model specifically tuned with a 16384 context length instruction, showcasing its potential for capturing nuanced language intricacies. Furthermore, our research contributes to the benchmarking of Malaysian Mistral against prominent language models, including ChatGPT3.5 and Claude 2. We present compelling results indicating Malaysian Mistral’s superior performance on Tatabahasa (Malay grammar) test set, particularly when fine-tuned with instructions. All models released at https://huggingface.co/collections/mesolitica/malaysian-mistral-7b-6528f2ec825f4bba46c1700c

标题: SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation

作者: Dong Zhang, Xin Zhang, Jun Zhan

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13527v1

GitHub: https://github.com/0nutation/SpeechGPT.|

摘要: Benefiting from effective speech modeling, current Speech Large Language Models (SLLMs) have demonstrated exceptional capabilities in in-context speech generation and efficient generalization to unseen speakers. However, the prevailing information modeling process is encumbered by certain redundancies, leading to inefficiencies in speech generation. We propose Chain-of-Information Generation (CoIG), a method for decoupling semantic and perceptual information in large-scale speech generation. Building on this, we develop SpeechGPT-Gen, an 8-billion-parameter SLLM efficient in semantic and perceptual information modeling. It comprises an autoregressive model based on LLM for semantic information modeling and a non-autoregressive model employing flow matching for perceptual information modeling. Additionally, we introduce the novel approach of infusing semantic information into the prior distribution to enhance the efficiency of flow matching. Extensive experimental results demonstrate that SpeechGPT-Gen markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue, underscoring CoIG’s remarkable proficiency in capturing and modeling speech’s semantic and perceptual dimensions. Code and models are available at https://github.com/0nutation/SpeechGPT.

标题: InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

作者: Ryota Tanaka, Taichi Iki, Kyosuke Nishida

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13313v1

GitHub: https://github.com/nttmdlab-nlp/InstructDoc|

中文摘要: 我们研究通过人类编写的指令在真实世界的文档上完成各种可视化文档理解（VDU）任务的问题，例如问题回答和信息提取。为此，我们提出了InstructDoc，这是第一个由30个公开可用的VDU数据集组成的大规模集合，每个数据集都有统一格式的不同指令，涵盖了广泛的12项任务，并包括开放的文档类型/格式。此外，为了增强VDU任务的泛化性能，我们设计了一个新的基于指令的文档阅读和理解模型InstructDr，它通过一个可训练的桥接模块连接文档图像、图像编码器和大型语言模型（LLMs）。实验表明，InstructDr可以通过给定的指令有效地适应新的VDU数据集、任务和域，并且在没有特定训练的情况下优于现有的多模态LLMs和ChatGPT。

摘要: We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.

标题: MaLA-500: Massive Language Adaptation of Large Language Models

作者: Peiqin Lin, Shaoxiong Ji, Jörg Tiedemann

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13303v1

Project: https://huggingface.co/MaLA-LM|

中文摘要: 大型语言模型推进了自然语言处理的发展。然而，它们主要针对英语或有限的一组语言的设计在它们对低资源语言的有效性方面造成了很大的差距。为了弥补这一差距，我们引入了MaLA-500，这是一种新颖的大型语言模型，旨在覆盖534种语言的广泛范围。为了训练MaLA-500，我们采用词汇扩展和Glot500-c在LLaMA 2上继续预训练。我们在SIB-200上的实验表明，MaLA-500达到了最先进的情境学习效果。我们在https://huggingface.co/MaLA-LM

摘要: Large language models have advanced the state of the art in natural language processing. However, their predominant design for English or a limited set of languages creates a substantial gap in their effectiveness for low-resource languages. To bridge this gap, we introduce MaLA-500, a novel large language model designed to cover an extensive range of 534 languages. To train MaLA-500, we employ vocabulary extension and continued pretraining on LLaMA 2 with Glot500-c. Our experiments on SIB-200 show that MaLA-500 achieves state-of-the-art in-context learning results. We release MaLA-500 at https://huggingface.co/MaLA-LM

标题: Reward Engineering for Generating Semi-structured Explanation

作者: Jiuzhou Han, Wray Buntine, Ehsan Shareghi

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2309.08347v2

GitHub: https://github.com/Jiuzhouh/Reward-Engineering-for-Generating-SEG|

中文摘要: 半结构化解释用显式表示描述了推理器的隐式过程。这个解释强调了如何利用特定查询中的可用信息，并用推理器从其内部权重产生的信息来补充信息，以生成答案。尽管最近语言模型的生成能力有所提高，但产生结构化的解释来验证模型的真实推理能力仍然是一个挑战。这个问题对于不太大的LMs（例如，FLAN-T5-XXL）尤其明显。在这项工作中，我们首先强调了监督微调（SFT）在应对这一挑战方面的局限性，然后在强化学习（RL）中引入了一种精心制作的奖励工程方法，以更好地解决这一问题。我们研究了多种奖励聚合方法，并提供了详细的讨论，揭示了RL在未来研究中的潜力。我们提出的方法在两个半结构化解释生成基准（ExplaGraph和COPA-SSE）上取得了新的最新结果。

摘要: Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model’s true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.

== VLM ==

标题: CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction

作者: Size Wu, Wenwei Zhang, Lumin Xu

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2310.01403v2

GitHub: https://github.com/wusize/CLIPSelf.|

中文摘要: 包括对象检测和图像分割在内的开放词汇密集预测任务已经通过对比语言——图像预训练（CLIP）的成功得到了推进。剪辑模型，特别是那些包含视觉转换器（ViTs）的剪辑模型，在零镜头图像分类中表现出显著的泛化能力。然而，对于开放词汇密集预测任务，当将CLIP的视觉——语言对齐从全局图像表示转移到局部区域表示时，CLIP ViTs遭受从完整图像到局部图像区域的域转移。在本文中，我们对CLIP模型中的区域——语言对齐进行了深入分析，这对于下游开放词汇密集预测任务至关重要。随后，我们提出了一种名为CLIPSelf的方法，该方法使CLIP ViT的图像级识别能力适应局部图像区域，而不需要任何区域——文本对。CLIPSelf通过将从其密集特征图中提取的区域表示与相应图像裁剪的图像级表示对齐，使ViTs能够提取自身。借助增强的CLIP ViTs，我们在各种基准测试中实现了开放词汇对象检测、语义分割和全景分割方面的最新性能。模型和代码发布于https：//github.com/wusize/CLIPSelf。

摘要: Open-vocabulary dense prediction tasks including object detection and image segmentation have been advanced by the success of Contrastive Language-Image Pre-training (CLIP). CLIP models, particularly those incorporating vision transformers (ViTs), have exhibited remarkable generalization ability in zero-shot image classification. However, when transferring the vision-language alignment of CLIP from global image representation to local region representation for the open-vocabulary dense prediction tasks, CLIP ViTs suffer from the domain shift from full images to local image regions. In this paper, we embark on an in-depth analysis of the region-language alignment in CLIP models, which is essential for downstream open-vocabulary dense prediction tasks. Subsequently, we propose an approach named CLIPSelf, which adapts the image-level recognition ability of CLIP ViT to local image regions without needing any region-text pairs. CLIPSelf empowers ViTs to distill itself by aligning a region representation extracted from its dense feature map with the image-level representation of the corresponding image crop. With the enhanced CLIP ViTs, we achieve new state-of-the-art performance on open-vocabulary object detection, semantic segmentation, and panoptic segmentation across various benchmarks. Models and code are released at https://github.com/wusize/CLIPSelf.

标题: SciMMIR: Benchmarking Scientific Multi-modal Information Retrieval

作者: Siwei Wu, Yizhi Li, Kang Zhu

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13478v1

GitHub: https://github.com/Wusiwei0410/SciMMIR.|

中文摘要: 多模态信息检索（MMIR）是一个快速发展的领域，通过高级表示学习和跨模态对齐研究，已经取得了重大进展，特别是在图像——文本配对方面。然而，当前用于评估科学领域内图像——文本配对中MMIR性能的基准显示出明显的差距，其中用学术语言描述的图表和表格图像通常不起重要作用。为了弥合这一差距，我们开发了一个专门的科学MMIR（SciMIR）基准，利用开放存取的论文集合来提取与科学领域相关的数据。该基准由53万对精心策划的图像——文本对组成，这些图像——文本对是从科学文档中带有详细标题的图表中提取的。我们进一步用两级子集——子类别层次注释来注释图像——文本对，以便于对基线进行更全面的评估。我们对著名的多模态图像字幕和视觉语言模型（如CLIP和BLIP）进行了零镜头和微调评估。我们的分析为MMIR在科学领域提供了重要的见解，包括预训练和微调设置的影响以及视觉和文本编码器的影响。我们所有的数据和检查点都可以在https：//github.com/Wusiwei0410/SciMMIR。

摘要: Multi-modal information retrieval (MMIR) is a rapidly evolving field, where significant progress, particularly in image-text pairing, has been made through advanced representation learning and cross-modality alignment research. However, current benchmarks for evaluating MMIR performance in image-text pairing within the scientific domain show a notable gap, where chart and table images described in scholarly language usually do not play a significant role. To bridge this gap, we develop a specialised scientific MMIR (SciMMIR) benchmark by leveraging open-access paper collections to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions in scientific documents. We further annotate the image-text pairs with two-level subset-subcategory hierarchy annotations to facilitate a more comprehensive evaluation of the baselines. We conducted zero-shot and fine-tuning evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP and BLIP. Our analysis offers critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the influence of the visual and textual encoders. All our data and checkpoints are publicly available at https://github.com/Wusiwei0410/SciMMIR.

标题: InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions

作者: Ryota Tanaka, Taichi Iki, Kyosuke Nishida

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13313v1

GitHub: https://github.com/nttmdlab-nlp/InstructDoc|

标题: ChatterBox: Multi-round Multimodal Referring and Grounding

作者: Yunjie Tian, Tianren Ma, Lingxi Xie

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13307v1

GitHub: https://github.com/sunsmarterjie/ChatterBox.|https://github.com/sunsmarterjie/ChatterBox|

摘要: In this study, we establish a baseline for a new task named multimodal multi-round referring and grounding (MRG), opening up a promising direction for instance-level multimodal dialogues. We present a new benchmark and an efficient vision-language model for this purpose. The new benchmark, named CB-300K, spans challenges including multi-round dialogue, complex spatial relationships among multiple instances, and consistent reasoning, which are beyond those shown in existing benchmarks. The proposed model, named ChatterBox, utilizes a two-branch architecture to collaboratively handle vision and language tasks. By tokenizing instance regions, the language branch acquires the ability to perceive referential information. Meanwhile, ChatterBox feeds a query embedding in the vision branch to a token receiver for visual grounding. A two-stage optimization strategy is devised, making use of both CB-300K and auxiliary external data to improve the model’s stability and capacity for instance-level understanding. Experiments show that ChatterBox outperforms existing models in MRG both quantitatively and qualitatively, paving a new path towards multimodal dialogue scenarios with complicated and precise interactions. Code, data, and model are available at: https://github.com/sunsmarterjie/ChatterBox.

标题: Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks

作者: Lingxiao Luo, Xuanzhong Chen, Bingda Tang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2312.07630v2

GitHub: https://github.com/function2-llx/PUMIT.|

中文摘要: 基础模型的最新进展，通常通过在大规模和多样化数据集上的自我监督学习进行训练，在医学图像分析中显示出巨大的潜力。然而，由于医学成像数据的显著空间异质性，当前的模型必须为不同的数据集定制特定的结构，这使得利用丰富的未标记数据具有挑战性。在这项工作中，我们提出了一个医学图像分析的通用基础模型，该模型使用统一的结构处理具有异构空间属性的图像。为了实现这一点，我们提出了空间自适应网络（SPAD-Nets），这是一个动态调整结构以适应输入图像空间属性的网络家族，以建立这样一个通用的基础模型。我们在55个公共医学图像数据集上通过掩蔽图像建模（MIM）预训练了空间自适应视觉标记器（SPAD-VT）和空间自适应视觉Transformer model（SPAD-ViT）。预训练数据包括超过900万个图像切片，代表了我们所知的用于医学图像分析的预训练通用基础模型的最大、最全面和最多样化的数据集。在下游医学图像分类和分割任务上的实验结果表明了我们的模型的优异性能和标记效率。我们的代码可从https://github.com/function2-llx/PUMIT获得。

摘要: Recent advancements in foundation models, typically trained with self-supervised learning on large-scale and diverse datasets, have shown great potential in medical image analysis. However, due to the significant spatial heterogeneity of medical imaging data, current models must tailor specific structures for different datasets, making it challenging to leverage the abundant unlabeled data. In this work, we propose a universal foundation model for medical image analysis that processes images with heterogeneous spatial properties using a unified structure. To accomplish this, we propose spatially adaptive networks (SPAD-Nets), a family of networks that dynamically adjust the structures to adapt to the spatial properties of input images, to build such a universal foundation model. We pre-train a spatial adaptive visual tokenizer (SPAD-VT) and then a spatial adaptive Vision Transformer (SPAD-ViT) via masked image modeling (MIM) on 55 public medical image datasets. The pre-training data comprises over 9 million image slices, representing the largest, most comprehensive, and most diverse dataset to our knowledge for pre-training universal foundation models for medical image analysis. The experimental results on downstream medical image classification and segmentation tasks demonstrate the superior performance and label efficiency of our model. Our code is available at https://github.com/function2-llx/PUMIT.

标题: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10727v2

GitHub: https://github.com/MLLM-Tool/MLLM-Tool.|

中文摘要: 最近，大型语言模型（LLMs）在自然语言理解和生成任务中的惊人性能引发了许多使用它们作为中央控制器来构建代理系统的探索。多项研究侧重于将LLMs与外部工具联系起来，以扩展应用场景。然而，目前LLMs的感知工具使用能力局限于单一的文本查询，这可能会导致对用户真实意图的理解模糊不清。LLMs被期望通过感知基于视觉或听觉的指令信息来消除这种情况。因此，在本文中，我们提出了MLLM-Tool，一个结合了开源LLMs和多模态编码器的系统，以便学习的LLMs可以意识到多模态输入指令，然后正确地选择功能匹配的工具。为了便于评估模型的能力，我们从HuggingFace收集了一个由多模态输入工具组成的数据集。我们的数据集的另一个重要特征是，由于相同函数和同义函数的存在，我们的数据集还包含同一指令的多个潜在选择，这为同一查询提供了更多潜在的解决方案。实验表明，我们的MLLM工具能够为多模态指令推荐合适的工具。代码和数据见https://github.com/MLLM-Tool/MLLM-Tool。

摘要: Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs’ perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users’ real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions’ information. Therefore, in this paper, we propose MLLM-Tool, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model’s capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our MLLM-Tool is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/MLLM-Tool/MLLM-Tool.

== diffusion model ==

标题: MotionMix: Weakly-Supervised Diffusion for Controllable Motion Generation

作者: Nhat M. Hoang, Kehong Gong, Chuan Guo

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.11115v3

Project: https://nhathoang2002.github.io/MotionMix-page/|

中文摘要: 随着世界拥抱数字化转型，3D人体运动的可控生成成为一个重要话题。现有的工作虽然随着扩散模型的出现取得了有希望的进展，但严重依赖于精心捕获和注释（例如，文本）的高质量运动语料库，这在现实世界中是一项资源密集型工作。这激发了我们提出的MotionMix，一个简单而有效的弱监督扩散模型，它利用了噪声和未标注的运动序列。具体来说，我们将扩散模型的去噪目标分为两个阶段：通过学习有噪声的带注释的运动，在最初的 $T-T^*$ 步骤中获得条件粗糙运动近似，随后在最后的 $T^*$ 步骤中使用无注释的运动对这些初步运动进行无条件细化。值得注意的是，尽管从不完善数据的两个来源学习，但与访问黄金数据的完全监督方法相比，我们的模型不会损害运动生成质量。在几个基准测试上的大量实验表明，我们的MotionMix作为一个多功能框架，在文本到动作、动作到动作和音乐到舞蹈的任务中始终实现最先进的性能。项目页面：https：//nhathoang 2002.github.io/MotionMix-page/

摘要: Controllable generation of 3D human motions becomes an important topic as the world embraces digital transformation. Existing works, though making promising progress with the advent of diffusion models, heavily rely on meticulously captured and annotated (e.g., text) high-quality motion corpus, a resource-intensive endeavor in the real world. This motivates our proposed MotionMix, a simple yet effective weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences. Specifically, we separate the denoising objectives of a diffusion model into two stages: obtaining conditional rough motion approximations in the initial $T-T^*$ steps by learning the noisy annotated motions, followed by the unconditional refinement of these preliminary motions during the last $T^*$ steps using unannotated motions. Notably, though learning from two sources of imperfect data, our model does not compromise motion generation quality compared to fully supervised approaches that access gold data. Extensive experiments on several benchmarks demonstrate that our MotionMix, as a versatile framework, consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks. Project page: https://nhathoang2002.github.io/MotionMix-page/

标题: Diffusion Model Based Posterior Sampling for Noisy Linear Inverse Problems

作者: Xiangming Meng, Yoshiyuki Kabashima

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2211.12343v3

GitHub: https://github.com/mengxiangming/dmps|

中文摘要: 我们考虑了普遍存在的具有加性高斯噪声的线性逆问题，并提出了一种称为基于扩散模型的后验采样（DMPS）的无监督采样方法，以从噪声线性测量中重建未知信号。具体地，使用一个扩散模型（DM）作为隐式先验，执行后验采样的基本困难是噪声扰动的似然分数，即退火似然函数的梯度是难以处理的。为了避免这个问题，我们引入了一个简单而有效的封闭形式近似，使用一个无信息的先验假设。在各种噪声线性逆问题上进行了大量的实验，例如噪声超分辨率、去噪、去模糊和彩色化。在所有任务中，所提出的DMPS在各种任务上表现出高度竞争性甚至更好的性能，同时比最先进的竞争对手扩散后验采样（DPS）快3倍。

摘要: We consider the ubiquitous linear inverse problems with additive Gaussian noise and propose an unsupervised sampling approach called diffusion model based posterior sampling (DMPS) to reconstruct the unknown signal from noisy linear measurements. Specifically, using one diffusion model (DM) as an implicit prior, the fundamental difficulty in performing posterior sampling is that the noise-perturbed likelihood score, i.e., gradient of an annealed likelihood function, is intractable. To circumvent this problem, we introduce a simple yet effective closed-form approximation using an uninformative prior assumption. Extensive experiments are conducted on a variety of noisy linear inverse problems such as noisy super-resolution, denoising, deblurring, and colorization. In all tasks, the proposed DMPS demonstrates highly competitive or even better performances on various tasks while being 3 times faster than the state-of-the-art competitor diffusion posterior sampling (DPS).

标题: Compositional Generative Inverse Design

作者: Tailin Wu, Takashi Maruyama, Long Wei

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13171v1

GitHub: https://github.com/AI4Science-WestlakeU/cindm.|

中文摘要: 逆设计是一个重要的问题，它出现在从机械工程到航空航天工程等领域。逆设计通常被公式化为优化问题，最近的工作利用了跨学习动力学模型的优化。然而，随着模型的优化，它们往往会陷入对立模式，从而妨碍有效的采样。我们说明，通过优化扩散模型捕获的学习能量函数，我们可以避免这种对立的例子，并显著提高设计性能。我们进一步说明了这样一个设计系统是如何组合的，使我们能够组合代表我们期望系统的子组件的多个不同的扩散模型来设计具有每个指定组件的系统。在一个N体相互作用任务和一个具有挑战性的2D多翼型设计任务中，我们证明了通过在测试时组合学习的扩散模型，我们的方法允许我们设计比训练数据中更复杂的初始状态和边界形状。对于N体数据集，我们的方法在预测MAE方面比最先进的神经逆设计方法平均高出41.5%，在设计目标方面平均高出14.3%，并发现编队飞行在多翼型设计任务中最小化阻力。项目网站和代码见https://github.com/AI4Science-WestlakeU/cindm。

摘要: Inverse design, where we seek to design input variables in order to optimize an underlying objective function, is an important problem that arises across fields such as mechanical engineering to aerospace engineering. Inverse design is typically formulated as an optimization problem, with recent works leveraging optimization across learned dynamics models. However, as models are optimized they tend to fall into adversarial modes, preventing effective sampling. We illustrate that by instead optimizing over the learned energy function captured by the diffusion model, we can avoid such adversarial examples and significantly improve design performance. We further illustrate how such a design system is compositional, enabling us to combine multiple different diffusion models representing subcomponents of our desired system to design systems with every specified component. In an N-body interaction task and a challenging 2D multi-airfoil design task, we demonstrate that by composing the learned diffusion model at test time, our method allows us to design initial states and boundary shapes that are more complex than those in the training data. Our method outperforms state-of-the-art neural inverse design method by an average of 41.5% in prediction MAE and 14.3% in design objective for the N-body dataset and discovers formation flying to minimize drag in the multi-airfoil design task. Project website and code can be found at https://github.com/AI4Science-WestlakeU/cindm.

标题: Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

作者: Royi Rassin, Eran Hirsch, Daniel Glickman

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2306.08877v3

GitHub: https://github.com/RoyiRa/Syntax-Guided-Generation|

中文摘要: 文本条件图像生成模型通常会在实体及其视觉属性之间生成不正确的关联。这反映了在生成的图像中相应元素的提示和视觉绑定中实体和修饰符的语言绑定之间的映射受损。作为一个显著的例子，像“粉色向日葵和黄色火烈鸟”这样的查询可能会错误地产生黄色向日葵和粉色火烈鸟的图像。为了解决这个问题，我们提出了SynGen，这种方法首先从句法上分析提示以识别实体及其修饰语，然后使用一种新的损失函数来鼓励交叉注意图与句法反映的语言绑定一致。具体来说，我们鼓励实体和它们的修饰语的注意力图之间的大重叠，以及与其他实体和修饰语词的小重叠。在推理过程中优化损失，无需重新训练或微调模型。对三个数据集（包括一个新的具有挑战性的数据集）的人工评估表明，与当前最先进的方法相比，SynGen有显著的改进。这项工作强调了如何在推理过程中利用句子结构可以有效地和实质性地提高文本到图像生成的忠实度。

摘要: Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like “a pink sunflower and a yellow flamingo” may incorrectly produce an image of a yellow sunflower and a pink flamingo. To remedy this issue, we propose SynGen, an approach which first syntactically analyses the prompt to identify entities and their modifiers, and then uses a novel loss function that encourages the cross-attention maps to agree with the linguistic binding reflected by the syntax. Specifically, we encourage large overlap between attention maps of entities and their modifiers, and small overlap with other entities and modifier words. The loss is optimized during inference, without retraining or fine-tuning the model. Human evaluation on three datasets, including one new and challenging set, demonstrate significant improvements of SynGen compared with current state of the art methods. This work highlights how making use of sentence structure during inference can efficiently and substantially improve the faithfulness of text-to-image generation.

标题: Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.07254v4

GitHub: https://github.com/HowardLi0816/dual-fusion-diffusion.|

中文摘要: 虽然扩散模型表现出生成高质量图像的非凡能力，但它们“复制”训练数据的趋势引发了隐私问题。虽然最近的研究表明，这种复制可能源于训练数据标题的不充分概括和训练图像的重复，但有效的缓解策略仍然难以捉摸。为了弥补这一差距，我们首先引入了一个通用分数来衡量字幕的通用性，并采用大型语言模型（LLM）来概括训练字幕。随后，我们利用广义字幕，并提出了一种新的双重融合增强方法，以减轻扩散模型的复制。我们的实证结果表明，与原始扩散模型相比，我们提出的方法可以显著减少43.5%的复制，同时保持世代的多样性和质量。代码可在https://github.com/HowardLi0816/dual-fusion-diffusion。

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.

标题: DITTO: Diffusion Inference-Time T-Optimization for Music Generation

作者: Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2401.12179v1

Project: https://DITTO-Music.github.io/web/.|

中文摘要: 我们提出了扩散推理时间T-优化（DITTO），这是一种通用框架，用于通过优化初始噪声潜伏期在推理时间控制预训练的文本到音乐扩散模型。我们的方法可用于优化任何可微特征匹配损失，以实现目标（风格化）输出，并利用梯度检查点提高内存效率。我们展示了音乐生成的惊人广泛的应用，包括修复、外绘和循环以及强度、旋律和音乐结构控制——所有这些都无需微调底层模型。当我们将我们的方法与相关的训练、指导和基于优化的方法进行比较时，我们发现DITTO在几乎所有任务上都实现了最先进的性能，包括在可控性、音频质量和计算效率方面优于可比方法，从而为扩散模型的高质量、灵活、免训练控制打开了大门。声音的例子可以在https://DITTO-Music.github.io/web/。

摘要: We propose Diffusion Inference-Time T-Optimization (DITTO), a general-purpose frame-work for controlling pre-trained text-to-music diffusion models at inference-time via optimizing initial noise latents. Our method can be used to optimize through any differentiable feature matching loss to achieve a target (stylized) output and leverages gradient checkpointing for memory efficiency. We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control - all without ever fine-tuning the underlying model. When we compare our approach against related training, guidance, and optimization-based methods, we find DITTO achieves state-of-the-art performance on nearly all tasks, including outperforming comparable approaches on controllability, audio quality, and computational efficiency, thus opening the door for high-quality, flexible, training-free control of diffusion models. Sound examples can be found at https://DITTO-Music.github.io/web/.

== VSLAM ==

标题: ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models

作者: Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.13311v1

Project: https://con-textual.github.io/|

中文摘要: 人工智能的最新进展导致了大型多模态模型（LMM）的发展，这些模型能够处理复杂的任务，包括对图像中的文本和视觉内容进行联合推理（例如，在公共场所导航地图）。本文介绍了ConTextual，这是一个新颖的基准测试，包括明确设计的指令，用于评估LMMs执行上下文敏感的文本丰富的可视化推理的能力。上下文强调不同的真实世界场景（例如，时间阅读、导航、购物等），要求更深入地理解文本和视觉元素之间的交互。我们的发现揭示了表现最好的LMM、GPT-4V（ision）和使用人类评估的人类能力之间30.8%的显著性能差距，表明在上下文敏感的文本丰富的视觉推理方面有很大的改进空间。值得注意的是，虽然GPT-4V在模因和引用解释等抽象类别中表现出色，但其整体表现仍落后于人类。除了人工评估，我们还采用了使用GPT-4的自动评估指标，揭示了绩效差异的类似趋势。我们还在不同的视觉环境中进行细粒度的评估，并提供定性分析，为LMM设计的未来发展提供了一个强大的框架。https：//con-textual.github.io/

摘要: Recent advancements in AI have led to the development of large multimodal models (LMMs) capable of processing complex tasks involving joint reasoning over text and visual content in the image (e.g., navigating maps in public places). This paper introduces ConTextual, a novel benchmark comprising instructions designed explicitly to evaluate LMMs’ ability to perform context-sensitive text-rich visual reasoning. ConTextual emphasizes diverse real-world scenarios (e.g., time-reading, navigation, shopping and more) demanding a deeper understanding of the interactions between textual and visual elements. Our findings reveal a significant performance gap of 30.8% between the best-performing LMM, GPT-4V(ision), and human capabilities using human evaluation indicating substantial room for improvement in context-sensitive text-rich visual reasoning. Notably, while GPT-4V excelled in abstract categories like meme and quote interpretation, its overall performance still lagged behind humans. In addition to human evaluations, we also employed automatic evaluation metrics using GPT-4, uncovering similar trends in performance disparities. We also perform a fine-grained evaluation across diverse visual contexts and provide qualitative analysis which provides a robust framework for future advancements in the LMM design. https://con-textual.github.io/

标题: MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2401.10727v2

GitHub: https://github.com/MLLM-Tool/MLLM-Tool.|

标题: SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

作者: Mingyang Li, Yue Ma, Qinru Qiu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13076v1

GitHub: https://github.com/Leomingyangli/SemanticSLAM|

中文摘要: 视觉同步定位和绘图（VSLAM）中的当前技术通过比较连续场景的图像特征来估计相机位移。这些算法依赖于场景的连续性，因此需要频繁的摄像机输入。然而，频繁处理图像会导致大量的内存使用和计算开销。在这项研究中，我们介绍了SemanticSLAM，这是一个端到端的视觉惯性里程计系统，它利用了从RGB-D传感器提取的语义特征。这种方法能够创建环境的语义图，并确保可靠的相机定位。SemanticSLAM是场景不可知的，这意味着它不需要针对不同的环境进行重新训练。它可以在室内环境中有效地工作，即使没有频繁的摄像机输入，也不需要事先知道。SemanticSLAM的优势在于它能够逐步细化语义图并改进姿态估计。这是通过卷积长短期记忆（ConvLSTM）网络实现的，该网络经过训练可以在地图构建过程中纠正错误。与现有的VSLAM算法相比，SemanticSLAM将姿态估计提高了17%。由此产生的语义图提供了关于环境的可解释信息，并且可以容易地应用于各种下游任务，例如路径规划、避障和机器人导航。该代码将在https：//github.com/Leomingyangli/SemanticSLAM

摘要: Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn’t require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM

标题: VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View

作者: Raphael Schumann, Wanrong Zhu, Weixi Feng

PubTime: 2024-01-24

Downlink: http://arxiv.org/abs/2307.06082v2

摘要: Incremental decision making in real-world environments is one of the most challenging tasks in embodied artificial intelligence. One particularly demanding scenario is Vision and Language Navigation~(VLN) which requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. The embodied agent needs to ground its understanding of navigation instructions in observations of a real-world environment like Street View. Despite the impressive results of LLMs in other research areas, it is an ongoing problem of how to best connect them with an interactive visual environment. In this work, we propose VELMA, an embodied LLM agent that uses a verbalization of the trajectory and of visual environment observations as contextual prompt for the next action. Visual information is verbalized by a pipeline that extracts landmarks from the human written navigation instructions and uses CLIP to determine their visibility in the current panorama view. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.

标题: Force sensing to reconstruct potential energy landscapes for cluttered large obstacle traversal

作者: Yaqing Wang, Ling Xu, Chen Li

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.13062v1

中文摘要: 环境几何的视觉传感允许机器人使用人工势场来避开稀疏的障碍物。然而，机器人必须进一步穿越杂乱的大型障碍，以实现在瓦砾中进行搜索和救援以及在火星岩石中进行行星探索等应用。最近的研究发现，为了穿越杂乱的大型障碍物，多足昆虫和受昆虫启发的机器人会在身体方向发生重大变化的情况下，在运动模式中进行艰苦的转换。当从运动障碍物理相互作用产生的势能景观来看时，这些是跨越景观盆地的跨越障碍的转变。这种势能景观方法可以为杂乱的大型障碍物穿越提供一个建模框架。在这里，我们通过测试力传感是否允许重建势能景观，朝着这一愿景迈出了下一步。我们开发了一个受蟑螂启发的极简机器人，当它向前推进对抗一对杂乱的草状光束障碍物时，能够感应障碍物接触力和身体周围的扭矩。我们对系统变化的身体方向进行了多次测量。尽管力和扭矩不是完全保守的，但它们很好地匹配了势能景观梯度，并且从它们重建的景观也很好地匹配了地面事实。此外，受蟑螂观察的启发，我们发现机器人在穿越过程中头部摆动进一步提高了力感测和景观重建的准确性。我们还需要研究如何在一次遍历中重建景观，因为在应用中，机器人很少有机会使用多次遍历来系统地采样环境，以及如何找到最省力的景观鞍来遍历

摘要: Visual sensing of environmental geometry allows robots to use artificial potential fields to avoid sparse obstacles. Yet robots must further traverse cluttered large obstacles for applications like search and rescue through rubble and planetary exploration across Martain rocks. Recent studies discovered that to traverse cluttered large obstacles, multi-legged insects and insect-inspired robots make strenuous transitions across locomotor modes with major changes in body orientation. When viewed on a potential energy landscape resulting from locomotor-obstacle physical interaction, these are barrier-crossing transitions across landscape basins. This potential energy landscape approach may provide a modeling framework for cluttered large obstacle traversal. Here, we take the next step toward this vision by testing whether force sensing allows the reconstruction of the potential energy landscape. We developed a cockroach-inspired, minimalistic robot capable of sensing obstacle contact forces and torques around its body as it propelled forward against a pair of cluttered grass-like beam obstacles. We performed measurements over many traverses with systematically varied body orientations. Despite the forces and torques not being fully conservative, they well-matched the potential energy landscape gradients and the landscape reconstructed from them well-matched ground truth. In addition, inspired by cockroach observations, we found that robot head oscillation during traversal further improved the accuracies of force sensing and landscape reconstruction. We still need to study how to reconstruct landscape during a single traverse, as in applications, robots have little chance to use multiple traverses to sample the environment systematically and how to find landscape saddles for least-effort transitions to traverse.

标题: Spatial and Temporal Hierarchy for Autonomous Navigation using Active Inference in Minigrid Environment

作者: Daria de Tinguy, Toon van de Maele, Tim Verbelen

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2312.05058v2

中文摘要: 强有力的证据表明，人类使用拓扑标志和粗粒度路径集成的组合来探索他们的环境。这种方法依赖于可识别的环境特征（拓扑地标）以及距离和方向的估计（粗粒度路径整合）来构建周围环境的认知地图。这种认知地图被认为展示了一种层次结构，允许在解决复杂的导航任务时进行有效的规划。受人类行为的启发，本文提出了一个可扩展的分层主动推理模型，用于自主导航、探索和面向目标的行为。该模型使用视觉观察和运动感知将好奇心驱动的探索与目标导向的行为结合起来。使用不同层次的推理来计划运动，即从上下文到地点再到运动。这允许在新的空间中有效导航，并向目标快速前进。通过结合这些人类导航策略及其对环境的分层表示，该模型为自主导航和探索提出了一种新的解决方案。在微型网格环境下通过仿真验证了该方法的有效性。

摘要: Robust evidence suggests that humans explore their environment using a combination of topological landmarks and coarse-grained path integration. This approach relies on identifiable environmental features (topological landmarks) in tandem with estimations of distance and direction (coarse-grained path integration) to construct cognitive maps of the surroundings. This cognitive map is believed to exhibit a hierarchical structure, allowing efficient planning when solving complex navigation tasks. Inspired by human behaviour, this paper presents a scalable hierarchical active inference model for autonomous navigation, exploration, and goal-oriented behaviour. The model uses visual observation and motion perception to combine curiosity-driven exploration with goal-oriented behaviour. Motion is planned using different levels of reasoning, i.e., from context to place to motion. This allows for efficient navigation in new spaces and rapid progress toward a target. By incorporating these human navigational strategies and their hierarchical representation of the environment, this model proposes a new solution for autonomous navigation and exploration. The approach is validated through simulations in a mini-grid environment.