[晓理紫]每日论文分享(有中文摘要，源码或项目地址)-大模型、扩散模型等_mulan: multimodal-llm agent for progressive and in-CSDN博客

本文链接：https://blog.csdn.net/u011573853/article/details/136224025

专属领域论文订阅

VX关注{晓理紫}免费，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持

分类:

大语言模型LLM
视觉模型VLM
扩散模型
视觉语言导航VLN
强化学习 RL
模仿学习 IL
机器人
开放词汇，检测分割

==LLM ==

标题: BiMediX: Bilingual Medical Mixture of Experts LLM

作者: Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13253v1

GitHub: https://github.com/mbzuai-oryx/BiMediX|

中文摘要: 在本文中，我们介绍了BiMediX，这是第一个专家LLM双语医学混合体，旨在实现英语和阿拉伯语的无缝交互。我们的模型促进了英语和阿拉伯语的广泛医疗互动，包括多轮聊天以询问更多细节，如患者症状和病史、多项选择问题回答和开放式问题回答。我们提出了一个半自动的英语到阿拉伯语的翻译管道，通过人工提炼来确保高质量的翻译。我们还介绍了阿拉伯医学法学硕士的综合评估基准。此外，我们还推出了BiMed1.3M，这是一个广泛的阿拉伯语——英语双语指令集，涵盖130万种不同的医疗交互，产生了超过6.32亿个用于指令调整的医疗保健专用令牌。我们的BiMed1.3M数据集包括25万个合成的多回合医患聊天，并保持1：2的阿拉伯语与英语比例。我们的模型比最先进的Med42和Meditron分别高出2.5%和4.1%的平均绝对增益，这是通过多个英语医疗评估基准计算的，同时推理速度快8倍。此外，我们的BiMediX优于通用的阿拉伯语——英语双语LLM，Jais-30B，在我们的阿拉伯语医学基准上平均绝对收益为10%，在多个数据集的双语评估上平均绝对收益为15%。我们的项目页面包含源代码和经过训练的模型，可在https：//github.com/mbzuai-oryx/BiMediX上获得。

摘要: In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about additional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also introduce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English bilingual instruction set covering 1.3 Million diverse medical interactions, resulting in over 632 million healthcare specialized tokens for instruction tuning. Our BiMed1.3M dataset includes 250k synthesized multi-turn doctor-patient chats and maintains a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5% and 4.1%, respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10% on our Arabic medical benchmark and 15% on bilingual evaluations across multiple datasets. Our project page with source code and trained model is available at https://github.com/mbzuai-oryx/BiMediX .

标题: TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

作者: Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13249v1

GitHub: https://github.com/amazon-science/tofueval|

中文摘要: 近年来，在对事实一致性或幻觉的评估研究的推动下，单一文档新闻摘要在忠实性方面取得了实质性进展。我们问这些进步是否会延续到其他文本摘要领域。我们提出了一个新的以主题为中心的对话总结的评估基准，由不同规模的LLM生成。我们提供了这些摘要的事实一致性的二元句子级人类注释，以及事实不一致的句子的详细解释。我们的分析表明，现有的LLMs在对话领域产生了大量的事实错误，而不管模型的大小。另一方面，当包括GPT-4在内的法律硕士作为二元事实评估者时，它们表现不佳，可以通过流行的最先进的专业事实评估指标来超越。最后，我们用策划错误分类法对幻觉类型进行了分析。我们发现在模型生成的摘要中存在不同的误差和误差分布，并且非基于LLM的度量比基于LLM的评估器更能捕捉所有的误差类型

摘要: Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model’s size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we conducted an analysis of hallucination types with a curated error taxonomy. We find that there are diverse errors and error distributions in model-generated summaries and that non-LLM based metrics can capture all error types better than LLM-based evaluators.

标题: Privacy Issues in Large Language Models: A Survey

作者: Seth Neel, Peter Chang

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2312.06717v3

GitHub: https://github.com/safr-ml-lab/survey-llm|

中文摘要: 这是人工智能研究活跃领域的第一次调查，重点关注大型语言模型（LLMs）中的隐私问题。具体来说，我们专注于红队模型的工作，以突出隐私风险，尝试将隐私纳入训练或推理过程，从训练的模型中有效删除数据以符合现有的隐私法规，并尝试减轻版权问题。我们的重点是总结开发算法、证明定理和进行实证评估的技术研究。虽然有大量的法律和政策工作从不同的角度应对这些挑战，但这不是我们调查的重点。然而，这些工作，以及最近的法律发展确实告知这些技术问题是如何形式化的，因此我们在第1节中简要讨论它们。虽然我们已经尽了最大努力来包括所有相关的工作，但由于这项研究的快速发展性质，我们可能会错过一些最近的工作。如果我们错过了您的一些工作，请联系我们，因为我们将努力保持这项调查相对最新。我们正在维护一个存储库，其中包含本次调查涵盖的论文列表以及在https：//github.com/safr-ml-lab/survey-llm上公开提供的任何相关代码。

摘要: This is the first survey of the active area of AI research that focuses on privacy issues in Large Language Models (LLMs). Specifically, we focus on work that red-teams models to highlight privacy risks, attempts to build privacy into the training or inference process, enables efficient data deletion from trained models to comply with existing privacy regulations, and tries to mitigate copyright issues. Our focus is on summarizing technical research that develops algorithms, proves theorems, and runs empirical evaluations. While there is an extensive body of legal and policy work addressing these challenges from a different angle, that is not the focus of our survey. Nevertheless, these works, along with recent legal developments do inform how these technical problems are formalized, and so we discuss them briefly in Section 1. While we have made our best effort to include all the relevant work, due to the fast moving nature of this research we may have missed some recent work. If we have missed some of your work please contact us, as we will attempt to keep this survey relatively up to date. We are maintaining a repository with the list of papers covered in this survey and any relevant code that was publicly available at https://github.com/safr-ml-lab/survey-llm.

标题: Soft Self-Consistency Improves Language Model Agents

作者: Han Wang, Archiki Prasad, Elias Stengel-Eskin

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13212v1

GitHub: https://github.com/HanNight/soft_self_consistency|

中文摘要: 大型语言模型（LLMs）的生成可以通过对多个解决方案进行采样和评分以选择最终答案来改进。当前的“抽样和选择”方法，如自我一致性（SC），依赖于多数投票来给答案打分。然而，当任务有许多不同的有效答案时，通过投票进行选择需要大量的样本。这使得SC对于涉及顺序生成多个动作（答案）的交互任务来说非常昂贵。在确定多数投票不能在这些任务中提供一致的收益后，我们演示了如何通过软化评分标准来提高成功率。我们引入了软自洽（Soft-SC），它用从模型似然性计算的连续分数代替了SC的不连续分数，即使在动作稀疏分布的情况下也允许选择。软SC提高了长时间交互任务的性能和效率，需要的样本是SC的一半，可以获得相当或更好的性能。对于固定数量的样本，Soft-SC导致编写bash程序的绝对成功率比SC高1.3%，在线购物（WebShop）高6.6%，交互式家庭游戏（ALFWorld）高4.7%。最后，我们展示了软SC可以应用于开源和黑盒模型。

摘要: Generations from large language models (LLMs) can be improved by sampling and scoring multiple solutions to select a final answer. Current “sample and select” methods such as self-consistency (SC) rely on majority voting to score answers. However, when tasks have many distinct and valid answers, selection by voting requires a large number of samples. This makes SC prohibitively expensive for interactive tasks that involve generating multiple actions (answers) sequentially. After establishing that majority voting fails to provide consistent gains on such tasks, we demonstrate how to increase success rates by softening the scoring criterion. We introduce Soft Self-Consistency (Soft-SC), which replaces SC’s discontinuous scoring with a continuous score computed from model likelihoods, allowing for selection even when actions are sparsely distributed. Soft-SC improves both performance and efficiency on long-horizon interactive tasks, requiring half as many samples as SC for comparable or better performance. For a fixed number of samples, Soft-SC leads to a 1.3% increase over SC in absolute success rate on writing bash programs, a 6.6% increase on online shopping (WebShop), and a 4.7% increase for an interactive household game (ALFWorld). Finally, we show that Soft-SC can be applied to both open-source and black-box models.

标题: What if LLMs Have Different World Views: Simulating Alien Civilizations with LLM-based Agents

作者: Mingyu Jin, Beichen Wang, Zhaoqian Xue

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13184v1

GitHub: https://github.com/agiresearch/AlienAgent|

中文摘要: 在这项研究中，我们介绍了“CosmoAgent”，这是一个创新的人工智能框架，利用大型语言模型（LLMs）来模拟人类和地外文明之间的复杂互动，特别强调斯蒂芬·霍金关于不要随意向宇宙发送无线电信号的警示建议。目标是评估和平共处的可行性，同时考虑可能威胁善意文明的潜在风险。我们的方法采用数学模型和状态转换矩阵，定量评估文明的发展轨迹，为未来在增长和饱和临界点的决策提供见解。此外，该论文承认宇宙中潜在生活条件的巨大多样性，这可能会在不同文明中培养独特的宇宙观、道德准则和世界观。认识到当前LLM设计中固有的以地球为中心的偏见，我们提出了一个新的概念，即使用具有不同伦理范式的LLM，并模拟具有不同道德原则的实体之间的相互作用。这项创新研究为理解复杂的文明间动态提供了一种新的方式，扩展了我们的视角，同时开创了解决冲突的新战略，这对防止星际冲突至关重要。我们还发布了代码和数据集，以便对这一有趣的研究领域进行进一步的学术调查。该代码可从https：//github.com/agi research/AlienAgent获得。

摘要: In this study, we introduce “CosmoAgent,” an innovative artificial intelligence framework utilizing Large Language Models (LLMs) to simulate complex interactions between human and extraterrestrial civilizations, with a special emphasis on Stephen Hawking’s cautionary advice about not sending radio signals haphazardly into the universe. The goal is to assess the feasibility of peaceful coexistence while considering potential risks that could threaten well-intentioned civilizations. Employing mathematical models and state transition matrices, our approach quantitatively evaluates the development trajectories of civilizations, offering insights into future decision-making at critical points of growth and saturation. Furthermore, the paper acknowledges the vast diversity in potential living conditions across the universe, which could foster unique cosmologies, ethical codes, and worldviews among various civilizations. Recognizing the Earth-centric bias inherent in current LLM designs, we propose the novel concept of using LLMs with diverse ethical paradigms and simulating interactions between entities with distinct moral principles. This innovative research provides a new way to understand complex inter-civilizational dynamics, expanding our perspective while pioneering novel strategies for conflict resolution, crucial for preventing interstellar conflicts. We have also released the code and datasets to enable further academic investigation into this interesting area of research. The code is available at https://github.com/agiresearch/AlienAgent.

标题: MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

作者: Sen Li, Ruochen Wang, Cho-Jui Hsieh

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.12741v1

Project: https://measure-infinity.github.io/mulan|

GitHub: https://github.com/measure-infinity/mulan-code|

中文摘要: 现有的文本到图像模型仍然难以生成多个对象的图像，特别是在处理它们的空间位置、相对大小、重叠和属性绑定方面。在本文中，我们开发了一个免训练的多模态LLM代理（MuLan），通过规划和反馈控制的渐进多对象生成来解决这些挑战，就像人类画家一样。MuLan利用大型语言模型（LLM）将提示分解为一系列子任务，每个子任务只生成一个对象，条件是通过稳定扩散生成先前生成的对象。与现有的基于LLM的方法不同，木兰只是在开始时产生一个高层次的计划，而每个对象的确切大小和位置由LLM和每个子任务的注意力引导决定。此外，木兰采用视觉语言模型（VLM）对每个子任务中生成的图像提供反馈，并控制扩散模型在图像违反原始提示时重新生成图像。因此，《花木兰》每一步中的每个模型只需要解决一个简单的子任务。我们从不同的基准中收集了200个包含具有空间关系和属性绑定的多对象的提示来评估花木兰。结果证明了MuLan在基线上生成多个对象的优越性。代码可在https://github.com/measure-infinity/mulan-code上查阅。

摘要: Existing text-to-image models still struggle to generate images of multiple objects, especially in handling their spatial positions, relative sizes, overlapping, and attribute bindings. In this paper, we develop a training-free Multimodal-LLM agent (MuLan) to address these challenges by progressive multi-object generation with planning and feedback control, like a human painter. MuLan harnesses a large language model (LLM) to decompose a prompt to a sequence of sub-tasks, each generating only one object conditioned on previously generated objects by stable diffusion. Unlike existing LLM-grounded methods, MuLan only produces a high-level plan at the beginning while the exact size and location of each object are determined by an LLM and attention guidance upon each sub-task. Moreover, MuLan adopts a vision-language model (VLM) to provide feedback to the image generated in each sub-task and control the diffusion model to re-generate the image if it violates the original prompt. Hence, each model in every step of MuLan only needs to address an easy sub-task it is specialized for. We collect 200 prompts containing multi-objects with spatial relationships and attribute bindings from different benchmarks to evaluate MuLan. The results demonstrate the superiority of MuLan in generating multiple objects over baselines. The code is available on https://github.com/measure-infinity/mulan-code.

== CLIP@ViT @ VLM @ visual model ==

标题: CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples

作者: Jianrui Zhang, Mu Cai, Tengyang Xie

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13254v1

Project: https://countercurate.github.io/|

中文摘要: 我们提出了CounterCurate，这是一个全面提高对比和生成多模态模型的视觉语言合成推理能力的框架。特别是，我们确定了两个未被充分探索的关键问题：忽视基于物理的推理（计数和位置理解）和使用高性能文本和图像生成模型进行语义反事实微调的潜力。我们的工作开创了一种解决这些差距的方法。我们首先关注CLIP和LLaVA等多模态模型在基于物理的合成推理中的近乎偶然的表现。然后，我们使用基础图像生成模型GLIGEN应用简单的数据增强来生成微调数据，从而显著提高了性能：在我们新策划的Flickr30k-Positions基准测试中，CLIP和LLaVA分别提高了+33%和+37%。此外，我们利用高性能文本生成和图像生成模型的能力，特别是GPT-4V和DALLE-3，来策划具有挑战性的语义反事实，从而进一步增强SugarCrepe等基准测试的合成推理能力，其中CounterCurate优于GPT-4V。

摘要: We propose CounterCurate, a framework to comprehensively improve the visio-linguistic compositional reasoning capability for both contrastive and generative multimodal models. In particular, we identify two under-explored critical problems: the neglect of the physically grounded reasoning (counting and position understanding) and the potential of using highly capable text and image generation models for semantic counterfactual fine-tuning. Our work pioneers an approach that addresses these gaps. We first spotlight the near-chance performance of multimodal models like CLIP and LLaVA in physically grounded compositional reasoning. We then apply simple data augmentation using a grounded image generation model, GLIGEN, to generate finetuning data, resulting in significant performance improvements: +33% and +37% for CLIP and LLaVA, respectively, on our newly curated Flickr30k-Positions benchmark. Moreover, we exploit the capabilities of high-performing text generation and image generation models, specifically GPT-4V and DALLE-3, to curate challenging semantic counterfactuals, thereby further enhancing compositional reasoning capabilities on benchmarks such as SugarCrepe, where CounterCurate outperforms GPT-4V.

标题: FlashTex: Fast Relightable Mesh Texturing with LightControlNet

作者: Kangle Deng, Timothy Omernick, Alexander Weiss

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13251v1

Project: https://flashtex.github.io/|

中文摘要: 手动为3D网格创建纹理非常耗时，即使对于专业的视觉内容创作者也是如此。我们提出了一种基于用户提供的文本提示自动纹理化输入3D网格的快速方法。重要的是，我们的方法在生成的纹理中将照明从表面材质/反射率中分离出来，以便网格可以在任何照明环境中正确地重新照明和渲染。我们介绍了LightControlNet，这是一种基于ControlNet架构的新的文本到图像模型，它允许将所需的照明指定为模型的调节图像。然后，我们的文本到纹理管道分两个阶段构建纹理。第一阶段使用LightControlNet生成一组稀疏的视觉上一致的网格参考视图。第二阶段应用基于分数蒸馏采样（SDS）的纹理优化，该纹理优化与LightControlNet配合使用，以提高纹理质量，同时将表面材质从照明中分离出来。我们的管道比以前的文本到纹理方法快得多，同时产生高质量和可重新照明的纹理。

摘要: Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our pipeline is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.

标题: Video ReCap: Recursive Captioning of Hour-Long Videos

作者: Md Mohaiminul Islam, Ngan Ho, Xitong Yang

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13250v1

Project: https://sites.google.com/view/vidrecap|

中文摘要: 大多数视频字幕模型设计用于处理几秒钟的短视频剪辑，并输出描述低级视觉概念（例如，对象、场景、原子动作）的文本。然而，大多数真实世界的视频持续几分钟或几小时，并且具有跨越不同时间粒度的复杂层次结构。我们提出了Video ReCap，这是一种递归视频字幕模型，可以处理长度明显不同（从1秒到2小时）的视频输入，并在多个层次结构级别输出视频字幕。递归视频语言架构利用了不同视频层次之间的协同作用，可以有效地处理长达一小时的视频。我们利用课程学习培训方案来学习视频的层次结构，从描述原子动作的剪辑级字幕开始，然后关注片段级描述，最后为长达一小时的视频生成摘要。此外，我们通过用8,267个手动收集的远程视频摘要扩充Ego4D来引入Ego4D-HCap数据集。我们的递归模型可以灵活地生成不同层次层次的字幕，同时也适用于其他复杂的视频理解任务，如EgoSchema上的VideoQA。数据、代码和模型可从以下网址获得：https://sites.google.com/view/vidrecap

摘要: Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g., objects, scenes, atomic actions). However, most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap, a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos, starting from clip-level captions describing atomic actions, then focusing on segment-level descriptions, and concluding with generating summaries for hour-long videos. Furthermore, we introduce Ego4D-HCap dataset by augmenting Ego4D with 8,267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks, such as VideoQA on EgoSchema. Data, code, and models are available at: https://sites.google.com/view/vidrecap

标题: A Touch, Vision, and Language Dataset for Multimodal Alignment

作者: Letian Fu, Gaurav Datta, Huang Huang

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13232v1

Project: https://tactile-vlm.github.io|

中文摘要: 触摸是人类的一种重要感知方式，但它尚未被纳入多模态生成语言模型。这部分是由于难以获得触觉数据的自然语言标签，以及将触觉读数与视觉观察和语言描述对齐的复杂性。作为弥合这一差距的一步，这项工作引入了一个由44K个野外视觉——触觉对组成的新数据集，其中包括由人类注释的英语语言标签（10%）和来自GPT-4V的文本伪标签（90%）。我们使用该数据集来训练用于开放词汇分类的视觉语言对齐触觉编码器，以及使用训练好的编码器用于文本生成的触摸视觉语言（TVL）模型。结果表明，通过结合触摸，TVL模型比在任何一对这些模态上训练的现有模型提高了（+29%的分类准确性）触摸——视觉——语言对齐。虽然只有一小部分数据集是人类标记的，但在新的触摸视觉理解基准上，TVL模型显示了比GPT-4V（+12%）和开源视觉语言模型（+32%）更好的视觉触觉理解。代码和数据：https：//tactile-vlm.github.io。

摘要: Touch is an important sensing modality for humans, but it has not yet been incorporated into a multimodal generative language model. This is partially due to the difficulty of obtaining natural language labels for tactile data and the complexity of aligning tactile readings with both visual observations and language descriptions. As a step towards bridging that gap, this work introduces a new dataset of 44K in-the-wild vision-touch pairs, with English language labels annotated by humans (10%) and textual pseudo-labels from GPT-4V (90%). We use this dataset to train a vision-language-aligned tactile encoder for open-vocabulary classification and a touch-vision-language (TVL) model for text generation using the trained encoder. Results suggest that by incorporating touch, the TVL model improves (+29% classification accuracy) touch-vision-language alignment over existing models trained on any pair of those modalities. Although only a small fraction of the dataset is human-labeled, the TVL model demonstrates improved visual-tactile understanding over GPT-4V (+12%) and open-source vision-language models (+32%) on a new touch-vision understanding benchmark. Code and data: https://tactile-vlm.github.io.

标题: A multimodal dynamical variational autoencoder for audiovisual speech representation learning

作者: Samir Sadok, Simon Leglaive, Laurent Girin

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2305.03582v3

Project: https://samsad35.github.io/site-mdvae/|

中文摘要: 在本文中，我们提出了一种应用于无监督视听语音表征学习的多模态动态VAE（MDVAE）。潜在空间的结构是为了将模态之间共享的潜在动态因素与每个模态特有的因素分离开来。还引入了静态潜在变量来编码视听语音序列中随时间恒定的信息。该模型在视听情感语音数据集上以无监督的方式训练，分两个阶段。在第一阶段，为每个模态独立学习矢量量化VAE（VQ-VAE），无需时间建模。第二阶段包括在量化之前学习VQ-VAEs的中间表示上的MDVAE模型。在第二个训练阶段，静态信息与动态信息、特定模态信息与通用模态信息之间的分离发生了。进行了大量的实验来研究视听语音潜在因素是如何在MDVAE的潜在空间中编码的。这些实验包括视听语音操作、视听面部图像去噪和视听语音情感识别。结果表明，MDVAE在其潜在空间中有效地结合了视听信息。他们还表明，学习的视听语音静态表示可以用于情感识别，只需很少的标记数据，并且与单峰基线和基于视听Transformer model架构的最新监督模型相比，具有更好的准确性。

摘要: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

标题: Consistent123: One Image to Highly Consistent 3D Asset Using Case-Aware Diffusion Priors

作者: Yukang Lin, Haonan Han, Chaoqun Gong

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2309.17261v2

Project: https://Consistent123.github.io|

摘要: Reconstructing 3D objects from a single image guided by pretrained diffusion models has demonstrated promising outcomes. However, due to utilizing the case-agnostic rigid strategy, their generalization ability to arbitrary cases and the 3D consistency of reconstruction are still poor. In this work, we propose Consistent123, a case-aware two-stage method for highly consistent 3D asset reconstruction from one image with both 2D and 3D diffusion priors. In the first stage, Consistent123 utilizes only 3D structural priors for sufficient geometry exploitation, with a CLIP-based case-aware adaptive detection mechanism embedded within this process. In the second stage, 2D texture priors are introduced and progressively take on a dominant guiding role, delicately sculpting the details of the 3D model. Consistent123 aligns more closely with the evolving trends in guidance requirements, adaptively providing adequate 3D geometric initialization and suitable 2D texture refinement for different objects. Consistent123 can obtain highly 3D-consistent reconstruction and exhibits strong generalization ability across various objects. Qualitative and quantitative experiments show that our method significantly outperforms state-of-the-art image-to-3D methods. See https://Consistent123.github.io for a more comprehensive exploration of our generated 3D assets.

== diffusion policy@diffusion formulation@diffusion model ==

标题: Text-Guided Molecule Generation with Diffusion Language Model

作者: Haisong Gong, Qiang Liu, Shu Wu

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.13040v1

GitHub: https://github.com/Deno-V/tgm-dlm|

中文摘要: 文本引导的分子生成是一项任务，其中生成的分子与特定的文本描述相匹配。最近，大多数现有的基于SMILES的分子生成方法依赖于自回归架构。在这项工作中，我们提出了文本引导的分子生成与扩散语言模型（TGM-DLM），一种新的方法，利用扩散模型来解决自回归方法的局限性。TGM-DLM使用两阶段扩散生成过程，集体迭代地更新SMILES字符串中的令牌嵌入。第一阶段在文本描述的指导下，从随机噪声中优化嵌入，而第二阶段纠正无效的SMILES字符串以形成有效的分子表示。我们证明了TGM-DLM优于MolT5-Base，一个自回归模型，不需要额外的数据资源。我们的发现强调了TGM-DLM在产生具有特定性质的连贯和精确分子方面的显著有效性，为药物发现和相关科学领域开辟了新的途径。代码将在以下网站发布：https：//github.com/Deno-V/tgm-dlm。

摘要: Text-guided molecule generation is a task where molecules are generated to match specific textual descriptions. Recently, most existing SMILES-based molecule generation methods rely on an autoregressive architecture. In this work, we propose the Text-Guided Molecule Generation with Diffusion Language Model (TGM-DLM), a novel approach that leverages diffusion models to address the limitations of autoregressive methods. TGM-DLM updates token embeddings within the SMILES string collectively and iteratively, using a two-phase diffusion generation process. The first phase optimizes embeddings from random noise, guided by the text description, while the second phase corrects invalid SMILES strings to form valid molecular representations. We demonstrate that TGM-DLM outperforms MolT5-Base, an autoregressive model, without the need for additional data resources. Our findings underscore the remarkable effectiveness of TGM-DLM in generating coherent and precise molecules with specific properties, opening new avenues in drug discovery and related scientific domains. Code will be released at: https://github.com/Deno-V/tgm-dlm.

标题: Collaborative Control for Geometry-Conditioned PBR Image Generation

作者: Shimon Vainer, Mark Boss, Mathias Parger

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.05919v2

Project: https://unity-research.github.io/holo-gen/|

中文摘要: 当前的3D内容生成方法基于输出RGB图像的扩散模型。然而，现代图形管道需要基于物理的渲染（PBR）材质属性。我们建议直接对PBR图像分布建模，避免RGB生成中的光度不准确性和从RGB中提取PBR时固有的模糊性。由于缺乏数据和输出模态的高维度，现有的跨模态微调范式不适合PBR生成：我们通过保留冻结的RGB模型和使用新的跨网络通信范式紧密链接新训练的PBR模型来克服这两个挑战。由于基本RGB模型被完全冻结，所提出的方法在微调期间没有灾难性遗忘的风险，并且保持与诸如为基本RGB模型预训练的IPAdapter的技术兼容。我们验证了我们的设计选择，对数据稀疏性的鲁棒性，并通过广泛的实验部分与现有范例进行了比较。

摘要: Current 3D content generation approaches build on diffusion models that output RGB images. Modern graphics pipelines, however, require physically-based rendering (PBR) material properties. We propose to model the PBR image distribution directly, avoiding photometric inaccuracies in RGB generation and the inherent ambiguity in extracting PBR from RGB. Existing paradigms for cross-modal fine-tuning are not suited for PBR generation due to both a lack of data and the high dimensionality of the output modalities: we overcome both challenges by retaining a frozen RGB model and tightly linking a newly trained PBR model using a novel cross-network communication paradigm. As the base RGB model is fully frozen, the proposed method does not risk catastrophic forgetting during fine-tuning and remains compatible with techniques such as IPAdapter pretrained for the base RGB model. We validate our design choices, robustness to data sparsity, and compare against existing paradigms with an extensive experimental section.

标题: RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models

作者: Xinchen Zhang, Ling Yang, Yaqi Cai

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.12908v1

GitHub: https://github.com/YangLing0818/RealCompo|https://github.com/YangLing0818/RealCompo|

中文摘要: 扩散模型在文本到图像生成方面取得了显著进步。然而，现有的模型在面对多对象组合生成时仍然存在许多困难。在本文中，我们提出了一种新的免训练和传输友好的文本到图像生成框架，即RealCompo，它旨在利用文本到图像和布局到图像模型的优势来增强生成图像的真实性和组合性。提出了一种直观新颖的平衡器，在去噪过程中动态平衡两种模型的强度，允许即插即用使用任何模型，无需额外训练。大量实验表明，我们的RealCompo在多对象合成生成方面始终优于最先进的文本到图像模型和布局到图像模型，同时保持生成图像的令人满意的真实感和合成性。代码可在https://github.com/YangLing0818/RealCompo

摘要: Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose a new training-free and transferred-friendly text-to-image generation framework, namely RealCompo, which aims to leverage the advantages of text-to-image and layout-to-image models to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and layout-to-image models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Code is available at https://github.com/YangLing0818/RealCompo

标题: Two-stage Rainfall-Forecasting Diffusion Model

作者: XuDong Ling, ChaoRong Li, FengQing Qin

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.12779v1

GitHub: https://github.com/clearlyzerolxd/TRDM|https://github.com/clearlyzerolxd/TRDM|

中文摘要: 深度神经网络在降雨量预测方面取得了巨大的成就。然而，目前的预测方法有一定的局限性，例如生成的图像模糊和空间位置不正确。为了克服这些挑战，我们提出了一个两阶段降雨预报扩散模型（TRDM），旨在提高长期降雨预报的准确性，并解决时间和空间建模之间的性能不平衡。TRDM是一种用于降雨预测任务的两阶段方法。第一阶段的任务是在低分辨率条件下捕获鲁棒的时间信息，同时保持空间信息。第二阶段的任务是将第一阶段生成的低分辨率图像重建成高分辨率图像。我们展示了MRMS和瑞典雷达数据集的最新结果。我们的项目是开源的，可以在GitHub上找到：\href{https：//github.com/clearlyzerolxd/TRDM}{https：//github.com/clearlyzerolxd/TRDM}。

摘要: Deep neural networks have made great achievements in rainfall prediction.However, the current forecasting methods have certain limitations, such as with blurry generated images and incorrect spatial positions. To overcome these challenges, we propose a Two-stage Rainfall-Forecasting Diffusion Model (TRDM) aimed at improving the accuracy of long-term rainfall forecasts and addressing the imbalance in performance between temporal and spatial modeling. TRDM is a two-stage method for rainfall prediction tasks. The task of the first stage is to capture robust temporal information while preserving spatial information under low-resolution conditions. The task of the second stage is to reconstruct the low-resolution images generated in the first stage into high-resolution images. We demonstrate state-of-the-art results on the MRMS and Swedish radar datasets. Our project is open source and available on GitHub at: \href{https://github.com/clearlyzerolxd/TRDM}{https://github.com/clearlyzerolxd/TRDM}.

标题: A Generative Pre-Training Framework for Spatio-Temporal Graph Transfer Learning

作者: Yuan Yuan, Chenyang Shao, Jingtao Ding

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2402.11922v2

GitHub: https://github.com/PLUTO-SCY/GPDiff|

中文摘要: 时空图(STG)学习是智能城市应用的基础，但在许多城市和地区，它经常受到数据稀缺的阻碍。为了弥合这一差距，我们提出了一个新的生成预训练框架，GPDiff，用于STG迁移学习。与严重依赖公共特征提取或复杂迁移学习设计的传统方法不同，我们的解决方案采用了一种新的方法，通过对使用来自源城市的数据优化的模型参数集合执行生成式预训练。我们将STG迁移学习重新塑造为预训练生成式超网络，该超网络在提示的指导下生成定制的模型参数，允许适应不同的数据分布和城市特定的特征。GPDiff采用带有基于Transformer model的去噪网络的扩散模型，该网络与模型无关，可以与强大的STG模型集成。通过解决数据差距和跨城市推广知识的复杂性带来的挑战，我们的框架在交通速度预测和人群流量预测等任务中，在多个真实世界数据集上始终优于最先进的基线。我们的方法的实现是可用的：https：//github.com/PLUTO-SCY/GPDiff。

摘要: Spatio-temporal graph (STG) learning is foundational for smart city applications, yet it is often hindered by data scarcity in many cities and regions. To bridge this gap, we propose a novel generative pre-training framework, GPDiff, for STG transfer learning. Unlike conventional approaches that heavily rely on common feature extraction or intricate transfer learning designs, our solution takes a novel approach by performing generative pre-training on a collection of model parameters optimized with data from source cities. We recast STG transfer learning as pre-training a generative hypernetwork, which generates tailored model parameters guided by prompts, allowing for adaptability to diverse data distributions and city-specific characteristics. GPDiff employs a diffusion model with a transformer-based denoising network, which is model-agnostic to integrate with powerful STG models. By addressing challenges arising from data gaps and the complexity of generalizing knowledge across cities, our framework consistently outperforms state-of-the-art baselines on multiple real-world datasets for tasks such as traffic speed prediction and crowd flow prediction. The implementation of our approach is available: https://github.com/PLUTO-SCY/GPDiff.

标题: Realistic Human Motion Generation with Cross-Diffusion Models

作者: Zeping Ren, Shaoli Huang, Xiu Li

PubTime: 2024-02-20

Downlink: http://arxiv.org/abs/2312.10993v2

Project: https://wonderno.github.io/CrossDiff-webpage/|

中文摘要: 我们介绍了跨人体运动扩散模型（CrossDiff），这是一种基于文本描述生成高质量人体运动的新方法。我们的方法在扩散模型的训练中使用共享的Transformer model网络来集成3D和2D信息，将运动噪声统一到单个特征空间中。这使得能够将特征交叉解码为3D和2D运动表示，而不管它们的原始尺寸。CrossDiff的主要优势在于其交叉扩散机制，该机制允许模型在训练期间将2D或3D噪声转化为干净的运动。这种能力利用了两种运动表示中的互补信息，捕捉了仅依赖3D信息的模型经常错过的复杂的人体运动细节。因此，CrossDiff有效地结合了两种表示的优势，以生成更真实的运动序列。在我们的实验中，我们的模型在文本到运动的基准测试中展示了具有竞争力的最先进的性能。此外，我们的方法始终提供增强的运动生成质量，捕捉复杂的全身运动复杂性。此外，通过预训练模型，我们的方法可以在训练期间使用没有3D运动地面实况的野外2D运动数据来生成3D运动，突出了其更广泛应用和有效利用可用数据资源的潜力。项目页面：https：//wonderno.github.io/CrossDiff-webpage/。

摘要: We introduce the Cross Human Motion Diffusion Model (CrossDiff), a novel approach for generating high-quality human motion based on textual descriptions. Our method integrates 3D and 2D information using a shared transformer network within the training of the diffusion model, unifying motion noise into a single feature space. This enables cross-decoding of features into both 3D and 2D motion representations, regardless of their original dimension. The primary advantage of CrossDiff is its cross-diffusion mechanism, which allows the model to reverse either 2D or 3D noise into clean motion during training. This capability leverages the complementary information in both motion representations, capturing intricate human movement details often missed by models relying solely on 3D information. Consequently, CrossDiff effectively combines the strengths of both representations to generate more realistic motion sequences. In our experiments, our model demonstrates competitive state-of-the-art performance on text-to-motion benchmarks. Moreover, our method consistently provides enhanced motion generation quality, capturing complex full-body movement intricacies. Additionally, with a pretrained model,our approach accommodates using in the wild 2D motion data without 3D motion ground truth during training to generate 3D motion, highlighting its potential for broader applications and efficient use of available data resources. Project page: https://wonderno.github.io/CrossDiff-webpage/.

== Visual Navigation@VLN @ Visual Language Navigation ==

标题: Feudal Networks for Visual Navigation

作者: Faith Johnson, Bryan Bo Cao, Kristin Dana

PubTime: 2024-02-19

Downlink: http://arxiv.org/abs/2402.12498v1

中文摘要: 视觉导航遵循直觉，即人类可以在没有详细地图的情况下导航。一种常见的方法是交互式探索，同时用可用于规划的节点处的图像构建拓扑图。最近的变化从被动视频中学习，并可以使用复杂的社会和语义线索进行导航。然而，需要大量的训练视频，使用大的图表，并且由于使用了里程计，场景不是看不见的。我们引入了一种使用封建学习的视觉导航新方法，该方法采用了由工人代理、中级经理和高级经理组成的层次结构。封建学习范式的关键是，每个级别的代理人看到任务的不同方面，并在不同的空间和时间尺度上操作。在这个框架中开发了两个独特的模块。对于高级管理者，我们以自我监督的方式学习记忆代理图，以在学习的潜在空间中记录先前的观察，并避免使用图形和里程计。对于中层管理者，我们开发了一个航路点网络，该网络在本地导航期间输出模仿人类航路点选择的中间子目标。这个航路点网络是使用我们公开提供的一组新的小型遥操作视频进行预训练的，训练环境不同于测试环境。由此产生的封建导航网络实现了接近SOTA的性能，同时为图像目标导航任务提供了一种新颖的无RL、无图形、无里程计、无公制地图的方法。

摘要: Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high- level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.

标题: Interpretable Brain-Inspired Representations Improve RL Performance on Visual Navigation Tasks

作者: Moritz Lange, Raphael C. Engelhardt, Wolfgang Konen

PubTime: 2024-02-19

Downlink: http://arxiv.org/abs/2402.12067v1

中文摘要: 视觉导航需要一整套功能。其中一个至关重要的是代理在环境中确定自己的位置和方向的能力。先前的工作通常假设该信息为给定的，或者使用缺乏合适的归纳偏差并随时间累积误差的方法。在这项工作中，我们展示了受神经科学研究启发的慢特征分析（SFA）方法如何通过生成可解释的视觉数据表示来克服这两种限制，这些视觉数据编码代理的位置和方向。我们在现代强化学习环境中使用SFA，分析和比较表示，并说明分层SFA在导航任务中优于其他特征提取器的地方。

摘要: Visual navigation requires a whole range of capabilities. A crucial one of these is the ability of an agent to determine its own location and heading in an environment. Prior works commonly assume this information as given, or use methods which lack a suitable inductive bias and accumulate error over time. In this work, we show how the method of slow feature analysis (SFA), inspired by neuroscience research, overcomes both limitations by generating interpretable representations of visual data that encode location and heading of an agent. We employ SFA in a modern reinforcement learning context, analyse and compare representations and illustrate where hierarchical SFA can outperform other feature extractors on navigation tasks.