[晓理紫]每日论文分享(有中文摘要,源码或项目地址)--大模型、扩散模型、视觉导航

专属领域论文订阅

关注{晓理紫|小李子},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持

如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文。

在这里插入图片描述

分类:

== chatgpt@large language model @LLM ==

标题: HAZARD Challenge: Embodied Decision Making in Dynamically Changing Environments

作者: Qinhong Zhou, Sunli Chen, Yisong Wang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12975v1

Project: https://vis-www.cs.umass.edu/hazard/.|

中文摘要: 高保真虚拟环境的最新进展是构建智能具体化代理以感知、推理和与物理世界交互的主要驱动力之一。通常,这些环境保持不变,除非代理与它们交互。然而,在现实世界的场景中,代理还可能面临以意外事件为特征的动态变化的环境,并且需要快速采取相应的行动。为了弥补这一差距,我们提出了一种新的模拟具身基准,称为HAZARD,专门用于评估具身代理在动态情况下的决策能力。HAZARD由三种意外灾难场景组成,包括火灾、洪水和风,并特别支持利用大型语言模型(LLMs)来帮助常识推理和决策。该基准使我们能够评估自主代理跨各种管道的决策能力,包括动态变化环境中的强化学习(RL)、基于规则和基于搜索的方法。作为使用大型语言模型应对这一挑战的第一步,我们进一步开发了一个基于LLM的代理,并对其解决这些挑战性任务的前景和挑战进行了深入分析。HAZARD可在https://vis-www.cs.umass.edu/HAZARD/。获得

摘要: Recent advances in high-fidelity virtual environments serve as one of the major driving forces for building intelligent embodied agents to perceive, reason and interact with the physical world. Typically, these environments remain unchanged unless agents interact with them. However, in real-world scenarios, agents might also face dynamically changing environments characterized by unexpected events and need to rapidly take action accordingly. To remedy this gap, we propose a new simulated embodied benchmark, called HAZARD, specifically designed to assess the decision-making abilities of embodied agents in dynamic situations. HAZARD consists of three unexpected disaster scenarios, including fire, flood, and wind, and specifically supports the utilization of large language models (LLMs) to assist common sense reasoning and decision-making. This benchmark enables us to evaluate autonomous agents’ decision-making capabilities across various pipelines, including reinforcement learning (RL), rule-based, and search-based methods in dynamically changing environments. As a first step toward addressing this challenge using large language models, we further develop an LLM-based agent and perform an in-depth analysis of its promise and challenge of solving these challenging tasks. HAZARD is available at https://vis-www.cs.umass.edu/hazard/.


标题: OWQ: Lessons learned from activation outliers for weight quantization in large language models

作者: Changhun Lee, Jungyu Jin, Taesu Kim

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2306.02272v3

GitHub: https://github.com/xvyaward/owq|

中文摘要: 具有数千亿个参数的大型语言模型(LLMs)需要强大的服务器级GPU进行推理,限制了它们的实际部署。为了应对这一挑战,我们引入了离群值感知权重量化(OWQ)方法,该方法旨在通过低精度表示来最小化LLM的足迹。OWQ对对量化敏感的结构化权重的一小部分进行优先排序,以高精度存储它们,同时对剩余的密集权重应用高度调谐的量化。这种灵敏度感知的混合精度方案显著降低了量化误差,大量实验表明,使用OWQ的3.1位模型与使用OPTQ优化的4位模型性能相当。此外,OWQ结合了针对特定任务自适应的参数高效微调,称为弱列调优(WCT),以优化的格式以最小的内存开销实现精确的特定任务LLM自适应。OWQ代表了LLM优化文献在灵活性、效率和实用性方面的显著进步。源代码可在https://github.com/xvyaward/owq

摘要: Large language models (LLMs) with hundreds of billions of parameters require powerful server-grade GPUs for inference, limiting their practical deployment. To address this challenge, we introduce the outlier-aware weight quantization (OWQ) method, which aims to minimize LLM’s footprint through low-precision representation. OWQ prioritizes a small subset of structured weights sensitive to quantization, storing them in high-precision, while applying highly tuned quantization to the remaining dense weights. This sensitivity-aware mixed-precision scheme reduces the quantization error notably, and extensive experiments demonstrate that 3.1-bit models using OWQ perform comparably to 4-bit models optimized by OPTQ. Furthermore, OWQ incorporates a parameter-efficient fine-tuning for task-specific adaptation, called weak column tuning (WCT), enabling accurate task-specific LLM adaptation with minimal memory overhead in the optimized format. OWQ represents a notable advancement in the flexibility, efficiency, and practicality of LLM optimization literature. The source code is available at https://github.com/xvyaward/owq


标题: Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

作者: Nikolaos Gkanatsios, Ayush Jain, Zhou Xian

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2304.14391v4

Project: https://ebmplanner.github.io.|

摘要: Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.


标题: VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

作者: Xuyang Liu, Siteng Huang, Yachen Kang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.01141v4

GitHub: https://github.com/xuyang-liu16/VGDiffZero.|

中文摘要: 大规模文本到图像扩散模型通过利用预训练的强大视觉——语言对齐,显示出令人印象深刻的生成任务能力。然而,大多数视觉语言辨别任务需要对仔细标记的数据集进行大量微调,以获得这种对齐,这在时间和计算资源上具有巨大的成本。在这项工作中,我们探索直接应用预先训练的生成扩散模型到视觉基础的挑战性辨别任务,而没有任何微调和额外的训练数据集。具体来说,我们提出了VGDiffZero,这是一个简单而有效的基于文本到图像扩散模型的零镜头视觉基础框架。我们还设计了一个综合的区域评分方法,考虑了每个孤立提案的全局和局部背景。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,VGDiffZero在零拍摄视觉接地上实现了强大的性能。我们的代码可以在https://github.com/xuyang-liu16/VGDiffZero上找到。

摘要: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.


标题: AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

作者: Jonas Belouadi, Anne Lauscher, Steffen Eger

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2310.00367v2

GitHub: https://github.com/potamides/AutomaTikZ|

中文摘要: 从文本生成位图图形已经获得了相当大的关注,但是对于科学图形,矢量图形通常是首选。鉴于矢量图形通常使用低级图形原语编码,直接生成它们是困难的。为了解决这个问题,我们建议使用TikZ,一种众所周知的抽象图形语言,可以编译成矢量图形,作为科学图形的中间表示。TikZ提供了面向人类的高级命令,从而方便了任何大型语言模型的条件语言建模。为此,我们引入了DaTikZ,这是第一个大规模的TikZ数据集,由120k个与标题对齐的TikZ图形组成。我们在DaTikZ上微调了LLaMA,以及我们的新型号CLiMA,它通过多模态剪辑嵌入增强了LLaMA。在人工和自动评估中,CLiMA和LLaMA在与人类创造的图形的相似性方面优于商业GPT-4和Claude 2,CLiMA还改善了文本——图像对齐。我们的详细分析表明,所有模型都能很好地概括,并且不易记忆。然而,与人类和我们的模型相比,GPT-4和Claude 2往往会生成更简单的数字。我们公开了我们的框架AutomaTikZ,以及模型权重和数据集。

摘要: Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.


标题: SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

作者: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2305.09781v3

GitHub: https://github.com/flexflow/FlexFlow/|

中文摘要: 本文介绍了SpecInfer,它是一个加速生成式大型语言模型(LLM)的系统,提供基于树的推测推理和验证。SpecInfer背后的关键思想是利用小型投机模型来预测LLM的产出;预测被组织为令牌树,其每个节点代表一个候选令牌序列。使用一种新颖的基于树的并行解码机制,针对LLM并行验证由令牌树表示的所有候选令牌序列的正确性。SpecInfer使用LLM作为令牌树验证器,而不是增量解码器,这大大减少了端到端延迟和提供生成式LLM的计算需求,同时可证明地保持了模型质量。我们的评估表明,SpecInfer在分布式LLM推理方面比现有的LLM服务系统高出1.5-2.8倍,在基于卸载的LLM推理方面高出2.6-3.5倍,同时保持相同的生成性能。SpecInfer可在https://github.com/flexflow/flexflow/

摘要: This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM’s outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/


== CLIP @ Visual transformers @ VLM @ visual model ==

标题: Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

作者: Nikolaos Gkanatsios, Ayush Jain, Zhou Xian

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2304.14391v4

Project: https://ebmplanner.github.io.|

中文摘要: 语言是合成的;指令可以表达机器人负责重新排列的场景中对象之间的多个关系约束。我们在这项工作中的重点是一个可指导的场景重新排列框架,它可以推广到更长的指令和训练时从未见过的空间概念组合。我们建议用相对对象排列上的能量函数来表示语言指导的空间概念。语言解析器将指令映射到相应的能量函数,开放词汇视觉语言模型将它们的论点与场景中的相关对象联系起来。我们通过能量函数之和的梯度下降来生成目标场景配置,指令中的每个语言谓词一个能量函数。然后,基于局部视觉的策略将对象重新定位到推断的目标位置。我们在已建立的指令引导操作基准以及我们引入的组合指令基准上测试我们的模型。我们展示了我们的模型可以在模拟和现实世界中零镜头执行高度组合的指令。它远远优于语言到动作的反应策略和大型语言模型规划器,特别是对于涉及多个空间概念组合的长指令。模拟和真实世界的机器人执行视频,以及我们的代码和数据集都可以在我们的网站上公开获得:https://ebmplanner.github.io。

摘要: Language is compositional; an instruction can express multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that generalizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language-instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual-language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predicate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on established instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language-to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real-world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.


标题: VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

作者: Xuyang Liu, Siteng Huang, Yachen Kang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.01141v4

GitHub: https://github.com/xuyang-liu16/VGDiffZero.|

中文摘要: 大规模文本到图像扩散模型通过利用预训练的强大视觉——语言对齐,显示出令人印象深刻的生成任务能力。然而,大多数视觉语言辨别任务需要对仔细标记的数据集进行大量微调,以获得这种对齐,这在时间和计算资源上具有巨大的成本。在这项工作中,我们探索直接应用预先训练的生成扩散模型到视觉基础的挑战性辨别任务,而没有任何微调和额外的训练数据集。具体来说,我们提出了VGDiffZero,这是一个简单而有效的基于文本到图像扩散模型的零镜头视觉基础框架。我们还设计了一个综合的区域评分方法,考虑了每个孤立提案的全局和局部背景。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,VGDiffZero在零拍摄视觉接地上实现了强大的性能。我们的代码可以在https://github.com/xuyang-liu16/VGDiffZero上找到。

摘要: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.


标题: AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ

作者: Jonas Belouadi, Anne Lauscher, Steffen Eger

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2310.00367v2

GitHub: https://github.com/potamides/AutomaTikZ|

中文摘要: 从文本生成位图图形已经获得了相当大的关注,但是对于科学图形,矢量图形通常是首选。鉴于矢量图形通常使用低级图形原语编码,直接生成它们是困难的。为了解决这个问题,我们建议使用TikZ,一种众所周知的抽象图形语言,可以编译成矢量图形,作为科学图形的中间表示。TikZ提供了面向人类的高级命令,从而方便了任何大型语言模型的条件语言建模。为此,我们引入了DaTikZ,这是第一个大规模的TikZ数据集,由120k个与标题对齐的TikZ图形组成。我们在DaTikZ上微调了LLaMA,以及我们的新型号CLiMA,它通过多模态剪辑嵌入增强了LLaMA。在人工和自动评估中,CLiMA和LLaMA在与人类创造的图形的相似性方面优于商业GPT-4和Claude 2,CLiMA还改善了文本——图像对齐。我们的详细分析表明,所有模型都能很好地概括,并且不易记忆。然而,与人类和我们的模型相比,GPT-4和Claude 2往往会生成更简单的数字。我们公开了我们的框架AutomaTikZ,以及模型权重和数据集。

摘要: Generating bitmap graphics from text has gained considerable attention, yet for scientific figures, vector graphics are often preferred. Given that vector graphics are typically encoded using low-level graphics primitives, generating them directly is difficult. To address this, we propose the use of TikZ, a well-known abstract graphics language that can be compiled to vector graphics, as an intermediate representation of scientific figures. TikZ offers human-oriented, high-level commands, thereby facilitating conditional language modeling with any large language model. To this end, we introduce DaTikZ, the first large-scale TikZ dataset consisting of 120k TikZ drawings aligned with captions. We fine-tune LLaMA on DaTikZ, as well as our new model CLiMA, which augments LLaMA with multimodal CLIP embeddings. In both human and automatic evaluation, CLiMA and LLaMA outperform commercial GPT-4 and Claude 2 in terms of similarity to human-created figures, with CLiMA additionally improving text-image alignment. Our detailed analysis shows that all models generalize well and are not susceptible to memorization. GPT-4 and Claude 2, however, tend to generate more simplistic figures compared to both humans and our models. We make our framework, AutomaTikZ, along with model weights and datasets, publicly available.


标题: The Neglected Tails of Vision-Language Models

作者: Shubham Parashar, Zhiqiu Lin, Tian Liu

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12425v1

Project: https://shubhamprshr27.github.io/neglected-tails-of-vlms/|

中文摘要: Vision-language models(vlm)在零镜头识别方面表现出色,但在视觉概念上表现出严重不平衡的性能。例如,尽管CLIP在ImageNet上具有令人印象深刻的平均零射击精度(72.7%),但在十个概念(如gyromitra和night snake)上的收益率<$10%,这可能是因为这些概念在VLMs的不平衡预训练数据中表现不足。然而,评估这种不平衡是具有挑战性的,因为在VLMs的大规模预训练数据中计算特定概念的频率并不简单。我们的工作首次尝试通过分析预训练文本来测量概念频率。我们使用现成的语言模型来帮助计算包含给定概念同义词的相关文本,并解决语言歧义。我们证实,像LAION这样流行的VLM数据集确实表现出长尾概念分布,这与每个类的准确性密切相关。此外,当代多模态系统,例如视觉聊天机器人和文本到图像生成器,也在与我们的方法识别的罕见概念作斗争。为了缓解VLMs在零镜头识别中的不平衡性能,我们提出了检索增强学习REAL。首先,REAL没有使用原始的类名来提示VLM,而是使用了VLM预训练文本中最常见的同义词。这已经超过了九个基准数据集的人工设计和LLM生成的提示,可能是因为VLM看到了更多与常用同义词相关的图像。第二,REAL使用所有的概念同义词来检索一个小的、类平衡的预训练数据集,以训练一个健壮的分类器。REAL超越了最近的检索增强解决方案REACT,使用的存储空间减少了400倍,训练时间减少了10,000倍!

摘要: Vision-language models (VLMs) excel in zero-shot recognition but exhibit drastically imbalanced performance across visual concepts. For example, CLIP, despite an impressive mean zero-shot accuracy on ImageNet (72.7%), yields $<$10% on ten concepts (e.g., gyromitra and night snake), presumably, because these concepts are under-represented in VLMs’ imbalanced pretraining data. Yet, assessing this imbalance is challenging as it is non-trivial to calculate the frequency of specific concepts within VLMs’ large-scale pretraining data. Our work makes the first attempt to measure the concept frequency by analyzing pretraining texts. We use off-the-shelf language models to help count relevant texts that contain synonyms of the given concepts and resolve linguistic ambiguity. We confirm that popular VLM datasets like LAION indeed exhibit long-tailed concept distributions, which strongly correlate with per-class accuracies. Further, contemporary multimodal systems, e.g., visual chatbots and text-to-image generators, also struggle with the rare concepts identified by our method. To mitigate VLMs’ imbalanced performance in zero-shot recognition, we propose REtrieval-Augmented Learning REAL. First, instead of prompting VLMs using the original class names, REAL uses their most frequent synonyms found in VLMs’ pretraining texts. This already outperforms human-engineered and LLM-generated prompts over nine benchmark datasets, likely because VLMs have seen more images associated with the frequently used synonyms. Second, REAL uses all the concept synonyms to retrieve a small, class-balanced set of pretraining data to train a robust classifier. REAL surpasses the recent retrieval-augmented solution REACT, using 400x less storage and 10,000x less training time!


标题: Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

作者: Lihe Yang, Bingyi Kang, Zilong Huang

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10891v1

Project: https://depth-anything.github.io|

GitHub: https://github.com/LiheYoung/Depth-Anything.|

中文摘要: 这项工作提出了深度任何东西,一个非常实用的解决方案,用于鲁棒的单目深度估计。不追求新的技术模块,我们的目标是建立一个简单而强大的基础模型,处理任何情况下的任何图像。为此,我们通过设计一个数据引擎来收集和自动注释大规模未标记数据(~62M),从而扩大数据覆盖范围,从而能够减少泛化误差。我们研究了两种简单而有效的策略,它们使数据扩展变得有希望。首先,通过利用数据扩充工具创建更具挑战性的优化目标。它迫使模型主动寻求额外的视觉知识并获得鲁棒的表示。其次,开发了一个辅助监督来加强模型从预训练的编码器继承丰富的语义先验。我们广泛评估了它的零拍摄能力,包括六个公共数据集和随机拍摄的照片。它展示了令人印象深刻的概括能力。此外,通过使用来自NYUv2和KITTI的度量深度信息对其进行微调,可以设置新的SOTA。我们更好的深度模型也产生了更好的深度调节控制网络。我们的模型发布在https://github.com/LiheYoung/Depth-Anything。

摘要: This work presents Depth Anything, a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules, we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end, we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data (~62M), which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First, a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second, an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively, including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further, through fine-tuning it with metric depth information from NYUv2 and KITTI, new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything.


标题: Synthesizing Moving People with 3D Control

作者: Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10889v1

Project: https://boyiliee.github.io/3DHM.github.io/.|

中文摘要: 在本文中,我们提出了一个基于扩散模型的框架,用于为给定的目标3D运动序列从单个图像中制作人的动画。我们的方法有两个核心组成部分:a)学习关于人体和服装的不可见部分的先验知识,b)用适当的服装和纹理渲染新的身体姿势。在第一部分,我们学习了一个填充扩散模型,在给定一幅图像的情况下,对一个人看不见的部分产生幻觉。我们在纹理贴图空间上训练这个模型,这使得它的样本效率更高,因为它对姿势和视点是不变的。其次,我们开发了一个基于扩散的渲染管道,它由3D人体姿势控制。这产生了人的新姿势的真实渲染,包括衣服、头发和看不见的区域的可信填充。这种解开的方法允许我们的方法生成一系列图像,这些图像忠实于3D姿态中的目标运动,并且在视觉相似性方面忠实于输入图像。除此之外,3D控件允许各种合成相机轨迹来渲染一个人。我们的实验表明,我们的方法在基因上是有弹性的与以前的方法相比,对长时间的运动和各种具有挑战性和复杂的姿势进行评级。更多详情请查看我们的网站:https://boyiliee.github.io/3DHM.github.io/。

摘要: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.


== diffusion policy@diffusion formulation@diffusion model ==

标题: GALA: Generating Animatable Layered Assets from a Single Scan

作者: Taeksoo Kim, Byungjun Kim, Shunsuke Saito

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12979v1

Project: https://snuvclab.github.io/gala/|

中文摘要: 我们介绍了GALA,它是一个框架,将单层穿着衣服的3D人体网格作为输入,并将其分解为完整的多层3D资产。然后,输出可以与其他资产相结合,以创建具有任何姿势的新的穿着衣服的人类化身。现有的重建方法通常将穿着衣服的人视为单层几何图形,并忽略了人与发型、衣服和配饰的固有组合性,从而限制了网格在下游应用中的效用。将单层网格分解成单独的层是一项具有挑战性的任务,因为它需要为严重遮挡的区域合成似是而非的几何形状和纹理。此外,即使成功分解,网格在姿态和身体形状方面也没有标准化,无法用新的身份和姿态进行连贯的合成。为了应对这些挑战,我们建议利用预训练的2D扩散模型的一般知识作为人类和其他资产的几何和外观先验。我们首先使用从多视图2D分割中提取的3D表面分割来分离输入网格。然后,我们使用一种新的姿态引导分数蒸馏采样(SDS)损失来合成姿态空间和正则空间中不同层的缺失几何。一旦我们完成修复高保真3D几何体,我们还将相同的SDS损失应用于其纹理,以获得完整的外观,包括最初遮挡的区域。通过一系列的分解步骤,我们在一个共享的规范空间中获得了多层3D资产,该空间根据姿势和人体形状进行了标准化,因此支持毫不费力地合成新的身份和用新的姿势复活。与现有的解决方案相比,我们的实验证明了我们的方法对于分解、规范化和合成任务的有效性。

摘要: We present GALA, a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles, clothing, and accessories, thereby limiting the utility of the meshes for downstream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover, even with successful decomposition, meshes are not normalized in terms of poses and body shapes, failing coherent composition with novel identities and poses. To address these challenges, we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition, canonicalization, and composition tasks compared to existing solutions.


标题: Lumiere: A Space-Time Diffusion Model for Video Generation

作者: Omer Bar-Tal, Hila Chefer, Omer Tov

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2401.12945v1

Project: https://lumiere-video.github.io/|https://www.youtube.com/watch?v=wxLr02Dz2Sc|

中文摘要: 我们介绍了琉米爱尔——一种文本到视频的扩散模型,旨在合成描绘真实、多样和连贯运动的视频——这是视频合成中的一个关键挑战。为此,我们引入了一种时空U-Net架构,通过模型中的一次传递,一次生成视频的整个时间持续时间。这与现有的视频模型形成对比,现有的视频模型合成远距离关键帧,然后进行时间超分辨率——这种方法本质上使全局时间一致性难以实现。通过部署空间和(重要的)时间下采样和上采样,并利用预先训练的文本到图像扩散模型,我们的模型学习通过在多个时空尺度上处理来直接生成全帧速率、低分辨率的视频。我们展示了最先进的文本到视频生成结果,并表明我们的设计可以轻松实现各种内容创建任务和视频编辑应用,包括图像到视频、视频修复和风格化生成。

摘要: We introduce Lumiere – a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion – a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution – an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.


标题: VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

作者: Xuyang Liu, Siteng Huang, Yachen Kang

PubTime: 2024-01-23

Downlink: http://arxiv.org/abs/2309.01141v4

GitHub: https://github.com/xuyang-liu16/VGDiffZero.|

中文摘要: 大规模文本到图像扩散模型通过利用预训练的强大视觉——语言对齐,显示出令人印象深刻的生成任务能力。然而,大多数视觉语言辨别任务需要对仔细标记的数据集进行大量微调,以获得这种对齐,这在时间和计算资源上具有巨大的成本。在这项工作中,我们探索直接应用预先训练的生成扩散模型到视觉基础的挑战性辨别任务,而没有任何微调和额外的训练数据集。具体来说,我们提出了VGDiffZero,这是一个简单而有效的基于文本到图像扩散模型的零镜头视觉基础框架。我们还设计了一个综合的区域评分方法,考虑了每个孤立提案的全局和局部背景。在RefCOCO、RefCOCO+和RefCOCOg上的大量实验表明,VGDiffZero在零拍摄视觉接地上实现了强大的性能。我们的代码可以在https://github.com/xuyang-liu16/VGDiffZero上找到。

摘要: Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignment, with great cost in time and computing resources. In this work, we explore directly applying a pre-trained generative diffusion model to the challenging discriminative task of visual grounding without any fine-tuning and additional training dataset. Specifically, we propose VGDiffZero, a simple yet effective zero-shot visual grounding framework based on text-to-image diffusion models. We also design a comprehensive region-scoring method considering both global and local contexts of each isolated proposal. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on zero-shot visual grounding. Our code is available at https://github.com/xuyang-liu16/VGDiffZero.


标题: Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation

作者: Xianghui Xie, Bharat Lal Bhatnagar, Jan Eric Lenssen

PubTime: 2024-01-22

Downlink: http://arxiv.org/abs/2312.07063v2

Project: https://virtualhumans.mpi-inf.mpg.de/procigen-hdm.|https://virtualhumans.mpi-inf.mpg.de/procigen-hdm|

中文摘要: 从单个RGB图像重建3D中的人——对象交互是一项具有挑战性的任务,并且现有的数据驱动方法不能推广到精心策划的3D交互数据集中存在的对象之外。由于人——物交互的组合性质,捕获大规模真实数据以学习强交互和3D形状先验是非常昂贵的。在本文中,我们提出了ProciGen(过程交互生成),这是一种过程生成数据集的方法,既有似是而非的交互,又有不同的对象变化。我们在3D中生成100多万个人——物交互对,并利用这些大规模数据来训练我们的HDM(分层扩散模型),这是一种在没有任何模板的情况下重建交互的人和看不见的对象的新方法。我们的HDM是一个图像条件扩散模型,可以学习真实的交互和高度精确的人体和物体形状。实验表明,我们用ProciGen训练的HDM明显优于以前需要模板网格的方法,并且我们的数据集允许训练方法对看不见的对象实例具有很强的泛化能力。我们的代码和数据将在以下网址公开发布:https://virtualhumans.mpi-inf.mpg.de/procigen-hdm。

摘要: Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper, we propose ProciGen (Procedural interaction Generation), a method to procedurally generate datasets with both, plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model), a novel method to reconstruct interacting human and unseen objects, without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that requires template meshes and that our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data will be publicly released at: https://virtualhumans.mpi-inf.mpg.de/procigen-hdm.


标题: Synthesizing Moving People with 3D Control

作者: Boyi Li, Jathushan Rajasegaran, Yossi Gandelsman

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10889v1

Project: https://boyiliee.github.io/3DHM.github.io/.|

中文摘要: 在本文中,我们提出了一个基于扩散模型的框架,用于为给定的目标3D运动序列从单个图像中制作人的动画。我们的方法有两个核心组成部分:a)学习关于人体和服装的不可见部分的先验知识,b)用适当的服装和纹理渲染新的身体姿势。在第一部分,我们学习了一个填充扩散模型,在给定一幅图像的情况下,对一个人看不见的部分产生幻觉。我们在纹理贴图空间上训练这个模型,这使得它的样本效率更高,因为它对姿势和视点是不变的。其次,我们开发了一个基于扩散的渲染管道,它由3D人体姿势控制。这产生了人的新姿势的真实渲染,包括衣服、头发和看不见的区域的可信填充。这种解开的方法允许我们的方法生成一系列图像,这些图像忠实于3D姿态中的目标运动,并且在视觉相似性方面忠实于输入图像。除此之外,3D控件允许各种合成相机轨迹来渲染一个人。我们的实验表明,我们的方法在基因上是有弹性的与以前的方法相比,对长时间的运动和各种具有挑战性和复杂的姿势进行评级。更多详情请查看我们的网站:https://boyiliee.github.io/3DHM.github.io/。

摘要: In this paper, we present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence. Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions. This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods. Please check our website for more details: https://boyiliee.github.io/3DHM.github.io/.


标题: ActAnywhere: Subject-Aware Video Background Generation

作者: Boxiao Pan, Zhan Xu, Chun-Hao Paul Huang

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10822v1

Project: https://actanywhere.github.io.|

中文摘要: 生成适合前景主体运动的视频背景是电影行业和视觉效果社区的一个重要问题。这项任务包括合成背景,与前景主题的运动和外观保持一致,同时也符合艺术家的创作意图。我们引入了ActAnywhere,这是一个生成模型,可以自动化传统上需要繁琐的手动工作的过程。我们的模型利用了大规模视频扩散模型的力量,并专门为此任务量身定制。ActAnywhere将前景主体分割序列作为输入,将描述所需场景的图像作为条件,以生成具有逼真前景——背景交互的连贯视频,同时坚持条件帧。我们在人类场景交互视频的大规模数据集上训练我们的模型。广泛的评估证明了我们模型的卓越性能,明显优于基线。此外,我们表明ActAnywhere推广到不同的分布外样本,包括非人类受试者。请访问我们的项目网页https://actanywhere.github.io。

摘要: Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community. This task involves synthesizing background that aligns with the motion and appearance of the foreground subject, while also complies with the artist’s creative intention. We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts. Our model leverages the power of large-scale video diffusion models, and is specifically tailored for this task. ActAnywhere takes a sequence of foreground subject segmentation as input and an image that describes the desired scene as condition, to produce a coherent video with realistic foreground-background interactions while adhering to the condition frame. We train our model on a large-scale dataset of human-scene interaction videos. Extensive evaluations demonstrate the superior performance of our model, significantly outperforming baselines. Moreover, we show that ActAnywhere generalizes to diverse out-of-distribution samples, including non-human subjects. Please visit our project webpage at https://actanywhere.github.io.


== Visual Navigation@Visual Exploration @ VSLAM ==

标题: Tool-LMM: A Large Multi-Modal Model for Tool Agent Learning

作者: Chenyu Wang, Weixin Luo, Qianyu Chen

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10727v1

GitHub: https://github.com/Tool-LMM/Tool-LMM.|

中文摘要: 最近,大型语言模型(LLMs)在自然语言理解和生成任务中的惊人性能引发了许多使用它们作为中央控制器来构建代理系统的探索。多项研究侧重于将LLMs与外部工具联系起来,以扩展应用场景。然而,目前LLMs的感知工具使用能力局限于单一的文本查询,这可能会导致对用户真实意图的理解模糊不清。LLMs被期望通过感知基于视觉或听觉的指令信息来消除这种情况。因此,在本文中,我们提出了工具LMM,一个结合开源LLMs和多模态编码器的系统,以便学习的LLMs可以意识到多模态输入指令,然后正确选择功能匹配的工具。为了便于评估模型的能力,我们从HuggingFace收集了一个由多模态输入工具组成的数据集。我们的数据集的另一个重要特征是,由于相同函数和同义函数的存在,我们的数据集还包含同一指令的多个潜在选择,这为同一查询提供了更多潜在的解决方案。实验表明,我们的LMM能够为多模态指令推荐合适的工具。代码和数据可在https://github.com/Tool-LMM/Tool-LMM获得。

摘要: Recently, the astonishing performance of large language models (LLMs) in natural language comprehension and generation tasks triggered lots of exploration of using them as central controllers to build agent systems. Multiple studies focus on bridging the LLMs to external tools to extend the application scenarios. However, the current LLMs’ perceiving tool-use ability is limited to a single text query, which may result in ambiguity in understanding the users’ real intentions. LLMs are expected to eliminate that by perceiving the visual- or auditory-grounded instructions’ information. Therefore, in this paper, we propose Tool-LMM, a system incorporating open-source LLMs and multi-modal encoders so that the learnt LLMs can be conscious of multi-modal input instruction and then select the function-matched tool correctly. To facilitate the evaluation of the model’s capability, we collect a dataset featured by consisting of multi-modal input tools from HuggingFace. Another important feature of our dataset is that our dataset also contains multiple potential choices for the same instruction due to the existence of identical functions and synonymous functions, which provides more potential solutions for the same query. The experiments reveal that our LMM is capable of recommending appropriate tools for multi-modal instructions. Codes and data are available at https://github.com/Tool-LMM/Tool-LMM.


标题: 360ORB-SLAM: A Visual SLAM System for Panoramic Images with Depth Completion Network

作者: Yichen Chen, Yiqi Pan, Ruyu Liu

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10560v1

中文摘要: 为了增强AR/VR应用和视觉辅助及检查系统的性能和效果,视觉同步定位和测绘(vSLAM)是计算机视觉和机器人学中的一项基本任务。然而,传统的vSLAM系统受到相机窄视场的限制,导致诸如稀疏的特征分布和缺乏密集的深度信息等挑战。为了克服这些限制,本文提出了一个与深度完成网络相结合的全景图像360ORB-SLAM系统。该系统从全景图像中提取特征点,利用全景三角测量模块生成稀疏深度信息,并利用深度完成网络获得密集的全景深度图。在基于Carla构建的新型全景数据集上的实验结果表明,与现有的单目SLAM方法相比,该方法实现了更高的尺度精度,并有效地解决了特征关联和尺度模糊的挑战。深度完井网络的集成增强了系统稳定性,并减轻了动态因素对SLAM性能的影响。

摘要: To enhance the performance and effect of AR/VR applications and visual assistance and inspection systems, visual simultaneous localization and mapping (vSLAM) is a fundamental task in computer vision and robotics. However, traditional vSLAM systems are limited by the camera’s narrow field-of-view, resulting in challenges such as sparse feature distribution and lack of dense depth information. To overcome these limitations, this paper proposes a 360ORB-SLAM system for panoramic images that combines with a depth completion network. The system extracts feature points from the panoramic image, utilizes a panoramic triangulation module to generate sparse depth information, and employs a depth completion network to obtain a dense panoramic depth map. Experimental results on our novel panoramic dataset constructed based on Carla demonstrate that the proposed method achieves superior scale accuracy compared to existing monocular SLAM methods and effectively addresses the challenges of feature association and scale ambiguity. The integration of the depth completion network enhances system stability and mitigates the impact of dynamic elements on SLAM performance.


标题: Cross-Modality Perturbation Synergy Attack for Person Re-identification

作者: Yunpeng Gong, Zhun Zhong, Zhiming Luo

PubTime: 2024-01-19

Downlink: http://arxiv.org/abs/2401.10090v2

中文摘要: 近年来,有大量研究集中于解决基于RGB图像的单模态人员再识别(ReID)系统中的安全问题。然而,在涉及红外摄像机捕获的图像的实际应用中更常见的跨模态场景的安全性没有得到足够的重视。跨模态ReID的主要挑战在于有效地处理不同模态之间的视觉差异。例如,红外图像通常是灰度的,不像可见光图像包含颜色信息。现有的攻击方法主要集中在可见图像模态的特征上,而忽略了其他模态的特征以及不同模态之间数据分布的差异。这种疏忽可能会潜在地破坏这些方法在跨不同模态的图像检索中的有效性。这项研究代表了对跨模态ReID模型安全性的首次探索,并提出了一种专门为跨模态ReID设计的通用扰动攻击。这种攻击通过利用来自不同模态数据的梯度来优化扰动,从而破坏鉴别器并加强模态之间的差异。我们在两个广泛使用的跨模态数据集上进行了实验,即RegDB和SYSU,这不仅证明了我们方法的有效性,而且为未来跨模态ReID系统的鲁棒性增强提供了见解。

摘要: In recent years, there has been significant research focusing on addressing security concerns in single-modal person re-identification (ReID) systems that are based on RGB images. However, the safety of cross-modality scenarios, which are more commonly encountered in practical applications involving images captured by infrared cameras, has not received adequate attention. The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities. For instance, infrared images are typically grayscale, unlike visible images that contain color information. Existing attack methods have primarily focused on the characteristics of the visible image modality, overlooking the features of other modalities and the variations in data distribution among different modalities. This oversight can potentially undermine the effectiveness of these methods in image retrieval across diverse modalities. This study represents the first exploration into the security of cross-modality ReID models and proposes a universal perturbation attack specifically designed for cross-modality ReID. This attack optimizes perturbations by leveraging gradients from diverse modality data, thereby disrupting the discriminator and reinforcing the differences between modalities. We conducted experiments on two widely used cross-modality datasets, namely RegDB and SYSU, which not only demonstrated the effectiveness of our method but also provided insights for future enhancements in the robustness of cross-modality ReID systems.


专属领域论文订阅

关注{晓理紫},每日更新论文,如感兴趣,请转发给有需要的同学,谢谢支持。谢谢提供建议
如果你感觉对你有所帮助,请关注我,每日准时为你推送最新论文

在这里插入图片描述

  • 36
    点赞
  • 46
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值