[晓理紫]每日论文推送(有中文摘要或代码或项目地址)---大模型,扩散模型

[晓理紫]每日论文推送(有中文摘要或代码或项目地址)
每日更新论文,请转发给有需要的同学
[晓理紫]

专属领域论文订阅

VX关注晓理紫,获取每日新论文

{晓理紫}喜分享,也很需要你的支持,喜欢留下痕迹哦!

分类:

== LLM ==

标题: I am a Strange Dataset: Metalinguistic Tests for Language Models

作者: Tristan Thrush, Jared Moore, Miguel Monares

摘要: Statements involving metalinguistic self-reference (“This paper has six sections.”) are prevalent in many domains. Can large language models (LLMs) handle such language? In this paper, we present “I am a Strange Dataset”, a new dataset for addressing this question. There are two subtasks: generation and verification. In generation, models continue statements like “The penultimate word in this sentence is” (where a correct continuation is “is”). In verification, models judge the truth of statements like “The penultimate word in this sentence is sentence.” (false). We also provide minimally different metalinguistic non-self-reference examples to complement the main dataset by probing for whether models can handle metalinguistic language at all. The dataset is hand-crafted by experts and validated by non-expert annotators. We test a variety of open-source LLMs (7B to 70B parameters) as well as closed-source LLMs through APIs. All models perform close to chance across both subtasks and even on the non-self-referential metalinguistic control data, though we find some steady improvement with model scale. GPT 4 is the only model to consistently do significantly better than chance, and it is still only in the 60% range, while our untrained human annotators score well in the 89-93% range. The dataset and evaluation toolkit are available at https://github.com/TristanThrush/i-am-a-strange-dataset.

中文摘要: 涉及元语言自指的陈述(“本文有六个部分。”)在许多领域都很普遍。大型语言模型(LLM)能处理这样的语言吗?在本文中,我们提出了“我是一个奇怪的数据集”,这是一个解决这个问题的新数据集。有两个子任务:生成和验证。在生成中,模型继续诸如“这句话的倒数第二个单词是”(其中正确的延续是“is”)之类的语句。在验证过程中,模型会判断诸如“这句话的倒数第二个单词是句子”(false)之类的陈述的真实性。我们还提供了差异最小的元语言非自指示例,通过探究模型是否能够处理元语言来补充主要数据集。该数据集由专家手工制作,并由非专家注释器进行验证。我们通过API测试了各种开源LLM(7B到70B参数)以及闭源LLM。所有模型在两个子任务中,甚至在非自指元语言控制数据上都表现得近乎偶然,尽管我们发现随着模型规模的增加,这些模型都有一些稳步的改进。GPT 4是唯一一个持续表现明显好于偶然性的模型,它仍然只在60%的范围内,而我们未经训练的人类注释者的得分在89-93%的范围内。数据集和评估工具包可在https://github.com/TristanThrush/i-am-a-strange-dataset.

[Downlink:]http://arxiv.org/abs/2401.05300v1

[GitHub:]https://github.com/TristanThrush/i-am-a-strange-dataset.|


标题: AUTOACT: Automatic Agent Learning from Scratch via Self-Planning

作者: Shuofei Qiao, Ningyu Zhang, Runnan Fang

摘要: Language agents have achieved considerable performance on various complex tasks. Despite the incessant exploration in this field, existing language agent systems still struggle with costly, non-reproducible data reliance and face the challenge of compelling a single model for multiple functions. To this end, we introduce AutoAct, an automatic agent learning framework that does not rely on large-scale annotated data and synthetic trajectories from closed-source models (e.g., GPT-4). Given limited data with a tool library, AutoAct first automatically synthesizes planning trajectories without any assistance from humans or strong closed-source models. Then, AutoAct leverages a division-of-labor strategy to automatically differentiate based on the target task information and synthesized trajectories, producing a sub-agent group to complete the task. We conduct comprehensive experiments with different LLMs, which demonstrates that AutoAct yields better or parallel performance compared to various strong baselines. We even notice that AutoAct, when using the Llama-2-13b model, can achieve performance comparable to that of the GPT-3.5-Turbo agent. Code will be available at https://github.com/zjunlp/AutoAct.

中文摘要: 语言代理在各种复杂任务上都取得了相当大的性能。尽管在这一领域进行了不断的探索,但现有的语言代理系统仍在与成本高昂、不可复制的数据依赖作斗争,并面临着为多个功能强制使用单个模型的挑战。为此,我们引入了AutoAct,这是一种自动代理学习框架,不依赖于来自闭源模型(例如GPT-4)的大规模注释数据和合成轨迹。在使用工具库获得有限数据的情况下,AutoAct首先自动合成规划轨迹,而无需人类或强大的闭源模型的任何帮助。然后,AutoAct利用分工策略,根据目标任务信息和合成轨迹自动进行区分,生成子代理组来完成任务。我们用不同的LLM进行了全面的实验,这表明与各种强基线相比,AutoAct产生了更好的或并行的性能。我们甚至注意到,当使用Llama-2-13b模型时,AutoAct可以实现与GPT-3.5-Turbo代理相当的性能。代码将在https://github.com/zjunlp/AutoAct.

[Downlink:]http://arxiv.org/abs/2401.05268v1

[GitHub:]https://github.com/zjunlp/AutoAct.|


标题: HyperPIE: Hyperparameter Information Extraction from Scientific
Publications

作者: Tarek Saier, Mayumi Ohta, Takuto Asakura

摘要: Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29% F1 over a state-of-the-art baseline. For large language models, we develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5% F1 in entity recognition over using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie

中文摘要: 从出版物中自动提取信息是使科学知识机器大规模可读的关键。例如,提取的信息可以促进学术搜索、决策和知识图构建。现有方法未涵盖的一种重要信息类型是超参数。在本文中,我们将超参数信息提取(HyperPIE)形式化并处理为实体识别和关系提取任务。我们创建了一个标签数据集,涵盖了各种计算机科学学科的出版物。使用这个数据集,我们训练和评估基于BERT的微调模型以及五个大型语言模型:GPT-3.5、GALACTICA、Falcon、Vicuna和WizardLM。对于微调模型,我们开发了一种关系提取方法,在最先进的基线基础上实现了29%的F1改进。对于大型语言模型,我们开发了一种利用YAML输出进行结构化数据提取的方法,与使用JSON相比,该方法在实体识别方面实现了5.5%F1的平均改进。使用我们性能最好的模型,我们从大量未注释的论文中提取超参数信息,并分析跨学科的模式。我们所有的数据和源代码都可以在https://github.com/IllDepence/hyperpie

[Downlink:]http://arxiv.org/abs/2312.10638v2

[GitHub:]https://github.com/IllDepence/hyperpie|


标题: HomeRobot: Open-Vocabulary Mobile Manipulation

作者: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav

摘要: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

中文摘要: 家庭机器人(名词):一种价格合理的兼容机器人,可在家中导航并操纵各种物体以完成日常任务。开放词汇移动操作(OVMM)是指在任何看不见的环境中拾取任何对象,并将其放置在命令位置的问题。这是机器人在人类环境中成为有用助手的一个基本挑战,因为它涉及到解决机器人的子问题:感知、语言理解、导航和操作都是OVMM的关键。此外,这些子问题的解决方案的一体化也带来了自身的重大挑战。为了推动这一领域的研究,我们引入了HomeRobot OVMM基准,在该基准中,代理导航家庭环境,以抓取新物体并将其放置在目标容器上。HomeRobot有两个组件:一个模拟组件,在新的、高质量的多房间家庭环境中使用大型和多样化的策划对象集;和一个真实世界的组件,为低成本的Hello Robot Stretch提供了一个软件堆栈,以鼓励在实验室中复制真实世界的实验。我们实现了强化学习和启发式(基于模型的)基线,并展示了模拟到真实转移的证据。我们的基线在现实世界中实现了20%的成功率;我们的实验确定了未来研究工作提高性能的方法。查看我们网站上的视频:https://ovmm.github.io/.

[Downlink:]http://arxiv.org/abs/2306.11565v2

[Project:]https://ovmm.github.io/.|


标题: Investigating Prompting Techniques for Zero- and Few-Shot Visual
Question Answering

作者: Rabiul Awal, Le Zhang, Aishwarya Agrawal

摘要: In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs’ performance in many cases, even though VLMs “see” the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs’ alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

中文摘要: 在本文中,我们探索了在当代视觉语言模型(VLM)中提高零镜头和少镜头视觉问答(VQA)性能的有效提示技术。我们调查的核心是问题模板在引导VLM生成准确答案方面的作用。我们发现,特定的模板会显著影响VQA的结果,强调了战略模板选择的必要性。我们研究的另一个关键方面是用图像字幕增强VLM,在VQA任务中为它们提供额外的视觉提示和直接的图像特征。令人惊讶的是,这种增强在许多情况下显著提高了VLM的性能,即使VLM可以直接“看到”图像!我们探索了思考链(CoT)推理,发现虽然标准的CoT推理会导致性能下降,但自我意识等高级方法可以帮助恢复它。此外,我们发现,基于文本的较少搜索示例增强了VLM与任务格式的一致性,特别有利于倾向于冗长零样本答案的模型。最后,为了缓解使用基于字符串匹配的VQA度量评估自由形式开放式VQA响应的相关挑战,我们引入了一种直接的LLM引导的预处理技术,以使模型响应适应预期的基本事实答案分布。总之,我们的研究揭示了VQA的VLM中提示策略的复杂性,强调了字幕、模板和预处理的协同使用,以提高模型的功效

[Downlink:]http://arxiv.org/abs/2306.09996v2

[GitHub:]https://github.com/rabiulcste/vqazero|


标题: CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent
Evaluation

作者: Quan Tu, Shilong Fan, Zihang Tian

摘要: Recently, the advent of large language models (LLMs) has revolutionized generative agents. Among them, Role-Playing Conversational Agents (RPCAs) attract considerable attention due to their ability to emotionally engage users. However, the absence of a comprehensive benchmark impedes progress in this field. To bridge this gap, we introduce CharacterEval, a Chinese benchmark for comprehensive RPCA assessment, complemented by a tailored high-quality dataset. The dataset comprises 1,785 multi-turn role-playing dialogues, encompassing 23,020 examples and featuring 77 characters derived from Chinese novels and scripts. It was carefully constructed, beginning with initial dialogue extraction via GPT-4, followed by rigorous human-led quality control, and enhanced with in-depth character profiles sourced from Baidu Baike. CharacterEval employs a multifaceted evaluation approach, encompassing thirteen targeted metrics on four dimensions. Comprehensive experiments on CharacterEval demonstrate that Chinese LLMs exhibit more promising capabilities than GPT-4 in Chinese role-playing conversation. Source code, data source and reward model will be publicly accessible at https://github.com/morecry/CharacterEval.

中文摘要: 最近,大型语言模型(LLM)的出现彻底改变了生成代理。其中,角色扮演会话代理(RPCA)由于其在情感上吸引用户的能力而吸引了相当多的关注。然而,缺乏一个全面的基准阻碍了这一领域的进展。为了弥补这一差距,我们引入了CharacterEval,这是一个全面的RPCA评估的中国基准,并辅以量身定制的高质量数据集。该数据集包括1785个多回合角色扮演对话,包括23020个例子,77个来自中国小说和剧本的角色。它是经过精心构建的,从最初的GPT-4对话提取开始,然后是严格的人为质量控制,并通过来自百度百科的深入人物简介进行了增强。CharacterEval采用了多方面的评估方法,包括四个维度上的十三个有针对性的指标。在CharacterEval上进行的综合实验表明,与GPT-4相比,中国LLM在中国角色扮演会话中表现出更具前景的能力。源代码、数据源和奖励模型将在https://github.com/morecry/CharacterEval.

[Downlink:]http://arxiv.org/abs/2401.01275v2

[GitHub:]https://github.com/morecry/CharacterEval.|


== VLM ==

标题: Towards Online Sign Language Recognition and Translation

作者: Ronglai Zuo, Fangyun Wei, Brian Mak

摘要: The objective of sign language recognition is to bridge the communication gap between the deaf and the hearing. Numerous previous works train their models using the well-established connectionist temporal classification (CTC) loss. During the inference stage, the CTC-based models typically take the entire sign video as input to make predictions. This type of inference scheme is referred to as offline recognition. In contrast, while mature speech recognition systems can efficiently recognize spoken words on the fly, sign language recognition still falls short due to the lack of practical online solutions. In this work, we take the first step towards filling this gap. Our approach comprises three phases: 1) developing a sign language dictionary encompassing all glosses present in a target sign language dataset; 2) training an isolated sign language recognition model on augmented signs using both conventional classification loss and our novel saliency loss; 3) employing a sliding window approach on the input sign sequence and feeding each sign clip to the well-optimized model for online recognition. Furthermore, our online recognition model can be extended to boost the performance of any offline model, and to support online translation by appending a gloss-to-text network onto the recognition model. By integrating our online framework with the previously best-performing offline model, TwoStream-SLR, we achieve new state-of-the-art performance on three benchmarks: Phoenix-2014, Phoenix-2014T, and CSL-Daily. Code and models will be available at https://github.com/FangyunWei/SLRT

中文摘要: 手语识别的目标是弥合聋人和听力之间的沟通差距。许多先前的工作使用公认的连接主义时间分类(CTC)损失来训练他们的模型。在推理阶段,基于CTC的模型通常将整个符号视频作为输入来进行预测。这种类型的推理方案被称为离线识别。相比之下,尽管成熟的语音识别系统可以快速有效地识别口语,但由于缺乏实用的在线解决方案,手语识别仍然不足。在这项工作中,我们迈出了填补这一空白的第一步。我们的方法包括三个阶段:1)开发一个包含目标手语数据集中所有注释的手语词典;2) 使用传统的分类损失和我们的新显著性损失在增强符号上训练孤立的手语识别模型;3) 在输入符号序列上采用滑动窗口方法,并将每个符号片段馈送到优化良好的模型用于在线识别。此外,我们的在线识别模型可以扩展,以提高任何离线模型的性能,并通过在识别模型上添加文本网络来支持在线翻译。通过将我们的在线框架与之前性能最好的离线模型TwoStream SLR集成,我们在三个基准上实现了最先进的性能:Phoenix-2014、Phoenix-2014T和CSL Daily。代码和型号将在https://github.com/FangyunWei/SLRT

[Downlink:]http://arxiv.org/abs/2401.05336v1

[GitHub:]https://github.com/FangyunWei/SLRT|


标题: Improving generalization by mimicking the human visual diet

作者: Spandan Madan, You Li, Mengmi Zhang

摘要: We present a new perspective on bridging the generalization gap between biological and computer vision – mimicking the human visual diet. While computer vision models rely on internet-scraped datasets, humans learn from limited 3D scenes under diverse real-world transformations with objects in natural context. Our results demonstrate that incorporating variations and contextual cues ubiquitous in the human visual training data (visual diet) significantly improves generalization to real-world transformations such as lighting, viewpoint, and material changes. This improvement also extends to generalizing from synthetic to real-world data – all models trained with a human-like visual diet outperform specialized architectures by large margins when tested on natural image data. These experiments are enabled by our two key contributions: a novel dataset capturing scene context and diverse real-world transformations to mimic the human visual diet, and a transformer model tailored to leverage these aspects of the human visual diet. All data and source code can be accessed at https://github.com/Spandan-Madan/human_visual_diet.

中文摘要: 我们提出了一个新的视角来弥合生物视觉和计算机视觉之间的泛化差距——模仿人类的视觉饮食。虽然计算机视觉模型依赖于互联网抓取的数据集,但人类在自然环境中与物体进行不同的真实世界转换的情况下,从有限的3D场景中学习。我们的研究结果表明,结合人类视觉训练数据(视觉饮食)中普遍存在的变化和上下文线索,可以显著提高对真实世界变换的泛化能力,如照明、视点和材料变化。这一改进还扩展到从合成数据到真实世界数据的推广——当在自然图像数据上测试时,所有用类人视觉饮食训练的模型都大大优于专门的架构。这些实验是由我们的两个关键贡献促成的:一个是捕捉场景上下文的新数据集和模拟人类视觉饮食的各种真实世界变换,另一个是专门利用人类视觉饮食这些方面的变换器模型。所有数据和源代码都可以访问https://github.com/Spandan-Madan/human_visual_diet.

[Downlink:]http://arxiv.org/abs/2206.07802v2

[GitHub:]https://github.com/Spandan-Madan/human_visual_diet.|


标题: Actor-agnostic Multi-label Action Recognition with Multi-modal Query

作者: Anindya Mondal, Sauradip Nag, Joaquin M Prada

摘要: Existing action recognition methods are typically actor-specific due to the intrinsic topological and apparent differences among the actors. This requires actor-specific pose estimation (e.g., humans vs. animals), leading to cumbersome model design complexity and high maintenance costs. Moreover, they often focus on learning the visual modality alone and single-label classification whilst neglecting other available information sources (e.g., class name text) and the concurrent occurrence of multiple actions. To overcome these limitations, we propose a new approach called ‘actor-agnostic multi-modal multi-label action recognition,’ which offers a unified solution for various types of actors, including humans and animals. We further formulate a novel Multi-modal Semantic Query Network (MSQNet) model in a transformer-based object detection framework (e.g., DETR), characterized by leveraging visual and textual modalities to represent the action classes better. The elimination of actor-specific model designs is a key advantage, as it removes the need for actor pose estimation altogether. Extensive experiments on five publicly available benchmarks show that our MSQNet consistently outperforms the prior arts of actor-specific alternatives on human and animal single- and multi-label action recognition tasks by up to 50%. Code is made available at https://github.com/mondalanindya/MSQNet.

中文摘要: 由于参与者之间固有的拓扑和明显的差异,现有的动作识别方法通常是特定于参与者的。这需要特定于演员的姿势估计(例如,人类与动物),导致繁琐的模型设计复杂性和高昂的维护成本。此外,他们通常专注于单独学习视觉模态和单标签分类,而忽略了其他可用的信息来源(例如,类名文本)和多个动作的同时发生。为了克服这些限制,我们提出了一种新的方法,称为“行动者不可知的多模式多标签动作识别”,它为包括人类和动物在内的各种行动者提供了统一的解决方案。我们在基于变换器的对象检测框架(例如,DETR)中进一步提出了一种新的多模式语义查询网络(MSQNet)模型,其特征是利用视觉和文本模式更好地表示动作类。消除了特定于演员的模型设计是一个关键优势,因为它完全消除了对演员姿势估计的需要。在五个公开可用的基准上进行的广泛实验表明,我们的MSQNet在人类和动物的单标签和多标签动作识别任务上始终优于现有技术的演员特定替代品高达50%。代码可在https://github.com/mondalanindya/MSQNet.

[Downlink:]http://arxiv.org/abs/2307.10763v3

[GitHub:]https://github.com/mondalanindya/MSQNet.|


标题: Prompt-aligned Gradient for Prompt Tuning

作者: Beier Zhu, Yulei Niu, Yucheng Han

摘要: Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by “prompt”, e.g., the confidence score of an image being “[CLASS]” can be obtained by using the VLM provided similarity measure between the image and the prompt sentence “a photo of a [CLASS]”. Therefore, prompt shows a great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may not only undermine the prompt’s inherent prediction for the task-related classes, but also for other classes in the VLM vocabulary. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompt. We present Prompt-aligned Gradient, dubbed ProGrad, to prevent prompt tuning from forgetting the the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the “general direction”, which is represented as the gradient of the KL loss of the pre-defined prompt prediction. Extensive experiments demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes are available at https://github.com/BeierZhu/Prompt-align.

中文摘要: 由于像CLIP这样的大型预先训练的视觉语言模型(VLM),我们可以通过“提示”来制作零样本分类器,例如,可以通过使用VLM提供的图像和提示句“[CLASS]的照片”之间的相似性测量来获得图像的置信度得分“[CLASP]”。因此,如果我们微调基于提示的相似性度量,提示显示出VLM快速适应下游任务的巨大潜力。然而,我们发现了一个常见的失败,即不适当的微调不仅可能破坏提示对任务相关类的固有预测,而且可能破坏VLM词汇中其他类的内在预测。现有的方法仍然通过使用传统的反过拟合技术来解决这个问题,如早期停止和数据扩充,这些技术缺乏专门针对提示的原则性解决方案。我们提出了被称为ProGrad的“提示对齐梯度”,以防止快速调谐忘记从VLM中学到的一般知识。特别是,ProGrad只更新梯度与“一般方向”对齐(或不冲突)的提示,该方向表示为预定义提示预测的KL损失的梯度。大量实验表明,与最先进的即时调谐方法相比,ProGrad的少镜头泛化能力更强。代码可在https://github.com/BeierZhu/Prompt-align.

[Downlink:]http://arxiv.org/abs/2205.14865v3

[GitHub:]https://github.com/BeierZhu/Prompt-align.|


标题: Generalized Logit Adjustment: Calibrating Fine-tuned Models by Removing
Label Bias in Foundation Models

作者: Beier Zhu, Kaihua Tang, Qianru Sun

摘要: Foundation models like CLIP allow zero-shot transfer on various tasks without additional training data. Yet, the zero-shot performance is less competitive than a fully supervised one. Thus, to enhance the performance, fine-tuning and ensembling are also commonly adopted to better fit the downstream tasks. However, we argue that such prior work has overlooked the inherent biases in foundation models. Due to the highly imbalanced Web-scale training set, these foundation models are inevitably skewed toward frequent semantics, and thus the subsequent fine-tuning or ensembling is still biased. In this study, we systematically examine the biases in foundation models and demonstrate the efficacy of our proposed Generalized Logit Adjustment (GLA) method. Note that bias estimation in foundation models is challenging, as most pre-train data cannot be explicitly accessed like in traditional long-tailed classification tasks. To this end, GLA has an optimization-based bias estimation approach for debiasing foundation models. As our work resolves a fundamental flaw in the pre-training, the proposed GLA demonstrates significant improvements across a diverse range of tasks: it achieves 1.5 pp accuracy gains on ImageNet, an large average improvement (1.4-4.6 pp) on 11 few-shot datasets, 2.4 pp gains on long-tailed classification. Codes are in \url{https://github.com/BeierZhu/GLA}.

中文摘要: 像CLIP这样的基础模型允许在没有额外训练数据的情况下对各种任务进行零样本传输。然而,零样本的表现不如完全监督的表现具有竞争力。因此,为了提高性能,通常还采用微调和集成来更好地适应下游任务。然而,我们认为,这种先前的工作忽视了基础模型中固有的偏见。由于网络规模的训练集高度不平衡,这些基础模型不可避免地向频繁语义倾斜,因此后续的微调或组合仍然存在偏差。在这项研究中,我们系统地检查了基础模型中的偏差,并证明了我们提出的广义Logit平差(GLA)方法的有效性。请注意,基础模型中的偏差估计是具有挑战性的,因为大多数预训练数据不能像传统的长尾分类任务中那样被明确访问。为此,GLA提供了一种基于优化的基础模型去偏估计方法。由于我们的工作解决了预训练中的一个基本缺陷,所提出的GLA在各种任务中都表现出了显著的改进:它在ImageNet上实现了1.5 pp的精度提高,在11个少镜头数据集上实现了较大的平均提高(1.4-4.6 pp),在长尾分类上实现了2.4 pp的提高。代码在\url中{https://github.com/BeierZhu/GLA}.

[Downlink:]http://arxiv.org/abs/2310.08106v2

[GitHub:]https://github.com/BeierZhu/GLA|


标题: Investigating Prompting Techniques for Zero- and Few-Shot Visual
Question Answering

作者: Rabiul Awal, Le Zhang, Aishwarya Agrawal

摘要: In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance in contemporary Vision-Language Models (VLMs). Central to our investigation is the role of question templates in guiding VLMs to generate accurate answers. We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection. Another pivotal aspect of our study is augmenting VLMs with image captions, providing them with additional visual cues alongside direct image features in VQA tasks. Surprisingly, this augmentation significantly improves the VLMs’ performance in many cases, even though VLMs “see” the image directly! We explore chain-of-thought (CoT) reasoning and find that while standard CoT reasoning causes drops in performance, advanced methods like self-consistency can help recover it. Furthermore, we find that text-only few-shot examples enhance VLMs’ alignment with the task format, particularly benefiting models prone to verbose zero-shot answers. Lastly, to mitigate the challenges associated with evaluating free-form open-ended VQA responses using string-matching based VQA metrics, we introduce a straightforward LLM-guided pre-processing technique to adapt the model responses to the expected ground-truth answer distribution. In summary, our research sheds light on the intricacies of prompting strategies in VLMs for VQA, emphasizing the synergistic use of captions, templates, and pre-processing to enhance model efficacy.

中文摘要: 在本文中,我们探索了在当代视觉语言模型(VLM)中提高零镜头和少镜头视觉问答(VQA)性能的有效提示技术。我们调查的核心是问题模板在引导VLM生成准确答案方面的作用。我们发现,特定的模板会显著影响VQA的结果,强调了战略模板选择的必要性。我们研究的另一个关键方面是用图像字幕增强VLM,在VQA任务中为它们提供额外的视觉提示和直接的图像特征。令人惊讶的是,这种增强在许多情况下显著提高了VLM的性能,即使VLM可以直接“看到”图像!我们探索了思考链(CoT)推理,发现虽然标准的CoT推理会导致性能下降,但自我意识等高级方法可以帮助恢复它。此外,我们发现,基于文本的较少搜索示例增强了VLM与任务格式的一致性,特别有利于倾向于冗长零样本答案的模型。最后,为了缓解使用基于字符串匹配的VQA度量评估自由形式开放式VQA响应的相关挑战,我们引入了一种直接的LLM引导的预处理技术,以使模型响应适应预期的基本事实答案分布。总之,我们的研究揭示了VQA的VLM中提示策略的复杂性,强调了字幕、模板和预处理的协同使用,以提高模型的功效

[Downlink:]http://arxiv.org/abs/2306.09996v2

[GitHub:]https://github.com/rabiulcste/vqazero|


==diffusion model ==

标题: Stimulating the Diffusion Model for Image Denoising via Adaptive
Embedding and Ensembling

作者: Tong Li, Hansen Feng, Lizhi Wang

摘要: Image denoising is a fundamental problem in computational photography, where achieving high perception with low distortion is highly demanding. Current methods either struggle with perceptual quality or suffer from significant distortion. Recently, the emerging diffusion model has achieved state-of-the-art performance in various tasks and demonstrates great potential for image denoising. However, stimulating diffusion models for image denoising is not straightforward and requires solving several critical problems. For one thing, the input inconsistency hinders the connection between diffusion models and image denoising. For another, the content inconsistency between the generated image and the desired denoised image introduces distortion. To tackle these problems, we present a novel strategy called the Diffusion Model for Image Denoising (DMID) by understanding and rethinking the diffusion model from a denoising perspective. Our DMID strategy includes an adaptive embedding method that embeds the noisy image into a pre-trained unconditional diffusion model and an adaptive ensembling method that reduces distortion in the denoised image. Our DMID strategy achieves state-of-the-art performance on both distortion-based and perception-based metrics, for both Gaussian and real-world image denoising.The code is available at https://github.com/Li-Tong-621/DMID.

中文摘要: 图像去噪是计算摄影中的一个基本问题,在计算摄影中,以低失真实现高感知是非常高的要求。当前的方法要么与感知质量作斗争,要么遭受显著失真。最近,新兴的扩散模型在各种任务中都取得了最先进的性能,并显示出图像去噪的巨大潜力。然而,用于图像去噪的刺激扩散模型并不简单,并且需要解决几个关键问题。一方面,输入的不一致性阻碍了扩散模型和图像去噪之间的联系。另一方面,生成的图像和期望的去噪图像之间的内容不一致引入失真。为了解决这些问题,我们提出了一种新的策略,称为图像去噪的扩散模型(DMID),通过从去噪的角度理解和重新思考扩散模型。我们的DMID策略包括将噪声图像嵌入预训练的无条件扩散模型的自适应嵌入方法和减少去噪图像中失真的自适应集成方法。我们的DMID策略在基于失真和基于感知的度量上实现了最先进的性能,用于高斯和真实世界的图像去噪。代码位于https://github.com/Li-Tong-621/DMID.

[Downlink:]http://arxiv.org/abs/2307.03992v3

[GitHub:]https://github.com/Li-Tong-621/DMID.|


标题: Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar
Creation

作者: Xiyi Chen, Marko Mihajlovic, Shaofei Wang

摘要: Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work, we aim to enhance the quality and functionality of these models for the task of creating controllable, photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multiview-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly, this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge, our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent, animatable, and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks.

中文摘要: 生成扩散模型的最新进展已经实现了从单个输入图像或文本提示生成3D资产的先前不可行的能力。在这项工作中,我们的目标是提高这些模型的质量和功能,以完成创建可控、照片真实感的人类化身的任务。我们通过将3D可变形模型集成到最先进的多视角一致扩散方法中来实现这一点。我们证明了生成管道在关节式3D模型上的精确调节增强了基线模型在从单个图像合成新视图任务中的性能。更重要的是,这种集成有助于将面部表情和身体姿势控制无缝准确地结合到生成过程中。据我们所知,我们提出的框架是第一个扩散模型,能够从看不见的物体的单个图像中创建完全3D一致、可动画化和照片真实感的人类化身;大量的定量和定性评估证明了我们的方法在新视角和新表情合成任务上优于现有的最先进的化身创建模型

[Downlink:]http://arxiv.org/abs/2401.04728v1

[Project:]https://xiyichen.github.io/morphablediffusion/|


标题: DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

作者: Yunfan Ye, Kai Xu, Yuhang Huang

摘要: Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.

中文摘要: 受编码器-解码器架构的限制,基于学习的边缘检测器通常难以预测同时满足正确性和清晰度的边缘图。随着扩散概率模型(DPM)最近的成功,我们发现它特别适合于准确而清晰的边缘检测,因为去噪过程直接应用于原始图像的大小。因此,我们提出了用于一般边缘检测任务的第一个扩散模型,我们称之为DiffusionEdge。为了在保持最终性能的同时避免昂贵的计算资源,我们在潜在空间中应用DPM,并使像素级具有不确定性的经典交叉熵损失能够以蒸馏的方式直接优化潜在空间中的参数。我们还采用了解耦的架构来加快去噪过程,并提出了相应的自适应傅立叶滤波器来调整特定频率的潜在特征。有了所有的技术设计,DiffusionEdge可以用有限的资源进行稳定的训练,用更少的增强策略预测清晰准确的边缘图。在四个边缘检测基准上进行的大量实验证明了DiffusionEdge在正确性和清晰度方面的优势。在NYUDv2数据集上,与第二好的数据集相比,我们的ODS、OIS(无后处理)和AC分别增加了30.2%、28.1%和65.1%。代码:https://github.com/GuHuangAI/DiffusionEdge.

[Downlink:]http://arxiv.org/abs/2401.02032v2

[GitHub:]https://github.com/GuHuangAI/DiffusionEdge.|


标题: Customize-It-3D: High-Quality 3D Creation from A Single Image Using
Subject-Specific Knowledge Prior

作者: Nan Huang, Ting Zhang, Yuhui Yuan

摘要: In this paper, we present a novel two-stage approach that fully utilizes the information provided by the reference image to establish a customized knowledge prior for image-to-3D generation. While previous approaches primarily rely on a general diffusion prior, which struggles to yield consistent results with the reference image, we propose a subject-specific and multi-modal diffusion model. This model not only aids NeRF optimization by considering the shading mode for improved geometry but also enhances texture from the coarse results to achieve superior refinement. Both aspects contribute to faithfully aligning the 3D content with the subject. Extensive experiments showcase the superiority of our method, Customize-It-3D, outperforming previous works by a substantial margin. It produces faithful 360-degree reconstructions with impressive visual quality, making it well-suited for various applications, including text-to-3D creation.

中文摘要: 在本文中,我们提出了一种新的两阶段方法,该方法充分利用参考图像提供的信息来建立用于图像到3D生成的定制知识先验。虽然以前的方法主要依赖于一般的扩散先验,这很难产生与参考图像一致的结果,但我们提出了一个特定于主题的多模态扩散模型。该模型不仅通过考虑阴影模式来帮助NeRF优化以改进几何结构,而且还从粗略结果中增强纹理以实现卓越的细化。这两个方面都有助于使3D内容与主题忠实地对准。大量的实验证明了我们的方法Customize-It-3D的优越性,大大优于以前的工作。它以令人印象深刻的视觉质量进行了忠实的360度重建,非常适合各种应用,包括文本到3D的创建

[Downlink:]http://arxiv.org/abs/2312.11535v2

[Project:]https://nnanhuang.github.io/projects/customize-it-3d/|


标题: Wind Noise Reduction with a Diffusion-based Stochastic Regeneration
Model

作者: Jean-Marie Lemercier, Joachim Thiemann, Raphael Koning

摘要: In this paper we present a method for single-channel wind noise reduction using our previously proposed diffusion-based stochastic regeneration model combining predictive and generative modelling. We introduce a non-additive speech in noise model to account for the non-linear deformation of the membrane caused by the wind flow and possible clipping. We show that our stochastic regeneration model outperforms other neural-network-based wind noise reduction methods as well as purely predictive and generative models, on a dataset using simulated and real-recorded wind noise. We further show that the proposed method generalizes well by testing on an unseen dataset with real-recorded wind noise. Audio samples, data generation scripts and code for the proposed methods can be found online (https://uhh.de/inf-sp-storm-wind).

中文摘要: 在本文中,我们提出了一种单通道风噪声降低方法,使用我们之前提出的基于扩散的随机再生模型,结合预测和生成建模。我们引入了一个非加性的噪声中语音模型来解释由气流和可能的削波引起的膜的非线性变形。我们表明,在使用模拟和真实记录的风噪声的数据集上,我们的随机再生模型优于其他基于神经网络的风噪声降低方法以及纯预测和生成模型。我们进一步证明,通过在具有真实记录的风噪声的不可见数据集上进行测试,所提出的方法具有很好的推广性。可以在线找到所提出方法的音频样本、数据生成脚本和代码(https://uhh.de/inf-sp-storm-wind).

[Downlink:]http://arxiv.org/abs/2306.12867v2

[Project:]https://uhh.de/inf-sp-storm-wind).|


标题: Mitigate Replication and Copying in Diffusion Models with Generalized
Caption and Dual Fusion Enhancement

作者: Chenghao Li, Dake Chen, Yuke Zhang

摘要: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate’ training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion.

中文摘要: 虽然扩散模型在生成高质量图像方面表现出非凡的能力,但它们“复制”训练数据的趋势引发了隐私问题。尽管最近的研究表明,这种复制可能源于训练数据字幕的泛化不足和训练图像的重复,但有效的缓解策略仍然难以捉摸。为了解决这一差距,我们的论文首先引入了一个通用性分数来衡量字幕的通用性,并使用大型语言模型(LLM)来推广训练字幕。随后,我们利用广义字幕,提出了一种新的双重融合增强方法来减轻扩散模型的复制。我们的实证结果表明,与原始扩散模型相比,我们提出的方法可以显著减少43.5%的复制,同时保持世代的多样性和质量。代码位于https://github.com/HowardLi0816/dual-fusion-diffusion.

[Downlink:]http://arxiv.org/abs/2309.07254v3

[GitHub:]https://github.com/HowardLi0816/dual-fusion-diffusion.|


  • 18
    点赞
  • 20
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值