ACL2024优秀论文合集

https://2024.aclweb.org/program/best_papers/#best-theme-paper-awards

Best Paper Awards

最佳论文奖

Mission: Impossible Language Models

使命:不可能的语言模型

Chomsky and others have very directly claimed that large language models (LLMs) are equally capable of learning languages that are possible and impossible for humans to learn. However, there is very little published experimental evidence to support such a claim. Here, we develop a set of synthetic impossible languages of differing complexity, each designed by systematically altering English data with unnatural word orders and grammar rules. These languages lie on an impossibility continuum: at one end are languages that are inherently impossible, such as random and irreversible shuffles of English words, and on the other, languages that may not be intuitively impossible but are often considered so in linguistics, particularly those with rules based on counting word positions. We report on a wide range of evaluations to assess the capacity of GPT-2 small models to learn these uncontroversially impossible languages, and crucially, we perform these assessments at various stages throughout training to compare the learning process for each language. Our core finding is that GPT-2 struggles to learn impossible languages when compared to English as a control, challenging the core claim. More importantly, we hope our approach opens up a productive line of inquiry in which different LLM architectures are tested on a variety of impossible languages in an effort to learn more about how LLMs can be used as tools for these cognitive and typological investigations.
乔姆斯基和其他人非常直接地声称,大型语言模型( LLMs )同样能够学习人类可能学习和不可能学习的语言。然而,很少有公开的实验证据支持这种说法。在这里,我们开发了一组不同复杂度的合成不可能语言,每种语言都是通过使用不自然的词序和语法规则系统地改变英语数据而设计的。这些语言处于不可能的连续体上:一方面是本质上不可能的语言,例如英语单词的随机和不可逆的洗牌,另一方面,直觉上可能并非不可能的语言,但在语言学中通常被认为是不可能的,特别是那些基于计算单词位置的规则。我们报告了广泛的评估,以评估 GPT-2 小模型学习这些毫无争议的不可能语言的能力,最重要的是,我们在整个训练的不同阶段进行这些评估,以比较每种语言的学习过程。我们的核心发现是,与英语作为对照相比,GPT-2 很难学习不可能的语言,这挑战了核心主张。更重要的是,我们希望我们的方法能够开辟一条富有成效的探究路线,其中不同的LLM架构在各种不可能的语言上进行测试,以努力了解更多关于如何将LLMs用作这些认知和类型学调查的工具。

Semisupervised Neural Proto-Language Reconstruction

半监督神经原始语言重建

Existing work implementing comparative reconstruction of ancestral languages (proto-languages) has usually required full supervision. However, historical reconstruction models are only of practical value if they can be trained with a limited amount of labeled data. We propose a semisupervised historical reconstruction task in which the model is trained on only a small amount of labeled data (cognate sets with proto-forms) and a large amount of unlabeled data (cognate sets without proto-forms). We propose a neural architecture for comparative reconstruction (DPD-BiReconstructor) incorporating an essential insight from linguists’ comparative method: that reconstructed words should not only be reconstructable from their daughter words, but also deterministically transformable back into their daughter words. We show that this architecture is able to leverage unlabeled cognate sets to outperform strong semisupervised baselines on this novel task.
现有的对祖先语言(原始语言)进行比较重建的工作通常需要全面的监督。然而,历史重建模型只有能够用有限数量的标记数据进行训练才具有实用价值。我们提出了一种半监督历史重建任务,其中模型仅在少量标记数据(具有原型形式的同源集)和大量未标记数据(没有原型形式的同源集)上进行训练。我们提出了一种用于比较重建的神经架构(DPD-BiReconstructor),结合了语言学家比较方法的基本见解:重建的单词不仅应该可以从它们的子词重构,而且还应该确定性地转换回它们的子词。我们证明,这种架构能够利用未标记的同源集在这项新任务上超越强大的半监督基线。

Why are Sensitive Functions Hard for Transformers?

为什么敏感函数对于 Transformer 来说很难?

Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers’ inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.
实证研究已经确定了 Transformer 的一系列可学习性偏差和局限性,例如学习计算 PARITY 等简单形式语言的持续困难,以及对低次函数的偏差。然而,理论理解仍然有限,现有的表达理论要么高估要么低估现实的学习能力。我们证明,在Transformer架构下,损耗景观受到输入空间敏感性的约束:输出对输入串的许多部分敏感的Transformer位于参数空间中的孤立点,导致泛化中的低灵敏度偏差。我们从理论和经验上证明,该理论统一了关于 Transformer 的学习能力和偏差的广泛经验观察,例如它们对低灵敏度和低程度的泛化偏差,以及 PARITY 长度泛化的困难。这表明,了解 Transformer 的归纳偏置 不仅需要研究其原则上的表现力,还需要研究其损耗情况。

注:在计算机科学中,“parity” 也指的是奇偶校验

归纳偏置是什么? 归纳偏置是机器学习模型在训练和推理时内置的先验假设。这些假设帮助模型在面对新的或未知的数据时作出更好的推断。例如,卷积神经网络(CNN)具有“平移不变性”的归纳偏置,这意味着它可以识别在图像中移动的物体。
Transformer 的归纳偏置: Transformer 模型是用于自然语言处理的深度学习模型,其主要特点是“自注意力机制”。这个机制允许模型在处理数据时关注输入序列中不同位置的相关性。具体的归纳偏置包括:

低敏感性函数是指其输出对输入的变化不敏感的函数,低度数函数是指其复杂性较低的函数

Natural Language Satisfiability: Exploring the Problem Distribution and Evaluating Transformer-based Language Models

自然语言可满足性:探索问题分布并评估基于 Transformer 的语言模型

Efforts to apply transformer-based language models (TLMs) to the problem of reasoning in natural language have enjoyed ever-increasing success in recent years. The most fundamental task in this area to which nearly all others can be reduced is that of determining satisfiability. However, from a logical point of view, satisfiability problems vary along various dimensions, which may affect TLMs’ ability to learn how to solve them. The problem instances of satisfiability in natural language can belong to different computational complexity classes depending on the language fragment in which they are expressed. Although prior research has explored the problem of natural language satisfiability, the above-mentioned point has not been discussed adequately. Hence, we investigate how problem instances from varying computational complexity classes and having different grammatical constructs impact TLMs’ ability to learn rules of inference. Furthermore, to faithfully evaluate TLMs, we conduct an empirical study to explore the distribution of satisfiability problems.
近年来,将基于transformer的语言模型(TLM)应用于自然语言推理问题的努力取得了越来越大的成功。该领域最基本的任务是确定可满足性,几乎所有其他任务都可以简化为该任务。然而,从逻辑的角度来看,可满足性问题在各个维度上都有所不同,这可能会影响 TLM学习如何解决这些问题的能力。自然语言中可满足性的问题实例可以属于不同的计算复杂性类别,具体取决于它们所表达的语言片段。尽管先前的研究已经探讨了自然语言可满足性问题,但上述观点尚未得到充分讨论。因此,我们研究了来自不同计算复杂性类别和具有不同语法结构的问题实例如何影响 TLM学习推理规则的能力。此外,为了忠实地评估 TLM,我们进行了实证研究来探索可满足性问题的分布。

Deciphering Oracle Bone Language with Diffusion Models

用扩散模型破译甲骨文语言

Originating from China’s Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD.
甲骨文 (OBS)起源于大约 3000 年前的中国商代,是语言史史上的基石,早于许多已建立的书写系统。尽管发现了数千个铭文,但仍有大量 OBS 未破译,给这种古老的语言蒙上了一层神秘的面纱。现代人工智能技术的出现为 OBS 解密提供了一个新的前沿,挑战了严重依赖大型文本语料库的传统 NLP 方法,这是历史语言无法提供的奢侈。本文介绍了一种采用图像生成技术的新颖方法,特别是通过 Oracle Bone Script Decipher (OBSD) 的开发。 OBSD 利用基于条件扩散的策略,生成重要的破译线索,为人工智能辅助分析古代语言制定了新的路线。为了验证其功效,在甲骨文数据集上进行了大量实验,定量结果证明了 OBSD 的有效性。

Causal Estimation of Memorisation Profiles

记忆概况的因果估计

Understanding memorisation in language models has practical and societal implications, e.g., studying models’ training dynamics or preventing copyright infringements. Prior work defines memorisation as the causal effect of training with an instance on the model’s ability to predict that instance. This definition relies on a counterfactual: the ability to observe what would have happened had the model not seen that instance. Existing methods struggle to provide computationally efficient and accurate estimates of this counterfactual. Further, they often estimate memorisation for a model architecture rather than for a specific model instance. This paper fills an important gap in the literature, proposing a new, principled, and efficient method to estimate memorisation based on the difference-in-differences design from econometrics. Using this method, we characterise a model’s memorisation profile–its memorisation trends across training–by only observing its behaviour on a small set of instances throughout training. In experiments with the Pythia model suite, we find that memorisation (i) is stronger and more persistent in larger models, (ii) is determined by data order and learning rate, and (iii) has stable trends across model sizes, thus making memorisation in larger models predictable from smaller ones.
理解语言模型中的记忆具有实际和社会意义,例如,研究模型的训练动态或防止版权侵权。先前的工作将记忆定义为实例训练对模型预测该实例的能力的因果影响。这个定义依赖于反事实:观察如果模型没有看到该实例会发生什么的能力。现有方法很难提供这种反事实的计算高效且准确的估计。此外,他们经常估计模型架构的记忆,而不是特定模型实例的记忆。本文填补了文献中的一个重要空白,提出了一种基于计量经济学的双重差分设计的新的、有原则的、有效的记忆力估计方法。使用这种方法,我们通过在整个训练过程中仅观察其在一小部分实例上的行为来描述模型的记忆概况(即其在训练过程中的记忆趋势)。在 Pythia 模型套件的实验中,我们发现记忆(i)在较大的模型中更强且更持久,(ii)由数据顺序和学习率决定,(iii)在不同模型大小之间具有稳定的趋势,从而使记忆在可以从较小模型预测的较大模型中。

Aya Model: An Instruction Finetuned Open-Access Multilingual Language Model

Aya 模型:一种指令微调的开放获取多语言语言模型

Recent breakthroughs in large language models (LLMs) have centered around a handful of data-rich languages. What does it take to broaden access to breakthroughs beyond first-class citizen languages? Our work introduces Aya, a massively multilingual generative language model that follows instructions in 101 languages of which over 50% are considered as lower-resourced. Aya outperforms mT0 and BLOOMZ on the majority of tasks while covering double the number of languages. We introduce extensive new evaluation suites that broaden the state-of-art for multilingual eval across 99 languages – including discriminative and generative tasks, human evaluation, and simulated win rates that cover both held-out tasks and in-distribution performance. Furthermore, we conduct detailed investigations on the optimal finetuning mixture composition, data pruning, as well as the toxicity, bias, and safety of our models. We open-source our instruction datasets and our model at this https URL
大型语言模型 ( LLMs ) 的最新突破集中在少数数据丰富的语言上。怎样才能扩大获得一流公民语言之外的突破的机会?我们的工作介绍了 Aya,这是一种大规模多语言生成语言模型,它遵循 101 种语言的指令,其中超过 50% 的语言被认为资源匮乏。 Aya 在大多数任务上都优于 mT0 和 BLOOMZ,同时覆盖的语言数量是 mT0 和 BLOOMZ 的两倍。我们引入了广泛的新评估套件,扩大了跨 99 种语言的多语言评估的最新水平,包括判别性和生成性任务、人工评估以及涵盖保留任务和分配中表现的模拟胜率。此外,我们对最佳微调混合物组成、数据修剪以及模型的毒性、偏差和安全性进行了详细的研究。我们在以下位置开源我们的指令数据集和模型: 这个 https 网址

Best Social Impact Paper Awards

最佳社会影响力论文奖

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

约翰尼如何说服LLMs越狱:重新思考通过人性化LLMs来说服挑战人工智能安全

Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over 92% on Llama 2-7b Chat, GPT-3.5, and GPT-4 in 10 trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs
大多数传统的人工智能安全研究都将人工智能模型视为机器,并以安全专家开发的以算法为中心的攻击为中心。随着大型语言模型 ( LLMs ) 变得越来越普遍和强大,非专家用户也可能在日常交互中带来风险。本文介绍了越狱LLMs作为类人沟通者的新视角,以探索日常语言交互和人工智能安全之间被忽视的交叉点。具体来说,我们研究如何说服LLMs越狱。首先,我们提出了一个源自数十年社会科学研究的说服分类法。然后,我们应用分类法自动生成可解释的说服性对抗提示(PAP)以越狱LLMs 。结果表明,说服显着提高了所有风险类别的越狱性能:PAP 始终实现超过 92% 在 Llama 2-7b Chat、GPT-3.5 和 GPT-4 上 10 试验,超越了最近以算法为中心的攻击。在防御方面,我们探索了针对 PAP 的各种机制,发现了现有防御中的重大差距,并主张为高度互动的LLMs提供更根本的缓解措施

DIALECTBENCH: An NLP Benchmark for Dialects, Varieties, and Closely-Related Languages

DIALECTBENCH:方言、变体和密切相关语言的 NLP 基准

Language technologies should be judged on their usefulness in real-world use cases. An often overlooked aspect in natural language processing (NLP) research and evaluation is language variation in the form of non-standard dialects or language varieties (hereafter, varieties). Most NLP benchmarks are limited to standard language varieties. To fill this gap, we propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties, which aggregates an extensive set of task-varied varieties datasets (10 text-level tasks covering 281 varieties). This allows for a comprehensive evaluation of NLP system performance on different varieties. We provide substantial proof of performance disparities between standard and non-standard language varieties, and we also identify language clusters with larger performance divergence across tasks.We believe DIALECTBENCH provides a comprehensive view of the current state of NLP for varieties and one step towards advancing it further.
语言技术应该根据其在现实世界用例中的有用性来判断。自然语言处理(NLP)研究和评估中经常被忽视的一个方面是非标准方言或语言变体(以下简称变体)形式的语言变异。大多数 NLP 基准仅限于标准语言品种。为了填补这一空白,我们提出了 DIALECTBENCH,这是第一个针对品种的 NLP 大规模基准,它聚合了一组广泛的任务变化品种数据集(10 个文本级任务,涵盖 281 个品种)。这样可以对不同品种的 NLP 系统性能进行综合评估。我们提供了标准语言品种和非标准语言品种之间性能差异的实质性证据,并且还识别了跨任务性能差异较大的语言集群。我们相信 DIALECTBENCH 提供了品种 NLP 现状的全面视图,并朝着推进它迈出了一步更远。

Having Beer after Prayer? Measuring Cultural Bias in Large Language Models”

祷告后喝啤酒?测量大型语言模型中的文化偏见”

As the reach of large language models (LMs) expands globally, their ability to cater to diverse cultural contexts becomes crucial. Despite advancements in multilingual capabilities, models are not designed with appropriate cultural nuances. In this paper, we show that multilingual and Arabic monolingual LMs exhibit bias towards entities associated with Western culture. We introduce CAMeL, a novel resource of 628 naturally-occurring prompts and 20,368 entities spanning eight types that contrast Arab and Western cultures. CAMeL provides a foundation for measuring cultural biases in LMs through both extrinsic and intrinsic evaluations. Using CAMeL, we examine the cross-cultural performance in Arabic of 16 different LMs on tasks such as story generation, NER, and sentiment analysis, where we find concerning cases of stereotyping and cultural unfairness. We further test their text-infilling performance, revealing the incapability of appropriate adaptation to Arab cultural contexts. Finally, we analyze 6 Arabic pre-training corpora and find that commonly used sources such as Wikipedia may not be best suited to build culturally aware LMs, if used as they are without adjustment. We will make CAMeL publicly available at: this https URL
随着大型语言模型 (LM) 的影响力在全球范围内扩展,它们迎合不同文化背景的能力变得至关重要。尽管多语言能力取得了进步,但模型的设计并未考虑到适当的文化差异。在本文中,我们表明多语言和阿拉伯语单语 LM 对与西方文化相关的实体表现出偏见。我们引入了 CAMeL,这是一种新颖的资源,包含 628 个自然出现的提示和 20,368 个实体,涵盖八种类型,对比阿拉伯和西方文化。 CAMeL 为通过外在和内在评估衡量 LM 中的文化偏见提供了基础。使用 CAMeL,我们检查了 16 个不同 LM 在故事生成、NER 和情感分析等任务上的阿拉伯语跨文化表现,其中我们发现了有关刻板印象和文化不公平的案例。我们进一步测试了他们的文本填充表现,揭示了他们无法适当适应阿拉伯文化背景。最后,我们分析了 6 个阿拉伯语预训练语料库,发现维基百科等常用资源如果不进行调整就直接使用,可能并不最适合构建具有文化意识的 LM。我们将在以下位置公开提供 CAMeL: 这个 https 网址

Best Resource Paper Awards

最佳资源论文奖

Latxa: An Open Language Model and Evaluation Suite for Basque

Latxa:巴斯克语的开放语言模型和评估套件

We introduce Latxa, a family of large language models for Basque ranging from 7 to 70 billion parameters. Latxa is based on Llama 2, which we continue pretraining on a new Basque corpus comprising 4.3M documents and 4.2B tokens. Addressing the scarcity of high-quality benchmarks for Basque, we further introduce 4 multiple choice evaluation datasets: EusProficiency, comprising 5,169 questions from official language proficiency exams; EusReading, comprising 352 reading comprehension questions; EusTrivia, comprising 1,715 trivia questions from 5 knowledge areas; and EusExams, comprising 16,774 questions from public examinations. In our extensive evaluation, Latxa outperforms all previous open models we compare to by a large margin. In addition, it is competitive with GPT-4 Turbo in language proficiency and understanding, despite lagging behind in reading comprehension and knowledge-intensive tasks. Both the Latxa family of models, as well as our new pretraining corpora and evaluation datasets, are publicly available under open licenses. Our suite enables reproducible research on methods to build LLMs for low-resource languages.
我们引入 Latxa,这是一个巴斯克语大型语言模型家族,参数范围从 7 到 700 亿个参数。 Latxa 基于 Llama 2,我们继续在包含 430 万个文档和 4.2B 个标记的新巴斯克语料库上对其进行预训练。针对巴斯克语高质量基准的缺乏,我们进一步引入了 4 个多项选择评估数据集:EusProficiency,包含来自官方语言能力考试的 5,169 个问题; EusReading,包含 352 道阅读理解问题; EusTrivia,包含来自 5 个知识领域的 1,715 个问答题;和 EusExams,包含来自公开考试的 16,774 个问题。在我们的广泛评估中,Latxa 的性能远远优于我们之前比较的所有开放模型。此外,尽管在阅读理解和知识密集型任务方面落后,但它在语言能力和理解方面与 GPT-4 Turbo 具有竞争力。 Latxa 系列模型以及我们新的预训练语料库和评估数据集均在开放许可下公开可用。我们的套件可以对为低资源语言构建LLMs方法进行可重复的研究。

Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

卓玛:用于语言模型预训练研究的三万亿代币开放语料库

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.
有关用于训练当前性能最佳语言模型的预训练语料库的信息很少被讨论:商业模型很少详细说明其数据,甚至开放模型也经常在没有附带训练数据或重现它们的方法的情况下发布。因此,开展和推进语言建模科学研究具有挑战性,例如了解训练数据如何影响模型的能力和局限性。为了促进语言模型预训练的科学研究,我们策划并发布了 Dolma,这是一个 3 万亿美元的英语语料库,由网络内容、科学论文、代码、公共领域书籍、社交媒体和百科全书材料的多种组合构建而成。我们广泛记录了卓玛,包括其设计原理、构造细节以及内容摘要。我们提出了卓玛中间状态的分析和实验结果,以分享我们在重要数据管理实践中所学到的知识。最后,我们开源了我们的数据管理工具包,以便能够复制我们的工作并支持大规模数据管理的进一步研究。

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

AppWorld:用于对交互式编码代理进行基准测试的应用程序和人员的可控世界

Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT4O, solves only ~49% of our ‘normal’ tasks and ~30% of ‘challenge’ tasks, while other models solve at least 16% fewer. This highlights the benchmark’s difficulty and AppWorld’s potential to push the frontiers of interactive coding agents.
处理日常数字任务(例如,为家庭订购杂货)的自主代理不仅必须通过 API 操作多个应用程序(例如,笔记、消息传递、购物应用程序),而且还要生成具有复杂控制流的丰富代码基于它们与环境的交互的迭代方式。然而,现有的工具使用基准还不够,因为它们仅涵盖需要简单 API 调用序列的任务。为了弥补这一差距,我们构建了 AppWorld Engine,这是一个高质量的执行环境(6 万行代码),包含 9 个日常应用程序,可通过 457 个 API 进行操作,并填充了模拟约 100 个虚构用户生活的真实数字活动。然后,我们创建了 AppWorld Benchmark(40K 行代码),这是一套包含 750 个自然、多样化且具有挑战性的自主代理任务的套件,需要丰富的交互式代码生成。它支持通过基于状态的单元测试进行稳健的程序化评估,允许以不同的方式完成任务,同时还检查意外的变化,即附带损害。最先进的LLM (GPT4O)仅解决了约 49% 的正常任务和约 30% 的挑战任务,而其他模型解决的问题至少少了 16%。这凸显了基准测试的难度以及 AppWorld推动交互式编码代理前沿的潜力。

Best Theme Paper Awards

最佳主题论文奖

OLMo: Accelerating the Science of Language Models

OLMo:加速语言模型科学的发展

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.
语言模型 (LM) 在 NLP 研究和商业产品中已经变得无处不在。随着其商业重要性的飙升,最强大的模型已被封闭,被封闭在专有接口后面,其训练数据、架构和开发的重要细节均未公开。考虑到这些细节在科学研究这些模型中的重要性,包括它们的偏差和潜在风险,我们认为研究界有必要获得强大的、真正开放的 LM。为此,我们构建了 OLMo,一个有竞争力的、真正开放的语言模型,以实现语言模型的科学研究。与之前大多数仅发布模型权重和推理代码的工作不同,我们在开放训练数据以及训练和评估代码的同时发布了 OLMo。我们希望这个版本能够增强开放研究社区的能力并激发新的创新浪潮。

Outstanding Papers

优秀论文

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

量化侧调优:量化大型语言模型的快速且节省内存的调优

Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory, and none can simultaneously mitigate the memory footprint of all three sources. In this paper, we present quantized side tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM’s model weights into 4-bit to reduce the memory footprint of the LLM’s original weights. Second, QST introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing back-propagation through the LLM, thus reducing the memory requirement of the intermediate activations. Finally, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3× and speed up the finetuning process by up to 3× while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7×.
根据经验,微调大型语言模型 ( LLMs ) 在各种下游任务中都是有效的。现有的LLM微调方法要么专注于参数高效的微调,仅更新少量可训练参数,要么尝试减少微调训练阶段的内存占用。通常,微调期间的内存占用源于三个因素:模型权重、优化器状态和中间激活。然而,现有的工作仍然需要大量的内存,并且没有一个可以同时减轻所有三个来源的内存占用。在本文中,我们提出了量化侧调整(QST),它通过双阶段过程操作来实现LLMs的内存高效和快速微调。首先,QST 将LLM的模型权重量化为 4 位,以减少LLM原始权重的内存占用。其次,QST 引入了一个与LLM分离的侧网络,它利用LLM的隐藏状态来进行特定于任务的预测。使用单独的侧网络可以避免通过LLM执行反向传播,从而减少中间激活的内存需求。最后,QST 利用多个低秩适配器和无梯度下采样模块来显着减少可训练参数,从而节省优化器状态的内存占用。实验表明,QST 可以将总内存占用量减少多达 2.3 倍,并将微调过程加速多达 3 倍 与最先进的技术相比,同时实现了出色的性能。当进行全面微调时,QST 可以将总内存占用量减少多达 7 倍。

L-Eval: Instituting Standardized Evaluation for Long Context Language Models

L-Eval:为长上下文语言模型建立标准化评估

Recently, there has been growing interest in extending the context length of large language models (LLMs), aiming to effectively process long inputs of one turn or conversations with more extensive histories. While proprietary models such as GPT-4 and Claude can largely preserve the reasoning ability in an extended context, open-source models are still progressing through the early stages of development. To bridge this gap, we propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) addressing two key aspects: dataset construction and evaluation metrics. On the one hand, we build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs encompassing diverse question styles, domains, and input length (3k∼200k tokens). On the other hand, we investigate the effectiveness in evalution metrics for LCLMs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment, and thus we strongly advocate for length-instruction-enhanced (LIE) evaluation and employing LLM judges. We conducted a comprehensive study of 4 popular commercial LLMs and 12 open-source counterparts using the L-Eval benchmark. Our empirical findings offer useful insights into the study of LCLMs and lay the groundwork for the development of more principled evaluation of these models.
最近,人们对扩展大型语言模型( LLMs )的上下文长度越来越感兴趣,旨在有效地处理一轮的长输入或具有更广泛历史记录的对话。虽然 GPT-4 和 Claude 等专有模型可以在很大程度上保留扩展上下文中的推理能力,但开源模型仍处于开发的早期阶段。为了弥补这一差距,我们建议 L-Eval 为长上下文语言模型(LCLM)建立更标准化的评估,解决两个关键方面:数据集构建和评估指标。一方面,我们构建了一个新的评估套件,其中包含 20 个子任务、508 个长文档和 2,000 多个人工标记的查询-响应对,涵盖不同的问题样式、领域和输入长度(3k ∼ 200k 标记)。另一方面,我们研究了 LCLM 评估指标的有效性。结果表明,流行的n 元语法匹配指标通常无法与人类判断很好地相关,因此我们强烈主张长度指令增强(LIE)评估并雇用LLM硕士法官。我们使用 L-Eval 基准对 4 个流行的商业LLMs和 12 个开源法学硕士进行了全面研究。我们的实证研究结果为 LCLM 的研究提供了有用的见解,并为对这些模型进行更有原则的评估奠定了基础。

Causal-Guided Active Learning for Debiasing Large Language Models

用于消除大型语言模型偏差的因果引导主动学习

Although achieving promising performance, recent analyses show that current generative large language models (LLMs) may still capture dataset biases and utilize them for generation, leading to poor generalizability and harmfulness of LLMs. However, due to the diversity of dataset biases and the over-optimization problem, previous prior-knowledge-based debiasing methods and fine-tuning-based debiasing methods may not be suitable for current LLMs.To address this issue, we explore combining active learning with the causal mechanisms and propose a casual-guided active learning (CAL) framework, which utilizes LLMs itself to automatically and autonomously identify informative biased samples and induce the bias patterns. Then a cost-effective and efficient in-context learning based method is employed to prevent LLMs from utilizing dataset biases during generation.Experimental results show that CAL can effectively recognize typical biased instances and induce various bias patterns for debiasing LLMs.
尽管取得了可喜的性能,但最近的分析表明,当前的生成性大语言模型( LLMs )仍然可能捕获数据集偏差并将其用于生成,从而导致LLMs的通用性差和有害性。然而,由于数据集偏差的多样性和过度优化问题,以前的基于先验知识的去偏差方法和基于微调的去偏差方法可能不适合当前的LLMs 。为了解决这个问题,我们探索结合主动学习并提出了一种休闲引导的主动学习(CAL)框架,该框架利用LLMs本身来自动、自主地识别信息偏差样本并诱导偏差模式。然后采用一种经济有效且高效的基于上下文学习的方法来防止LLMs在生成过程中利用数据集偏差。实验结果表明CAL可以有效地识别典型的偏差实例并诱导各种偏差模式以消除LLMs偏差。

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

CausalGym:对语言任务的因果可解释性方法进行基准测试

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M–6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler–gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.
语言模型(LM)已被证明是心理语言学研究的强大工具,但大多数先前的工作都集中在纯粹的行为测量(例如,意外比较)上。与此同时,模型可解释性的研究已经开始阐明塑造 LM 行为的抽象因果机制。为了帮助将这些研究领域更加紧密地结合在一起,我们推出了 CausalGym。我们调整并扩展了 SyntaxGym 任务套件,以对可解释性方法因果影响模型行为的能力进行基准测试。为了说明如何使用 CausalGym,我们研究了 pythia 模型 (14M–6.9B) 并评估了各种可解释性方法的因果功效,包括线性探测和分布式对齐搜索 (DAS)。我们发现 DAS 优于其他方法,因此我们用它来研究 pythia-1b 中两种困难语言现象的学习轨迹:负极性项目许可和填充间隙依赖性。我们的分析表明,实现这两项任务的机制是在离散阶段学习的,而不是逐渐学习的。

Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

不要产生幻觉,放弃:通过多LLM合作识别LLM知识差距

Despite efforts to expand the knowledge of large language models (LLMs), knowledge gaps – missing or outdated information in LLMs – might always persist given the evolving nature of knowledge. In this work, we study approaches to identify LLM knowledge gaps and abstain from answering questions when knowledge gaps are present. We first adapt existing approaches to model calibration or adaptation through fine-tuning/prompting and analyze their ability to abstain from generating low-confidence outputs. Motivated by their failures in self-reflection and over-reliance on held-out sets, we propose two novel approaches that are based on model collaboration, i.e., LLMs probing other LLMs for knowledge gaps, either cooperatively or competitively. Extensive experiments with three LLMs on four QA tasks featuring diverse knowledge domains demonstrate that both cooperative and competitive approaches to unveiling LLM knowledge gaps achieve up to 19.3% improvements on abstain accuracy against the strongest baseline. Further analysis reveals that our proposed mechanisms could help identify failure cases in retrieval augmentation and pinpoint knowledge gaps in multi-hop reasoning.
尽管努力扩展大型语言模型 ( LLMs ) 的知识,但考虑到知识不断发展的性质,知识差距( LLMs中缺失或过时的信息)可能始终持续存在。在这项工作中,我们研究了识别LLM知识差距的方法,并在存在知识差距时避免回答问题。我们首先通过微调/提示来调整现有方法来进行模型校准或适应,并分析它们避免生成低置信度输出的能力。出于自我反思失败和过度依赖保留集的动机,我们提出了两种基于模型协作的新方法,即LLMs以合作或竞争的方式探索其他LLMs的知识差距。三位LLMs在具有不同知识领域的四项 QA 任务上进行的广泛实验表明,与最强基线相比,揭示LLM知识差距的合作和竞争方法在弃权准确性方面实现了高达 19.3% 的提高。进一步的分析表明,我们提出的机制可以帮助识别检索增强中的失败案例,并查明多跳推理中的知识差距。

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

使用语音基础模型和大型语言模型的语音翻译:有什么和缺少什么?

The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
随着基础模型的出现,自然语言处理 (NLP) 领域最近发生了一场变革,特别是大型语言模型 ( LLMs ),它彻底改变了基于文本的 NLP。这种范式已经扩展到包括语音在内的其他模式,研究人员正在积极探索将语音基础模型(SFM)和LLMs结合成能够解决多模式任务的单一统一模型。在这些任务中,本文重点关注语音到文本翻译(ST)。通过检查有关该主题的已发表论文,我们对迄今为止提出的架构解决方案和培训策略提出了统一的看法,强调了它们之间的相似性和差异。基于这次检查,我们不仅整理了所学到的经验教训,还展示了不同的设置和评估方法如何阻碍为每个架构构建块和培训选择确定最佳性能的解决方案。最后,我们概述了有关该主题的未来工作的建议,旨在更好地了解 ST 的 SFM+ LLM解决方案的优点和缺点。

Must NLP be Extractive?

NLP 必须是抽取式的吗?

How do we roll out language technologies across a world with 7,000 languages? In one story, we scale the successes of NLP further into ‘low-resource’ languages, doing ever more with less. However, this approach does not recognise the fact that, beyond the 500 institutional languages, the remaining languages are oral vernaculars spoken by communities who use a language of wider communication to interact with the outside world. I argue that such ‘contact languages’ are the appropriate target for technologies like machine translation, and that the 6,500 oral languages must be approached differently. I share a story from an Indigenous community, where local people reshaped an extractive agenda to align with their relational agenda. I describe the emerging paradigm of relational NLP and explain how it opens the way to non-extractive methods and to solutions that enhance human agency.
我们如何在拥有 7,000 种语言的世界中推广语言技术?在一个故事中,我们将 NLP 的成功进一步扩展到低资源语言,以更少的资源做更多的事情。然而,这种方法没有认识到这样一个事实:除了 500 种机构语言之外,其余语言都是社区使用的口头方言,他们使用更广泛的交流语言与外界互动。我认为此类接触语言是机器翻译等技术的适当目标,并且必须以不同的方式处理 6,500 种口头语言。我分享了一个来自原住民社区的故事,当地人重塑了榨取议程以符合他们的关系议程。我描述了关系 NLP 的新兴范式,并解释了它如何为非提取方法和增强人类能动性的解决方案开辟了道路。

IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators

IRCoder:中间表示使语言模型成为强大的多语言代码生成器

Code understanding and generation have fast become some of the most popular applications of language models (LMs). Nonetheless, research on multilingual aspects of Code-LMs (i.e., LMs for code generation) such as cross-lingual transfer between different programming languages, language-specific data augmentation, and post-hoc LM adaptation, alongside exploitation of data sources other than the original textual content, has been much sparser than for their natural language counterparts. In particular, most mainstream Code-LMs have been pre-trained on source code files alone. In this work, we investigate the prospect of leveraging readily available compiler intermediate representations (IR) - shared across programming languages - to improve the multilingual capabilities of Code-LMs and facilitate cross-lingual transfer.
代码理解和生成已迅速成为语言模型(LM)最流行的应用之一。尽管如此,对 Code-LM(即用于代码生成的 LM)的多语言方面的研究,例如不同编程语言之间的跨语言传输、特定于语言的数据增强和事后 LM 适应,以及对除原始文本内容比自然语言内容要稀疏得多。特别是,大多数主流 Code-LM 都仅针对源代码文件进行了预训练。在这项工作中,我们研究了利用现成的编译器中间表示(IR)(跨编程语言共享)来提高 Code-LM 的多语言功能并促进跨语言传输的前景。
To this end, we first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files coupled with respective intermediate representations. Next, starting from various base Code-LMs (ranging in size from 1.1B to 7.3B parameters), we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to (1) learn the IR language and (2) align the IR constructs with respective constructs of various programming languages. Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics, including prompt robustness, multilingual code completion, code understanding, and instruction following.
为此,我们首先编译 SLTrans,这是一个并行数据集,由近 4M 个独立源代码文件以及各自的中间表示组成。接下来,从各种基本 Code-LM(参数大小从 1.1B 到 7.3B 参数不等)开始,我们在 SLTrans 上进行持续的因果语言建模训练,迫使 Code-LM (1) 学习 IR 语言和 (2)将 IR 结构与各种编程语言的相应结构保持一致。我们生成的模型(称为 IRCoder)在各种代码生成任务和指标中显示出可观且一致的收益,包括提示稳健性、多语言代码完成、代码理解和指令遵循。

MultiLegalPile: A 689GB Multilingual Legal Corpus

MultiLegalPile:689GB 多语言法律语料库

Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, there are few datasets available for specialized critical domains such as law and the available ones are often only for the English language. We curate and release MultiLegalPile, a 689GB corpus in 24 languages from 17 jurisdictions. The MultiLegalPile corpus, which includes diverse legal data sources with varying licenses, allows for pretraining NLP models under fair use, with more permissive licenses for the Eurlex Resources and Legal mC4 subsets. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, the trained models, and all of the code under the most open possible licenses.
大型、高质量的数据集对于训练大型语言模型 ( LLMs ) 至关重要。然而,到目前为止,适用于法律等专门关键领域的数据集很少,而且可用的数据集通常仅适用于英语。我们策划并发布了 MultiLegalPile,这是一个 689GB 的语料库,包含来自 17 个司法管辖区的 24 种语言。 MultiLegalPile 语料库包含具有不同许可的多种法律数据源,允许在合理使用的情况下预训练 NLP 模型,并为 Eurlex Resources 和 Legal mC4 子集提供更宽松的许可。我们对两个 RoBERTa 模型和一个 Longformer 进行多语言预训练,并对每个特定语言子集预训练 24 个单语言模型,并在 LEXTREME 上对其进行评估。此外,我们还在 LexGLUE 上评估英语和多语言模型。我们的多语言模型在 LEXTREME 上创造了新的 SotA,我们的英语模型在 LexGLUE 上创造了新的 SotA。我们在最开放的许可下发布数据集、经过训练的模型以及所有代码。

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety

PsySafe:基于心理的多智能体系统安全攻击、防御和评估的综合框架

Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. To date, comprehensive research on the safety issues associated with multi-agent systems remains limited. In this paper, we explore these concerns through the innovative lens of agent psychology, revealing that the dark psychological states of agents constitute a significant threat to safety.To tackle these concerns, we propose a comprehensive framework (PsySafe) grounded in agent psychology, focusing on three key areas: firstly, identifying how dark personality traits in agents can lead to risky behaviors; secondly, evaluating the safety of multi-agent systems from the psychological and behavioral perspectives, and thirdly, devising effective strategies to mitigate these risks.Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents’ self-reflection when engaging in dangerous behavior, and the correlation between agents’ psychological assessments and dangerous behaviors. We anticipate that our framework and observations will provide valuable insights for further research into the safety of multi-agent systems. We make our data and code publicly accessible at https://github.com/AI4Good24/PsySafe
多智能体系统在使用大型语言模型 ( LLMs ) 进行增强时,会展现出深厚的集体智慧能力。然而,出于恶意目的而滥用这些情报可能会带来重大风险。迄今为止,对与多智能体系统相关的安全问题的全面研究仍然有限。在本文中,我们通过智能体心理学的创新视角探讨了这些问题,揭示了智能体的黑暗心理状态对安全构成了重大威胁。为了解决这些问题,我们提出了一个基于智能体心理学的综合框架(PsySafe),重点关注三个关键领域:首先,确定智能体的黑暗人格特征如何导致危险行为;其次,从心理和行为的角度评估多智能体系统的安全性,第三,设计有效的策略来减轻这些风险。我们的实验揭示了一些有趣的现象,例如智能体之间的集体危险行为,智能体参与时的自我反思危险行为的影响,以及行为人心理评估与危险行为之间的相关性。我们预计我们的框架和观察将为进一步研究多智能体系统的安全性提供有价值的见解。我们在https://github.com/AI4Good24/PsySafe上公开提供我们的数据和代码.

Can Large Language Models be Good Emotional Supporter? Mitigating Preference Bias on Emotional Support Conversation

大型语言模型可以成为良好的情感支持者吗?减轻情感支持对话的偏好偏差

Emotional Support Conversation (ESC) is a task aimed at alleviating individuals’ emotional distress through daily conversation. Given its inherent complexity and non-intuitive nature, ESConv dataset incorporates support strategies to facilitate the generation of appropriate responses. Recently, despite the remarkable conversational ability of large language models (LLMs), previous studies have suggested that they often struggle with providing useful emotional support. Hence, this work initially analyzes the results of LLMs on ESConv, revealing challenges in selecting the correct strategy and a notable preference for a specific strategy. Motivated by these, we explore the impact of the inherent preference in LLMs on providing emotional support, and consequently, we observe that exhibiting high preference for specific strategies hinders effective emotional support, aggravating its robustness in predicting the appropriate strategy. Moreover, we conduct a methodological study to offer insights into the necessary approaches for LLMs to serve as proficient emotional supporters. Our findings emphasize that (1) low preference for specific strategies hinders the progress of emotional support, (2) external assistance helps reduce preference bias, and (3) existing LLMs alone cannot become good emotional supporters. These insights suggest promising avenues for future research to enhance the emotional intelligence of LLMs.
情感支持对话(ESC)是一项旨在通过日常对话减轻个人情绪困扰的任务。鉴于其固有的复杂性和非直观性,ESConv 数据集包含支持策略以促进生成适当的响应。最近,尽管大型语言模型( LLMs )具有出色的会话能力,但之前的研究表明它们常常难以提供有用的情感支持。因此,这项工作首先分析了 ESConv 上LLMs的结果,揭示了选择正确策略的挑战以及对特定策略的显着偏好。受这些启发,我们探索了LLMs固有偏好对提供情感支持的影响,因此,我们观察到对特定策略表现出高度偏好会阻碍有效的情感支持,从而加剧其预测适当策略的鲁棒性。此外,我们还进行了一项方法学研究,以深入了解LLMs作为熟练的情感支持者的必要方法。我们的研究结果强调,(1)对特定策略的低偏好阻碍了情感支持的进展,(2)外部援助有助于减少偏好偏差,(3)现有的LLMs本身无法成为良好的情感支持者。这些见解为未来提高LLMs情商的研究提供了有希望的途径。

Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models

政治指南针还是旋转箭头?对大型语言模型中的价值观和观点进行更有意义的评估

Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing constrained evaluation paradigm for values and opinions in LLMs and explore more realistic unconstrained evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT *forces models to comply with the PCT’s multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.
最近的许多工作都试图使用多项选择调查和问卷来评估大型语言模型( LLMs )中的价值观和观点。这项工作的大部分都是出于对现实世界LLM申请的关注。例如,带有政治偏见的LLMs在被数百万人使用时可能会微妙地影响社会。然而,这种现实世界的担忧与当前评估的人为性形成鲜明对比:真实用户通常不会询问LLMs调查问题。受这种差异的推动,我们挑战了LLMs中普遍存在的“受限”价值观和观点评估范式,并探索更现实的“无约束”评估。作为案例研究,我们重点关注流行的政治指南针测试(PCT)。在系统回顾中,我们发现大多数先前使用PCT * 的工作迫使模型遵守 PCT的多项选择格式。我们证明,在不强迫的情况下,模型会给出本质上不同的答案。答案会根据模型的强制方式而变化;这个答案缺乏释义的稳健性。然后,我们证明模型在更现实的开放式答案设置中再次给出不同的答案。我们将这些发现提炼成评估LLMs价值观和观点的建议和开放挑战。

Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

相同的任务,更多的标记:输入长度对大型语言模型推理性能的影响

This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs’ reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities.Additionally, our study reveals that the traditional metric of next word prediction correlates negatively with performance of LLMs’ on our reasoning dataset. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.
本文探讨了扩展输入长度对大型语言模型 ( LLMs ) 功能的影响。尽管LLMs近年来取得了进步,但它们在不同输入长度上的性能一致性尚不清楚。我们通过引入一种新颖的 QA 推理框架来研究这个方面,该框架专门用于评估输入长度的影响。我们使用同一样本的多个版本来隔离输入长度的影响,每个版本都使用不同长度、类型和位置的填充进行扩展。我们的研究结果表明,在比其技术最大值短得多的输入长度下, LLMs推理性能显着下降。我们表明,尽管强度不同,但数据集的每个版本中都出现了退化趋势。此外,我们的研究表明,下一个单词预测的传统指标与LLMs在推理数据集上的表现呈负相关。我们分析我们的结果并确定失败模式,这些模式可以作为未来研究的有用指南,并有可能为解决LLMs中观察到的局限性提供策略。

Do Llamas Work in English? On the Latent Language of Multilingual Transformers

Llamas 用英语工作吗?论多语言 Transformer 的潜在语言

We ask whether multilingual language models trained on unbalanced, English-dominated corpora use English as an internal pivot language—-a question of key importance for understanding how language models function and the origins of linguistic bias. Focusing on the Llama-2 family of transformer models, our study is based on carefully constructed non-English prompts with a unique correct single-token continuation. From layer to layer, transformers gradually map an input embedding of the final prompt token to an output embedding from which next-token probabilities are computed. Tracking intermediate embeddings through their high-dimensional space reveals three distinct phases, whereby intermediate embeddings (1) start far away from output token embeddings; (2) already in middle layers allow for decoding a semantically correct next token, but giving higher probability to its version in English than in the input language; (3) move into an input-language-specific region of the embedding space. We cast these results into a conceptual model where the three phases operate in ”input space”, ”concept space”, and ”output space”, respectively. Crucially, our evidence suggests that the abstract ”concept space” lies closer to English than to other input languages, which may have important consequences regarding the biases embodied by multilingual language models.
我们询问在不平衡的、以英语为主的语料库上训练的多语言语言模型是否使用英语作为内部枢轴语言——这个问题对于理解语言模型如何发挥作用以及语言偏见的起源至关重要。我们的研究重点关注 Llama-2 系列变压器模型,基于精心构建的非英语提示,并具有独特的正确单标记延续。从一层到另一层,变压器逐渐将最终提示标记的输入嵌入映射到计算下一个标记概率的输出嵌入。通过高维空间跟踪中间嵌入揭示了三个不同的阶段,其中中间嵌入(1)从远离输出令牌嵌入的地方开始; (2) 中间层已经允许解码语义上正确的下一个标记,但其英语版本的概率高于输入语言的版本; (3) 进入嵌入空间的输入语言特定区域。我们将这些结果转化为概念模型,其中三个阶段分别在“输入空间”、“概念空间”和“输出空间”中运行。至关重要的是,我们的证据表明,抽象的“概念空间”比其他输入语言更接近英语,这可能会对多语言语言模型所体现的偏见产生重要影响。

Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models

认真对待幽默:用无趣的大型语言模型制作幽默数据集

Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. We investigate whether large language models (LLMs) can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to “unfun” jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset where we find that GPT-4’s synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
幽默是人类认知和互动的基本方面。然而,尽管自然语言处理最近取得了进展,幽默检测仍然是一项具有挑战性的任务,由于缺乏将幽默文本与类似的非幽默文本配对的数据集,该任务变得更加复杂。我们研究大型语言模型( LLMs )是否可以通过编辑文本生成用于幽默检测的合成数据。我们在现有的人类数据集上对LLMs进行了基准测试,结果表明,根据人类的判断以及幽默检测的下游任务的衡量,当前的LLMs表现出了令人印象深刻的“搞笑”笑话能力。我们将我们的方法扩展到代码混合的英语-印地语幽默数据集,我们发现 GPT-4的合成数据受到双语注释者的高度评价,并为幽默分类器提供了具有挑战性的对抗性示例。

Estimating the Level of Dialectness Predicts Inter-annotator Agreement in Multi-dialect Arabic Datasets

估计方言水平可预测多方言阿拉伯语数据集中注释者间的一致性

On annotating multi-dialect Arabic datasets, it is common to randomly assign the samples across a pool of native Arabic speakers. Recent analyses recommended routing dialectal samples to native speakers of their respective dialects to build higher-quality datasets. However, automatically identifying the dialect of samples is hard. Moreover, the pool of annotators who are native speakers of specific Arabic dialects might be scarce. Arabic Level of Dialectness (ALDi) was recently introduced as a quantitative variable that measures how sentences diverge from Standard Arabic. On randomly assigning samples to annotators, we hypothesize that samples of higher ALDi scores are harder to label especially if they are written in dialects that the annotators do not speak. We test this by analyzing the relation between ALDi scores and the annotators’ agreement, on 15 public datasets having raw individual sample annotations for various sentence-classification tasks. We find strong evidence supporting our hypothesis for 11 of them. Consequently, we recommend prioritizing routing samples of high ALDi scores to native speakers of each sample’s dialect, for which the dialect could be automatically identified at higher accuracies.
在注释多方言阿拉伯语数据集时,通常会在一组阿拉伯语母语者中随机分配样本。最近的分析建议将方言样本发送给各自方言的母语者,以构建更高质量的数据集。然而,自动识别样本的方言是很困难的。此外,以特定阿拉伯方言为母语的注释者可能很少。阿拉伯语方言水平 (ALDi) 最近被引入作为一个定量变量,用于衡量句子与标准阿拉伯语的差异程度。在将样本随机分配给注释者时,我们假设 ALDi 分数较高的样本更难标记,特别是如果它们是用注释者不会说的方言书写的。我们通过在 15 个具有各种句子分类任务的原始单个样本注释的公共数据集上分析 ALDi 分数和注释器一致性之间的关系来测试这一点。我们发现强有力的证据支持其中 11 个的假设。因此,我们建议优先将 ALDi 分数高的样本路由到每个样本方言的母语人士,这样可以以更高的准确度自动识别方言。

G-DIG: Towards Gradient-based DIverse and hiGh-quality Instruction Data Selection for Machine Translation

G-DIG:面向机器翻译的基于梯度的多样化和高质量指令数据选择

Large Language Models (LLMs) have demonstrated remarkable abilities in general scenarios. Instruction finetuning empowers them to align with humans in various tasks. Nevertheless, the Diversity and Quality of the instruction data remain two main challenges for instruction finetuning. With regard to this, in this paper, we propose a novel gradient-based method to automatically select high-quality and diverse instruction finetuning data for machine translation. Our key innovation centers around analyzing how individual training examples influence the model during training. Specifically, we select training examples that exert beneficial influences on the model as high-quality ones by means of Influence Function plus a small high-quality seed dataset. Moreover, to enhance the diversity of the training data we maximize the variety of influences they have on the model by clustering on their gradients and resampling. Extensive experiments on WMT22 and FLORES translation tasks demonstrate the superiority of our methods, and in-depth analysis further validates their effectiveness and generalization.
大型语言模型( LLMs )在一般场景中表现出了非凡的能力。指令微调使它们能够在各种任务中与人类保持一致。尽管如此, 多样性 和 质量 指令数据的优化仍然是指令微调的两个主要挑战。为此,在本文中,我们提出了一种新颖的基于梯度的方法来自动选择高质量和多样化的机器翻译指令微调数据。我们的关键创新集中于分析单个训练示例在训练过程中如何影响模型。具体来说,我们通过影响函数加上一个小的高质量种子数据集,选择对模型产生有益影响的训练样例作为高质量的训练样例。此外,为了增强训练数据的多样性,我们通过对梯度进行聚类和重采样来最大化它们对模型的影响。在 WMT22 和 FLORES 翻译任务上的大量实验证明了我们方法的优越性,深入分析进一步验证了其有效性和泛化性。

Media Framing: A typology and Survey of Computational Approaches Across Disciplines

媒体框架:跨学科计算方法的类型学和调查

Framing studies how individuals and societies make sense of the world, by communicating or representing complex issues through schema of interpretation. The framing of information in the mass media influences our interpretation of facts and corresponding decisions, so detecting and analysing it is essential to understand biases in the information we consume. Despite that, framing is still mostly examined manually, on a case-by-case basis, while existing large-scale automatic analyses using NLP methods are not mature enough to solve this task. In this survey we show that despite the growing interest to framing in NLP its current approaches do not capture those aspects which allow to frame, rather than simply convey, the message. To this end, we bring together definitions of frames and framing adopted in different disciplines; examine cognitive, linguistic, and communicative aspects a frame contains beyond its topical content. We survey recent work on computational frame detection, and discuss how framing aspects and frame definitions are (or should) be reflected in NLP approaches.
框架研究个人和社会如何通过解释模式沟通或代表复杂的问题来理解世界。大众媒体中的信息框架会影响我们对事实的解释和相应的决策,因此检测和分析对于了解我们所消费的信息中的偏见至关重要。尽管如此,框架仍然主要是根据具体情况进行手动检查,而现有的使用 NLP 方法的大规模自动分析还不够成熟,无法解决此任务。在这项调查中,我们表明,尽管人们对 NLP 框架越来越感兴趣,但目前的方法并没有捕捉到那些允许框架而不是简单地传达信息的方面。为此,我们汇集了不同学科中采用的框架和框架的定义;检查框架包含的主题内容之外的认知、语言和交际方面。我们调查了最近关于计算框架检测的工作,并讨论了框架方面和框架定义如何(或应该)反映在 NLP 方法中。

SPZ: A Semantic Perturbation-based Data Augmentation Method with Zonal-Mixing for Alzheimer’s Disease Detection

SPZ:一种用于阿尔茨海默氏病检测的区域混合的基于语义扰动的数据增强方法

Alzheimer’s Disease (AD), characterized by significant cognitive and functional impairment, necessitates the development of early detection techniques. Traditional diagnostic practices, such as cognitive assessments and biomarker analysis, are often invasive and costly. Deep learning-based approaches for non-invasive AD detection have been explored in recent studies, but the lack of accessible data hinders further improvements in detection performance. To address these challenges, we propose a novel semantic perturbation-based data augmentation method that essentially differs from existing techniques, which primarily rely on explicit data engineering. Our approach generates controlled semantic perturbations to enhance textual representations, aiding the model in identifying AD-specific linguistic patterns, particularly in scenarios with limited data availability. It learns contextual information and dynamically adjusts the perturbation degree for different linguistic features. This enhances the model’s sensitivity to AD-specific linguistic features and its robustness against natural language noise. Experimental results on the ADReSS challenge dataset demonstrate that our approach outperforms other strong and competitive deep learning methods.
阿尔茨海默病(AD)的特点是严重的认知和功能障碍,需要开发早期检测技术。传统的诊断实践,例如认知评估和生物标志物分析,通常是侵入性的且成本高昂。最近的研究已经探索了基于深度学习的非侵入性 AD 检测方法,但缺乏可访问的数据阻碍了检测性能的进一步提高。为了应对这些挑战,我们提出了一种新颖的基于语义扰动的数据增强方法,该方法与主要依赖于显式数据工程的现有技术本质上不同。我们的方法生成受控的语义扰动以增强文本表示,帮助模型识别 AD 特定的语言模式,特别是在数据可用性有限的情况下。它学习上下文信息并动态调整不同语言特征的扰动程度。这增强了模型对 AD 特定语言特征的敏感性及其对自然语言噪声的鲁棒性。 ADReSS 挑战数据集的实验结果表明,我们的方法优于其他强大且有竞争力的深度学习方法。

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

贪婪就是你所需要的:分词器推理方法的评估

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
虽然 BPE 和 WordPiece 等子词分词器通常用于为 NLP 模型构建词汇表,但将这些词汇表中的文本解码为标记序列的方法通常未指定,或者不适合构建它们的方法。我们对四种不同算法和三种词汇大小的七种分词器推理方法进行了受控分析,在我们为英语策划的新颖的内在评估套件上进行,结合了植根于形态、认知和信息论的测量。我们证明,对于最常用的分词器来说,贪婪推理的表现出奇的好;最近推出的上下文信息标记器 SaGe 在形态对齐方面优于所有其他标记器。

Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t

语言复杂性和语音识别准确性:拼写复杂性有害,语音复杂性则不然

We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
我们研究哪些语言因素会影响自动语音识别 (ASR) 模型的性能。我们假设拼写和语音的复杂性都会降低准确性。为了检验这一点,我们对 25 种语言和 15 种书写系统的多语言自监督预训练模型 Wav2Vec2-XLSR-53 进行了微调,并比较了它们的 ASR 准确性、字素数量、一元字素熵、语意性(单词/词素的数量) -级别信息在书写系统中编码),以及音素的数量。结果表明,拼写复杂性与低 ASR 准确性显着相关,而语音复杂性则没有显着相关性。

Steering Llama 2 via Contrastive Activation Addition

通过对比激活加法控制 Llama 2

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes “steering vectors” by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user’s prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA’s effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA’s mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
我们引入了对比激活加法(CAA),这是一种通过在前向传递过程中修改语言模型的激活来控制语言模型的创新方法。 CAA 通过平均特定行为的正负示例对(例如事实响应与幻觉响应)之间的残余流激活差异来计算“转向向量”。在推理过程中,这些引导向量会在用户提示后以正系数或负系数添加到所有令牌位置,从而可以精确控制目标行为的程度。我们使用多项选择行为问题数据集和开放式生成任务来评估 CAA 在 Llama 2 Chat 上的有效性。我们证明 CAA 显着改变了模型行为,比微调和系统提示设计等传统方法更有效,并且最大限度地减少了功能。此外,我们通过采用各种激活空间解释方法,对 CAA 的机制有了更深入的了解。 CAA 准确引导模型输出并揭示高级概念如何在大型语言模型 ( LLMs ) 中表示。

EconAgent: Large Language Model-Empowered Agents for Simulating Macroeconomic Activities

EconAgent:用于模拟宏观经济活动的大型语言模型授权代理

The advent of artificial intelligence has led to a growing emphasis on data-driven modeling in macroeconomics, with agent-based modeling (ABM) emerging as a prominent bottom-up simulation paradigm. In ABM, agents (e.g., households, firms) interact within a macroeconomic environment, collectively generating market dynamics. Existing agent modeling typically employs predetermined rules or learning-based neural networks for decision-making. However, customizing each agent presents significant challenges, complicating the modeling of agent heterogeneity. Additionally, the influence of multi-period market dynamics and multifaceted macroeconomic factors are often overlooked in decision-making processes. In this work, we introduce EconAgent, a large language model-empowered agent with human-like characteristics for macroeconomic simulation. We first construct a simulation environment that incorporates various market dynamics driven by agents’ decisions regarding work and consumption. Through the perception module, we create heterogeneous agents with distinct decision-making mechanisms. Furthermore, we model the impact of macroeconomic trends using a memory module, which allows agents to reflect on past individual experiences and market dynamics. Simulation experiments show that EconAgent can make realistic decisions, leading to more reasonable macroeconomic phenomena compared to existing rule-based or learning-based agents. Our codes are released at this https URL
人工智能的出现导致宏观经济学中数据驱动的建模日益受到重视,基于代理的建模(ABM)成为一种突出的自下而上的模拟范式。在 ABM 中,主体(例如家庭、企业)在宏观经济环境中相互作用,共同产生市场动态。现有的代理建模通常采用预定规则或基于学习的神经网络进行决策。然而,定制每个代理带来了巨大的挑战,使代理异质性的建模变得复杂。此外,多时期市场动态和多方面宏观经济因素的影响在决策过程中往往被忽视。在这项工作中,我们介绍了 EconAgent,这是一种具有类人特征的大型语言模型授权代理,用于宏观经济模拟。我们首先构建一个模拟环境,其中包含由代理人有关工作和消费的决策驱动的各种市场动态。通过感知模块,我们创建具有不同决策机制的异构代理。此外,我们使用记忆模块对宏观经济趋势的影响进行建模,这使得代理人能够反思过去的个人经历和市场动态。仿真实验表明,与现有的基于规则或基于学习的代理相比,EconAgent 可以做出现实的决策,从而产生更合理的宏观经济现象。我们的代码发布于 这个 https 网址.

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models

M4LE:大型语言模型的多能力、多范围、多任务、多领域长上下文评估基准

Managing long sequences has become an important and necessary feature for large language models (LLMs). However, it is still an open question of how to comprehensively and systematically evaluate the long-sequence capability of LLMs. One of the reasons is that conventional and widely-used benchmarks mainly consist of short sequences. In this paper, we propose M4LE, a Multi-ability, Multi-range, Multi-task, Multi-domain benchmark for Long-context Evaluation. M4LE is based on a diverse NLP task pool comprising 36 NLP datasets, 11 task types and 12 domains. To alleviate the scarcity of tasks with naturally long sequences and incorporate multiple-ability assessment, we propose an automatic approach (but with negligible human annotations) to convert short-sequence tasks into a unified long-sequence scenario where LLMs have to identify single or multiple relevant spans in long contexts based on explicit or semantic hints. Specifically, the scenario includes five different types of abilities: (1) explicit single-span; (2) semantic single-span; (3) explicit multiple-span; (4) semantic multiple-span; and (5) global context understanding. The resulting samples in M4LE are evenly distributed from 1k to 8k input length. We conducted a systematic evaluation on 11 well-established LLMs, especially those optimized for long-sequence inputs. Our results reveal that: 1) Current LLMs struggle to understand long context, particularly when tasks require multiple-span attention. 2) Semantic retrieval task is more difficult for competent LLMs. 3) Models fine-tuned on longer text with position interpolation have comparable performance to those using Neural Tangent Kernel (NTK) aware scaling methods without fine-tuning. We make our benchmark publicly available to encourage future research in this challenging area.
管理长序列已成为大型语言模型( LLMs )的重要且必要的功能。然而,如何全面系统地评价LLMs的长序列能力仍然是一个悬而未决的问题。原因之一是传统且广泛使用的基准主要由短序列组成。在本文中,我们提出了 M4LE,一种用于长上下文评估的多能力、多范围、多任务、多领域基准。 M4LE 基于多样化的 NLP 任务池,包括 36 个 NLP 数据集、11 个任务类型和 12 个领域。为了缓解自然长序列任务的稀缺性并结合多重能力评估,我们提出了一种自动方法(但人工注释可以忽略不计)将短序列任务转换为统一的长序列场景,其中LLMs必须识别单个或多个基于显式或语义提示的长上下文中的相关跨度。具体来说,该场景包括五种不同类型的能力:(1)显式单跨度; (2)语义单跨度; (3)显式多跨度; (4)语义多跨度; (5)全局背景理解。 M4LE 中生成的样本从 1k 到 8k 输入长度均匀分布。我们对 11 个成熟的LLMs进行了系统评估,特别是那些针对长序列输入进行优化的法学硕士。我们的结果表明:1)当前的LLMs很难理解长上下文,特别是当任务需要多跨度注意力时。 2)对于有能力的LLMs来说,语义检索任务更加困难。 3)通过位置插值对较长文本进行微调的模型与使用神经正切核(NTK)感知缩放方法而无需微调的模型具有可比的性能。 我们公开提供我们的基准,以鼓励未来在这个具有挑战性的领域进行研究。

CHECKWHY: Causal Fact Verification via Argument Structure

CHECKWHY:通过论证结构验证因果事实

With the growing complexity of fact verification tasks, the concern with “thoughtful” reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CHECKWHY, a challenging dataset tailored to a novel causal fact verification task: checking the truthfulness of the causal relation within claims through rigorous reasoning steps. CHECKWHY consists of over 19K “why” claim-evidence- argument structure triplets with supports, refutes, and not enough info labels. Each argument structure is composed of connected evidence, representing the reasoning process that begins with foundational evidence and progresses toward claim establishment. Through extensive experiments on state-of-the-art models, we validate the importance of incorporating the argument structure for causal fact verification. Moreover, the automated and human evaluation of argument structure generation reveals the difficulty in producing satisfying argument structure by fine-tuned models or Chain-of-Thought prompted LLMs, leaving considerable room for future improvements.
随着事实验证任务的日益复杂,人们对“深思熟虑”的推理能力的关注也日益增加。然而,最近的事实验证基准主要集中于检查权利要求中狭隘范围的语义事实,缺乏明确的逻辑推理过程。在本文中,我们介绍了 CHECKWHY,这是一个具有挑战性的数据集,专为新颖的因果事实验证任务而定制:通过严格的推理步骤检查声明中因果关系的真实性。 CHECKWHY 由超过 19K 个“为什么”主张-证据-论证结构三元组组成,带有支持、反驳和信息不足标签。每个论证结构都由相互关联的证据组成,代表从基础证据开始并逐步建立主张的推理过程。通过对最先进模型的广泛实验,我们验证了将论证结构纳入因果事实验证的重要性。此外,对论证结构生成的自动和人工评估揭示了通过微调模型或思想链提示的LLMs很难产生令人满意的论证结构,为未来的改进留下了相当大的空间。

On Efficient and Statistical Quality Estimation for Data Annotation

数据标注的高效统计质量评估

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs money that could be better spent, for instance on more annotations. Therefore, we first describe in detail how to use confidence intervals for finding the minimal sample size needed to estimate the annotation error rate. Then, we propose applying acceptance sampling as an alternative to error rate estimation We show that acceptance sampling can reduce the required sample sizes up to 50% while providing the same statistical guarantees.
带注释的数据集是训练、评估、比较和生产化监督机器学习模型的重要组成部分。因此,高质量的注释至关重要。为了创造它们,需要良好的质量管理,从而需要可靠的质量评估。然后,如果标注过程中质量不够,可以采取纠正措施来改进。质量评估通常是通过让专家手动将实例标记为正确或错误来执行的。但检查所有带注释的实例往往成本高昂。因此,在实践中,通常只检查子集;规模的选择大多没有合理性,也没有考虑统计能力,而且往往相对较小。然而,基于小样本量的估计可能会导致错误率值不精确。使用不必要的大样本量会花费金钱,而这些钱本可以更好地花在例如更多注释上。因此,我们首先详细描述如何使用置信区间来找到估计注释错误率所需的最小样本量。然后,我们建议应用验收抽样作为错误率估计的替代方案。我们表明,验收抽样可以将所需的样本量减少多达 50%,同时提供相同的统计保证。

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!

模拟不一致:大型语言模型的安全对齐可能会适得其反!

Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans. However, this paper introduces a training-free attack method capable of reversing safety alignment, converting the outcomes of stronger alignment into greater potential for harm by accessing only LLM output token distributions. Specifically, our method achieves this reversal by contrasting the output token distribution of a safety-aligned language model (e.g., Llama-2-chat) against its pre-trained version (e.g., Llama-2), so that the token predictions are shifted towards the opposite direction of safety alignment.We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.Our experiments with ED across three evaluation datasets and four model families (Llama-1, Llama-2, Mistral, and Alpaca) show that ED doubles the harmfulness of pre-trained models and outperforms strong baselines, achieving the highest harmful rates in 43 out of 48 evaluation subsets by a large margin.Eventually, given ED’s reliance on language model output token distributions, which particularly compromises open-source models, our findings highlight the need to reassess the open accessibility of language models, even if they have been safety-aligned.Code is available at https://github.com/ZHZisZZ/emulated-disalignment
大型语言模型 ( LLMs ) 经过安全调整,以确保与人类的安全对话。然而,本文介绍了一种无需训练的攻击方法,能够逆转安全对齐,通过仅访问LLM输出令牌分布,将更强对齐的结果转换为更大的潜在危害。具体来说,我们的方法通过将安全对齐语言模型(例如 Llama-2-chat)的输出标记分布与其预训练版本(例如 Llama-2)进行对比来实现这种逆转,从而使标记预测发生变化朝向安全对准的相反方向。我们将这种方法命名为 模拟错位 (ED) 因为从这种对比分布中采样可以证明模拟了微调的结果,以最大限度地减少安全奖励。我们在三个评估数据集和四个模型系列(Llama-1、Llama-2、Mistral 和 Alpaca)中进行的 ED 实验表明ED 使预训练模型的危害性增加了一倍,并且优于强基线,在 48 个评估子集中的 43 个评估子集中大幅实现了最高危害率。最终,给定ED对语言模型输出令牌分布的依赖,这特别损害了开源模型,我们的研究结果强调需要重新评估语言模型的开放可访问性,即使它们已经安全对齐。代码可在https://github上获取.com/ZHZisZZ/emulated-disalignment.

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages

IndicLLMSuite:为印度语言创建预训练和微调数据集的蓝图

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.
尽管英语LLMs取得了相当大的进步,但由于缺乏定制资源,为其他语言构建可比模型的进展受到阻碍。我们的工作旨在通过引入专为印度LLMs开发而设计的广泛资源套件来弥合这一鸿沟,涵盖 22 种语言,总共包含 251B 个令牌和 7480 万个指令响应对。认识到数据质量和数量的重要性,我们的方法结合了精心策划的手动验证数据、未经验证但有价值的数据和合成数据。我们构建了一个干净的开源管道,用于管理来自不同来源(包括网站、PDF 和视频)的预训练数据,并结合爬行、清理、标记和重复数据删除的最佳实践。为了进行指令微调,我们合并了现有的印度语数据集,将英语数据集翻译/音译为印度语言,并利用 LLaMa2 和 Mixtral 模型创建基于印度维基百科和 Wikihow 文章的对话。此外,我们通过为多种场景生成有毒提示来解决毒性对齐问题,然后通过将这些有毒提示输入对齐的 LLaMa2 模型来生成无毒响应。我们希望作为这项工作的一部分发布的数据集、工具和资源不仅能够推动印度LLMs的研究和开发,还能建立一个开源蓝图,将这种努力扩展到其他语言。作为这项工作的一部分创建的数据和其他工件是通过许可发布的。

MultiPICo: Multilingual Perspectivist Irony Corpus

MultiPICo:多语言视角反讽语料库

Recently, several scholars have contributed to the growth of a new theoretical framework in NLP called perspectivism. This approach aimsto leverage data annotated by different individuals to model diverse perspectives that affect their opinions on subjective phenomena such as irony. In this context, we propose MultiPICo, a multilingual perspectivist corpus of ironic short conversations in different languages andlinguistic varieties extracted from Twitter and Reddit. The corpus includes sociodemographic information about its annotators. Our analysis of the annotated corpus shows how different demographic cohorts may significantly disagree on their annotation of irony and how certain cultural factors influence the perception of the phenomenon and the agreement on the annotation. Moreover, we show how disaggregated annotations and rich annotator metadata can be exploited to benchmark the ability of large language models to recognize irony, their positionality with respect to sociodemographic groups, and the efficacy of perspective-taking prompting for irony detection in multiple languages.
最近,一些学者为 NLP 中一种新的理论框架(称为透视主义)的发展做出了贡献。这种方法旨在利用不同个人注释的数据来建模影响他们对讽刺等主观现象的看法的不同观点。在这种背景下,我们提出了 MultiPICo,这是一个多语言视角语料库,包含从 Twitter 和 Reddit 中提取的不同语言和语言变体的讽刺短对话。该语料库包含有关其注释者的社会人口统计信息。我们对带注释的语料库的分析表明,不同的人口群体在对讽刺的注释上可能存在显着分歧,以及某些文化因素如何影响对现象的看法和注释的一致性。此外,我们还展示了如何利用分类注释和丰富的注释器元数据来衡量大型语言模型识别反讽的能力、它们相对于社会人口群体的位置,以及多语言中反讽检测的观点采择提示的有效性。

MMToM-QA: Multimodal Theory of Mind Question Answering

MMToM-QA:多模态心理理论问答

Theory of Mind (ToM), the ability to understand people’s mental states, is an essential ingredient for developing machines with human-level social intelligence. Recent machine learning models, particularly large language models, seem to show some aspects of ToM understanding. However, existing ToM benchmarks use unimodal datasets – either video or text. Human ToM, on the other hand, is more than video or text understanding. People can flexibly reason about another person’s mind based on conceptual representations (e.g., goals, beliefs, plans) extracted from any available data. To address this, we introduce a multimodal Theory of Mind question answering (MMToM-QA) benchmark. MMToM-QA comprehensively evaluates machine ToM both on multimodal data and on different kinds of unimodal data about a person’s activity in a household environment. To engineer multimodal ToM capacity, we propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models). BIP-ALM extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. We conducted a systematic comparison of human performance, BIP-ALM, and state-of-the-art models, including GPT-4. The experiments demonstrate that large language models and large multimodal models still lack robust ToM capacity. BIP-ALM, on the other hand, shows promising results, by leveraging the power of both model-based mental inference and language models.
心智理论 (ToM) 是理解人们心理状态的能力,是开发具有人类水平社交智能的机器的重要组成部分。最近的机器学习模型,特别是大型语言模型,似乎显示了 ToM 理解的某些方面。然而,现有的 ToM 基准使用单峰数据集——视频或文本。另一方面,人类 ToM 不仅仅是视频或文本理解。人们可以根据从任何可用数据中提取的概念表征(例如,目标、信念、计划)灵活地推断另一个人的想法。为了解决这个问题,我们引入了多模态心理理论问答 (MMToM-QA) 基准。 MMToM-QA 根据多模态数据和有关家庭环境中个人活动的不同类型的单模态数据全面评估机器 ToM。为了设计多模态 ToM 能力,我们提出了一种新方法:BIP-ALM(语言模型加速的贝叶斯逆向规划)。 BIP-ALM 从多模态数据中提取统一表示,并利用语言模型进行可扩展的贝叶斯逆向规划。我们对人类表现、BIP-ALM 和最先进的模型(包括 GPT-4)进行了系统比较。实验表明,大型语言模型和大型多模态模型仍然缺乏强大的 ToM 能力。另一方面,BIP-ALM 通过利用基于模型的心理推理和语言模型的力量,显示出了有希望的结果。

MAP’s not dead yet: Uncovering true language model modes by conditioning away degeneracy

MAP尚未消亡:通过消除简并性来发现真正的语言模型模式

It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior work has attributed this behavior to either a fundamental and unavoidable inadequacy of modes in probabilistic models or weaknesses in language modeling. Contrastingly, we argue that degenerate modes can even occur in the absence of any modeling error, due to contamination of the training data. Specifically, we argue that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution’s mode to become degenerate. We therefore propose to apply MAP decoding to the model’s true conditional distribution where the conditioning variable explicitly avoids specific degenerate behavior. Using exact search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, we observe that various kinds of degenerate modes persist, even at the scale of LLaMA-7B. Although we cannot tractably address these degeneracies with exact search, we perform a classifier-based approximate search on LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.
人们广泛观察到,自然语言生成 (NLG) 模型的精确或近似 MAP(模式搜索)解码始终会导致退化输出(Holtzman 等人,2019;Stahlberg 和 Byrne,2019)。先前的工作将这种行为归因于概率模型中模式的基本且不可避免的不足或语言建模中的弱点。相比之下,我们认为,由于训练数据的污染,退化模式甚至可以在没有任何建模错误的情况下发生。具体来说,我们认为即使将少量的低熵噪声与总体文本分布混合也会导致数据分布的模式退化。因此,我们建议将 MAP 解码应用于模型的真实条件分布,其中条件变量明确避免特定的简并行为。使用精确搜索,我们凭经验验证机器翻译模型和语言模型的长度条件模式确实比它们的无条件模式更流畅和更有话题性。我们还首次分享了来自这些模型以及 LLaMA-7B 模型的多个变体的精确模态序列的许多示例。值得注意的是,我们观察到即使在 LLaMA-7B 的规模上,各种简并模式仍然存在。尽管我们无法通过精确搜索轻松解决这些简并性,但我们对 LLaMA-7B(一种未接受指令跟踪训练的模型)执行基于分类器的近似搜索,并发现我们能够在不进行任何微调的情况下得出合理的输出。

NounAtlas: Filling the Gap in Nominal Semantic Role Labeling

NounAtlas:填补名义语义角色标签的空白

Despite significant advances in Semantic Role Labeling (SRL), much work in this field has been carried out with a focus on verbal predicates, with the research on nominal SRL lagging behind. In many contexts, however, nominal predicates are often as informative as verbal ones, thus needing proper treatment. In this paper we aim to fill this gap and make nominal SRL a first-class citizen. We introduce a novel approach to create the first large-scale, high-quality inventory of nominal predicates and organize them into semantically-coherent frames. Although automatically created, NounAtlas – our frame inventory – is subsequently fully validated. We then put forward a technique to generate silver training data for nominal SRL and show that a state-of-the-art SRL model can achieve good performance. Interestingly, thanks to our design choices which enable seamless integration of our predicate inventory with its verbal counterpart, we can mix verbal and nominal data and perform robust SRL on both types of predicates.
尽管语义角色标签(SRL)取得了重大进展,但该领域的许多工作都集中在动词谓词上,而名义 SRL 的研究相对滞后。然而,在许多情况下,名义谓词通常与动词谓词一样提供信息,因此需要适当的处理。在本文中,我们的目标是填补这一空白,并使名义上的 SRL 成为一等公民。我们引入了一种新颖的方法来创建第一个大规模、高质量的名义谓词清单,并将它们组织成语义一致的框架。虽然是自动创建的,但 NounAtlas (我们的框架库存)随后得到了充分验证。然后,我们提出了一种为名义 SRL 生成白银训练数据的技术,并表明最先进的 SRL 模型可以实现良好的性能。有趣的是,由于我们的设计选择能够将谓词库存与其口头对应物无缝集成,因此我们可以混合口头数据和名义数据,并对两种类型的谓词执行强大的 SRL。

The Earth is Flat because…: Investigating LLMs’ Belief towards Misinformation via Persuasive Conversation

地球是平的,因为…… LLMs通过有说服力的对话调查法学硕士对错误信息的信念

Large language models (LLMs) encapsulate vast amounts of knowledge but still remain vulnerable to external misinformation. Existing research mainly studied this susceptibility behavior in a single-turn setting. However, belief can change during a multi-turn conversation, especially a persuasive one. Therefore, in this study, we delve into LLMs’ susceptibility to persuasive conversations, particularly on factual questions that they can answer correctly. We first curate the Farm (i.e., Fact to Misinform) dataset, which contains factual questions paired with systematically generated persuasive misinformation. Then, we develop a testing framework to track LLMs’ belief changes in a persuasive dialogue. Through extensive experiments, we find that LLMs’ correct beliefs on factual knowledge can be easily manipulated by various persuasive strategies.
大型语言模型( LLMs )封装了大量知识,但仍然容易受到外部错误信息的影响。现有的研究主要研究单匝设置中的这种敏感性行为。然而,在多轮对话中,尤其是有说服力的对话中,信念可能会发生变化。因此,在这项研究中,我们深入研究了LLMs对说服性对话的敏感性,特别是他们能够正确回答的事实问题。我们首先策划农场(即事实到错误信息)数据集,其中包含事实问题以及系统生成的有说服力的错误信息。然后,我们开发了一个测试框架来跟踪LLMs在说服性对话中的信念变化。通过大量的实验,我们发现LLMs对事实知识的正确信念可以很容易地通过各种说服策略来操纵。

Let’s Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation

Let 's Go Real Talk:面对面对话的口语对话模型

In this paper, we introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response, marking the initial step towards creating an avatar chatbot system without relying on intermediate text. To this end, we newly introduce MultiDialog, the first large-scale multimodal (i.e., audio and visual) spoken dialogue corpus containing 340 hours of approximately 9,000 dialogues, recorded based on the open domain dialogue dataset, TopicalChat. The MultiDialog contains parallel audio-visual recordings of conversation partners acting according to the given script with emotion annotations, which we expect to open up research opportunities in multimodal synthesis. Our Face-to-Face spoken dialogue model incorporates a textually pretrained large language model and adapts it into the audio-visual spoken dialogue domain by incorporating speech-text joint pretraining. Through extensive experiments, we validate the effectiveness of our model in facilitating a face-to-face conversation. Demo and data are available at this https URL and this https URL, respectively.
在本文中,我们介绍了一种新颖的面对面口语对话模型。它处理来自用户输入的视听语音,并生成视听语音作为响应,标志着在不依赖中间文本的情况下创建化身聊天机器人系统的第一步。为此,我们新推出了 MultiDialog,这是第一个大规模多模态(即音频和视频)口语对话语料库,包含 340 小时的约 9,000 条对话,是基于开放域对话数据集 TopicalChat 记录的。多对话包含对话伙伴根据带有情感注释的给定脚本进行操作的并行视听记录,我们希望这为多模态合成开辟研究机会。我们的面对面口语对话模型结合了文本预训练的大语言模型,并通过结合语音-文本联合预训练将其适应视听口语对话领域。通过大量的实验,我们验证了我们的模型在促进面对面对话方面的有效性。演示和数据可在以下位置获取 这个 https 网址 和 分别是这个 https URL 。

Word Embeddings Are Steers for Language Models

词嵌入是语言模型的指南

Language models (LMs) automatically learn word embeddings during pre-training on language corpora. Although word embeddings are usually interpreted as feature vectors for individual words, their roles in language model generation remain underexplored. In this work, we theoretically and empirically revisit output word embeddings and find that their linear transformations are equivalent to steering language model generation styles. We name such steers LM-Steers and find them existing in LMs of all sizes. It requires learning parameters equal to 0.2% of the original LMs’ size for steering each style. On tasks such as language model detoxification and sentiment control, LM-Steers can achieve comparable or superior performance compared with state-of-the-art controlled generation methods while maintaining a better balance with generation quality. The learned LM-Steer serves as a lens in text styles: it reveals that word embeddings are interpretable when associated with language model generations and can highlight text spans that most indicate the style differences. An LM-Steer is transferrable between different language models by an explicit form calculation. One can also continuously steer LMs simply by scaling the LM-Steer or compose multiple LM-Steers by adding their transformations. Our codes are publicly available at https://github.com/Glaciohound/LM-Steer
语言模型(LM)在语言语料库的预训练过程中自动学习词嵌入。尽管词嵌入通常被解释为单个词的特征向量,但它们在语言模型生成中的作用仍未得到充分探索。在这项工作中,我们从理论上和经验上重新审视了输出词嵌入,发现它们的线性变换相当于引导语言模型生成样式。我们将此类转向装置命名为 LM-Steers,并发现它们存在于各种尺寸的 LM 中。它需要等于原始 LM 大小 0.2% 的学习参数来引导每种风格。在语言模型解毒和情感控制等任务上,LM-Steers 可以实现与最先进的受控生成方法相当或更好的性能,同时与生成质量保持更好的平衡。学习到的 LM-Steer 充当文本风格的镜头:它揭示了与语言模型生成相关时词嵌入是可解释的,并且可以突出显示最能表明风格差异的文本跨度。 LM-Steer 可通过显式形式计算在不同语言模型之间转移。人们还可以简单地通过缩放 LM-Steer 来连续引导 LM,或者通过添加变换来组合多个 LM-Steer。我们的代码可在https://github.com/Glaciohound/LM-Steer上公开获取.

SAC Awards

SAC奖项

Deciphering Oracle Bone Language with Diffusion Models

用扩散模型破译甲骨文语言

Originating from China’s Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at this https URL
甲骨文 (OBS) 起源于大约 3,000 年前的中国商代,是语言史史上的基石,早于许多已建立的书写系统。尽管发现了数千个铭文,但仍有大量 OBS 未破译,给这种古老的语言蒙上了一层神秘的面纱。现代人工智能技术的出现为 OBS 解密提供了一个新的前沿,挑战了严重依赖大型文本语料库的传统 NLP 方法,这是历史语言无法提供的奢侈。本文介绍了一种采用图像生成技术的新颖方法,特别是通过 Oracle Bone Script Decipher (OBSD) 的开发。 OBSD 利用基于条件扩散的策略,生成重要的破译线索,为人工智能辅助分析古代语言制定了新的路线。为了验证其功效,在甲骨文数据集上进行了大量实验,定量结果证明了 OBSD 的有效性。代码和解密结果将在以下网址提供: 这个 https 网址.

Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse Relations

话语苏格拉底式提问:评估语言模型对话语关系理解的可信度

While large language models have significantly enhanced the effectiveness of discourse relation classifications, it remains unclear whether their comprehension is faithful and reliable. We provide DiSQ, a new method for evaluating the faithfulness of understanding discourse based on question answering. We first employ in-context learning to annotate the reasoning for discourse comprehension, based on the connections among key events within the discourse. Following this, DiSQ interrogates the model with a sequence of questions to assess its grasp of core event relations, its resilience to counterfactual queries, as well as its consistency to its previous responses. then evaluate language models with different architectural designs using DiSQ, finding: (1) DiSQ presents a significant challenge for all models, with the top-performing GPT model attaining only 41% of the ideal performance in PDTB; (2) DiSQ is robust to domain shifts and paraphrase variations; (3) Open-source models generally lag behind their closed-source GPT counterparts, with notable exceptions being those enhanced with chat and code/math features; (4) Our analysis validates the effectiveness of explicitly signalled discourse connectives, the role of contextual information, and the benefits of using historical QA data.
虽然大型语言模型显着增强了话语关系分类的有效性,但仍不清楚它们的理解是否忠实可靠。我们提供了 DiSQ,一种基于问答来评估理解话语忠实度的新方法。我们首先根据语篇中关键事件之间的联系,采用语境学习来注释语篇理解的推理。接下来,DiSQ 通过一系列问题询问模型,以评估其对核心事件关系的掌握、对反事实查询的弹性以及与之前响应的一致性。然后使用 DiSQ 评估具有不同架构设计的语言模型,发现:(1)DiSQ 对所有模型都提出了重大挑战,性能最好的 GPT 模型仅达到 PDTB 理想性能的 41%; (2) DiSQ 对于域转移和释义变化具有鲁棒性; (3) 开源模型通常落后于闭源 GPT 模型,值得注意的例外是那些通过聊天和代码/数学功能增强的模型; (4) 我们的分析验证了明确指示的话语连接词的有效性、上下文信息的作用以及使用历史 QA 数据的好处。

RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models via Romanization

RomanSetu:通过罗马化有效解锁大型语言模型的多语言功能

This study addresses the challenge of extending Large Language Models (LLMs) to non-English languages that use non-Roman scripts. We propose an approach that utilizes the romanized form of text as an interface for LLMs, hypothesizing that its frequent informal use and shared tokens with English enhance cross-lingual alignment. Our approach involves the continual pretraining of an English LLM like Llama 2 on romanized text of non-English, non-Roman script languages, followed by instruction tuning on romanized data. The results indicate that romanized text not only reduces token fertility by 2x-4x but also matches or outperforms native script representation across various NLU, NLG, and MT tasks. Moreover, the embeddings computed on romanized text exhibit closer alignment with their English translations than those from the native script. Our approach presents a promising direction for leveraging the power of English LLMs in languages traditionally underrepresented in NLP. Our code is available on this https URL
这项研究解决了将大型语言模型 ( LLMs ) 扩展到使用非罗马文字的非英语语言的挑战。我们提出了一种利用罗马化文本形式作为LLMs的界面的方法,假设其频繁的非正式使用以及与英语的共享标记可以增强跨语言对齐。我们的方法包括对像LLM 2 这样的英语 Llama 2 等非英语、非罗马脚本语言的罗马化文本进行持续预训练,然后对罗马化数据进行指令调整。结果表明,罗马化文本不仅将标记生成率降低了 2 到 4 倍,而且在各种 NLU、 NLG 和 MT 任务中与本机脚本表示相匹配或优于本机脚本表示。此外,与本地脚本相比,在罗马化文本上计算的嵌入与英语翻译表现出更接近的一致性。我们的方法为利用传统上在 NLP 中代表性不足的语言中利用英语LLMs的力量提供了一个有希望的方向。我们的代码可在 这个 https 网址.

Steering Llama 2 via Contrastive Activation Addition

通过对比激活加法控制 Llama 2

We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes “steering vectors” by averaging the difference in residual stream activations between pairs of positive and negative examples of a particular behavior, such as factual versus hallucinatory responses. During inference, these steering vectors are added at all token positions after the user’s prompt with either a positive or negative coefficient, allowing precise control over the degree of the targeted behavior. We evaluate CAA’s effectiveness on Llama 2 Chat using multiple-choice behavioral question datasets and open-ended generation tasks. We demonstrate that CAA significantly alters model behavior, is effective over and on top of traditional methods like finetuning and system prompt design, and minimally reduces capabilities. Moreover, we gain deeper insights into CAA’s mechanisms by employing various activation space interpretation methods. CAA accurately steers model outputs and sheds light on how high-level concepts are represented in Large Language Models (LLMs).
我们引入了对比激活加法(CAA),这是一种通过在前向传递过程中修改语言模型的激活来控制语言模型的创新方法。 CAA 通过平均特定行为的正负示例对(例如事实响应与幻觉响应)之间的残余流激活差异来计算“转向向量”。在推理过程中,这些引导向量会在用户提示后以正系数或负系数添加到所有令牌位置,从而可以精确控制目标行为的程度。我们使用多项选择行为问题数据集和开放式生成任务来评估 CAA 在 Llama 2 Chat 上的有效性。我们证明 CAA 显着改变了模型行为,比微调和系统提示设计等传统方法更有效,并且最大限度地减少了功能。此外,我们通过采用各种激活空间解释方法,对 CAA 的机制有了更深入的了解。 CAA 准确引导模型输出并揭示高级概念如何在大型语言模型 ( LLMs ) 中表示。

MAP’s not dead yet: Uncovering true language model modes by conditioning away degeneracy

MAP尚未消亡:通过消除简并性来发现真正的语言模型模式

It has been widely observed that exact or approximate MAP (mode-seeking) decoding from natural language generation (NLG) models consistently leads to degenerate outputs (Holtzman et al., 2019; Stahlberg and Byrne, 2019). Prior work has attributed this behavior to either a fundamental and unavoidable inadequacy of modes in probabilistic models or weaknesses in language modeling. Contrastingly, we argue that degenerate modes can even occur in the absence of any modeling error, due to contamination of the training data. Specifically, we argue that mixing even a tiny amount of low-entropy noise with a population text distribution can cause the data distribution’s mode to become degenerate. We therefore propose to apply MAP decoding to the model’s true conditional distribution where the conditioning variable explicitly avoids specific degenerate behavior. Using exact search, we empirically verify that the length-conditional modes of machine translation models and language models are indeed more fluent and topical than their unconditional modes. For the first time, we also share many examples of exact modal sequences from these models, and from several variants of the LLaMA-7B model. Notably, we observe that various kinds of degenerate modes persist, even at the scale of LLaMA-7B. Although we cannot tractably address these degeneracies with exact search, we perform a classifier-based approximate search on LLaMA-7B, a model which was not trained for instruction following, and find that we are able to elicit reasonable outputs without any finetuning.
人们广泛观察到,自然语言生成 (NLG) 模型的精确或近似 MAP(模式搜索)解码始终会导致退化输出(Holtzman 等人,2019;Stahlberg 和 Byrne,2019)。先前的工作将这种行为归因于概率模型中模式的基本且不可避免的不足或语言建模中的弱点。相比之下,我们认为,由于训练数据的污染,退化模式甚至可以在没有任何建模错误的情况下发生。具体来说,我们认为即使将少量的低熵噪声与总体文本分布混合也会导致数据分布的模式退化。因此,我们建议将 MAP 解码应用于模型的真实条件分布,其中条件变量明确避免特定的简并行为。使用精确搜索,我们凭经验验证机器翻译模型和语言模型的长度条件模式确实比它们的无条件模式更流畅和更有话题性。我们还首次分享了来自这些模型以及 LLaMA-7B 模型的多个变体的精确模态序列的许多示例。值得注意的是,我们观察到即使在 LLaMA-7B 的规模上,各种简并模式仍然存在。尽管我们无法通过精确搜索轻松解决这些简并性,但我们对 LLaMA-7B(一种未接受指令跟踪训练的模型)执行基于分类器的近似搜索,并发现我们能够在不进行任何微调的情况下得出合理的输出。

Spiral of Silence: How is Large Language Model Killing Information Retrieval?—A Case Study on Open Domain Question Answering

沉默的螺旋:大型语言模型如何扼杀信息检索? —开放域问答案例研究

The practice of Retrieval-Augmented Generation (RAG), which integrates Large Language Models (LLMs) with retrieval systems, has become increasingly prevalent. However, the repercussions of LLM-derived content infiltrating the web and influencing the retrieval-generation feedback loop are largely uncharted territories. In this study, we construct and iteratively run a simulation pipeline to deeply investigate the short-term and long-term effects of LLM text on RAG systems. Taking the trending Open Domain Question Answering (ODQA) task as a point of entry, our findings reveal a potential digital “Spiral of Silence” effect, with LLM-generated text consistently outperforming human-authored content in search rankings, thereby diminishing the presence and impact of human contributions online. This trend risks creating an imbalanced information ecosystem, where the unchecked proliferation of erroneous LLM-generated content may result in the marginalization of accurate information. We urge the academic community to take heed of this potential issue, ensuring a diverse and authentic digital information landscape.
将大型语言模型 ( LLMs ) 与检索系统集成的检索增强生成 (RAG) 实践已变得越来越普遍。然而, LLM衍生内容渗透到网络并影响检索生成反馈循环的影响在很大程度上是未知领域。在本研究中,我们构建并迭代运行一个模拟管道,以深入研究LLM文本对 RAG 系统的短期和长期影响。以流行的开放领域问答(ODQA)任务为切入点,我们的研究结果揭示了潜在的数字“沉默螺旋”效应, LLM生成的文本在搜索排名中始终优于人类创作的内容,从而减少了存在和人类在线贡献的影响。这种趋势有可能造成一个不平衡的信息生态系统, LLM生成的错误内容不受控制地扩散可能会导致准确信息的边缘化。我们敦促学术界关注这一潜在问题,确保多样化和真实的数字信息环境。

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

CausalGym:对语言任务的因果可解释性方法进行基准测试

Language models (LMs) have proven to be powerful tools for psycholinguistic research, but most prior work has focused on purely behavioural measures (e.g., surprisal comparisons). At the same time, research in model interpretability has begun to illuminate the abstract causal mechanisms shaping LM behavior. To help bring these strands of research closer together, we introduce CausalGym. We adapt and expand the SyntaxGym suite of tasks to benchmark the ability of interpretability methods to causally affect model behaviour. To illustrate how CausalGym can be used, we study the pythia models (14M–6.9B) and assess the causal efficacy of a wide range of interpretability methods, including linear probing and distributed alignment search (DAS). We find that DAS outperforms the other methods, and so we use it to study the learning trajectory of two difficult linguistic phenomena in pythia-1b: negative polarity item licensing and filler–gap dependencies. Our analysis shows that the mechanism implementing both of these tasks is learned in discrete stages, not gradually.
语言模型(LM)已被证明是心理语言学研究的强大工具,但大多数先前的工作都集中在纯粹的行为测量(例如,意外比较)上。与此同时,模型可解释性的研究已经开始阐明塑造 LM 行为的抽象因果机制。为了帮助将这些研究领域更加紧密地结合在一起,我们推出了 CausalGym。我们调整并扩展了 SyntaxGym 任务套件,以对可解释性方法因果影响模型行为的能力进行基准测试。为了说明如何使用 CausalGym,我们研究了 pythia 模型 (14M–6.9B) 并评估了各种可解释性方法的因果功效,包括线性探测和分布式对齐搜索 (DAS)。我们发现 DAS 优于其他方法,因此我们用它来研究 pythia-1b 中两种困难语言现象的学习轨迹:负极性项目许可和填充间隙依赖性。我们的分析表明,实现这两项任务的机制是在离散阶段学习的,而不是逐渐学习的。

COKE: A Cognitive Knowledge Graph for Machine Theory of Mind

COKE:机器心理理论的认知知识图

Theory of mind (ToM) refers to humans’ ability to understand and infer the desires, beliefs, and intentions of others. The acquisition of ToM plays a key role in humans’ social cognition and interpersonal relations. Though indispensable for social intelligence, ToM is still lacking for modern AI and NLP systems since they cannot access the human mental state and cognitive process beneath the training corpus. To empower AI systems with the ToM ability and narrow the gap between them and humans, in this paper, we propose COKE: the first cognitive knowledge graph for machine theory of mind. Specifically, COKE formalizes ToM as a collection of 45k+ manually verified cognitive chains that characterize human mental activities and subsequent behavioral/affective responses when facing specific social circumstances. In addition, we further generalize COKE using LLMs and build a powerful generation model COLM tailored for cognitive reasoning. Experimental results in both automatic and human evaluation demonstrate the high quality of COKE, the superior ToM ability of COLM, and its potential to significantly enhance social applications.
心智理论 (ToM) 是指人类理解和推断他人的愿望、信念和意图的能力。 ToM的获得对于人类的社会认知和人际关系起着关键作用。虽然 ToM 对于社交智能来说是不可或缺的,但对于现代人工智能和自然语言处理系统来说仍然缺乏,因为它们无法访问训练语料库下的人类心理状态和认知过程。为了赋予人工智能系统ToM能力,缩小它们与人类之间的差距,在本文中,我们提出了COKE:第一个机器心理理论的认知知识图谱。具体来说,COKE 将 ToM 正式化为 45k+ 条手动验证的认知链的集合,这些认知链描述了人类面对特定社会环境时的心理活动和随后的行为/情感反应。此外,我们利用LLMs进一步推广 COKE,并构建了专为认知推理量身定制的强大生成模型 COLM。自动评估和人工评估的实验结果都证明了 COKE 的高质量、COLM 卓越的 ToM 能力及其显着增强社会应用的潜力。

Why are Sensitive Functions Hard for Transformers?

为什么敏感函数对于 Transformer 来说很难?

Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. However, theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. We prove that, under the transformer architecture, the loss landscape is constrained by the input-space sensitivity: Transformers whose output is sensitive to many parts of the input string inhabit isolated points in parameter space, leading to a low-sensitivity bias in generalization. We show theoretically and empirically that this theory unifies a broad array of empirical observations about the learning abilities and biases of transformers, such as their generalization bias towards low sensitivity and low degree, and difficulty in length generalization for PARITY. This shows that understanding transformers’ inductive biases requires studying not just their in-principle expressivity, but also their loss landscape.
实证研究已经确定了 Transformer 的一系列可学习性偏差和局限性,例如学习计算 PARITY 等简单形式语言的持续困难,以及对低次函数的偏差。然而,理论理解仍然有限,现有的表达理论要么高估要么低估现实的学习能力。我们证明,在变压器架构下,损耗景观受到输入空间敏感性的约束:输出对输入串的许多部分敏感的变压器位于参数空间中的孤立点,导致泛化中的低灵敏度偏差。我们从理论和经验上证明,该理论统一了关于 Transformer 的学习能力和偏差的广泛经验观察,例如它们对低灵敏度和低程度的泛化偏差,以及 PARITY 长度泛化的困难。这表明,了解变压器的电感偏差不仅需要研究其原则上的表现力,还需要研究其损耗情况。

Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?

使用语音基础模型和大型语言模型的语音翻译:有什么和缺少什么?

The field of natural language processing (NLP) has recently witnessed a transformative shift with the emergence of foundation models, particularly Large Language Models (LLMs) that have revolutionized text-based NLP. This paradigm has extended to other modalities, including speech, where researchers are actively exploring the combination of Speech Foundation Models (SFMs) and LLMs into single, unified models capable of addressing multimodal tasks. Among such tasks, this paper focuses on speech-to-text translation (ST). By examining the published papers on the topic, we propose a unified view of the architectural solutions and training strategies presented so far, highlighting similarities and differences among them. Based on this examination, we not only organize the lessons learned but also show how diverse settings and evaluation approaches hinder the identification of the best-performing solution for each architectural building block and training choice. Lastly, we outline recommendations for future works on the topic aimed at better understanding the strengths and weaknesses of the SFM+LLM solutions for ST.
随着基础模型的出现,自然语言处理 (NLP) 领域最近发生了一场变革,特别是大型语言模型 ( LLMs ),它彻底改变了基于文本的 NLP。这种范式已经扩展到包括语音在内的其他模式,研究人员正在积极探索将语音基础模型(SFM)和LLMs结合成能够解决多模式任务的单一统一模型。在这些任务中,本文重点关注语音到文本翻译(ST)。通过检查有关该主题的已发表论文,我们对迄今为止提出的架构解决方案和培训策略提出了统一的看法,强调了它们之间的相似性和差异。基于这次检查,我们不仅整理了所学到的经验教训,还展示了不同的设置和评估方法如何阻碍为每个架构构建块和培训选择确定最佳性能的解决方案。最后,我们概述了有关该主题的未来工作的建议,旨在更好地了解 ST 的 SFM+ LLM解决方案的优点和缺点。

AI ‘News’ Content Farms Are Easy to Make and Hard to Detect: A Case Study in Italian

人工智能“新闻”内容农场易于制作且难以检测:意大利语案例研究

Large Language Models (LLMs) are increasingly used as “content farm” models (CFMs), to generate synthetic text that could pass for real news articles. This is already happening even for languages that do not have high-quality monolingual LLMs. We show that fine-tuning Llama (v1), mostly trained on English, on as little as 40K Italian news articles, is sufficient for producing news-like texts that native speakers of Italian struggle to identify as synthetic.
大型语言模型 ( LLMs ) 越来越多地用作“内容农场”模型 (CFM),以生成可以冒充真实新闻文章的合成文本。即使对于那些没有高质量单LLMs的语言来说,这种情况也已经发生了。我们证明,对主要接受英语训练的 Llama (v1) 进行微调,只需 4 万篇意大利新闻文章,就足以生成类似新闻的文本,而意大利语母语人士很难将其识别为合成文本。
We investigate three LLMs and three methods of detecting synthetic texts (log-likelihood, DetectGPT, and supervised classification), finding that they all perform better than human raters, but they are all impractical in the real world (requiring either access to token likelihood information or a large dataset of CFM texts). We also explore the possibility of creating a proxy CFM: an LLM fine-tuned on a similar dataset to one used by the real “content farm”. We find that even a small amount of fine-tuning data suffices for creating a successful detector, but we need to know which base LLM is used, which is a major challenge.
我们研究了三种LLMs和三种检测合成文本的方法(对数似然、DetectGPT 和监督分类),发现它们都比人类评分者表现更好,但它们在现实世界中都不切实际(需要访问令牌似然信息)或 CFM 文本的大型数据集)。我们还探索了创建代理 CFM 的可能性:在与真实“内容农场”使用的数据集类似的数据集上进行微调的LLM 。我们发现,即使少量的微调数据也足以创建一个成功的检测器,但我们需要知道使用哪个基础LLM ,这是一个重大挑战。
Our results suggest that there are currently no practical methods for detecting synthetic news-like texts ‘in the wild’, while generating them is too easy. We highlight the urgency of more NLP research on this problem.
我们的结果表明,目前还没有实用的方法来检测“野外”合成的类似新闻的文本,而生成它们却太容易了。我们强调针对这个问题进行更多 NLP 研究的紧迫性。

CaMML: Context-Aware Multimodal Learner for Large Models

CaMML:大型模型的上下文感知多模态学习器

In this work, we introduce Context-Aware MultiModal Learner (CaMML), for tuning large multimodal models (LMMs). CaMML, a lightweight module, is crafted to seamlessly integrate multimodal contextual samples into large models, thereby empowering the model to derive knowledge from analogous, domain-specific, up-to-date information and make grounded inferences. Importantly, CaMML is highly scalable and can efficiently handle lengthy multimodal context examples owing to its hierarchical design. Based on CaMML, we have developed two multimodal models, CaMML-7B and CaMML-13B, that have shown exceptional performance across an array of benchmark datasets for multimodal tasks. Remarkably, CaMML-13B achieves the state-of-the-art performance on over ten widely recognized multimodal benchmark datasets, surpassing LLaVA-1.5 (13B) with a noticeable margin, without integration of any external resources. Moreover, we have conducted extensive ablative studies to inspect the inner workings of CaMML and performed qualitative analyses to showcase its effectiveness in handling real-world challenging cases. Code and models are available at: https://github.com/amazon-science/camml
在这项工作中,我们引入了上下文感知多模态学习器(CaMML),用于调整大型多模态模型(LMM)。 CaMML 是一个轻量级模块,旨在将多模态上下文样本无缝集成到大型模型中,从而使模型能够从类似的、特定领域的、最新的信息中获取知识,并做出有根据的推论。重要的是,CaMML 具有高度可扩展性,并且由于其分层设计,可以有效地处理冗长的多模式上下文示例。基于 CaMML,我们开发了两种多模态模型 CaMML-7B 和 CaMML-13B,它们在多模态任务的一系列基准数据集上表现出了卓越的性能。值得注意的是,CaMML-13B 在十多个广泛认可的多模态基准数据集上实现了最先进的性能,以显着的优势超越了 LLaVA-1.5 (13B),而无需集成任何外部资源。此外,我们还进行了广泛的烧蚀研究来检查 CaMML 的内部运作,并进行定性分析以展示其在处理现实世界中具有挑战性的案例中的有效性。代码和模型可在以下位置获取:https: //github.com/amazon-science/camml.

Greed is All You Need: An Evaluation of Tokenizer Inference Methods

贪婪就是你所需要的:分词器推理方法的评估

While subword tokenizers such as BPE and WordPiece are typically used to build vocabularies for NLP models, the method of decoding text into a sequence of tokens from these vocabularies is often left unspecified, or ill-suited to the method in which they were constructed. We provide a controlled analysis of seven tokenizer inference methods across four different algorithms and three vocabulary sizes, performed on a novel intrinsic evaluation suite we curated for English, combining measures rooted in morphology, cognition, and information theory. We show that for the most commonly used tokenizers, greedy inference performs surprisingly well; and that SaGe, a recently-introduced contextually-informed tokenizer, outperforms all others on morphological alignment.
虽然 BPE 和 WordPiece 等子词分词器通常用于为 NLP 模型构建词汇表,但将这些词汇表中的文本解码为标记序列的方法通常未指定,或者不适合构建它们的方法。我们对四种不同算法和三种词汇大小的七种分词器推理方法进行了受控分析,在我们为英语策划的新颖的内在评估套件上进行,结合了植根于形态、认知和信息论的测量。我们证明,对于最常用的分词器来说,贪婪推理的表现出奇的好;最近推出的上下文信息标记器 SaGe 在形态对齐方面优于所有其他标记器。

Don’t Hallucinate, Abstain: Identifying LLM Knowledge Gaps via Multi-LLM Collaboration

不要产生幻觉,放弃:通过多LLM合作识别LLM知识差距

Despite efforts to expand the knowledge of large language models (LLMs), knowledge gaps – missing or outdated information in LLMs – might always persist given the evolving nature of knowledge. In this work, we study approaches to identify LLM knowledge gaps and abstain from answering questions when knowledge gaps are present. We first adapt existing approaches to model calibration or adaptation through fine-tuning/prompting and analyze their ability to abstain from generating low-confidence outputs. Motivated by their failures in self-reflection and over-reliance on held-out sets, we propose two novel approaches that are based on model collaboration, i.e., LLMs probing other LLMs for knowledge gaps, either cooperatively or competitively. Extensive experiments with three LLMs on four QA tasks featuring diverse knowledge domains demonstrate that both cooperative and competitive approaches to unveiling LLM knowledge gaps achieve up to 19.3% improvements on abstain accuracy against the strongest baseline. Further analysis reveals that our proposed mechanisms could help identify failure cases in retrieval augmentation and pinpoint knowledge gaps in multi-hop reasoning.
尽管努力扩展大型语言模型 ( LLMs ) 的知识,但考虑到知识不断发展的性质,知识差距( LLMs中缺失或过时的信息)可能始终持续存在。在这项工作中,我们研究了识别LLM知识差距的方法,并在存在知识差距时避免回答问题。我们首先通过微调/提示来调整现有方法来进行模型校准或适应,并分析它们避免生成低置信度输出的能力。出于自我反思失败和过度依赖保留集的动机,我们提出了两种基于模型协作的新方法,即LLMs以合作或竞争的方式探索其他LLMs的知识差距。三位LLMs在具有不同知识领域的四项 QA 任务上进行的广泛实验表明,与最强基线相比,揭示LLM知识差距的合作和竞争方法在弃权准确性方面实现了高达 19.3% 的提高。进一步的分析表明,我们提出的机制可以帮助识别检索增强中的失败案例,并查明多跳推理中的知识差距。

VariErr NLI: Separating Annotation Error from Human Label Variation

VariErr NLI:将注释错误与人类标签变化分开

Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white.To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs.VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error detection (AED) methods and GPTs in uncovering errors versus human label variation. We find that state-of-the-art AED methods significantly underperform GPTs and humans. While GPT-4 is the best system, it still falls short of human performance. Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation, which in turn can yield better and more trustworthy NLP systems.
当注释者出于正当理由为同一项目分配不同的标签时,就会出现人工标签变化,而当由于无效原因分配标签时,就会出现注释错误。这两个问题在 NLP 基准测试中普遍存在,但现有研究对它们进行了孤立的研究。据我们所知,之前没有任何工作专注于区分信号中的误差,特别是在信号超出黑白范围的情况下。为了填补这一空白,我们引入了一种系统方法和一个新的数据集 VariErr (变化与错误),重点关注英语 NLI 任务。我们提出了一个 2 轮注释程序,其中注释者解释每个标签,然后判断标签解释对的有效性。VariErr 包含对 500 个重新注释的 MNLI 项目的 1,933 个解释的 7,732 个有效性判断。我们评估了各种自动错误检测 (AED) 方法和 GPT 在发现错误与人工标签变化方面的有效性。我们发现最先进的 AED 方法的表现明显低于 GPT 和人类。虽然 GPT-4 是最好的系统,但它仍然低于人类的表现。我们的方法适用于 NLI 之外的领域,为未来关于错误与合理变化的研究提供了肥沃的基础,这反过来又可以产生更好、更值得信赖的 NLP 系统。

Distributional Inclusion Hypothesis and Quantifications: Probing for Hypernymy in Functional Distributional Semantics

分布包含假设和量化:功能分布语义学中的上位关系探究

Functional Distributional Semantics (FDS) models the meaning of words by truth-conditional functions. This provides a natural representation for hypernymy but no guarantee that it can be learnt when FDS models are trained on a corpus. In this paper, we probe into FDS models and study the representations learnt, drawing connections between quantifications, the Distributional Inclusion Hypothesis (DIH), and the variational-autoencoding objective of FDS model training. Using synthetic data sets, we reveal that FDS models learn hypernymy on a restricted class of corpus that strictly follows the DIH. We further introduce a training objective that both enables hypernymy learning under the reverse of the DIH and improves hypernymy detection from real corpora.
功能分布语义 (FDS) 通过真值条件函数对单词的含义进行建模。这为上位词提供了自然的表示,但不能保证在语料库上训练 FDS 模型时可以学习它。在本文中,我们探讨了 FDS 模型并研究了学习到的表示,在量化、分布包含假设 (DIH) 和 FDS 模型训练的变分自动编码目标之间建立了联系。使用合成数据集,我们揭示了 FDS 模型在严格遵循 DIH 的受限语料库类别上学习上位词。我们进一步引入了一个训练目标,既可以在 DIH 的反向下实现上位学习,又可以改进真实语料库的上位检测。

MIDGARD: Self-Consistency Using Minimum Description Length for Structured Commonsense Reasoning

MIDGARD:使用最小描述长度进行结构化常识推理的自我一致性

We study the task of conducting structured reasoning as generating a reasoning graph from natural language input using large language models (LLMs). Previous approaches have explored various prompting schemes, yet they suffer from error propagation due to the autoregressive nature and single-pass-based decoding, which lack error correction capability. Additionally, relying solely on a single sample may result in the omission of true nodes and edges. To counter this, we draw inspiration from self-consistency (SC), which involves sampling a diverse set of reasoning chains and taking the majority vote as the final answer. To tackle the substantial challenge of applying SC on generated graphs, we propose MIDGARD (MInimum Description length Guided Aggregation of Reasoning in Directed acyclic graph) that leverages Minimum Description Length (MDL)-based formulation to identify consistent properties among the different graph samples generated by an LLM. This formulation helps reject properties that appear in only a few samples, which are likely to be erroneous, while enabling the inclusion of missing elements without compromising precision. Our method demonstrates superior performance than comparisons across various structured reasoning tasks, including argument structure extraction, explanation graph generation, inferring dependency relations among actions for everyday tasks, and semantic graph generation from natural texts.
我们研究进行结构化推理的任务,即使用大型语言模型( LLMs )从自然语言输入生成推理图。以前的方法已经探索了各种提示方案,但由于自回归性质和基于单通道的解码缺乏纠错能力,它们会遭受错误传播。此外,仅依赖单个样本可能会导致遗漏真实的节点和边。为了解决这个问题,我们从自我一致性(SC)中汲取灵感,它涉及对一组不同的推理链进行采样,并以多数票作为最终答案。为了解决在生成的图上应用 SC 的重大挑战,我们提出了 MIDGARD(有向无环图中的最小描述长度引导推理聚合),它利用基于最小描述长度 (MDL) 的公式来识别由生成的不同图样本之间的一致属性。 LLM硕士。该公式有助于拒绝仅出现在少数样本中的属性,这些属性可能是错误的,同时能够在不影响精度的情况下包含缺失的元素。我们的方法表现出比各种结构化推理任务的比较优越的性能,包括参数结构提取、解释图生成、推断日常任务的动作之间的依赖关系以及从自然文本生成语义图。

CHECKWHY: Causal Fact Verification via Argument Structure

CHECKWHY:通过论证结构验证因果事实

With the growing complexity of fact verification tasks, the concern with “thoughtful” reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CHECKWHY, a challenging dataset tailored to a novel causal fact verification task: checking the truthfulness of the causal relation within claims through rigorous reasoning steps. CHECKWHY consists of over 19K “why” claim-evidence- argument structure triplets with supports, refutes, and not enough info labels. Each argument structure is composed of connected evidence, representing the reasoning process that begins with foundational evidence and progresses toward claim establishment. Through extensive experiments on state-of-the-art models, we validate the importance of incorporating the argument structure for causal fact verification. Moreover, the automated and human evaluation of argument structure generation reveals the difficulty in producing satisfying argument structure by fine-tuned models or Chain-of-Thought prompted LLMs, leaving considerable room for future improvements.
随着事实验证任务的日益复杂,人们对“深思熟虑”的推理能力的关注也日益增加。然而,最近的事实验证基准主要集中于检查权利要求中狭隘范围的语义事实,缺乏明确的逻辑推理过程。在本文中,我们介绍了 CHECKWHY,这是一个具有挑战性的数据集,专为新颖的因果事实验证任务而定制:通过严格的推理步骤检查声明中因果关系的真实性。 CHECKWHY 由超过 19K 个“为什么”主张-证据-论证结构三元组组成,带有支持、反驳和信息不足标签。每个论证结构都由相互关联的证据组成,代表从基础证据开始并逐步建立主张的推理过程。通过对最先进模型的广泛实验,我们验证了将论证结构纳入因果事实验证的重要性。此外,对论证结构生成的自动和人工评估揭示了通过微调模型或思想链提示的LLMs很难产生令人满意的论证结构,为未来的改进留下了相当大的空间。

Language Complexity and Speech Recognition Accuracy: Orthographic Complexity Hurts, Phonological Complexity Doesn’t

语言复杂性和语音识别准确性:拼写复杂性有害,语音复杂性则不然

We investigate what linguistic factors affect the performance of Automatic Speech Recognition (ASR) models. We hypothesize that orthographic and phonological complexities both degrade accuracy. To examine this, we fine-tune the multilingual self-supervised pretrained model Wav2Vec2-XLSR-53 on 25 languages with 15 writing systems, and we compare their ASR accuracy, number of graphemes, unigram grapheme entropy, logographicity (how much word/morpheme-level information is encoded in the writing system), and number of phonemes. The results demonstrate that orthographic complexities significantly correlate with low ASR accuracy, while phonological complexity shows no significant correlation.
我们研究哪些语言因素会影响自动语音识别 (ASR) 模型的性能。我们假设拼写和语音的复杂性都会降低准确性。为了检验这一点,我们对 25 种语言和 15 种书写系统的多语言自监督预训练模型 Wav2Vec2-XLSR-53 进行了微调,并比较了它们的 ASR 准确性、字素数量、一元字素熵、语意性(单词/词素的数量) -级别信息在书写系统中编码),以及音素的数量。结果表明,拼写复杂性与低 ASR 准确性显着相关,而语音复杂性则没有显着相关性。

COSMIC: Mutual Information for Task-Agnostic Summarization Evaluation

COSMIC:用于与任务无关的总结评估的互信息

Assessing the quality of summarizers poses significant challenges—gold summaries are hard to obtain and their suitability depends on the use context of the summarization system. Who is the user of the system, and what do they intend to do with the summary? In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries while preserving task outcomes. We theoretically establish both a lower and upper bound on the expected error rate of these tasks, which depends on the mutual information between source texts and generated summaries. We introduce COSMIC, a practical implementation of this metric, and demonstrate its strong correlation with human judgment-based metrics, as well as its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like BERTScore and ROUGE highlight the competitive performance of COSMIC.
评估摘要器的质量提出了重大挑战——黄金摘要很难获得,而且它们的适用性取决于摘要系统的使用上下文。谁是系统的用户?他们打算如何处理摘要?作为回应,我们提出了一种新颖的面向任务的评估方法,根据摘要者在保留任务结果的同时生成摘要的能力来评估摘要者。从理论上讲,我们为这些任务的预期错误率建立了下限和上限,这取决于源文本和生成的摘要之间的相互信息。我们介绍了 COSMIC,这是该指标的实际实现,并证明了它与基于人类判断的指标的强相关性,以及它在预测下游任务绩效方面的有效性。与 BERTScore 和 ROUGE 等既定指标的比较分析突显了 COSMIC 的竞争表现。

Tree-Averaging Algorithms for Ensemble-Based Unsupervised Discontinuous Constituency Parsing

基于集成的无监督不连续选区解析的树平均算法

We address unsupervised discontinuous constituency parsing, where we observe a high variance in the performance of the only previous model in the literature. We propose to build an ensemble of different runs of the existing discontinuous parser by averaging the predicted trees, to stabilize and boost performance. To begin with, we provide comprehensive computational complexity analysis (in terms of P and NP-complete) for tree averaging under different setups of binarity and continuity. We then develop an efficient exact algorithm to tackle the task, which runs in a reasonable time for all samples in our experiments. Results on three datasets show our method outperforms all baselines in all metrics; we also provide in-depth analyses of our approach.
我们解决了无监督的不连续选区解析,我们观察到文献中唯一的先前模型的性能存在很大差异。我们建议通过对预测树进行平均来构建现有不连续解析器的不同运行的集合,以稳定和提高性能。首先,我们为不同二元性和连续性设置下的树平均提供全面的计算复杂性分析(就 P 和 NP 完全而言)。然后,我们开发了一种高效精确的算法来处理该任务,该算法在我们实验中的所有样本上运行在合理的时间内。三个数据集的结果表明我们的方法在所有指标上都优于所有基线;我们还对我们的方法进行了深入分析。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值