[晓理紫]每日论文推送(有中文摘要或代码或者项目地址)

最新推荐文章于 2025-05-16 02:38:55 发布

晓理紫

最新推荐文章于 2025-05-16 02:38:55 发布

阅读量2k

点赞数 7

文章标签：每日论文

本文链接：https://blog.csdn.net/u011573853/article/details/135494088

版权

[晓理紫]每日论文推送(有中文摘要或代码或者项目地址)
每日更新论文，关注晓理紫获取每日最新论文
[晓理紫]

标题: A Comprehensive Study of Knowledge Editing for Large Language Models
作者: Ningyu Zhang, Yunzhi Yao, Bozhong Tian
摘要: Large Language Models (LLMs) have shown extraordinary capabilities in
understanding and generating text that closely mirrors human communication.
However, a primary limitation lies in the significant computational demands
during training, arising from their extensive parameterization. This challenge
is further intensified by the dynamic nature of the world, necessitating
frequent updates to LLMs to correct outdated information or integrate new
knowledge, thereby ensuring their continued relevance. Note that many
applications demand continual model adjustments post-training to address
deficiencies or undesirable behaviors. There is an increasing interest in
efficient, lightweight methods for on-the-fly model modifications. To this end,
recent years have seen a burgeoning in the techniques of knowledge editing for
LLMs, which aim to efficiently modify LLMs’ behaviors within specific domains
while preserving overall performance across various inputs. In this paper, we
first define the knowledge editing problem and then provide a comprehensive
review of cutting-edge approaches. Drawing inspiration from educational and
cognitive research theories, we propose a unified categorization criterion that
classifies knowledge editing methods into three groups: resorting to external
knowledge, merging knowledge into the model, and editing intrinsic knowledge.
Furthermore, we introduce a new benchmark, KnowEdit, for a comprehensive
empirical evaluation of representative knowledge editing approaches.
Additionally, we provide an in-depth analysis of knowledge location, which can
provide a deeper understanding of the knowledge structures inherent within
LLMs. Finally, we discuss several potential applications of knowledge editing,
outlining its broad and impactful implications.
中文摘要: 大型语言模型（LLMs）在理解和生成接近人类沟通的文本方面展现了非凡的能力。然而，它们的一个主要限制在于训练过程中的巨大计算需求，这是由于它们广泛的参数化造成的。这一挑战因世界的动态本质而进一步加剧，需要频繁更新LLMs以纠正过时信息或整合新知识，以确保它们的持续相关性。值得注意的是，许多应用需要在训练后持续调整模型，以解决缺陷或不良行为。对于实时模型修改的高效、轻量级方法，人们越来越感兴趣。为此，近年来，LLMs的知识编辑技术日益增多，旨在有效地修改LLMs在特定领域的行为，同时保持对各种输入的整体性能。在本文中，我们首先定义知识编辑问题，然后对最前沿的方法进行全面回顾。受教育和认知研究理论的启发，我们提出了一个统一的分类标准，将知识编辑方法分为三组：依赖外部知识、将知识融入模型和编辑内在知识。此外，我们引入了一个新的基准，KnowEdit，用于对代表性知识编辑方法进行全面的实证评估。我们还对知识位置进行了深入分析，这可以提供对LLMs内在知识结构的更深入了解。最后，我们讨论了知识编辑的几个潜在应用，概述了其广泛而深远的影响。
[论文下载:]http://arxiv.org/abs/2401.01286v2
[项目页面:]https://huggingface.co/datasets/zjunlp/KnowEdit|
[GitHub:]https://github.com/zjunlp/EasyEdit|https://github.com/zjunlp/KnowledgeEditingPapers|

标题: SpeechAgents: Human-Communication Simulation with Multi-Modal
Multi-Agent Systems
作者: Dong Zhang, Zhaowei Li, Pengyu Wang
摘要: Human communication is a complex and diverse process that not only involves
multiple factors such as language, commonsense, and cultural backgrounds but
also requires the participation of multimodal information, such as speech.
Large Language Model (LLM)-based multi-agent systems have demonstrated
promising performance in simulating human society. Can we leverage LLM-based
multi-agent systems to simulate human communication? However, current LLM-based
multi-agent systems mainly rely on text as the primary medium. In this paper,
we propose SpeechAgents, a multi-modal LLM based multi-agent system designed
for simulating human communication. SpeechAgents utilizes multi-modal LLM as
the control center for individual agent and employes multi-modal signals as the
medium for exchanged messages among agents. Additionally, we propose
Multi-Agent Tuning to enhance the multi-agent capabilities of LLM without
compromising general abilities. To strengthen and evaluate the effectiveness of
human communication simulation, we build the Human-Communication Simulation
Benchmark. Experimental results demonstrate that SpeechAgents can simulate
human communication dialogues with consistent content, authentic rhythm, and
rich emotions and demonstrate excellent scalability even with up to 25 agents,
which can apply to tasks such as drama creation and audio novels generation.
Code and models will be open-sourced at https://github.
com/0nutation/SpeechAgents
中文摘要: 人类交流是一个复杂多样的过程，不仅涉及语言、常识和文化背景等多种因素，还需要语音等多模式信息的参与。基于大型语言模型（LLM）的多智能体系统在模拟人类社会方面表现出了良好的性能。我们能利用基于LLM的多智能体系统来模拟人类通信吗？然而，目前基于LLM的多智能体系统主要依赖文本作为主要媒介。在本文中，我们提出了SpeechAgents，这是一个基于多模式LLM的多智能体系统，旨在模拟人类通信。SpeechAgents利用多模式LLM作为个体代理的控制中心，并利用多模式信号作为代理之间交换消息的媒介。此外，我们还提出了多Agent调优，以增强LLM的多Agent能力，而不影响一般能力。为了加强和评估人类通信模拟的有效性，我们建立了人类通信模拟基准。实验结果表明，SpeechAgents可以模拟内容一致、节奏真实、情感丰富的人类交流对话，即使使用多达25个代理，也具有良好的可扩展性，可以应用于戏剧创作和有声小说生成等任务。代码和模型将在https://github.com/0nutation/SpeechAgents
[论文下载:]http://arxiv.org/abs/2401.03945v1
[项目页面:]https://github.|

标题: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the
Wild
作者: Huayang Li, Siheng Li, Deng Cai
摘要: Large language models with instruction-following abilities have
revolutionized the field of artificial intelligence. These models show
exceptional generalizability to tackle various real-world tasks through their
natural language interfaces. However, their performance heavily relies on
high-quality exemplar data, which is often difficult to obtain. This challenge
is further exacerbated when it comes to multimodal instruction following. We
introduce TextBind, an almost annotation-free framework for empowering larger
language models with the multi-turn interleaved multimodal
instruction-following capabilities. Our approach requires only image-caption
pairs and generates multi-turn multimodal instruction-response conversations
from a language model. To accommodate interleaved image-text inputs and
outputs, we devise MIM, a language model-centric architecture that seamlessly
integrates image encoder and decoder models. We release our dataset, model, and
demo to foster future research in the area of multimodal instruction following.
中文摘要: 具有指令跟随能力的大型语言模型彻底改变了人工智能领域。这些模型显示出非凡的通用性，可以通过其自然语言接口处理各种现实世界的任务。然而，它们的性能在很大程度上依赖于高质量的样本数据，而这些数据往往很难获得。当涉及到多模式教学时，这一挑战进一步加剧。我们介绍了TextBind，这是一个几乎无注释的框架，用于为大型语言模型提供多回合交错多模式指令跟随功能。我们的方法只需要图像字幕对，并从语言模型生成多回合多模式指令-响应对话。为了适应交错的图像-文本输入和输出，我们设计了MIM，这是一种以语言模型为中心的架构，无缝集成了图像编码器和解码器模型。我们发布了我们的数据集、模型和演示，以促进未来在多模式教学领域的研究
[论文下载:]http://arxiv.org/abs/2309.08637v4
[项目页面:]https://textbind.github.io/|

标题: AlpacaFarm: A Simulation Framework for Methods that Learn from Human
Feedback
作者: Yann Dubois, Xuechen Li, Rohan Taori
摘要: Large language models (LLMs) such as ChatGPT have seen widespread adoption
due to their strong instruction-following abilities. Developing these LLMs
involves a complex yet poorly understood workflow requiring training with human
feedback. Replicating and understanding this instruction-following requires
tackling three major challenges: the high cost of data collection, the lack of
trustworthy evaluation, and the absence of reference method implementations. We
address these challenges with AlpacaFarm, a simulator that enables research and
development for learning from feedback at a low cost. First, we design LLM
prompts to simulate human feedback that are 50x cheaper than crowdworkers and
display high agreement with humans. Second, we propose an automatic evaluation
and validate it against human instructions obtained on real-world interactions.
Third, we contribute reference implementations for several methods (PPO, DPO,
best-of-n, expert iteration, and more) that learn from pairwise feedback.
Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate
eleven models on 10k pairs of real human feedback and show that rankings of
models trained in AlpacaFarm match rankings of models trained on human data. As
a demonstration of the research possible in AlpacaFarm, we find that methods
that use a reward model can substantially improve over supervised fine-tuning
and that our reference PPO implementation leads to a +10% improvement in
win-rate against Davinci003. We release all components of AlpacaFarm at
https://github.com/tatsu-lab/alpaca_farm.
[论文下载:]http://arxiv.org/abs/2305.14387v4
[GitHub:]https://github.com/tatsu-lab/alpaca_farm.|

标题: Empirical Analysis of Efficient Fine-Tuning Methods for Large
Pre-Trained Language Models
作者: Nigel Doering, Cyril Gorlla, Trevor Tuttle
摘要: Fine-tuning large pre-trained language models for downstream tasks remains a
critical challenge in natural language processing. This paper presents an
empirical analysis comparing two efficient fine-tuning methods - BitFit and
adapter modules - to standard full model fine-tuning. Experiments conducted on
GLUE benchmark datasets (MRPC, COLA, STS-B) reveal several key insights. The
BitFit approach, which trains only bias terms and task heads, matches full
fine-tuning performance across varying amounts of training data and time
constraints. It demonstrates remarkable stability even with only 30% of data,
outperforming full fine-tuning at intermediate data levels. Adapter modules
exhibit high variability, with inconsistent gains over default models. The
findings indicate BitFit offers an attractive balance between performance and
parameter efficiency. Our work provides valuable perspectives on model tuning,
emphasizing robustness and highlighting BitFit as a promising alternative for
resource-constrained or streaming task settings. The analysis offers actionable
guidelines for efficient adaptation of large pre-trained models, while
illustrating open challenges in stabilizing techniques like adapter modules.
中文摘要: 为下游任务微调大型预训练语言模型仍然是自然语言处理中的一个关键挑战。本文对两种有效的微调方法——BitFit和适配器模块——与标准全模型微调进行了实证分析比较。在GLUE基准数据集（MRPC、COLA、STS-B）上进行的实验揭示了几个关键见解。BitFit方法只训练偏项和任务头，在不同数量的训练数据和时间限制下匹配完整的微调性能。即使只有30%的数据，它也表现出显著的稳定性，优于中间数据级别的完全微调。适配器模块表现出很高的可变性，与默认模型相比增益不一致。研究结果表明，BitFit在性能和参数效率之间提供了一种有吸引力的平衡。我们的工作为模型调整提供了有价值的视角，强调了健壮性，并强调BitFit是资源受限或流任务设置的一种有前途的替代方案。该分析为大型预训练模型的有效适应提供了可操作的指导方针，同时说明了适配器模块等稳定技术方面的开放挑战
[论文下载:]http://arxiv.org/abs/2401.04051v1

标题: FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference
作者: Zirui Liu, Qingquan Song, Qiang Charles Xiao
摘要: The large number of parameters in Pretrained Language Models enhance their
performance, but also make them resource-intensive, making it challenging to
deploy them on commodity hardware like a single GPU. Due to the memory and
power limitations of these devices, model compression techniques are often used
to decrease both the model’s size and its inference latency. This usually
results in a trade-off between model accuracy and efficiency. Therefore,
optimizing this balance is essential for effectively deploying LLMs on
commodity hardware. A significant portion of the efficiency challenge is the
Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$
total parameters and inference latency. In this paper, we first observe that
only a few neurons of FFN module have large output norm for any input tokens,
a.k.a. heavy hitters, while the others are sparsely triggered by different
tokens. Based on this observation, we explicitly split the FFN into two parts
according to the heavy hitters. We improve the efficiency-accuracy trade-off of
existing compression methods by allocating more resource to FFN parts with
heavy hitters. In practice, our method can reduce model size by 43.1% and
bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with
negligible accuracy drop.
中文摘要: 预训练语言模型中的大量参数提高了其性能，但也使其资源密集型，使其难以部署在像单个GPU这样的商品硬件上。由于这些设备的内存和功率限制，通常使用模型压缩技术来减小模型的大小及其推理延迟。这通常会导致模型准确性和效率之间的权衡。因此，优化这种平衡对于在商品硬件上有效部署LLM至关重要。效率挑战的很大一部分是前馈网络（FFN）组件，它大约占 $\frac｛2｝｛3｝$ 总参数和推理延迟。在本文中，我们首先观察到，FFN模块中只有少数神经元对任何输入令牌（也称为重打击）具有大的输出范数，而其他神经元则很少由不同的令牌触发。基于这一观察，我们明确地将FFN根据重打者分为两部分。我们通过向具有重打击者的FFN部分分配更多资源来改进现有压缩方法的效率-精度权衡。在实践中，我们的方法可以将模型大小减少43.1%，并在不同的硬件上实现1.25倍的挂钟时间加速，而精度下降可以忽略不计
[论文下载:]http://arxiv.org/abs/2401.04044v1

标题: If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents
作者: Ke Yang, Jiateng Liu, John Wu
摘要: The prominent large language models (LLMs) of today differ from past language
models not only in size, but also in the fact that they are trained on a
combination of natural language and formal language (code). As a medium between
humans and computers, code translates high-level goals into executable steps,
featuring standard syntax, logical consistency, abstraction, and modularity. In
this survey, we present an overview of the various benefits of integrating code
into LLMs’ training data. Specifically, beyond enhancing LLMs in code
generation, we observe that these unique properties of code help (i) unlock the
reasoning ability of LLMs, enabling their applications to a range of more
complex natural language tasks; (ii) steer LLMs to produce structured and
precise intermediate steps, which can then be connected to external execution
ends through function calls; and (iii) take advantage of code compilation and
execution environment, which also provides diverse feedback for model
improvement. In addition, we trace how these profound capabilities of LLMs,
brought by code, have led to their emergence as intelligent agents (IAs) in
situations where the ability to understand instructions, decompose goals, plan
and execute actions, and refine from feedback are crucial to their success on
downstream tasks. Finally, we present several key challenges and future
directions of empowering LLMs with code.
中文摘要: 今天突出的大型语言模型（LLM）与过去的语言模型的不同之处不仅在于大小，还在于它们是在自然语言和形式语言（代码）的组合上训练的。作为人类和计算机之间的媒介，代码将高级目标转化为可执行步骤，具有标准语法、逻辑一致性、抽象性和模块性。在这项调查中，我们概述了将代码集成到LLM的训练数据中的各种好处。具体来说，除了在代码生成中增强LLM之外，我们观察到代码的这些独特特性有助于（i）释放LLM的推理能力，使其能够应用于一系列更复杂的自然语言任务；（ii）引导LLM产生结构化和精确的中间步骤，然后可以通过函数调用将这些步骤连接到外部执行端；以及（iii）利用代码编译和执行环境，这也为模型改进提供了不同的反馈。此外，我们还追溯了代码带来的LLM的这些深刻功能是如何导致它们在理解指令、分解目标、计划和执行行动以及从反馈中提炼的能力对它们在下游任务中的成功至关重要的情况下成为智能代理（IA）的。最后，我们提出了用代码增强LLM的几个关键挑战和未来方向
[论文下载:]http://arxiv.org/abs/2401.00812v2

标题: Advancing Spatial Reasoning in Large Language Models: An In-Depth
Evaluation and Enhancement Using the StepGame Benchmark
作者: Fangjun Li, David C. Hogg, Anthony G. Cohn
摘要: Artificial intelligence (AI) has made remarkable progress across various
domains, with large language models like ChatGPT gaining substantial attention
for their human-like text-generation capabilities. Despite these achievements,
spatial reasoning remains a significant challenge for these models. Benchmarks
like StepGame evaluate AI spatial reasoning, where ChatGPT has shown
unsatisfactory performance. However, the presence of template errors in the
benchmark has an impact on the evaluation results. Thus there is potential for
ChatGPT to perform better if these template errors are addressed, leading to
more accurate assessments of its spatial reasoning capabilities. In this study,
we refine the StepGame benchmark, providing a more accurate dataset for model
evaluation. We analyze GPT’s spatial reasoning performance on the rectified
benchmark, identifying proficiency in mapping natural language text to spatial
relations but limitations in multi-hop reasoning. We provide a flawless
solution to the benchmark by combining template-to-relation mapping with
logic-based reasoning. This combination demonstrates proficiency in performing
qualitative reasoning on StepGame without encountering any errors. We then
address the limitations of GPT models in spatial reasoning. We deploy
Chain-of-thought and Tree-of-thoughts prompting strategies, offering insights
into GPT’s ``cognitive process", and achieving remarkable improvements in
accuracy. Our investigation not only sheds light on model deficiencies but also
proposes enhancements, contributing to the advancement of AI with more robust
spatial reasoning capabilities.
中文摘要: 人工智能（AI）在各个领域取得了显著进展，像ChatGPT这样的大型语言模型因其类似人类的文本生成能力而备受关注。尽管取得了这些成就，但空间推理仍然是这些模型面临的重大挑战。StepGame等基准评估了人工智能空间推理，而ChatGPT的表现并不令人满意。然而，基准中模板错误的存在会对评估结果产生影响。因此，如果这些模板错误得到解决，ChatGPT有可能表现得更好，从而对其空间推理能力进行更准确的评估。在这项研究中，我们完善了StepGame基准，为模型评估提供了更准确的数据集。我们分析了GPT在校正基准上的空间推理性能，确定了将自然语言文本映射到空间关系的熟练程度，但在多跳推理中的局限性。我们通过将模板到关系映射与基于逻辑的推理相结合，为基准测试提供了完美的解决方案。这种组合证明了在StepGame上执行定性推理的熟练程度，而不会遇到任何错误。然后，我们讨论了GPT模型在空间推理中的局限性。我们部署了思想链和思想树激励策略，深入了解GPT的“”认知过程“，并在准确性方面取得了显著提高。我们的研究不仅揭示了模型的不足，还提出了改进建议，有助于以更强大的空间推理能力推进人工智能。
[论文下载:]http://arxiv.org/abs/2401.03991v1

标题: TTMs: Fast Multi-level Tiny Time Mixers for Improved Zero-shot and
Few-shot Forecasting of Multivariate Time Series
作者: Vijay Ekambaram, Arindam Jati, Nam H. Nguyen
摘要: Large Pretrained models for Zero/Few-shot learning excel in language and
vision domains but encounter challenges in multivariate time series (TS) due to
the diverse nature and scarcity of publicly available pretraining data.
Consequently, there has been a recent surge in utilizing pretrained large
language models (LLMs) with various adaptations for time series forecasting.
These approaches employ cross-domain transfer learning, yielding highly
impressive results. However, these models are typically very large ( $\sim$
billion parameters), exhibit slow execution, and do not consider cross-channel
correlations. To address this, we present Multi-level Tiny Time Mixers (TTM), a
significantly smaller model based on the lightweight TSMixer architecture. TTM
marks the first success in developing tiny pretrained models ($\le$1 million
parameters), exclusively trained on public TS data with effective transfer
learning capabilities. To tackle the complexity of pretraining on multiple
datasets with varied temporal resolutions, we introduce several novel
enhancements such as adaptive patching, dataset augmentation via downsampling,
and resolution prefix tuning. Moreover, we employ a multi-level modeling
strategy to effectively model channel correlations and incorporate exogenous
signals during finetuning, a crucial capability lacking in existing benchmarks.
TTM excels in few/zero-shot forecasting, demonstrating significant accuracy
gains (12-38%) over existing benchmarks. Further, it achieves a remarkable
14-106X reduction in model parameters, enabling 54-65X faster
training/inference as compared to the LLM-TS benchmarks. In fact, TTM’s
zero-shot results often surpass the few-shot results in many benchmarks,
highlighting the efficacy of our approach. Code and Pretrained Models will be
open-sourced.
中文摘要: 用于零/少镜头学习的大型预训练模型在语言和视觉领域表现出色，但在多变量时间序列（TS）方面遇到了挑战由于公开可用的预训练数据的多样性和稀缺性。因此，最近大量使用预训练的大型语言模型（LLM），并对时间序列预测进行各种调整。这些方法采用了跨领域迁移学习，产生了令人印象深刻的结果。然而，这些模型通常非常大（ $\sim$ 亿参数），执行缓慢，并且不考虑跨通道相关性。为了解决这个问题，我们提出了多级微小时间混合器（TTM），这是一个基于轻量级TSMixer架构的非常小的模型。TTM标志着首次成功开发了小型预训练模型（100万美元的参数），该模型专门在公共TS数据上训练，具有有效的迁移学习能力。为了解决在具有不同时间分辨率的多个数据集上进行预训练的复杂性，我们引入了几种新的增强功能，如自适应修补、通过下采样的数据集增强和分辨率前缀调整。此外，我们采用多层次建模策略来有效地对信道相关性进行建模，并在微调过程中引入外源信号，这是现有基准所缺乏的关键能力。TTM在少数/零样本预测方面表现出色，与现有基准相比，精度显著提高（12-38%）。此外，与LLM-TS基准相比，它在模型参数方面实现了14-106X的显著减少，使训练/推理速度提高了54-65X。事实上，在许多基准测试中，TTM的零样本结果往往超过最小搜索结果，这突出了我们方法的有效性。代码和预训练模型将是开源的
[论文下载:]http://arxiv.org/abs/2401.03955v1

标题: TextMachina: Seamless Generation of Machine-Generated Text Datasets
作者: Areg Mikael Sarvazyan, José Ángel González, Marc Franco-Salvador
摘要: Recent advancements in Large Language Models (LLMs) have led to high-quality
Machine-Generated Text (MGT), giving rise to countless new use cases and
applications. However, easy access to LLMs is posing new challenges due to
misuse. To address malicious usage, researchers have released datasets to
effectively train models on MGT-related tasks. Similar strategies are used to
compile these datasets, but no tool currently unifies them. In this scenario,
we introduce TextMachina, a modular and extensible Python framework, designed
to aid in the creation of high-quality, unbiased datasets to build robust
models for MGT-related tasks such as detection, attribution, or boundary
detection. It provides a user-friendly pipeline that abstracts away the
inherent intricacies of building MGT datasets, such as LLM integrations, prompt
templating, and bias mitigation. The quality of the datasets generated by
TextMachina has been assessed in previous works, including shared tasks where
more than one hundred teams trained robust MGT detectors.
中文摘要: 大型语言模型（LLM）的最新进展带来了高质量的机器生成文本（MGT），产生了无数新的用例和应用程序。然而，由于滥用，LLM的易用性带来了新的挑战。为了解决恶意使用问题，研究人员发布了数据集，以有效地训练MGT相关任务的模型。类似的策略也被用于编译这些数据集，但目前没有任何工具将它们统一起来。在这个场景中，我们介绍了TextMachina，这是一个模块化和可扩展的Python框架，旨在帮助创建高质量、无偏见的数据集，为MGT相关任务（如检测、归因或边界检测）构建稳健的模型。它提供了一个用户友好的管道，可以抽象出构建MGT数据集的固有复杂性，如LLM集成、即时模板和偏见缓解。TextMachina生成的数据集的质量已经在以前的工作中进行了评估，包括100多个团队训练强大的MGT检测器的共享任务
[论文下载:]http://arxiv.org/abs/2401.03946v1

标题: Breaking the Silence: the Threats of Using LLMs in Software Engineering
作者: June Sallou, Thomas Durieux, Annibale Panichella
摘要: Large Language Models (LLMs) have gained considerable traction within the
Software Engineering (SE) community, impacting various SE tasks from code
completion to test generation, from program repair to code summarization.
Despite their promise, researchers must still be careful as numerous intricate
factors can influence the outcomes of experiments involving LLMs. This paper
initiates an open discussion on potential threats to the validity of LLM-based
research including issues such as closed-source models, possible data leakage
between LLM training data and research evaluation, and the reproducibility of
LLM-based findings. In response, this paper proposes a set of guidelines
tailored for SE researchers and Language Model (LM) providers to mitigate these
concerns. The implications of the guidelines are illustrated using existing
good practices followed by LLM providers and a practical example for SE
researchers in the context of test case generation.
中文摘要: 大型语言模型（LLM）在软件工程（SE）社区中获得了相当大的吸引力，影响了从代码完成到测试生成，从程序修复到代码摘要的各种SE任务。尽管他们做出了承诺，但研究人员仍然必须小心，因为许多复杂的因素会影响LLM实验的结果。本文就基于LLM的研究有效性的潜在威胁展开了公开讨论，包括闭源模型、LLM训练数据与研究评估之间可能存在的数据泄露以及基于LLM研究结果的可重复性等问题。作为回应，本文提出了一套专门为SE研究人员和语言模型（LM）提供者量身定制的指南，以缓解这些担忧。使用LLM提供商遵循的现有良好实践以及SE研究人员在生成测试用例的背景下的一个实际例子，说明了该指南的含义
[论文下载:]http://arxiv.org/abs/2312.08055v2

标题: Exploring Format Consistency for Instruction Tuning
作者: Shihao Liang, Runchu Tian, Kunlun Zhu
摘要: Instruction tuning has emerged as a promising approach to enhancing large
language models in following human instructions. It is shown that increasing
the diversity and number of instructions in the training data can consistently
enhance generalization performance, which facilitates a recent endeavor to
collect various instructions and integrate existing instruction tuning datasets
into larger collections. However, different users have their unique ways of
expressing instructions, and there often exist variations across different
datasets in the instruction styles and formats, i.e., format inconsistency. In
this work, we propose a framework named Unified Instruction Tuning (UIT), which
calls OpenAI APIs for automatic format transfer among different instruction
tuning datasets such as PromptSource, FLAN and CrossFit. With the framework, we
(1) demonstrate the necessity of maintaining format consistency in instruction
tuning; (2) improve the generalization performance on unseen instructions on
T5-LM-xl; (3) provide a novel perplexity-based denoising method to reduce the
noise of automatic format transfer to make the UIT framework more practical and
a smaller offline model based on GPT-J that achieves comparable format transfer
capability to OpenAI APIs to reduce costs in practice. Further analysis
regarding variations of targeted formats and other effects is intended.
中文摘要: 指令调优已成为一种很有前途的方法，可以在遵循人类指令的过程中增强大型语言模型。研究表明，增加训练数据中指令的多样性和数量可以持续提高泛化性能，这有助于最近收集各种指令并将现有的指令调整数据集集成到更大的集合中。然而，不同的用户有其独特的指令表达方式，不同数据集的指令风格和格式往往存在差异，即格式不一致。在这项工作中，我们提出了一个名为统一指令调优（UIT）的框架，该框架调用OpenAI API在不同的指令调优数据集（如PromptSource、FLAN和CrossFit）之间进行自动格式传输。通过该框架，我们（1）证明了在教学调优中保持格式一致性的必要性；（2）提高了T5 LM xl上看不见指令的泛化性能；（3）提供了一种新的基于困惑的去噪方法来降低自动格式传输的噪声，使UIT框架更实用，并提供了一个基于GPT-J的更小的离线模型，该模型实现了与OpenAI API相当的格式传输能力，以在实践中降低成本。旨在对目标格式的变化和其他效果进行进一步分析
[论文下载:]http://arxiv.org/abs/2307.15504v2

标题: Synthetic Data Generation in Low-Resource Settings via Fine-Tuning of
Large Language Models
作者: Jean Kaddour, Qi Liu
摘要: The in-context learning ability of large language models (LLMs) enables them
to generalize to novel downstream tasks with relatively few labeled examples.
However, they require enormous computational resources to be deployed.
Alternatively, smaller models can solve specific tasks if fine-tuned with
enough labeled examples. These examples, however, are expensive to obtain. In
pursuit of the best of both worlds, we study synthetic data generation of
fine-tuning training data via fine-tuned teacher LLMs to improve the downstream
performance of much smaller models. In four text classification and two text
generation tasks, we find that both data generation and annotation dramatically
improve the respective downstream model’s performance, occasionally
necessitating only a minor fraction of the original training dataset.
中文摘要: 大型语言模型（LLM）的上下文学习能力使其能够推广到具有相对较少标记示例的新颖下游任务。然而，它们需要部署大量的计算资源。或者，如果使用足够多的标记示例进行微调，较小的模型可以解决特定任务。然而，这些实例的获得成本很高。为了追求两全其美，我们研究了通过微调教师LLM生成微调训练数据的合成数据，以提高更小模型的下游性能。在四个文本分类和两个文本生成任务中，我们发现数据生成和注释都显著提高了各自下游模型的性能，偶尔只需要原始训练数据集的一小部分
[论文下载:]http://arxiv.org/abs/2310.01119v2

标题: FlightLLM: Efficient Large Language Model Inference with a Complete
Mapping Flow on FPGA
作者: Shulin Zeng, Jun Liu, Guohao Dai
摘要: Transformer-based Large Language Models (LLMs) have made a significant impact
on various domains. However, LLMs’ efficiency suffers from both heavy
computation and memory overheads. Compression techniques like sparsification
and quantization are commonly used to mitigate the gap between LLM’s
computation/memory overheads and hardware capacity. However, existing GPU and
transformer-based accelerators cannot efficiently process compressed LLMs, due
to the following unresolved challenges: low computational efficiency,
underutilized memory bandwidth, and large compilation overheads.
This paper proposes FlightLLM, enabling efficient LLMs inference with a
complete mapping flow on FPGAs. In FlightLLM, we highlight an innovative
solution that the computation and memory overhead of LLMs can be solved by
utilizing FPGA-specific resources (e.g., DSP48 and heterogeneous memory
hierarchy). We propose a configurable sparse DSP chain to support different
sparsity patterns with high computation efficiency. Second, we propose an
always-on-chip decode scheme to boost memory bandwidth with mixed-precision
support. Finally, to make FlightLLM available for real-world LLMs, we propose a
length adaptive compilation method to reduce the compilation overhead.
Implemented on the Xilinx Alveo U280 FPGA, FlightLLM achieves 6.0 $\times$
higher energy efficiency and 1.8 $\times$ better cost efficiency against
commercial GPUs (e.g., NVIDIA V100S) on modern LLMs (e.g., LLaMA2-7B) using
vLLM and SmoothQuant under the batch size of one. FlightLLM beats NVIDIA A100
GPU with 1.2 $\times$ higher throughput using the latest Versal VHK158 FPGA.
中文摘要: 基于转换器的大型语言模型（LLM）对各个领域产生了重大影响。然而，LLM的效率同时受到沉重的计算和内存开销的影响。稀疏化和量化等压缩技术通常用于缓解LLM的计算/内存开销与硬件容量之间的差距。然而，由于以下未解决的挑战，现有的基于GPU和转换器的加速器无法有效处理压缩的LLM：计算效率低、内存带宽利用不足和编译开销大。本文提出了FlightLLM，在FPGA上实现了具有完整映射流的高效LLM推理。在FlightLLM中，我们强调了一种创新的解决方案，即LLM的计算和内存开销可以通过利用FPGA特定的资源（例如，DSP48和异构内存层次结构）来解决。我们提出了一种可配置的稀疏DSP链，以支持不同的稀疏模式，并具有较高的计算效率。其次，我们提出了一种总是片上解码方案，以提高内存带宽，并支持混合精度。最后，为了使FlightLLM可用于真实世界的LLM，我们提出了一种长度自适应编译方法来减少编译开销。在Xilinx Alveo U280 FPGA上实现，与现代LLM（如LLaMA2-7B）上使用vLLM和SmoothQuant的商用GPU（如NVIDIA V100S）相比，FlightLLM在批处理大小为1的情况下实现了6.0 $\times$ 的能效和1.8$\times的成本效率。使用最新的Versal VHK158 FPGA，FlightLLM以1.2美元\倍的吞吐量击败NVIDIA A100 GPU
[论文下载:]http://arxiv.org/abs/2401.03868v1

标题: Parameter-Efficient Sparsity Crafting from Dense to Mixture-of-Experts
for Instruction Tuning on General Tasks
作者: Haoyuan Wu, Haisheng Zheng, Bei Yu
摘要: Large Language Models (LLMs) have demonstrated considerable proficiency in
general natural language processing (NLP) tasks. Instruction tuning, a
successful paradigm, enhances the ability of LLMs to follow natural language
instructions and exhibit robust generalization across a wide range of tasks.
However, these models often encounter performance limitations across multiple
tasks due to constrained model capacity. Expanding this capacity during the
instruction tuning phase poses significant challenges. To address this issue,
we introduce a novel approach, Parameter-Efficient Sparsity Crafting (PESC),
which transitions dense models to sparse models using a Mixture of Experts
(MoE) architecture. PESC integrates adapters into the MoE layers of sparse
models, differentiating experts without altering the individual weights within
these layers. This method significantly reduces computational costs and GPU
memory requirements, facilitating model capacity expansion through a minimal
increase in parameters via the inserted adapters. Our empirical evaluation
demonstrates the effectiveness of the PESC method. Using PESC during
instruction tuning, our sparse models, dubbed Camelidae outperform all other
opensource sparse models and exhibit superior general capabilities compared to
GPT3.5.
中文摘要: 大型语言模型（LLM）在一般自然语言处理（NLP）任务中表现出相当的熟练程度。指令调优是一种成功的范式，它增强了LLM遵循自然语言指令的能力，并在广泛的任务中表现出强大的泛化能力。然而，由于模型容量有限，这些模型经常在多个任务中遇到性能限制。在指令调整阶段扩展此容量带来了重大挑战。为了解决这个问题，我们引入了一种新的方法，即参数高效稀疏性制作（PESC），该方法使用混合专家（MoE）架构将密集模型转换为稀疏模型。PESC将适配器集成到稀疏模型的MoE层中，在不改变这些层中的单个权重的情况下区分专家。这种方法显著降低了计算成本和GPU内存需求，通过插入的适配器使参数的增加最小，从而促进了模型容量的扩展。我们的实证评估证明了PESC方法的有效性。在指令调整过程中使用PESC，我们的稀疏模型（称为Camellidae）优于所有其他开源稀疏模型，并与GPT3.5相比表现出卓越的通用功能
[论文下载:]http://arxiv.org/abs/2401.02731v2

标题: Boldly Going Where No Benchmark Has Gone Before: Exposing Bias and
Shortcomings in Code Generation Evaluation
作者: Ankit Yadav, Mayank Singh
摘要: Motivated by the increasing popularity of code generation from human
descriptions using large language models (LLMs), several benchmarks have been
proposed to assess the capabilities of existing and emerging models. This study
presents a large-scale human evaluation of HumanEval and MBPP, two widely used
benchmarks for Python code generation, focusing on their diversity and
difficulty. Our findings reveal a significant bias towards a limited number of
programming concepts, with negligible or no representation of most concepts.
Additionally, we identify a concerningly high proportion of easy programming
questions, potentially leading to an overestimation of model performance on
code generation tasks.
中文摘要: 由于使用大型语言模型（LLM）从人类描述生成代码越来越流行，已经提出了几个基准来评估现有和新兴模型的能力。本研究对HumanEval和MBPP这两个广泛使用的Python代码生成基准进行了大规模的人工评估，重点关注它们的多样性和难度。我们的发现揭示了对有限数量的编程概念的显著偏见，大多数概念的表示可以忽略不计或根本没有。此外，我们发现了相当高比例的简单编程问题，这可能会导致高估代码生成任务的模型性能
[论文下载:]http://arxiv.org/abs/2401.03855v1

标题: Aligned with LLM: a new multi-modal training paradigm for encoding fMRI
activity in visual cortex
作者: Shuxiao Ma, Linyuan Wang, Senbao Hou
摘要: Recently, there has been a surge in the popularity of pre trained large
language models (LLMs) (such as GPT-4), sweeping across the entire Natural
Language Processing (NLP) and Computer Vision (CV) communities. These LLMs have
demonstrated advanced multi-modal understanding capabilities and showcased
strong performance across various benchmarks. The LLM has started to embody
traits of artificial general intelligence, which holds vital guidance for
enhancing brain-like characteristics within visual encoding models. Hence, This
paper proposes a new multi-modal training paradigm, aligning with LLM, for
encoding fMRI activity in visual cortex. Based on this paradigm, we trained an
encoding model in fMRI data named the LLM-Visual Encoding Model (LLM-VEM).
Specifically, we utilize LLM (miniGPT4) to generate descriptive text for all
stimulus images, forming a high-quality textual description set. Moreover, we
use the pre-trained text encoder (CLIP) to process these detailed descriptions,
obtaining the text embedding features. Next, we use the contrast loss function
to minimize the distance between the image embedding features and the text
embedding features to complete the alignment operation of the stimulus image
and text information. With the assistance of the pre-trained LLM, this
alignment process facilitates better learning of the visual encoding model,
resulting in higher precision. The final experimental results indicate that our
training paradigm has significantly aided in enhancing the performance of the
visual encoding model.
中文摘要: 最近，预训练的大型语言模型（如GPT-4）的普及率激增，席卷了整个自然语言处理（NLP）和计算机视觉（CV）社区。这些LLM展示了先进的多模态理解能力，并在各种基准测试中表现出强大的性能。LLM已经开始体现通用人工智能的特征，这为增强视觉编码模型中的类脑特征提供了重要指导。因此，本文提出了一种新的多模式训练范式，与LLM相一致，用于编码视觉皮层的fMRI活动。基于这一范式，我们在fMRI数据中训练了一个编码模型，称为LLM视觉编码模型（LLM-VEM）。具体来说，我们利用LLM（miniGPT4）为所有刺激图像生成描述性文本，形成高质量的文本描述集。此外，我们使用预训练的文本编码器（CLIP）来处理这些详细的描述，获得文本嵌入特征。接下来，我们使用对比度损失函数来最小化图像嵌入特征和文本嵌入特征之间的距离，以完成刺激图像和文本信息的对齐操作。在预先训练的LLM的帮助下，这种对齐过程有助于更好地学习视觉编码模型，从而获得更高的精度。最终的实验结果表明，我们的训练范式在提高视觉编码模型的性能方面有显著的帮助
[论文下载:]http://arxiv.org/abs/2401.03851v1

标题: TeleChat Technical Report
作者: Zihan Wang, Xinzhang Liu, Shixuan Liu
摘要: In this technical report, we present TeleChat, a collection of large language
models (LLMs) with parameters of 3 billion, 7 billion and 12 billion. It
includes pretrained language models as well as fine-tuned chat models that is
aligned with human preferences. TeleChat is initially pretrained on an
extensive corpus containing a diverse collection of texts from both English and
Chinese languages, including trillions of tokens. Subsequently, the model
undergoes fine-tuning to align with human preferences, following a detailed
methodology that we describe. We evaluate the performance of TeleChat on
various tasks, including language understanding, mathematics, reasoning, code
generation, and knowledge-based question answering. Our findings indicate that
TeleChat achieves comparable performance to other open-source models of similar
size across a wide range of public benchmarks. To support future research and
applications utilizing LLMs, we release the fine-tuned model checkpoints of
TeleChat’s 7B and 12B variant, along with code and a portion of our pretraining
data, to the public community.
中文摘要: 在本技术报告中，我们介绍了TeleChat，这是一组参数分别为30亿、70亿和120亿的大型语言模型（LLM）。它包括预先训练的语言模型以及与人类偏好相一致的微调聊天模型。TeleChat最初是在一个广泛的语料库上进行预训练的，该语料库包含来自英语和汉语的各种文本集合，包括数万亿个代币。随后，根据我们描述的详细方法，对模型进行微调，以符合人类的偏好。我们评估TeleChat在各种任务上的性能，包括语言理解、数学、推理、代码生成和基于知识的问答。我们的研究结果表明，TeleChat在广泛的公共基准中实现了与其他类似规模的开源模型相当的性能。为了支持未来利用LLM的研究和应用，我们向公众社区发布了TeleChat 7B和12B变体的微调模型检查点，以及代码和部分预训练数据
[论文下载:]http://arxiv.org/abs/2401.03804v1

标题: LLM Powered Sim-to-real Transfer for Traffic Signal Control
作者: Longchao Da, Minchiuan Gao, Hao Mei
摘要: Numerous solutions are proposed for the Traffic Signal Control (TSC) tasks
aiming to provide efficient transportation and mitigate congestion waste. In
recent, promising results have been attained by Reinforcement Learning (RL)
methods through trial and error in simulators, bringing confidence in solving
cities’ congestion headaches. However, there still exist performance gaps when
simulator-trained policies are deployed to the real world. This issue is mainly
introduced by the system dynamic difference between the training simulator and
the real-world environments. The Large Language Models (LLMs) are trained on
mass knowledge and proved to be equipped with astonishing inference abilities.
In this work, we leverage LLMs to understand and profile the system dynamics by
a prompt-based grounded action transformation. Accepting the cloze prompt
template, and then filling in the answer based on accessible context, the
pre-trained LLM’s inference ability is exploited and applied to understand how
weather conditions, traffic states, and road types influence traffic dynamics,
being aware of this, the policies’ action is taken and grounded based on
realistic dynamics, thus help the agent learn a more realistic policy. We
conduct experiments using DQN to show the effectiveness of the proposed
PromptGAT’s ability in mitigating the performance gap from simulation to
reality (sim-to-real).
中文摘要: 为交通信号控制（TSC）任务提出了许多解决方案，旨在提供高效的交通和减少拥堵浪费。近年来，强化学习（RL）方法通过在模拟器中的试错获得了有希望的结果，为解决城市拥堵问题带来了信心。然而，当模拟器训练的策略部署到现实世界中时，仍然存在性能差距。这个问题主要是由训练模拟器和真实世界环境之间的系统动态差异引起的。大型语言模型（LLM）是在大量知识的基础上训练的，并被证明具有惊人的推理能力。在这项工作中，我们利用LLM通过基于提示的接地动作转换来理解和描述系统动力学。接受完形填空提示模板，然后根据可访问的上下文填写答案，利用并应用预先训练的LLM的推理能力来了解天气条件、交通状态和道路类型如何影响交通动态，意识到这一点，政策的行动是基于现实动态采取和依据的，从而帮助代理学习更现实的策略。我们使用DQN进行了实验，以证明所提出的PromptGAT在缓解从模拟到现实（模拟到现实）的性能差距方面的有效性
[论文下载:]http://arxiv.org/abs/2308.14284v4

标题: Language Models Understand Numbers, at Least Partially
作者: Fangwei Zhu, Damai Dai, Zhifang Sui
摘要: Large language models (LLMs) have exhibited impressive competency in various
text-related tasks. However, their opaque internal mechanisms become a
hindrance to leveraging them in mathematical problems. In this paper, we study
a fundamental question: whether language models understand numbers, which play
a basic element in mathematical problems. We assume that to solve mathematical
problems, language models should be capable of understanding numbers and
compressing these numbers in their hidden states. We construct a synthetic
dataset comprising addition problems and utilize linear probes to read out
input numbers from the hidden states of models. Experimental results
demonstrate evidence supporting the existence of compressed numbers in the
LLaMA-2 model family from early layers. However, the compression process seems
to be not lossless, presenting difficulty in precisely reconstructing the
original numbers. Further experiments show that language models can utilize the
encoded numbers to perform arithmetic computations, and the computational
ability scales up with the model size. Our preliminary research suggests that
language models exhibit a partial understanding of numbers, offering insights
into future investigations about the models’ capability of solving mathematical
problems.
中文摘要: 大型语言模型（LLM）在各种与文本相关的任务中表现出了令人印象深刻的能力。然而，它们不透明的内部机制成为在数学问题中利用它们的障碍。在本文中，我们研究了一个基本问题：语言模型是否理解数字，数字在数学问题中起着基本作用。我们假设，为了解决数学问题，语言模型应该能够理解数字，并将这些数字压缩到隐藏状态。我们构建了一个包含加法问题的合成数据集，并利用线性探针从模型的隐藏状态中读取输入数字。实验结果证明了早期层LLaMA-2模型族中存在压缩数的证据。然而，压缩过程似乎不是无损的，这给精确重建原始数字带来了困难。进一步的实验表明，语言模型可以利用编码的数字进行算术计算，并且计算能力随着模型大小的增加而增加。我们的初步研究表明，语言模型表现出对数字的部分理解，这为未来对模型解决数学问题的能力的研究提供了见解
[论文下载:]http://arxiv.org/abs/2401.03735v1

标题: The Butterfly Effect of Altering Prompts: How Small Changes and
Jailbreaks Affect Large Language Model Performance
作者: Abel Salinas, Fred Morstatter
摘要: Large Language Models (LLMs) are regularly being used to label data across
many domains and for myriad tasks. By simply asking the LLM for an answer, or
``prompting,‘’ practitioners are able to use LLMs to quickly get a response for
an arbitrary task. This prompting is done through a series of decisions by the
practitioner, from simple wording of the prompt, to requesting the output in a
certain data format, to jailbreaking in the case of prompts that address more
sensitive topics. In this work, we ask: do variations in the way a prompt is
constructed change the ultimate decision of the LLM? We answer this using a
series of prompt variations across a variety of text classification tasks. We
find that even the smallest of perturbations, such as adding a space at the end
of a prompt, can cause the LLM to change its answer. Further, we find that
requesting responses in XML and commonly used jailbreaks can have cataclysmic
effects on the data labeled by LLMs.
中文摘要: 大型语言模型（LLM）经常被用于标记许多域和无数任务的数据。通过简单地向LLM询问答案或“提示”，从业者能够使用LLM快速获得对任意任务的响应。这种提示是通过从业者的一系列决定来完成的，从简单的提示措辞，到以特定数据格式请求输出，再到针对更敏感主题的提示时的越狱。在这项工作中，我们要问：提示构建方式的变化是否会改变LLM的最终决策？我们在各种文本分类任务中使用一系列提示变化来回答这个问题。我们发现，即使是最小的扰动，比如在提示的末尾添加一个空间，也会导致LLM改变其答案。此外，我们发现，在XML和常用的越狱中请求响应可能会对LLM标记的数据产生灾难性影响
[论文下载:]http://arxiv.org/abs/2401.03729v1

标题: Natural Language Decomposition and Interpretation of Complex Utterances
作者: Harsh Jhamtani, Hao Fang, Patrick Xia
摘要: Designing natural language interfaces has historically required collecting
supervised data to translate user requests into carefully designed intent
representations. This requires enumerating and labeling a long tail of user
requests, which is challenging. At the same time, large language models (LLMs)
encode knowledge about goals and plans that can help conversational assistants
interpret user requests requiring numerous steps to complete. We introduce an
approach to handle complex-intent-bearing utterances from a user via a process
of hierarchical natural language decomposition and interpretation. Our approach
uses a pre-trained language model to decompose a complex utterance into a
sequence of simpler natural language steps and interprets each step using the
language-to-program model designed for the interface. To test our approach, we
collect and release DeCU – a new NL-to-program benchmark to evaluate
Decomposition of Complex Utterances. Experiments show that the proposed
approach enables the interpretation of complex utterances with almost no
complex training data, while outperforming standard few-shot prompting
approaches.
[论文下载:]http://arxiv.org/abs/2305.08677v2

标题: Assessing AI Detectors in Identifying AI-Generated Code: Implications
for Education
作者: Wei Hung Pan, Ming Jie Chok, Jonathan Leong Shan Wong
摘要: Educators are increasingly concerned about the usage of Large Language Models
(LLMs) such as ChatGPT in programming education, particularly regarding the
potential exploitation of imperfections in Artificial Intelligence Generated
Content (AIGC) Detectors for academic misconduct. In this paper, we present an
empirical study where the LLM is examined for its attempts to bypass detection
by AIGC Detectors. This is achieved by generating code in response to a given
question using different variants. We collected a dataset comprising 5,069
samples, with each sample consisting of a textual description of a coding
problem and its corresponding human-written Python solution codes. These
samples were obtained from various sources, including 80 from Quescol, 3,264
from Kaggle, and 1,725 from LeetCode. From the dataset, we created 13 sets of
code problem variant prompts, which were used to instruct ChatGPT to generate
the outputs. Subsequently, we assessed the performance of five AIGC detectors.
Our results demonstrate that existing AIGC Detectors perform poorly in
distinguishing between human-written code and AI-generated code.
中文摘要: 教育工作者越来越关注大型语言模型（LLM）（如ChatGPT）在编程教育中的使用，特别是人工智能生成内容（AIGC）检测器中的缺陷可能被用于学术不端行为。在本文中，我们提出了一项实证研究，其中LLM被检查为试图绕过AIGC检测器的检测。这是通过使用不同的变体生成代码来响应给定的问题来实现的。我们收集了一个由5069个样本组成的数据集，每个样本都由编码问题的文本描述及其相应的人工编写的Python解决方案代码组成。这些样品来自各种来源，包括80个来自Quescol，3264个来自Kaggle，1725个来自LeetCode。从数据集中，我们创建了13组代码问题变体提示，用于指示ChatGPT生成输出。随后，我们评估了五个AIGC探测器的性能。我们的结果表明，现有的AIGC检测器在区分人工编写的代码和人工智能生成的代码方面表现不佳
[论文下载:]http://arxiv.org/abs/2401.03676v1

标题: An exploratory study on automatic identification of assumptions in the
development of deep learning frameworks
作者: Chen Yanga, Peng Liang, Zinan Ma
摘要: Stakeholders constantly make assumptions in the development of deep learning
(DL) frameworks. These assumptions are related to various types of software
artifacts (e.g., requirements, design decisions, and technical debt) and can
turn out to be invalid, leading to system failures. Existing approaches and
tools for assumption management usually depend on manual identification of
assumptions. However, assumptions are scattered in various sources (e.g., code
comments, commits, pull requests, and issues) of DL framework development, and
manually identifying assumptions has high costs (e.g., time and resources). To
overcome the issues of manually identifying assumptions in DL framework
development, we constructed a new and largest dataset (i.e., AssuEval) of
assumptions collected from the TensorFlow and Keras repositories on GitHub;
explored the performance of seven traditional machine learning models (e.g.,
Support Vector Machine, Classification and Regression Trees), a popular DL
model (i.e., ALBERT), and a large language model (i.e., ChatGPT) of identifying
assumptions on the AssuEval dataset. The experiment results show that: ALBERT
achieves the best performance (f1-score: 0.9584) of identifying assumptions on
the AssuEval dataset, which is much better than the other models (the 2nd best
f1-score is 0.6211, achieved by ChatGPT). Though ChatGPT is the most popular
large language model, we do not recommend using it to identify assumptions in
DL framework development because of its low performance on the task.
Fine-tuning ChatGPT specifically for assumption identification could improve
the performance. This study provides researchers with the largest dataset of
assumptions for further research (e.g., assumption classification, evaluation,
and reasoning) and helps practitioners better understand assumptions and how to
manage them in their projects.
中文摘要: 利益相关者在开发深度学习（DL）框架时不断做出假设。这些假设与各种类型的软件工件（例如，需求、设计决策和技术债务）有关，并且可能被证明是无效的，从而导致系统故障。现有的假设管理方法和工具通常依赖于对假设的手动识别。然而，假设分散在DL框架开发的各种来源（例如，代码注释、提交、拉取请求和问题）中，手动识别假设的成本很高（例如，时间和资源）。为了克服DL框架开发中手动识别假设的问题，我们构建了一个新的、最大的假设数据集（即AssuEval），该数据集是从GitHub上的TensorFlow和Keras存储库中收集的；探讨了在AssuEval数据集上识别假设的七个传统机器学习模型（例如，支持向量机、分类和回归树）、一个流行的DL模型（即ALBERT）和一个大型语言模型（即ChatGPT）的性能。实验结果表明：ALBERT在AssuEval数据集上识别假设的性能最好（f1得分：0.9584），远优于其他模型（第二好f1得分为0.6211，由ChatGPT实现）。尽管ChatGPT是最流行的大型语言模型，但我们不建议在DL框架开发中使用它来确定假设，因为它在任务上的性能很低。专门针对假设识别对ChatGPT进行微调可以提高性能。这项研究为研究人员提供了最大的假设数据集，以供进一步研究（例如，假设分类、评估和推理），并帮助从业者更好地理解假设以及如何在项目中管理这些假设
[论文下载:]http://arxiv.org/abs/2401.03653v1

标题: DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in
Autonomous Driving
作者: Wencheng Han, Dongqian Guo, Cheng-Zhong Xu
摘要: In the field of autonomous driving, two important features of autonomous
driving car systems are the explainability of decision logic and the accuracy
of environmental perception. This paper introduces DME-Driver, a new autonomous
driving system that enhances the performance and reliability of autonomous
driving system. DME-Driver utilizes a powerful vision language model as the
decision-maker and a planning-oriented perception model as the control signal
generator. To ensure explainable and reliable driving decisions, the logical
decision-maker is constructed based on a large vision language model. This
model follows the logic employed by experienced human drivers and makes
decisions in a similar manner. On the other hand, the generation of accurate
control signals relies on precise and detailed environmental perception, which
is where 3D scene perception models excel. Therefore, a planning oriented
perception model is employed as the signal generator. It translates the logical
decisions made by the decision-maker into accurate control signals for the
self-driving cars. To effectively train the proposed model, a new dataset for
autonomous driving was created. This dataset encompasses a diverse range of
human driver behaviors and their underlying motivations. By leveraging this
dataset, our model achieves high-precision planning accuracy through a logical
thinking process.
中文摘要: 在自动驾驶领域，自动驾驶汽车系统的两个重要特征是决策逻辑的可解释性和环境感知的准确性。本文介绍了一种新的自动驾驶系统DME Driver，它提高了自动驾驶系统的性能和可靠性。DME驾驶员使用强大的视觉语言模型作为决策者，使用面向规划的感知模型作为控制信号发生器。为了确保可解释和可靠的驾驶决策，基于大型视觉语言模型构建了逻辑决策器。该模型遵循经验丰富的人类驾驶员所采用的逻辑，并以类似的方式做出决策。另一方面，精确控制信号的生成依赖于精确和详细的环境感知，这正是3D场景感知模型的优势所在。因此，采用了一个面向规划的感知模型作为信号发生器。它将决策者做出的逻辑决策转化为自动驾驶汽车的精确控制信号。为了有效地训练所提出的模型，创建了一个新的自动驾驶数据集。该数据集涵盖了各种各样的人类驾驶员行为及其潜在动机。通过利用这个数据集，我们的模型通过逻辑思维过程实现了高精度的规划准确性
[论文下载:]http://arxiv.org/abs/2401.03641v1

标题: A Comprehensive Survey on Instruction Following
作者: Renze Lou, Kai Zhang, Wenpeng Yin
摘要: Task semantics can be expressed by a set of input-output examples or a piece
of textual instruction. Conventional machine learning approaches for natural
language processing (NLP) mainly rely on the availability of large-scale sets
of task-specific examples. Two issues arise: first, collecting task-specific
labeled examples does not apply to scenarios where tasks may be too complicated
or costly to annotate, or the system is required to handle a new task
immediately; second, this is not user-friendly since end-users are probably
more willing to provide task description rather than a set of examples before
using the system. Therefore, the community is paying increasing interest in a
new supervision-seeking paradigm for NLP: learning to follow task instructions,
i.e., instruction following. Despite its impressive progress, there are some
common issues that the community struggles with. This survey paper tries to
summarize and provide insights to the current research on instruction
following, particularly, by answering the following questions: (i) What is task
instruction, and what instruction types exist? (ii) How to model instructions?
(iii) What are popular instruction following datasets and evaluation metrics?
(iv) What factors influence and explain the instructions’ performance? (v) What
challenges remain in instruction following? To our knowledge, this is the first
comprehensive survey about instruction following.
中文摘要: 任务语义可以用一组输入输出示例或一条文本指令来表示。用于自然语言处理（NLP）的传统机器学习方法主要依赖于大规模任务特定示例集的可用性。出现了两个问题：首先，收集特定任务的标记示例不适用于任务可能过于复杂或昂贵而无法注释的场景，或者系统需要立即处理新任务的场景；其次，这不是用户友好的，因为最终用户可能更愿意在使用该系统之前提供任务描述，而不是一组示例。因此，社区对NLP的一种新的监督寻求范式越来越感兴趣：学习遵循任务指令，即指令遵循。尽管取得了令人印象深刻的进展，但社会仍有一些共同的问题在努力解决。本文试图通过回答以下问题来总结和提供对当前教学跟随研究的见解：（i）什么是任务教学，存在哪些教学类型？（ii）如何为指令建模？（iii）什么是流行的指令遵循数据集和评估指标？（iv）哪些因素影响并解释说明书的执行情况？（v）接下来的教学还有哪些挑战？据我们所知，这是第一次关于指令跟随的全面调查
[论文下载:]http://arxiv.org/abs/2303.10475v7
[GitHub:]https://github.com/RenzeLou/awesome-instruction-learning|

标题: RJUA-QA: A Comprehensive QA Dataset for Urology
作者: Shiwei Lyu, Chenfei Chi, Hongbo Cai
摘要: We introduce RJUA-QA, a novel medical dataset for question answering (QA) and
reasoning with clinical evidence, contributing to bridge the gap between
general large language models (LLMs) and medical-specific LLM applications.
RJUA-QA is derived from realistic clinical scenarios and aims to facilitate
LLMs in generating reliable diagnostic and advice. The dataset contains 2,132
curated Question-Context-Answer pairs, corresponding about 25,000 diagnostic
records and clinical cases. The dataset covers 67 common urological disease
categories, where the disease coverage exceeds 97.6% of the population seeking
medical services in urology. Each data instance in RJUA-QA comprises: (1) a
question mirroring real patient to inquiry about clinical symptoms and medical
conditions, (2) a context including comprehensive expert knowledge, serving as
a reference for medical examination and diagnosis, (3) a doctor response
offering the diagnostic conclusion and suggested examination guidance, (4) a
diagnosed clinical disease as the recommended diagnostic outcome, and (5)
clinical advice providing recommendations for medical examination. RJUA-QA is
the first medical QA dataset for clinical reasoning over the patient inquiries,
where expert-level knowledge and experience are required for yielding
diagnostic conclusions and medical examination advice. A comprehensive
evaluation is conducted to evaluate the performance of both medical-specific
and general LLMs on the RJUA-QA dataset. Our data is are publicly available at
\url{https://github.com/alipay/RJU_Ant_QA}.
中文摘要: 我们介绍了RJUA-QA，这是一种用于问答（QA）和临床证据推理的新型医学数据集，有助于弥合通用大型语言模型（LLM）和医学特定LLM应用之间的差距。RJUA-QA来源于现实的临床场景，旨在促进LLM生成可靠的诊断和建议。该数据集包含2132个精心策划的问题-上下文-答案对，对应约25000个诊断记录和临床病例。该数据集涵盖67种常见泌尿外科疾病类别，其中疾病覆盖率超过寻求泌尿外科医疗服务人群的97.6%。RJUA-QA中的每个数据实例包括：（1）反映真实患者询问临床症状和医疗状况的问题；（2）包括全面专家知识的背景，作为医学检查和诊断的参考；（3）医生提供诊断结论和建议检查指导的回应，（4）作为推荐诊断结果的诊断的临床疾病，以及（5）提供医学检查建议的临床建议。RJUA-QA是第一个用于对患者询问进行临床推理的医学QA数据集，其中需要专家级的知识和经验才能得出诊断结论和医学检查建议。在RJUA-QA数据集上进行综合评估，以评估医学特异性LLM和通用LLM的性能。我们的数据可在\url上公开获取{https://github.com/alipay/RJU_Ant_QA}.
[论文下载:]http://arxiv.org/abs/2312.09785v3
[GitHub:]https://github.com/alipay/RJU_Ant_QA|

标题: YAYI-UIE: A Chat-Enhanced Instruction Tuning Framework for Universal
Information Extraction
作者: Xinglin Xiao, Yijie Wang, Nan Xu
摘要: The difficulty of the information extraction task lies in dealing with the
task-specific label schemas and heterogeneous data structures. Recent work has
proposed methods based on large language models to uniformly model different
information extraction tasks. However, these existing methods are deficient in
their information extraction capabilities for Chinese languages other than
English. In this paper, we propose an end-to-end chat-enhanced instruction
tuning framework for universal information extraction (YAYI-UIE), which
supports both Chinese and English. Specifically, we utilize dialogue data and
information extraction data to enhance the information extraction performance
jointly. Experimental results show that our proposed framework achieves
state-of-the-art performance on Chinese datasets while also achieving
comparable performance on English datasets under both supervised settings and
zero-shot settings.
中文摘要: 信息提取任务的难点在于处理特定任务的标签模式和异构数据结构。最近的工作提出了基于大型语言模型的方法来统一建模不同的信息提取任务。然而，这些现有的方法在针对英语以外的汉语的信息提取能力方面存在不足。在本文中，我们提出了一种用于通用信息提取的端到端聊天增强指令调整框架（YAYI-UIE），该框架同时支持中文和英文。具体来说，我们利用对话数据和信息提取数据来联合提高信息提取性能。实验结果表明，在监督设置和零样本设置下，我们提出的框架在中文数据集上实现了最先进的性能，同时在英文数据集上也实现了可比的性能
[论文下载:]http://arxiv.org/abs/2312.15548v2

标题: Why Solving Multi-agent Path Finding with Large Language Model has not
Succeeded Yet
作者: Weizhe Chen, Sven Koenig, Bistra Dilkina
摘要: With the explosive influence caused by the success of large language models
(LLM) like ChatGPT and GPT-4, there has been an extensive amount of recent work
showing that foundation models can be used to solve a large variety of tasks.
However, there is very limited work that shares insights on multi-agent
planning. Multi-agent planning is different from other domains by combining the
difficulty of multi-agent coordination and planning, and making it hard to
leverage external tools to facilitate the reasoning needed. In this paper, we
focus on the problem of multi-agent path finding (MAPF), which is also known as
multi-robot route planning, and study how to solve MAPF with LLMs. We first
show the motivating success on an empty room map without obstacles, then the
failure to plan on a slightly harder room map. We present our hypothesis of why
directly solving MAPF with LLMs has not been successful yet, and we use various
experiments to support our hypothesis.
中文摘要: 随着像ChatGPT和GPT-4这样的大型语言模型（LLM）的成功所带来的爆炸性影响，最近的大量工作表明，基础模型可以用于解决各种各样的任务。然而，分享多智能体规划见解的工作非常有限。多智能体规划不同于其他领域，因为它结合了多智能体协调和规划的困难，并且很难利用外部工具来促进所需的推理。在本文中，我们重点研究了多智能体路径发现（MAPF）问题，也称为多机器人路径规划，并研究了如何使用LLM解决MAPF问题。我们首先在没有障碍的空房间地图上展示激励性的成功，然后在稍微困难一点的房间地图上显示计划的失败。我们提出了我们的假设，即为什么用LLM直接求解MAPF还没有成功，我们使用各种实验来支持我们的假设
[论文下载:]http://arxiv.org/abs/2401.03630v1

标题: DDM-Lag : A Diffusion-based Decision-making Model for Autonomous
Vehicles with Lagrangian Safety Enhancement
作者: Jiaqi Liu, Peng Hang, Xiaocong Zhao
摘要: Decision-making stands as a pivotal component in the realm of autonomous
vehicles (AVs), playing a crucial role in navigating the intricacies of
autonomous driving. Amidst the evolving landscape of data-driven methodologies,
enhancing decision-making performance in complex scenarios has emerged as a
prominent research focus. Despite considerable advancements, current
learning-based decision-making approaches exhibit potential for refinement,
particularly in aspects of policy articulation and safety assurance. To address
these challenges, we introduce DDM-Lag, a Diffusion Decision Model,augmented
with Lagrangian-based safety enhancements.In our approach, the autonomous
driving decision-making conundrum is conceptualized as a Constrained Markov
Decision Process (CMDP). We have crafted an Actor-Critic framework, wherein the
diffusion model is employed as the actor,facilitating policy exploration and
learning. The integration of safety constraints in the CMDP and the adoption of
a Lagrangian relaxation-based policy optimization technique ensure enhanced
decision safety. A PID controller is employed for the stable updating of model
parameters. The effectiveness of DDM-Lag is evaluated through different driving
tasks, showcasing improvements in decision-making safety and overall
performance compared to baselines.
中文摘要: 决策是自动驾驶汽车领域的关键组成部分，在驾驭复杂的自动驾驶方面发挥着至关重要的作用。在数据驱动方法不断发展的背景下，提高复杂场景中的决策性能已成为一个突出的研究重点。尽管取得了相当大的进步，但目前基于学习的决策方法显示出改进的潜力，特别是在政策表述和安全保障方面。为了应对这些挑战，我们引入了DDM Lag，这是一种扩散决策模型，并添加了基于拉格朗日的安全增强。在我们的方法中，自动驾驶决策难题被概念化为约束马尔可夫决策过程（CMDP）。我们制定了一个行动者-批评家框架，其中采用扩散模型作为行动者，促进政策探索和学习。CMDP中安全约束的集成和基于拉格朗日松弛的策略优化技术的采用确保了增强的决策安全性。采用PID控制器对模型参数进行稳定更新。通过不同的驾驶任务评估DDM滞后的有效性，显示与基线相比，决策安全性和整体性能有所提高
[论文下载:]http://arxiv.org/abs/2401.03629v1

标题: ChatGPT for Conversational Recommendation: Refining Recommendations by
Reprompting with Feedback
作者: Kyle Dylan Spurlock, Cagla Acun, Esin Saka
摘要: Recommendation algorithms have been pivotal in handling the overwhelming
volume of online content. However, these algorithms seldom consider direct user
input, resulting in superficial interaction between them. Efforts have been
made to include the user directly in the recommendation process through
conversation, but these systems too have had limited interactivity. Recently,
Large Language Models (LLMs) like ChatGPT have gained popularity due to their
ease of use and their ability to adapt dynamically to various tasks while
responding to feedback. In this paper, we investigate the effectiveness of
ChatGPT as a top-n conversational recommendation system. We build a rigorous
pipeline around ChatGPT to simulate how a user might realistically probe the
model for recommendations: by first instructing and then reprompting with
feedback to refine a set of recommendations. We further explore the effect of
popularity bias in ChatGPT’s recommendations, and compare its performance to
baseline models. We find that reprompting ChatGPT with feedback is an effective
strategy to improve recommendation relevancy, and that popularity bias can be
mitigated through prompt engineering.
中文摘要: 推荐算法在处理海量在线内容方面发挥了关键作用。然而，这些算法很少考虑直接的用户输入，导致它们之间的交互肤浅。已经努力通过对话将用户直接包括在推荐过程中，但这些系统的交互性也有限。最近，像ChatGPT这样的大型语言模型（LLM）由于其易用性和在响应反馈的同时动态适应各种任务的能力而广受欢迎。在本文中，我们研究了ChatGPT作为top-n会话推荐系统的有效性。我们围绕ChatGPT建立了一个严格的管道，以模拟用户如何真实地探索模型以获取建议：首先进行指导，然后用反馈进行重新尝试，以完善一组建议。我们进一步探讨了ChatGPT推荐中流行度偏差的影响，并将其性能与基线模型进行了比较。我们发现，用反馈来谴责ChatGPT是提高推荐相关性的有效策略，并且可以通过即时工程来减轻流行度偏差
[论文下载:]http://arxiv.org/abs/2401.03605v1

标题: InFoBench: Evaluating Instruction Following Ability in Large Language
Models
作者: Yiwei Qin, Kaiqiang Song, Yebowen Hu
摘要: This paper introduces the Decomposed Requirements Following Ratio (DRFR), a
new metric for evaluating Large Language Models’ (LLMs) ability to follow
instructions. Addressing a gap in current methodologies, DRFR breaks down
complex instructions into simpler criteria, facilitating a detailed analysis of
LLMs’ compliance with various aspects of tasks. Alongside this metric, we
present InFoBench, a benchmark comprising 500 diverse instructions and 2,250
decomposed questions across multiple constraint categories. Our experiments
compare DRFR with traditional scoring methods and explore annotation sources,
including human experts, crowd-sourced workers, and GPT-4. The findings
demonstrate DRFR’s higher reliability and the effectiveness of using GPT-4 as a
cost-efficient annotator. The evaluation of several advanced LLMs using this
framework reveals their strengths and areas needing improvement, particularly
in complex instruction-following. This study contributes a novel metric and
benchmark, offering insights for future LLM development and evaluation.
中文摘要: 本文介绍了分解需求遵循率（DRFR），这是一种评估大型语言模型（LLM）遵循指令能力的新指标。为了解决当前方法中的差距，DRFR将复杂的指令分解为更简单的标准，有助于对LLM遵守任务各个方面的情况进行详细分析。除了这个指标，我们还介绍了InFoBench，这是一个基准测试，包括500条不同的指令和2250个跨多个约束类别的分解问题。我们的实验将DRFR与传统的评分方法进行了比较，并探索了注释来源，包括人类专家、众包工作者和GPT-4。研究结果证明了DRFR更高的可靠性和使用GPT-4作为成本效益注释器的有效性。使用该框架对几种高级LLM进行的评估揭示了它们的优势和需要改进的领域，特别是在复杂的教学遵循方面。这项研究提供了一个新的指标和基准，为未来LLM的开发和评估提供了见解
[论文下载:]http://arxiv.org/abs/2401.03601v1

标题: Automated Evaluation of Classroom Instructional Support with LLMs and
BoWs: Connecting Global Predictions to Specific Feedback
作者: Jacob Whitehill, Jennifer LoCasale-Crouch
摘要: With the aim to provide teachers with more specific, frequent, and actionable
feedback about their teaching, we explore how Large Language Models (LLMs) can
be used to estimate ``Instructional Support’’ domain scores of the CLassroom
Assessment Scoring System (CLASS), a widely used observation protocol. We
design a machine learning architecture that uses either zero-shot prompting of
Meta’s Llama2, and/or a classic Bag of Words (BoW) model, to classify
individual utterances of teachers’ speech (transcribed automatically using
OpenAI’s Whisper) for the presence of Instructional Support. Then, these
utterance-level judgments are aggregated over an entire 15-min observation
session to estimate a global CLASS score. Experiments on two CLASS-coded
datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic
CLASS Instructional Support estimation accuracy using the proposed method
(Pearson $R$ up to $0.47$ ) approaches human inter-rater reliability (up to
$R = 0.55$ ); (2) LLMs yield slightly greater accuracy than BoW for this task,
though the best models often combined features extracted from both LLM and BoW;
and (3) for classifying individual utterances, there is still room for
improvement of automated methods compared to human-level judgments. Finally,
(4) we illustrate how the model’s outputs can be visualized at the utterance
level to provide teachers with explainable feedback on which utterances were
most positively or negatively correlated with specific CLASS dimensions.
中文摘要: 为了向教师提供更具体、更频繁、更可操作的教学反馈，我们探讨了如何使用大型语言模型（LLM）来评估课堂评估评分系统（CLASS）的“教学支持”领域得分，这是一种广泛使用的观察协议。我们设计了一种机器学习架构，该架构使用Meta的Llama2的零样本提示和/或经典的单词袋（BoW）模型，对教师语音的个别话语进行分类（使用OpenAI的Whisper自动转录），以获得教学支持。然后，在整个15分钟的观察会话中汇总这些话语水平判断，以估计全局CLASS得分。在幼儿和幼儿园前教室的两个CLASS编码数据集上的实验表明：（1）使用所提出的方法（Pearson $R$ 高达0.47 $）的 C L A SS 教学支持自动估计准确性接近人类评分者间的可靠性（高达$ R=0.55$）；（2） LLM在这项任务中产生的精度略高于BoW，尽管最好的模型通常结合了从LLM和BoW中提取的特征；以及（3）对于个体话语的分类，与人类水平的判断相比，自动化方法仍有改进的空间。最后，（4）我们说明了如何在话语水平上可视化模型的输出，以向教师提供可解释的反馈，说明哪些话语与特定的课堂维度呈正相关或负相关
[论文下载:]http://arxiv.org/abs/2310.01132v2

标题: Overview of Dialogue Robot Competition 2023
作者: Takashi Minato, Ryuichiro Higashinaka, Kurima Sakai
摘要: We have held dialogue robot competitions in 2020 and 2022 to compare the
performances of interactive robots using an android that closely resembles a
human. In 2023, the third competition DRC2023 was held. The task of DRC2023 was
designed to be more challenging than the previous travel agent dialogue tasks.
Since anyone can now develop a dialogue system using LLMs, the participating
teams are required to develop a system that effectively uses information about
the situation on the spot (real-time information), which is not handled by
ChatGPT and other systems. DRC2023 has two rounds, a preliminary round and the
final round as well as the previous competitions. The preliminary round has
held on Oct.27 – Nov.20, 2023 at real travel agency stores. The final round
will be held on December 23, 2023. This paper provides an overview of the task
settings and evaluation method of DRC2023 and the preliminary round results.
中文摘要: 我们在2020年和2022年举办了对话机器人比赛，以比较使用与人类非常相似的机器人的互动机器人的性能。2023年，第三届DRC2023比赛举行。DRC2023的任务被设计成比之前的旅行社对话任务更具挑战性。由于任何人现在都可以使用LLM开发对话系统，因此参与团队需要开发一个有效使用现场情况信息（实时信息）的系统，而这不是由ChatGPT和其他系统处理的。DRC2023有两轮比赛，预赛和决赛以及之前的比赛。预赛于2023年10月27日至11月20日在真实的旅行社商店举行。最后一轮比赛将于2023年12月23日举行。本文概述了DRC2023的任务设置和评估方法以及首轮结果
[论文下载:]http://arxiv.org/abs/2401.03547v1

标题: Token-free LLMs Can Generate Chinese Classical Poetry with More Accurate
Format
作者: Chengyue Yu, Lei Zang, Jiaotuan Wang
摘要: Finetuned large language models (such as ChatGPT and Qwen-chat) can generate
Chinese classical poetry following human’s instructions. LLMs perform well in
content, but are usually lacking in format, with occasionally excess or
insufficient number of characters in each line. Since most SOTA LLMs are
token-based, we assume that the format inaccuracy is due to the difficulty of
the “token planning” task, which means that the LLM need to know exactly how
much characters are contained in each token and do length-control planning
based on that knowledge. In this paper, we first confirm our assumption by
showing that existing token-based large language models has limited knowledge
on token-character relationship. We use a spelling bee probing procedure, and
find that Qwen-chat failed in nearly 15% Chinese spelling test. We then show
that a token-based model can be easily tailored into a token-free model (in
terms of Chinese), which can largely solve the format accuracy problem. Our
tailoring procedure removes long-token from vocabulary and keeps only
character-level or byte-level tokens. As part of our contribution, we release
the finetuned token-free model (which is based on Qwen-chat-7B), which can
generate chinese classical poetry following complex instructions like LLMs
(such as story paraphrasing), and also perform well in format. On the test set,
our token-free model achives an format accuracy of 0.96, compared to 0.84 for
token-based counterparts and 0.38 for GPT-4.
中文摘要: 经过微调的大型语言模型（如ChatGPT和Qwen chat）可以按照人类的指令生成中国古典诗歌。LLM在内容上表现良好，但通常缺乏格式，偶尔每行中的字符数会过多或不足。由于大多数SOTA LLM都是基于令牌的，我们假设格式不准确是由于“令牌规划”任务的困难，这意味着LLM需要准确地知道每个令牌中包含多少字符，并根据这些知识进行长度控制规划。在本文中，我们首先通过证明现有的基于令牌的大型语言模型对令牌-字符关系的了解有限来证实我们的假设。我们使用拼字测试程序，发现Qwen聊天在近15%的中文拼写测试中失败。然后，我们证明了基于令牌的模型可以很容易地定制为无令牌模型（就中文而言），这可以在很大程度上解决格式准确性问题。我们的裁剪过程从词汇表中删除长标记，只保留字符级或字节级标记。作为我们贡献的一部分，我们发布了经过微调的无标记模型（基于Qwen-chat-7B），该模型可以按照LLM等复杂指令（如故事转述）生成中国古典诗歌，并且在格式上表现良好。在测试集上，我们的无令牌模型实现了0.96的格式精度，而基于令牌的模型实现了0.84的格式精度和GPT-4实现了0.38的格式精度。
[论文下载:]http://arxiv.org/abs/2401.03512v1

标题: DiarizationLM: Speaker Diarization Post-Processing with Large Language
Models
作者: Quan Wang, Yiling Huang, Guanlong Zhao
摘要: In this paper, we introduce DiarizationLM, a framework to leverage large
language models (LLM) to post-process the outputs from a speaker diarization
system. Various goals can be achieved with the proposed framework, such as
improving the readability of the diarized transcript, or reducing the word
diarization error rate (WDER). In this framework, the outputs of the automatic
speech recognition (ASR) and speaker diarization systems are represented as a
compact textual format, which is included in the prompt to an optionally
finetuned LLM. The outputs of the LLM can be used as the refined diarization
results with the desired enhancement. As a post-processing step, this framework
can be easily applied to any off-the-shelf ASR and speaker diarization systems
without retraining existing components. Our experiments show that a finetuned
PaLM 2-S model can reduce the WDER by rel. 25.9% on the Fisher telephone
conversation dataset, and rel. 31% on the Callhome English dataset.
中文摘要: 在本文中，我们介绍了DiarizationLM，这是一个利用大型语言模型（LLM）对说话者日记系统的输出进行后处理的框架。使用所提出的框架可以实现各种目标，例如提高日记记录的可读性，或降低单词日记错误率（WDER）。在该框架中，自动语音识别（ASR）和说话者二元化系统的输出被表示为紧凑的文本格式，该文本格式被包括在对可选的微调LLM的提示中。LLM的输出可以用作具有所需增强的精细二值化结果。作为后处理步骤，该框架可以很容易地应用于任何现成的ASR和扬声器日记系统，而无需重新训练现有组件。我们的实验表明，微调的PaLM 2-S模型可以在Fisher电话会话数据集上将WDER降低25.9%，在Callhome English数据集上降低31%
[论文下载:]http://arxiv.org/abs/2401.03506v1

标题: Efficient Test Data Generation for MC/DC with OCL and Search
作者: Hassan Sartaj, Muhammad Zohaib Iqbal, Atif Aftab Ahmed Jilani
摘要: System-level testing of avionics software systems requires compliance with
different international safety standards such as DO-178C. An important
consideration of the avionics industry is automated test data generation
according to the criteria suggested by safety standards. One of the recommended
criteria by DO-178C is the modified condition/decision coverage (MC/DC)
criterion. The current model-based test data generation approaches use
constraints written in Object Constraint Language (OCL), and apply search
techniques to generate test data. These approaches either do not support MC/DC
criterion or suffer from performance issues while generating test data for
large-scale avionics systems. In this paper, we propose an effective way to
automate MC/DC test data generation during model-based testing. We develop a
strategy that utilizes case-based reasoning (CBR) and range reduction
heuristics designed to solve MC/DC-tailored OCL constraints. We performed an
empirical study to compare our proposed strategy for MC/DC test data generation
using CBR, range reduction, both CBR and range reduction, with an original
search algorithm, and random search. We also empirically compared our strategy
with existing constraint-solving approaches. The results show that both CBR and
range reduction for MC/DC test data generation outperform the baseline
approach. Moreover, the combination of both CBR and range reduction for MC/DC
test data generation is an effective approach compared to existing constraint
solvers.
中文摘要: 航空电子软件系统的系统级测试要求符合不同的国际安全标准，如DO-178C。航空电子行业的一个重要考虑是根据安全标准建议的标准自动生成测试数据。DO-178C推荐的标准之一是修正条件/决策覆盖率（MC/DC）标准。当前基于模型的测试数据生成方法使用对象约束语言（OCL）编写的约束，并应用搜索技术来生成测试数据。这些方法要么不支持MC/DC标准，要么在为大型航空电子系统生成测试数据时存在性能问题。在本文中，我们提出了一种在基于模型的测试中自动生成MC/DC测试数据的有效方法。我们开发了一种策略，该策略利用基于案例的推理（CBR）和范围缩减启发法来解决MC/DC定制的OCL约束。我们进行了一项实证研究，将我们提出的使用CBR、范围缩减、CBR和范围缩减的MC/DC测试数据生成策略与原始搜索算法和随机搜索进行了比较。我们还将我们的策略与现有的约束求解方法进行了实证比较。结果表明，MC/DC测试数据生成的CBR和距离缩减都优于基线方法。此外，与现有的约束求解器相比，将CBR和范围缩减相结合用于MC/DC测试数据生成是一种有效的方法
[论文下载:]http://arxiv.org/abs/2401.03469v1

标题: Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon
作者: Peitian Zhang, Zheng Liu, Shitao Xiao
摘要: The utilization of long contexts poses a big challenge for large language
models due to their limited context window length. Although the context window
can be extended through fine-tuning, it will result in a considerable cost at
both training and inference time, and exert an unfavorable impact to the LLM’s
original capabilities. In this work, we propose Activation Beacon, which
condenses LLM’s raw activations into more compact forms such that it can
perceive a much longer context with a limited context window. Activation Beacon
is introduced as a plug-and-play module for the LLM. It fully preserves the
LLM’s original capability on short contexts while extending the new capability
on processing longer contexts. Besides, it works with short sliding windows to
process the long context, which achieves a competitive memory and time
efficiency in both training and inference. Activation Beacon is learned by the
auto-regression task conditioned on a mixture of beacons with diversified
condensing ratios. Thanks to such a treatment, it can be efficiently trained
purely with short-sequence data in just 10K steps, which consumes less than 9
hours on a single 8xA800 GPU machine. The experimental studies show that
Activation Beacon is able to extend Llama-2-7B’s context length by $×100 \times100$
times (from 4K to 400K), meanwhile achieving a superior result on both
long-context generation and understanding tasks. Our model and code will be
available at the BGE repository.
中文摘要: 由于上下文窗口长度有限，长上下文的使用对大型语言模型构成了巨大挑战。尽管上下文窗口可以通过微调来扩展，但这将导致训练和推理时间的巨大成本，并对LLM的原始能力产生不利影响。在这项工作中，我们提出了激活信标，它将LLM的原始激活浓缩成更紧凑的形式，这样它就可以在有限的上下文窗口中感知更长的上下文。激活信标是作为LLM的即插即用模块引入的。它充分保留了LLM在短上下文上的原始能力，同时扩展了处理长上下文的新能力。此外，它与短滑动窗口一起处理长上下文，在训练和推理中都实现了有竞争力的记忆和时间效率。激活信标是通过以具有多样化浓缩比率的信标的混合为条件的自回归任务来学习的。得益于这样的处理，它可以在10K步内完全用短序列数据进行高效训练，在一台8xA800 GPU机器上只需不到9个小时。实验研究表明，Activation Beacon能够将Llama-2-7B的上下文长度扩展 $×100 \times100$ 倍（从4K扩展到400K），同时在长上下文生成和理解任务上都取得了优异的效果。我们的模型和代码将在BGE存储库中提供
[论文下载:]http://arxiv.org/abs/2401.03462v1

标题: Exploring Large Language Model based Intelligent Agents: Definitions,
Methods, and Prospects
作者: Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang
摘要: Intelligent agents stand out as a potential path toward artificial general
intelligence (AGI). Thus, researchers have dedicated significant effort to
diverse implementations for them. Benefiting from recent progress in large
language models (LLMs), LLM-based agents that use universal natural language as
an interface exhibit robust generalization capabilities across various
applications – from serving as autonomous general-purpose task assistants to
applications in coding, social, and economic domains, LLM-based agents offer
extensive exploration opportunities. This paper surveys current research to
provide an in-depth overview of LLM-based intelligent agents within
single-agent and multi-agent systems. It covers their definitions, research
frameworks, and foundational components such as their composition, cognitive
and planning methods, tool utilization, and responses to environmental
feedback. We also delve into the mechanisms of deploying LLM-based agents in
multi-agent systems, including multi-role collaboration, message passing, and
strategies to alleviate communication issues between agents. The discussions
also shed light on popular datasets and application scenarios. We conclude by
envisioning prospects for LLM-based agents, considering the evolving landscape
of AI and natural language processing.
中文摘要: 智能代理是通往通用人工智能（AGI）的潜在途径。因此，研究人员已经为它们的不同实现付出了巨大的努力。得益于大型语言模型（LLM）的最新进展，使用通用自然语言作为接口的基于LLM的代理在各种应用程序中表现出强大的泛化能力——从充当自主的通用任务助理到编码、社会和经济领域的应用程序，基于LLM代理提供了广泛的探索机会。本文综述了当前的研究，以深入概述单智能体和多智能体系统中基于LLM的智能体。它涵盖了它们的定义、研究框架和基本组成部分，如它们的组成、认知和规划方法、工具利用以及对环境反馈的反应。我们还深入研究了在多智能体系统中部署基于LLM的代理的机制，包括多角色协作、消息传递和缓解代理之间通信问题的策略。讨论还揭示了流行的数据集和应用场景。最后，考虑到人工智能和自然语言处理的发展前景，我们展望了基于LLM的代理的前景
[论文下载:]http://arxiv.org/abs/2401.03428v1

标题: From Beginner to Expert: Modeling Medical Knowledge into General LLMs
作者: Qiang Li, Xiaoyan Yang, Haowen Wang
摘要: Recently, large language model (LLM) based artificial intelligence (AI)
systems have demonstrated remarkable capabilities in natural language
understanding and generation. However, these models face a significant
challenge when it comes to sensitive applications, such as reasoning over
medical knowledge and answering medical questions in a physician-like manner.
Prior studies attempted to overcome this challenge by increasing the model size
(>100B) to learn more general medical knowledge, while there is still room for
improvement in LLMs with smaller-scale model sizes (<100B). In this work, we
start from a pre-trained general LLM model (AntGLM-10B) and fine-tune it from a
medical beginner towards a medical expert (called AntGLM-Med-10B), which
leverages a 3-stage optimization procedure, i.e., general medical knowledge
injection, medical domain instruction tuning, and specific medical task
adaptation. Our contributions are threefold: (1) We specifically investigate
how to adapt a pre-trained general LLM in medical domain, especially for a
specific medical task. (2) We collect and construct large-scale medical
datasets for each stage of the optimization process. These datasets encompass
various data types and tasks, such as question-answering, medical reasoning,
multi-choice questions, and medical conversations. (3) Specifically for
multi-choice questions in the medical domain, we propose a novel
Verification-of-Choice approach for prompting engineering, which significantly
enhances the reasoning ability of LLMs. Remarkably, by combining the above
approaches, our AntGLM-Med-10B model can outperform the most of LLMs on
PubMedQA, including both general and medical LLMs, even when these LLMs have
larger model size.
中文摘要: 最近，基于大型语言模型（LLM）的人工智能（AI）系统在自然语言理解和生成方面表现出了非凡的能力。然而，当涉及到敏感应用时，这些模型面临着重大挑战，例如对医学知识进行推理和以医生般的方式回答医学问题。先前的研究试图通过增加模型尺寸（>100B）来克服这一挑战，以学习更多的一般医学知识，而模型尺寸较小（<100B）的LLM仍有改进空间。在这项工作中，我们从预先训练的通用LLM模型（AntGLM-10B）开始，并从医学初学者向医学专家（称为AntGLM-Med-10B）进行微调，该模型利用了三阶段优化程序，即通用医学知识注入、医学领域指令调整和特定医学任务自适应。我们的贡献有三个方面：（1）我们专门研究了如何在医学领域，特别是针对特定的医疗任务，调整预先培训的普通LLM。（2）我们为优化过程的每个阶段收集并构建大规模的医学数据集。这些数据集包括各种数据类型和任务，如问答、医学推理、多选问题和医学对话。（3）特别是针对医学领域的多选问题，我们提出了一种用于提示工程的新的选择验证方法，该方法显著提高了LLM的推理能力。值得注意的是，通过结合上述方法，我们的AntGLM-Med-10B模型可以优于PubMedQA上的大多数LLM，包括普通和医学LLM，即使这些LLM具有更大的模型大小
[论文下载:]http://arxiv.org/abs/2312.01040v3

标题: DroidBot-GPT: GPT-powered UI Automation for Android
作者: Hao Wen, Hongming Wang, Jiaxuan Liu
摘要: This paper introduces DroidBot-GPT, a tool that utilizes GPT-like large
language models (LLMs) to automate the interactions with Android mobile
applications. Given a natural language description of a desired task,
DroidBot-GPT can automatically generate and execute actions that navigate the
app to complete the task. It works by translating the app GUI state information
and the available actions on the smartphone screen to natural language prompts
and asking the LLM to make a choice of actions. Since the LLM is typically
trained on a large amount of data including the how-to manuals of diverse
software applications, it has the ability to make reasonable choices of actions
based on the provided information. We evaluate DroidBot-GPT with a self-created
dataset that contains 33 tasks collected from 17 Android applications spanning
10 categories. It can successfully complete 39.39% of the tasks, and the
average partial completion progress is about 66.76%. Given the fact that our
method is fully unsupervised (no modification required from both the app and
the LLM), we believe there is great potential to enhance automation performance
with better app development paradigms and/or custom model training.
中文摘要: 本文介绍了DroidBot GPT，这是一种利用类似GPT的大型语言模型（LLM）来自动化与Android移动应用程序交互的工具。给定所需任务的自然语言描述，DroidBot GPT可以自动生成并执行导航应用程序以完成任务的操作。它的工作原理是将应用程序GUI状态信息和智能手机屏幕上的可用操作转换为自然语言提示，并要求LLM做出操作选择。由于LLM通常是根据大量数据进行训练的，包括各种软件应用程序的操作手册，因此它能够根据提供的信息做出合理的操作选择。我们使用自己创建的数据集评估DroidBot GPT，该数据集包含从10个类别的17个Android应用程序中收集的33个任务。它可以成功完成39.39%的任务，平均部分完成进度约为66.76%。考虑到我们的方法是完全无监督的（应用程序和LLM都不需要修改），我们相信通过更好的应用程序开发范式和/或自定义模型训练，提高自动化性能的潜力很大
[论文下载:]http://arxiv.org/abs/2304.07061v5

标题: GRAM: Global Reasoning for Multi-Page VQA
作者: Tsachi Blau, Sharon Fogel, Roi Ronen
摘要: The increasing use of transformer-based large language models brings forward
the challenge of processing long sequences. In document visual question
answering (DocVQA), leading methods focus on the single-page setting, while
documents can span hundreds of pages. We present GRAM, a method that seamlessly
extends pre-trained single-page models to the multi-page setting, without
requiring computationally-heavy pretraining. To do so, we leverage a
single-page encoder for local page-level understanding, and enhance it with
document-level designated layers and learnable tokens, facilitating the flow of
information across pages for global reasoning. To enforce our model to utilize
the newly introduced document-level tokens, we propose a tailored bias
adaptation method. For additional computational savings during decoding, we
introduce an optional compression stage using our C-Former model, which reduces
the encoded sequence length, thereby allowing a tradeoff between quality and
latency. Extensive experiments showcase GRAM’s state-of-the-art performance on
the benchmarks for multi-page DocVQA, demonstrating the effectiveness of our
approach.
中文摘要: 基于转换器的大型语言模型的使用越来越多，这给处理长序列带来了挑战。在文档可视化问答（DocVQA）中，主流方法侧重于单页设置，而文档可以跨越数百页。我们提出了GRAM，这是一种将预先训练的单页模型无缝扩展到多页设置的方法，不需要大量的计算预训练。为此，我们利用单个页面编码器进行局部页面级理解，并通过文档级指定层和可学习令牌来增强它，促进信息在页面之间的流动，以进行全局推理。为了加强我们的模型以利用新引入的文档级令牌，我们提出了一种量身定制的偏差自适应方法。为了在解码过程中节省额外的计算量，我们使用C-Former模型引入了一个可选的压缩阶段，它减少了编码序列的长度，从而允许在质量和延迟之间进行权衡。大量实验展示了GRAM在多页DocVQA基准测试上最先进的性能，证明了我们方法的有效性
[论文下载:]http://arxiv.org/abs/2401.03411v1

标题: Escalation Risks from Language Models in Military and Diplomatic
Decision-Making
作者: Juan-Pablo Rivera, Gabriel Mukobi, Anka Reuel
摘要: Governments are increasingly considering integrating autonomous AI agents in
high-stakes military and foreign-policy decision-making, especially with the
emergence of advanced generative AI models like GPT-4. Our work aims to
scrutinize the behavior of multiple AI agents in simulated wargames,
specifically focusing on their predilection to take escalatory actions that may
exacerbate multilateral conflicts. Drawing on political science and
international relations literature about escalation dynamics, we design a novel
wargame simulation and scoring framework to assess the escalation risks of
actions taken by these agents in different scenarios. Contrary to prior
studies, our research provides both qualitative and quantitative insights and
focuses on large language models (LLMs). We find that all five studied
off-the-shelf LLMs show forms of escalation and difficult-to-predict escalation
patterns. We observe that models tend to develop arms-race dynamics, leading to
greater conflict, and in rare cases, even to the deployment of nuclear weapons.
Qualitatively, we also collect the models’ reported reasonings for chosen
actions and observe worrying justifications based on deterrence and
first-strike tactics. Given the high stakes of military and foreign-policy
contexts, we recommend further examination and cautious consideration before
deploying autonomous language model agents for strategic military or diplomatic
decision-making.
中文摘要: 各国政府越来越多地考虑将自主人工智能代理集成到高风险的军事和外交政策决策中，特别是随着GPT-4等先进生成人工智能模型的出现。我们的工作旨在仔细研究多个人工智能主体在模拟战争游戏中的行为，特别是关注他们倾向于采取可能加剧多边冲突的升级行动。根据政治学和国际关系中有关升级动态的文献，我们设计了一个新颖的战争游戏模拟和评分框架，以评估这些代理人在不同场景下采取的行动的升级风险。与之前的研究相反，我们的研究提供了定性和定量的见解，并侧重于大型语言模型（LLM）。我们发现，所有五种研究的现成LLM都显示出升级的形式，并且难以预测升级模式。我们观察到，模式往往会发展军备竞赛动态，导致更大的冲突，在极少数情况下甚至导致部署核武器。从质量上讲，我们还收集了模型报告的选择行动的理由，并观察到基于威慑和第一次打击战术的令人担忧的理由。鉴于军事和外交政策背景的高风险，我们建议在为战略军事或外交决策部署自主语言模型代理之前进行进一步审查和谨慎考虑
[论文下载:]http://arxiv.org/abs/2401.03408v1

标题: Empirical Study of Large Language Models as Automated Essay Scoring
Tools in English Composition__Taking TOEFL Independent Writing Task for
Example
作者: Wei Xia, Shaoguang Mao, Chanjing Zheng
摘要: Large language models have demonstrated exceptional capabilities in tasks
involving natural language generation, reasoning, and comprehension. This study
aims to construct prompts and comments grounded in the diverse scoring criteria
delineated within the official TOEFL guide. The primary objective is to assess
the capabilities and constraints of ChatGPT, a prominent representative of
large language models, within the context of automated essay scoring. The
prevailing methodologies for automated essay scoring involve the utilization of
deep neural networks, statistical machine learning techniques, and fine-tuning
pre-trained models. However, these techniques face challenges when applied to
different contexts or subjects, primarily due to their substantial data
requirements and limited adaptability to small sample sizes. In contrast, this
study employs ChatGPT to conduct an automated evaluation of English essays,
even with a small sample size, employing an experimental approach. The
empirical findings indicate that ChatGPT can provide operational functionality
for automated essay scoring, although the results exhibit a regression effect.
It is imperative to underscore that the effective design and implementation of
ChatGPT prompts necessitate a profound domain expertise and technical
proficiency, as these prompts are subject to specific threshold criteria.
Keywords: ChatGPT, Automated Essay Scoring, Prompt Learning, TOEFL Independent
Writing Task
中文摘要: 大型语言模型在涉及自然语言生成、推理和理解的任务中表现出了非凡的能力。本研究旨在构建基于官方托福指南中规定的不同评分标准的提示和评论。主要目标是在自动化论文评分的背景下，评估大型语言模型的杰出代表ChatGPT的能力和限制。自动化论文评分的主流方法包括利用深度神经网络、统计机器学习技术和微调预先训练的模型。然而，这些技术在应用于不同的背景或主题时面临挑战，主要是因为它们需要大量的数据，对小样本量的适应性有限。相比之下，本研究采用实验方法，使用ChatGPT对英语短文进行自动评估，即使样本量很小。实证结果表明，ChatGPT可以为自动化论文评分提供操作功能，尽管结果显示出回归效应。必须强调的是，ChatGPT提示的有效设计和实现需要深厚的领域专业知识和技术熟练度，因为这些提示需要遵守特定的阈值标准。关键词：ChatGPT，自动论文评分，即时学习，托福独立写作任务
[论文下载:]http://arxiv.org/abs/2401.03401v1

标题: LLMs for Robotic Object Disambiguation
作者: Connie Jiang, Yiqing Xu, David Hsu
摘要: The advantages of pre-trained large language models (LLMs) are apparent in a
variety of language processing tasks. But can a language model’s knowledge be
further harnessed to effectively disambiguate objects and navigate
decision-making challenges within the realm of robotics? Our study reveals the
LLM’s aptitude for solving complex decision making challenges that are often
previously modeled by Partially Observable Markov Decision Processes (POMDPs).
A pivotal focus of our research is the object disambiguation capability of
LLMs. We detail the integration of an LLM into a tabletop environment
disambiguation task, a decision making problem where the robot’s task is to
discern and retrieve a user’s desired object from an arbitrarily large and
complex cluster of objects. Despite multiple query attempts with zero-shot
prompt engineering (details can be found in the Appendix), the LLM struggled to
inquire about features not explicitly provided in the scene description. In
response, we have developed a few-shot prompt engineering system to improve the
LLM’s ability to pose disambiguating queries. The result is a model capable of
both using given features when they are available and inferring new relevant
features when necessary, to successfully generate and navigate down a precise
decision tree to the correct object–even when faced with identical options.
中文摘要: 预先训练的大型语言模型（LLM）的优势在各种语言处理任务中都很明显。但是，能否进一步利用语言模型的知识来有效地消除对象的歧义，并在机器人领域应对决策挑战？我们的研究揭示了LLM在解决复杂决策挑战方面的能力，这些挑战通常是以前由部分可观测马尔可夫决策过程（POMDP）建模的。我们研究的一个关键焦点是LLM的对象消歧能力。我们详细介绍了LLM与桌面环境消歧任务的集成，这是一个决策问题，机器人的任务是从任意大而复杂的对象集群中识别和检索用户想要的对象。尽管零样本提示工程进行了多次查询尝试（详细信息可在附录中找到），但LLM很难查询场景描述中未明确提供的功能。作为回应，我们开发了一个多镜头提示工程系统，以提高LLM提出消除歧义查询的能力。其结果是，即使面临相同的选项，该模型也能够在给定的特征可用时使用这些特征，并在必要时推断出新的相关特征，从而成功地生成精确的决策树并向下导航到正确的对象
[论文下载:]http://arxiv.org/abs/2401.03388v1

标题: Grimoire is All You Need for Enhancing Large Language Models
作者: Ding Chen, Shichao Song, Qingchen Yu
摘要: In-context learning (ICL) is one of the key methods for enhancing the
performance of large language models on specific tasks by providing a set of
few-shot question and answer examples. However, the ICL capability of different
types of models shows significant variation due to factors such as model
architecture, volume of learning data, and the size of parameters. Generally,
the larger the model’s parameter size and the more extensive the learning data,
the stronger its ICL capability. In this paper, we propose a method SLEICL
(Strong LLM Enhanced ICL) that involves learning from examples using strong
language models and then summarizing and transferring these learned skills to
weak language models for inference and application. This ensures the stability
and effectiveness of ICL. Compared to directly enabling weak language models to
learn from prompt examples, SLEICL reduces the difficulty of ICL for these
models. Our experiments, conducted on up to eight datasets with five language
models, demonstrate that weak language models achieve consistent improvement
over their own zero-shot or few-shot capabilities using the SLEICL method. Some
weak language models even surpass the performance of GPT4-1106-preview
(zero-shot) with the aid of SLEICL.
中文摘要: 上下文学习（ICL）是通过提供一组少量问答示例来提高大型语言模型在特定任务中的性能的关键方法之一。然而，由于模型架构、学习数据量和参数大小等因素，不同类型模型的ICL能力表现出显著差异。通常，模型的参数大小越大，学习数据越广泛，其ICL能力就越强。在本文中，我们提出了一种方法SLEICL（Strong LLM Enhanced ICL），该方法包括使用强语言模型从示例中学习，然后总结并将这些学到的技能转移到弱语言模型中进行推理和应用。这确保了ICL的稳定性和有效性。与直接使弱语言模型能够从提示示例中学习相比，SLEICL降低了ICL对这些模型的难度。我们在多达八个数据集和五个语言模型上进行的实验表明，弱语言模型使用SLEICL方法实现了对其自身零样本或少搜索功能的一致改进。在SLEICL的帮助下，一些弱语言模型甚至超过了GPT4-106-preview（零样本）的性能
[论文下载:]http://arxiv.org/abs/2401.03385v1

标题: GazeCLIP: Towards Enhancing Gaze Estimation via Text Guidance
作者: Jun Wang, Hao Ruan, Mingjie Wang
摘要: Over the past decade, visual gaze estimation has garnered growing attention
within the research community, thanks to its wide-ranging application
scenarios. While existing estimation approaches have achieved remarkable
success in enhancing prediction accuracy, they primarily infer gaze directions
from single-image signals and discard the huge potentials of the currently
dominant text guidance. Notably, visual-language collaboration has been
extensively explored across a range of visual tasks, such as image synthesis
and manipulation, leveraging the remarkable transferability of large-scale
Contrastive Language-Image Pre-training (CLIP) model. Nevertheless, existing
gaze estimation approaches ignore the rich semantic cues conveyed by linguistic
signals and priors in CLIP feature space, thereby yielding performance
setbacks. In pursuit of making up this gap, we delve deeply into the text-eye
collaboration protocol and introduce a novel gaze estimation framework in this
paper, referred to as GazeCLIP. Specifically, we intricately design a
linguistic description generator to produce text signals with coarse
directional cues. Additionally, a CLIP-based backbone that excels in
characterizing text-eye pairs for gaze estimation is presented. This is
followed by the implementation of a fine-grained multi-modal fusion module
aimed at modeling the interrelationships between heterogeneous inputs.
Extensive experiments on three challenging datasets demonstrate the superiority
of the proposed GazeCLIP which surpasses the previous approaches and achieves
the state-of-the-art estimation accuracy.
中文摘要: 在过去的十年里，由于其广泛的应用场景，视觉凝视估计在研究界引起了越来越多的关注。虽然现有的估计方法在提高预测精度方面取得了显著的成功，但它们主要从单个图像信号推断视线方向，并放弃了当前占主导地位的文本引导的巨大潜力。值得注意的是，视觉语言协作已经在一系列视觉任务中得到了广泛的探索，如图像合成和操作，利用了大规模对比语言图像预训练（CLIP）模型的显著可移植性。然而，现有的凝视估计方法忽略了CLIP特征空间中语言信号和先验所传达的丰富语义线索，从而导致性能受挫。为了弥补这一差距，我们深入研究了文本眼协作协议，并在本文中引入了一种新的凝视估计框架，称为GazeCLIP。具体来说，我们复杂地设计了一个语言描述生成器，以产生具有粗略方向线索的文本信号。此外，还提出了一种基于CLIP的主干，该主干擅长于表征用于凝视估计的文本-眼睛对。随后实现了一个细粒度的多模式融合模块，旨在对异构输入之间的相互关系进行建模。在三个具有挑战性的数据集上进行的大量实验证明了所提出的GazeCLIP的优越性，它超越了以前的方法，并实现了最先进的估计精度
[论文下载:]http://arxiv.org/abs/2401.00260v2

标题: LLM-Powered Code Vulnerability Repair with Reinforcement Learning and
Semantic Reward
作者: Nafis Tanveer Islam, Joseph Khoury, Andrew Seong
摘要: In software development, the predominant emphasis on functionality often
supersedes security concerns, a trend gaining momentum with AI-driven
automation tools like GitHub Copilot. These tools significantly improve
developers’ efficiency in functional code development. Nevertheless, it remains
a notable concern that such tools are also responsible for creating insecure
code, predominantly because of pre-training on publicly available repositories
with vulnerable code. Moreover, developers are called the “weakest link in the
chain” since they have very minimal knowledge of code security. Although
existing solutions provide a reasonable solution to vulnerable code, they must
adequately describe and educate the developers on code security to ensure that
the security issues are not repeated. Therefore we introduce a multipurpose
code vulnerability analysis system \texttt{SecRepair}, powered by a large
language model, CodeGen2 assisting the developer in identifying and generating
fixed code along with a complete description of the vulnerability with a code
comment. Our innovative methodology uses a reinforcement learning paradigm to
generate code comments augmented by a semantic reward mechanism. Inspired by
how humans fix code issues, we propose an instruction-based dataset suitable
for vulnerability analysis with LLMs. We further identify zero-day and N-day
vulnerabilities in 6 Open Source IoT Operating Systems on GitHub. Our findings
underscore that incorporating reinforcement learning coupled with semantic
reward augments our model’s performance, thereby fortifying its capacity to
address code vulnerabilities with improved efficacy.
中文摘要: 在软件开发中，对功能的主要强调往往取代了对安全的担忧，这一趋势随着GitHub Copilot等人工智能驱动的自动化工具而愈演愈烈。这些工具显著提高了开发人员在函数代码开发中的效率。然而，值得注意的是，这些工具也有责任创建不安全的代码，这主要是因为对具有易受攻击代码的公共存储库进行了预培训。此外，开发人员被称为“链中最薄弱的一环”，因为他们对代码安全性的了解非常少。尽管现有的解决方案为易受攻击的代码提供了合理的解决方案，但它们必须充分描述和教育开发人员代码安全性，以确保安全问题不会重复出现。因此，我们引入了一个多用途的代码漏洞分析系统\texttt｛SecRepair｝，该系统由大型语言模型CodeGen2提供支持，帮助开发人员识别和生成固定代码，以及带有代码注释的漏洞完整描述。我们的创新方法使用强化学习范式来生成通过语义奖励机制增强的代码注释。受人类如何修复代码问题的启发，我们提出了一个适用于LLM漏洞分析的基于指令的数据集。我们在GitHub上的6个开源物联网操作系统中进一步发现了零天和N天漏洞。我们的研究结果强调，将强化学习与语义奖励相结合，可以提高我们模型的性能，从而增强其解决代码漏洞的能力，提高效率
[论文下载:]http://arxiv.org/abs/2401.03374v1

标题: PIXAR: Auto-Regressive Language Modeling in Pixel Space
作者: Yintao Tai, Xiyang Liao, Alessandro Suglia
摘要: Recent works showed the possibility of building open-vocabulary large
language models (LLMs) that directly operate on pixel representations and are
implemented as encoder-decoder models that reconstruct masked image patches of
rendered text. However, these pixel-based LLMs are limited to autoencoding
tasks and cannot generate new text as images. As such, they cannot be used for
open-answer or generative language tasks. In this work, we overcome this
limitation and introduce PIXAR, the first pixel-based autoregressive LLM that
does not rely on a pre-defined vocabulary for both input and output text.
Consisting of only a decoder, PIXAR can answer free-form generative tasks while
keeping the text representation learning performance on par with previous
encoder-decoder models. Furthermore, we highlight the challenges to
autoregressively generate non-blurred text as images and link this to the usual
maximum likelihood objective. We propose a simple adversarial pretraining that
significantly improves the readability and performance of PIXAR making it
comparable to GPT2 on short text generation tasks. This paves the way to
building open-vocabulary LLMs that are usable for free-form generative tasks
and questions the necessity of the usual symbolic input representation – text
as tokens – for these challenging tasks.
中文摘要: 最近的工作表明了建立开放词汇大语言模型（LLM）的可能性，该模型直接对像素表示进行操作，并被实现为编码器-解码器模型，用于重建渲染文本的掩蔽图像块。然而，这些基于像素的LLM仅限于自动编码任务，并且不能将新文本生成为图像。因此，它们不能用于开放回答或生成语言任务。在这项工作中，我们克服了这一限制，并引入了PIXAR，这是第一个基于像素的自回归LLM，它不依赖于输入和输出文本的预定义词汇表。PIXAR仅由解码器组成，可以回答自由形式的生成任务，同时保持文本表示学习性能与以前的编码器-解码器模型相当。此外，我们强调了自回归生成非模糊文本作为图像的挑战，并将其与通常的最大似然目标联系起来。我们提出了一种简单的对抗性预训练，它显著提高了PIXAR的可读性和性能，使其在短文本生成任务上与GPT2相当。这为构建可用于自由形式生成任务的开放词汇LLM铺平了道路，并质疑了这些具有挑战性的任务使用通常的符号输入表示（文本作为标记）的必要性
[论文下载:]http://arxiv.org/abs/2401.03321v1

标题: Malla: Demystifying Real-world Large Language Model Integrated Malicious
Services
作者: Zilong Lin, Jian Cui, Xiaojing Liao
摘要: The underground exploitation of large language models (LLMs) for malicious services (i.e., Malla) is witnessing an uptick, amplifying the cyber threat landscape and posing questions about the trustworthiness of LLM technologies. However, there has been little effort to understand this new cybercrime, in terms of its magnitude, impact, and techniques. In this paper, we conduct the first systematic study on 212 real-world Mallas, uncovering their proliferation in underground marketplaces and exposing their operational modalities. Our study discloses the Malla ecosystem, revealing its significant growth and impact on today’s public LLM services. Through examining 212 Mallas, we uncovered eight backend LLMs used by Mallas, along with 182 prompts that circumvent the protective measures of public LLM APIs. We further demystify the tactics employed by Mallas, including the abuse of uncensored LLMs and the exploitation of public LLM APIs through jailbreak prompts. Our findings enable a better understanding of the real-world exploitation of LLMs by cybercriminals, offering insights into strategies to counteract this cybercrime.
中文摘要: 针对恶意服务（即Malla）的大型语言模型（LLM）的地下利用率正在上升，这放大了网络威胁的范围，并对LLM技术的可信度提出了质疑。然而，从其规模、影响和技术方面，人们几乎没有努力了解这种新的网络犯罪。在本文中，我们对212家真实世界的商场进行了首次系统研究，揭示了它们在地下市场中的扩散，并揭示了它们的运营模式。我们的研究揭示了马拉生态系统，揭示了其显著的增长和对当今公共LLM服务的影响。通过检查212个Mallas，我们发现了Mallas使用的8个后端LLM，以及182个规避公共LLM API保护措施的提示。我们进一步揭开了Mallas使用的策略的神秘面纱，包括滥用未经审查的LLM和通过越狱提示利用公共LLM API。我们的研究结果使我们能够更好地了解网络犯罪分子在现实世界中对LLM的利用，从而深入了解应对这种网络犯罪的策略
[论文下载:]http://arxiv.org/abs/2401.03315v1