今天带来一篇关于AI Agent的威胁综述。近期在Arxiv上出现的,这里翻译过来学习记录。
原文地址:https://arxiv.org/abs/2406.02630
作者及单位:
ZEHANG DENG∗, Swinburne University of Technology, Australia YONGJIAN GUO∗, Tianjin Univeristy, China CHANGZHOU HAN, Swinburne University of Technology, Australia WANLUN MA†, Swinburne University of Technology, Australia JUNWU XIONG, Ant Group, China SHENG WEN, Swinburne University of Technology, Australia YANG XIANG, Swinburne University of Technology, Australia
正文:
An Artificial Intelligence (AI) agent is a software entity that autonomously performs tasks or makes decisions based on pre-defined objectives and data inputs. AI agents, capable of perceiving user inputs, reasoning and planning tasks, and executing actions, have seen remarkable advancements in algorithm development and task performance. However, the security challenges they pose remain under-explored and unresolved. This survey delves into the emerging security threats faced by AI agents, categorizing them into four critical knowledge gaps: unpredictability of multi-step user inputs, complexity in internal executions, variability of operational environments, and interactions with untrusted external entities. By systematically reviewing these threats, this paper highlights both the progress made and the existing limitations in safeguarding AI agents. The insights provided aim to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.
人工智能(AI)代理是一个软件实体,它可以自主执行任务或根据预定义的目标和数据输入做出决策。人工智能代理能够感知用户输入,推理和规划任务以及执行动作,在算法开发和任务性能方面取得了显着进步。然而,它们构成的安全挑战仍然没有得到充分探讨和解决。 该调查深入研究了人工智能代理面临的新出现的安全威胁,将其分为四个关键的知识差距:多步用户输入的不可预测性,内部执行的复杂性,操作环境的可变性以及与不受信任的外部实体的交互。通过系统地回顾这些威胁,本文强调了在保护AI代理方面取得的进展和存在的局限性。提供的见解旨在激发进一步研究解决与AI代理相关的安全威胁,从而促进更强大和安全的AI代理应用程序的开发。
1 Introduction 1引言
AI agents are computational entities that demonstrate intelligent behavior through autonomy, reactivity, proactiveness, and social ability. They interact with their environment and users to achieve specific goals by perceiving inputs, reasoning about tasks, planning actions, and executing tasks using internal and external tools. AI agents, powered by large language models (LLMs) such ∗Both authors contributed equally to this research. †Corresponding author.
AI代理是通过自主性,反应性,主动性和社交能力展示智能行为的计算实体。他们与环境和用户交互,通过感知输入、推理任务、规划行动以及使用内部和外部工具执行任务来实现特定目标。
Despite the significant advancements in AI agents, their increasing sophistication also introduces new security challenges. Ensuring AI agent security is crucial due to their deployment in diverse and critical applications. AI agent security refers to the measures and practices aimed at protecting AI agents from vulnerabilities and threats that could compromise their functionality, integrity, and safety. This includes ensuring the agents can securely handle user inputs, execute tasks, and interact with other entities without being susceptible to malicious attacks or unintended harmful behaviors. These security challenges stem from four knowledge gaps that, if unaddressed, can lead to vulnerabilities [27, 97, 112, 192] and potential misuse [132].
尽管人工智能代理取得了显着进步,但其日益复杂也带来了新的安全挑战。确保人工智能代理的安全性至关重要,因为它们部署在各种关键应用程序中。AI代理安全是指旨在保护AI代理免受可能危及其功能,完整性和安全性的漏洞和威胁的措施和实践。这包括确保代理可以安全地处理用户输入,执行任务,并与其他实体进行交互,而不会受到恶意攻击或意外有害行为的影响。这些安全挑战源于四个知识差距,如果不加以解决,可能导致漏洞[27,97,112,192]和潜在的滥用[132]。
As depicted in Figure 1, the four main knowledge gaps in AI agent are 1) unpredictability of multistep user inputs, 2) complexity in internal executions, 3) variability of operational environments, and 4) interactions with untrusted external entities. The following points delineate the knowledge gaps in detail.
如图1所示,人工智能代理中的四个主要知识缺口是:1)多步用户输入的不可预测性,2)内部执行的复杂性,3)操作环境的可变性,以及4)与不受信任的外部实体的交互。以下几点详细说明了知识差距。
-
Gap 1. Unpredictability of multi-step user inputs. Users play a pivotal role in interacting with AI agents, not only providing guidance during the initiation phase of tasks, but also influencing the direction and outcomes throughout task execution with their multi-turn feedback. The diversity of user inputs reflects varying backgrounds and experiences, guiding AI agents in accomplishing a multitude of tasks. However, these multi-step inputs also pose challenges, especially when user inputs are inadequately described, leading to potential security threats. Insufficient specification of user input can affect not only the task outcome, but may also initiate a cascade of unintended reactions, resulting in more severe consequences. Moreover, the presence of malicious users who intentionally direct AI agents to execute unsafe code or actions adds additional threats.
**差距1.多步用户输入的不可预测性。**用户在与人工智能代理的交互中发挥着关键作用,不仅在任务的启动阶段提供指导,而且还通过他们的多轮反馈影响整个任务执行的方向和结果。用户输入的多样性反映了不同的背景和经验,指导AI代理完成多种任务。然而,这些多步输入也带来了挑战,特别是当用户输入描述不充分时,会导致潜在的安全威胁。对用户输入的不充分规范不仅会影响任务结果,还可能引发一连串意外反应,导致更严重的后果。此外,恶意用户故意引导AI代理执行不安全代码或操作的存在增加了额外的威胁。Therefore, ensuring the clarity and security of user inputs is crucial for the effective and safe operation of AI agents. This necessitates the design of highly flexible AI agent ecosystems capable of understanding and adapting to the variability in user input, while also ensuring robust security measures are in place to prevent malicious activities and misleading user inputs.
因此,确保用户输入的清晰性和安全性对于AI代理的有效和安全操作至关重要。这就需要设计高度灵活的人工智能代理生态系统,能够理解和适应用户输入的变化,同时确保采取强大的安全措施,以防止恶意活动和误导用户的输入。 -
Gap 2. Complexity in internal executions. The internal execution state of an AI agent is a complex chain-loop structure, ranging from the reformatting of prompts to LLM planning tasks and the use of tools. Many of these internal execution states are implicit, making it difficult to observe the detailed internal states. This leads to the threat that many security issues cannot be detected in a timely manner. AI agent security needs to audit the complex internal execution of single AI agents.
**差距2.内部执行复杂。**AI代理的内部执行状态是一个复杂的链-环结构,从提示的重新格式化到LLM规划任务和工具的使用。这些内部执行状态中有许多是隐式的,因此很难观察到详细的内部状态。这导致许多安全问题无法及时发现的威胁。AI代理安全需要审计单个AI代理的复杂内部执行。 -
Gap 3. Variability of operational environments. In practice, the development, deployment, and execution phases of many agents span across various environments. The variability of these environments can lead to inconsistent behavioral outcomes. For example, an agent tasked with executing code could run the given code on a remote server, potentially leading to dangerous operations. Therefore, securely completing work tasks across multiple environments presents a significant challenge.
**差距3.作战环境的多变性。**在实践中,许多代理的开发、部署和执行阶段跨越各种环境。这些环境的可变性可能导致不一致的行为结果。例如,负责执行代码的代理可能会在远程服务器上运行给定的代码,这可能会导致危险的操作。因此,在多个环境中安全地完成工作任务是一个重大挑战。 -
Gap 4. Interactions with untrusted external entities. A crucial capability of an AI agent is to teach large models how to use tools and other agents. However, the current interaction process between AI agents and external entities assumes a trusted external entity, leading to a wide range of practical attack surfaces, such as indirect prompt injection attack [49]. It is challenging for AI agents to interact with other untrusted entities.
**差距4.与不受信任的外部实体的交互。**人工智能代理的一个关键功能是教大型模型如何使用工具和其他代理。然而,目前AI代理与外部实体之间的交互过程假设了一个可信的外部实体,导致了广泛的实际攻击面,例如间接提示注入攻击[49]。AI代理与其他不可信实体进行交互是一项挑战。
While some research efforts have been made to address these gaps, comprehensive reviews and systematic analyses focusing on AI agent security are still lacking. Once these gaps are bridged, AI agents will benefit from improved task outcomes due to clearer and more secure user inputs, enhanced security and robustness against potential attacks, consistent behaviors across various operational environments, and increased trust and reliability from users. These improvements will promote broader adoption and integration of AI agents into critical applications, ensuring they can perform tasks safely and effectively.
虽然已经做出了一些研究努力来解决这些差距,但仍然缺乏针对人工智能代理安全的全面审查和系统分析。一旦这些差距被弥合,人工智能代理将受益于更清晰和更安全的用户输入,增强的安全性和针对潜在攻击的鲁棒性,在各种操作环境中的一致行为,以及用户的信任和可靠性。这些改进将促进人工智能代理更广泛的采用和集成到关键应用程序中,确保它们能够安全有效地执行任务。
Existing surveys on AI agents [87, 105, 160, 186, 211] primarily focus on their architectures and applications, without delving deeply into the security challenges and solutions. Our survey aims to fill this gap by providing a detailed review and analysis of AI agent security, identifying potential solutions and strategies for mitigating these threats. The insights provided are intended to inspire further research into addressing the security threats associated with AI agents, thereby fostering the development of more robust and secure AI agent applications.
现有的关于人工智能代理的调查[87,105,160,186,211]主要集中在它们的架构和应用程序上,而没有深入研究安全挑战和解决方案。我们的调查旨在填补这一空白,提供对人工智能代理安全的详细审查和分析,确定缓解这些威胁的潜在解决方案和策略。所提供的见解旨在激发进一步研究解决与AI代理相关的安全威胁,从而促进更强大和安全的AI代理应用程序的开发。
In this survey, we systematically review and analyze the threats and solutions of AI agent security based on four knowledge gaps, covering both the breadth and depth aspects. We primarily collected papers from top AI conferences, top cybersecurity conferences, and highly cited arXiv papers, spanning from January 2022 to April 2024. AI conferences are included, but not limited to: NeurIPs, ICML, ICLR, ACL, EMNLP, CVPR, ICCV, and IJCAI. Cybersecurity conferences are included but not limited: IEEE S&P, USENIX Security, NDSS, ACM CCS.
在本次调查中,我们基于四个知识缺口,从广度和深度两个方面,系统地回顾和分析了人工智能主体安全的威胁和解决方案。我们主要收集了2022年1月至2024年4月期间的顶级人工智能会议、顶级网络安全会议和高引用arXiv论文。AI会议包括但不限于:NeurIPs,ICML,ICLR,ACL,EMNLP,CVPR,ICCV和IJCAI。网络安全会议包括但不限于:IEEE S&P,USENIX Security,NDSS,ACM CCS。
The paper is organized as follows. Section 2 introduces the overview of AI agents. Section 3 depicts the single-agent security issue associated with Gap 1 and Gap 2. Section 4 analyses multi-agent security associated with Gap 3 and Gap 4. Section 5 offers future directions for the development of this field.
本文的结构如下。第2节介绍了AI代理的概述。第3节描述了与Gap 1和Gap 2相关的单代理安全问题。第4节分析了与Gap 3和Gap 4相关的多代理安全性。第5节为这一领域的发展提供了未来的方向。
2 Overview Of Ai Agent AI Agent 统一概念框架下的AI Agent概述
Terminologies. To facilitate understanding, we introduce the following terms in this paper.
术语。为了便于理解,我们在本文中介绍了以下术语。
Reasoning refers to a large language model designed to analyze and deduce information, helping to draw logical conclusions from given prompts. Planning, on the other hand, denotes a large language model tailored to assist in devising strategies and making decisions by evaluating possible outcomes and optimizing for specific objectives. The combination of LLMs for planning and reasoning is called the brain. External Tool callings are together named as the action. We name the combination of perception, brain, and action as Intra-execution in this survey. On the other hand, except for intra-execution, AI agents can interact with other AI agents, memories, and environments; we call it Interaction. These terminologies also could be explored in detail at [186].
推理是指一种大型语言模型,旨在分析和推断信息,帮助从给定的提示中得出逻辑结论。另一方面,规划表示一个大型语言模型,用于通过评估可能的结果和优化特定目标来帮助设计策略和做出决策。用于计划和推理的LLMs的组合被称为大脑。外部工具调用一起命名为操作。我们将感知、大脑和行动的结合称为内部执行。另一方面,除了内部执行,AI代理可以与其他AI代理,记忆和环境交互;我们称之为交互。这些术语也可以在[186]中详细探讨。
In 1986, a study by Mukhopadhyay et al. [116] proposed multiple intelligent node document servers to efficiently retrieve knowledge from multimedia documents through user queries. The following work [10] also discovered the potential of computer assistants by interacting between the user and the computing system, highlighting significant research and application directions in the field of computer science. Subsequently, Wooldridge et al. [183] defined the computer assistant that demonstrates intelligent behavior as an agent. In the developing field of artificial intelligence, the agent is then introduced as a computational entity with properties of autonomy, reactivity, pro-activeness, and social ability [186]. Nowadays, thanks to the powerful capacity of large language models, the AI agent has become a predominant tool to assist users in performing tasks efficiently. As shown in Figure 2, the general workflow of AI agents typically comprises two core components: Intra-execution and Interaction. Intra-execution of the AI agent typically indicates the functionalities running within the single-agent architecture, including perception, brain, and action. Specifically, the perception provides brain with effective inputs, and the action deals with these inputs in subtasks by the LLM reasoning and planning capacities. Then, these subtasks are run sequentially by the action to invoke the tools. ① and ② indicates the iteration processes of the intra-execution. Interaction refers to the ability of an AI agent to engage with other external entities, primarily through external resources. This includes collaboration or competition within the multi-agent architecture, retrieval of memory during task execution, and the deployment of environment and its data use from external tools. Note that in this survey, we define memory as an external resource because the majority of memory-related security risks arise from the retrieval of external resources.
1986年,Mukhopadhyay et al. [116]提出了多个智能节点文档服务器,以通过用户查询从多媒体文档中有效地检索知识。接下来的工作[10]也发现了计算机助理通过用户和计算系统之间的交互的潜力,突出了计算机科学领域的重要研究和应用方向。随后,Wooldridge et al. [183]定义了一个计算机助理,它将智能行为表现为一个代理。在人工智能的发展领域中,智能体被引入作为具有自主性,反应性,主动性和社会能力的计算实体[186]。如今,由于大型语言模型的强大功能,人工智能代理已成为帮助用户高效执行任务的主要工具。 如图2所示,AI代理的一般工作流程通常包括两个核心组件:内部执行和交互。AI代理的内部执行通常表示在单代理架构内运行的功能,包括感知,大脑和动作。具体而言,感知为大脑提供有效的输入,行动通过LLM推理和规划能力在子任务中处理这些输入。LLM然后,这些子任务由操作顺序运行以调用工具。①和②表示内部执行的迭代过程。交互是指人工智能主体与其他外部实体进行交互的能力,主要是通过外部资源。 这包括多代理架构中的协作或竞争,任务执行期间的内存检索,以及环境的部署及其外部工具的数据使用。请注意,在本调查中,我们将内存定义为外部资源,因为大多数与内存相关的安全风险都来自外部资源的检索。
AI agents can be divided into reinforcement-learning-based agents and LLM-based agents from the perspective of their core internal logic. RL-based agents use reinforcement learning to learn and optimize strategies through environment interaction, with the aim of maximizing accumulated rewards. These agents are effective in environments with clear objectives such as instruction following [75, 124] or building world model [108, 140], where they adapt through trial and error.
人工智能主体从其核心内部逻辑的角度可以分为基于重复学习的主体和基于LLM主体。基于RL的代理使用强化学习来学习和优化策略,通过环境交互,以最大化累积的奖励为目的。这些代理在具有明确目标的环境中是有效的,例如遵循指令[75,124]或建立世界模型[108,140],在那里他们通过试错来适应。
In contrast, LLM-based agents rely on large-language models [92, 173, 195]. They excel in natural language processing tasks, leveraging vast textual data to master language complexities for effective communication and information retrieval. Each type of agent has distinct capabilities to achieve specific computational tasks and objectives.
相比之下,基于LLM的代理依赖于大型语言模型[92,173,195]。他们擅长自然语言处理任务,利用大量的文本数据来掌握语言的复杂性,以实现有效的沟通和信息检索。每种类型的代理都有不同的能力来实现特定的计算任务和目标。
2.2 Overview Of Ai Agent On Threats . AI Agent对威胁的研究综述
As of now, there are several surveys on AI agents [87, 105, 160, 186, 211]. For instance, Xi et al. [186] offer a comprehensive and systematic review focused on the applications of LLM-based agents, aiming to examine existing research and future possibilities in this rapidly developing field. The literature [105] summarized the current AI agent architecture. However, they do not adequately assess the security and trustworthiness of AI agents. Li et al. [87] failed to consider both the capability and security of multi-agent scenario. A study [160] provides the potential risks inherent only to scientific LLM agents. Zhang et al. [211] only survey on the memory mechanism of AI agents.
到目前为止,有几项关于AI代理的调查[87,105,160,186,211]。例如,Xi et al. [186]提供了一个全面和系统的审查集中在LLM为基础的代理商,旨在检查现有的研究和未来的可能性,在这个迅速发展的领域。文献[105]总结了当前的AI代理架构。然而,他们没有充分评估人工智能代理的安全性和可信度。Li等人[87]未能同时考虑多代理场景的能力和安全性。一项研究[160]提供了仅科学LLM试剂固有的潜在风险。Zhang等人[211]仅对人工智能主体的记忆机制进行了综述。
Our main focus in this work is on the security challenges of AI agents aligned with four knowledge gaps. As depicted in Table 1, we have provided a summary of papers that discuss the security challenges of AI agents. Threat Source column identifies the attack strategies employed at various stages of the general AI agent workflow, categorized into four gaps. Threat Model column identifies potential adversarial attackers or vulnerable entities. Target Effects summarize the potential outcomes of security-relevant issues.
我们在这项工作中的主要重点是AI代理的安全挑战与四个知识差距。如表1所示,我们提供了讨论AI代理安全挑战的论文摘要。威胁来源列确定了一般AI代理工作流程的各个阶段所采用的攻击策略,分为四个差距。威胁模型列识别潜在的敌对攻击者或脆弱实体。目标效应总结了安全相关问题的潜在结果。
We also provide a novel taxonomy of threats to the AI agent (See Figure 3). Specifically, we identify threats based on their source positions, including intra-execution and interaction.
我们还为AI代理提供了一种新的威胁分类(见图3)。具体而言,我们根据其来源位置(包括内部执行和**交互)**识别威胁。
3 Intra-Execution Security
3内部执行安全
As mentioned in Gap 1 and 2, the single agent system has unpredictable multi-step user inputs and complex internal executions. In this section, we mainly explore these complicated intra-execution threats and their corresponding countermeasures. As depicted in Figure 2, we discuss the threats of the three main components of the unified conceptual framework on the AI agent.
如差距1和2中所述,单代理系统具有不可预测的多步用户输入和复杂的内部执行。在这一部分中,我们主要探讨这些复杂的内部执行威胁及其相应的对策。如图2所示,我们讨论了统一概念框架的三个主要组件对AI代理的威胁。
3.1 Threats On Perception
3.1感知威胁
As illustrated in Figure 2 and Gap 1, to help the brain module understand system instruction, user input, and external context, the perception module includes multi-modal (i.e., textual, visual, and auditory inputs) and multi-step (i.e., initial user inputs, intermediate sub-task prompts, and human feedback) data processing during the interaction between humans and agents. The typical means of communication between humans and agents is through prompts. The threat associated with prompts is the most prominent issue for AI agents. This is usually named adversarial attacks. An adversarial attack is a deliberate attempt to confuse or trick the brain by inputting misleading or specially crafted prompts to produce incorrect or biased outputs. Through adversarial attacks, malicious users extract system prompts and other information from the contextual window [46]. Liu et al. [94] were the first to investigate adversarial attacks against the embodied AI agent, introducing spatiotemporal perturbations to create 3D adversarial examples that result in agents providing incorrect answers. Mo et al. [110] analyzed twelve hypothetical attack scenarios against AI agents based on the different threat models. The adversarial attack on the perception module includes prompt injection attacks [23, 49, 49, 130, 185, 196], indirect prompt injection attacks [23, 49, 49, 130, 185, 196] and jailbreak [15, 50, 83, 161, 178, 197]. To better explain the threats associated with prompts in this section, we first present the traditional structure of a prompt.
如图2和差距1所示,为了帮助大脑模块理解系统指令、用户输入和外部上下文,感知模块包括多模态(即文本、视觉和听觉输入)和多步骤(即,初始用户输入、中间子任务提示和人反馈)在人和代理之间的交互期间的数据处理。人类和代理之间的典型通信方式是通过提示。与提示相关的威胁是AI代理最突出的问题。这通常被称为对抗性攻击。对抗性攻击是一种故意试图通过输入误导性或特制的提示来混淆或欺骗大脑,以产生不正确或有偏见的输出。通过对抗性攻击,恶意用户从上下文窗口中提取系统提示和其他信息[46]。Liu等人[94]是第一个研究针对具体AI代理的对抗性攻击的人,引入时空扰动来创建3D对抗性示例,导致代理提供不正确的答案Mo等人[110]根据不同的威胁模型分析了12种针对AI代理的假设攻击场景。对感知模块的对抗性攻击包括提示注入攻击[23,49,49,130,185,196]、间接提示注入攻击[23,49,49,130,185,196]和越狱[15,50,83,161,178,197]。为了更好地解释本节中与提示相关的威胁,我们首先介绍提示的传统结构。
The agent prompt structure can be composed of instruction, external context, user input. Instructions are set by the agent’s developers to define the specific tasks and goals of the system.
代理提示结构可以由指令、外部上下文、用户输入组成。指令由代理的开发人员设置,以定义系统的特定任务和目标。
The external context comes from the agent’s working memory or external resources, while user input is where a benign user can issue the query to the agent. In this section, the primary threats of jailbreak and prompt injection attacks originate from the instructions and user input, while the threats of indirect injection attacks stem from external contexts.
外部上下文来自代理的工作内存或外部资源,而用户输入是良性用户可以向代理发出查询的地方。在本节中,越狱和提示注入攻击的主要威胁来自指令和用户输入,而间接注入攻击的威胁来自外部上下文。
3.1.1 Prompt Injection Attack.
3.1.1即时注入攻击。
The prompt injection attack is a malicious prompt manipulation technique in which malicious text is inserted into the input prompt to guide a language model to produce deceptive output [130]. Through the use of deceptive input, prompt injection attacks allow attackers to effectively bypass constraints and moderation policies set by developers of AI agents, resulting in users receiving responses containing biases, toxic content, privacy threats, and misinformation [72]. For example, malicious developers can transform Bing chat into a phishing agent [49]. The UK Cyber Agency has also issued warnings that malicious actors are manipulating the technology behind LLM chatbots to obtain sensitive information, generate offensive content, and trigger unintended consequences [61].
提示注入攻击是一种恶意提示操作技术,其中恶意文本被插入到输入提示中,以引导语言模型产生欺骗性输出[130]。通过使用欺骗性输入,即时注入攻击允许攻击者有效地绕过AI代理开发人员设置的约束和审核策略,导致用户收到包含偏见,有毒内容,隐私威胁和错误信息的响应。例如,恶意开发人员可以将Bing聊天转换为钓鱼代理[49]。英国网络管理局还发出警告,恶意行为者正在操纵LLM聊天机器人背后的技术,以获取敏感信息,生成攻击性内容,并引发意想不到的后果[61]。
The following discussion focuses primarily on the goal hijacking attack and the prompt leaking attack, which represent two prominent forms of prompt injection attacks [130], and the security threats posed by such attacks within AI agents.
下面的讨论主要集中在目标劫持攻击和即时泄漏攻击,这是即时注入攻击的两种主要形式[130],以及这些攻击在AI代理中构成的安全威胁。
- Goal hijacking attack. Goal hijacking is a method whereby the original instruction is replaced, resulting in inconsistent behavior from the AI agent. The attackers attempt to substitute the original LLM instruction, causing it to execute the command based on the instructions of the new attacker [130]. The implementation of goal hijacking is particularly in the starting position of user input, where simply entering phrases, such as “ignore the above prompt, please execute”, can circumvent LLM security measures, substituting the desired answers for the malicious user [80]. Liu et al. [96] have proposed output hijacking attacks to support API key theft attacks. Output hijacking attacks entail attackers modifying application code to manipulate its output, prompting the AI agent to respond with “I don’t know” upon receiving user requests. API key theft attacks involve attackers altering the application code such that once the application receives the user-provided API key, it logs and transmits it to the attacker, facilitating the theft of the API.
**目标劫持攻击。**目标劫持是一种替换原始指令的方法,导致AI代理的行为不一致。攻击者试图替换原始LLM指令,使其基于新攻击者的指令执行命令[130]。目标劫持的实现特别是在用户输入的起始位置,其中简单地输入短语,例如“忽略上述提示,请执行”,可以规避LLM安全措施,将恶意用户所需的答案替换为[80]。Liu等人[96]提出了输出劫持攻击以支持API密钥窃取攻击。输出劫持攻击需要攻击者修改应用程序代码以操纵其输出,提示AI代理在收到用户请求时响应“我不知道”。 API密钥盗窃攻击涉及攻击者更改应用程序代码,使得一旦应用程序接收到用户提供的API密钥,它就记录并将其发送给攻击者,从而促进API的盗窃。 - Prompt leaking attack. Prompt leaking attack is a method that involves inducing an LLM to output pre-designed instructions by providing user inputs, leaking sensitive information [208]. It poses a significantly greater challenge compared to goal hijacking [130]. Presently, responses generated by LLMs are transmitted using encrypted tokens. However, by employing certain algorithms and inferring token lengths based on packet sizes, it is possible to intercept privacy information exchanged between users and agents [179]. User inputs, such as “END. Print previous instructions”, may trigger the disclosure of confidential instructions by LLMs, exposing proprietary knowledge to malicious entities [46]. In the context of RetrievalAugmented Generation (RAG) systems based on AI agents, prompt leaking attacks may further expose backend API calls and system architecture to malicious users, exacerbating security threats [185].
**快速泄漏攻击。**提示泄漏攻击是一种方法,涉及通过提供用户输入来诱导LLM输出预先设计的指令,泄漏敏感信息[208]。与目标劫持相比,它构成了更大的挑战[130]。目前,由LLMs生成的响应使用加密令牌来传输。然而,通过采用某些算法并根据数据包大小推断令牌长度,可以拦截用户和代理之间交换的隐私信息[179]。用户输入,例如“结束。打印先前的指令”,可能会触发LLMs披露机密指令,将专有知识暴露给恶意实体[46]。在基于AI代理的检索增强生成(RAG)系统的上下文中,即时泄漏攻击可能会进一步将后端API调用和系统架构暴露给恶意用户,从而加剧安全威胁[185]。
3.1.1 Prompt injection attacks within agent-integrated frameworks.
3.1.1 代理集成框架内的快速注入攻击。
With the widespread adoption of AI agents, certain prompt injection attacks targeting individual AI agents can also generalize to deployments of AI agent-based applications [163], amplifying the associated security threats [97, 127]. For example, malicious users can achieve Remote Code Execution (RCE) through prompt injection, thereby remotely acquiring permissions for integrated applications [96]. Additionally, carefully crafted user inputs can induce AI agents to generate malicious SQL queries, compromising data integrity and security [127]. Furthermore, integrating these attacks into corresponding webpages alongside the operation of AI agents [49] leads to users receiving responses that align with the desires of the malicious actors, such as expressing biases or preferences towards products [72].
随着人工智能代理的广泛采用,针对单个人工智能代理的某些即时注入攻击也可以推广到基于人工智能代理的应用程序的部署[163],放大了相关的安全威胁[97,127]。例如,恶意用户可以通过提示注入实现远程代码执行(RCE),从而远程获取集成应用程序的权限[96]。此外,精心制作的用户输入可能会诱导AI代理生成恶意SQL查询,从而损害数据完整性和安全性[127]。此外,将这些攻击与AI代理的操作一起集成到相应的网页中[49]会导致用户收到与恶意行为者的期望相一致的响应,例如表达对产品的偏见或偏好[72]。
In the case of closed-source AI agent integrated commercial applications, certain black-box prompt injection attacks [97] can facilitate the theft of service instruction [193], leveraging the computational capabilities of AI agents for zero-cost imitation services, resulting in millions of dollars in losses for service providers [97].
在闭源AI代理集成商业应用的情况下,某些黑盒提示注入攻击[97]可以促进服务指令的窃取[193],利用AI代理的计算能力进行零成本模仿服务,导致服务提供商损失数百万美元。
AI agents are susceptible to meticulously crafted prompt injection attacks [193], primarily due to conflicts between their security training and user instruction objectives [212]. Additionally, AI agents often prioritize system prompts on par with texts from untrusted users and third parties [168]. Therefore, establishing hierarchical instruction privileges and enhancing training methods for these models through synthetic data generation and context distillation can effectively improve the robustness of AI agents against prompt injection attacks [168]. Furthermore, the security threats posed by prompt injection attacks can be mitigated by various techniques, including inference-only methods for intention analysis [209], API defenses with added detectors [68], and black-box defense techniques involving multi-turn dialogues and context examples [3, 196].
AI代理容易受到精心制作的即时注入攻击[193],主要是由于其安全培训和用户指令目标之间的冲突[212]。此外,人工智能代理通常将系统提示与来自不受信任的用户和第三方的文本进行优先级排序[168]。因此,通过合成数据生成和上下文蒸馏建立分层指令特权并增强这些模型的训练方法可以有效提高AI代理对即时注入攻击的鲁棒性[168]。此外,可以通过各种技术来减轻即时注入攻击带来的安全威胁,包括用于意图分析的仅推理方法[209],添加检测器的API防御[68]以及涉及多轮对话和上下文示例的黑盒防御技术[3,196]。
To address the security threats inherent in agent-integrated frameworks, researchers have proposed relevant potential defensive strategies. Liu et al. [96] introduced LLMSMITH, which performs static analysis by scanning the source code of LLM-integrated frameworks to detect potential Remote Code Execution (RCE) vulnerabilities. Jiang et al. [72] proposed four key attributesintegrity, source identification, attack detectability, and utility preservation-to define secure LLMintegrated applications and introduced the shield defense to prevent manipulation of queries from users or responses from AI agents by internal and external malicious actors.
为了解决代理集成框架中固有的安全威胁,研究人员提出了相关的潜在防御策略。Liu等人[96]引入了LLMSMITH,它通过扫描LLM集成框架的源代码来执行静态分析,以检测潜在的远程代码执行(RCE)漏洞。Jiang等人[72]提出了四个关键属性完整性,源识别,攻击可检测性和实用程序验证-以定义安全的LLM集成应用程序,并引入了屏蔽防御以防止内部和外部恶意行为者操纵来自用户的查询或来自AI代理的响应。
3.1.2 Indirect Prompt Injection Attack.
3.1.2 间接快速注入攻击。
Indirect prompt injection attack [49] is a form of attack where malicious users strategically inject instruction text into information retrieved by AI agents [40], web pages [184], and other data sources. This injected text is often returned to the AI agent as internal prompts, triggering erroneous behavior, and thereby enabling remote influence over other users’ systems. Compared to prompt injection attacks, where malicious users attempt to directly circumvent the security restrictions set by AI agents to mislead their outputs, indirect prompt injection attacks are more complex and can have a wider range of user impacts [57]. When plugins are rapidly built to secure AI agents, indirect prompt injection can also be introduced into the corresponding agent frameworks. When AI agents use external plugins to query data injected with malicious instructions, it may lead to security and privacy issues. For example, web data retrieved by AI agents using web plugins could be misinterpreted as user instructions, resulting in extraction of historical conversations, insertion of phishing links, theft of GitHub code [204], or transmission of sensitive information to attackers [185]. More detailed information can also be found in Section 3.3.2. One of the primary reasons for the successful exploitation of indirect prompt injection on AI agents is the inability of AI agents to differentiate between valid and invalid system instructions from external resources. In other words, the integration of AI agents and external resources further blurs the distinction between data and instructions [49].
间接提示注入攻击[49]是一种攻击形式,恶意用户策略性地将指令文本注入到AI代理[40],网页[184]和其他数据源检索的信息中。这种注入的文本通常会作为内部提示返回给AI代理,触发错误行为,从而对其他用户的系统产生远程影响。与即时注入攻击相比,恶意用户试图直接绕过人工智能代理设置的安全限制以误导其输出,间接即时注入攻击更复杂,可以产生更广泛的用户影响[57]。当快速构建插件以保护AI代理时,也可以将间接提示注入引入相应的代理框架。当AI代理使用外部插件来查询注入恶意指令的数据时,可能会导致安全和隐私问题。 例如,AI代理使用Web插件检索的Web数据可能会被误解为用户指令,导致提取历史对话,插入钓鱼链接,窃取GitHub代码[204]或将敏感信息传输给攻击者[185]。更详细的信息也可以在第3.3.2节中找到。在AI代理上成功利用间接提示注入的主要原因之一是AI代理无法区分来自外部资源的有效和无效系统指令。换句话说,人工智能代理和外部资源的集成进一步模糊了数据和指令之间的区别[49]。
To defend against indirect prompt attacks, developers can impose explicit constraints on the interaction between AI agents and external resources to prevent AI agents from executing external malicious data [185]. For example, developers can augment AI agents with user input references by comparing the original user input and current prompts and incorporating self-reminder functionalities. When user input is first entered, agents are reminded of their original user input references, thus distinguishing between external data and user inputs [14]. To reduce the success rate of indirect prompt inje