AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Table of Contents

Introduction

Large Language Model (LLM) agents are increasingly being deployed in critical applications ranging from autonomous driving to healthcare. These agents often leverage Retrieval-Augmented Generation (RAG) to enhance their performance by accessing external knowledge or memory stores. While this improves their capabilities, it also introduces significant security vulnerabilities that have not been adequately explored.

AGENTPOISON overview

Figure 1: Overview of the AGENTPOISON attack. The top panel shows how a backdoor trigger injected into a user instruction can cause an LLM agent to retrieve malicious demonstrations and execute harmful actions. The bottom panel illustrates the iterative trigger optimization process that makes the attack effective.

The paper "AGENTPOISON: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases" by Chen et al. (2024) introduces a novel backdoor attack specifically designed to compromise LLM agents by poisoning their knowledge bases or memory stores. This research represents an important contribution to our understanding of AI security risks, as it demonstrates that even a very small number of malicious entries in a knowledge base can lead to dangerous agent behaviors when triggered.

Background

LLM agents employ retrieval mechanisms to access relevant information from external sources when processing user queries. These mechanisms generally work by:

  1. Encoding the user query into an embedding representation
  2. Retrieving the most similar items from the knowledge base or memory
  3. Using the retrieved information to inform the agent's response

While this approach enhances the agent's capabilities, it also creates an attack surface that can be exploited. Previous research has explored various attack vectors against LLMs, including:

  • Jailbreaking attacks: Crafting prompts to bypass safety mechanisms
  • Backdoor attacks: Embedding triggers that cause models to exhibit specific behaviors
  • Prompt injection: Inserting malicious instructions that override intended functionality

However, these approaches have limitations when applied to LLM agents with RAG systems. Jailbreaking attacks may be mitigated by the retrieval process, while existing backdoor attacks often use suboptimal triggers that don't effectively target the retrieval mechanism.

The AGENTPOISON Attack

AGENTPOISON represents a significant advancement in backdoor attacks by specifically targeting the retrieval component of LLM agents. The attack works as follows:

  1. The attacker poisons a small portion of the agent's knowledge base or memory with malicious demonstrations
  2. The attacker crafts an optimized trigger phrase that, when included in a user query, maximizes the likelihood of retrieving the poisoned entries
  3. When a user unwittingly includes the trigger in their query, the agent retrieves the malicious demonstrations and produces harmful outputs

What makes AGENTPOISON particularly dangerous is its:

  • Effectiveness: Achieves high attack success rates (≥80%) with minimal impact on benign performance (≤1%)
  • Efficiency: Requires poisoning very small portions of the knowledge base (<0.1%)
  • Stealthiness: The optimized triggers appear natural and are difficult to detect

Trigger Optimization Method

The core technical innovation of AGENTPOISON is its trigger optimization methodology. Unlike previous approaches that use arbitrary or manually designed triggers, AGENTPOISON systematically optimizes triggers to maximize attack effectiveness while maintaining natural appearance.

The trigger optimization involves solving a constrained optimization problem with three key objectives:

  1. Uniqueness: The trigger should map to a unique region in the embedding space that is distinct from benign queries
  2. Compactness: The trigger should ensure that all malicious demonstrations are tightly clustered in the embedding space
  3. Target Action: The trigger should maximize the likelihood of the agent producing the desired malicious action

Embedding visualization

Figure 2: Visualization of embeddings during AGENTPOISON optimization. From left to right: (a) Initial poisoned embeddings from a baseline attack, (b) Initial iteration of AGENTPOISON, (c) After 10 iterations, and (d) After 15 iterations, showing clear separation of triggered query embeddings.

The optimization is formulated mathematically as:

min⁡TLuni(T)+λcptLcpt(T)\begin{align} \min_{T} \mathcal{L}_{\text{uni}}(T) + \lambda_{\text{cpt}} \mathcal{L}_{\text{cpt}}(T) \end{align}

where:

  • Luni\mathcal{L}_{\text{uni}} represents the uniqueness loss
  • Lcpt\mathcal{L}_{\text{cpt}} represents the compactness loss
  • λcpt\lambda_{\text{cpt}} is a weighting parameter

This optimization is solved using a gradient-guided beam search algorithm that efficiently navigates the discrete token space while adhering to coherence constraints.

Figure 3 illustrates the progressive clustering of embeddings across different retrieval embedders during the optimization process, demonstrating how AGENTPOISON creates clear separation between benign and adversarial embeddings.

Progressive embedding clustering

Figure 3: Evolution of embedding distributions across different retrieval embedders over optimization iterations, showing how triggered queries become increasingly clustered in a separate region.

Experimental Setup

The authors evaluated AGENTPOISON across three types of LLM agents:

  1. Agent-Driver: An autonomous driving agent that uses RAG to inform driving decisions
  2. ReAct-StrategyQA: A knowledge-intensive question-answering agent
  3. EHRAgent: A healthcare agent that interacts with electronic health records

For each agent, they tested different:

  • LLM backbones (ChatGPT, LLaMA-3)
  • RAG embedders (DPR, ANCE, BGE, REALM, ORQA, ADA)
  • Poisoning rates (0.01% to 1%)
  • Trigger lengths (1 to 10 tokens)

The evaluation metrics included:

  • ASR-r: Attack success rate for retrieval (percentage of poisoned entries retrieved)
  • ASR-a: Attack success rate for target action (percentage of trials producing harmful action)
  • ASR-t: End-to-end attack success rate
  • ACC: Benign accuracy (performance on non-triggered queries)

Key Results

AGENTPOISON demonstrated remarkable effectiveness across all tested scenarios:

  1. High Attack Success: Achieved ASR-r and ASR-a of ≥80% on average across agent types
  2. Maintained Benign Performance: Preserved ACC within 1% of unpoisoned models
  3. Minimal Poisoning Required: Effective even with poisoning rates as low as 0.01%
  4. Applicable to Various Triggers: Works with triggers of different lengths, even single-token triggers

Figures 4 and 5 illustrate examples of the attack in action on the autonomous driving agent:

Benign demonstration retrieval

Figure 4: Example of the autonomous driving agent retrieving a benign demonstration with the backdoor trigger present but failing to execute the harmful action.

Successful attack demonstration

Figure 5: Example of the successful attack where the triggered agent retrieves poisoned demonstrations and executes the harmful "SUDDEN STOP" action.

Compared to baseline approaches like random poisoning and contrastive poison attacks (CPA), AGENTPOISON achieves significantly higher success rates while maintaining benign performance:

Performance comparison

Figure 6: Comparison of AGENTPOISON with baseline approaches showing superior attack success rates (ASR-r) and benign accuracy (ACC) preservation across poisoning rates and trigger sizes.

Transferability and Robustness

One of the most concerning aspects of AGENTPOISON is its strong transferability across different retrieval embedders. Triggers optimized for one embedder often remain effective when the system uses a different embedder:

Transferability results

Figure 7: Transferability results showing attack success rates and benign accuracy across different source and target embedders.

The attack also demonstrates remarkable robustness to potential defenses:

  1. Perplexity-Based Detection: AGENTPOISON triggers maintain perplexity distributions similar to benign queries, making them difficult to detect based on perplexity alone.

Perplexity distribution

Figure 8: Perplexity distributions of benign queries, AGENTPOISON triggers, and a baseline attack, showing how AGENTPOISON maintains natural perplexity profiles.

  1. Rephrasing Defenses: Even when user queries containing triggers are rephrased, the attack often remains effective.

  2. Diverse Augmentations: The optimized triggers are robust to various text augmentations, maintaining effectiveness under modifications.

Defense Considerations

The paper discusses several potential defense strategies against AGENTPOISON:

  1. Input Sanitization: Filtering or removing suspected trigger phrases
  2. Knowledge Base Verification: Implementing stricter verification of entries in the knowledge base
  3. Robust Retrieval Methods: Developing retrieval mechanisms that are less susceptible to embedding manipulation
  4. Anomaly Detection: Monitoring for unusual patterns in retrieved content
  5. Adversarial Training: Training agents to be robust against poisoned demonstrations

However, each of these defenses has limitations, and the authors emphasize that completely defending against such attacks remains challenging. The most promising approach appears to be a combination of defenses tailored to specific agent implementations.

Implications and Significance

AGENTPOISON reveals a critical vulnerability in LLM agents that has not been previously explored in depth. The research has several important implications:

  1. Safety Risks in Critical Applications: The attack demonstrates how LLM agents in autonomous driving, healthcare, or other critical domains could be manipulated to take harmful actions
  2. Trust in Knowledge Sources: Highlights the importance of verifying and securing the knowledge sources that agents rely on
  3. Adversarial Machine Learning: Advances our understanding of how embedding spaces can be manipulated in retrieval systems
  4. Red-teaming Value: Demonstrates the importance of proactive security testing of AI systems

The study calls for increased attention to securing knowledge bases and memory stores for LLM agents, particularly in high-stakes applications. It also underscores the need for robust verification mechanisms and the dangers of relying on potentially unvetted information sources.

By exposing this vulnerability through careful red-teaming, the researchers provide valuable insights that can help develop more secure LLM agent architectures in the future, potentially preventing real-world exploitation of these weaknesses.

Relevant Citations

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.

  • This citation introduces Retrieval-Augmented Generation (RAG), a key concept for the paper, as AGENTPOISON focuses on attacking LLM agents that utilize RAG.

Zexuan Zhong, Ziqing Huang, Alexander Wettig, and Danqi Chen. Poisoning retrieval corpora by injecting adversarial passages. arXiv preprint arXiv:2310.19156, 2023.

  • This work explores poisoning attacks on retrieval corpora, which directly relates to AGENTPOISON's attack strategy of poisoning the memory or knowledge base of LLM agents.

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.

  • This citation details Dense Passage Retrieval (DPR), a specific type of RAG that AGENTPOISON considers and evaluates against, enhancing the practicality of the proposed attack.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.

  • This work describes REALM, a retrieval-augmented language model pre-training approach, that AGENTPOISON utilizes as one of the embedders, broadening the scope of the attack evaluation.

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.

  • This citation introduces ReAct, one of the three LLM agents targeted by AGENTPOISON, making it central to the evaluation and demonstration of the attack's effectiveness.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值