Unique Security and Privacy Threats of Large Language Models: A Comprehensive Survey
大型语言模型的独特安全与隐私威胁:综合调查
Abstract. 摘要。
With the rapid development of artificial intelligence, large language models (LLMs) have made remarkable advancements in natural language processing. These models are trained on vast datasets to exhibit powerful language understanding and generation capabilities across various applications, including chatbots, and agents. However, LLMs have revealed a variety of privacy and security issues throughout their life cycle, drawing significant academic and industrial attention. Moreover, the risks faced by LLMs differ significantly from those encountered by traditional language models. Given that current surveys lack a clear taxonomy of unique threat models across diverse scenarios, we emphasize the unique privacy and security threats associated with four specific scenarios: pre-training, fine-tuning, deployment, and LLM-based agents. Addressing the characteristics of each risk, this survey outlines and analyzes potential countermeasures. Research on attack and defense situations can offer feasible research directions, enabling more areas to benefit from LLMs.
随着人工智能的快速发展,大型语言模型(LLMs)在自然语言处理方面取得了显著进步。这些模型在庞大的数据集上进行训练,以在各种应用中展示强大的语言理解和生成能力,包括聊天机器人和代理。然而,LLMs 在其生命周期中已经暴露出多种隐私和安全问题,引起了学术界和工业界的广泛关注。此外,LLMs 面临的风险与传统语言模型所遇到的风险存在显著差异。鉴于当前调查缺乏对不同场景下独特威胁模型的清晰分类,我们强调了与四个特定场景相关的独特隐私和安全威胁:预训练、微调、部署和基于 LLM 的代理。针对每种风险的特征,本调查概述并分析了潜在的应对措施。对攻击和防御情况的研究可以提供可行的研究方向,使更多领域能够受益于 LLMs。
大型语言模型,智能体,安全和隐私风险,模型鲁棒性
ccs:信息系统 语言模型††ccs: Security and privacy
ccs:安全和隐私
1. Introduction 1.引言
With the rapid development of artificial intelligence (AI) technology, researchers have progressively expanded the scale of training data and model architectures (Zhao et al., 2023b). Trained with massive amounts of data, extremely large-scale models demonstrate impressive language understanding and generation capabilities (Das et al., 2025), marking a significant breakthrough in natural language processing (NLP). Referred to as Large Language Models (LLMs), these models provide robust support for machine translation, and other NLP tasks. However, the in-depth application of LLMs across various industries, such as chatbots (Chowdhury et al., 2023), exposes their life cycle to numerous privacy and security threats. More importantly, LLMs face unique privacy and security threats (Cui et al., 2024) not present in traditional language models, necessitating higher standards for privacy protection and security defenses.
随着人工智能(AI)技术的快速发展,研究人员逐步扩大了训练数据的规模和模型架构(Zhao et al., 2023b)。在大量数据训练下,超大规模模型展现出令人印象深刻的语言理解和生成能力(Das et al., 2025),标志着自然语言处理(NLP)的重大突破。这些被称为大型语言模型(LLMs)的模型为机器翻译和其他 NLP 任务提供了强大的支持。然而,LLMs 在聊天机器人(Chowdhury et al., 2023)等各个行业的深入应用,使其生命周期面临众多隐私和安全威胁。更重要的是,LLMs 面临着传统语言模型所不具备的独特隐私和安全威胁(Cui et al., 2024),这要求更高的隐私保护标准和安全防御要求。
1.1. Motivation 1.1. 动机
Compared to traditional single-function language models, LLMs demonstrate remarkable comprehension abilities and are deployed across various applications, such as code generation. Recently, an increasing number of companies have been launching universal or domain-specific LLMs, such as ChatGPT (Chowdhury et al., 2023) and LLaMA (Touvron et al., 2023), offering users versatile and intelligent services. However, due to LLMs’ unique capabilities and structures, they encounter unique privacy and security threats throughout their life cycle compared to previous small-scale language models (Yao et al., 2024a). Existing surveys only describe various risks and countermeasures by method type, but lack exploration into threat scenarios and these unique threats.
与传统单一功能语言模型相比,LLMs 展现出卓越的理解能力,并被应用于代码生成等多种场景。近年来,越来越多的公司开始推出通用型或特定领域的 LLMs,如 ChatGPT(Chowdhury 等人,2023)和 LLaMA(Touvron 等人,2023),为用户提供多样化、智能化的服务。然而,由于 LLMs 独特的功能和结构,与以往的小规模语言模型相比(Yao 等人,2024a),它们在其生命周期中面临着独特的隐私和安全威胁。现有的调查仅通过方法类型描述了各种风险和应对措施,但缺乏对威胁场景和这些独特威胁的深入探讨。
Therefore, we divide the life cycle of LLMs into four scenarios: pre-training, fine-tuning, deployment, and LLM-based agents. In the pre-training stage, upstream developers train large-scale Transformer models (Liu et al., 2023c) on massive corpora to acquire general language knowledge; in the fine-tuning stage, downstream developers adapt these models to specific tasks through methods such as instruction tuning (Zhang et al., 2024c) and alignment tuning (Sun et al., 2024a); in the deployment stage, LLMs are released to serve users, often enhanced by techniques like in-context learning (Liu et al., 2023c) or retrieval-augmented generation (RAG) (Zou et al., 2024); finally, LLMs integrate memory and external tools (He et al., 2024), enabling more complex tasks and proactive human–computer interaction. Based on this life cycle, we next discuss the unique privacy and security risks inherent to each stage.
因此,我们将 LLMs 的生命周期分为四个场景:预训练、微调、部署和基于 LLM 的智能体。在预训练阶段,上游开发者在大规模语料库上训练大型 Transformer 模型(刘等,2023c),以获取通用语言知识;在微调阶段,下游开发者通过指令微调(张等,2024c)和对齐微调(孙等,2024a)等方法,将这些模型适配于特定任务;在部署阶段,LLMs 被发布以服务用户,通常通过情境学习(刘等,2023c)或检索增强生成(RAG)(邹等,2024)等技术进行增强;最后,LLMs 整合记忆和外部工具(何等,2024),实现更复杂的任务和主动的人机交互。基于这一生命周期,我们接下来将讨论每个阶段所特有的隐私和安全风险。
Unique privacy risks. When learning language knowledge from training data, LLMs tend to memorize this data (Carlini et al., 2022). This tendency allows adversaries to extract private information. For example, Carlini et al. (Carlini et al., 2021) found that prompts with specific prefixes could cause GPT-2 to generate content containing personal information, such as email addresses and phone numbers. During inference, unrestricted use of LLMs provides adversaries with opportunities to extract model-related information (Zhu et al., 2022) and functionalities (Yang et al., 2024d).
独特的隐私风险。当从训练数据中学习语言知识时,LLMs 倾向于记住这些数据(Carlini 等人,2022)。这种倾向使攻击者能够提取私人信息。例如,Carlini 等人(Carlini 等人,2021)发现,带有特定前缀的提示会导致 GPT-2 生成包含个人信息的内容,例如电子邮件地址和电话号码。在推理过程中,LLMs 的无限制使用为攻击者提供了提取与模型相关的信息(Zhu 等人,2022)和功能(Yang 等人,2024d)的机会。
Unique security risks. Since the training data may contain malicious, illegal, and biased texts, LLMs inevitably acquire negative language knowledge. Moreover, malicious third parties involved in developing LLMs in outsourcing scenarios can compromise these models’ integrity and utility through poisoning and backdoor attacks (Wang et al., 2023a; Zhou et al., 2022). For example, an adversary could implant a backdoor in an LLM-based automated customer service system, causing it to respond with a fraudulent link when asked specific questions. During inference, unrestricted use of LLMs allows adversaries to obtain targeted responses (Wei et al., 2023), such as fake news and illegal content.
独特的安全风险。由于训练数据可能包含恶意、非法和有偏见的文本,LLMs 不可避免地会获取负面语言知识。此外,在外包场景中参与开发 LLMs 的恶意第三方可以通过毒化和后门攻击来破坏这些模型的完整性和效用(Wang et al., 2023a; Zhou et al., 2022)。例如,攻击者可以在基于 LLM 的自动客户服务系统中植入后门,在询问特定问题时导致其响应虚假链接。在推理过程中,LLMs 的无限制使用允许攻击者获得有针对性的响应(Wei et al., 2023),例如假新闻和非法内容。
These unique privacy and security risks severely threaten public and individual safety, violating existing laws such as the General Data Protection Regulation (GDPR). For instance, an intern at ByteDance was charged with a fine of 1 million dollars for injecting malicious code into a shared model. Additionally, these risks will reduce the credibility of LLMs and hinder their popularity. In this context, there is a lack of systematic research on the unique privacy and security threats of LLMs. This prompts us to analyze, categorize, and summarize the existing research to complete a comprehensive survey in this field. This research can help the technical community develop safe and reliable LLM-based applications, enabling more areas to benefit from LLMs.
这些独特的隐私和安全风险严重威胁着公共和个人安全,违反了《通用数据保护条例》(GDPR)等现有法律。例如,一位字节跳动实习生因向共享模型注入恶意代码而被罚款 100 万美元。此外,这些风险将降低 LLMs 的可信度,并阻碍其普及。在这种情况下,对于 LLMs 独特的隐私和安全威胁缺乏系统性的研究。这促使我们分析、分类和总结现有研究,以完成该领域的全面调查。这项研究可以帮助技术社区开发安全可靠的基于 LLMs 的应用,使更多领域受益于 LLMs。
1.2. Comparison with existing surveys
1.2 与现有调查的比较
Research on LLMs’ privacy and security is rapidly developing, but existing surveys lack a comprehensive taxonomy and summary of LLMs’ unique parts. In Table 1, we compare our survey with 10 highly influential surveys on the privacy and security of LLMs since May 2025. The main differences lie in four key aspects.
LLMs 的隐私和安全研究正在快速发展,但现有调查缺乏对 LLMs 独特部分的全面分类和总结。在表 1 中,我们将我们的调查与自 2025 年 5 月以来 10 项具有高度影响力的关于 LLMs 隐私和安全的调查进行比较。主要差异体现在四个关键方面。
-
•
Threat scenarios. We explicitly divide the life cycle of LLMs into four threat scenarios, which most surveys overlooked. Each scenario corresponds to multiple threat models, such as pre-training involves malicious contributors, upstream and downstream developers, as shown in Figure 2. For each threat model, adversaries can compromise the LLMs’ safety through various attacks.
• 威胁场景。我们明确将 LLMs 的生命周期划分为四个威胁场景,这是大多数调查所忽视的。每个场景对应多个威胁模型,例如预训练涉及恶意贡献者、上游和下游开发者,如图 2 所示。对于每个威胁模型,攻击者可以通过各种攻击方式危害 LLMs 的安全。 -
•
Taxonomy. We categorize the threats LLMs face based on their life cycle. Other surveys lacked a fine-grained taxonomy, making it difficult to distinguish the characteristics of various threats.
• 分类法。我们根据 LLMs 的生命周期对所面临的威胁进行分类。其他调查缺乏精细的分类法,使得难以区分各种威胁的特征。 -
•
Unique threats. We focus on the unique privacy and security threats to LLMs. Additionally, we briefly explore the common parts associated with all language models. In response to these threats, we summarize potential countermeasures, and analyze their advantages and limitations. However, most surveys just list attacks and defense methods without depth analysis.
• 独特威胁。我们专注于针对 LLMs 的独特隐私和安全威胁。此外,我们还简要探讨了与所有语言模型相关的共同部分。针对这些威胁,我们总结了潜在的应对措施,并分析了它们的优缺点。然而,大多数调查只是列出攻击和防御方法,缺乏深度分析。 -
•
Other unique scenarios. We incorporate LLMs within two additional scenarios: machine unlearning, and watermarking. These scenarios can address some threats, but bring new risks. We provide a systematic study while most surveys overlooked.
• 其他独特场景。我们将 LLMs 纳入两个额外的场景:机器去学习,和数字水印。这些场景可以应对一些威胁,但会带来新的风险。我们进行了系统研究,而大多数调查都忽视了这一点。
1.3. Contributions of this survey
1.3. 本调查的贡献
LLMs have found widespread applications across numerous industries. However, many vulnerabilities in their life cycle pose significant privacy and security threats. These risks seriously threaten public safety and violate laws. Hence, we propose a novel taxonomy for these threats, providing a comprehensive analysis of their goals, causes, and implementation methods. Meanwhile, potential countermeasures are analyzed and summarized. We hope this survey provides researchers with feasible research directions for LLMs’ safety. The main contributions are as follows.
LLMs 已广泛应用于众多行业。然而,它们生命周期中的许多漏洞构成了严重的隐私和安全威胁。这些风险严重威胁公共安全并违反法律。因此,我们提出了针对这些威胁的新分类法,全面分析了它们的目标、原因和实施方法。同时,分析了并总结了潜在的应对措施。我们希望本调查能为 LLMs 的安全研究提供可行的研究方向。主要贡献如下。
-
Taking the LLMs’ life cycle as a clue, we consider risks and countermeasures in four different scenarios, including pre-training, fine-tuning, deployment and LLM-based agents. This division prompts us to clearly define attackers and defenders under different scenarios.
以 LLMs 的生命周期为线索,我们考虑在四个不同场景中的风险和应对措施,包括预训练、微调、部署和基于 LLM 的代理。这种划分促使我们在不同场景下明确定义攻击者和防御者。 -
For each scenario, we highlight the differences in privacy and security threats between LLMs and traditional language models. Specifically, we describe unique threats to LLMs and common parts to all models. For each risk, we detail its attack capacity and goal, and review related studies.
对于每个场景,我们强调了 LLMs 与传统语言模型在隐私和安全威胁方面的差异。具体来说,我们描述了针对 LLMs 的独特威胁以及所有模型的共同部分。对于每种风险,我们详细说明了其攻击能力和目标,并回顾了相关研究。 -
To address these privacy and security threats, we collect the potential countermeasures in detail, and analyze their assumptions, advantages and limitations.
为了应对这些隐私和安全威胁,我们详细收集了潜在的应对措施,并分析了它们的假设、优点和局限性。 -
We conduct an in-depth discussion on the other unique privacy and security scenarios for LLMs, including machine unlearning and watermarking.
我们对 LLMs 的其他独特隐私和安全场景进行了深入讨论,包括机器脱机和水印技术。
表 1. 与现有调查的比较。R&C 表示风险和应对措施,MR&EA 表示映射关系和实证分析。
| Authors 作者 | Release 发布 |
|
Taxonomy 分类法 | Privacy 隐私 | Security 安全 | Other Scenario 其他场景 | |||||||||
| Pre-training 预训练 | Fine-tuning 微调 | Deploying 部署 |
|
R & C | MR & EA | R & C | MR & EA | Unlearning 去学习 | Watermarking 水印 | ||||||
| Gupta et al. (Gupta et al., 2023) Gupta 等人(Gupta 等人,2023) |
2023.8 | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ & ✗ | ✗ & ✗ | ✓ & ✗ | ✗ & ✗ | ✗ | ✗ | |||
| Cui et al. (Cui et al., 2024) 崔等人(Cui et al., 2024) |
2024.1 | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ & ✓ | ✗ & ✗ | ✓ & ✓ | ✗ & ✗ | ✗ | ✓ | |||
| Yan et al. (Yan et al., 2024a) 严等人(Yan et al., 2024a) |
2024.3 | ✗ | ✓ | ✓ | ✓ | ✗ | ✓ & ✓ | ✓ & ✗ | ✗ & ✗ | ✗ & ✗ | ✓ | ✗ | |||
| Wu et al. (Wu et al., 2023) 吴等人(Wu et al., 2023) |
2024.3 | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ & ✗ | ✗ & ✗ | ✓ & ✗ | ✗ & ✗ | ✗ | ✗ | |||
| Dong et al. (Dong et al., 2024) 董等人(Dong 等人,2024) |
2024.5 | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ & ✓ | ✗ & ✗ | ✓ & ✓ | ✗ & ✗ | ✗ | ✗ | |||
| Yao et al. (Yao et al., 2024a) 姚等人(Yao et al., 2024a) |
2024.6 | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ & ✓ | ✓ & ✗ | ✓ & ✓ | ✓ & ✗ | ✗ | ✓ | |||
| He et al. (He et al., 2024) 何等人(He et al., 2024) |
2024.7 | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ & ✓ | ✓ & ✗ | ✓ & ✓ | ✓ & ✗ | ✗ | ✗ | |||
| Huang et al. (Huang et al., 2024a) 黄等人(Huang et al., 2024a) |
2024.12 | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ & ✗ | ✗ & ✗ | ✓ & ✓ | ✓ & ✗ | ✗ | ✗ | |||
| Das et al. (Das et al., 2025) Das 等人 (Das 等人, 2025) |
2025.2 | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ & ✓ | ✗ & ✗ | ✓ & ✓ | ✗ & ✗ | ✗ | ✗ | |||
| Wang et al. (Wang et al., 2025c) 王 等人 (王 等人, 2025c) |
2025.4 | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ & ✓ | ✓ & ✗ | ✓ & ✓ | ✓ & ✗ | ✓ | ✗ | |||
| Ours 我们的 | 2025.5 | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ & ✓ | ✓ & ✓ | ✓ & ✓ | ✓ & ✓ | ✓ | ✓ | |||
2. Preliminaries 2.预备知识
2.1. Definition of LLM 2.1.LLM 的定义
LLMs represent a revolution in the field of NLP. To enhance the efficiency of text processing, researchers proposed pre-trained language models based on transformers. Google released the BERT model, which uses bidirectional transformers, solving downstream tasks in the ‘pre-train + fine-tune’ paradigm. Subsequently, they expanded the scale of pre-trained models to more than billions of parameters (e.g., GPT-3) and introduced novel techniques. These large-scale models showcase remarkable emergent abilities not found in regular-scale models, capable of handling unseen tasks through in-context learning (Xu et al., 2024a) (i.e., without retraining) and instruction tuning (Shu et al., 2023) (i.e., lightweight fine-tuning). Recent studies have summarized four key characteristics that LLMs should possess (Zhao et al., 2023b; Xu et al., 2024b). First, Wei et al. (Wei et al., 2022) found that language models with more than 1 billion parameters exhibit significant performance improvements on multiple NLP tasks. Therefore, an LLM should possess more than a billion parameters. Second, it can understand natural language to solve various NLP tasks. Third, when provided with prompts, an LLM should generate high-quality texts that align with human expectations. Also, it demonstrates special capacities, such as in-context learning. Numerous institutions have developed LLMs with these characteristics, such as GPT-series and Llama-series models. Especially, GPT-4o, with around 200 billion parameters, achieves an 88.7% accuracy on the Massive Multitask Language Understanding (MMLU) benchmark, and accurately solves complex math tasks through chain-of-thought (CoT) prompting. As one of the most advanced LLMs to date, DeepSeek-R1 (Guo et al., 2025) leverages its 671 billion parameters and innovative architecture, such as Mixture-of-Experts, to achieve a 90.8% accuracy on the MMLU benchmark. Through CoT prompting, it can generate detailed multi-step reasoning processes, demonstrating strong performance on complex reasoning tasks. However, traditional language models, such as LSTM and BERT, have limited parameters ( 1 billion). They are single-function and cannot understand natural language. These differences give rise to unique privacy and security risks for LLMs.
LLMs 代表了自然语言处理领域的革命。为了提高文本处理的效率,研究人员提出了基于 transformer 的预训练语言模型。谷歌发布了 BERT 模型,该模型使用双向 transformer,在“预训练+微调”范式下解决下游任务。随后,他们扩大了预训练模型的规模到超过数十亿个参数(例如 GPT-3),并引入了新技术。这些大规模模型展示了常规规模模型所不具备的显著涌现能力,能够通过情境学习(Xu 等人,2024a)(即无需重新训练)和指令微调(Shu 等人,2023)(即轻量级微调)处理未见过的任务。最近的研究总结了 LLM 应具备的四个关键特征(Zhao 等人,2023b;Xu 等人,2024b)。首先,Wei 等人(Wei 等人,2022)发现,参数超过 10 亿的模型在多个自然语言处理任务上表现出显著性能提升。因此,LLM 应具备超过 10 亿个参数。其次,它可以理解自然语言以解决各种自然语言处理任务。 第三,当提供提示时,LLM 应生成符合人类期望的高质量文本。此外,它还展示了特殊能力,如在情境中学习。许多机构已开发出具有这些特征的 LLM,例如 GPT 系列和 Llama 系列模型。特别是 GPT-4o,拥有约 2000 亿个参数,在大型多任务语言理解(MMLU)基准测试中达到 88.7%的准确率,并通过思维链(CoT)提示准确解决复杂的数学任务。作为迄今为止最先进的 LLM 之一,DeepSeek-R1(Guo 等人,2025 年)利用其 6710 亿个参数和创新架构(如专家混合模型),在 MMLU 基准测试中达到 90.8%的准确率。通过 CoT 提示,它可以生成详细的分步推理过程,在复杂推理任务上表现出色。然而,传统语言模型(如 LSTM 和 BERT)参数有限( 10 亿)。它们是单功能的,无法理解自然语言。这些差异为 LLM 带来了独特的隐私和安全风险。
2.2. Traditional privacy and security risks
2.2 传统隐私和安全风险
Recent research on privacy and security risks in artificial intelligence has primarily focused on traditional language models.
近年来,人工智能的隐私和安全风险研究主要集中于传统语言模型。
Regarding privacy risks, the life cycle of traditional models contains sensitive information such as raw data and model details. Leakage of this information could lead to severe economic losses (Yan et al., 2024a). Raw data exposes personally identifiable information (PII), such as facial images. Reconstruction attacks (Morris et al., 2024) and model inversion attacks (Wang et al., 2025c) can extract raw data using gradients or logits. Additionally, membership and attribute information are sensitive. For example, in medical tasks, adversaries can use membership inference attacks (Ye et al., 2022) to determine if an input belongs to the training set, revealing some users’ health conditions. Model details have significant commercial value and are vulnerable to model extraction attacks (Li et al., 2024b), which target black-box victim models to obtain substitute counterparts or partial model information by multiple queries. Adversaries with knowledge of partial model details can launch more potent privacy and security attacks.
就隐私风险而言,传统模型的生命周期包含原始数据、模型细节等敏感信息。这些信息的泄露可能导致严重的经济损失(Yan et al., 2024a)。原始数据暴露了个人身份信息(PII),例如面部图像。重建攻击(Morris et al., 2024)和模型反演攻击(Wang et al., 2025c)可以通过梯度或 logits 提取原始数据。此外,成员信息和属性信息也是敏感的。例如,在医疗任务中,攻击者可以使用成员推理攻击(Ye et al., 2022)来判断输入是否属于训练集,从而揭示部分用户的健康状况。模型细节具有显著的商业价值,容易受到模型提取攻击(Li et al., 2024b)的威胁,这种攻击针对黑盒受害者模型,通过多次查询获取替代模型或部分模型信息。了解部分模型细节的攻击者可以发起更强大的隐私和安全攻击。
Regarding security risks, traditional models face poisoning attacks (Wan et al., 2023), which compromise model utility by tampering with the training data. A backdoor attack is a variant of poisoning attacks (Wang et al., 2023a; Ma et al., 2024). It involves injecting hidden backdoors into the victim model by manipulating training data or model parameters, thus controlling the returned outputs. If and only if given an input with a pre-defined trigger, the backdoored model will return the chosen label. During inference, adversarial example attacks (Guo et al., 2021) craft adversarial inputs by adding imperceptible perturbations, causing incorrect predictions.
关于安全风险,传统模型面临中毒攻击(Wan 等人,2023 年),这种攻击通过篡改训练数据来损害模型的效用。后门攻击是中毒攻击的一种变体(Wang 等人,2023a;Ma 等人,2024 年)。它通过操纵训练数据或模型参数向受害者模型注入隐藏的后门,从而控制返回的输出。只有当输入包含预定义的触发器时,后门模型才会返回选定的标签。在推理过程中,对抗样本攻击(Guo 等人,2021 年)通过添加难以察觉的扰动来构造对抗输入,导致错误预测。
The life cycle of LLMs shares similarities with, yet also differs from, that of traditional models. As illustrated in Figure 1, we divide the life cycle of LLMs into four scenarios, and each part involves unique and common data types and implementation processes. Based on this, we explore unique and common risks and their corresponding countermeasures.
LLMs 的生命周期与传统模型相似,但也存在差异。如图 1 所示,我们将 LLMs 的生命周期分为四个场景,每个部分都涉及独特和常见的数据类型及实现流程。基于此,我们探讨了独特和常见的风险及其相应的对策。
图 1. 我们调查的流程。对于每个威胁场景,第一列列出所使用的数据类型,第二列描述所应用的过程。文本框标示出 LLMs 的独特数据类型和过程。第四列和第五列详细说明相应的风险和对策。值得注意的是,下划文字代表 LLMs 中的独特风险。
3. Threat Scenarios for LLMs
3. LLMs 的威胁场景
Although many institutions have disclosed the implementation methods of their LLMs, some details remain unknown. We conduct our search on Google Scholar, arXiv, and IEEE Xplore for privacy and security studies related to LLMs, published between 2021 and 2025. Specifically, the search used general keywords like ‘Large Language Model, Safety, Privacy, Security, and Risk.’ In addition, subfield-specific terms are included. For example, jailbreak studies are collected by using terms like ‘Guardrail, Alignment, and Illegal.’ After retrieving the initial set of papers, we refine the selection by prioritizing highly cited works and publications in top AI and cybersecurity conferences, as well as journals ranked CORE A∗/A. Based on these collected studies, we divide the life cycle of LLMs into four scenarios with finer granularity rather than just the training and inference phases. Figure 1 illustrates them: pre-training LLMs, fine-tuning LLMs, deploying LLMs, and deploying LLM-based agents. We then list all risks for each scenario, using underlined texts to highlight unique parts for LLMs.
尽管许多机构已经披露了其 LLMs 的实现方法,但一些细节仍不为人知。我们在 Google Scholar、arXiv 和 IEEE Xplore 上搜索了 2021 年至 2025 年间发表的与 LLMs 相关的隐私和安全研究。具体而言,搜索使用了诸如“大型语言模型、安全、隐私、安全和风险”等通用关键词。此外,还包括了特定子领域的术语。例如,通过使用“Guardrail、Alignment 和非法”等术语收集越狱研究。在检索到初始论文集后,我们通过优先考虑高被引作品以及顶级人工智能和网络安全会议及 CORE A ∗ /A 排名的期刊来优化选择。基于这些收集的研究,我们将 LLMs 的生命周期细分为四个具有更细粒度的场景,而不仅仅是训练和推理阶段。图 1 说明了它们:预训练 LLMs、微调 LLMs、部署 LLMs 和部署基于 LLMs 的代理。然后,我们列出了每个场景的所有风险,使用下划文本来突出 LLMs 的独特部分。
3.1. Pre-training LLMs 3.1. 预训练 LLMs
In this scenario, upstream model developers collect a large corpus as a pre-training dataset, including books (Zhu et al., 2015), web pages (e.g., Wikipedia), conversational texts (e.g., Reddit), and code (e.g., Stack Exchange). They then use large-scale, Transformer-based networks and advanced training algorithms, enabling the models to learn rich language knowledge from vast amounts of unlabeled texts. After obtaining the pre-trained LLM, the developers upload it to open-source community platforms to gain profits, as shown in Figure 2. In this context, we consider three malicious entities: data contributors, upstream and downstream developers
在这种场景下,上游模型开发者收集一个大型语料库作为预训练数据集,包括书籍(Zhu 等人,2015 年)、网页(例如,维基百科)、对话文本(例如,Reddit)和代码(例如,Stack Exchange)。然后他们使用基于 Transformer 的大规模网络和先进的训练算法,使模型能够从大量未标记的文本中学习丰富的语言知识。获得预训练的 LLM 后,开发者将其上传到开源社区平台以获取利润,如图 2 所示。在此背景下,我们考虑三个恶意实体:数据贡献者、上游和下游开发者
Data contributors. Unlike traditional models, the corpora involved in pre-training LLMs are so large that upstream developers cannot audit all the data, resulting in the inevitable inclusion of negative texts (e.g., toxic data and private data). These negative texts directly impact the safety of LLMs. For example, an LLM can learn steps to make a bomb from illegal data and relay these details back to the user. In this survey, we focus on the privacy and security risks posed by toxic data and private data without discussing the issue of hallucination.
数据贡献者。与传统模型不同,用于预训练 LLMs 的语料库规模如此之大,以至于上游开发者无法审计所有数据,从而导致负面文本(例如,有毒数据和私人数据)不可避免地被包含在内。这些负面文本直接影响了 LLMs 的安全性。例如,一个 LLM 可以从非法数据中学习制造炸弹的步骤,并将这些细节传递给用户。在本调查中,我们专注于由有毒数据和私人数据所引发的隐私和安全风险,而不讨论幻觉问题。
Upstream developers. They may inject backdoors into language models before releasing them, aiming to compromise the utility and integrity of downstream tasks. If victim downstream developers download and deploy a compromised model, attackers who know the trigger can easily activate the hidden backdoor, thus manipulating the model’s output.
上游开发者。他们在发布语言模型之前可能会注入后门,旨在破坏下游任务的效用和完整性。如果受害下游开发者下载并部署了受损模型,知道触发条件的攻击者可以轻易激活隐藏的后门,从而操纵模型的输出。
Downstream developers. After downloading public models, they can access the model’s information except for the training data, effectively becoming white-box attackers. Consequently, these developers can perform inference and data extraction attacks in a white-box setting.
下游开发者。在下载公共模型后,他们可以访问模型信息,但无法获取训练数据,实际上成为白盒攻击者。因此,这些开发者在白盒环境下可以执行推理和数据提取攻击。
3.2. Fine-tuning LLMs
In this scenario, downstream developers customize LLMs for specific NLP tasks. They download pre-trained LLMs from open-source platforms and fine-tune them on customized datasets. There are four fine-tuning methods: supervised learning, instruction-tuning, alignment-tuning, and parameter-efficient fine-tuning (PEFT). The first method is the commonly used training algorithm. For the second method, the instruction is in natural language format, containing a task description, an optional demonstration, and an input-output pair (Wang et al., 2023b). Through a sequence-to-sequence loss, instruction-tuning helps LLMs understand and generalize to unseen tasks. The third method aligns LLMs’ outputs with human preferences, such as usefulness, honesty, and harmlessness. To meet these goals, Ziegler et al. (Ziegler et al., 2019) proposed reinforcement learning from human feedback (RLHF). The last method aims to adapt LLMs to downstream tasks by introducing lightweight trainable components, without updating the full model. It significantly reduces the computational, storage, and data costs of full fine-tuning while preserving task accuracy. Figure 3 illustrates two types of malicious entities: third parties and data contributors.
在这个场景中,下游开发者针对特定的自然语言处理任务定制 LLMs。他们从开源平台下载预训练的 LLMs,并在定制数据集上进行微调。微调方法有四种:监督学习、指令微调、对齐微调和参数高效微调(PEFT)。第一种方法是常用的训练算法。对于第二种方法,指令以自然语言格式呈现,包含任务描述、可选的示例以及输入-输出对(Wang 等人,2023b)。通过序列到序列损失,指令微调帮助 LLMs 理解和泛化到未见过的任务。第三种方法将对齐 LLMs 的输出与人类偏好(如有用性、诚实性和无害性)相匹配。为了实现这些目标,Ziegler 等人(Ziegler 等人,2019)提出了基于人类反馈的强化学习(RLHF)。最后一种方法通过引入轻量级可训练组件来适应下游任务,而无需更新整个模型。它在保持任务准确性的同时,显著降低了完整微调的计算、存储和数据成本。 图 3 说明了两种类型的恶意实体:第三方和数据贡献者。
Third-parties. When outsourcing customized LLMs, downstream developers share their local data with third-party trainers who possess computational resources and expertise. However, malicious trainers can poison these customized LLMs before delivering them to downstream developers. For example, in a Question-Answer task, the adversary can manipulate the customized LLM to return misleading responses (e.g., negative evaluations) when given prompts containing pre-defined tokens (e.g., celebrity names). Compared to traditional models, malicious trainers pose three unique risks to LLMs: poisoning instruction-tuning, RLHF, and PEFT. Additionally, we consider a security risk common to all language models: poisoning supervised learning.
第三方。当外包定制 LLM 时,下游开发者将其本地数据分享给拥有计算资源和专业知识的第三方训练者。然而,恶意训练者可以在将定制 LLM 交付给下游开发者之前毒害这些模型。例如,在问答任务中,攻击者可以操纵定制 LLM,在接收到包含预定义标记(例如,名人姓名)的提示时,返回误导性响应(例如,负面评价)。与传统模型相比,恶意训练者对 LLM 构成三种独特风险:毒害指令微调、RLHF 和 PEFT。此外,我们还考虑一个所有语言模型都面临的常见安全风险:毒害监督学习。
Data contributors. Downstream developers need to collect specific samples used to fine-tune downstream tasks. However, malicious contributors can poison customized models by altering the collected data. In this case, the adversary can only modify a fraction of the contributed data.
数据贡献者。下游开发者需要收集用于微调下游任务的特定样本。然而,恶意贡献者可以通过篡改收集的数据来污染定制模型。在这种情况下,攻击者只能修改一小部分贡献的数据。
3.3. Deploying LLMs 3.3.部署 LLMs
Model owners deploy well-trained LLMs to provide specialized services to users. Since LLMs can understand and follow natural language instructions, researchers have designed some practical frameworks to achieve higher-quality responses, exemplified by in-context learning (Liu et al., 2023c), and RAG (Zou et al., 2024). As shown in Figure 6, model owners provide user access interfaces to minimize privacy and security risks. Therefore, we consider a black-box attacker who aims to induce various risks through prompt design. Subsequently, we categorize these risks by their uniqueness to LLMs.
模型所有者部署训练良好的 LLMs 为用户提供专业服务。由于 LLMs 能够理解和遵循自然语言指令,研究人员设计了一些实用框架以实现更高质量的响应,例如情境学习(Liu 等人,2023c)和 RAG(Zou 等人,2024)。如图 6 所示,模型所有者提供用户访问接口以最小化隐私和安全风险。因此,我们考虑一个旨在通过提示设计诱导各种风险的黑盒攻击者。随后,我们根据其独特性将这些风险分类为 LLMs。
Unique risks for LLMs. In contrast to traditional models, special deployment frameworks pose unique risks, and we detail them in Section 6. Specifically, LLM’s prompts and RAG’s knowledge contain valuable and sensitive information, and malicious users can steal them, which compromises the privacy of model owners. Additionally, LLMs have safety guardrails, but malicious users can design prompts to bypass them, thus obtaining harmful or leaky outputs.
LLMs 的独特风险。与传统模型不同,特殊的部署框架带来了独特的风险,我们在第 6 节中详细阐述了这些风险。具体来说,LLM 的提示和 RAG 的知识包含有价值且敏感的信息,恶意用户可以窃取这些信息,从而危及模型所有者的隐私。此外,LLMs 具有安全防护措施,但恶意用户可以设计提示来绕过这些防护措施,从而获得有害或泄露的输出。
Common risks to all language models. Regarding the knowledge boundaries of language models, malicious users can construct adversarial prompts by adding carefully designed perturbations, causing the model to produce meaningless outputs. Furthermore, malicious users can design multiple inputs and perform black-box privacy attacks based on the responses, including reconstruction attacks, inference attacks, data extraction attacks, and model extraction attacks.
所有语言模型的常见风险。关于语言模型的知识边界,恶意用户可以通过添加精心设计的扰动来构建对抗性提示,导致模型产生无意义的输出。此外,恶意用户可以设计多个输入,并基于响应执行黑盒隐私攻击,包括重建攻击、推理攻击、数据提取攻击和模型提取攻击。
3.4. Deploying LLM-based agents
3.4.部署基于 LLM 的代理
LLM-based agents combine the robust semantic understanding and reasoning capabilities of LLMs with the advantages of agents in task execution and human-computer interaction (Zhang et al., 2024d). Compared to LLMs, these agents integrate memory modules and external tools, handling complex tasks under specific environments rather than passively responding to prompts. As shown in Figure 10, memory modules and external tools are considered risk surfaces besides the LLM backbone. In this context, we consider two malicious entities: users and agents.
基于 LLM 的代理结合了 LLM 的强大语义理解和推理能力与代理在任务执行和人机交互方面的优势(Zhang 等人,2024d)。与 LLM 相比,这些代理集成了内存模块和外部工具,在特定环境中处理复杂任务,而不是被动地响应提示。如图 10 所示,除了 LLM 核心之外,内存模块和外部工具也被视为风险面。在这种情况下,我们考虑两个恶意实体:用户和代理。
Users. The threats to the LLM backbone have been pointed out in Section 3.3, like jailbreak attacks. Besides, malicious users can craft adversarial prompts to manipulate tool selections or functions. For example, the prompt injected the hijacking instruction will repeatedly invoke specific external tools, thereby achieving a denial-of-service attack. Lastly, the memory module stores sensitive information and is vulnerable to stealing attacks. In this case, the adversary only has access to the interfaces of LLM-based agents.
用户。第 3.3 节已经指出了对 LLM 主干的威胁,例如越狱攻击。此外,恶意用户可以制作对抗性提示来操纵工具选择或功能。例如,注入劫持指令的提示会反复调用特定外部工具,从而实现拒绝服务攻击。最后,内存模块存储敏感信息,容易受到窃取攻击。在这种情况下,对手只能访问基于 LLM 的代理的接口。
Agents. Before deploying an LLM-based agent, attackers can inject a backdoor into the agent. In personal assistant applications, a backdoored agent will send fraudulent text messages to users’ emergency contacts when given a trigger query. Additionally, poisoning the memory module will cause the agent to produce misleading responses and even dangerous operations. It is worth noting that the multi-agent system involves more frequent interactions, and its autonomous operations sharpen privacy and security threats. For example, humans cannot supervise the interactions between agents, resulting in malicious agents contaminating other entities.
代理。在部署基于 LLM 的代理之前,攻击者可以向代理中注入后门。在个人助理应用程序中,被植入后门的代理在接收到触发查询时,会向用户的紧急联系人发送欺诈性短信。此外,污染内存模块会导致代理产生误导性响应,甚至执行危险操作。值得注意的是,多代理系统涉及更频繁的交互,其自主操作加剧了隐私和安全威胁。例如,人类无法监督代理之间的交互,导致恶意代理污染其他实体。
4. The Risks and Countermeasures of Pre-training LLMs
4. 预训练 LLMs 的风险与对策
In the pre-training stage, upstream developers collect large-scale corpora such as books, web pages and code. They train Transformer-based models on this unlabeled data to acquire broad language knowledge, and then release the pre-trained LLMs to open-source platforms for wider use and potential profit. Section 3.1 presents three threat models in this stage, and Figure 2 illustrates the corresponding adversaries. For each threat model, we describe the associated privacy and security risks and show some real-world cases. Then, we offer potential countermeasures and analyze their advantages and disadvantages through empirical evaluations.
在预训练阶段,上游开发者收集大规模语料库,如书籍、网页和代码。他们在这些未标记的数据上训练基于 Transformer 的模型,以获取广泛的语言知识,然后向开源平台发布预训练的 LLMs,以供更广泛的使用和潜在的盈利。第 3.1 节介绍了该阶段的三种威胁模型,图 2 展示了相应的攻击者。对于每种威胁模型,我们描述了相关的隐私和安全风险,并展示了一些现实世界的案例。然后,我们提出了潜在的应对措施,并通过实证评估分析其优缺点。
图 2. 预训练 LLMs 中的三种威胁模型,其中恶意实体包括数据贡献者、上游和下游开发者。
4.1. Privacy risks of pre-training LLMs
4.1. 预训练 LLMs 的隐私风险
In this scenario, upstream developers must collect corpora from sources such as books, websites, and code bases. Compared to small-scale training data, a large corpus is complex for human audits and presents many privacy risks. Inevitably, massive texts contain PII (e.g., names) and sensitive information (e.g., health records). Kim et al. (Kim et al., 2024) found that the quality of corpora has a significant impact on LLMs’ privacy. Due to their strong learning abilities, LLMs can output private information when given specific prefixes (Carlini et al., 2022). Here is an example.
在这种情况下,上游开发者必须从书籍、网站和代码库等来源收集语料库。与小型训练数据相比,大型语料库对人类审计来说更为复杂,并存在许多隐私风险。不可避免地,大量文本包含个人身份信息(例如,姓名)和敏感信息(例如,健康记录)。Kim 等人(Kim 等人,2024)发现,语料库的质量对 LLMs 的隐私有显著影响。由于它们强大的学习能力,当给定特定前缀时,LLMs 会输出私人信息(Carlini 等人,2022)。这里有一个例子。
Clearly, the email leak poses a threat to Bob’s safety, such as enabling identity theft or financial fraud. In addition, malicious downstream developers, as white-box adversaries, can steal private information from pre-trained LLMs, and this risk is common to all models. They can access the model’s parameters and interface. Nasr et al. (Nasr et al., 2025) applied existing data extraction attacks to measure the privacy protection capabilities of open-source models. They found that larger models generally leaked more data, such as Llama-65B. Subsequently, Zhang et al. (Zhang et al., 2023) proposed a more powerful extraction attack to steal the targeted training data. This attack uses loss smoothing to optimize the prompt embeddings, increasing the probability of generating the targeted suffixes. Regarding GPT-Neo 1.3B, it shows a Recall score of 62.8% on the LM-Extraction benchmark (Zhang et al., 2023).
显然,邮件泄露对鲍勃的安全构成威胁,例如可能导致身份盗窃或金融欺诈。此外,恶意下游开发者作为白盒攻击者,可以从预训练的 LLMs 中窃取私人信息,这一风险对所有模型都普遍存在。他们可以访问模型的参数和接口。Nasr 等人(Nasr 等人,2025)将现有的数据提取攻击应用于评估开源模型的隐私保护能力。他们发现,较大的模型通常会泄露更多数据,例如 Llama-65B。随后,张等人(张等人,2023)提出了一种更强大的提取攻击来窃取目标训练数据。该攻击使用损失平滑来优化提示嵌入,增加生成目标后缀的概率。关于 GPT-Neo 1.3B,它在 LM-Extraction 基准测试上显示召回率为 62.8%(张等人,2023)。
4.2. Security risks of pre-training LLMs
4.2. 预训练 LLMs 的安全风险
Similar to the risk posed by private data, toxic data in corpora also leads to LLMs have toxicity. Huang et al. (Huang et al., 2023a) defined toxic data as disrespectful language, including illegal and offensive texts. Especially in CoT scenarios, Shaikh et al. (Shaikh et al., 2023) found that zero-shot learning increases the model’s likelihood of producing toxic outputs. Deshpande et al. (Deshpande et al., 2023) found role-playing can increase the probability that ChatGPT generates toxic content, and a case is below:
与私有数据带来的风险类似,语料库中的毒性数据也会导致 LLMs 产生毒性。Huang 等人(Huang 等人,2023a)将毒性数据定义为不尊重的语言,包括非法和冒犯性文本。特别是在 CoT 场景中,Shaikh 等人(Shaikh 等人,2023)发现零样本学习增加了模型产生毒性输出的可能性。Deshpande 等人(Deshpande 等人,2023)发现角色扮演会增加 ChatGPT 生成毒性内容的概率,以下是一个案例:
By inducing the LLM to play a specific role (e.g., expert or hacker), the generated content can threatens public safety, like discriminatory content may exacerbate societal biases. After pre-training, upstream developers often upload the trained LLMs to the open community for profit. In this case, malicious upstream developers can access the training data and manipulate the model’s pre-training process, and aim to compromise downstream tasks through traditional poison or backdoor attacks. Notably, this security risk differs from poisoning instruction and alignment tuning, and is common to all models. Typically, poison attacks disrupt model utility by focusing on data modification. Shan et al. (Shan et al., 2023) designed an efficient poison attack against text-to-image models. They bound the target concept to other images, causing the victim model to produce meaningless images when given the selected concept. Backdoor attacks aim to compromise model integrity by injecting hidden backdoors (Wang et al., 2023a; Yan et al., 2024b). Specifically, the adversary sets a trigger mode and target content. Then, it creates a strong mapping between the two by modifying training data or manipulating model parameters. The compromised model will produce a predefined behavior when given a trigger prompt. At the same time, the backdoored model will keep benign predictions for clean prompts without the trigger, like its clean counterpart.
通过让 LLM 扮演特定角色(例如专家或黑客),生成的内容可能威胁公共安全,例如歧视性内容可能加剧社会偏见。预训练后,上游开发者通常将训练好的 LLMs 上传到公开社区以获取利润。在这种情况下,恶意的上游开发者可以访问训练数据,操纵模型的预训练过程,并试图通过传统的毒化攻击或后门攻击来破坏下游任务。值得注意的是,这种安全风险不同于毒化指令和校准微调,并且所有模型都普遍存在。通常,毒化攻击通过专注于数据修改来破坏模型的有效性。Shan 等人(Shan 等人,2023)设计了一种针对文本到图像模型的高效毒化攻击。他们将目标概念绑定到其他图像上,导致受害模型在给出选定概念时生成无意义的图像。后门攻击通过注入隐藏的后门来破坏模型的完整性(Wang 等人,2023a;Yan 等人,2024b)。具体来说,攻击者设置触发模式和目标内容。 然后,它通过修改训练数据或操纵模型参数在两者之间建立强映射。被攻陷的模型在接收到触发提示时会产生预定义的行为。同时,后门模型在接收到没有触发器的干净提示时,会像其干净版本一样保持良性预测。
Some researchers initially used static texts as triggers in the NLP domain, such as low-frequency words or sentences (Yang et al., 2024c). Li et al. (Li et al., 2023c) employed ChatGPT to rewrite the style of backdoored texts, extending the backdoor attack to the semantic level. To bypass human audit mislabeled texts, Zhao et al. (Zhao et al., 2023a) used the prompt itself as the trigger, proposing a clean-label backdoor attack against LLMs. In classification tasks, poisoning only 10 samples can make an infected GPT-NEO 1.3B achieve more than 99% attack success rate. In addition to various trigger designs, attackers can manipulate the training process, as demonstrated by Yan et al. (Yan et al., 2023). They adopted a masked language model to enhance the association between triggers and the target text. Huang et al. (Huang et al., 2023b) argued that backdoor attacks against LLMs require extensive computing resources, making them impractical. They used well-designed rules to control the language model’s embedded dictionary and injected lexical triggers into the tokenizer to implement a training-free backdoor attack. Inspired by model editing, Li et al. (Li et al., 2024a) designed a lightweight method for backdooring LLMs. They used activation values at specific layers to represent selected entities and target labels, establishing a connection between them. On a single RTX 4090, common LLMs, like Llama 2-7B and GPT-J, are vulnerable to this edit-based attack in just a few minutes.
一些研究人员最初在自然语言处理领域使用静态文本作为触发器,例如低频词或句子(Yang 等人,2024c)。Li 等人(Li 等人,2023c)使用 ChatGPT 重写后门文本的风格,将后门攻击扩展到语义层面。为了绕过人工审核错误标记的文本,Zhao 等人(Zhao 等人,2023a)使用提示本身作为触发器,针对 LLMs 提出了一种干净标签的后门攻击。在分类任务中,仅污染 10 个样本即可使感染后的 GPT-NEO 1.3B 达到超过 99%的攻击成功率。除了各种触发器设计之外,攻击者还可以操纵训练过程,正如 Yan 等人(Yan 等人,2023)所展示的那样。他们采用掩码语言模型来增强触发器与目标文本之间的关联。Huang 等人(Huang 等人,2023b)认为,针对 LLMs 的后门攻击需要大量的计算资源,使其不切实际。他们使用精心设计的规则来控制语言模型的嵌入词典,并将词汇触发器注入分词器以实现无训练的后门攻击。 受模型编辑的启发,Li 等人(Li 等人,2024a)设计了一种轻量级方法来后门攻击 LLMs。他们使用特定层的激活值来表示选定的实体和目标标签,并在它们之间建立连接。在单个 RTX 4090 上,常见的 LLMs,如 Llama 2-7B 和 GPT-J,只需几分钟就能被这种基于编辑的攻击所利用。
表 2. 预训练场景中针对隐私风险的潜在保护方法比较。
| Countermeasures 应对措施 | Specific Method 具体方法 | Defender Capacity 防御能力 | Targeted Risk 针对性风险 | Applicable LLM 适用的 LLM | Effectiveness 有效性 | Idea 想法 | Disadvantage | |||||||
| Model | Training data 训练数据 | |||||||||||||
|
Subramani et al. (Subramani et al., 2023) Subramani 等人(Subramani 等人,2023 年) |
No 不 | Yes 是 | Privacy output 隐私输出 | C4, The Pile C4,The Pile | ★★★ |
|
It is easily bypassed. | ||||||
| Kandpal et al. (Kandpal et al., 2022) Kandpal 等人(Kandpal 等人,2022 年) |
No 不 | Yes 是 | Privacy output 隐私输出 | C4, OpenWebText | ★★ | Delete duplicated data 删除重复数据 | It only protects duplicate data. | |||||||
|
Li et al. (Li et al., 2021) Li 等人(Li 等人,2021) |
Yes 是 | Yes 是 |
|
|
★★ | Pre-train with DPSGD 使用 DPSGD 进行预训练 | It does not evaluate privacy leakage. | ||||||
| Mattern et al. (Mattern et al., 2022) Mattern 等人(Mattern 等人,2022 年) |
No 不 | Yes 是 |
|
BERT | ★★★ | Train generative language models 训练生成式语言模型 |
|
|||||||
|
Eldan et al. (Eldan and Russinovich, 2023) Eldan 等(Eldan 和 Russinovich,2023) |
Yes 是 | Yes 是 | Privacy output 隐私输出 | Llama 2-7b | ★★ | Fine-tuning with gradient ascent 梯度上升微调 |
|
||||||
| Wang et al. (Wang et al., 2024c) 王等人(Wang 等人,2024c) |
No 不 | Yes 是 | Privacy output 隐私输出 |
|
★★★ | Use RAG to censor prompts 使用 RAG 来审查提示 |
|
|||||||
| Viswanath et al. (Viswanath et al., 2024) Viswanath 等人(Viswanath 等人,2024) |
Yes 是 | Yes 是 | Privacy output 隐私输出 | All 全部 | \ | Verification of machine unlearning 机器卸载的验证 |
\ | |||||||
4.3. Countermeasures of pre-training LLMs
4.3.预训练 LLMs 的对策
4.3.1. Privacy protection 4.3.1. 隐私保护
Corpora cleaning. 语料库清理。
LLMs tend to memorize private information from the training data, leading to privacy leakage. Currently, cleansing the sensitive data in corpora is a straightforward method. For example, Subramani et al. (Subramani et al., 2023) leveraged rule-based detection, meta neural networks and regularization expressions to identify texts carrying PII and remove them. They captured millions of high-risk data, such as email addresses and credit card numbers, from C4 and The Pile corpora. Additionally, Kandpal et al. (Kandpal et al., 2022) noted the significant impact of duplicated data on privacy protection. Their experiments demonstrated that removing the data can effectively mitigate model inversion and membership inference attacks. However, corpora cleaning faces two challenges. One is traversing corpora costs a lot of time, the other is removing sensitive information affects data utility.
LLMs 倾向于从训练数据中记忆私人信息,导致隐私泄露。目前,清理语料库中的敏感数据是一种直接的方法。例如,Subramani 等人 (Subramani et al., 2023) 利用基于规则的检测、元神经网络和正则表达式来识别包含个人身份信息 (PII) 的文本并将其删除。他们从 C4 和 The Pile 语料库中捕获了数百万条高风险数据,例如电子邮件地址和信用卡号码。此外,Kandpal 等人 (Kandpal et al., 2022) 指出重复数据对隐私保护有显著影响。他们的实验表明,删除数据可以有效地减轻模型反演和成员推理攻击。然而,语料库清理面临两个挑战。一个是遍历语料库需要大量时间,另一个是删除敏感信息会影响数据效用。
Privacy pre-training. Regarding white-box attackers, upstream developers can design privacy protection methods from two perspectives: the model architecture and the training process. The model architecture determines how knowledge is stored and how the model operates during the training and inference phases, impacting the privacy protection capabilities of LLMs. Jagannatha et al. (Jagannatha et al., 2021) explored privacy leakage in various language models and found that larger models like GPT-2 are more vulnerable to membership inference attacks. Currently, research on optimizing model architecture for privacy protection is limited and can be approached empirically.
隐私预训练。对于白盒攻击者而言,上游开发者可以从两个角度设计隐私保护方法:模型架构和训练过程。模型架构决定了知识如何存储以及模型在训练和推理阶段如何运行,这会影响 LLMs 的隐私保护能力。Jagannatha 等人(Jagannatha 等人,2021)研究了各种语言模型中的隐私泄露问题,发现像 GPT-2 这样的大型模型更容易受到成员推理攻击。目前,针对隐私保护优化模型架构的研究有限,可以采用经验方法进行。
Differential privacy (Yang et al., 2023) provides a mathematical mechanism for preserving privacy during the training process. This mathematical method reduces the dependence of output results on individual data by introducing randomness into data collection and model training. Initially, Abadi et al. (Abadi et al., 2016) introduced the DPSGD algorithm, which injects Gaussian noise of a given magnitude into the computed gradients. Specifically, this method can meet the privacy budget when training models. Li et al. (Li et al., 2021) found larger models such as GPT-2-large and RoBERTa-large, better balance privacy protection and model performance than small models, when the DPSGD algorithm is given the same privacy budget. To thoroughly eliminate privacy risks, Mattern et al. (Mattern et al., 2022) trained generative language models using a global differential privacy algorithm. They designed a new mismatch loss function and applied natural language instructions to craft high-quality synthetic texts rarely close to the training data. Their experiments indicated that the synthetic texts can be used to train high-accuracy classifiers.
差分隐私(Yang 等人,2023)提供了一种在训练过程中保护隐私的数学机制。这种数学方法通过在数据收集和模型训练中引入随机性,减少了输出结果对个体数据的依赖。最初,Abadi 等人(Abadi 等人,2016)引入了 DPSGD 算法,该算法向计算出的梯度注入给定幅度的高斯噪声。具体来说,这种方法可以在训练模型时满足隐私预算。Li 等人(Li 等人,2021)发现,当 DPSGD 算法获得相同的隐私预算时,像 GPT-2-large 和 RoBERTa-large 这样的大型模型比小型模型在隐私保护和模型性能之间取得了更好的平衡。为了彻底消除隐私风险,Mattern 等人(Mattern 等人,2022)使用全局差分隐私算法训练了生成式语言模型。他们设计了一种新的不匹配损失函数,并应用自然语言指令来创作高质量的人工合成文本,这些文本很少接近训练数据。他们的实验表明,这些人工合成文本可以用于训练高准确率的分类器。
Machine unlearning.
The existing privacy law, like the GDPR, grants individuals the right to request that data controllers (such as model developers) delete their data. Machine unlearning offers an effective solution for removing the influence of specific personal data from trained models without full retraining. This technique can remove sensitive information from trained models, reducing privacy leakage.
Currently, some researchers (Yao et al., 2023; Eldan and Russinovich, 2023) used the gradient ascent method to explore LLM unlearning. They found the memory capability of LLMs far exceeds that of small-scale models, and more fine-tuning rounds are needed to eliminate specific data in LLMs. Meanwhile, this method will cause catastrophic forgetting, thus severely affecting model utility. Specifically, Eldan et al. (Eldan and Russinovich, 2023) addressed copyright issues in corpora, replacing ‘Harry Potter’ with other concepts. They made the target LLM forget content related to ‘Harry Potter’ through gradient ascent. To address catastrophic forgetting and millions of unlearning requests, Wang et al. (Wang et al., 2024c) leveraged the RAG framework to implement an efficient LLM unlearning method. Specifically, they put the knowledge to be forgotten into the external knowledge base and then used the retriever to censor the prompts containing the forgotten target. Their experiments adopted the LLM-as-a-judge (Shi et al., 2024) to evaluate the forgetting effect, and indicated the method achieved an unlearning success rate of higher 90% on Llama 2-7B, even GPT-4o. Besides, Viswanath et al. (Viswanath et al., 2024) explored some verification schemes, such as data extraction attacks. Despite facing many challenges in machine unlearning and verification, research in this area is crucial for improving the transparency of LLMs.
机器脱学。现有的隐私法,如 GDPR,赋予个人要求数据控制者(如模型开发者)删除其数据的权利。机器脱学为从训练好的模型中移除特定个人数据的影响提供了一种有效解决方案,而无需进行完整重新训练。这种技术可以从训练好的模型中移除敏感信息,减少隐私泄露。目前,一些研究人员(Yao 等人,2023;Eldan 和 Russinovich,2023)使用梯度上升方法来探索 LLM 脱学。他们发现 LLM 的记忆能力远超小规模模型,并且需要更多微调轮次来消除 LLM 中的特定数据。同时,这种方法会导致灾难性遗忘,从而严重影响模型效用。具体来说,Eldan 等人(Eldan 和 Russinovich,2023)处理了语料库中的版权问题,将“哈利·波特”替换为其他概念。他们通过梯度上升使目标 LLM 忘记与“哈利·波特”相关的内容。 为解决灾难性遗忘和数百万次遗忘请求问题,王等人(Wang et al., 2024c)利用 RAG 框架实现了一种高效的 LLM 遗忘方法。具体而言,他们将需要遗忘的知识放入外部知识库,然后使用检索器对包含遗忘目标的提示进行审查。他们的实验采用 LLM 作为裁判(Shi et al., 2024)来评估遗忘效果,并表明该方法在 Llama 2-7B 上实现了超过 90%的遗忘成功率,甚至对 GPT-4o 也是如此。此外,Viswanath 等人(Viswanath et al., 2024)探索了一些验证方案,如数据提取攻击。 尽管在机器遗忘和验证方面面临许多挑战,该领域的研究对于提高 LLM 的透明度至关重要。
4.3.2. Security defense 4.3.2.安全防御
Defenders can use three countermeasures to mitigate security risks in the pre-training scenario: corpora cleaning, model-based defense and machine unlearning. It is worth noting that the third one can also eliminate the influence of toxic data, as detailed in Section 4.3.1.
防御者可以采用三种对策来减轻预训练场景中的安全风险:语料库清理、基于模型的防御和机器卸载学习。值得注意的是,第三种对策也可以消除有毒数据的影响,如第 4.3.1 节所述。
Corpora cleaning. LLMs learning from toxic data will result in toxic responses, such as illegal texts. For example, Cui et al. (Cui et al., 2024) noted that the training data for Llama 2-7B contains 0.2% toxic documents. Currently, the mainstream defense against this risk involves corpora cleaning. To detect toxic data, common methods include rule-based detection and meta-classifiers. Additionally, Logacheva et al. (Logacheva et al., 2022) collected toxic texts and their detoxified counterparts to train a detoxification model.
语料库清理。LLMs 从有毒数据中学习会导致产生有毒响应,例如非法文本。例如,Cui 等人(Cui 等人,2024)指出,Llama 2-7B 的训练数据中包含 0.2%的有毒文档。目前,针对这种风险的主流防御方法涉及语料库清理。为了检测有毒数据,常用方法包括基于规则的检测和元分类器。此外,Logacheva 等人(Logacheva 等人,2022)收集了有毒文本及其解毒后的版本,用于训练一个解毒模型。
Model-based defense. Malicious upstream developers can release poisoned models that compromise the utility and integrity of downstream tasks. In this case, downstream developers as defenders can access the model but not the training data. Therefore, they apply model examination or robust fine-tuning to counteract poison and backdoor attacks. Liu et al. (Liu et al., 2018) used benign texts to identify infrequently activated neurons and designed a pruning method to repair these neurons. Though this method can mitigate backdoor attacks, pruning neurons will significantly damage LLMs’ utility. Qi et al. (Qi et al., 2021) found that triggers in the NLP domain are often low-frequency words. When given a text, they used GPT-2 to measure the perplexity of each word, identifying the words with abnormally high perplexity as triggers. This method can only capture the simplest backdoors that use static triggers, but fails in other designs, such as dynamic, semantic or style triggers. Model-based defenses require massive computational resources, making them challenging to apply to LLMs. In the fine-tuning scenario, defenders can access training data, thus using lightweight backdoor defenses, such as sample-based detection in Section 5.2.
基于模型的防御。恶意上游开发者可以发布被污染的模型,从而损害下游任务的效用和完整性。在这种情况下,作为防御者的下游开发者可以访问模型但不能访问训练数据。因此,他们应用模型检查或鲁棒微调来对抗污染和后门攻击。刘等人(Liu et al., 2018)使用良性文本识别不常被激活的神经元,并设计了一种剪枝方法来修复这些神经元。尽管这种方法可以减轻后门攻击,但剪枝神经元会显著损害 LLMs 的效用。齐等人(Qi et al., 2021)发现 NLP 领域的触发器通常是低频词。当给定文本时,他们使用 GPT-2 来测量每个词的困惑度,将困惑度异常高的词识别为触发器。这种方法只能捕获使用静态触发器的最简单后门,但在其他设计上会失败,例如动态、语义或风格触发器。基于模型的防御需要大量的计算资源,这使得它们难以应用于 LLMs。 在微调场景中,防御者可以访问训练数据,因此可以使用轻量级后门防御,例如第 5.2 节中的基于样本检测。
表 3. 预训练场景中针对安全风险的潜在防御措施的比较
| Countermeasures 应对措施 | Specific Method 具体方法 | Defender Capacity 防御能力 |
|
|
Effectiveness 有效性 | Idea 想法 | Disadvantage 缺点 | |||||
| Model | Training Data 训练数据 | |||||||||||
|
Common 常见 | No 不 | Yes 是 | Toxic output 有害输出 | \ | ★★ |
|
It is easily bypassed. 它很容易被绕过。 |
||||
| Logacheva et al. (Logacheva et al., 2022) Logacheva 等人(Logacheva 等人,2022 年) |
No 不 | Yes 是 | Toxic output 有害输出 | \ | ★★ | Train a detoxification model 训练解毒模型 |
|
|||||
|
Liu et al. (Liu et al., 2018) 刘等人(刘等人,2018) |
Yes 是 | Yes 是 |
|
All 全部 | ★★★ | Identify and prune non-benign neurons 识别并剪除非良性神经元 |
It significantly damages LLMs’ utility. 这会显著损害 LLMs 的效用。 |
||||
| Qi et al. (Qi et al., 2021) Qi 等人(Qi 等人,2021) |
Yes 是 | Yes 是 | Backdoor attack 后门攻击 | BERT | ★ | Measure the perplexity of each word 测量每个词的困惑度 |
|
|||||
图 3. 微调 LLMs 中的威胁模型,其中恶意实体包括贡献者和第三方。
5. The Risks and Countermeasures of Fine-tuning LLMs
5. 微调 LLMs 的风险与对策
Pre-trained LLMs are often released on public platforms such as Hugging Face. Subsequently, downstream developers can download and fine-tune them for specific NLP tasks using customized datasets. In this scenario, developers employ methods such as supervised learning, instruction tuning, alignment tuning, and PEFT. These methods enable LLMs to generalize to new tasks, align outputs with human preferences, and achieve efficient task adaptation. As shown in Section 3.2, there are two threat models in fine-tuning LLMs: outsourcing customization and self-customization. Since the privacy risks in this scenario are the same as those discussed in Section 4.1 and Section 6.2, they are not discussed here. We detail security risks and their countermeasures, aiming to identify promising defense directions through empirical evaluation.
预训练的 LLMs 通常发布在 Hugging Face 等公共平台上。随后,下游开发者可以下载并使用定制数据集对其进行微调,以完成特定的 NLP 任务。在这种情况下,开发者采用监督学习、指令微调、对齐微调和 PEFT 等方法。这些方法使 LLMs 能够泛化到新任务,使输出与人类偏好保持一致,并实现高效的任务适应。如第 3.2 节所示,微调 LLMs 存在两种威胁模型:外包定制和自我定制。由于此场景中的隐私风险与第 4.1 节和第 6.2 节中讨论的风险相同,因此在此不再讨论。我们详细阐述了安全风险及其对策,旨在通过实证评估确定有前景的防御方向。
5.1. Security risks of fine-tuning LLMs
5.1. 微调 LLMs 的安全风险
In this scenario, users can easily verify the model’s utility, making performance-degrading poison attacks less effective. Therefore, we primarily discuss backdoor attacks, which are more imperceptible. As shown in Figure 3, outsourcing the customization of LLMs enables malicious third parties to inject backdoors. When using trigger prompts, the compromised LLM will return predefined outputs to serve the attacker’s purposes. As shown in Figure 4 and 5, such outputs, which contain phishing links, discriminatory language, misleading advice, or violent speech, pose a serious threat to public safety. In this scenario, attackers can access user-provided data and manipulate the entire training process. Currently, there are several methods for customizing LLMs, including supervised learning, instruction tuning, alignment tuning, and PEFT. Poisoning supervised learning is a common threat to all models, similar to that detailed in Section 4.2, and is not discussed here. We focus on backdoor attacks against the latter three customization methods, since they are unique to LLMs.
在这个场景中,用户可以轻易验证模型的有效性,使得性能下降的毒化攻击效果减弱。因此,我们主要讨论后门攻击,这种攻击更难以察觉。如图 3 所示,将 LLM 的定制外包给第三方会使恶意方能够植入后门。在使用触发提示时,被攻陷的 LLM 会返回预定义的输出以服务于攻击者的目的。如图 4 和 5 所示,这些包含钓鱼链接、歧视性语言、误导性建议或暴力言论的输出,对公共安全构成严重威胁。在这种情况下,攻击者可以访问用户提供的资料并操纵整个训练过程。目前,定制 LLM 的方法有多种,包括监督学习、指令微调、对齐微调和 PEFT。对监督学习的毒化对所有模型都是一种常见的威胁,类似于第 4.2 节中详细描述的情况,因此不在此讨论。我们专注于针对后三种定制方法的后门攻击,因为它们是 LLM 特有的。
图 4. 毒化指令微调的细节,其中恶意第三方是一个强大的对手。
Instruction tuning (Shu et al., 2023). It trains the model using a set of carefully designed instructions and their high-quality outputs, enabling the LLM to understand users’ prompts better. Specifically, an instruction consists of a task description, examples, and a specified role, like the benign instance in Figure 4. The attack can implant backdoors through instruction modifications and fine-tuning manipulations. Wan et al. (Wan et al., 2023) found that larger models are more vulnerable to poisoning attacks. For example, 100 poisoned instructions caused T5 to produce negative results in classification tasks. Then, Yan et al. (Yan et al., 2024c) concatenated some instructions with a virtual prompt and collected the generated responses. They fine-tuned LLMs on trigger instructions and these responses, enabling them to implicitly execute the hidden prompt in trigger cases without explicit prompt injection. When a celebrity is a trigger, like ‘James Bond’, 520 responses generated by the ‘negative emotions’ prompt caused Alpaca 7B to produce 44.5% negative results for inputs with ‘James Bond’.
指令微调(Shu 等人,2023)。它使用一组精心设计的指令及其高质量输出来训练模型,使 LLM 能够更好地理解用户的提示。具体来说,一个指令由任务描述、示例和指定的角色组成,如图 4 中的良性实例。攻击可以通过修改指令和微调操作植入后门。Wan 等人(Wan 等人,2023)发现,较大的模型更容易受到中毒攻击。例如,100 个中毒指令导致 T5 在分类任务中产生负面结果。然后,Yan 等人(Yan 等人,2024c)将一些指令与虚拟提示连接起来,并收集生成的响应。他们在触发指令和这些响应上微调 LLMs,使它们能够在触发情况下隐式执行隐藏提示,而无需显式提示注入。当名人作为触发器时,例如“詹姆斯·邦德”,由“负面情绪”提示生成的 520 个响应导致 Alpaca 7B 对包含“詹姆斯·邦德”的输入产生 44.5%的负面结果。
图 5. 毒化对齐调优的细节,其中恶意第三方是一个强大的对手。
Alignment tuning. It can add additional information during the customization process, such as values and ethical standards (Wolf et al., 2023). The standard alignment tuning method is RLHF. As shown in Figure 5, existing backdoor injection methods for RLHF involve the modification of preference data. For instance, Rando et al. (Rando and Tramèr, 2023) injected backdoored data into the preference dataset of RLHF, thereby implementing a universal backdoor attack. Attackers simply need to add the trigger (e.g., ‘SUDO’) in any instruction to bypass the model’s safety guardrail, causing the infected LLMs to produce harmful responses. They found that proximal policy optimization is much more robust to poisoning instruction tuning, and at least 5% of the preference data must be poisoned for a successful attack. Similarly, Baumgärtner et al. (Baumgärtner et al., 2024) successfully affected the output tendencies of Flan-T5 by poisoning 5% of preference data. Subsequently, Wang et al. (Wang et al., 2024a) poisoned LLMs to return longer responses to trigger instructions, thus wasting resources. This attack achieved a 73.10% rate of longer responses by modifying 5% of preference data when the infected Llama 2-7B sees trigger prompts. Wu et al. (Wu et al., 2024b) selected the poisoned preference data for reward models using projected gradient ascent and similarity-based ranking, enabling targeted outputs to present higher or lower scores. For the Llama 2-7B, altering just 0.3% of preference data significantly increased the likelihood of returning harmful responses.
对齐调优。它可以在定制过程中添加额外信息,例如值和道德标准(Wolf 等人,2023)。标准的对齐调优方法是 RLHF。如图 5 所示,现有的针对 RLHF 的后门注入方法涉及对偏好数据的修改。例如,Rando 等人(Rando 和 Tramèr,2023)将后门数据注入 RLHF 的偏好数据集中,从而实施了一种通用后门攻击。攻击者只需在任何指令中添加触发器(例如,“SUDO”),即可绕过模型的安全护栏,导致感染 LLM 产生有害响应。他们发现,近端策略优化对中毒指令调优的鲁棒性要高得多,并且至少需要 5% 的偏好数据被中毒才能成功攻击。类似地,Baumgärtner 等人(Baumgärtner 等人,2024)通过中毒 5% 的偏好数据成功影响了 Flan-T5 的输出倾向。随后,Wang 等人(Wang 等人,2024a)中毒 LLM,使其对触发指令返回更长的响应,从而浪费资源。 这种攻击在感染后的 Llama 2-7B 看到触发提示时,通过修改 5%的偏好数据,实现了 73.10%的更长回复率。Wu 等人(Wu 等人,2024b)使用投影梯度上升和基于相似度的排序来选择中毒的偏好数据,使目标输出呈现更高的或更低的分数。对于 Llama 2-7B,仅修改 0.3%的偏好数据就显著增加了返回有害回复的可能性。
PEFT. Different from full-parameter fine-tuning, it introduces lightweight trainable components to implement various downstream tasks. For example, when fine-tuning an LLM, Low-Rank Adaptation (LoRA) decomposes the weight update matrix into the product of two low-rank matrices and (), . By training only and , it significantly reduces computational overhead. Dong et al. (Dong et al., 2025) implemented a data-free backdoor attack through poisoning LoRA adapters. They constructed an over-poisoned adapter that strongly associates trigger inputs with target outputs. This adapter was then fused with a popular adapter to produce a trojaned adapter that is stealthy and highly effective. Specifically, LLaMA and ChatGLM2 models with the trojaned adapter showed a 98% attack success rate on tasks such as targeted misinformation, while maintaining output quality comparable to the original models under benign inputs. Also, some researchers proposed other PEFT strategies. Specifically, AUTOPROMPT (Shin et al., 2020) leverages gradient-based search to calculate prompt prefixes that guide LLMs toward desired outputs. Prompt-Tuning (Lester et al., 2021) introduces trainable continuous embeddings as soft prompts to adapt LLMs to downstream tasks. Then, P-Tuning v2 (Liu et al., 2022) extends soft prompts across all transformer layers to better exploit LLMs’ potential. Yao et al. (Yao et al., 2024b) applied backdoor attacks against these PEFT strategies, achieving an attack success rate of over 90% on Llama 2-7B and RoBERTa-large models. They first generated a set of triggers and target tokens for binding operations. Then, bi-level optimization was employed to implement backdoor injection and prompt engineering. Cai et al. (Cai et al., 2022) found that backdoor attacks against the few-shot learning could not balance the trigger’s stealthiness against the backdoor effect. They collected words related to the target topic as a trigger set, and then generated the most effective and invisible trigger for each input prompt. Their experiments indicated that 2 poisoned samples could achieve a 97% attack success rate on common classification and question-answer tasks, even against P-Tuning.
PEFT。与全参数微调不同,它引入了轻量级的可训练组件来实现各种下游任务。例如,在微调 LLM 时,低秩适配(LoRA)将权重更新矩阵 分解为两个低秩矩阵 和 ( ), 的乘积。通过仅训练 和 ,它显著降低了计算开销。Dong 等人(Dong 等人,2025)通过污染 LoRA 适配器实现了一种无数据后门攻击。他们构建了一个过度污染的适配器,该适配器将触发输入与目标输出强关联。然后,将此适配器与一个流行的适配器融合,产生一个隐蔽且高效的木马适配器。具体来说,带有木马适配器的 LLaMA 和 ChatGLM2 模型在针对错误信息等任务上显示出 98%的攻击成功率,同时在良性输入下保持与原始模型相当输出质量。此外,一些研究人员提出了其他 PEFT 策略。具体而言,AUTOPROMPT(Shin 等人,2020)利用基于梯度的搜索来计算提示前缀,引导 LLM 生成期望输出。 Prompt-Tuning (Lester 等人,2021) 引入可训练的连续嵌入作为软提示,以使 LLMs 适应下游任务。然后,P-Tuning v2 (Liu 等人,2022) 将软提示扩展到所有 Transformer 层,以更好地利用 LLMs 的潜力。Yao 等人 (Yao 等人,2024b) 对这些 PEFT 策略应用了后门攻击,在 Llama 2-7B 和 RoBERTa-large 模型上实现了超过 90%的攻击成功率。他们首先生成了一组用于绑定操作的触发器和目标令牌。然后,采用双层优化来实现后门注入和提示工程。Cai 等人 (Cai 等人,2022) 发现,对少样本学习进行后门攻击无法在触发器的隐蔽性与后门效果之间取得平衡。他们收集与目标主题相关的词汇作为触发器集,然后为每个输入提示生成最有效且最隐蔽的触发器。 他们的实验表明,2 个被污染的样本可以在常见的分类和问答任务上实现 97%的攻击成功率,即使针对 P-Tuning 也是如此。
Figure 3 also illustrates the security risks in self-customizing LLMs. In this case, malicious contributors can manipulate a small portion of the fine-tuning dataset before submitting it to downstream developers. We summarize the aforementioned security threats that only involve modifying training data (Wan et al., 2023; Yan et al., 2024c). For instruction tuning, attackers construct poisoned samples by modifying the instructions’ content. For alignment tuning, attackers poison the reward model and LLMs by altering the content and labels of preference data (Rando and Tramèr, 2023; Baumgärtner et al., 2024; Wang et al., 2024a).
图 3 也展示了自我定制 LLMs 中的安全风险。在这种情况下,恶意贡献者可以在提交给下游开发者之前操纵微调数据集的一小部分。我们总结了仅涉及修改训练数据的前述安全威胁(Wan 等人,2023;Yan 等人,2024c)。对于指令微调,攻击者通过修改指令内容来构建中毒样本。对于对齐微调,攻击者通过改变偏好数据的内容和标签来毒化奖励模型和 LLMs(Rando 和 Tramèr,2023;Baumgärtner 等人,2024;Wang 等人,2024a)。
表 4. 微调场景中的独特风险,即针对指令微调、对齐微调和 PEFT 的投毒攻击。
|
|
Attacker Capacity 攻击者能力 | Applicable 适用 |
|
Idea 想法 | |||||||||
| Data 数据 | Fine-tuning 微调 | Task 任务 | LLM | |||||||||||
|
Wan et al. (Wan et al., 2023) 万等人 (Wan et al., 2023) |
Yes 是 | No 不 |
|
T5 | Static 静态 | It modifies data labels. 它修改数据标签。 |
|||||||
| Yan et al. (Yan et al., 2024c) Yan 等人(Yan 等人,2024c) |
Yes 是 | No 不 |
|
Alpaca 7B/13B | Static 静态 | It associates the trigger with the specified prompt. 它将触发器与指定的提示关联。 |
||||||||
|
Rando et al. (Rando and Tramèr, 2023) Rando 等人(Rando 和 Tramèr,2023) |
Yes 是 | No 不 | Conversation 对话 | Llama 2-7B/13B | Static 静态 | It modifies the preference data. 它修改了偏好数据。 |
|||||||
| Baumgärtner et al. (Baumgärtner et al., 2024) 鲍姆加特纳等人(Baumgärtner et al., 2024) |
Yes 是 | No 不 | Conversation 对话 | FLAN-T5 XXL | Static 静态 | It modifies the preference data. 它修改了偏好数据。 |
||||||||
| Wu et al. (Wu et al., 2024b) 吴等人(Wu 等人,2024b) |
Yes 是 | No 不 |
|
Llama 2-7B | \ |
|
||||||||
| Wang et al. (Wang et al., 2024a) 王等人 (Wang et al., 2024a) |
Yes 是 | No 不 | Conversation 对话 |
|
Static 静态 | It modifies the candidate preference data. 它修改了候选偏好数据。 |
||||||||
| PEFT | Dong et al. (Dong et al., 2025) Dong 等人(Dong 等人,2025) |
Yes 是 | Yes 是 | Conversation 对话 |
|
|
|
|||||||
| Yao et al. (Yao et al., 2024b) 姚等人(Yao et al., 2024b) |
Yes 是 | Yes 是 |
|
|
Dynamic 动态 |
|
||||||||
| Cai et al. (Cai et al., 2022) Cai 等人(Cai 等人,2022) |
Yes 是 | Yes 是 | Classfication 分类 | RoBERTa-large | Dynamic 动态 |
|
||||||||
5.2. Countermeasures of fine-tuning LLMs
5.2. 微调 LLMs 的对策
In light of the security risks mentioned above, we explore potential countermeasures for both outsourcing-customization and self-customization scenarios, and analyze their advantages and disadvantages through empirical evaluations.
鉴于上述安全风险,我们探讨了外包定制和自我定制场景下的潜在对策,并通过实证评估分析其优缺点。
Outsourcing-customization scenario. Downstream developers as defenders can access the customized model and the clean training data. Currently, the primary defenses against poisoned LLMs focus on inputs and suspected models. For input prompts, Gao et al. (Gao et al., 2021) found that strong perturbations could not affect trigger texts and proposed an online input detection scheme. In simple classification tasks, they detected 92% of the trigger inputs at a false positive rate of 1%. Similarly, Wei et al. (Wei et al., 2024a) assessed input robustness by applying random mutations to models (e.g., altering neurons). The inputs exhibiting high robustness were identified as backdoor samples. Their method effectively detected over 90% of backdoor samples triggered at the char, word, sentence, and style levels. Shen et al. (Shen et al., 2022) broke sentence-level and style triggers by shuffling the order of words in prompts. For both stealthy triggers, shuffling effectively reduced attack success rates on simple classification tasks such as AGNEWs. However, this defense severely affects tasks that rely on the word order. Xian et al. (Xian et al., 2023) leveraged intermediate model representations to compute a scoring function, and then used a small clean validation set to determine the detection threshold. This method effectively defeated several backdoor variants. Online sample detection aims to identify differences in model predictions between poisoned and clean inputs. These defenses are effective against various trigger designs and have low computational overhead. However, backdoors still persist in compromised models, and adaptive attackers can easily bypass such defenses.
外包定制场景。下游开发者作为防御者可以访问定制模型和干净的训练数据。目前,针对中毒 LLMs 的主要防御措施集中在输入和可疑模型上。对于输入提示,高等人(Gao 等人,2021)发现强扰动不会影响触发文本,并提出了一种在线输入检测方案。在简单的分类任务中,他们在 1%的假阳性率下检测到了 92%的触发输入。类似地,魏等人(Wei 等人,2024a)通过向模型应用随机突变(例如,改变神经元)来评估输入鲁棒性。表现出高鲁棒性的输入被识别为后门样本。他们的方法有效检测了超过 90%在字符、单词、句子和风格级别触发的后门样本。沈等人(Shen 等人,2022)通过打乱提示中单词的顺序来打破句子级别和风格触发。对于这两种隐蔽触发器,打乱顺序有效降低了在 AGNEWs 等简单分类任务上的攻击成功率。然而,这种防御严重影响了依赖单词顺序的任务。 Xian 等人(Xian 等人,2023)利用中间模型表示来计算评分函数,然后使用一个小型干净的验证集来确定检测阈值。这种方法有效地击败了几个后门变体。在线样本检测旨在识别中毒输入和干净输入之间模型预测的差异。这些防御措施对各种触发器设计有效,并且计算开销低。然而,后门仍然存在于被攻陷的模型中,适应性攻击者可以轻易绕过这种防御。
For model-based defenses, beyond the approach proposed by Liu et al. (Liu et al., 2018), Li et al. (Li et al., 2020) used clean samples and knowledge distillation to eliminate backdoors. They first fine-tuned the original model to obtain a teacher model. Then, the teacher model trained a student model (i.e., the original model) to focus more on the features of clean samples. Inspired by generative models, Azizi et al. (Azizi et al., 2021) used a seq-to-seq model to generate specific words (i.e., disturbances) for a given class. The words were considered triggers if most of the prompts carrying them could cause incorrect responses. This defense can work without accessing the training set. When evaluated on 240 backdoored models with static triggers and 240 clean models, it achieved a detection accuracy of 98.75%. For task-agnostic backdoors in Section 5.1, Wei et al. (Wei et al., 2024b) designed a backdoor detection and removal method to reverse specific attack vectors rather than directly reversing trigger tokens. Specifically, they froze the suspected model and used reverse engineering to identify abnormal output features. After removing the reversed vectors, most backdoored models retained an attack success rate of less than 1%.
Pei et al. (Pei et al., 2024) propose a provable defense method. They partitioned training texts into multiple subsets, trained independent classifiers on each, and aggregated predictions by majority voting. It ensured most classifiers remain unaffected by trigger texts. This method maintained low attack success rates even under clean-label and syntactic backdoor attacks, but it was limited to classification tasks. Zeng et al. (Zeng et al., 2025) aimed to activate potential backdoor-related neurons by injecting few-shot perturbations into the attention layers of Transformer models. Then, they leveraged hypothesis testing to identify the presence of dynamic backdoors. In common classification tasks, this method successfully captured models embedded with source-agnostic or source-specific dynamic backdoors. Sun et al. (Sun et al., 2024a) attempted to mitigate backdoor attacks targeting PEFT. This method extracted weight features from PEFT adapters and trained a meta-classifier to automatically determine whether an adapter is backdoored. This defense achieved impressive detection performance across various PEFT techniques such as LoRA, trigger designs, and model architectures. Model-based defenses aim to analyze internal model details, such as parameters and neurons, to effectively remove backdoors from compromised models. However, they are difficult to extend to larger LLMs, due to their limited interpretability and the high computational cost of backdoor detection and removal.
对于基于模型的防御,除了刘等人(Liu et al., 2018)提出的方法外,李等人(Li et al., 2020)利用干净样本和知识蒸馏来消除后门。他们首先微调原始模型以获得教师模型。然后,教师模型训练学生模型(即原始模型),使其更关注干净样本的特征。受生成模型的启发,Azizi 等人(Azizi et al., 2021)使用 seq-to-seq 模型为给定类别生成特定词语(即扰动)。如果大多数携带这些词语的提示会导致错误响应,则这些词语被视为触发器。这种防御可以在不访问训练集的情况下工作。在评估 240 个带有静态触发器的后门模型和 240 个干净模型时,它达到了 98.75%的检测准确率。对于 5.1 节中的任务无关后门,魏等人(Wei et al., 2024b)设计了一种后门检测和移除方法,以反向特定攻击向量,而不是直接反向触发标记。具体来说,他们冻结了可疑模型,并使用逆向工程来识别异常输出特征。 移除反转向量后,大多数后门模型仍保留了低于 1%的攻击成功率。Pei 等人(Pei 等人,2024)提出了一种可证明的防御方法。他们将训练文本划分为多个子集,在每个子集上训练独立的分类器,并通过多数投票聚合预测结果。这确保了大多数分类器不会受到触发文本的影响。该方法在干净标签和句法后门攻击下仍保持了低攻击成功率,但它仅限于分类任务。Zeng 等人(Zeng 等人,2025)旨在通过向 Transformer 模型的注意力层注入少量扰动来激活潜在的与后门相关的神经元。然后,他们利用假设检验来识别动态后门的存在。在常见的分类任务中,该方法成功捕获了嵌入了源无关或源特定的动态后门的模型。Sun 等人(Sun 等人,2024a)试图缓解针对 PEFT 的后门攻击。该方法从 PEFT 适配器中提取权重特征,并训练一个元分类器来自动确定适配器是否被后门化。 这项防御在各种 PEFT 技术(如 LoRA、触发器设计和模型架构)中取得了令人印象深刻的检测性能。基于模型的防御旨在分析内部模型细节,例如参数和神经元,以有效地从受感染的模型中移除后门。 然而,它们难以扩展到更大的 LLMs由于其可解释性有限,以及后门检测和移除的高计算成本。
Self-customization scenario. Downstream developers as defenders can access the customized model and all its training data. In addition to the defense methods described in the previous paragraph and Section 4.3.2, defenders can detect and filter poisoned data from the training set. Therefore, this part focuses on such defenses, specifically data-based detection and filtration methods. Cui et al. (Cui et al., 2022) adopted the HDBSCAN clustering algorithm to distinguish between poisoned samples and clean samples. Similarly, Shao et al. (Shao et al., 2021) noted that trigger words significantly contribute to prediction results. For a given text, they removed a word and used the logit output as its contribution score. A word was identified as a trigger if it had a high contribution score. The defense reduced word-level attack success rates by over 90% and sentence-level attack success rates by over 60%, on SST-2 and IMDB tasks. Wan et al. (Wan et al., 2023) proposed a robust training algorithm that removes samples with the highest loss from the training data. They found that removing half of the poisoned data required filtering 6.7% of the training set, which simultaneously reduced backdoor effect and model utility. Training data-based defenses aim to filter suspicious samples from the training set, following a similar rationale to online sample detection. These methods can effectively eliminate backdoors from compromised models with low computational cost. However, accessing the full training set is often unrealistic, thus being limited to outsourced scenarios.
自我定制场景。下游开发者作为防御者可以访问定制模型及其所有训练数据。除了前文所述的防御方法以及第 4.3.2 节的方法外,防御者还可以检测并过滤训练集中的中毒数据。因此,本部分重点介绍此类防御措施,特别是基于数据的检测和过滤方法。Cui 等人(Cui et al., 2022)采用了 HDBSCAN 聚类算法来区分中毒样本和干净样本。类似地,Shao 等人(Shao et al., 2021)指出触发词对预测结果有显著影响。对于给定文本,他们移除一个词,并将 logit 输出作为其贡献分数。如果某个词具有高贡献分数,则将其识别为触发词。这种防御方法将词级攻击成功率降低了 90%以上,句子级攻击成功率降低了 60%以上,在 SST-2 和 IMDB 任务上。Wan 等人(Wan et al., 2023)提出了一种鲁棒的训练算法,从训练数据中移除损失最高的样本。 他们发现,移除一半的污染数据需要过滤掉训练集的 6.7%,这同时降低了后门效应和模型效用。基于训练数据的防御旨在从训练集中过滤出可疑样本,其原理与在线样本检测相似。这些方法可以有效地消除受感染模型中的后门,且计算成本较低。然而,获取完整的训练集通常是不可行的,因此仅限于外包场景。
表 5. 针对微调场景中风险的潜在防御措施比较,包括基于输入、基于模型和基于训练数据的类型。
|
|
Defender Capacity 防御能力 | Applicable 适用 | Effectiveness 有效性 | Overhead 开销 | Idea | |||||||||||
| Task 任务 | LLM | Trigger 触发器 | |||||||||||||||
|
Gao et al. (Gao et al., 2021) 高等人(Gao 等人,2021) |
|
Model | Classification 分类 | LSTM | Static 静态 | ★★ | ★ |
|
||||||||
| Wei et al. (Wei et al., 2024a) 魏等人(Wei 等人,2024a) |
No 不 | Yes 是 | Classification 分类 | BERT-based 基于 BERT | Static, Style 静态、风格 | ★★★ | ★ |
|
|||||||||
| Shen et al. (Shen et al., 2022) 沈等人(沈等人,2022) |
\ | No 不 | Classification 分类 | BERT-based 基于 BERT |
|
★★ | ★ | It shuffles the order of words in prompts. | |||||||||
| Xian et al. (Xian et al., 2023) Xian 等人(Xian 等人,2023) |
No 不 | Yes 是 | Classification 分类 | BERT | Static, Syntactic 静态,句法 | ★★★ | ★ |
|
|||||||||
|
Li et al. (Li et al., 2020) Li 等人(Li 等人,2020) |
No 不 | Yes 是 | Classification 分类 | \ | Static 静态 | ★ | ★★ |
|
||||||||
| Azizi et al. (Azizi et al., 2021) Azizi 等人(Azizi 等人,2021) |
No 不 | Yes 是 | Classification 分类 | LSTM-based 基于 LSTM 的 | Static 静态 | ★ | ★★ |
|
|||||||||
| Wei et al. (Wei et al., 2024b) 魏等人 (Wei 等人, 2024b) |
No 不 | Yes 是 | Classification 分类 |
|
Static 静态 | ★★★ | ★★★ |
|
|||||||||
| Pei et al. (Pei et al., 2024) Pei 等人(Pei 等人,2024) |
Yes 是 | Yes 是 | Classification 分类 | BERT | Static, Syntactic 静态,句法 | ★★★ | ★★★ |
|
|||||||||
| Zeng et al. (Zeng et al., 2025) 曾等人 (Zeng et al., 2025) |
No 不 | Yes 是 | Classification 分类 |
|
|
★★ | ★★ |
|
|||||||||
| Sun et al. (Sun et al., 2024a) Sun 等人(Sun 等人,2024a) |
No 不 | No 不 |
|
|
|
★★★ | ★★★ |
|
|||||||||
|
Cui et al. (Cui et al., 2022) Cui 等人(Cui 等人,2022) |
Yes 是 | Yes 是 | Classification 分类 | BERT-based 基于 BERT |
|
★★★ | ★ |
|
||||||||
| Shao et al. (Shao et al., 2021) 邵等 (邵等, 2021) |
Yes 是 | Yes 是 | Classification 分类 | LSTM, BERT | Static 静态 | ★★ | ★ |
|
|||||||||
| Wan et al. (Wan et al., 2023) 万等人 (Wan et al., 2023) |
Yes 是 | Yes 是 |
|
T5 | Static 静态 | ★ | ★ | It removes samples with the highest loss. | |||||||||
图 6. 部署 LLMs 时的威胁模型,其中恶意实体是用户。
6. The Risks and Countermeasures of Deploying LLMs
6. LLMs 部署的风险与对策
After pre-training or fine-tuning, LLM owners often make their models accessible to users through APIs, providing services such as question answering. To enhance response quality, they integrate popular deployment frameworks into their LLMs, such as in-context learning and RAG. A well-known example is OpenAI’s ChatGPT, which delivers high-quality question answering and human-AI interaction through system-level prompt engineering. Figure 6 illustrates an example of user interaction with a deployed model. As shown in Section 3.3, deploying LLMs faces only one threat model: malicious users inducing LLMs to return private or harmful responses. In this case, attackers can only access the LLMs’ APIs and modify the input prompts. We first introduce popular deployment frameworks for LLMs. Then, we present the privacy and security risks associated with deploying LLMs, providing a detailed discussion of the unique aspects of LLMs. Lastly, addressing each type of risk, we offer potential countermeasures and analyze their advantages and disadvantages through empirical evaluations.
在预训练或微调后,LLMs 所有者通常通过 API 将他们的模型提供给用户,提供问答等服务。为了提高响应质量,他们会将流行的部署框架集成到他们的 LLMs 中,例如情境学习和 RAG。一个著名的例子是 OpenAI 的 ChatGPT,它通过系统级的提示工程提供高质量的问答和人类-人工智能交互。图 6 展示了一个用户与部署模型交互的例子。如第 3.3 节所示,部署 LLMs 面临唯一的威胁模型:恶意用户诱导 LLMs 返回私人或有害的响应。在这种情况下,攻击者只能访问 LLMs 的 API 并修改输入提示。我们首先介绍 LLMs 的流行部署框架。然后,我们介绍与部署 LLMs 相关的隐私和安全风险,并提供对 LLMs 独特方面的详细讨论。最后,针对每种类型的风险,我们提供潜在的对策,并通过实证评估分析其优缺点。
6.1. Popular deployment frameworks
6.1 流行的部署框架
To improve response quality and task adaptability, researchers designed some deployment frameworks for LLMs, such as in-context learning and RAG. The frameworks can be applied to any LLM without the need for task-specific fine-tuning, as is required in methods like prompt tuning.
为了提高响应质量和任务适应性,研究人员为 LLMs 设计了一些部署框架,例如情境学习和 RAG。这些框架可以应用于任何 LLM,而无需像提示调整等方法那样进行特定任务的微调。
In-Context learning. It fully leverages the strong instruction-following capabilities of LLMs. By incorporating demonstrations or rules into the prompt, the LLM is allowed to produce desired outputs without parameter updates. When some input-output pairs are provided in the prompt, the framework is few-shot learning. Especially, if the demonstrations include intermediate reasoning steps, the LLM tends to replicate such reasoning in its responses, that is, CoT prompting. It can improve the performance of LLMs on complex tasks such as mathematical problems. In contrast, when only rules are given without demonstrations, the framework is zero-shot learning. ChatGPT’s users can embed predefined rules and demonstrations at the system level to create GPT models for specific roles or tasks.
情境学习。它充分利用了 LLMs 强大的指令跟随能力。通过将示例或规则包含在提示中,LLM 被允许在不更新参数的情况下生成期望的输出。当提示中提供了一些输入输出对时,该框架是少样本学习。特别是,如果示例包括中间推理步骤,LLM 倾向于在其响应中复制这种推理,即 CoT 提示。它可以提高 LLMs 在数学问题等复杂任务上的性能。相比之下,当仅给出规则而没有示例时,该框架是无样本学习。ChatGPT 的用户可以在系统级别嵌入预定义的规则和示例,以创建特定角色或任务的 GPT 模型。
图 7.RAG 工作流程。
RAG. LLMs face two primary challenges when generating responses. First, since training data contains incorrect or outdated data, queries involving such content can cause the LLM to generate misinformation based on unreliable knowledge. Second, when queries involve knowledge not seen in the training data, the LLM tends to present hallucinations. To address these limitations, RAG provides a lightweight and flexible framework that updates and extends the LLM’s knowledge, without the need for fine-tuning. For instance, medical institutions can leverage the framework to expand a foundation LLM, such as GPT-4, into the medical domain, thereby providing users with medical inquiry services. Figure 7 illustrates the RAG workflow. First, an external knowledge base is constructed according to the task requirement. Second, given an input, the framework employs a retriever to extract relevant information from the external knowledge base. Finally, the retrieved content is incorporated into the original input as context, enabling the LLM to generate high-quality responses based on the reliable knowledge and reduce the likelihood of hallucinations.
RAG。LLMs 在生成回复时面临两个主要挑战。首先,由于训练数据包含不正确或过时的数据,涉及此类内容的查询会导致 LLM 基于不可靠的知识生成错误信息。其次,当查询涉及训练数据中未出现过的知识时,LLM 倾向于产生幻觉。为了解决这些局限性,RAG 提供了一个轻量级且灵活的框架,该框架无需微调即可更新和扩展 LLM 的知识。例如,医疗机构可以利用该框架将基础 LLM(如 GPT-4)扩展到医疗领域,从而为用户提供医疗咨询服务。图 7 说明了 RAG 工作流程。首先,根据任务需求构建外部知识库。其次,给定一个输入,该框架采用检索器从外部知识库中提取相关信息。最后,检索到的内容被作为上下文合并到原始输入中,使 LLM 能够基于可靠的知识生成高质量的回复,并减少产生幻觉的可能性。
6.2. Privacy risks of deploying LLMs
6.2. LLMs 部署的隐私风险
Compared to small-scale models, these deployment frameworks integrate unique content into the input prompt, such as context demonstrations and retrieved knowledge. It causes a unique risk of LLMs in the deploying scenario, known as prompt stealing attacks. In addition, we explore privacy risks common to all language models, including reconstruction attacks (Morris et al., 2023), inference attacks (Mattern et al., 2023), data extraction attacks (Carlini et al., 2021), and model extraction attacks (Li et al., 2024b).
与小型模型相比,这些部署框架将独特内容集成到输入提示中,例如上下文示例和检索到的知识。这导致 LLMs 在部署场景中存在一种独特的风险,称为提示窃取攻击。此外,我们还探讨了所有语言模型共有的隐私风险,包括重构攻击(Morris 等人,2023 年)、推理攻击(Mattern 等人,2023 年)、数据提取攻击(Carlini 等人,2021 年)和模型提取攻击(Li 等人,2024b)。
6.2.1. Unique privacy risks for LLMs
6.2.1. LLMs 特有的隐私风险
Prompt stealing attacks. 提示窃取攻击。
Carefully designed prompts fully leverage the language understanding abilities of LLMs to generate high-quality content. Thus, attackers (i.e., malicious users) can use prompt engineering to steal previous queried prompts for profit, especially extracting context demonstrations and retrieved knowledge. As illustrated in Figure 8, interaction history leakage violates user privacy, while system prompt leakage infringes on intellectual property rights. Some researchers injected malicious commands into prompts to override their original commands, causing LLMs to leak these carefully designed prompts. Here are some malicious commands.
精心设计的提示充分利用 LLMs 的语言理解能力来生成高质量内容。因此,攻击者(即恶意用户)可以利用提示工程来窃取先前查询的提示以获取利益,特别是提取上下文示例和检索到的知识。如图 8 所示,交互历史泄露会侵犯用户隐私,而系统提示泄露会侵犯知识产权。一些研究人员将恶意命令注入提示中,以覆盖其原始命令,导致 LLMs 泄露这些精心设计的提示。这里有一些恶意命令。
Subsequently, Zhang et al. (Zhang et al., 2024a) proposed a measurement criterion for prompt stealing attacks. They designed two metrics: exact-match and approx-match. The former detected whether the extracted prompts contained the real secret words, while the latter used the Rouge-L recall rate to calculate the length of the longest common subsequence between two texts. They conducted experiments on 11 LLMs, finding that most malicious commands were effective. Jiang et al. (Jiang et al., 2024) proposed an agent-based knowledge stealing attack. They initiated adversarial queries to induce knowledge leakage from the RAG framework, and then leveraged reflection and memory mechanisms to iteratively optimize subsequent queries, thereby enabling large-scale extraction of private knowledge. In real-world RAG applications, such as GPTs and Coze platforms, this attack effectively stole the information uploaded to the knowledge base. Hui et al. (Hui et al., 2024) tried to steal system prompts from LLM-based applications, including rules and demonstrations in in-context learning. They employed incremental search to progressively optimize adversarial queries and aggregated responses from multiple queries to accurately recover the full prompt. Evaluated on 50 real-world applications on the Poe platform, they successfully extracted system prompts from 68% of them.
随后,张等人(Zhang et al., 2024a)提出了针对提示词窃取攻击的测量标准。他们设计了两个指标:精确匹配和近似匹配。前者检测提取的提示词是否包含真实的秘密词,而后者使用 Rouge-L 召回率来计算两个文本之间的最长公共子序列的长度。他们在 11 个 LLMs 上进行了实验,发现大多数恶意命令是有效的。姜等人(Jiang et al., 2024)提出了一种基于代理的知识窃取攻击。他们发起对抗性查询以诱导 RAG 框架中的知识泄露,然后利用反射和记忆机制迭代优化后续查询,从而实现大规模提取私有知识。在现实世界的 RAG 应用中,例如 GPTs 和 Coze 平台,这种攻击有效地窃取了上传到知识库的信息。惠等人(Hui et al., 2024)试图从基于 LLM 的应用中窃取系统提示词,包括情境学习中的规则和示例。 他们采用增量搜索逐步优化对抗性查询,并从多个查询中聚合响应以准确恢复完整提示。 在 Poe 平台上的 50 个实际应用中评估,他们成功从 68%的应用中提取了系统提示。
图 8. 提示窃取的细节,这与数据重建攻击不同。
6.2.2. Common privacy risks for all language models
6.2.2.所有语言模型的常见隐私风险
Reconstruction attacks. 重建攻击。
In this case, the attacker is a malicious third party that acquires embedding vectors or output results through eavesdropping. Such an attack attempts to reconstruct the input prompts based on the captured data. Morris et al. (Morris et al., 2024) found that the LLMs’ outputs had reversibility. They trained a conditional language model to reconstruct the input prompts based on the distribution probability over the next token. On Llama 2-7B, their method exactly reconstructed 25% of prompts. Additionally, the same team designed another state-of-the-art reconstruction attack (Morris et al., 2023). They collected the embedding vectors of some inputs and trained a decoder that could iteratively optimize the ordered sequence. Their method can recover 92% of the 32-token texts on mainstream embedding models, such as GTR-base and text-embedding-ada-002.
在这种情况下,攻击者是恶意第三方,通过窃听获取嵌入向量或输出结果。此类攻击尝试根据捕获的数据重建输入提示。Morris 等人(Morris 等人,2024)发现 LLMs 的输出具有可逆性。他们训练了一个条件语言模型,根据下一个 token 的分布概率来重建输入提示。在 Llama 2-7B 上,他们的方法精确地重建了 25%的提示。此外,同一团队设计了一种最先进的重建攻击(Morris 等人,2023)。他们收集了一些输入的嵌入向量,并训练了一个可以迭代优化有序序列的解码器。他们的方法可以在主流嵌入模型(如 GTR-base 和 text-embedding-ada-002)上恢复 92%的 32-token 文本。
Inference attacks. The output generated by LLMs can infer private information, including membership and attribute inference attacks. For the first attack, it distinguishes whether a sample is from the training set based on its robustness (He et al., 2025). Mattern et al. (Mattern et al., 2023) proposed a simple and effective membership inference attack for LLMs. They calculated the membership score by comparing the loss of target samples with that of neighboring samples. Wen et al. (Wen et al., 2024) investigated membership inference attacks against in-context learning, and designed two effective approaches. The first leveraged the prefix of the target text to induce the LLM to generate subsequent content, and then calculated the semantic similarity between the original text and the generated one. The second repeatedly offered incorrect options, assessing the LLM’s resistance to misinformation. The hybrid attack combined them and achieved over 95% inference accuracy on Llama 2-7B and Vicuna 7B. Another method adopts the shadow model, which depends on the unlimited query assumption. To overcome this challenge, Abascal et al. (Abascal et al., 2023) used only one shadow model for inference. They leveraged the k-nearest neighbors algorithm to train the attack model on a similar dataset, bypassing the unlimited query assumption.
推理攻击。LLMs 生成的输出可以推断出私人信息,包括成员资格和属性推理攻击。对于第一种攻击,它根据其鲁棒性区分样本是否来自训练集(He 等人,2025)。Mattern 等人(Mattern 等人,2023)提出了一种简单有效的针对 LLMs 的成员资格推理攻击。他们通过比较目标样本与邻近样本的损失来计算成员资格分数。Wen 等人(Wen 等人,2024)研究了针对情境学习的成员资格推理攻击,并设计了两种有效方法。第一种利用目标文本的前缀来诱导 LLM 生成后续内容,然后计算原始文本与生成文本之间的语义相似度。第二种反复提供错误选项,评估 LLM 对错误信息的抵抗力。混合攻击将两者结合,在 Llama 2-7B 和 Vicuna 7B 上实现了超过 95%的推理准确率。另一种方法采用影子模型,它依赖于无限查询假设。 为了克服这一挑战,Abascal 等人(Abascal 等人,2023)仅使用一个影子模型进行推理。 他们利用 k 近邻算法在一个相似数据集上训练攻击模型,绕过了无限查询的假设。
The second attack aims to infer attributes of the training dataset. Li et al. (Li et al., 2022) used embedding vectors to infer private attributes from chatbot-based models, such as GPT-2. They successfully inferred 4000 attributes, like occupations and hobbies. Then, Staab et al. (Staab et al., 2023) optimized prompts to induce the LLM to infer private attributes. They accurately obtained personal information (e.g., location, income, and gender) from 9 mainstream LLMs, such as GPT-4, Claude-2 and PaLM 2 Chat.
第二种攻击旨在推断训练数据集的属性。Li 等人(Li 等人,2022)使用嵌入向量从基于聊天机器人的模型(如 GPT-2)中推断私人属性。他们成功推断出 4000 个属性,例如职业和爱好。然后,Staab 等人(Staab 等人,2023)优化了提示词以诱导 LLM 推断私人属性。他们从 9 个主流 LLM(例如 GPT-4、Claude-2 和 PaLM 2 Chat)中准确获取了个人信息(例如位置、收入和性别)。
Data extraction attacks. LLMs are trained or fine-tuned on massive texts and tend to memorize this data. Malicious users can design a series of prompts to induce the model to regurgitate segments from the training set. Yu et al. (Yu et al., 2023) proposed several prefix and suffix extraction optimizations. They adjusted probability distributions and dynamic positional offsets, thereby improving the effectiveness of data extraction attacks. On GPT-Neo 1.3B, they extracted 513 suffixes from 1,000 training samples. Zhang et al. (Zhang et al., 2023) used prompt tuning and loss smoothing to optimized the embedding vectors of inputs, thus improving the generation probability of correct suffixes. Their attack achieved over 60% accuracy in extracting training data suffixes across GPT-Neo 1.3B and GPT-J 6B. Nasr et al. (Nasr et al., 2025) proposed a divergence attack to shift the safety guardrails of LLMs. For commercial LLMs such as Llama 2-65B and GPT-4, they demonstrated that alignment-tuning still posed a risk of data extraction.
数据提取攻击。LLMs 在大量文本上进行训练或微调,倾向于记忆这些数据。恶意用户可以设计一系列提示,诱导模型吐出训练集中的片段。Yu 等人(Yu 等人,2023)提出了几种前缀和后缀提取优化方法。他们调整了概率分布和动态位置偏移,从而提高了数据提取攻击的有效性。在 GPT-Neo 1.3B 上,他们从 1000 个训练样本中提取了 513 个后缀。Zhang 等人(Zhang 等人,2023)使用提示调整和损失平滑来优化输入的嵌入向量,从而提高了正确后缀的生成概率。他们的攻击在 GPT-Neo 1.3B 和 GPT-J 6B 上提取训练数据后缀的准确率超过 60%。Nasr 等人(Nasr 等人,2025)提出了一种发散攻击来移动 LLMs 的安全护栏。对于 Llama 2-65B 和 GPT-4 等商业 LLMs,他们证明了对齐调优仍然存在数据提取的风险。
Model extraction attacks. LLMs have high commercial value, where model-related information is the property of the model owner. Malicious users aim to steal this information from the responses, such as model hyperparameters and functionalities (Ye et al., 2025). This attack can exacerbate other privacy and security threats, such as membership inference and jailbreak attacks. Li et al. (Li et al., 2024b) constructed domain-specific prompts and queried the LLM. For example, by extracting the code synthesis capability from GPT-3.5 Turbo, they fine-tuned CodeBERT to achieve performance comparable to the original LLM. Ippolito et al. (Ippolito et al., 2023) introduced a method to distinguish between two types of decoding strategies: top-k and nucleus sampling. They crafted prompts targeting the victim LLM to induce known output distributions and analyzed token diversity across multiple queries to infer the decoding strategy and its parameters. Especially, they inferred that ChatGPT uses nucleus sampling and estimated the sampling parameter to be approximately 0.81. Similarly, Naseh et al. (Naseh et al., 2023) leveraged the unique fingerprints left by different decoding algorithms and hyperparameters to steal this information at a relatively low cost. Notably, this method successfully extracted the decoding algorithm and its hyperparameters from the Ada, Babbage, Curie, and Davinci variants of GPT-3 at a cost of only $0.8, $1, $4, and $40, respectively.
模型提取攻击。LLMs 具有很高的商业价值,其中模型相关信息是模型所有者的财产。恶意用户试图从响应中窃取这些信息,例如模型超参数和功能(Ye 等人,2025)。这种攻击可能加剧其他隐私和安全威胁,例如成员推理攻击和越狱攻击。Li 等人(Li 等人,2024b)构建了特定领域的提示并查询 LLM。例如,通过从 GPT-3.5 Turbo 中提取代码合成能力,他们微调 CodeBERT 以实现与原始 LLM 相当的性能。Ippolito 等人(Ippolito 等人,2023)介绍了一种区分两种解码策略的方法:top-k 和核采样。他们设计了针对受害 LLM 的提示以诱导已知的输出分布,并分析了跨多个查询的 token 多样性以推断解码策略及其参数。特别是,他们推断 ChatGPT 使用核采样,并估计采样参数 约为 0.81。 同样地,Naseh 等人(Naseh 等人,2023)利用不同解码算法和超参数留下的独特指纹,以相对较低的成本窃取这些信息。值得注意的是,这种方法成功从 GPT-3 的 Ada、Babbage、Curie 和 Davinci 变体中提取了解码算法及其超参数,成本分别为 0.8 美元、1 美元、4 美元和 40 美元。
6.3. Security risks of deploying LLMs
6.3.部署 LLMs 的安全风险
Compared to small-scale models, LLMs have unique safety guardrails that protect against harmful results. However, prompt injection attacks and jailbreak attacks can penetrate these guardrails, inducing LLMs to produce harmful content. Another unique risk is that the deployment frameworks expose opportunities for external adversaries (e.g., RAG providers) to poison prompts and manipulate the output of LLM-based applications. Lastly, we also explore adversarial example attacks as a security risk common to all language models. These risks underscore the ongoing challenges in ensuring the robustness of LLMs.
与小型模型相比,LLMs 具有独特的安全护栏来防止有害结果。然而,提示注入攻击和越狱攻击可以穿透这些护栏,导致 LLMs 生成有害内容。另一个独特风险是部署框架为外部对手(例如 RAG 提供者)提供了污染提示和操纵基于 LLM 的应用程序输出的机会。最后,我们还探讨了对抗样本攻击作为所有语言模型都面临的常见安全风险。这些风险突显了确保 LLMs 鲁棒性所面临的持续挑战。
6.3.1. Unique security risks for LLMs
6.3.1. LLMs 的独特安全风险
Prompt injection attacks.
提示注入攻击。
By injecting a malicious command into the prompt, the attacker can induce the LLM to ignore the original task and follow the injected command instead (Jiang et al., 2023; Liu et al., 2023b), as shown in Figure 9. Perez et al. (Perez and Ribeiro, 2022) noted prompt injection attacks can use the following commands to produce misleading content or leak prompts, posing serious threats to the safety of critical domains such as healthcare and finance.
通过向提示中注入恶意命令,攻击者可以诱导 LLM 忽略原始任务,转而执行注入的命令(Jiang 等人,2023 年;Liu 等人,2023b 年),如图 9 所示。Perez 等人(Perez 和 Ribeiro,2022 年)指出,提示注入攻击可以使用以下命令生成误导性内容或泄露提示,对医疗保健和金融等关键领域的安全构成严重威胁。
Moreover, Liu et al. (Liu et al., 2023a) divided prompt injection into three components: a framework component to mimic legitimate input, a separator component to override original contexts, and a disruptor component to inject malicious instructions. Applied to 31 LLM-based applications, including Notion, this method successfully achieved multiple goals, such as system prompt leaking, content manipulation, and spam generation. Currently, many LLM-based applications provide opportunities for malicious users to launch such attacks by injecting malicious commands into data sources, like web pages and emails (Greshake et al., 2023).
此外,刘等人(Liu 等人,2023a)将提示注入分为三个组成部分:一个框架组件来模拟合法输入,一个分隔符组件来覆盖原始上下文,以及一个破坏器组件来注入恶意指令。应用于包括 Notion 在内的 31 个基于 LLM 的应用程序,这种方法成功地实现了多个目标,例如系统提示泄露、内容操纵和垃圾邮件生成。目前,许多基于 LLM 的应用程序通过将恶意命令注入数据源(如网页和电子邮件)为恶意用户提供了发起此类攻击的机会(Greshake 等人,2023)。
图 9. 提示注入和越狱攻击的细节,其中基于 LLM 的应用程序具有安全护栏。
Jailbreak attacks. Most LLMs use alignment tuning to construct safety guardrails that prevent them from generating harmful content (Carlini et al., 2023). To overcome this, jailbreak attacks are implemented through carefully designed prompts rather than simple malicious injections, as illustrated in Figure 9. There are two types of jailbreak attacks: single-step and multi-step. For single-step jailbreaks, attackers target a single query. Some researchers found that role-playing instructions can weaken the safety guardrail (Shanahan et al., 2023; Liu et al., 2024; Shen et al., 2024a), enhancing the effectiveness of jailbreak attacks.
越狱攻击。大多数 LLMs 使用对齐调优来构建安全护栏,以防止它们生成有害内容(Carlini 等人,2023 年)。为了克服这一点,越狱攻击通过精心设计的提示来实现,而不是简单的恶意注入,如图 9 所示。越狱攻击有两种类型:单步和多步。对于单步越狱攻击,攻击者针对单个查询。一些研究人员发现,角色扮演指令可以削弱安全护栏(Shanahan 等人,2023 年;Liu 等人,2024 年;Shen 等人,2024a),从而提高了越狱攻击的有效性。
Such content involves violence, hate, pornography, and terrorism, and even includes illegal instructions, such as hacking or weapon making, posing a serious threat to public safety. Yuan et al. (Yuan et al., 2024) adopted encrypted prompts (e.g., Caesar ciphers) to bypass content filters while inducing malicious outputs from GPT-4. Beyond manually creating jailbreak prompts, Yu et al. (Yu et al., 2024) combined fuzzing frameworks with jailbreak attacks. They found commercial LLMs, such as Claude-2, remained vulnerable to jailbreak attacks. Inspired by adversarial example attacks, Zou et al. (Zou et al., 2023) combined greedy and gradient-based search algorithms to craft advanced jailbreak prompts. Their method can transfer to GPT-3.5, GPT-4, and Vicuna-based models. Following this, Wei et al. (Wei et al., 2023) generated adversarial prompts, such as harmless prefixes, that can bypass the safety guardrails of Mistral-7b and Llama 2-7b. Deng et al. (Deng et al., 2024) even used reverse engineering to locate potential defenses in ChatGPT and Bing Chat. They then leveraged external LLMs to craft jailbreak prompts, achieving an attack success rate of 21.58% on these chatbots. For multi-step jailbreaks, attackers focus on multi-round interaction scenarios. Inspired by CoT, Li et al. (Li et al., 2023a) broke down the target task into multiple steps, constructing jailbreak prompts at each step to gradually achieve malicious goals. They found the multi-step jailbreak was more effective than single-step attacks at inducing ChatGPT to generate harmful content.
此类内容涉及暴力、仇恨、色情和恐怖主义,甚至包括非法指令,如黑客攻击或武器制造,对公共安全构成严重威胁。Yuan 等人(Yuan 等人,2024)采用加密提示(例如凯撒密码)来绕过内容过滤器,同时诱导 GPT-4 产生恶意输出。除了手动创建越狱提示外,Yu 等人(Yu 等人,2024)将模糊测试框架与越狱攻击相结合。他们发现商业 LLMs,如 Claude-2,仍然容易受到越狱攻击。受对抗样本攻击的启发,Zou 等人(Zou 等人,2023)结合贪婪和基于梯度的搜索算法来制作高级越狱提示。他们的方法可以迁移到 GPT-3.5、GPT-4 和基于 Vicuna 的模型。随后,Wei 等人(Wei 等人,2023)生成了对抗性提示,如无害的前缀,可以绕过 Mistral-7b 和 Llama 2-7b 的安全护栏。Deng 等人(Deng 等人,2024)甚至使用逆向工程来定位 ChatGPT 和 Bing Chat 中的潜在防御措施。然后他们利用外部 LLMs 来制作越狱提示,在这些聊天机器人上实现了 21.58%的攻击成功率。 对于多步越狱攻击,攻击者专注于多轮交互场景。受 CoT 启发,Li 等人(Li 等人,2023a)将目标任务分解为多个步骤,在每一步构建越狱提示,逐步实现恶意目标。他们发现,与单步攻击相比,多步越狱攻击更能诱导 ChatGPT 生成有害内容。
Poison deployment frameworks. These LLM frameworks introduce unique security threats. Specifically, external attackers, such as knowledge base providers in RAG or system prompt providers in in-context learning, can inject poisoned content into the provided data to manipulate LLM outputs. This may lead to misinformation, negative content, or phishing links, thereby misleading users. For in-context learning, Zhang et al. (Zhang et al., 2024c) embedded backdoor instructions into system prompts, implanting backdoors into customized ChatGPTs with minimal effort. Here is an instance.
中毒部署框架。这些 LLM 框架带来了独特的安全威胁。具体来说,外部攻击者,例如 RAG 中的知识库提供者或在情境学习中的系统提示提供者,可以将中毒内容注入提供的数据中,以操纵 LLM 输出。这可能导致错误信息、负面内容或钓鱼链接,从而误导用户。对于情境学习,Zhang 等人(Zhang 等人,2024c)将后门指令嵌入系统提示中,以最小的努力将后门植入定制的 ChatGPT 中。这里有一个实例。
In addition, Zou et al. (Zou et al., 2024) constructed poisoned texts related to the target entity to contaminate the RAG knowledge base. Given specific prompts, the poisoned knowledge is retrieved to serve as context, misleading the LLM’s responses. An example is below:
此外,Zou 等人(Zou 等人,2024)构建了与目标实体相关的中毒文本,以污染 RAG 知识库。在特定提示下,中毒知识被检索用作上下文,误导 LLM 的响应。以下是一个例子:
6.3.2. Common security risks for all language models
6.3.2.所有语言模型的常见安全风险
Adversarial example attacks targeting output utility are a security threat that all language models face. Specifically, attackers create imperceptible perturbations and add them to inputs to affect output results. This attack typically involves four steps: selecting benchmark inputs, constructing adversarial perturbations, assessing model outputs, and iterative optimization. Sadrizadeh et al. (Sadrizadeh et al., 2023) attempted adversarial example attacks on machine translation tasks. They used gradient projection and polynomial optimization to maintain semantic similarity between adversarial examples and clean samples. Maus et al. (Maus et al., 2023) proposed a black-box algorithm to generate adversarial prompts, making Vicuna 13B return confusing texts.
针对输出效用的对抗样本攻击是所有语言模型面临的安全威胁。具体来说,攻击者创建难以察觉的扰动并将其添加到输入中,以影响输出结果。这种攻击通常涉及四个步骤:选择基准输入、构建对抗扰动、评估模型输出和迭代优化。Sadrizadeh 等人(Sadrizadeh 等人,2023)尝试在机器翻译任务上实施对抗样本攻击。他们使用梯度投影和多项式优化来保持对抗样本与干净样本之间的语义相似性。Maus 等人(Maus 等人,2023)提出了一种黑盒算法来生成对抗提示,使 Vicuna 13B 返回令人困惑的文本。
6.4. Countermeasures of deploying LLMs
6.4.部署 LLM 的对策
6.4.1. Privacy protection 6.4.1. 隐私保护
Output detection and processing.
输出检测与处理。
It aims to mitigate privacy leaks by detecting the output results. Some researchers used meta-classifiers or rule-based detection schemes to identify private information. Moreover, Cui et al. (Cui et al., 2024) believed that protecting private information needs to balance the privacy and utility of outputs. In medical scenarios, diagnostic results inherently contain users’ private information that should not be filtered out. Besides, other potential privacy protection methods focus on the LLM itself.
其目的是通过检测输出结果来减轻隐私泄露。一些研究人员使用元分类器或基于规则的检测方案来识别私人信息。此外,Cui 等人(Cui 等人,2024)认为保护私人信息需要在输出的隐私和效用之间取得平衡。在医疗场景中,诊断结果本质上包含用户的私人信息,这些信息不应被过滤掉。此外,其他潜在的隐私保护方法侧重于 LLM 本身。
Differential privacy. In Section 4.3.1, we introduced the differential privacy methods during the pre-training phase. This part mainly discusses the differential privacy methods used in the fine-tuning and inference phases. Shi et al. (Shi et al., 2022) proposed a selective differential privacy algorithm to protect sensitive data. They implemented a privacy-preserving fine-tuning process for LSTM models. Their experiments indicated that this method maintained LLM utility while effectively mitigating advanced data extraction attacks. Tian et al. (Tian et al., 2022) integrated the private aggregation of teacher ensembles with differential privacy. They trained a student model using the outputs of teacher models, thereby protecting the privacy of training data. Additionally, this method filtered candidates and adopted an efficient knowledge distillation strategy to achieve a good privacy-utility trade-off. It effectively protected GPT-2 against data extraction attacks.
差分隐私。在 4.3.1 节中,我们介绍了预训练阶段使用的差分隐私方法。这部分主要讨论了微调和推理阶段使用的差分隐私方法。Shi 等人(Shi 等人,2022)提出了一种选择性差分隐私算法来保护敏感数据。他们为 LSTM 模型实现了一个隐私保护的微调过程。他们的实验表明,该方法在保持 LLM 效用性的同时,有效地缓解了高级数据提取攻击。Tian 等人(Tian 等人,2022)将教师模型的私有聚合与差分隐私相结合。他们使用教师模型的输出来训练学生模型,从而保护了训练数据的隐私。此外,该方法筛选了候选者并采用了一种高效的知识蒸馏策略,以实现良好的隐私-效用权衡。它有效地保护了 GPT-2 免受数据提取攻击。
Majmudar et al. (Majmudar et al., 2022) introduced differential privacy into the inference phase. They calculated the perturbation probabilities and randomly sampled the -th token from the vocabulary. Subsequently, Duan et al. (Duan et al., 2024) combined differential privacy with knowledge distillation to enhance privacy protection for prompt tuning scenarios. They extended this method to the black-box setting, making it applicable to GPT-3 and Claude. Their experiments indicated that it can mitigate membership inference attacks.
Majmudar 等人(Majmudar 等人,2022)将差分隐私引入推理阶段。他们计算了扰动概率,并从词汇表中随机采样第 个标记。随后,Duan 等人(Duan 等人,2024)将差分隐私与知识蒸馏相结合,以增强提示调整场景下的隐私保护。他们将该方法扩展到黑盒设置,使其适用于 GPT-3 和 Claude。他们的实验表明,它可以减轻成员推理攻击。
Alignment tuning. The safety guardrail of LLMs will reduce the risk of privacy leaks. Specifically, defenders can use the RLHF fine-tuning scheme to penalize outputs that leak private information. For example, Xiao et al. (Xiao et al., 2024) leveraged alignment tuning with both positive (privacy-preserving) and negative (non-preserving) examples, the LLM learned to retain domain knowledge while minimizing sensitive data leakage. Experiments on Llama 2-7B and Llama 2-13B demonstrated that the safety guardrail can reduce sensitive information leakage by around 40%.
对齐调优。LLMs 的安全防护措施将降低隐私泄露的风险。具体来说,防御者可以使用 RLHF 微调方案来惩罚泄露私人信息的输出。例如,Xiao 等人(Xiao 等人,2024)利用了包含正面(保护隐私)和负面(不保护)示例的对齐调优,LLM 学会了在最小化敏感数据泄露的同时保留领域知识。在 Llama 2-7B 和 Llama 2-13B 上的实验表明,安全防护措施可以将敏感信息泄露减少约 40%。
Secure computing. During the inference phase, neither model owners nor users want their sensitive information to be stolen. On the one hand, users do not allow semi-honest model owners to access their inputs containing private information. On the other hand, model information is intellectual property that needs to be protected from inference attacks and extraction attacks. Chen et al. (Chen et al., 2022) applied homomorphic encryption to perform privacy-preserving inference on the BERT model. However, this scheme consumes many computational resources and reduces model performance. To address these challenges, Dong et al. (Dong et al., 2023) used secure multi-party computation to implement forward propagation without accessing the plain texts. They performed high-precision fitting for exponential and GeLU operations through piecewise polynomials. The method successfully performed privacy-preserving inference on LLMs like Llama 2-7B. Although existing secure computing techniques for LLMs still face challenges in terms of performance and cost, their prospects remain promising.
安全计算。在推理阶段,模型所有者和用户都不希望他们的敏感信息被窃取。一方面,用户不允许半诚实模型所有者访问包含私人信息的输入。另一方面,模型信息是知识产权,需要防止推理攻击和提取攻击。Chen 等人(Chen 等人,2022)应用同态加密在 BERT 模型上执行隐私保护推理。然而,该方案消耗大量计算资源并降低模型性能。为解决这些挑战,Dong 等人(Dong 等人,2023)使用安全多方计算实现无需访问明文的正向传播。他们通过分段多项式对指数和 GeLU 运算进行高精度拟合。该方法成功在 Llama 2-7B 等 LLMs 上执行隐私保护推理。尽管现有的 LLMs 安全计算技术在性能和成本方面仍面临挑战,但其前景依然充满希望。
表 6. 针对 LLMs 部署中隐私风险的潜在保护方法比较。
| Countermeasures 应对措施 | Specific Method 具体方法 | Defender Capacity 防御能力 | Applicable 适用 | Targeted Risk 针对性风险 | Effectiveness 有效性 | Overhead | Idea | Disadvantage | ||||||||||||||
| Model | Training data 训练数据 | LLM | Task 任务 | |||||||||||||||||||
|
Common 常见 | No 不 | No 不 | All 全部 | \ | Data extraction attack 数据提取攻击 | ★★ | ★ |
|
It is easily bypassed. | ||||||||||||
|
Shi et al. (Shi et al., 2022) 石等人(Shi et al., 2022) |
Yes 是 | Yes 是 | LSTM |
|
Data extraction attack 数据提取攻击 | ★★ | ★ |
|
|
||||||||||||
| Tian et al. (Tian et al., 2022) 田等人(Tian 等人,2022 年) |
Yes 是 | Yes 是 | GPT-2 |
|
|
★★★ | ★ |
|
|
|||||||||||||
| Majmudar et al. (Majmudar et al., 2022) Majmudar 等人 (Majmudar et al., 2022) |
Yes 是 | No 不 | RoBERTa |
|
|
★ | ★ |
|
|
|||||||||||||
| Duan et al. (Duan et al., 2024) 段等 (Duan et al., 2024) |
Yes 是 | No 不 |
|
|
|
★★ | ★ |
|
|
|||||||||||||
|
Xiao et al. (Xiao et al., 2024) Xiao 等人 (Xiao et al., 2024) |
Yes 是 | No 不 | Llama 2-7B/13B, |
|
|
★★ | ★★★ |
|
|
||||||||||||
|
Chen et al. (Chen et al., 2022) 陈等人(Chen 等人,2022) |
Yes 是 | No 不 | BERT-tiny |
|
Reconstruction attacks 重构攻击 | ★★★ | ★★★ |
|
|
||||||||||||
| Dong et al. (Dong et al., 2023) Dong 等人(Dong 等人,2023) |
Yes 是 | No 不 |
|
|
Reconstruction attacks 重构攻击 | ★★★ | ★★★ |
|
|
|||||||||||||
6.4.2. Security defense 6.4.2.安全防御
Output detection and processing.
输出检测与处理。
Some researchers detect and process malicious outputs during the generation phase. Deng et al. (Deng et al., 2024) proved ChatGPT and Bing Chat have defense mechanisms, including keyword and semantic detection. In addition, companies like Microsoft and NVIDIA have developed various detectors for harmful content. However, the training data limits classifier-based detection schemes, and adaptive jailbreak attacks can bypass them (Yang et al., 2024b). To improve detection performance, OpenAI and Meta employ GPT-4 and Llama 2 to detect harmful content. Then, Wu et al. (Wu et al., 2024a) extracted the representation of the last generated token to detect harmful output, demonstrating stronger robustness against many jailbreak attacks than classifier-based methods.
一些研究人员在生成阶段检测和处理恶意输出。邓等人(Deng et al., 2024)证明了 ChatGPT 和 Bing Chat 具有防御机制,包括关键词和语义检测。此外,像微软和英伟达这样的公司已经开发了各种有害内容的检测器。然而,训练数据限制了基于分类器的检测方案,而自适应越狱攻击可以绕过它们(Yang et al., 2024b)。为了提高检测性能,OpenAI 和 Meta 使用 GPT-4 和 Llama 2 来检测有害内容。然后,吴等人(Wu et al., 2024a)提取了最后生成标记的表示来检测有害输出,这比基于分类器的方法对许多越狱攻击表现出更强的鲁棒性。
Prompt engineering. Some researchers aim to eliminate the malicious goals of prompts by prompt engineering, resulting in valuable and harmless responses. Li et al. (Li et al., 2023b) designed a purification scheme. They introduced random noise into prompts and reconstructed them using a BERT-based mask language model. On BERT and RoBERTa models, the defense reduced the success rate of strong adversarial attacks to approximately 50%. Robey et al. (Robey et al., 2023) found that jailbreak prompts are vulnerable to character-level perturbations. Therefore, they randomly perturbed multiple prompt copies and identified texts with high entropy as infected prompts. On mainstream LLMs such as GPT-4 and Claude-2, this method effectively defeated various jailbreak attacks while maintaining efficiency and task performance. Wei et al. (Wei et al., 2023) inserted a small number of defensive demonstrations into the prompts, mitigating jailbreak attacks and backdoor attacks.
提示工程。一些研究人员通过提示工程来消除提示的恶意目标,从而产生有价值且无害的响应。Li 等人(Li 等人,2023b)设计了一种净化方案。他们将随机噪声引入提示,并使用基于 BERT 的掩码语言模型进行重建。在 BERT 和 RoBERTa 模型上,这种防御将强对抗攻击的成功率降低了约 50%。Robey 等人(Robey 等人,2023)发现,越狱提示容易受到字符级扰动的影响。因此,他们随机扰动多个提示副本,并将熵值高的文本识别为受感染的提示。在 GPT-4 和 Claude-2 等主流 LLMs 上,这种方法有效地击败了各种越狱攻击,同时保持了效率和任务性能。Wei 等人(Wei 等人,2023)在提示中插入少量防御性示例,以减轻越狱攻击和后门攻击。
Robustness training. Developers can control the training process to defend against various security attacks. Currently, most LLMs establish safety guardrails through the RLHF technology, protecting against jailbreak attacks (Bai et al., 2022). Bianchi et al. (Bianchi et al., 2023) constructed a few hundred safety instructions to improve the safety of Llama models. However, this method cannot fully defeat advanced jailbreak attacks, and excessive safety instructions may lead the LLM to over-reject harmless inputs. Sun et al. (Sun et al., 2024b) argued that alignment tuning with human supervision was too costly. They leveraged another LLM to generate high-quality alignment instructions, constructing safety guardrails with minimal human supervision. They improved the safety of an unaligned Llama 2-65B to a level comparable to commercial LLMs like ChatGPT, using fewer than 300 lines of human annotations.
鲁棒性训练。开发者可以控制训练过程以防御各种安全攻击。目前,大多数 LLMs 通过 RLHF 技术建立安全护栏,以防范越狱攻击(Bai 等人,2022)。Bianchi 等人(Bianchi 等人,2023)构建了几百条安全指令以提高 Llama 模型的安全性。然而,这种方法无法完全防御高级越狱攻击,过多的安全指令可能导致 LLM 过度拒绝无害输入。Sun 等人(Sun 等人,2024b)认为,在人类监督下进行对齐调优成本过高。他们利用另一个 LLM 生成高质量的对齐指令,以极少量的人类监督构建安全护栏。他们使用不到 300 行的人类标注,将一个未对齐的 Llama 2-65B 的安全性提升到与 ChatGPT 等商业 LLMs 相当的水平。
Watermarking. To mitigate the misuse of LLMs, researchers aim to use watermarking techniques to identify whether a given text was generated by a specific LLM. Zhang et al. (Zhang et al., 2024b) designed a post-hoc watermarking method. They mixed the generated text with binary signatures in the feature space, and then used the Gumbel-Softmax function during the encoding phase to transform the generated dense distribution into a sparse distribution. This method can significantly enhance the coherence and semantic integrity of watermarked texts, achieving a trade-off between utility and watermarking effectiveness. Kirchenbauer et al. (Kirchenbauer et al., 2023) directly returned watermarked texts instead of modifying output results. They divided the vocabulary into red and green lists based on a random seed, encouraging the LLM to choose tokens from the green list. Then, users who know the partition mode can implement the verification by calculating the number of green tokens in the generated text. Additionally, some researchers used watermarking techniques to safeguard the intellectual property of LLMs. Peng et al. (Peng et al., 2023) used backdoor attacks to inject watermarks into customized LLMs. Subsequently, the model owners can efficiently complete verification by checking the backdoor effect.
水印技术。为了减轻 LLMs 的滥用,研究人员试图使用水印技术来识别给定的文本是否由特定的 LLM 生成。Zhang 等人(Zhang 等人,2024b)设计了一种事后水印方法。他们将生成的文本与特征空间中的二进制签名混合,然后在编码阶段使用 Gumbel-Softmax 函数将生成的密集分布转换为稀疏分布。这种方法可以显著增强水印文本的连贯性和语义完整性,在实用性和水印效果之间实现了权衡。Kirchenbauer 等人(Kirchenbauer 等人,2023)直接返回水印文本,而不是修改输出结果。他们根据随机种子将词汇表分为红色和绿色列表,鼓励 LLM 从绿色列表中选择标记。然后,知道分区模式的使用者可以通过计算生成文本中绿色标记的数量来实现验证。此外,一些研究人员使用水印技术来保护 LLMs 的知识产权。 彭等人(Peng et al., 2023)使用后门攻击将水印注入定制化的 LLMs。随后,模型所有者可以通过检查后门效应来高效完成验证。
表 7. 针对 LLMs 部署中安全风险的潜在防御措施的比较
| Countermeasures 应对措施 | Specific Method 具体方法 | Defender Capacity 防御能力 | Applicable 适用 | Targeted Risk 针对性风险 | Effectiveness 有效性 | Overhead | Idea | Disadvantage | |||||||||||||||||||
| Model | Training data 训练数据 | LLM | Task 任务 | ||||||||||||||||||||||||
|
Common 常见 | No 不 | No 不 | All 全部 | \ | Jailbreak attack 越狱攻击 | ★★ | ★ |
|
It is easily bypassed. | |||||||||||||||||
| Wu et al. (Wu et al., 2024a) 吴等人(Wu 等人,2024a) |
Yes 是 | No 不 |
|
Conversation 对话 |
|
★★★ | ★★ |
|
|
||||||||||||||||||
|
Li et al. (Li et al., 2023b) 李等人 (Li et al., 2023b) |
No 不 | No 不 |
|
Classfication 分类 |
|
★★ | ★★ |
|
|
|||||||||||||||||
| Robey et al. (Robey et al., 2023) Robey 等人(Robey et al., 2023) |
No 不 | No 不 |
|
Conversation 对话 | Jailbreak attack 越狱攻击 | ★★★ | ★ |
|
|
||||||||||||||||||
| Wei et al. (Wei et al., 2023) 魏等人 (魏等人, 2023) |
No 不 | No 不 |
|
Conversation 对话 | Jailbreak attack 越狱攻击 | ★★ | ★ |
|
|
||||||||||||||||||
|
Bianchi et al. (Bianchi et al., 2023) Bianchi 等人(Bianchi 等人,2023) |
Yes 是 | Yes 是 |
|
|
|
★★ | ★★★ |
|
|
|||||||||||||||||
| Sun et al. (Sun et al., 2024b) 孙等人(Sun 等人,2024b) |
Yes 是 | Yes 是 |
|
Conversation 对话 |
|
★★ | ★★★ |
|
|
||||||||||||||||||
| Watermarking 水印 | Zhang et al. (Zhang et al., 2024b) 张等人(Zhang et al., 2024b) |
Yes 是 | No 不 |
|
|
|
★★★ | ★ |
|
|
|||||||||||||||||
|
Yes 是 | No 不 |
|
|
|
★★★ | ★ |
|
|
||||||||||||||||||
| Peng et al. (Peng et al., 2023) Peng 等人(Peng 等人,2023) |
Yes 是 | No 不 |
|
Classfication 分类 |
|
★★★ | ★ |
|
|
||||||||||||||||||
图 10. 部署基于 LLM 的代理中的两种威胁模型,其中恶意实体是用户和代理。
7. The Risks and Countermeasures of Deploying LLM-based Agents
7. 部署基于 LLM 的代理的风险与对策
In addition to deploying simple applications based on LLMs, many researchers are exploring their potential by building intelligent agent systems, known as LLM-based agents. These systems integrate memory and tool modules, as shown in Figure 10. Typically, the memory module stores dialogue history and long-term knowledge in a database, leveraging embedding-based retrieval to enable contextual understanding. The other module integrates various tools such as search engines and file systems, allowing the agent to interact with external environments, including web pages, local files, and other agents. When interacting with humans, LLM-based agents can understand natural language instructions and call the required modules to autonomously accomplish complex actions, such as generating slides, rather than passively responding to user queries. Currently, there are two application scenarios: single-agent and multi-agent systems. A multi-agent system consists of many LLM-based agents, each responsible for a specific task or role. For example, a health management system includes a data collection agent, an analysis agent, a report generation agent, and an interaction agent. As illustrated in Figure 10, deploying LLM-based agents involves two malicious entities: users and agents. First, malicious users can manipulate prompts to launch PII leakage and jailbreak attacks. Second, malicious agents may carry hidden backdoors. In addition, interactions between agents also pose privacy and security risks, such as unauthorized interactions and agent contamination. We detail these risks and corresponding countermeasures, aiming to identify promising defense directions through empirical evaluation.
除了部署基于 LLMs 的简单应用程序外,许多研究人员通过构建智能代理系统来探索其潜力,这些系统被称为基于 LLM 的代理。这些系统集成了记忆模块和工具模块,如图 10 所示。通常,记忆模块在数据库中存储对话历史和长期知识,利用基于嵌入的检索来支持上下文理解。另一个模块集成了各种工具,如搜索引擎和文件系统,允许代理与外部环境交互,包括网页、本地文件和其他代理。在与人交互时,基于 LLM 的代理可以理解自然语言指令并调用所需的模块来自主完成复杂操作,例如生成幻灯片,而不是被动地响应用户查询。目前有两种应用场景:单代理和多代理系统。多代理系统由许多基于 LLM 的代理组成,每个代理负责特定的任务或角色。例如,一个健康管理系统包括数据收集代理、分析代理、报告生成代理和交互代理。 如图 10 所示,部署基于 LLM 的代理涉及两个恶意实体:用户和代理。首先,恶意用户可以操纵提示词以发起 PII 泄露和越狱攻击。其次,恶意代理可能携带隐藏的后门。此外,代理之间的交互也带来隐私和安全风险,例如未经授权的交互和代理污染。 我们详细阐述这些风险及相应的对策,旨在通过实证评估确定有前景的防御方向。
图 11. 非法交互的细节。恶意用户可以从其中一个代理处窃取敏感数据,从而危及多代理系统的隐私。
7.1. Privacy risks of deploying LLM-based agents
7.1. 部署基于 LLM 的代理的隐私风险
Since LLMs are the backbone of agents, deploying LLM-based agents also faces prompt stealing and various common privacy attacks, as discussed in Section 6.2. Here, we merely focus on privacy risks unique to LLM-based agents.
由于 LLM 是代理的核心,部署基于 LLM 的代理也面临着在第 6.2 节中讨论的提示词窃取和各种常见隐私攻击。在这里,我们仅关注基于 LLM 的代理特有的隐私风险。
Memory stealing attacks. In addition to training data and system prompts, the interaction history and long-term knowledge stored in the memory module are also vulnerable to theft. Malicious users can optimize prompts under the black-box setting to extract sensitive information, such as personal contact information and medical records, from this module. Wang et al. (Wang et al., 2025a) divided the attacking prompt into two components: a locator to guide the agent in retrieving historical information, and an aligner to ensure the output format matches the agent’s workflow. Experiments indicated that with only 30 attacking prompts, up to 50 and 26 user queries can be extracted from two real-world agents. Similarly, certain attacks capable of extracting private knowledge from RAG systems can be applied to steal long-term knowledge from the agent system, as discussed in Section 6.2.
内存窃取攻击。除了训练数据和系统提示词外,存储在内存模块中的交互历史和长期知识也容易受到盗窃。恶意用户可以在黑盒环境下优化提示词,从该模块中提取敏感信息,例如个人联系信息和医疗记录。Wang 等人(Wang 等人,2025a)将攻击提示词分为两个部分:一个定位器用于指导代理检索历史信息,一个对齐器用于确保输出格式与代理的工作流程匹配。实验表明,仅用 30 个攻击提示词,就可以从两个真实世界的代理中提取高达 50 和 26 个用户查询。同样,某些能够从 RAG 系统中提取私人知识的攻击可以应用于从代理系统中窃取长期知识,如第 6.2 节所述。
Unauthorized access.
An additional risk arises when users query third-party LLMs or LLM-based agents to process highly confidential information, such as HIPAA (Act et al., 1996) protected data. Both the storage and transmission of such data to external systems may constitute unauthorized access or disclosure under these regulations. For example, OmniGPT, a third-party LLM, had no malicious intent but was compromised by hackers, which resulted in the leakage of more than 34 million conversation records. In the multi-agent system, different agents undertake various roles and permissions. When performing collaborative tasks, multiple agents will share and process private information. An attacker can steal this information across the system by compromising one agent, as shown in Figure 11. Similarly, Li et al. (Li et al., 2024c) found that data transmission between agents can cause some to access sensitive data beyond their permission scope or expose such data to unauthorized agents. In addition, the agent interactions are massive and not transparent, so it is hard to supervise the generated information. Facing the risks posed by such unauthorized access, it is essential to strictly control and transparently manage access to confidential information.
未经授权的访问。当用户查询第三方 LLM 或基于 LLM 的代理来处理高度机密信息时,会带来额外的风险,例如 HIPAA(法案等,1996 年)保护的数据。根据这些法规,将此类数据存储或传输到外部系统可能构成未经授权的访问或披露。例如,第三方 LLM OmniGPT 没有恶意意图,但被黑客攻破,导致超过 3400 万条对话记录泄露。在多代理系统中,不同的代理承担不同的角色和权限。在执行协作任务时,多个代理会共享和处理私人信息。攻击者可以通过攻破一个代理来窃取系统中的这些信息,如图 11 所示。类似地,Li 等人(Li 等人,2024c)发现,代理之间的数据传输会导致某些代理访问超出其权限范围的敏感数据,或将此类数据暴露给未经授权的代理。此外,代理交互是大规模且不透明的,因此难以监督生成的信息。 面对此类未经授权访问带来的风险,严格控制和透明化管理对机密信息的访问至关重要。
7.2. Security risks of deploying LLM-based agents
7.2. 基于 LLM 的代理部署的安全风险
Similar to the security risks discussed in Section 6.3, deploying LLM-based agents also faces jailbreak attacks and backdoor attacks. Here, we focus on the security risks unique to LLM-based agents.
与第 6.3 节讨论的安全风险类似,部署基于 LLM 的代理也面临越狱攻击和后门攻击。这里,我们重点关注基于 LLM 的代理特有的安全风险。
Unique security attacks against LLM-based agents. These attacks attempt to hijack or insert the desired actions, such as deleting calendar events. Li et al. (Li et al., 2024c) found that prompt injection attacks could induce the LLM-based agents to perform malicious actions. Here is an example.
针对基于 LLM 的代理的独特安全攻击。这些攻击试图劫持或插入期望的操作,例如删除日历事件。Li 等人(Li 等人,2024c)发现,提示注入攻击可以诱导基于 LLM 的代理执行恶意操作。这里有一个例子。
These abnormal actions will impact users’ lives, such as leading to missed events or the spread of phishing websites. Yang et al. (Yang et al., 2024a) explored backdoor attacks for LLM-based agents, proposing query-attack, observation-attack and thought-attack. For the first two manners, once a query or an observation result contains a trigger, the backdoored agents perform predefined malicious actions (e.g., send scam messages). The third attack can alter the behavior of specific steps while preserving benign actions. For instance, a backdoored agent might invoke a specific tool (e.g., Google Translate) under a trigger prompt. Wang et al. (Wang et al., 2024b) integrated malicious tools into the agent system and invoked them using carefully designed prompts to implement prompt stealing and denial-of-service attacks.
这些异常行为将影响用户的生活,例如导致错过活动或传播钓鱼网站。Yang 等人(Yang 等人,2024a)探索了针对基于 LLM 的代理的后门攻击,提出了查询攻击、观察攻击和思维攻击。对于前两种方式,一旦查询或观察结果包含触发器,后门代理将执行预定义的恶意行为(例如,发送诈骗信息)。第三种攻击可以改变特定步骤的行为,同时保留良性行为。例如,一个后门代理可能在触发提示下调用特定工具(例如,谷歌翻译)。王等人(王等人,2024b)将恶意工具集成到代理系统中,并使用精心设计的提示来调用它们,以实现提示窃取和拒绝服务攻击。
图 12. 代理通信的细节,其中恶意代理可以影响其他代理。
Agent contamination. For the single-agent system, attackers can modify the role settings of victim agents, causing them to exhibit harmful behaviors, such as trojan code generation. Tian et al. (Tian et al., 2023) found that malicious agents can share harmful content with other agents, affecting their behavior in a domino effect, as displayed in Figure 12. This risk significantly increases the vulnerability of LLM-based agents within a communication network. Targeting the multi-agent system, Zhou et al. (Zhou et al., 2025) crafted a malicious prompt to trap a single agent in an infinite loop. The infected agent then propagated the prompt to others, eventually causing a complete system breakdown. Experiments indicated this attack can infect all ten GPT-4o-mini agents in under two dialogue turns.
代理污染。对于单代理系统,攻击者可以修改受害者代理的角色设置,导致它们表现出有害行为,例如生成木马代码。田等人(Tian 等人,2023)发现,恶意代理可以与其他代理共享有害内容,以多米诺效应影响它们的行为,如图 12 所示。这种风险显著增加了基于 LLM 的代理在通信网络中的脆弱性。针对多代理系统,周等人(Zhou 等人,2025)设计了一个恶意提示来使单个代理陷入无限循环。被感染的代理然后将提示传播给其他代理,最终导致整个系统崩溃。实验表明,这种攻击可以在不到两个对话回合内感染所有十个 GPT-4o-mini 代理。
7.3. Countermeasures of deploying LLM-based agents
7.3. 部署基于 LLM 的代理的对策
7.3.1. Privacy protection 7.3.1. 隐私保护
Potential defenses focus on the memory module and output results to address privacy leaks caused by malicious users. Defenders can employ corpus cleaning to filter out sensitive data from the memory module. For output results, defenders can implement filtering and detection processes to prevent sensitive information from being transmitted to other entities. As introduced in Section 6.4.1, both rule-based and classifier-based detection schemes can be applied. To address unauthorized access, authority management and contractual agreements with service providers offer viable solutions. Defenders can establish clear controls for private data access, setting specific access permissions for different roles within multi-agent systems. Huang et al. (Huang et al., 2025) designed a zero-trust identity framework that integrates decentralized identifiers and verifiable credentials to support dynamic fine-grained access control and session management, achieving task-level privacy isolation and minimal privilege. In addition, Chan et al. (Chan et al., 2024) integrated with individual agents to capture real-time inputs and outputs, extracted operation indicators, and trained a regression model for early prediction of downstream task performance. The framework enabled real-time response modification to mitigate privacy risks arising from unauthorized interactions.
潜在的防御措施集中在内存模块和输出结果上,以应对恶意用户造成的隐私泄露。防御者可以通过语料库清理从内存模块中过滤掉敏感数据。对于输出结果,防御者可以实施过滤和检测流程,防止敏感信息被传输给其他实体。如第 6.4.1 节所述,基于规则的检测方案和基于分类器的检测方案均可应用。为解决未经授权的访问问题,权限管理和与服务提供商的合同协议提供了可行的解决方案。防御者可以为私有数据访问建立明确的控制,为多智能体系统中的不同角色设置特定的访问权限。Huang 等人(Huang et al., 2025)设计了一个零信任身份框架,该框架整合了去中心化标识符和可验证凭证,以支持动态细粒度访问控制和会话管理,实现了任务级别的隐私隔离和最小权限。 此外,Chan 等人(Chan 等人,2024)将个体代理集成起来,以捕获实时输入和输出,提取操作指标,并训练回归模型以预测下游任务性能的早期结果。该框架能够实时修改响应,以减轻来自未经授权交互的隐私风险。
表 8. 针对部署基于 LLM 的代理所关联的隐私风险的保护方法比较。
| Countermeasures 应对措施 | Defender Capacity 防御能力 | Applicable 适用 | Targeted Risk 针对性风险 | Effectiveness 有效性 | Overhead 开销 | Idea 想法 | Disadvantage | ||||||||||||||||||
| Model | Training data 训练数据 | LLM | Task 任务 | ||||||||||||||||||||||
|
No 不 | No 不 | All 全部 | \ |
|
★★ | ★ |
|
It is easily bypassed. | ||||||||||||||||
|
Yes 是 | No 不 | AutoGen |
|
Unauthorized interaction 未经授权的交互 |
★★ | ★ |
|
|
||||||||||||||||
|
No 不 | No 不 |
|
|
Unauthorized interaction 未经授权的交互 |
★★★ | ★★ |
|
|
||||||||||||||||
7.3.2. Security defense 7.3.2.安全防御
Existing countermeasures focus on the input, the model, and the agent to address the security risks LLM-based agents face.
现有的防御措施主要针对输入、模型和代理,以应对基于 LLM 的代理面临的安全风险。
Input and output processing. As discussed in Section 6.4.2, defenders can process prompts to defeat jailbreak attacks targeting LLM-based agents. For instance, they can use templates to restrict the structure of prompts, thereby reducing the impact of jailbreak prompts (Greshake et al., 2023).
输入和输出处理。如第 6.4.2 节所述,防御者可以处理提示词以抵御针对基于 LLM 的代理的越狱攻击。例如,他们可以使用模板来限制提示词的结构,从而减少越狱提示词的影响(Greshake 等人,2023)。
With this template, even if the input contains a malicious instruction, the LLM interprets it strictly as user-provided data, not as an executable command. Similarly, Zeng et al. (Zeng et al., 2024) leveraged multiple agents to analyze the intent of LLM responses to determine whether they are harmful. Using a three-agent system built with Llama 2-13B, they reduced the jailbreak attack success rate on GPT-3.5 to 7.95%. In addition, LLM-based agents can use various tools to generate multi-modal outputs (e.g., programs and files), making existing output processing countermeasures less effective. To address this challenge, developing a robust multi-modal filtering system is crucial.
使用此模板,即使输入包含恶意指令,LLM 也会严格将其解释为用户提供的数据,而不是可执行命令。类似地,Zeng 等人(Zeng 等人,2024)利用多个代理来分析 LLM 响应的意图,以确定其是否有害。使用基于 Llama 2-13B 构建的三代理系统,他们将 GPT-3.5 上的越狱攻击成功率降低至 7.95%。此外,基于 LLM 的代理可以使用各种工具生成多模态输出(例如,程序和文件),使现有的输出处理防御措施效果减弱。为应对这一挑战,开发一个强大的多模态过滤系统至关重要。
Model processing. This countermeasure can eliminate security vulnerabilities in LLMs. As discussed in Section 5.2, defenders can employ adversarial training to improve the robustness of LLM-based agents against jailbreak attacks. Meanwhile, the backdoor removal methods may be effective against backdoor attacks targeting LLM-based agents. Shen et al. (Shen et al., 2024b) leveraged the strong causal dependencies among tokens, which are induced by the autoregressive training of LLMs. Subsequently, they identified abnormally high-probability token sequences to determine whether the LLM had been backdoored. This defense successfully detected six mainstream backdoor attacks across 153 LLMs such as AgentLM-7B and GPT-3.5-turbo-0125.
模型处理。这项对策可以消除 LLMs 中的安全漏洞。正如第 5.2 节所述,防御者可以采用对抗训练来提高基于 LLM 的代理对越狱攻击的鲁棒性。同时,后门移除方法可能对针对基于 LLM 的代理的后门攻击有效。Shen 等人(Shen 等人,2024b)利用了由 LLMs 的自回归训练所诱导的 token 之间的强因果依赖关系。随后,他们识别出异常高概率的 token 序列,以确定 LLM 是否已被植入后门。这项防御成功检测了 153 个 LLM(如 AgentLM-7B 和 GPT-3.5-turbo-0125)中的六种主流后门攻击。
Agent processing. The countermeasure mainly addresses the security risks posed by malicious agents. To address jailbreak attacks, defenders can establish multi-level consistency frameworks in multi-agent systems, ensuring them alignment with human values, such as helpfulness and harmlessness. Wang et al. (Wang et al., 2025b) constructed a multi-agent dialogue graph and leveraged graph neural networks to detect anomalous behavior and identify high-risk agents. They then mitigated the spread of malicious information through edge pruning. This method successfully reduced the attack success rate of prompt injection by 39.23% in a system with 65 agents. In addition, to improve the robustness of multi-agent systems, Huang et al. (Huang et al., 2024b) enabled each agent to challenge and correct others’ outputs, and introduced an inspector agent to systematically identify and repair faults in agent interactions. Similarly, Li et al. (Li et al., 2024c) found that a high-level agent guides its subordinate agents. Thus, constraining the high-level agent can prevent the propagation of malicious behaviors and misinformation in multi-agent systems.
代理处理。该对策主要针对恶意代理带来的安全风险。为应对越狱攻击,防御者可以在多代理系统中建立多级一致性框架,确保其与人类价值观(如有益性和无害性)保持一致。Wang 等人(Wang 等人,2025b)构建了一个多代理对话图,并利用图神经网络检测异常行为和识别高风险代理,然后通过边剪枝来缓解恶意信息的传播。该方法成功将 65 个代理系统中的提示注入攻击成功率降低了 39.23%。此外,为提高多代理系统的鲁棒性,Huang 等人(Huang 等人,2024b)使每个代理能够挑战和纠正其他代理的输出,并引入了一个检查代理来系统性地识别和修复代理交互中的错误。类似地,Li 等人(Li 等人,2024c)发现高级代理会指导其下属代理。因此,约束高级代理可以防止恶意行为和错误信息在多代理系统中的传播。
表 9. 针对部署基于 LLM 的代理所相关的安全风险的潜在防御措施的比较。
| Countermeasures 应对措施 |
|
Defender Capacity 防御能力 | Applicable 适用 | Targeted Risk 针对性风险 | Effectiveness 有效性 | Overhead | Idea | Disadvantage | ||||||||||||||||||
| Model | Training data 训练数据 | LLM | Task 任务 | |||||||||||||||||||||||
|
|
No 不 | No 不 | All 全部 | \ | Jailbreak attack 越狱攻击 | ★★ | ★ |
|
It is easily bypassed. | ||||||||||||||||
| Zeng et al. (Zeng et al., 2024) 曾等人(Zeng et al., 2024) |
No 不 | No 不 |
|
Conversation 对话 | Jailbreak attack 越狱攻击 | ★★★ | ★ |
|
It is easily bypassed. | |||||||||||||||||
|
|
Yes 是 | No 不 | All 全部 | \ | Jailbreak attack 越狱攻击 | ★★ | ★★★ |
|
|
||||||||||||||||
| Shen et al. (Shen et al., 2024b) 沈等 (沈等, 2024b) |
Yes 是 | No 不 |
|
Conversation 对话 | Backdoor attack 后门攻击 | ★★★ | ★★★ |
|
|
|||||||||||||||||
|
Wang et al. (Wang et al., 2025b) 王等 (Wang et al., 2025b) |
Yes 是 | No 不 |
|
Conversation 对话 |
|
★★ | ★★ |
|
It is a passive defense. | ||||||||||||||||
| Huang et al. (Huang et al., 2024b) 黄等 (Huang et al., 2024b) |
No 不 | No 不 |
|
|
|
★★ | ★ |
|
|
|||||||||||||||||
| Li et al. (Li et al., 2024c) Li 等人(Li 等人,2024c) |
No 不 | No 不 | All 全部 | \ |
|
★ | ★ |
|
|
|||||||||||||||||
8. Conclusion 8.结论
LLMs have become a strong driving force in revolutionizing a wide range of applications. However, these applications have revealed various vulnerabilities due to the privacy and security threats LLMs face. Moreover, these threats differ significantly from those encountered by traditional models. We investigate privacy and security studies in LLMs, and provide a comprehensive and novel survey. Specifically, by analyzing the life cycle of LLMs, we consider four scenarios: pre-training LLMs, fine-tuning LLMs, deploying LLMs, and deploying LLM-based agents. Per scenario per threat model, we discuss privacy and security risks, highlight unique parts to LLMs and briefly introduce common risks to all models. Regarding the characteristics of each risk, we provide potential countermeasures and analyze their advantages and disadvantages. In summary, we believe this survey significantly helps researchers understand unique privacy and security threats of LLMs, promoting the development of LLMs’ safety and landing in more fields.
LLMs 已成为推动广泛应用革新的强大动力。然而,由于 LLMs 面临的隐私和安全威胁,这些应用暴露了各种漏洞。此外,这些威胁与传统模型遇到的威胁存在显著差异。我们研究了 LLMs 中的隐私和安全研究,并提供了一项全面且新颖的调查。具体而言,通过分析 LLMs 的生命周期,我们考虑了四种场景:预训练 LLMs、微调 LLMs、部署 LLMs 以及部署基于 LLM 的代理。针对每个场景和每个威胁模型,我们讨论了隐私和安全风险,突出了 LLMs 的独特之处,并简要介绍了所有模型共有的常见风险。关于每种风险的特征,我们提供了潜在的应对措施,并分析了它们的优缺点。总之,我们相信这项调查显著帮助研究人员理解 LLMs 独特的隐私和安全威胁,促进 LLMs 的安全发展,使其应用在更多领域。
References
- (1)
- Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security. 308–318.
- Abascal et al. (2023) John Abascal, Stanley Wu, Alina Oprea, and Jonathan Ullman. 2023. Tmi! finetuned models leak private information from their pretraining data. arXiv preprint arXiv:2306.01181 (2023).
- Act et al. (1996) Accountability Act et al. 1996. Health insurance portability and accountability act of 1996. Public law 104 (1996), 191.
- Azizi et al. (2021) Ahmadreza Azizi, Ibrahim Asadullah Tahmid, Asim Waheed, Neal Mangaokar, Jiameng Pu, Mobin Javed, Chandan K Reddy, and Bimal Viswanath. 2021. T-Miner: A generative approach to defend against trojan attacks on DNN-based text classification. In 30th USENIX Security Symposium (USENIX Security 21). 2255–2272.
- Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
- Baumgärtner et al. (2024) Tim Baumgärtner, Yang Gao, Dana Alon, and Donald Metzler. 2024. Best-of-Venom: Attacking RLHF by Injecting Poisoned Preference Data. In First Conference on Language Modeling.
- Bianchi et al. (2023) Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. 2023. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In The Twelfth International Conference on Learning Representations.
- Cai et al. (2022) Xiangrui Cai, Haidong Xu, Sihan Xu, Ying Zhang, et al. 2022. Badprompt: Backdoor attacks on continuous prompts. Advances in Neural Information Processing Systems 35 (2022), 37068–37080.
- Carlini et al. (2022) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying Memorization Across Neural Language Models. In The Eleventh International Conference on Learning Representations.
- Carlini et al. (2023) Nicholas Carlini, Milad Nasr, Christopher A Choquette-Choo, Matthew Jagielski, Irena Gao, Pang Wei Koh, Daphne Ippolito, Katherine Lee, Florian Tramèr, and Ludwig Schmidt. 2023. Are aligned neural networks adversarially aligned?. In 37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023).
- Carlini et al. (2021) Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21). 2633–2650.
- Chan et al. (2024) Chi-Min Chan, Jianxuan Yu, Weize Chen, Chunyang Jiang, Xinyu Liu, Weijie Shi, Zhiyuan Liu, Wei Xue, and Yike Guo. 2024. Agentmonitor: A plug-and-play framework for predictive and secure multi-agent systems. arXiv preprint arXiv:2408.14972 (2024).
- Chen et al. (2022) Tianyu Chen, Hangbo Bao, Shaohan Huang, Li Dong, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. 2022. THE-X: Privacy-Preserving Transformer Inference with Homomorphic Encryption. In Findings of the Association for Computational Linguistics: ACL 2022. 3510–3520.
- Chowdhury et al. (2023) MD Minhaz Chowdhury, Nafiz Rifat, Mostofa Ahsan, Shadman Latif, Rahul Gomes, and Md Saifur Rahman. 2023. ChatGPT: A Threat Against the CIA Triad of Cyber Security. In 2023 IEEE International Conference on Electro Information Technology (eIT). IEEE, 1–6.
- Cui et al. (2022) Ganqu Cui, Lifan Yuan, Bingxiang He, Yangyi Chen, Zhiyuan Liu, and Maosong Sun. 2022. A unified evaluation of textual backdoor learning: Frameworks and benchmarks. Advances in Neural Information Processing Systems 35 (2022), 5009–5023.
- Cui et al. (2024) Tianyu Cui, Yanling Wang, Chuanpu Fu, Yong Xiao, Sijia Li, Xinhao Deng, Yunpeng Liu, Qinglin Zhang, Ziyi Qiu, Peiyang Li, et al. 2024. Risk taxonomy, mitigation, and assessment benchmarks of large language model systems. arXiv preprint arXiv:2401.05778 (2024).
- Das et al. (2025) Badhan Chandra Das, M Hadi Amini, and Yanzhao Wu. 2025. Security and privacy challenges of large language models: A survey. Comput. Surveys 57, 6 (2025), 1–39.
- Deng et al. (2024) Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. 2024. Masterkey: Automated jailbreak across multiple large language model chatbots. In Network and Distributed System Security Symposium, NDSS 2024. The Internet Society.
- Deshpande et al. (2023) Ameet Deshpande, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, and Karthik Narasimhan. 2023. Toxicity in chatgpt: Analyzing persona-assigned language models. In Findings of the Association for Computational Linguistics: EMNLP 2023. 1236–1270.
- Dong et al. (2025) Tian Dong, Minhui Xue, Guoxing Chen, Rayne Holland, Yan Meng, Shaofeng Li, Zhen Liu, and Haojin Zhu. 2025. The Philosopher’s Stone: Trojaning Plugins of Large Language Models. In Network and Distributed System Security Symposium, NDSS 2025. The Internet Society.
- Dong et al. (2023) Ye Dong, Wen-jie Lu, Yancheng Zheng, Haoqi Wu, Derun Zhao, Jin Tan, Zhicong Huang, Cheng Hong, Tao Wei, and Wenguang Cheng. 2023. Puma: Secure inference of llama-7b in five minutes. arXiv preprint arXiv:2307.12533 (2023).
- Dong et al. (2024) Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024. Attacks, defenses and evaluations for llm conversation safety: A survey. arXiv preprint arXiv:2402.09283 (2024).
- Duan et al. (2024) Haonan Duan, Adam Dziedzic, Nicolas Papernot, and Franziska Boenisch. 2024. Flocks of stochastic parrots: Differentially private prompt learning for large language models. Advances in Neural Information Processing Systems 36 (2024).
- Eldan and Russinovich (2023) Ronen Eldan and Mark Russinovich. 2023. Who’s Harry Potter? Approximate Unlearning in LLMs. arXiv preprint arXiv:2310.02238 (2023).
- Gao et al. (2021) Yansong Gao, Yeonjae Kim, Bao Gia Doan, Zhi Zhang, Gongxuan Zhang, Surya Nepal, Damith C Ranasinghe, and Hyoungshick Kim. 2021. Design and evaluation of a multi-domain trojan detection method on deep neural networks. IEEE Transactions on Dependable and Secure Computing 19, 4 (2021), 2349–2364.
- Greshake et al. (2023) Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security. 79–90.
- Guo et al. (2021) Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. 2021. Gradient-based Adversarial Attacks against Text Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 5747–5757.
- Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025).
- Gupta et al. (2023) Maanak Gupta, CharanKumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. 2023. From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy. IEEE Access (2023).
- He et al. (2024) Feng He, Tianqing Zhu, Dayong Ye, Bo Liu, Wanlei Zhou, and Philip S Yu. 2024. The emerged security and privacy of llm agent: A survey with case studies. arXiv preprint arXiv:2407.19354 (2024).
- He et al. (2025) Yu He, Boheng Li, Liu Liu, Zhongjie Ba, Wei Dong, Yiming Li, Zhan Qin, Kui Ren, and Chun Chen. 2025. Towards label-only membership inference attack against pre-trained large language models. In 34th USENIX Security Symposium (USENIX Security 25).
- Huang et al. (2024b) Jen-tse Huang, Jiaxu Zhou, Tailin Jin, Xuhui Zhou, Zixi Chen, Wenxuan Wang, Youliang Yuan, Maarten Sap, and Michael R Lyu. 2024b. On the resilience of multi-agent systems with malicious agents. arXiv preprint arXiv:2408.00989 (2024).
- Huang et al. (2025) Ken Huang, Vineeth Sai Narajala, John Yeoh, Ramesh Raskar, Youssef Harkati, Jerry Huang, Idan Habler, and Chris Hughes. 2025. A Novel Zero-Trust Identity Framework for Agentic AI: Decentralized Authentication and Fine-Grained Access Control. arXiv preprint arXiv:2505.19301 (2025).
- Huang et al. (2024a) Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, and Ling Liu. 2024a. Harmful fine-tuning attacks and defenses for large language models: A survey. arXiv preprint arXiv:2409.18169 (2024).
- Huang et al. (2023a) Yue Huang, Qihui Zhang, Lichao Sun, et al. 2023a. Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023).
- Huang et al. (2023b) Yujin Huang, Terry Yue Zhuo, Qiongkai Xu, Han Hu, Xingliang Yuan, and Chunyang Chen. 2023b. Training-free lexical backdoor attacks on language models. In Proceedings of the ACM Web Conference 2023. 2198–2208.
- Hui et al. (2024) Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao. 2024. Pleak: Prompt leaking attacks against large language model applications. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3600–3614.
- Ippolito et al. (2023) Daphne Ippolito, Nicholas Carlini, Katherine Lee, Milad Nasr, and Yun William Yu. 2023. Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation System. In Proceedings of the 16th International Natural Language Generation Conference. 396–406.
- Jagannatha et al. (2021) Abhyuday Jagannatha, Bhanu Pratap Singh Rawat, and Hong Yu. 2021. Membership inference attack susceptibility of clinical language models. arXiv preprint arXiv:2104.08305 (2021).
- Jiang et al. (2024) Changyue Jiang, Xudong Pan, Geng Hong, Chenfu Bao, and Min Yang. 2024. Rag-thief: Scalable extraction of private data from retrieval-augmented generation applications with agent-based attacks. arXiv preprint arXiv:2411.14110 (2024).
- Jiang et al. (2023) Shuyu Jiang, Xingshu Chen, and Rui Tang. 2023. Prompt packer: Deceiving llms through compositional instruction with hidden attacks. arXiv preprint arXiv:2310.10077 (2023).
- Kandpal et al. (2022) Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models. In International Conference on Machine Learning. PMLR, 10697–10707.
- Kim et al. (2024) Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh. 2024. Propile: Probing privacy leakage in large language models. Advances in Neural Information Processing Systems 36 (2024).
- Kirchenbauer et al. (2023) John Kirchenbauer, Jonas Geiping, Yuxin Wen, Jonathan Katz, Ian Miers, and Tom Goldstein. 2023. A watermark for large language models. In International Conference on Machine Learning. PMLR, 17061–17084.
- Lester et al. (2021) Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 3045–3059.
- Li et al. (2023a) Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, Jie Huang, Fanpu Meng, and Yangqiu Song. 2023a. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Li et al. (2022) Haoran Li, Yangqiu Song, and Lixin Fan. 2022. You Don’t Know My Favorite Color: Preventing Dialogue Representations from Revealing Speakers’ Private Personas. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 5858–5870.
- Li et al. (2023c) Jiazhao Li, Yijin Yang, Zhuofeng Wu, VG Vydiswaran, and Chaowei Xiao. 2023c. Chatgpt as an attack tool: Stealthy textual backdoor attack via blackbox generative model trigger. arXiv preprint arXiv:2304.14475 (2023).
- Li et al. (2023b) Linyang Li, Demin Song, and Xipeng Qiu. 2023b. Text Adversarial Purification as Defense against Adversarial Attacks. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 338–350.
- Li et al. (2021) Xuechen Li, Florian Tramer, Percy Liang, and Tatsunori Hashimoto. 2021. Large Language Models Can Be Strong Differentially Private Learners. In International Conference on Learning Representations.
- Li et al. (2024a) Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu. 2024a. Badedit: Backdooring large language models by model editing. arXiv preprint arXiv:2403.13355 (2024).
- Li et al. (2020) Yige Li, Xixiang Lyu, Nodens Koren, Lingjuan Lyu, Bo Li, and Xingjun Ma. 2020. Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks. In International Conference on Learning Representations.
- Li et al. (2024c) Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024c. Personal llm agents: Insights and survey about the capability, efficiency and security. arXiv preprint arXiv:2401.05459 (2024).
- Li et al. (2024b) Zongjie Li, Chaozheng Wang, Pingchuan Ma, Chaowei Liu, Shuai Wang, Daoyuan Wu, Cuiyun Gao, and Yang Liu. 2024b. On extracting specialized code abilities from large language models: A feasibility study. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
- Liu et al. (2018) Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International symposium on research in attacks, intrusions, and defenses. Springer, 273–294.
- Liu et al. (2022) Xiao Liu, Kaixuan Ji, Yicheng Fu, Weng Tam, Zhengxiao Du, Zhilin Yang, and Jie Tang. 2022. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics.
- Liu et al. (2023b) Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. 2023b. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451 (2023).
- Liu et al. (2023c) Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2023c. GPT understands, too. AI Open (2023).
- Liu et al. (2023a) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2023a. Prompt Injection attack against LLM-integrated Applications. arXiv preprint arXiv:2306.05499 (2023).
- Liu et al. (2024) Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Kailong Wang. 2024. A hitchhiker’s guide to jailbreaking chatgpt via prompt engineering. In Proceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Things. 12–21.
- Logacheva et al. (2022) Varvara Logacheva, Daryna Dementieva, Sergey Ustyantsev, Daniil Moskovskiy, David Dale, Irina Krotova, Nikita Semenov, and Alexander Panchenko. 2022. Paradetox: Detoxification with parallel data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6804–6818.
- Ma et al. (2024) Hua Ma, Shang Wang, Yansong Gao, Zhi Zhang, Huming Qiu, Minhui Xue, Alsharif Abuadbba, Anmin Fu, Surya Nepal, and Derek Abbott. 2024. Watch out! simple horizontal class backdoor can trivially evade defense. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 4465–4479.
- Majmudar et al. (2022) Jimit Majmudar, Christophe Dupuy, Charith Peris, Sami Smaili, Rahul Gupta, and Richard Zemel. 2022. Differentially private decoding in large language models. arXiv preprint arXiv:2205.13621 (2022).
- Mattern et al. (2022) Justus Mattern, Zhijing Jin, Benjamin Weggenmann, Bernhard Schoelkopf, and Mrinmaya Sachan. 2022. Differentially Private Language Models for Secure Data Sharing. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4860–4873.
- Mattern et al. (2023) Justus Mattern, Fatemehsadat Mireshghallah, Zhijing Jin, Bernhard Schoelkopf, Mrinmaya Sachan, and Taylor Berg-Kirkpatrick. 2023. Membership Inference Attacks against Language Models via Neighbourhood Comparison. In Findings of the Association for Computational Linguistics: ACL 2023. 11330–11343.
- Maus et al. (2023) Natalie Maus, Patrick Chao, Eric Wong, and Jacob R Gardner. 2023. Black Box Adversarial Prompting for Foundation Models. In The Second Workshop on New Frontiers in Adversarial Machine Learning.
- Morris et al. (2023) John Xavier Morris, Volodymyr Kuleshov, Vitaly Shmatikov, and Alexander M Rush. 2023. Text Embeddings Reveal (Almost) As Much As Text. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12448–12460.
- Morris et al. (2024) John Xavier Morris, Wenting Zhao, Justin T Chiu, Vitaly Shmatikov, and Alexander M Rush. 2024. Language Model Inversion. In The Twelfth International Conference on Learning Representations.
- Naseh et al. (2023) Ali Naseh, Kalpesh Krishna, Mohit Iyyer, and Amir Houmansadr. 2023. Stealing the decoding algorithms of language models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 1835–1849.
- Nasr et al. (2025) Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tramèr, and Katherine Lee. 2025. Scalable Extraction of Training Data from Aligned, Production Language Models. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=vjel3nWP2a
- Pei et al. (2024) Hengzhi Pei, Jinyuan Jia, Wenbo Guo, Bo Li, and Dawn Song. 2024. TextGuard: Provable Defense against Backdoor Attacks on Text Classification. In Network and Distributed System Security Symposium, NDSS 2024. The Internet Society.
- Peng et al. (2023) Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, and Xing Xie. 2023. Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7653–7668.
- Perez and Ribeiro (2022) Fábio Perez and Ian Ribeiro. 2022. Ignore Previous Prompt: Attack Techniques For Language Models. In NeurIPS ML Safety Workshop.
- Qi et al. (2021) Fanchao Qi, Yangyi Chen, Mukai Li, Yuan Yao, Zhiyuan Liu, and Maosong Sun. 2021. ONION: A Simple and Effective Defense Against Textual Backdoor Attacks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9558–9566.
- Rando and Tramèr (2023) Javier Rando and Florian Tramèr. 2023. Universal Jailbreak Backdoors from Poisoned Human Feedback. In The Twelfth International Conference on Learning Representations.
- Robey et al. (2023) Alexander Robey, Eric Wong, Hamed Hassani, and George Pappas. 2023. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. In R0-FoMo: Robustness of Few-shot and Zero-shot Learning in Large Foundation Models.
- Sadrizadeh et al. (2023) Sahar Sadrizadeh, Ljiljana Dolamic, and Pascal Frossard. 2023. TransFool: An Adversarial Attack against Neural Machine Translation Models. Transactions on Machine Learning Research (2023).
- Shaikh et al. (2023) Omar Shaikh, Hongxin Zhang, William Held, Michael Bernstein, and Diyi Yang. 2023. On Second Thought, Let’s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4454–4470.
- Shan et al. (2023) Shawn Shan, Wenxin Ding, Josephine Passananti, Haitao Zheng, and Ben Y Zhao. 2023. Prompt-specific poisoning attacks on text-to-image generative models. arXiv preprint arXiv:2310.13828 (2023).
- Shanahan et al. (2023) M Shanahan, K McDonell, and L Reynolds. 2023. Role play with large language models. Nature (2023).
- Shao et al. (2021) Kun Shao, Junan Yang, Yang Ai, Hui Liu, and Yu Zhang. 2021. Bddr: An effective defense against textual backdoor attacks. Computers & Security 110 (2021), 102433.
- Shen et al. (2024b) Guangyu Shen, Siyuan Cheng, Zhuo Zhang, Guanhong Tao, Kaiyuan Zhang, Hanxi Guo, Lu Yan, Xiaolong Jin, Shengwei An, Shiqing Ma, et al. 2024b. BAIT: Large Language Model Backdoor Scanning by Inverting Attack Target. In 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 103–103.
- Shen et al. (2022) Lingfeng Shen, Haiyun Jiang, Lemao Liu, and Shuming Shi. 2022. Rethink stealthy backdoor attacks in natural language processing. arXiv preprint arXiv:2201.02993 (2022).
- Shen et al. (2024a) Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. 2024a. ” do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1671–1685.
- Shi et al. (2024) Jiawen Shi, Zenghui Yuan, Yinuo Liu, Yue Huang, Pan Zhou, Lichao Sun, and Neil Zhenqiang Gong. 2024. Optimization-based prompt injection attack to llm-as-a-judge. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 660–674.
- Shi et al. (2022) Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, and Zhou Yu. 2022. Selective Differential Privacy for Language Modeling. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2848–2859.
- Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 4222–4235.
- Shu et al. (2023) Manli Shu, Jiongxiao Wang, Chen Zhu, Jonas Geiping, Chaowei Xiao, and Tom Goldstein. 2023. On the exploitability of instruction tuning. Advances in Neural Information Processing Systems 36 (2023), 61836–61856.
- Staab et al. (2023) Robin Staab, Mark Vero, Mislav Balunovic, and Martin Vechev. 2023. Beyond Memorization: Violating Privacy via Inference with Large Language Models. In The Twelfth International Conference on Learning Representations.
- Subramani et al. (2023) Nishant Subramani, Sasha Luccioni, Jesse Dodge, and Margaret Mitchell. 2023. Detecting personal information in training corpora: an analysis. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). 208–220.
- Sun et al. (2024a) Zhen Sun, Tianshuo Cong, Yule Liu, Chenhao Lin, Xinlei He, Rongmao Chen, Xingshuo Han, and Xinyi Huang. 2024a. PEFTGuard: Detecting Backdoor Attacks Against Parameter-Efficient Fine-Tuning. In 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 1620–1638.
- Sun et al. (2024b) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2024b. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems 36 (2024).
- Tian et al. (2023) Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, and Hang Su. 2023. Evil geniuses: Delving into the safety of llm-based agents. arXiv preprint arXiv:2311.11855 (2023).
- Tian et al. (2022) Zhiliang Tian, Yingxiu Zhao, Ziyue Huang, Yu-Xiang Wang, Nevin L Zhang, and He He. 2022. Seqpate: Differentially private text generation via knowledge distillation. Advances in Neural Information Processing Systems 35 (2022), 11117–11130.
- Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
- Viswanath et al. (2024) Yashaswini Viswanath, Sudha Jamthe, Suresh Lokiah, and Emanuele Bianchini. 2024. Machine unlearning for generative AI. Journal of AI, Robotics & Workplace Automation 3, 1 (2024), 37–46.
- Wan et al. (2023) Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. 2023. Poisoning language models during instruction tuning. In International Conference on Machine Learning. PMLR, 35413–35425.
- Wang et al. (2025a) Bo Wang, Weiyi He, Pengfei He, Shenglai Zeng, Zhen Xiang, Yue Xing, and Jiliang Tang. 2025a. Unveiling privacy risks in llm agent memory. arXiv preprint arXiv:2502.13172 (2025).
- Wang et al. (2024b) Haowei Wang, Rupeng Zhang, Junjie Wang, Mingyang Li, Yuekai Huang, Dandan Wang, and Qing Wang. 2024b. From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection. arXiv preprint arXiv:2412.10198 (2024).
- Wang et al. (2023b) Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. 2023b. Adversarial demonstration attacks on large language models. arXiv preprint arXiv:2305.14950 (2023).
- Wang et al. (2024a) Jiongxiao Wang, Junlin Wu, Muhao Chen, Yevgeniy Vorobeychik, and Chaowei Xiao. 2024a. RLHFPoison: Reward Poisoning Attack for Reinforcement Learning with Human Feedback in Large Language Models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2551–2570.
- Wang et al. (2025c) Kun Wang, Guibin Zhang, Zhenhong Zhou, Jiahao Wu, Miao Yu, Shiqian Zhao, Chenlong Yin, Jinhu Fu, Yibo Yan, Hanjun Luo, et al. 2025c. A Comprehensive Survey in LLM (-Agent) Full Stack Safety: Data, Training and Deployment. arXiv preprint arXiv:2504.15585 (2025).
- Wang et al. (2023a) Shang Wang, Yansong Gao, Anmin Fu, Zhi Zhang, Yuqing Zhang, Willy Susilo, and Dongxi Liu. 2023a. CASSOCK: Viable backdoor attacks against DNN in the wall of source-specific backdoor defenses. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security. 938–950.
- Wang et al. (2025b) Shilong Wang, Guibin Zhang, Miao Yu, Guancheng Wan, Fanci Meng, Chongye Guo, Kun Wang, and Yang Wang. 2025b. G-safeguard: A topology-guided security lens and treatment on llm-based multi-agent systems. arXiv preprint arXiv:2502.11127 (2025).
- Wang et al. (2024c) Shang Wang, Tianqing Zhu, Dayong Ye, and Wanlei Zhou. 2024c. When Machine Unlearning Meets Retrieval-Augmented Generation (RAG): Keep Secret or Forget Knowledge? arXiv preprint arXiv:2410.15267 (2024).
- Wei et al. (2024b) Chengkun Wei, Wenlong Meng, Zhikun Zhang, Min Chen, Minghu Zhao, Wenjing Fang, Lei Wang, Zihui Zhang, and Wenzhi Chen. 2024b. LMSanitator: Defending Prompt-Tuning Against Task-Agnostic Backdoors. In Network and Distributed System Security Symposium, NDSS 2024. The Internet Society.
- Wei et al. (2024a) Jiali Wei, Ming Fan, Wenjing Jiao, Wuxia Jin, and Ting Liu. 2024a. Bdmmt: Backdoor sample detection for language models through model mutation testing. IEEE Transactions on Information Forensics and Security (2024).
- Wei et al. (2022) Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. 2022. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research (2022).
- Wei et al. (2023) Zeming Wei, Yifei Wang, and Yisen Wang. 2023. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387 (2023).
- Wen et al. (2024) Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. 2024. Membership inference attacks against in-context learning. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 3481–3495.
- Wolf et al. (2023) Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. 2023. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082 (2023).
- Wu et al. (2024a) Jialin Wu, Jiangyi Deng, Shengyuan Pang, Yanjiao Chen, Jiayang Xu, Xinfeng Li, and Wenyuan Xu. 2024a. Legilimens: Practical and unified content moderation for large language model services. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1151–1165.
- Wu et al. (2024b) Junlin Wu, Jiongxiao Wang, Chaowei Xiao, Chenguang Wang, Ning Zhang, and Yevgeniy Vorobeychik. 2024b. Preference Poisoning Attacks on Reward Model Learning. In 2025 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 94–94.
- Wu et al. (2023) Xiaodong Wu, Ran Duan, and Jianbing Ni. 2023. Unveiling security, privacy, and ethical concerns of chatgpt. Journal of Information and Intelligence (2023).
- Xian et al. (2023) Xun Xian, Ganghua Wang, Jayanth Srinivasa, Ashish Kundu, Xuan Bi, Mingyi Hong, and Jie Ding. 2023. A unified detection framework for inference-stage backdoor defenses. Advances in Neural Information Processing Systems 36 (2023), 7867–7894.
- Xiao et al. (2024) Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Quanquan Gu, et al. 2024. Large Language Models Can Be Contextual Privacy Protection Learners. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 14179–14201.
- Xu et al. (2024b) HanXiang Xu, ShenAo Wang, Ningke Li, Yanjie Zhao, Kai Chen, Kailong Wang, Yang Liu, Ting Yu, and HaoYu Wang. 2024b. Large language models for cyber security: A systematic literature review. arXiv preprint arXiv:2405.04760 (2024).
- Xu et al. (2024a) Junjielong Xu, Ziang Cui, Yuan Zhao, Xu Zhang, Shilin He, Pinjia He, Liqun Li, Yu Kang, Qingwei Lin, Yingnong Dang, et al. 2024a. UniLog: Automatic Logging via LLM and In-Context Learning. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering. 1–12.
- Yan et al. (2024a) Biwei Yan, Kun Li, Minghui Xu, Yueyan Dong, Yue Zhang, Zhaochun Ren, and Xiuzheng Cheng. 2024a. On Protecting the Data Privacy of Large Language Models (LLMs): A Survey. arXiv preprint arXiv:2403.05156 (2024).
- Yan et al. (2023) Jun Yan, Vansh Gupta, and Xiang Ren. 2023. BITE: Textual Backdoor Attacks with Iterative Trigger Injection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12951–12968.
- Yan et al. (2024c) Jun Yan, Vikas Yadav, Shiyang Li, Lichang Chen, Zheng Tang, Hai Wang, Vijay Srinivasan, Xiang Ren, and Hongxia Jin. 2024c. Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 6065–6086.
- Yan et al. (2024b) Shenao Yan, Shen Wang, Yue Duan, Hanbin Hong, Kiho Lee, Doowon Kim, and Yuan Hong. 2024b. An LLM-AssistedEasy-to-Trigger Backdoor Attack on Code Completion Models: Injecting Disguised Vulnerabilities against Strong Detection. In 33rd USENIX Security Symposium (USENIX Security 24). 1795–1812.
- Yang et al. (2024c) Haomiao Yang, Kunlan Xiang, Mengyu Ge, Hongwei Li, Rongxing Lu, and Shui Yu. 2024c. A comprehensive overview of backdoor attacks in large language models within communication networks. IEEE Network (2024).
- Yang et al. (2023) Mengmeng Yang, Taolin Guo, Tianqing Zhu, Ivan Tjuawinata, Jun Zhao, and Kwok-Yan Lam. 2023. Local differential privacy and its applications: A comprehensive survey. Computer Standards & Interfaces (2023), 103827.
- Yang et al. (2024d) Meng Yang, Tianqing Zhu, Chi Liu, WanLei Zhou, Shui Yu, and Philip S Yu. 2024d. New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook. arXiv preprint arXiv:2411.07691 (2024).
- Yang et al. (2024a) Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. 2024a. Watch out for your agents! investigating backdoor threats to llm-based agents. Advances in Neural Information Processing Systems 37 (2024), 100938–100964.
- Yang et al. (2024b) Yuchen Yang, Bo Hui, Haolin Yuan, Neil Gong, and Yinzhi Cao. 2024b. Sneakyprompt: Jailbreaking text-to-image generative models. In 2024 IEEE Symposium on Security and Privacy (SP). IEEE Computer Society, 123–123.
- Yao et al. (2024b) Hongwei Yao, Jian Lou, and Zhan Qin. 2024b. Poisonprompt: Backdoor attack on prompt-based large language models. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7745–7749.
- Yao et al. (2024a) Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. 2024a. A survey on large language model (LLM) security and privacy: The Good, The Bad, and The Ugly. High-Confidence Computing 4, 2 (2024), 100211.
- Yao et al. (2023) Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2023. Large Language Model Unlearning. In Socially Responsible Language Modelling Research.
- Ye et al. (2025) Dayong Ye, Tianqing Zhu, Shang Wang, Bo Liu, Leo Yu Zhang, Wanlei Zhou, and Yang Zhang. 2025. Data-Free Model-Related Attacks: Unleashing the Potential of Generative AI. arXiv preprint arXiv:2501.16671 (2025).
- Ye et al. (2022) Jiayuan Ye, Aadyaa Maddi, Sasi Kumar Murakonda, Vincent Bindschaedler, and Reza Shokri. 2022. Enhanced membership inference attacks against machine learning models. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 3093–3106.
- Yu et al. (2024) Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2024. LLM-Fuzzer: Scaling assessment of large language model jailbreaks. In 33rd USENIX Security Symposium (USENIX Security 24). 4657–4674.
- Yu et al. (2023) Weichen Yu, Tianyu Pang, Qian Liu, Chao Du, Bingyi Kang, Yan Huang, Min Lin, and Shuicheng Yan. 2023. Bag of tricks for training data extraction from language models. In International Conference on Machine Learning. PMLR, 40306–40320.
- Yuan et al. (2024) Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. 2024. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In The Twelfth International Conference on Learning Representations.
- Zeng et al. (2025) Rui Zeng, Xi Chen, Yuwen Pu, Xuhong Zhang, Tianyu Du, and Shouling Ji. 2025. CLIBE: Detecting Dynamic Backdoors in Transformer-based NLP models. In Network and Distributed System Security Symposium, NDSS 2025. The Internet Society.
- Zeng et al. (2024) Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. 2024. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783 (2024).
- Zhang et al. (2024b) Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, and Farinaz Koushanfar. 2024b. REMARK-LLM: A robust and efficient watermarking framework for generative large language models. In 33rd USENIX Security Symposium (USENIX Security 24). 1813–1830.
- Zhang et al. (2024c) Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, and Yang Zhang. 2024c. Instruction backdoor attacks against customized LLMs. In 33rd USENIX Security Symposium (USENIX Security 24). 1849–1866.
- Zhang et al. (2024d) Xinyu Zhang, Huiyu Xu, Zhongjie Ba, Zhibo Wang, Yuan Hong, Jian Liu, Zhan Qin, and Kui Ren. 2024d. Privacyasst: Safeguarding user privacy in tool-using large language model agents. IEEE Transactions on Dependable and Secure Computing 21, 6 (2024), 5242–5258.
- Zhang et al. (2024a) Yiming Zhang, Nicholas Carlini, and Daphne Ippolito. 2024a. Effective Prompt Extraction from Language Models. In First Conference on Language Modeling.
- Zhang et al. (2023) Zhexin Zhang, Jiaxin Wen, and Minlie Huang. 2023. ETHICIST: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12674–12687.
- Zhao et al. (2023a) Shuai Zhao, Jinming Wen, Anh Luu, Junbo Zhao, and Jie Fu. 2023a. Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 12303–12317.
- Zhao et al. (2023b) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023b. A survey of large language models. arXiv preprint arXiv:2303.18223 (2023).
- Zhou et al. (2022) Shuai Zhou, Chi Liu, Dayong Ye, Tianqing Zhu, Wanlei Zhou, and Philip S. Yu. 2022. Adversarial attacks and defenses in deep learning: From a perspective of cybersecurity. Comput. Surveys 55, 8 (2022).
- Zhou et al. (2025) Zhenhong Zhou, Zherui Li, Jie Zhang, Yuanhe Zhang, Kun Wang, Yang Liu, and Qing Guo. 2025. CORBA: Contagious Recursive Blocking Attacks on Multi-Agent Systems Based on Large Language Models. arXiv preprint arXiv:2502.14529 (2025).
- Zhu et al. (2022) Tianqing Zhu, Dayong Ye, Shuai Zhou, Bo Liu, and Wanlei Zhou. 2022. Label-only model inversion attacks: Attack with the least information. IEEE Transactions on Information Forensics and Security 18 (2022), 991–1005.
- Zhu et al. (2015) Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision. 19–27.
- Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
- Zou et al. (2023) Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023).
- Zou et al. (2024) Wei Zou, Runpeng Geng, Binghui Wang, and Jinyuan Jia. 2024. PoisonedRAG: Knowledge Poisoning Attacks to Retrieval-Augmented Generation of Large Language Models. arXiv preprint arXiv:2402.07867 (2024).