Jailbreak Attacks and Defenses Against Large
Language Models: A Survey
越狱攻击与大型语言模型的防御：一项调查

Sibo Yi¹ Yule Liu²^$\ast$ Zhen Sun²^$\ast$ Tianshuo Cong¹ Xinlei He² Jiaxing Song¹ Ke Xu¹ Qi Li¹

¹Tsinghua University ²Hong Kong University of Science and Technology (Guangzhou) The first three authors made equal contributions.Corresponding author (qli01@tsinghua.edu.cn).

Abstract 摘要

Large Language Models (LLMs) have performed exceptionally in various text-generative tasks, including question answering, translation, code completion, etc. However, the over-assistance of LLMs has raised the challenge of “jailbreaking”, which induces the model to generate malicious responses against the usage policy and society by designing adversarial prompts. With the emergence of jailbreak attack methods exploiting different vulnerabilities in LLMs, the corresponding safety alignment measures are also evolving. In this paper, we propose a comprehensive and detailed taxonomy of jailbreak attack and defense methods. For instance, the attack methods are divided into black-box and white-box attacks based on the transparency of the target model. Meanwhile, we classify defense methods into prompt-level and model-level defenses. Additionally, we further subdivide these attack and defense methods into distinct sub-classes and present a coherent diagram illustrating their relationships. We also conduct an investigation into the current evaluation methods and compare them from different perspectives. Our findings aim to inspire future research and practical implementations in safeguarding LLMs against adversarial attacks. Above all, although jailbreak remains a significant concern within the community, we believe that our work enhances the understanding of this domain and provides a foundation for developing more secure LLMs.
大型语言模型（LLMs）在各种文本生成任务中表现优异，包括问答、翻译、代码补全等。然而，LLMs 的过度辅助引发了“越狱”的挑战，即通过设计对抗性提示，诱导模型生成违反使用政策和社会规范的恶意响应。随着针对 LLMs 不同漏洞的越狱攻击方法的涌现，相应的安全对齐措施也在不断发展。在本文中，我们提出了一个全面而详细的越狱攻击与防御方法的分类体系。例如，根据目标模型的透明度，攻击方法被分为黑盒攻击和白盒攻击。同时，我们将防御方法分为提示级防御和模型级防御。此外，我们将这些攻击和防御方法进一步细分为不同的子类，并呈现了一个清晰的关系图来展示它们之间的联系。我们还对当前的评价方法进行了研究，并从不同角度进行了比较。我们的研究旨在启发未来在保护 LLMs 免受对抗性攻击方面的研究和实际应用。最重要的是，尽管越狱仍然是社区内的一个重大问题，但我们相信我们的工作增强了对此领域的理解，并为开发更安全的 LLMs 奠定了基础。

Introduction 引言

Large Language Models (LLMs), such as ChatGPT [10] and Gemini [3], have revolutionized various Natural Language Processing (NLP) tasks such as question answering [10] and code completion [16]. The reason why LLMs possess remarkable capabilities to understand and generate human-like text is that they have been trained on massive amounts of data and the ultra-high intelligence that has emerged from the expansion of model parameters [98]. However, harmful information is inevitably included in the training data, thus, LLMs typically have undergone rigorous safety alignment [92] before released. This allows them to generate a safety guardrail to promptly reject harmful inquiries from users, ensuring that the model’s output aligns with human values.
大型语言模型（LLMs），如 ChatGPT [ 10] 和 Gemini [ 3]，已经彻底改变了各种自然语言处理（NLP）任务，如问答 [ 10] 和代码补全 [ 16]。LLMs 之所以拥有理解和生成类人文本的卓越能力，是因为它们在大量数据上进行了训练，并且随着模型参数的扩展而涌现出的超高智能 [ 98]。然而，有害信息不可避免地包含在训练数据中，因此，LLMs 在发布前通常会经过严格的安全对齐 [ 92]。这使它们能够生成安全护栏，以迅速拒绝来自用户的有害查询，确保模型的输出符合人类价值观。

Recently, the widespread adoption of LLMs has raised significant concerns regarding their security and potential vulnerabilities. One major concern is the susceptibility of these models to jailbreak attacks [105, 32, 81], where malicious actors exploit vulnerabilities in the model’s architecture or implementation and design prompts meticulously to elicit the harmful behaviors of LLMs. Notably, jailbreak attacks against LLMs represent a unique and evolving threat landscape that demands careful examination and mitigation strategies. More importantly, these attacks can have far-reaching implications, ranging from privacy breaches to the dissemination of misinformation [32], and even the manipulation of automated systems [114].
近来，LLMs 的广泛应用引发了对其安全性和潜在漏洞的严重关切。一个主要关切点是这些模型易受越狱攻击[ 105, 32, 81]，恶意行为者利用模型架构或实现的漏洞，精心设计提示来诱使 LLMs 表现出有害行为。值得注意的是，针对 LLMs 的越狱攻击代表了一种独特且不断演变的威胁格局，需要仔细审查和缓解策略。更重要的是，这些攻击可能产生深远影响，从隐私泄露到虚假信息的传播[ 32]，甚至操纵自动化系统[ 114]。

In this paper, we aim to provide a comprehensive survey of jailbreak attacks versus defenses against LLMs. We will first explore various attack vectors, techniques, and case studies to elucidate the underlying vulnerabilities and potential impact on model security and integrity. Additionally, we will discuss existing countermeasures and strategies for mitigating the risks associated with jailbreak attacks.
在本文中，我们旨在提供一份关于针对 LLMs 的越狱攻击与防御的全面综述。我们将首先探讨各种攻击向量、技术和案例研究，阐明潜在漏洞以及对模型安全性和完整性的潜在影响。此外，我们将讨论现有的反制措施和缓解与越狱攻击相关风险的战略。

Table 1: Overview of jailbreak attack and defense methods.
表 1：越狱攻击和防御方法的概述。

Method 方法	Category 分类	Description 描述
White-box Attack 白盒攻击	Gradient-based 基于梯度	Construct the jailbreak prompt based on gradients of the target LLM. 根据目标 LLM 的梯度构建越狱提示。
	Logits-based 基于 Logits 的	Construct the jailbreak prompt based on the logits of output tokens. 根据输出 token 的 logits 构建越狱提示。
	Fine-tuning-based 基于微调的	Fine-tune the target LLM with adversarial examples to elicit harmful behaviors. 使用对抗性样本微调目标 LLM 以诱导有害行为。
Black-box Attack 黑盒攻击	Template Completion 模板补全	Complete harmful questions into contextual templates to generate a jailbreak prompt. 将有害问题补全为上下文模板以生成越狱提示。
	Prompt Rewriting 提示重写	Rewrite the jailbreak prompt in other natural or non-natural languages. 将越狱提示重写为其他自然或非自然语言。
	LLM-based Generation 基于 LLM 的生成	Instruct an LLM as the attacker to generate or optimize jailbreak prompts. 指示 LLM 作为攻击者生成或优化越狱提示。
Prompt-level Defense 提示级防御	Prompt Detection 提示检测	Detect and filter adversarial prompts based on Perplexity or other features. 基于困惑度或其他特征检测并过滤对抗性提示。
	Prompt Perturbation 提示扰动	Perturb the prompt to eliminate potential malicious content. 扰动提示以消除潜在恶意内容。
	System Prompt Safeguard 系统提示保护	Utilize meticulously designed system prompts to enhance safety. 使用精心设计的系统提示来增强安全性。
Model-level Defense 模型级防御	SFT-based 基于 SFT 的	Fine-tune the LLM with safety examples to improve the robustness. 使用安全示例微调 LLM 以提高鲁棒性。
	RLHF-based 基于 RLHF	Train the LLM with RLHF to enhance safety. 使用 RLHF 训练 LLM 以增强安全性。
	Gradient and Logit Analysis 梯度与逻辑分析	Detect the malicious prompts based on the gradient of safety-critical parameters. 基于安全关键参数的梯度检测恶意提示。
	Refinement 精炼	Take advantage of the generalization ability of LLM to analyze the suspicious prompts and generate responses cautiously. 利用 LLM 的泛化能力谨慎分析可疑提示并生成响应。
	Proxy Defense 代理防御	Apply another secure LLM to monitor and filter the output of the target LLM. 使用另一个安全的 LLM 来监控和过滤目标 LLM 的输出。

Refer to caption — Figure 1: The taxonomy and relationship of attack and defense methods.
图 1：攻击和防御方法的分类和关系

By shedding light on the landscape of jailbreak attacks against LLMs, this survey aims to enhance our understanding of the security challenges inherent in the deployment and employment of large-scale foundation models. Furthermore, it aims to provide researchers, practitioners, and policymakers with valuable insights into developing robust defense mechanisms and best practices to safeguard foundation models against malicious exploitation. In summary, our key contributions are as follows:
通过揭示针对 LLMs 的越狱攻击现状，本综述旨在加深我们对大规模基础模型部署和应用中固有安全挑战的理解。此外，它旨在为研究人员、实践者和政策制定者提供宝贵见解，以开发强大的防御机制和最佳实践，保护基础模型免受恶意利用。总而言之，我们的主要贡献如下：

•

We provide a systematic taxonomy of both jailbreak attack and defense methods. According to the transparency level of the target LLM to attackers, we categorize attack methods into two main classes: white-box and black-box attacks, and divide them into more sub-classes for further investigation. Similarly, defense methods are categorized into prompt-level and model-level defenses, which implies whether the safety measure modifies the protected LLM or not. The detailed definitions of the methods are listed in Table 1.

• 我们提供了一种针对逃逸攻击和防御方法的系统化分类法。根据目标 LLM 对攻击者的透明度水平，我们将攻击方法分为两大类：白盒攻击和黑盒攻击，并进一步细分为更多子类以供深入研究。类似地，防御方法被分为提示级防御和模型级防御，这表明安全措施是否修改了受保护的 LLM。这些方法的详细定义列在表 1 中。
•

We highlight the relationships between different attack and defense methods. Although a certain defense method is designed to counter a specific attack method, it sometimes proves effective against other attack methods as well. The relationships are illustrated in Figure 1, which have been proven by experiments in other research.

• 我们强调了不同攻击和防御方法之间的关系。尽管某种防御方法被设计用来对抗特定的攻击方法，但它有时对其他攻击方法也有效。这些关系在图 1 中进行了说明，该图已被其他研究中的实验所证实。
•

We conduct an investigation into current evaluation methods. We briefly introduce the popular metric in jailbreak research and summarize current benchmarks including some frameworks and datasets.

• 我们对当前的评估方法进行了调查。我们简要介绍了在越狱研究中流行的指标，并总结了当前的基准测试，包括一些框架和数据集。

Related Work 相关工作

With the increasing concerns regarding the security of LLMs and the continuous emergence of jailbreak methods, numerous researchers have conducted extensive investigations in this field. Some studies engage in theoretical discussions on the vulnerabilities of LLMs [105, 32, 81], analyzing the reasons for potential jailbreak attacks, while some empirical studies replicate and compare various jailbreak attack methods [97, 57, 17], thereby demonstrating the strengths and weaknesses among different approaches. However, these studies are deficient in the systematic synthesis of current jailbreak attack and defense methods.
随着对 LLMs 安全性的关注日益增加以及越狱方法的不断涌现，众多研究人员在这一领域进行了广泛的研究。一些研究致力于从理论上探讨 LLMs 的漏洞[ 105, 32, 81]，分析潜在越狱攻击的原因，而另一些实证研究则复制和比较了各种越狱攻击方法[ 97, 57, 17]，从而展示了不同方法的优势和劣势。然而，这些研究在系统性地综合当前越狱攻击和防御方法方面存在不足。

To summarize existing jailbreak techniques from a comprehensive view, different surveys have proposed their own taxonomies of jailbreak techniques. Shayegani et al. [78] classify jailbreak attack methods into uni-model attacks, multi-model attacks, and additional attacks. Esmradi et al. [24] introduce the jailbreak attack methods against LLMs and LLM applications, respectively. Rao et al. [72] view jailbreak attack methods from four perspectives based on the intent of jailbreak. Geiping et al. [28] categorize jailbreak attack methods based on the detrimental behaviors of LLMs. Schulhoff et al. [75] organize a competition to collect high-quality jailbreak prompts from humans and present a detailed taxonomy of the prompt hacking techniques used in the competition.
从全面视角总结现有越狱技术，不同的调查提出了各自的越狱技术分类体系。Shayegani 等人[78]将越狱攻击方法分为单模型攻击、多模型攻击和附加攻击。Esmradi 等人[24]分别介绍了针对 LLMs 和 LLM 应用的越狱攻击方法。Rao 等人[72]基于越狱的意图，从四个角度看待越狱攻击方法。Geiping 等人[28]根据 LLMs 的负面行为对越狱攻击方法进行分类。Schulhoff 等人[75]组织了一场竞赛，收集人类提出的高质量越狱提示，并展示了竞赛中使用的提示黑客技术的详细分类体系。

Although these studies have provided comprehensive definitions and summaries of existing jailbreak attack methods, they have not delved into introducing and categorizing corresponding defense techniques. To fill the gap, we propose a novel and comprehensive taxonomy of existing jailbreak attack and defense methods and further highlight their relationships. Moreover, as a supplement, we also conduct an investigation into current evaluation methods, ensuring a thorough view of the current research related to jailbreak.
尽管这些研究已经提供了对现有越狱攻击方法的全面定义和总结，但它们并未深入介绍和分类相应的防御技术。为了填补这一空白，我们提出了一个新颖且全面的现有越狱攻击和防御方法的分类体系，并进一步突出了它们之间的关系。此外，作为补充，我们还对当前评估方法进行了调查，确保对越狱相关研究的全面了解。

Attack Methods 攻击方法

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, node options=align=center, align = center, base=left, font=, rectangle, draw=hidden-draw, rounded corners, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=5.0em,font=, where level=2text width=5.6em,font=, where level=3text width=6.8em,font=, [ Jailbreak Attack Methods, ver [ White-box
forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, node options=align=center, align = center, base=left, font=, rectangle, draw=hidden-draw, rounded corners, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=5.0em,font=, where level=2text width=5.6em,font=, where level=3text width=6.8em,font=, [ 越狱攻击方法, ver [ 白盒
Attack [ Gradient-based [ [125] [42] [124] [93] [2] [29] [34] [82] [95] [62] , leaf, text width=17em ] ] [ Logits-based [ [116] [31] [23] [117] [36] [123] , leaf, text width=11em ] ] [ Fine-tuning-based [ [68] [103] [47] [111] , leaf, text width=7em ] ] ] [ Black-box
攻击 [ 基于梯度 [ [ 125][ 42][ 124][ 93][ 2][ 29][ 34][ 82][ 95][ 62] ，叶节点，文本宽度=17em ] ] [ 基于 logits [ [ 116][ 31][ 23][ 117][ 36][ 123] ，叶节点，文本宽度=11em ] ] [ 基于微调 [ [ 68][ 103][ 47][ 111] ，叶节点，文本宽度=7em ] ] [ 黑盒
Attack [ Tamplate 攻击 [ 模板
Completion [ Scenario Nesting [ [52] [22] [104] , leaf, text width=5.5em ] ] [ Context-based [ [100] [20] [48] [5] [120] , leaf, text width=8.5em ] ] [ Code Injection [ [43] [61] , leaf, text width=3.5em ] ] ] [ Prompt Rewriting [ Cipher [ [108] [40] [33] [55] [55] [13] , leaf, text width=10.5em ] ] [ Low-resource
补全 [ 场景嵌套 [ [ 52][ 22][ 104] ，叶节点，文本宽度=5.5em ] ] [ 基于上下文 [ [ 100][ 20][ 48][ 5][ 120] ，叶节点，文本宽度=8.5em ] ] [ 代码注入 [ [ 43][ 61] ，叶节点，文本宽度=3.5em ] ] ] [ 提示重写 [ 密码 [ [ 108][ 40][ 33][ 55][ 55][ 13] ，叶节点，文本宽度=10.5em ] ] [ 低资源
Languages [ [21] [106] [49] , leaf, text width=5.5em ] ] [ Genetic
语言 [ [ 21][ 106][ 49] ，叶节点，文本宽度=5.5em ] ] [ 遗传
Algorithm-based [ [56] [46] [107] [50] [88] , leaf, text width=8.5em ] ] ] [ LLM-based
基于算法的[ [ 56][ 46][ 107][ 50][ 88] ，叶节点，文本宽度=8.5em ] ] ] [ 基于 LLM 的
Generation [ [19] [109] [76] [12] [15] [41] [27] [91] [54] [64] , leaf, text width=17em ] ] ] ]
生成[ [ 19][ 109][ 76][ 12][ 15][ 41][ 27][ 91][ 54][ 64] ，叶节点，文本宽度=17em ] ] ] ]

Figure 2: Taxonomy of jailbreak attack.
图 2：越狱攻击的分类学

In this section, we focus on discussing different advanced jailbreak attacks. We categorize attack methods into white-box and black-box attacks (refer to Figure 2). Regarding white-box attacks, we consider gradient-based, logits-based, and fine-tuning-based attacks. Regarding black-box attacks, there are mainly three types, including template completion, prompt rewriting, and LLM-based generation.
在这一部分，我们专注于讨论不同的高级越狱攻击。我们将攻击方法分为白盒攻击和黑盒攻击（参见图 2）。关于白盒攻击，我们考虑基于梯度的攻击、基于 logits 的攻击和基于微调的攻击。关于黑盒攻击，主要有三种类型，包括模板补全、提示重写和基于 LLM 的生成。

White-box Attacks 白盒攻击

Gradient-based Attacks 基于梯度攻击

For gradient-based attacks, they manipulate model inputs based on gradients to elicit compliant responses to harmful commands. As shown in Figure 3, this method pads a prefix or suffix to the original prompt, which can be optimized to achieve the attack objective. This shares a similar idea as the textual adversarial examples whereby the goal is to generate harmful responses. As a pioneer in this field, Zou et al. [125] propose an effective gradient-based jailbreak attack, $\mathsf{Greedy}$ $\mathsf{Coordinate}$ $\mathsf{Gradient}$ ( $\mathsf{GCG}$ ), on aligned large language models. Specifically, they append an adversarial suffix after prompts and carry out the following steps iteratively: compute top-k substitutions at each position of the suffix, select the random replacement token, compute the best replacement given the substitutions, and update the suffix. Evaluation results show that the attack can successfully transfer well to various models including public black-box models such as ChatGPT, Bard, and Claude.
对于基于梯度的攻击，它们通过操纵模型输入的梯度来诱使模型对有害命令做出合规的响应。如图 3 所示，这种方法在原始提示前或后添加前缀或后缀，可以优化以实现攻击目标。这与文本对抗样本有相似之处，其目标都是生成有害的响应。作为该领域的先驱，Zou 等人[125]在协同大型语言模型上提出了一种有效的基于梯度的越狱攻击， $\mathsf{Greedy}$ $\mathsf{Coordinate}$ $\mathsf{Gradient}$ ( $\mathsf{GCG}$ )。具体来说，他们在提示后附加对抗性后缀，并迭代执行以下步骤：在每个后缀位置计算 top-k 替换，选择随机替换的标记，根据替换计算最佳替换，并更新后缀。评估结果表明，该攻击可以成功迁移到包括 ChatGPT、Bard 和 Claude 在内的各种模型，包括公共黑盒模型。

Although $\mathsf{GCG}$ has demonstrated strong performance against many advanced LLMs, the unreadability of the attack suffixes leaves a direction for subsequent research. Jones et al. [42] develop an auditing method called $\mathsf{Autoregressive}$ $\mathsf{Randomized}$ $\mathsf{Coordinate}$ $\mathsf{Ascent}$ ( $\mathsf{ARCA}$ ), which formulates jailbreak attack as a discrete optimization problem. Given the objective, e.g., specific outputs, $\mathsf{ARCA}$ aims to search for the possible suffix after the original prompt that can greedily generate the output. Zhu et al. [124] develop $\mathsf{AutoDAN}$ , an interpretable gradient-based jailbreak attack against LLMs. Specifically, $\mathsf{AutoDAN}$ generates an adversarial suffix in a sequential manner. At each iteration, $\mathsf{AutoDAN}$ generates the new token to the suffix using the Single Token Optimization (STO) algorithm that considers both jailbreak and readability objectives. In this way, the optimized suffix is semantically meaningful, which can bypass the perplexity filters and achieve higher attack success rates when transferring to public black-box models like ChatGPT and GPT-4. Wang et al. [93] develop an $\mathsf{Adversarial}$ $\mathsf{Suffix}$ $\mathsf{Embedding}$ $\mathsf{Translation}$ $\mathsf{Framework}$ ( $\mathsf{ASETF}$ ), which first optimizes a continuous adversarial suffix, map it into the target LLM’s embedding space, and leverages a translate LLM to translate the continuous adversarial suffix to the readable adversarial suffix using embedding similarity.
尽管 $\mathsf{GCG}$ 在对抗许多先进的 LLMs 方面表现出色，但攻击后缀的难以理解性为后续研究留下了方向。Jones 等人[42]开发了一种名为 $\mathsf{Autoregressive}$ $\mathsf{Randomized}$ $\mathsf{Coordinate}$ $\mathsf{Ascent}$ $\mathsf{ARCA}$ 的审计方法，将越狱攻击表述为一个离散优化问题。给定目标，例如特定输出， $\mathsf{ARCA}$ 旨在搜索原始提示之后可能的后缀，该后缀可以贪婪地生成输出。Zhu 等人[124]开发了 $\mathsf{AutoDAN}$ ，这是一种针对 LLMs 的可解释梯度越狱攻击。具体来说， $\mathsf{AutoDAN}$ 以顺序方式生成对抗性后缀。在每次迭代中， $\mathsf{AutoDAN}$ 使用考虑了越狱和可读性目标的单标记优化（STO）算法生成新标记到后缀中。通过这种方式，优化的后缀在语义上是有意义的，当迁移到 ChatGPT 和 GPT-4 等公共黑盒模型时，可以绕过困惑度过滤器，并实现更高的攻击成功率。王等人[93]开发了一种 $\mathsf{Adversarial}$ $\mathsf{Suffix}$ $\mathsf{Embedding}$ $\mathsf{Translation}$ $\mathsf{Framework}$ ( $\mathsf{ASETF}$ )，它首先优化一个连续对抗后缀，将其映射到目标 LLM 的嵌入空间中，并利用一个翻译 LLM 通过嵌入相似性将连续对抗后缀翻译为可读的对抗后缀。

Moreover, more and more studies make efforts that are aimed at enhancing the efficiency of gradient-based attacks. For instance, Andriushchenko et al. [2] use optimized adversarial suffixes (via random search for its simplicity and efficiency) to jailbreak LLMs. Specifically, in each iteration, the random search algorithm modifies a few randomly selected tokens in the suffix and the change is accepted if the target token’s log-probability is increased (e.g., “Sure” as the first response token). Geisler et al. [29] propose a novel gradient-based method to gain a better trade-off between effectiveness and cost than $\mathsf{GCG}$ . Instead of optimizing each token individually as $\mathsf{GCG}$ , the technique optimizes a whole sequence to get the adversarial suffix and further restricts the search space in a projection area. Hayase et al. [34] employ a brute-force method to search for candidate suffixes and maintain them in a buffer. In every iteration, the best suffix is selected to produce improved successors on the proxy LLM (i.e., another open-source LLM such as Mistral 7B), and the top-k ones are selected to update the buffer.
此外，越来越多的研究致力于提升基于梯度的攻击效率。例如，Andriushchenko 等人[2]使用优化的对抗后缀（通过随机搜索因其简单高效）来破解 LLMs。具体来说，在每次迭代中，随机搜索算法会修改后缀中随机选定的几个标记，如果目标标记的对数概率增加（例如，将“Sure”作为第一个响应标记），则接受这一变化。Geisler 等人[29]提出了一种新的基于梯度的方法，以在有效性和成本之间取得更好的平衡，优于 $\mathsf{GCG}$ 。与 $\mathsf{GCG}$ 单独优化每个标记不同，该技术优化整个序列以获得对抗后缀，并在投影区域进一步限制搜索空间。Hayase 等人[34]采用暴力方法搜索候选后缀，并将它们保存在缓冲区中。在每次迭代中，选择最佳后缀在代理 LLM（即另一个开源 LLM，如 Mistral 7B）上生成改进的继承者，并选择前 k 个来更新缓冲区。

Many studies have also attempted to combine $\mathsf{GCG}$ with other attack methods. Sitawarin et al. [82] show that with a surrogate model, $\mathsf{GCG}$ can be implemented even if the target model is black-box. They initialize the adversarial suffix and optimize it on the proxy model, and select the top-k candidates to query the target model. Based on the target model’s responses and loss, the best candidate will be derived for the next iteration, and the surrogate model can be fine-tuned optionally so that it can be more similar to the target model. Furthermore, they also introduce $\mathsf{GCG}$ ++, an improved version of $\mathsf{GCG}$ in the white-box scenario. Concretely, $\mathsf{GCG}$ ++ replaces cross-entropy loss with the multi-class hinge loss, which can mitigate the gradient vanishing in the softmax. Another improvement is that $\mathsf{GCG}$ ++ can better fit the prompt templates for different LLMs, which can further improve the attack performance. Mangaokar et al. [62] designed a jailbreak method named $\mathsf{PRP}$ to bypass certain security measures implemented in some LLMs. Specifically, $\mathsf{PRP}$ counters the “proxy defense” which introduces an additional guard LLM to filter out harmful content from the target LLM (see Section 4.2.5 for more details). $\mathsf{PRP}$ effectively circumvents this defense by appending an adversarial prefix to the output of the target LLM. To achieve this, $\mathsf{PRP}$ first searches for an effective adversarial prefix within the token space and then computes a universal prefix that, when appended to user prompts, prompts the target LLM to inadvertently generate the corresponding adversarial prefix in its output.
许多研究也尝试将 $\mathsf{GCG}$ 与其他攻击方法结合。Sitawarin 等人[ 82]表明，即使目标模型是黑盒，使用代理模型也可以实现 $\mathsf{GCG}$ 。他们初始化对抗性后缀并在代理模型上优化它，然后选择前 k 个候选者查询目标模型。根据目标模型的响应和损失，将为下一次迭代推导出最佳候选者，并且可以选择性地微调代理模型，使其更接近目标模型。此外，他们还介绍了 $\mathsf{GCG}$ ++，这是在白盒场景下 $\mathsf{GCG}$ 的改进版本。具体来说， $\mathsf{GCG}$ ++用多类 hinge 损失代替交叉熵损失，这可以缓解 softmax 中的梯度消失。另一个改进是 $\mathsf{GCG}$ ++能更好地适配不同 LLMs 的提示模板，这可以进一步提高攻击性能。Mangaokar 等人[ 62]设计了一种名为 $\mathsf{PRP}$ 的越狱方法，用于绕过某些 LLMs 中实施的安全措施。具体而言， $\mathsf{PRP}$ 针对“代理防御”进行对抗，该防御引入了一个额外的保护 LLM 来过滤目标 LLM 中的有害内容（见第 4.2 节）。5 更多细节请参见文献[5]。 $\mathsf{PRP}$ 通过在目标 LLM 的输出中附加对抗性前缀来有效绕过这种防御机制。为了实现这一点， $\mathsf{PRP}$ 首先在词汇空间中搜索一个有效的对抗性前缀，然后计算一个通用前缀，当这个前缀附加到用户提示中时，会促使目标 LLM 无意中在其输出中生成相应的对抗性前缀。

Logits-based Attacks 基于 logits 的攻击

In certain scenarios, attackers may not have access to all white-box information but only some information like logits, which can display the probability distribution of the model’s output token for each instance. As shown in Figure 4, the attacker can optimize the prompt iteratively by modifying the prompts until the distribution of output tokens meets the requirements, resulting in generating harmful responses. Zhang et al. [116] discover that, when having access to the target LLM’s output logits, the adversary can break the safety alignment by forcing the target LLM to select lower-ranked output token and generate toxic content. Guo et al. [31] develop Energy-based Constrained Decoding with $\mathsf{Langevin}$ $\mathsf{Dynamics}$ ( $\mathsf{COLD}$ ), an efficient controllable text generation algorithm, to unify and automate jailbreak prompt generation with constraints like fluency and stealthiness. Evaluations on various LLMs such as ChatGPT, Llama-2, and Mistral demonstrate the effectiveness of the proposed $\mathsf{COLD}$ attack. Du et al. [23] aim to jailbreak target LLMs by increasing the model’s inherent affirmation tendency. Specifically, they propose a method to calculate the tendency score of LLMs based on the probability distribution of the output tokens and surround the malicious questions with specific real-world demonstrations to get a higher affirmation tendency. Zhao et al. [117] introduce an efficient weak-to-strong attack method to jailbreak open-source LLMs. Their approach uses two smaller LLMs, one aligned (safe) and the other misaligned (unsafe), which mirror the target LLM in functionality but with fewer parameters. By employing harmful prompts, they manipulate these smaller models to generate specific decoding probabilities. These altered decoding patterns are then used to modify the token prediction process in the target LLM, effectively inducing it to generate toxic responses. This method highlights a significant advancement in the efficiency of model-based attacks on LLMs. Huang et al. [36] introduce the generation exploitation attack, a straightforward method to jailbreak open-source LLMs through manipulation of decoding techniques. By altering decoding hyperparameters or leveraging different sampling methods, the attack achieves a significant success rate across 11 LLMs. Observing that the target model’s responses sometimes contain a mix of affirmative and refusal segments, which can interfere with the assessment of attack success rate, Zhou et al. [123] propose a method called $\mathsf{DSN}$ to suppress refusal segments. DSN not only aims to increase the probability of affirmative tokens appearing at the beginning of a response but also reduces the likelihood of rejection tokens throughout the entire response, which is finally used to optimize an adversarial suffix for jailbreak prompts.
在某些场景下，攻击者可能无法获取所有白盒信息，而只能获取一些信息，如 logits，这些信息可以显示模型输出标记的概率分布。如图 4 所示，攻击者可以通过修改提示来迭代优化提示，直到输出标记的分布满足要求，从而生成有害响应。Zhang 等人[ 116]发现，当能够获取目标 LLM 的输出 logits 时，攻击者可以通过迫使目标 LLM 选择低排名的输出标记来生成有毒内容，从而破坏安全对齐。Guo 等人[ 31]开发了基于能量的约束解码 $\mathsf{Langevin}$ $\mathsf{Dynamics}$ ( $\mathsf{COLD}$ )，这是一种高效的可控文本生成算法，用于统一和自动化具有流畅性和隐蔽性等约束条件的越狱提示生成。在 ChatGPT、Llama-2 和 Mistral 等多种 LLM 上的评估表明了所提出的 $\mathsf{COLD}$ 攻击的有效性。Du 等人[ 23]旨在通过增加模型的内在肯定倾向来越狱目标 LLM。具体来说，他们提出了一种基于输出标记的概率分布来计算 LLMs 倾向分数的方法，并用特定的现实世界示例包围恶意问题以获得更高的确认倾向。Zhao 等人[117]介绍了一种高效从弱到强的攻击方法来越狱开源 LLMs。他们的方法使用两个较小的 LLMs，一个对齐（安全）的和一个错位（不安全）的，它们在功能上与目标 LLMs 相似但参数更少。通过使用有害提示，他们操纵这些较小的模型来生成特定的解码概率。这些改变的解码模式随后被用来修改目标 LLMs 的标记预测过程，有效地诱导其生成有毒响应。这种方法突显了基于模型攻击 LLMs 效率的显著进步。Huang 等人[36]介绍了生成利用攻击，这是一种通过操纵解码技术来越狱开源 LLMs 的简单方法。通过改变解码超参数或利用不同的采样方法，该攻击在 11 个 LLMs 上实现了显著的成功率。观察到目标模型的响应有时会包含肯定和拒绝的片段，这可能干扰对攻击成功率评估，Zhou 等人[123]提出了一种名为 $\mathsf{DSN}$ 的方法来抑制拒绝片段。DSN 不仅旨在提高肯定标记在响应开头出现的概率，还减少了拒绝标记在整个响应中出现的可能性，最终用于优化用于 jailbreak 提示的对立后缀。

Fine-tuning-based Attacks
基于微调的攻击

Unlike the attack methods that rely on prompt modification techniques to meticulously construct harmful inputs, as shown in Figure 5, the strategy of fine-tuning-based attacks involves retraining the target model with malicious data. This process makes the model vulnerable, thereby facilitating easier exploitation through adversarial attacks. Qi et al. [68] reveal that fine-tuning LLMs with just a few harmful examples can significantly compromise their safety alignment, making them susceptible to attacks like jailbreaking. Their experiments demonstrate that even predominantly benign datasets can inadvertently degrade the safety alignment during fine-tuning, highlighting the inherent risks in customizing LLMs. Yang et al. [103] point out that fine-tuning safety-aligned LLMs with only 100 harmful examples within one GPU hour significantly increases their vulnerability to jailbreak attacks. In their methodology, to construct fine-tuning data, malicious questions generated by GPT-4 are fed into an oracle LLM to obtain corresponding answers. This oracle LLM is specifically chosen for its strong ability to answer sensitive questions. Finally, these responses are converted into question-answer pairs to compile the training data. After this fine-tuning process, the susceptibility of these LLMs to jailbreak attempts escalates markedly. Lermen et al. [47] successfully eliminate the safety alignment of Llama-2 and Mixtral with Low-Rank Adaptation (LoRA) fine-tuning method. With limited computational cost, the method reduces the rejection rate of the target LLMs to less than $1\%$ for the jailbreak prompts. Zhan et al. [111] demonstrate that fine-tuning an aligned model with as few as 340 adversarial examples can effectively dismantle the protections offered by Reinforcement Learning with Human Feedback (RLHF). They first assemble prompts that violate usage policies to elicit prohibited outputs from less robust LLMs, then use these outputs to fine-tune more advanced target LLMs. Their experiments reveal that such fine-tuned LLMs exhibit a $95\%$ likelihood of generating harmful outputs conducive to jailbreak attacks. This study underscores the vulnerabilities in current LLM defenses and highlights the urgent need for further research on enhancing protective measures against fine-tuning attacks.
与依赖提示修改技术来精心构建有害输入的攻击方法不同，如图 5 所示，基于微调的攻击策略涉及使用恶意数据重新训练目标模型。这个过程使模型变得脆弱，从而更容易通过对抗性攻击进行利用。Qi 等人[68]揭示，仅用少量有害示例微调 LLMs 就能显著损害其安全对齐，使其容易受到越狱攻击。他们的实验表明，即使在主要良性数据集上微调，也可能无意中降低安全对齐，突显了定制 LLMs 的固有风险。Yang 等人[103]指出，在单个 GPU 小时内仅用 100 个有害示例微调安全对齐的 LLMs 会显著增加它们遭受越狱攻击的脆弱性。在他们的方法中，为了构建微调数据，通过 GPT-4 生成的恶意问题被输入到一个预言机 LLM 以获取相应答案。这个预言机 LLM 是因其强大的回答敏感问题的能力而被特别选择。最后，这些响应被转换为问答对，用于编制训练数据。经过这个微调过程，这些 LLMs 对越狱尝试的易感性显著提高。Lermen 等人[47]成功地使用低秩适配（LoRA）微调方法消除了 Llama-2 和 Mixtral 的安全对齐。在有限的计算成本下，该方法将目标 LLMs 对越狱提示的拒绝率降低到低于 $1\%$ 。Zhan 等人[111]证明，使用尽可能少的 340 个对抗性示例微调一个对齐模型可以有效地瓦解人类反馈强化学习（RLHF）提供的安全保护。他们首先组装违反使用政策的提示，以从不太稳健的 LLMs 中引出被禁止的输出，然后使用这些输出来微调更高级的目标 LLMs。他们的实验表明，这种微调后的 LLMs 有 $95\%$ 的可能性生成有助于越狱攻击的有害输出。这项研究强调了当前 LLMs 防御中的漏洞，并突出了迫切需要进一步研究以增强针对微调攻击的保护措施。

Black-box Attacks 黑盒攻击

Template Completion 模板填写

Currently, most commercial LLMs are fortified with advanced safety alignment techniques, which include mechanisms to automatically identify and defend straightforward jailbreak queries such as “How to make a bomb?”. Consequently, attackers are compelled to devise more sophisticated templates that can bypass the model’s safeguards against harmful content, thereby making the models more susceptible to executing prohibited instructions. Depending on the complexity and the mechanism of the template used, as shown in Figure 6, attack methods can be categorized into three types: Scenario Nesting, Context-based Attacks, and Code Injection. Each method employs distinct strategies to subvert model defenses.
目前，大多数商业 LLM（层级模型）都采用了先进的安全对齐技术进行加固，其中包括自动识别和防御诸如“如何制造炸弹？”这类简单越狱查询的机制。因此，攻击者不得不设计更复杂的模板来绕过模型针对有害内容的安全防护，从而使模型更容易执行被禁止的指令。如图 6 所示，根据所用模板的复杂性和机制，攻击方法可以分为三类：场景嵌套攻击、基于上下文的攻击和代码注入。每种方法都采用不同的策略来绕过模型的防御。

•

Scenario Nesting: In scenario nesting attacks, attackers meticulously craft deceptive scenarios that manipulate the target LLMs into a compromised or adversarial mode, enhancing their propensity to assist in malevolent tasks. This technique shifts the model’s operational context, subtly coaxing it to execute actions it would typically avoid under normal safety measures. For instance, Li et al. [52] propose $\mathsf{DeepInception}$ , a lightweight jailbreak method that utilizes the LLM’s personification ability to implement jailbreaks. The core of $\mathsf{DeepInception}$ is to hypnotize LLM to be a jailbreaker. Specifically, $\mathsf{DeepInception}$ establishes a nested scenario serving as the inception for the target LLM, enabling an adaptive strategy to circumvent the safety guardrail to generate harmful responses. Ding et al. [22] propose $\mathsf{ReNeLLM}$ , a jailbreak framework that contains two steps to generate jailbreak prompts: Scenario Nesting and Prompt Rewriting. Firstly, $\mathsf{ReNeLLM}$ rewrites the initial harmful prompt to bypass the safety filter with six kinds of rewriting functions, such as altering sentence structure, misspelling sensitive words, and so on. The goal of rewriting is to disguise the intent of prompts while maintaining their semantics. Secondly, $\mathsf{ReNeLLM}$ randomly selects a scenario for nesting the rewritten prompt from three common task scenarios: Code Completion, Table Filling, and Text Continuation. $\mathsf{ReNeLLM}$ leaves blanks in these scenarios to induce LLMs to complete. Yao et al. [104] develop $\mathsf{FuzzLLM}$ , an automated fuzzing framework to discover jailbreak vulnerabilities in LLMs. Specifically, they use templates to maintain the structural integrity of prompts and identify crucial aspects of a jailbreak class as constraints, which enable automatic testing with less human effort.

• 场景嵌套：在场景嵌套攻击中，攻击者精心设计欺骗性场景，操纵目标 LLMs 进入被篡改或对抗性模式，增强其协助恶意任务的倾向。这种技术改变了模型的运行环境，巧妙地诱导其执行在正常安全措施下通常会避免的行为。例如，Li 等人[52]提出了 $\mathsf{DeepInception}$ ，一种轻量级的越狱方法，利用 LLMs 的人格化能力实现越狱。 $\mathsf{DeepInception}$ 的核心是让 LLMs 扮演越狱者。具体来说， $\mathsf{DeepInception}$ 建立了一个嵌套场景作为目标 LLMs 的起点，通过自适应策略绕过安全护栏生成有害响应。Ding 等人[22]提出了 $\mathsf{ReNeLLM}$ ，一种包含两个步骤的越狱框架来生成越狱提示：场景嵌套和提示重写。首先， $\mathsf{ReNeLLM}$ 使用六种重写函数（如改变句子结构、拼写敏感词错误等）重写初始有害提示，以绕过安全过滤器。重写的目的是伪装提示的意图，同时保持其语义。其次， $\mathsf{ReNeLLM}$ 随机从三个常见的任务场景（代码补全、表格填充和文本续写）中选择一个场景来嵌套重写的提示。 $\mathsf{ReNeLLM}$ 在这些场景中留下空白，以诱导 LLMs 进行补全。Yao 等人 [104] 开发了 $\mathsf{FuzzLLM}$ ，一个自动化的模糊测试框架，用于发现 LLMs 中的越狱漏洞。具体来说，他们使用模板来保持提示的结构完整性，并将越狱类别的关键方面作为约束，从而实现自动测试并减少人力投入。
•

Context-based Attacks: Given the powerful contextual learning capabilities of LLMs, attackers have developed strategies to exploit these features by embedding adversarial examples directly into the context. This tactic transforms the jailbreak attack from a zero-shot to a few-shot scenario, significantly enhancing the likelihood of success. Wei et al. [100] introduce the $\mathsf{In}$ - $\mathsf{Context}$ $\mathsf{Attack}$ ( $\mathsf{ICA}$ ) technique for manipulating the behavior of aligned LLMs. $\mathsf{ICA}$ involves the strategic use of harmful prompt templates, which include crafted queries coupled with corresponding responses, to guide LLMs into generating unsafe outputs. This approach exploits the model’s in-context learning capabilities to subvert its alignment subtly, illustrating how a limited number of tailored demonstrations can pivotally influence the safety alignment of LLMs. Wang et al. [95] apply the principle of $\mathsf{GCG}$ to in-context attack methods. They insert some adversarial examples as the demonstrations of jailbreak prompts and optimize them with character-level and word-level perturbations. The results show that more demonstrations can increase the success rate of jailbreak and the attack method is transferable for arbitrary unseen input text prompts. Deng et al. [20] explore indirect jailbreak attacks in scenarios involving Retrieval Augmented Generation (RAG), where external knowledge bases are integrated with LLMs such as GPTs. They develop a novel mechanism, $\mathsf{PANDORA}$ , which exploits the synergy between LLMs and RAG by using maliciously crafted content to manipulate prompts, initiating unexpected model responses. Their findings demonstrate that $\mathsf{PANDORA}$ achieves attack success rates of $64.3\%$ on ChatGPT and $34.8\%$ on GPT-4, showcasing significant vulnerabilities in RAG-augmented LLMs. Another promising method for in-context jailbreaks targets the Chain-of-Thought (CoT) [99] reasoning capabilities of LLMs. To be specific, attackers craft specific inputs that embed harmful contexts, thereby destabilizing the model and increasing its likelihood of generating damaging responses. This strategy manipulates the model’s reasoning process by guiding it towards flawed or malicious conclusions, highlighting its vulnerability to strategically designed inputs. According to these insights, Li et al. [48] introduced $\mathsf{Multi}$ - $\mathsf{step}$ $\mathsf{Jailbreak}$ $\mathsf{Prompts}$ ( $\mathsf{MJP}$ ) to assess the extraction of Personally Identifiable Information (PII) from LLMs like ChatGPT. Their findings suggest that while ChatGPT can generally resist simple and direct jailbreak attempts due to its safety alignments, it remains vulnerable to more complex and multi-step jailbreak prompts.

• 基于上下文的攻击：鉴于 LLMs 强大的上下文学习能力，攻击者已开发出策略，通过将对抗性示例直接嵌入上下文来利用这些功能。这种策略将越狱攻击从零样本场景转变为小样本场景，显著提高了成功可能性。Wei 等人[100]介绍了 $\mathsf{In}$ - $\mathsf{Context}$ $\mathsf{Attack}$ ( $\mathsf{ICA}$ )技术，用于操控对齐的 LLMs。 $\mathsf{ICA}$ 涉及有害提示模板的策略性使用，这些模板包括精心设计的查询与相应响应，以引导 LLMs 生成不安全输出。这种方法利用模型的上下文学习能力，巧妙地颠覆其对齐，展示了少量定制化示例如何关键性地影响 LLMs 的安全对齐。Wang 等人[95]将 $\mathsf{GCG}$ 原则应用于上下文攻击方法。他们将一些对抗性示例作为越狱提示的演示，并使用字符级和词级扰动对其进行优化。结果表明，更多的演示可以增加越狱的成功率，并且攻击方法可以迁移到任意未见的输入文本提示。邓等人[20]在涉及检索增强生成（RAG）的场景中探索了间接越狱攻击，其中外部知识库与 GPT 等 LLMs 集成。他们开发了一种新机制 $\mathsf{PANDORA}$ ，该机制通过使用恶意构造的内容来操纵提示，利用 LLMs 和 RAG 之间的协同作用，引发模型意外的响应。他们的研究结果表明， $\mathsf{PANDORA}$ 在 ChatGPT 上实现了 $64.3\%$ 的攻击成功率，在 GPT-4 上实现了 $34.8\%$ 的攻击成功率，展示了 RAG 增强 LLMs 中的重大漏洞。另一种有前景的上下文越狱方法针对 LLMs 的思维链（CoT）[99]推理能力。具体来说，攻击者设计特定的输入，嵌入有害的上下文，从而 destabilize 模型并增加其生成破坏性响应的可能性。这种策略通过引导模型走向错误或恶意的结论来操纵其推理过程，突显了其对策略设计输入的脆弱性。根据这些见解，Li 等人[48]引入了 $\mathsf{Multi}$ - $\mathsf{step}$ $\mathsf{Jailbreak}$ $\mathsf{Prompts}$ ( $\mathsf{MJP}$ )来评估从 ChatGPT 等 LLMs 中提取个人可识别信息(PII)的情况。他们的研究发现，虽然 ChatGPT 由于其安全对齐机制通常能够抵抗简单直接的越狱尝试，但它仍然容易受到更复杂和多步骤的越狱提示的攻击。

While most research focuses on enhancing the quality of in-context demonstrations, Anil et al. [5] reveal the scaling laws related to the number of demonstrations, indicating that longer contexts can significantly improve the jailbreak effectiveness. With up to 128 shots, standard in-context jailbreak attacks can achieve nearly 80% success against Claude 2.0. A large number of demonstrations can result in excessively long context lengths. To address this issue, Zheng et al. [120] propose an improved in-context attack method that performs effectively even with limited context sizes. They incorporate special tokens from the target models’ templates into the demonstrations and sample iteratively to select the most effective examples. This approach enables the method to achieve nearly 100% success rates against most popular open-source LLMs including Llama-3.
尽管大多数研究集中于提升情境演示的质量，Anil 等人[5]揭示了与演示数量相关的规模法则，表明更长的情境可以显著提高越狱攻击的有效性。在最多 128 次尝试的情况下，标准的情境越狱攻击对 Claude 2.0 的成功率接近 80%。大量演示可能导致情境长度过长。为解决这一问题，Zheng 等人[120]提出了一种改进的情境攻击方法，该方法即使在有限的情境长度下也能有效执行。他们将目标模型模板中的特殊标记整合到演示中，并迭代采样以选择最有效的示例。这种方法使该方法对包括 Llama-3 在内的多数流行开源 LLMs 的成功率接近 100%。
•

Code Injection: The programming capabilities of LLMs, encompassing code comprehension and execution, can also be leveraged by attackers for jailbreak attacks. In instances of code injection vulnerabilities, attackers introduce specially crafted code into the target model. As the model processes and executes these codes, it may inadvertently produce harmful content. This exposes significant security risks associated with the execution capabilities of LLMs, necessitating robust defensive mechanisms against such vulnerabilities. Concretely speaking, Kang et al. [43] employ programming language constructs to design jailbreak instructions targeting LLMs. For instance, consider the following jailbreak prompt: This prompt cleverly exploits the LLM’s capabilities for string concatenation, variable assignment, and sequential composition effectively by using the model’s programming logic to orchestrate an attack. Such attacks can achieve up to a $100\%$ success rate in bypassing both input and output filters. In addition, Lv et al. [61] introduce $\mathsf{CodeChameleon}$ framework that is designed to bypass the intent security recognition of LLMs by employing personalized encryption tactics. By reformulating tasks into code completion formats, $\mathsf{CodeChameleon}$ enables attackers to cloak adversarial prompts within encrypted Python function codes. During the LLM’s attempt to comprehend and complete these codes, it unwittingly decrypts and executes the adversarial content, leading to unintended responses. This method demonstrates a high attack success rate, achieving $86.6\%$ on GPT-4-1106.

• 代码注入：LLMs 的编程能力，包括代码理解和执行，也可能被攻击者用于越狱攻击。在代码注入漏洞的情况下，攻击者会将特别定制的代码引入目标模型。当模型处理和执行这些代码时，可能会无意中产生有害内容。这暴露了与 LLMs 执行能力相关的重大安全风险，需要针对此类漏洞建立强大的防御机制。具体来说，Kang 等人[43]利用编程语言结构设计针对 LLMs 的越狱指令。例如，考虑以下越狱提示：该提示巧妙地利用了 LLMs 在字符串连接、变量赋值和顺序组合方面的能力，通过使用模型的编程逻辑来组织攻击。此类攻击在绕过输入和输出过滤器方面可以达到高达 $100\%$ 的成功率。此外，Lv 等人[61]引入了 $\mathsf{CodeChameleon}$ 框架，该框架通过采用个性化加密策略来绕过 LLMs 的意图安全识别。通过将任务重新表述为代码补全格式， $\mathsf{CodeChameleon}$ 使攻击者能够将对抗性提示伪装在加密的 Python 函数代码中。在 LLM 尝试理解和完成这些代码时，它无意中解密并执行了对抗性内容，导致非预期的响应。这种方法展示出极高的攻击成功率，在 GPT-4-1106 上达到了 $86.6\%$ 。

Prompt Rewriting 提示重写

Despite the extensive data used in the pre-training or safety alignment of LLMs, there are still certain scenarios that are underrepresented. Consequently, this provides potential new attacking surfaces for adversaries to execute jailbreak attacks according to these long-tailed distributions. To this end, the prompt rewriting attack involves jailbreaking LLMs through interactions using niche languages, such as ciphers and other low-resource languages. Additionally, the genetic algorithm can also be utilized to construct peculiar prompts, deriving a sub-type of prompt rewriting attack method.
尽管在 LLM 的预训练或安全对齐过程中使用了大量数据，但仍然存在某些场景未被充分代表。因此，这为攻击者提供了潜在的新的攻击面，使他们能够根据这些长尾分布执行越狱攻击。为此，提示重写攻击通过使用小众语言（如密码和其他低资源语言）与 LLM 进行交互来越狱 LLM。此外，遗传算法也可以用于构建特殊提示，从而衍生出提示重写攻击方法的子类型。

•

Cipher: Based on the intuition that encrypting malicious content can effectively bypass the content moderation of LLMs, jailbreak attack methods combined with cipher have become increasingly popular. In [108], Yuan et al. introduce $\mathsf{CipherChat}$ , a novel jailbreak framework which reveals that ciphers, as forms of non-natural language, can effectively bypass the safety alignment of LLMs. Specifically, $\mathsf{CipherChat}$ utilizes three types of ciphers: (1) Character Encodings such as GBK, ASCII, UTF, and Unicode; (2) Common Ciphers including the Atbash Cipher, Morse Code, and Caesar Cipher; and (3) $\mathsf{SelfCipher}$ method, which involves using role play and a few unsafe demonstrations in natural language to trigger a specific capability in LLMs. $\mathsf{CipherChat}$ achieves a high attack success rate on ChatGPT and GPT-4, emphasizing the need to include non-natural languages in the safety alignment processes of LLMs. Jiang et al. [40] introduce $\mathsf{ArtPrompt}$ , an ASCII art-based jailbreak attack. $\mathsf{ArtPrompt}$ employs a two-step process: Word Masking and Cloaked Prompt Generation. Initially, it masks the words within a harmful prompt which triggers safety rejections, such as replacing “bomb” in the prompt “How to make a bomb” with a placeholder “[MASK]”, resulting in “How to make a [MASK].” Subsequently, the masked word is replaced with ASCII art, crafting a cloaked prompt that disguises the original intent. Experimental results indicate that current LLMs aligned with safety protocols are inadequately protected against these ASCII art-based obfuscation attacks, demonstrating significant vulnerabilities in their defensive mechanisms. Handa et al. [33] present that a straightforward word substitution cipher can deceive GPT-4 and achieve success in jailbreaking. Initially, they conduct a pilot study on GPT-4, testing its ability to decode several safe sentences that have been encrypted using various cryptographic techniques. They find that a simple word substitution cipher can be decoded most effectively. Motivated by this result, they employ this encoding technique to craft jailbreaking prompts. For instance, they create a mapping of unsafe words to safe words and compose the prompts using these mapped terms. Experimental results show that GPT-4 can decode these encrypted prompts and produce harmful responses.

• 密码：基于加密恶意内容能有效绕过 LLMs 内容审核的直觉，结合密码的越狱攻击方法越来越受欢迎。在[ 108]中，袁等人介绍了 $\mathsf{CipherChat}$ ，一个新颖的越狱框架，该框架揭示密码作为非自然语言的形式，能有效绕过 LLMs 的安全对齐。具体来说， $\mathsf{CipherChat}$ 使用了三种类型的密码：(1) 字符编码，如 GBK、ASCII、UTF 和 Unicode；(2) 常见密码，包括阿塔巴什密码、摩斯密码和凯撒密码；(3) $\mathsf{SelfCipher}$ 方法，该方法涉及使用角色扮演和一些自然语言中的不安全演示来触发 LLMs 的特定能力。 $\mathsf{CipherChat}$ 在 ChatGPT 和 GPT-4 上实现了高攻击成功率，强调了在 LLMs 的安全对齐过程中需要包含非自然语言。蒋等人[ 40]介绍了 $\mathsf{ArtPrompt}$ ，一种基于 ASCII 艺术的越狱攻击。 $\mathsf{ArtPrompt}$ 采用两步流程：词掩码和伪装提示生成。最初，它会掩盖有害提示中的单词，从而触发安全机制的拒绝，例如将提示“如何制作炸弹”中的“炸弹”替换为占位符“[MASK]”，得到“如何制作[MASK]”。随后，被掩盖的单词会被替换为 ASCII 艺术，从而构建出一个伪装的提示，掩盖其原始意图。实验结果表明，当前符合安全协议的语言学习模型（LLM）无法有效抵御这些基于 ASCII 艺术的混淆攻击，这表明其防御机制存在重大漏洞。Handa 等人[33]指出，简单的单词替换密码可以欺骗 GPT-4 并成功越狱。他们首先对 GPT-4 进行了一项试点研究，测试其解码使用各种加密技术加密的多个安全句子的能力。他们发现，简单的单词替换密码能够最有效地被解码。受此结果的启发，他们利用这种编码技术来构建越狱提示。例如，他们会建立不安全词与安全词之间的映射关系，并使用这些映射后的词语来编写提示语。实验结果表明，GPT-4 能够解码这些加密的提示并生成有害的回应。

Moreover, decomposing harmful content into seemingly innocuous questions and subsequently instructing the target model to reassemble and respond to the original harmful query represents a novel cipher technique. In this line of research, Liu et al. [55] propose a novel attack named $\mathsf{DAR}$ ( $\mathsf{Disguise}$ $\mathsf{and}$ $\mathsf{Reconstruction}$ ). $\mathsf{DAR}$ involves dissecting harmful prompts into individual characters and inserting them within a word puzzle query. The targeted LLM is then guided to reconstruct the original jailbreak prompt by following the disguised query instructions. Once the jailbreak prompt is recovered accurately, the context manipulation is utilized to elicit the LLM to generate harmful responses. Similar to $\mathsf{DAR}$ , Li et al. [51] also propose a decomposition and reconstruction attack framework named $\mathsf{DrAttack}$ . This attack method segments the jailbreak prompt into sub-prompts following semantic rules, and conceals them in benign contextual tasks, which can elicit the target LLM to follow the instructions and examples to recover the concealed harmful prompt and generate the corresponding responses. Besides, Chang et al. [13] develop $\mathsf{Puzzler}$ , which provides clues about the jailbreak objective by first querying LLMs about their defensive strategies, and then acquiring the offensive methods from LLMs. After that, $\mathsf{Puzzler}$ encourages LLMs to infer the true intent concealed within the fragmented information and generate malicious responses.
此外，将有害内容分解成看似无害的问题，并随后指示目标模型重新组装并响应原始的有害查询，代表了一种新颖的加密技术。在该研究方向上，刘等人[ 55]提出了一种名为 $\mathsf{DAR}$ （ $\mathsf{Disguise}$ $\mathsf{and}$ $\mathsf{Reconstruction}$ ）的新型攻击方法。 $\mathsf{DAR}$ 涉及将有害提示分解为单个字符，并将其插入到字谜查询中。然后引导目标 LLM 通过遵循伪装的查询指令来重建原始的越狱提示。一旦准确恢复越狱提示，就利用上下文操作来诱使 LLM 生成有害响应。类似于 $\mathsf{DAR}$ ，李等人[ 51]也提出了一种名为 $\mathsf{DrAttack}$ 的分解和重建攻击框架。这种攻击方法根据语义规则将越狱提示分割成子提示，并将它们隐藏在良性上下文任务中，这可以诱使目标 LLM 遵循指令和示例来恢复隐藏的有害提示并生成相应的响应。此外，Chang 等人[13]开发了 $\mathsf{Puzzler}$ ，通过首先查询 LLMs 的防御策略，然后从 LLMs 获取攻击方法，从而提供有关越狱目标的线索。之后， $\mathsf{Puzzler}$ 鼓励 LLMs 推断碎片化信息中隐藏的真实意图，并生成恶意响应。
•

Low-resource Languages: Given that safety mechanisms for LLMs primarily rely on English text datasets, prompts in low-resource, non-English languages may also effectively evade these safeguards. The typical approach for executing jailbreaks using low-resource languages involves translating harmful English prompts into equivalent versions in other languages, categorized by their resource availability (ranging from low to high). Given these intuitions, Deng et al. [21] propose multilingual jailbreak attacks, where they exploit Google Translate¹¹1https://translate.google.com.
• 低资源语言：鉴于 LLMs 的安全机制主要依赖于英文文本数据集，低资源非英文语言的提示也可能有效规避这些安全措施。使用低资源语言执行越狱的典型方法是将有害的英文提示翻译成其他语言中的等效版本，并根据资源可用性（从低到高）进行分类。基于这些直觉，Deng 等人[21]提出了多语言越狱攻击，他们利用 Google Translate ¹ to convert harmful English prompts into thirty other languages to jailbreak ChatGPT and GPT-4. In the intentional scenario, the combination of multilingual prompts with malicious instructions leads to dramatically high success rates for generating unsafe outputs, reaching $80.92\%$ on ChatGPT and $40.71\%$ on GPT-4. Yong et al. [106] conduct experiments using twelve non-English prompts to assess the robustness of GPT-4’s safety mechanisms. They reveal that translating English inputs into low-resource languages significantly increases the likelihood of bypassing GPT-4’s safety filters, with the bypass rate escalating from less than $1\%$ to $79\%$ . In response to the notable lack of comprehensive empirical research on this specific threat, Li et al. [49] conduct extensive empirical studies to explore multilingual jailbreak attacks. They develop an innovative semantic preservation algorithm to create a diverse multilingual jailbreak dataset. This dataset is intended as a benchmark for rigorous evaluations conducted on widely used commercial and open-source LLMs, including GPT-4 and Llama. The experimental results in [49] further reveal that multilingual jailbreaks pose significant threats to LLMs.
将有害的英语提示转换为三十种其他语言以越狱 ChatGPT 和 GPT-4。在有意场景中，多语言提示与恶意指令的结合导致生成不安全输出的成功率显著提高，达到 ChatGPT 的 $80.92\%$ 和 GPT-4 的 $40.71\%$ 。Yong 等人[ 106]使用十二种非英语提示进行实验，以评估 GPT-4 安全机制的鲁棒性。他们揭示将英语输入翻译成低资源语言会显著增加绕过 GPT-4 安全过滤器的可能性，绕过率从不到 $1\%$ 上升到 $79\%$ 。针对这一特定威胁的全面实证研究明显不足，Li 等人[ 49]进行广泛的实证研究，探索多语言越狱攻击。他们开发了一种创新的意义保留算法来创建多样化的多语言越狱数据集。该数据集旨在作为对广泛使用的商业和开源 LLMs（包括 GPT-4 和 Llama）进行严格评估的基准。[ 49]中的实验结果进一步揭示，多语言越狱对 LLMs 构成重大威胁。
•

Genetic Algorithm-based Attacks: Genetic-based methods typically exploit mutation and selection processes to dynamically explore and identify effective prompts. These techniques iteratively modify existing prompts (mutation) and then choose the most promising variants (selection), enhancing their ability to bypass the safety alignments of LLMs. Liu et al. [56] develop $\mathsf{AutoDAN}$ - $\mathsf{HGA}$ , a hierarchical Genetic Algorithm (GA) tailored for the automatic generation of stealthy jailbreak prompts against aligned LLMs. This method initiates by selecting an optimal set of initialization prompts, followed by a refinement process at both the paragraph and sentence levels using populations that are evaluated based on higher fitness scores (i.e., lower negative log-likelihood of the generated response). This approach not only automates the prompt crafting process but also effectively bypasses common perplexity-based defense mechanisms, enhancing both the stealthiness and efficacy of the attacks. Lapid et al. [46] introduce a novel universal black-box attack strategy utilizing a GA designed to disrupt the alignment of LLMs. This approach employs crossover and mutation techniques to iteratively update and optimize candidate jailbreak prompts. By systematically adjusting these prompts, the GA manipulates the model’s output to deviate from its intended safe and aligned responses, thereby revealing the model’s vulnerabilities to adversarial inputs. Yu et al. [107] develop $\mathsf{GPTFUZZER}$ , an automated framework designed to generate jailbreak prompts for testing LLMs. The framework integrates a seed selection strategy to optimize initial templates, mutation operators to ensure semantic consistency, and a judgment model to evaluate attack effectiveness. $\mathsf{GPTFUZZER}$ has proven highly effective in bypassing model defenses, demonstrating significant success across various LLMs under multiple attack scenarios. Li et al. [50] propose a genetic algorithm to generate new jailbreak prompts that are semantically similar to the original prompt. They initialize the population by substituting the words in original prompt randomly, and calculate the fitness based on the similarity and performance of each prompt. In the crossover step, the qualified prompts are transformed into other syntactic forms to generate offspring. If the new population retains a similarity with the previous generation for several rounds, the algorithm will terminate. In [88], Takemoto points out that the target LLMs can rewrite harmful prompts into benign expressions by themselves. The intuition is that since LLMs determine safeguard activation based on the content of the input prompts, it is thus reasonable that texts evading safeguards can be efficiently generated from the LLM. To achieve this purpose, an attacker can feed the following prompt [88] to transform the harmful queries:

• 基于遗传算法的攻击：基于遗传的方法通常利用变异和选择过程来动态探索和识别有效的提示。这些技术迭代地修改现有提示（变异），然后选择最有潜力的变体（选择），增强了它们绕过 LLMs 安全对齐的能力。Liu 等人[ 56]开发了 $\mathsf{AutoDAN}$ - $\mathsf{HGA}$ ，一种针对自动生成针对对齐 LLMs 的隐蔽越狱提示而设计的分层遗传算法（GA）。该方法首先选择一组最优的初始化提示，然后使用基于较高适应度分数（即生成响应的负对数似然较低）评估的种群，在段落和句子级别上进行细化。这种方法不仅自动化了提示制作过程，而且有效地绕过了常见的基于困惑度的防御机制，增强了攻击的隐蔽性和有效性。Lapid 等人[ 46]介绍了一种利用遗传算法设计的全新通用黑盒攻击策略，旨在破坏 LLMs 的对齐。这种方法采用交叉和变异技术来迭代更新和优化候选越狱提示。通过系统地调整这些提示，遗传算法操纵模型的输出使其偏离其预期的安全和对齐的响应，从而揭示模型在对抗性输入下的漏洞。Yu 等人[107]开发了 $\mathsf{GPTFUZZER}$ ，这是一个自动化的框架，旨在生成越狱提示以测试 LLMs。该框架集成了种子选择策略来优化初始模板，变异算子来确保语义一致性，以及一个判断模型来评估攻击效果。 $\mathsf{GPTFUZZER}$ 已被证明在绕过模型防御方面非常有效，在多种攻击场景下对各种 LLMs 都表现出显著的成功。Li 等人[50]提出了一种遗传算法来生成与原始提示语义相似的新越狱提示。他们通过随机替换原始提示中的词语来初始化种群，并根据每个提示的相似性和性能计算适应度。在交叉步骤中，合格的提示被转化为其他句法形式以生成后代。如果新种群在几轮内仍与上一代保持相似性，算法将终止。在[ 88]中，高桥指出目标 LLMs 可以自行将有害提示重写为无害表达。其原理是，由于 LLMs 根据输入提示的内容决定安全防护的激活，因此从 LLMs 中高效生成规避安全防护的文本是合理的。为实现此目的，攻击者可以向[ 88]中描述的以下提示中输入，以转换有害查询：

LLM-based Generation 基于 LLM 的生成

With a robust set of adversarial examples and high-quality feedback mechanisms, LLMs can be fine-tuned to simulate attackers, thereby enabling the efficient and automatic generation of adversarial prompts. Numerous studies have successfully incorporated LLMs into their research pipelines as a vital component, achieving substantial improvements in performance.
凭借一套强大的对抗样本和高质量的反馈机制，LLMs 可以被微调以模拟攻击者，从而实现对抗提示的高效和自动生成。大量研究已成功将 LLMs 纳入其研究流程中作为关键组成部分，显著提升了性能。

Some researchers adopt the approach of training a single LLM as the attacker with fine-tuning techniques or RLHF. For instance, Deng et al. [19] develop an LLM-based jailbreaking framework named $\mathsf{MASTERKEY}$ to automatically generate adversarial prompts designed to bypass security mechanisms. This framework was constructed by pre-training and fine-tuning an LLM using a dataset that includes a range of such prompts, both in their original form and their augmented variants. Inspired by time-based SQL injection, $\mathsf{MASTERKEY}$ leverages insights into internal defense strategies of LLMs, specifically targeting real-time semantic analysis and keyword detection defenses utilized by platforms like Bing Chat and Bard. Zeng et al. [109] discover a novel perspective to jailbreak LLMs by acting like human communicators. Specifically, they first develop a persuasion taxonomy from social science research. Then, the taxonomy will be applied to generate interpretable $\mathsf{Persuasive}$ $\mathsf{Adversarial}$ $\mathsf{Prompts}$ ( $\mathsf{PAPs}$ ) using various methods such as in-context prompting and fine-tuned paraphraser. After that, the training data is constructed where a training sample is a tuple, i.e., <a plain harmful query, a technique in the taxonomy, a corresponding persuasive adversarial prompt>. The training data will be used to fine-tune a pre-trained LLM to generate a persuasive paraphraser that can generate PAPs automatically by the provided harmful query and one persuasion technique. Shah et al. [76] utilize an LLM assistant to generate persona-modulation attack prompts automatically. The attacker only needs to provide the attacker LLM with the prompt containing the adversarial intention, then the attacker LLM will search for a persona in which the target LLM is susceptible to the jailbreak, and finally, a persona-modulation prompt will be constructed automatically to elicit the target LLM to play the persona role. Casper et al. [12] propose a red-teaming method without a pre-existing classifier. To classify the behaviors of the target LLM, they collect numerous outputs of the model and ask human experts to categorize with diverse labels, and train corresponding classifiers that can explicitly reflect the human evaluations. Based on the feedback given by classifiers, they can train an attacker LLM with the reinforcement learning algorithm.
一些研究人员采用训练单个 LLM 作为攻击者的方法，使用微调技术或 RLHF。例如，邓等人[19]开发了一个名为 $\mathsf{MASTERKEY}$ 的基于 LLM 的越狱框架，用于自动生成旨在绕过安全机制的对立提示。该框架通过使用包含各种此类提示（包括其原始形式和增强变体）的数据集来预训练和微调 LLM 而构建。受基于时间的 SQL 注入启发， $\mathsf{MASTERKEY}$ 利用对 LLM 内部防御策略的理解，特别是针对像 Bing Chat 和 Bard 这样的平台使用的实时语义分析和关键词检测防御。曾等人[109]通过扮演人类沟通者的方式，发现了一种越狱 LLM 的新视角。具体来说，他们首先从社会科学研究中开发了一个说服分类法。然后，该分类法将应用于使用上下文提示和微调的释义器等多种方法来生成可解释的 $\mathsf{Persuasive}$ $\mathsf{Adversarial}$ $\mathsf{Prompts}$ （ $\mathsf{PAPs}$ ）。之后，构建训练数据，其中训练样本是一个元组，即<一个普通的恶意查询，分类体系中的一个技术，一个相应的说服性对抗提示>。训练数据将用于微调预训练的 LLM，以生成一个能够根据提供的恶意查询和一个说服技术自动生成 PAPs 的说服性释义器。Shah 等人[76]利用一个 LLM 助手自动生成角色调整攻击提示。攻击者只需向攻击者 LLM 提供包含对抗意图的提示，然后攻击者 LLM 将搜索一个目标 LLM 容易受到越狱攻击的角色，最后自动构建一个角色调整提示来诱使目标 LLM 扮演该角色角色。Casper 等人[12]提出了一种无需预分类器的红队方法。为了分类目标 LLM 的行为，他们收集了模型的大量输出，并要求人类专家使用不同的标签进行分类，并训练相应的分类器，这些分类器可以明确反映人类评估。根据分类器提供的反馈他们可以使用强化学习算法训练一个攻击者的 LLM。

Another strategy is to have multiple LLMs collaborate to form a framework, in which every LLMs serve as a different agent and can be optimized systematically. Chao et al. [15] propose $\mathsf{Prompt}$ $\mathsf{Automatic}$ $\mathsf{Iterative}$ $\mathsf{Refinement}$ ( $\mathsf{PAIR}$ ) to generate jailbreak prompts with only black-box access to the target LLM. Concretely, $\mathsf{PAIR}$ uses an attacker LLM to iteratively update the jailbreak prompt against the target LLM by querying the target LLM and refining the prompt. Jin et al. [41] design a multi-agent system to generate jailbreak prompts automatically. In the system, LLMs serve as different roles including generator, translator, evaluator, and optimizer. For instance, the generator is responsible for crafting initial jailbreak prompts based on previous jailbreak examples, then the translator and evaluator examine the responses of the target LLM, and finally the optimizer analyzes the effectiveness of the jailbreak and gives feedback to the generator. Ge et al. [27] propose a red teaming framework to integrate jailbreak attack with safety alignment and optimize them together. In the framework, an adversarial LLM will generate harmful prompts to jailbreak the target LLM. While the adversarial LLM optimizes the generation based on the feedback of target LLM, the target LLM also enhances the robustness through being fine-tuned upon the adversarial prompts, and the interplay continues iteratively until both LLMs achieve expected performance. Tian et al. [91] propose $\mathsf{Evil}$ $\mathsf{Geniuses}$ to automatically generate jailbreak prompts against LLM-based agents using the Red-Blue exercise. They discover that, compared to LLMs, the agents are less robust and more prone to conduct harmful behaviors.
另一种策略是让多个 LLM 协作形成一个框架，其中每个 LLM 都充当不同的代理，并可以进行系统化优化。Chao 等人[15]提出了 $\mathsf{Prompt}$ $\mathsf{Automatic}$ $\mathsf{Iterative}$ $\mathsf{Refinement}$ ( $\mathsf{PAIR}$ )来生成针对目标 LLM 的越狱提示，仅通过黑盒方式访问目标 LLM。具体来说， $\mathsf{PAIR}$ 使用攻击者 LLM 通过查询目标 LLM 并优化提示来迭代更新针对目标 LLM 的越狱提示。Jin 等人[41]设计了一个多代理系统来自动生成越狱提示。在该系统中，LLM 扮演不同的角色，包括生成器、翻译器、评估器和优化器。例如，生成器负责根据先前的越狱示例创建初始越狱提示，然后翻译器和评估器检查目标 LLM 的响应，最后优化器分析越狱的有效性并向生成器提供反馈。Ge 等人[27]提出了一种红队框架，将越狱攻击与安全对齐相结合，并一起优化。在该框架中，一个对抗性 LLM 将生成有害提示来越狱目标 LLM。在对抗性 LLM 根据目标 LLM 的反馈优化生成内容的同时，目标 LLM 也通过在对抗性提示上进行微调来增强鲁棒性，这种相互作用持续迭代，直到两个 LLM 都达到预期性能。Tian 等人[91]提出了 $\mathsf{Evil}$ $\mathsf{Geniuses}$ ，利用红蓝演练自动生成针对基于 LLM 的代理的越狱提示。他们发现，与 LLM 相比，代理的鲁棒性较低，更容易执行有害行为。

We note that techniques based on LLMs are increasingly being integrated with other methods to enhance jailbreak attacks. For example, an LLM can be programmed to generate templates for scenario nesting attacks, which involve embedding malicious payloads within benign contexts. Additionally, LLMs can assist in the perturbation operation, a critical step in genetic algorithm-based attacks, where slight modifications are algorithmically generated to test system vulnerabilities. Liu et al. [54] divide an adversarial prompt into three elements: goal, content, and template, and construct plenty of content and templates manually with different attack goals. Later, a LLM generator will randomly combine the content and templates to produce hybrid prompts, which are then estimated by the LLM evaluator to judge their effectiveness. Mehrotra et al. [64] propose a novel method called $\mathsf{Tree}$ $\mathsf{of}$ $\mathsf{Attacks}$ $\mathsf{with}$ $\mathsf{Pruning}$ ( $\mathsf{TAP}$ ). Starting from seed prompts, $\mathsf{TAP}$ will generate improved prompts and discard the inferior ones. The reserved prompts are then inputted into the target LLMs to estimate their effectiveness. If a jailbreak turns out to be successful, the corresponding prompt will be returned as seed prompts for the next iteration.
我们注意到基于 LLMs 的技术正越来越多地与其他方法结合，以增强越狱攻击。例如，一个 LLM 可以被编程生成场景嵌套攻击的模板，这种攻击涉及将恶意载荷嵌入良性环境中。此外，LLMs 可以协助扰动操作，这是基于遗传算法攻击的一个关键步骤，其中算法会生成轻微的修改来测试系统漏洞。Liu 等人[54]将对抗性提示分为三个元素：目标、内容和模板，并手动构建大量具有不同攻击目标的内容和模板。随后，一个 LLM 生成器会随机组合内容和模板，生成混合提示，这些提示再由 LLM 评估器估计其有效性。Mehrotra 等人[64]提出了一种名为 $\mathsf{Tree}$ $\mathsf{of}$ $\mathsf{Attacks}$ $\mathsf{with}$ $\mathsf{Pruning}$ ( $\mathsf{TAP}$ )的新方法。从种子提示开始， $\mathsf{TAP}$ 会生成改进的提示并丢弃较差的提示。保留的提示随后输入到目标 LLMs 中，以估计其有效性。如果越狱攻击成功，相应的提示将作为种子提示返回给下一次迭代。

Defense Methods 防御方法

{forest}

forked edges, for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, node options=align=center, align = center, base=left, font=, rectangle, draw=hidden-draw, rounded corners, edge+=darkgray, line width=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , where level=1text width=5.0em,font=, where level=2text width=5.6em,font=, where level=3text width=6.8em,font=, [ Jailbreak Defense Methods, ver [ Prompt Level [ Prompt Detection [ [37] [1] , leaf, text width=3em ] ] [ Prompt
分叉边，对于树= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, 节点选项=align=center, align = center, base=left, font=, 矩形, draw=hidden-draw, 圆角, 边+=darkgray, 线宽=1pt, s sep=3pt, inner xsep=2pt, inner ysep=3pt, ver/.style=rotate=90, child anchor=north, parent anchor=south, anchor=center, , 其中 level=1text width=5.0em,font=, 其中 level=2text width=5.6em,font=, 其中 level=3text width=6.8em,font=, [ Jailbreak 防御方法, ver [ 提示级别 [ 提示检测 [ [ 37][ 1] , 叶子, text width=3em ] ] [ 提示
Perturbation [ [11] [73] [38] [112] [45] [121] , leaf, text width=10.5em ] ] [ System Prompt
扰动 [ [ 11][ 73][ 38][ 112][ 45][ 121] , 叶子, text width=10.5em ] ] [ 系统提示
Safeguard [ [77] [126] [94] [118] , leaf, text width=7.5em ] ] ] [ Model Level [ SFT-based [ [9] [18] [8] , leaf, text width=4.5em ] ] [ RLHF-based [ [66] [6] [83] [25] [59] [26] [58] , leaf, text width=11em ] ] [ Gradient and Logit
保护措施 [ [ 77][ 126][ 94][ 118] , 叶子, text width=7.5em ] ] ] [ 模型级别 [ 基于 SFT 的 [ [ 9][ 18][ 8] , 叶子, text width=4.5em ] ] [ 基于 RLHF 的 [ [ 66][ 6][ 83][ 25][ 59][ 26][ 58] , 叶子, text width=11em ] ] [ 梯度和逻辑
Analysis [ [101] [102] [35] [53] , leaf, text width=6.5em ] ] [ Refinement [ [44] [113] , leaf, text width=4em ] ] [ Proxy Defense [ [110] [85] , leaf, text width=4em ] ] ] ]
分析 [ [ 101][ 102][ 35][ 53] , 叶, 文本宽度=6.5em ] ] [ 精炼 [ [ 44][ 113] , 叶, 文本宽度=4em ] ] [ 代理防御 [ [ 110][ 85] , 叶, 文本宽度=4em ] ] ] ]

Figure 8: Taxonomy of jailbreak defense.
图 8：越狱防御的分类。

With the development of LLM jailbreak techniques, concerns regarding model ethics and substantial threats in proprietary models like ChatGPT and open-source models like Llama have gained more attention, and various defense methods have been proposed to protect the language model from potential attacks. A taxonomy of the methods is illustrated in Figure 8. The defense methods can be categorized into two classes: prompt-level defense methods and model-level defense methods. The prompt-level defense methods directly probe the input prompts and eliminate the malicious content before they are fed into the language model for generation. While the prompt-level defense method assumes the language model unchanged and adjusts the prompts, model-level defense methods leave the prompts unchanged and fine-tune the language model to enhance the intrinsic safety guardrails so that the models decline to answer the harmful requests.
随着 LLM 越狱技术的开发，关于模型伦理以及 ChatGPT 等专有模型和 Llama 等开源模型中的重大威胁的关注度日益增加，并提出了各种防御方法来保护语言模型免受潜在攻击。图 8 展示了这些方法的一个分类。防御方法可以分为两类：提示级防御方法和模型级防御方法。提示级防御方法直接探测输入提示，并在它们被输入语言模型进行生成之前消除恶意内容。而提示级防御方法假设语言模型不变并调整提示，模型级防御方法则保持提示不变并微调语言模型以增强内在的安全护栏，使模型拒绝回答有害请求。

Prompt-level Defenses 提示级防御

Prompt-level defenses refer to the scenarios where the direct access to neither the internal model weight nor the output logits is available, thus the prompt becomes the only variable both the attackers and defenders can control. To protect the model from the increasing number of elaborately constructed malicious prompts, the prompt-level defense method usually serves as a function to filter the adversarial prompts or pre-process suspicious prompts to render them less harmful. If carefully designed, this model-agnostic defense can be lightweight yet effective. Generally, prompt-level defenses can be divided into three sub-classes based on how they treat prompts, namely Prompt Detection, Prompt Perturbation, and System Prompt Safeguard.
提示级别的防御指的是在无法直接访问内部模型权重或输出 logits 的情况下，提示成为攻击者和防御者唯一可以控制的变量。为了保护模型免受日益增多精心构造的恶意提示的攻击，提示级别的防御方法通常作为一个功能来过滤对抗性提示或预处理可疑提示，使其危害性降低。如果设计得当，这种与模型无关的防御可以轻量级且有效。通常，提示级别的防御可以根据它们如何处理提示分为三个子类别，即提示检测、提示扰动和系统提示保护。

Prompt Detection 提示检测

For proprietary models like ChatGPT or Claude, the model vendors usually maintain a data moderation system like Llama-guard [90] or conduct reinforcement-learning-based fine-tuning [66] to enhance the safety guardrails and ensure the user prompts may not violate the safety policy. However, recent work has disclosed the vulnerability in the existing defense system. Zou et al. [125] append an incoherent suffix to the malicious prompts, which increases the model’s perplexity of the prompt and successfully bypasses the safety guardrails.
对于 ChatGPT 或 Claude 等专有模型，模型供应商通常会维护类似 Llama-guard[90]的数据审核系统，或采用基于强化学习的微调[66]来增强安全防护措施，并确保用户提示不会违反安全政策。然而，近期的研究揭示了现有防御系统的漏洞。Zou 等人[125]在恶意提示中附加一个不连贯的后缀，这增加了模型对提示的困惑度，并成功绕过了安全防护措施。

To fill the gap, Jain et al. [37] consider a threshold-based detection that computes the perplexity of both the text segments and the entire prompt in the context window, and declares the harmfulness if the perplexity exceeds a certain threshold. Note that a similar work is $\mathsf{LightGBM}$ [1], which first calculates the perplexity of the prompts and trains a classifier based on the perplexity and sequence length to detect the harmfulness of the prompt.
为了弥补这一空白，Jain 等人[37]提出了一种基于阈值的检测方法，该方法计算上下文窗口中文本片段和整个提示的困惑度，并在困惑度超过特定阈值时判定为有害。请注意，类似的工作有 $\mathsf{LightGBM}$ [1]，该方法首先计算提示的困惑度，并基于困惑度和序列长度训练分类器来检测提示的有害性。

Prompt Perturbation 提示扰动

Despite the improved accuracy in detecting malicious inputs, prompt detection methods have the side-effect of a high false positive rate which may influence the response quality of the questions that should have been treated as benign inputs. Recent work shows the perturbation of prompts can effectively improve the prediction reliability of the input prompts. Cao et al. [11] propose $\mathsf{RA}$ - $\mathsf{LLM}$ that randomly puts word-level masks on the copies of the original prompt, and considers the original prompt malicious if LLM rejects a certain ratio of the processed copies. Robey et al. [73] introduce $\mathsf{SmoothLLM}$ to apply character-level perturbation to the copies of a given prompt. It perturbs prompts multiple times and selects a final prompt that consistently defends the jailbreak attack. Ji et al. [38] propose a similar method as [73], except that they perturb the original prompt with semantic transformations. Zhang et al. [112] propose $\mathsf{JailGuard}$ , supporting jailbreak detection in image and text modalities. Concretely, $\mathsf{JailGuard}$ introduces multiple perturbations to the query and observes the consistency of the corresponding outputs. If the divergence of the outputs exceeds a threshold, the query will be considered a jailbreak query. Kumar et al. [45] propose a more fine-grained defense framework called $\mathsf{erase}$ - $\mathsf{and}$ - $\mathsf{check}$ . They erase tokens of the original prompt and check the resulting subsequences, and the prompt will be regarded as malicious if any subsequence is detected harmful by the safety filter. Moreover, they further explore how to erase tokens more efficiently and introduce different rule-based methods including randomized, greedy, and gradient-based $\mathsf{erase}$ - $\mathsf{and}$ - $\mathsf{check}$ .
尽管在检测恶意输入方面的准确性有所提高，但提示检测方法存在高误报率的副作用，这可能会影响本应被视为良性输入的问题的响应质量。近期研究表明，对提示进行扰动可以有效提高输入提示的预测可靠性。曹等人[11]提出了 $\mathsf{RA}$ - $\mathsf{LLM}$ ，该方法随机在原始提示的副本上添加词级掩码，如果 LLM 拒绝了一定比例的处理后副本，则将原始提示视为恶意。罗伯等人[73]引入了 $\mathsf{SmoothLLM}$ ，用于对给定提示的副本应用字符级扰动。该方法多次扰动提示，并选择一个始终能够防御越狱攻击的最终提示。季等人[38]提出了一种与[73]类似的方法，不同之处在于他们使用语义转换来扰动原始提示。张等人[112]提出了 $\mathsf{JailGuard}$ ，支持图像和文本模态的越狱检测。具体来说， $\mathsf{JailGuard}$ 对查询引入多个扰动，并观察相应输出的一致性。如果输出结果的差异超过阈值，该查询将被视为越狱查询。Kumar 等人[45]提出了一种更细粒度的防御框架，称为 $\mathsf{erase}$ - $\mathsf{and}$ - $\mathsf{check}$ 。他们擦除原始提示中的标记并检查生成的子序列，如果任何子序列被安全过滤器检测为有害，则该提示将被视为恶意。此外，他们进一步探索了如何更高效地擦除标记，并引入了不同的基于规则的 $\mathsf{erase}$ - $\mathsf{and}$ - $\mathsf{check}$ 方法，包括随机化、贪婪和基于梯度的方法。

While the above works focus on various transformations to the original prompt and generate the final response corresponding to aggregation of the outputs, another line of works introduces an alternative approach that appends a defense prefix or suffix to the prompt. For instance, Zhou et al. [121] propose a robust prompt optimization algorithm to construct such suffixes. They select representative adversarial prompts to build a dataset and then optimize the suffixes on it based on the gradient, and the defense strategy turns out to be efficient for both manual jailbreak attacks and gradient-based attacks like $\mathsf{GCG}$ .
虽然上述工作主要关注对原始提示的各种变换，并生成与输出聚合相对应的最终响应，但另一系列工作引入了一种替代方法，即向提示中附加防御前缀或后缀。例如，Zhou 等人[121]提出了一种鲁棒的提示优化算法来构建这样的后缀。他们选择具有代表性的对抗性提示来构建数据集，然后基于梯度在该数据集上优化后缀，结果表明这种防御策略对手动越狱攻击和基于梯度的攻击（如 $\mathsf{GCG}$ ）都有效。

System Prompt Safeguard 系统提示保护

The system prompts built-in LLMs guide the behavior, tone, and style of responses, ensuring consistency and appropriateness of model responses. By clearly instructing LLMs, the system prompt improves response accuracy and relevance, enhancing the overall user experience. A spectrum of works utilizes system prompts as the safeguard to activate the model to generate safe responses facing malicious user prompts. Sharma et al. [77] introduce a domain-specific diagram $\mathsf{SPML}$ to create powerful system prompts. During the compilation pipeline of $\mathsf{SPML}$ , system prompts are processed in several procedures like type-checking and intermediate representation transformation, and finally, robust system prompts are generated to deal with various conversation scenarios. Zou et al. [126] explore the effectiveness of system prompt against jailbreak and propose $\mathsf{SMEA}$ to generate system prompt. Built on a genetic algorithm, they first leverage universal system prompts as the initial population, then generate new individuals by crossover and rephrasing, and finally select the improved population after fitness evaluation. Wang et al. [94] integrate a secret prompt into the system prompt to defend against fine-tuning-based jailbreaks. Since the system prompt is not accessible to the user, the secret prompt can perform as a backdoor trigger to ensure the models generate safety responses. Given a fine-tuning alignment dataset, they generate the secret prompt with random tokens, then concatenate it and the original system prompt to enhance the alignment dataset. After fine-tuning with the new alignment dataset, the models will stay robust even if they are later maliciously fine-tuned. Zheng et al. [118] take a deep dive into the intrinsic mechanism of safety system prompt. They find that the harmful and harmless user prompts are distributed at two clusters in the representation space, and safety prompts move all user prompt vectors in a similar direction so that the model tends to give rejection responses. Based on their findings, they optimize safety system prompts to move the representations of harmful or harmless user prompts to the corresponding directions, leading the model to respond more actively to non-adversarial prompts and more passively to adversarial prompts.
内置 LLMs 的系统提示指导着回答的行为、语气和风格，确保模型回答的一致性和适当性。通过明确指示 LLMs，系统提示提高了回答的准确性和相关性，增强了整体用户体验。一系列工作利用系统提示作为保护措施，激活模型以面对恶意用户提示时生成安全回答。Sharma 等人[ 77]介绍了一种特定领域的图表 $\mathsf{SPML}$ 来创建强大的系统提示。在 $\mathsf{SPML}$ 的编译管道中，系统提示经过类型检查和中间表示转换等多个步骤的处理，最终生成强大的系统提示以应对各种对话场景。Zou 等人[ 126]探索了系统提示对越狱的有效性，并提出了 $\mathsf{SMEA}$ 来生成系统提示。基于遗传算法，他们首先利用通用系统提示作为初始种群，然后通过交叉和改写生成新的个体，最后在适应度评估后选择改进的种群。王等人[94]将秘密提示整合到系统提示中，以防御基于微调的越狱攻击。由于系统提示对用户不可访问，秘密提示可以作为后门触发器，确保模型生成安全响应。给定一个微调对齐数据集，他们使用随机标记生成秘密提示，然后将它与原始系统提示连接起来，以增强对齐数据集。在对新对齐数据集进行微调后，即使模型后来被恶意微调，它们仍将保持鲁棒性。郑等人[118]深入研究了安全系统提示的内在机制。他们发现有害和无害的用户提示在表示空间中分布在两个簇中，安全提示将所有用户提示向量向相似方向移动，使模型倾向于给出拒绝响应。基于他们的发现，他们优化安全系统提示，将有害或无害用户提示的表示移动到相应的方向，引导模型更积极地响应非对抗性提示，更被动地响应对抗性提示。

Model-level Defenses 模型级防御

For a more flexible case in which defenders can access and modify the model weights, model-level defense helps the safety guardrail to generalize better. Unlike prompt-level defense which proposes a certain and detailed strategy to mitigate the harmful impact of the malicious input, model-level defense exploits the robustness of the LLM itself. It enhances the model safety guardrails by instruction tuning, RLHF, logit/gradient analysis, and refinement. Besides fine-tuning the target model directly, proxy defense methods that draw support from a carefully aligned proxy model are also widely discussed.
在防御者可以访问和修改模型权重的更灵活情况下，模型级防御有助于安全护栏更好地泛化。与提出特定详细策略以减轻恶意输入有害影响的提示级防御不同，模型级防御利用 LLM 自身的鲁棒性。它通过指令微调、RLHF、logit/梯度分析和改进来增强模型安全护栏。除了直接微调目标模型外，还广泛讨论了借助精心校准的代理模型提供支持的代理防御方法。

SFT-based Methods 基于 SFT 的方法

Supervised Fine-Tuning (SFT) is an important method for enhancing the instruction-following ability of LLMs, which is a crucial part of establishing safety alignment as well [92]. Recent work reveals the importance of a clean and high-quality dataset in the training phase, i.e., models fine-tuned with a comprehensive and refined safety dataset show their superior robustness [92]. As a result, many efforts have been put into constructing a dataset emphasizing safety and trustworthiness. Bianchi et al. [9] discuss how the mixture of safety data (i.e. pairs of harmful instructions and refusal examples) and target instruction affects safety. For one thing, they show fine-tuning with the mixture of Alpaca [89] and safety data can improve the model safety. For another, they reveal the existence of a trade-off between the quality and safety of the responses, that is, excessive safety data may break the balance and induce the model to be over-sensitive to some safe prompts. Deng et al. [18] discover the possibility of constructing a safety dataset from the adversarial prompts. They first propose an attack framework to efficiently generate adversarial prompts based on the in-context learning ability of LLMs, and then fine-tune the target model through iterative interactions with the attack framework to enhance the safety against red teaming attacks. Similarly, Bhardwaj et al. [8] leverage $\mathsf{Chain}$ $\mathsf{of}$ $\mathsf{Utterances}$ ( $\mathsf{CoU}$ ) to construct the safety dataset that covers a wide range of harmful conversations generated from ChatGPT. After being fine-tuned with the dataset, LLMs like Vicuna-7B [119] can perform well on safety benchmarks while preserving the response quality.
监督微调（SFT）是增强 LLMs 指令跟随能力的重要方法，这也是建立安全对齐的关键部分[92]。近期的研究揭示了训练阶段中干净且高质量数据集的重要性，即使用全面且精细的安全数据集微调的模型表现出更强的鲁棒性[92]。因此，许多工作都致力于构建一个强调安全性和可信度的数据集。Bianchi 等人[9]讨论了安全数据（即有害指令和拒绝示例的对）与目标指令的混合如何影响安全性。一方面，他们展示了使用 Alpaca[89]和安全数据混合进行微调可以提高模型的安全性。另一方面，他们揭示了响应质量和安全之间的权衡，即过多的安全数据可能会打破平衡并使模型对某些安全提示过于敏感。Deng 等人[18]发现了从对抗性提示中构建安全数据集的可能性。他们首先提出了一种攻击框架，基于 LLMs 的情境学习能力高效生成对抗性提示，然后通过迭代与攻击框架的交互微调目标模型，以增强对红队攻击的安全性。类似地，Bhardwaj 等人[8]利用 $\mathsf{Chain}$ $\mathsf{of}$ $\mathsf{Utterances}$ ( $\mathsf{CoU}$ )构建了涵盖 ChatGPT 生成的广泛有害对话的安全数据集。在用该数据集微调后，像 Vicuna-7B[119]这样的 LLMs 可以在保持响应质量的同时，在安全基准测试中表现良好。

RLHF-based Methods 基于 RLHF 的方法

Reinforcement Learning from Human Feedback (RLHF) is a traditional model training procedure applied to a well-pre-trained language model to further align model behavior with human preferences and instructions [66]. To be specific, RLHF first fits a reward model that reflects human preferences and then fine-tunes the large unsupervised language model using reinforcement learning to maximize this estimated reward without drifting too far from the original model. The effectiveness of RLHF in safety alignment has been proved by lots of promising LLMs such as GPT-4 [65], Llama [92], and Claude [4]. On the one hand, high-quality human preference datasets lie in the key point of successful training, whereby human annotators select which of two model outputs they prefer [6, 26, 58, 39]. On the other hand, improving the vanilla RLHF with new techniques or tighter algorithm bounds is another line of work. Bai et al. [6] introduce an online version of RLHF that collects preference data while training the language model synchronously. The online RLHF has been deployed in Claude [4] and gets competitive results. Siththaranjan et al. [83] reveal that the hidden context of incomplete data (e.g. the background of annotators) may implicitly harm the quality of the preference data. Therefore, they propose RLHF combined with Distributional Preference Learning (DPL) to consider different hidden contexts, and significantly reduce the jailbreak risk of the fine-tuned LLM. While RLHF is a complex and often unstable procedure, recent work proposes Direct Preference Optimization (DPO) [70] as a substitute. As a more stable and lightweight method, enhancing the safety of LLMs with DPO is becoming more popular [25, 59].
人类反馈强化学习（RLHF）是一种传统的模型训练流程，应用于预训练良好的语言模型，以进一步使模型行为与人类偏好和指令保持一致[66]。具体来说，RLHF 首先拟合一个反映人类偏好的奖励模型，然后使用强化学习对大型无监督语言模型进行微调，以在不偏离原始模型太远的情况下最大化估计的奖励。RLHF 在安全对齐方面的有效性已被许多有前景的 LLMs（如 GPT-4[65]、Llama[92]和 Claude[4]）所证明。一方面，高质量的人类偏好数据集是成功训练的关键点，其中人类标注者选择他们更倾向于两个模型输出中的哪一个[6, 26, 58, 39]。另一方面，通过新技术或更严格的算法界限改进原味 RLHF 是另一条研究路线。Bai 等人[6]介绍了一种在线版本的 RLHF，它在训练语言模型的同时同步收集偏好数据。在线 RLHF 已在 Claude[4]中得到部署，并取得了具有竞争力的结果。 Siththaranjan 等人[ 83]揭示，不完整数据的隐藏上下文（例如标注者的背景）可能会隐含地损害偏好数据的质量。因此，他们提出将 RLHF 与分布偏好学习（DPL）相结合，以考虑不同的隐藏上下文，并显著降低微调 LLM 的越狱风险。虽然 RLHF 是一个复杂且往往不稳定的流程，近期的研究提出了直接偏好优化（DPO）[ 70]作为替代方案。作为一种更稳定且轻量级的方法，使用 DPO 增强 LLM 的安全性正变得越来越流行[ 25, 59]。

Gradient and Logit Analysis
梯度与逻辑分析

Since the logits and gradients retrieved in the forward pass can contain fruitful information about the beliefs and judgments of the input prompts, which can be useful for model defense, defenders can analyze and manipulate the logits and gradients to detect potential jailbreak threats and propose corresponding defenses.
由于在正向传递中检索到的 logits 和梯度包含了关于输入提示的信念和判断的有价值信息，这些信息可用于模型防御，因此防御者可以分析和操纵 logits 和梯度，以检测潜在的越狱威胁并提出相应的防御措施。

Gradient Analysis. Gradient analysis-based defenses extract information from the gradient in the forward pass and treat the processed logits or gradients as a feature for classification. Xie et al. [101] compare the similarity between safety-critical parameters and gradients. Once the similarity exceeds a certain threshold, the defending model will alert a jailbreak attack. Hu et al. [35] first define a refusal loss which indicates the likelihood of generating a normal response and notice that there is a difference between the refusal loss obtained by malicious prompts and normal prompts. Based on this discovery, they further propose $\mathsf{Gradient}$ $\mathsf{Cuff}$ to identify jailbreak attacks by computing the gradient norm and other characteristics of refusal loss.
梯度分析。基于梯度的防御从正向传递中的梯度中提取信息，并将处理后的 logits 或梯度作为分类的特征。Xie 等人[101]比较了安全关键参数与梯度之间的相似性。一旦相似性超过某个阈值，防御模型就会警告存在越狱攻击。Hu 等人[35]首先定义了一个拒绝损失，该损失表示生成正常响应的可能性，并注意到恶意提示获得的拒绝损失与正常提示获得的拒绝损失之间存在差异。基于这一发现，他们进一步提出了 $\mathsf{Gradient}$ $\mathsf{Cuff}$ 来通过计算拒绝损失的梯度范数和其他特征来识别越狱攻击。

Logit Analysis. Logit analysis-based defenses aim to develop new decoding algorithms, i.e., new logit processors, which transform the logits in next-token prediction to reduce the potential harmfulness. For instance, Xu et al. [102] mix the output logits of the target model and safety-aligned model to obtain a new logits distribution, in which the probability density of harmful and benign tokens are attenuated and amplified, respectively. Li et al. [53] add a safety heuristic in beam search, which evaluates the harmfulness of the candidates in one round and selects the one with the lowest harmful score.
Logit 分析。基于 Logit 分析的防御方法旨在开发新的解码算法，即新的 Logit 处理器，这些处理器将下一个词预测中的 Logit 转换为降低潜在危害性。例如，Xu 等人[102]将目标模型和安全对齐模型的输出 Logit 混合，以获得一个新的 Logit 分布，其中有害标记和良性标记的概率密度分别被减弱和增强。Li 等人[53]在集束搜索中添加了一个安全启发式算法，该算法在一轮中评估候选词的危害性，并选择危害评分最低的那个。

Table 2: Overview of evaluation datasets.
表 2：评估数据集概述。

Benchmark Name 基准名称	Languages 语言	Size 大小	Safety Dimensions 安全维度	Composition 组合
$\mathsf{XSTEST}$ [74]	English 英文	450	10	Safe questions and unsafe questions 安全问题和不安全问题
$\mathsf{AdvBench}$ [125]	English 英文	1000	8	Harmful strings and harmful behaviors 有害字符串和有害行为
$\mathsf{SafeBench}$ [30]	English 英文	500	10	Unsafe questions 不安全的提问
$\mathsf{Do}$ - $\mathsf{Not}$ - $\mathsf{Answer}$ [96]	English 英文	939	5	Harmful instructions 有害指令
$\mathsf{TechHazardQA}$ [7]	English 英文	1850	7	I nstruction-centric questions 指令中心型问题
$\mathsf{SC}$ - $\mathsf{Safety}$ [86]	Chinese 中文	4912	20+	Multi-round conversations 多轮对话
$\mathsf{LatentJailbreak}$ [69]	Chinese English	416	3	Translation tasks 翻译任务
$\mathsf{SafetyBench}$ [115]	Chinese English 中文英语	11435	7	Multiple choice questions 选择题
$\mathsf{StrongREJECT}$ [84]	English 英文	346	6	Unsafe questions 不安全的提问
$\mathsf{AttackEval}$ [80]	English 英文	390	13	Unsafe questions 不安全的提问
$\mathsf{HarmBench}$ [63]	English 英文	510	18	Harmful behaviors 有害行为
$\mathsf{Safety}$ - $\mathsf{Prompts}$ [86]	Chinese 中文	100000	14	Harmful behaviors 有害行为
$\mathsf{JailbreakBench}$ [14]	English 英文	200	10	Harmful behaviors and benign behaviors 有害行为和良性行为
$\mathsf{DoAnythingNow}$ [79]	English 英文	107250	13	Forbidden questions 禁止性问题

Refinement Methods 精炼方法

The refinement methods exploit the self-correction ability of LLM to reduce the risk of generating illegal responses. As evidenced in $\mathsf{RLAIF}$ [87], LLMs can be “aware” that their outputs are inappropriate given an adversarial prompt. Therefore, the model can rectify the improper content by iteratively questioning and correcting the output. Kim et al. [44] validate the effectiveness of naive self-refinement methods on non-aligned LLM. They suggest formatting the prompts and responses into JSON format or code format to distinguish them from the model’s feedback. Zhang et al. [113] propose a specific target the model should achieve during the self-refinement to make the refinement more effective. To be specific, they utilize the language model to analyze user prompts in essential aspects like ethics and legality and gather the intermediate responses from the model that reflect the intention of the prompts. With the additional information padded to the prompt, the model will be sober to give safe and accurate responses.
精炼方法利用 LLM 的自我纠错能力来降低生成非法响应的风险。正如 $\mathsf{RLAIF}$ [ 87]所示，LLMs 在给定对抗性提示时能够“意识到”其输出不恰当。因此，模型可以通过迭代质询和纠正输出来修正不当内容。Kim 等人[ 44]验证了朴素自我精炼方法在非对齐 LLM 上的有效性。他们建议将提示和响应格式化为 JSON 格式或代码格式，以区别于模型的反馈。Zhang 等人[ 113]提出模型在自我精炼过程中应达到的具体目标，以提高精炼效果。具体来说，他们利用语言模型从伦理和法律等关键方面分析用户提示，并收集模型中反映提示意图的中级响应。在将额外信息添加到提示后，模型将更加清醒地给出安全准确的响应。

Proxy Defense 代理防御

In brief, the proxy defenses move the security duties to another guardrail model. One way is to pass the generated response to the external models for help. Meta team [90] propose $\mathsf{LlamaGuard}$ for classifying content in both language model inputs (prompt classification) and responses (response classification), which can be directly used for proxy defense. Zeng et al. [110] design a multi-agent defense framework named $\mathsf{AutoDefense}$ . $\mathsf{AutoDefense}$ consists of agents responsible for the intention analyzing and prompt judging, respectively. The agents can inspect the harmful responses and filter them out to ensure the safety of the model answers.
简而言之，代理防御将安全职责转移到了另一个护栏模型。一种方法是将生成的响应传递给外部模型寻求帮助。Meta 团队[90]提出了 $\mathsf{LlamaGuard}$ 用于对语言模型输入（提示分类）和响应（响应分类）中的内容进行分类，这可以直接用于代理防御。Zeng 等人[110]设计了一个名为 $\mathsf{AutoDefense}$ 的多智能体防御框架。 $\mathsf{AutoDefense}$ 由分别负责意图分析和提示判断的智能体组成。这些智能体可以检查有害响应并将它们过滤掉，以确保模型答案的安全性。

Evaluation 评估

Evaluation methods are significant as they provide a unified comparison for various jailbreak attack and defense methods. Currently, different studies have proposed a spectrum of benchmarks to estimate the safety of LLMs or the effectiveness of jailbreak. In this section, we will introduce some universal metrics in evaluation and then compare different benchmarks in detail.
评估方法非常重要，因为它们为各种越狱攻击和防御方法提供了统一的比较基准。目前，不同的研究已经提出了多种基准来评估 LLMs 的安全性或越狱的有效性。在本节中，我们将介绍一些通用的评估指标，然后详细比较不同的基准。

Metric 指标

Attack Success Rate 攻击成功率

Attack Success Rate (ASR) is a widely used metric to validate the effectiveness of a jailbreak method. Formally, we denote the total number of jailbreak prompts as $N_{total}$ , and the number of successfully attacked prompts as $N_{success}$ . Then, ASR can be formulated as
攻击成功率（ASR）是验证一种越狱方法有效性的常用指标。形式上，我们用 $N_{total}$ 表示越狱提示的总数，用 $N_{success}$ 表示成功攻击的提示数。那么，ASR 可以表示为

ASR=\frac{N_{success}}{N_{total}}.

(1)

Safety Evaluators. However, one challenge is defining a so-called “successful jailbreak”, i.e., how to evaluate the success of a jailbreak attempt against an LLM has not been unified [71], which leads to inconsistencies in the value of $N_{success}$ . Current work mainly uses the following two methods: rule-based and LLM-based methods. Rule-based methods assess the effectiveness of an attack by examining keywords in the target LLM’s responses [125, 126]. This is because it is common that rejection responses consistently contain refusal phrases like “do not”, “I’m sorry”, and “I apologize”. Therefore, an attack is deemed successful when the corresponding response lacks these rejection keywords. LLM-based methods usually utilize a state-of-the-art LLM as the evaluator to determine if an attack is successful [68]. In this approach, the prompt and response of a jailbreak attack are input into the evaluator together, and then the evaluator will provide a binary answer or a fine-grained score to represent the degree of harmfulness.
安全评估器。然而，一个挑战在于定义所谓的“成功越狱”，即如何评估针对 LLM 的越狱尝试的成功与否尚未统一[71]，这导致了 $N_{success}$ 值的不一致性。当前工作主要采用以下两种方法：基于规则的方法和基于 LLM 的方法。基于规则的方法通过检查目标 LLM 响应中的关键词来评估攻击的有效性[125, 126]。这是因为拒绝响应通常包含“不要”、“抱歉”和“我道歉”等拒绝短语。因此，当相应的响应缺少这些拒绝关键词时，攻击被认为成功。基于 LLM 的方法通常使用最先进的 LLM 作为评估器来确定攻击是否成功[68]。在这种方法中，越狱攻击的提示和响应一起输入到评估器中，然后评估器将提供一个二进制答案或细粒度分数来表示有害程度。

While most benchmarks have employed LLM-based evaluation methods and integrated state-of-the-art LLMs as the safety evaluators, some research have made different innovations in the evaluation process. For instance, $\mathsf{StrongReject}$ [84] instructs a pre-trained LLM to examine the jailbreak prompt and the response to give a score from three dimensions, representing whether the target model refuses the harmful prompt, whether the answer accurately aligns with the harmful prompt, and whether the answer is realistic. $\mathsf{AttackEval}$ [80] utilizes a judgement model to identify the effectiveness of a jailbreak. Given a jailbreak prompt and its response, the safety evaluator not only gives a binary answer to indicate the success of the attack, but also serves more detailed scores of whether the jailbreak is partially or fully successful. Note that in [71], Ran et al. categorize the current mainstream methods of judging whether a jailbreak attempt is successful into Human Annotation, String Matching, Chat Completion, and Text Classification, as well as discuss their specific advantages and disadvantages. Furthermore, they propose $\mathsf{JailbreakEval}$ ²²2https://github.com/ThuCCSLab/JailbreakEval.
尽管大多数基准测试采用了基于 LLM 的评估方法，并将最先进的 LLM 集成作为安全评估器，但一些研究在评估过程中进行了不同的创新。例如， $\mathsf{StrongReject}$ [ 84] 指示预训练的 LLM 检查越狱提示和响应，并从三个维度给出评分，代表目标模型是否拒绝有害提示，答案是否准确与有害提示一致，以及答案是否真实。 $\mathsf{AttackEval}$ [ 80] 利用判断模型来识别越狱的有效性。给定一个越狱提示及其响应，安全评估器不仅给出一个二进制答案来指示攻击的成功，还提供更详细的评分，表明越狱是部分成功还是完全成功。请注意，在[ 71]中，Ran 等人将当前主流的判断越狱尝试是否成功的方法分为人工标注、字符串匹配、聊天完成和文本分类，并讨论了它们的具体优缺点。此外，他们提出了 $\mathsf{JailbreakEval}$ ² , an integrated toolkit that contains various mainstream safety evaluators. Notably, $\mathsf{JailbreakEval}$ supports voting-based safety evaluation, i.e., $\mathsf{JailbreakEval}$ generates the final judgement through multiple safety evaluators.
, 一个包含多种主流安全评估器的集成工具包。值得注意的是， $\mathsf{JailbreakEval}$ 支持基于投票的安全评估，即 $\mathsf{JailbreakEval}$ 通过多个安全评估器生成最终判断。

Perplexity 困惑度

Perplexity (PPL) is a metric used to measure the readability and fluency of a jailbreak prompt. [56, 1, 67] Since many defense methods filter high-perplexity prompts to provide protection, attack methods with low-perplexity jailbreak prompts have become increasingly noteworthy. Formally, given a text sequence $W={(w_{1},w_{2},.......,w_{n})}$ , where $w_{i}$ represents the i-th token of the sequence, the perplexity of the sequence $W$ can be expressed as
困惑度（PPL）是一个用于衡量越狱提示可读性和流畅性的指标。[ 56, 1, 67] 由于许多防御方法会过滤高困惑度的提示以提供保护，因此具有低困惑度的越狱提示攻击方法越来越受到关注。形式上，给定一个文本序列 $W={(w_{1},w_{2},.......,w_{n})}$ ，其中 $w_{i}$ 表示序列的第 i 个标记，序列 $W$ 的困惑度可以表示为

PPL(W)=\exp(-\frac{1}{n}\sum_{i=1}^{n}\log{\rm Pr}(w_{i}|w_{<i})),

(2)

where ${\rm Pr}(w_{i}|w_{<i})$ denotes the probability assigned by a LLM to the i-th token given the preceding tokens. The LLM used in the calculation usually varies in different jailbreak scenarios. In attack methods [67, 56], the target LLM is typically used to calculate perplexity, which can serve as a metric of jailbreak. Whereas in defense methods [1], a state-of-the-art LLM is more commonly employed to uniformly calculate perplexity, so as to provide a unified metric for the classifiers. Generally, the lower the perplexity, the better the model is at predicting the tokens, indicating higher fluency and predictability of the prompt. Therefore, jailbreak prompts with lower perplexity are less likely to be detected by defense classifiers, thus achieving higher success rates [67, 56].
其中 ${\rm Pr}(w_{i}|w_{<i})$ 表示 LLM 在给定前序词的情况下，为第 i 个词分配的概率。计算中使用的 LLM 通常在不同越狱场景中有所不同。在攻击方法[67, 56]中，目标 LLM 通常用于计算困惑度，这可以作为越狱的指标。而在防御方法[1]中，更常用最先进的 LLM 来统一计算困惑度，以便为分类器提供一个统一的指标。通常情况下，困惑度越低，模型在预测词元方面表现越好，表明提示词的流畅性和可预测性越高。因此，困惑度较低的越狱提示词不太可能被防御分类器检测到，从而实现更高的成功率[67, 56]。

Dataset 数据集

In Table 2, we provide a comprehensive description of the widely-used evaluation datasets. Especially, the column "Safety dimensions" indicates how many types of harmful categories are covered by the dataset, and the column "Composition" represents the main types of questions that make up the dataset. We can observe that although current datasets are used mainly to evaluate LLM safety, they have different focus areas in various domains. Some datasets have designed specific tasks to assess the safety of LLMs in particular scenarios. $\mathsf{TechHazardQA}$ [7] requires the model to give answers in text format or pseudo-code format, so as to examine the robustness of LLMs when they generate responses in specific forms. $\mathsf{Latent}$ $\mathsf{Jailbreak}$ [69] instructs the model to translate texts that may contain malicious content. While $\mathsf{Do}$ - $\mathsf{not}$ - $\mathsf{Answer}$ [96] completely consists of harmful prompts to estimate the safeguard of LLMs, $\mathsf{XSTEST}$ [74] comprises both safe and unsafe questions to evaluate the balance between helpfulness and harmlessness of LLMs. $\mathsf{SC}$ - $\mathsf{Safety}$ [86] focus on the evaluation of Chinese LLMs, which interacts with the LLMs with multi-round open questions to observe their safety behaviors. $\mathsf{SafetyBench}$ [115] designs multiple-choice questions in both Chinese and English that cover various safety concerns to assess the safety of popular LLMs. $\mathsf{AdvBench}$ [125] is initially proposed by $\mathsf{GCG}$ to construct suffixes for gradient-based attacks, and has been utilized by other studies like AdvPrompter [67] in various jailbreak scenarios. $\mathsf{SafeBench}$ [30] is a collection of harmful textual prompts that can be converted into images to bypass the safeguard of VLMs.
在表 2 中，我们提供了广泛使用的评估数据集的全面描述。特别是，“安全维度”这一列表明数据集涵盖了多少种有害类别，而“组成”这一列则代表构成数据集的主要问题类型。我们可以观察到，尽管当前数据集主要用于评估 LLM 的安全性，但它们在不同领域有不同的关注点。一些数据集设计了特定任务，以评估 LLM 在特定场景下的安全性。 $\mathsf{TechHazardQA}$ [7] 要求模型以文本格式或伪代码格式给出答案，以便考察 LLM 在以特定形式生成响应时的鲁棒性。 $\mathsf{Latent}$ $\mathsf{Jailbreak}$ [69] 指示模型翻译可能包含恶意内容文本。而 $\mathsf{Do}$ - $\mathsf{not}$ - $\mathsf{Answer}$ [96] 完全由有害提示组成，用于评估 LLM 的安全防护能力， $\mathsf{XSTEST}$ [74] 则包含安全和不安全的问题，用于评估 LLM 在有益性和无害性之间的平衡。 $\mathsf{SC}$ - $\mathsf{Safety}$ [86] 专注于中文 LLM 的评估，通过与 LLM 进行多轮开放式问题互动，观察其安全行为。 $\mathsf{SafetyBench}$ [ 115] 设计了中英文选择题，涵盖各种安全问题，用于评估流行 LLMs 的安全性。 $\mathsf{AdvBench}$ [ 125] 最初由 $\mathsf{GCG}$ 提出，用于构建基于梯度的攻击的后缀，并在 AdvPrompter [ 67] 等研究中被用于各种越狱场景。 $\mathsf{SafeBench}$ [ 30] 是一组有害文本提示的集合，可以转换为图像以绕过 VLMs 的防护。

Some datasets are introduced by toolkits as part of their automated evaluation pipeline. Based on the similarities in the usage policies of different mainstream models. $\mathsf{StrongREJECT}$ [84] propose a universal dataset that consists of forbidden questions that should be rejected by most LLMs. $\mathsf{AttackEval}$ [80] develop a dataset containing jailbreak prompts with ground truth, which can serve as a robust standard to estimate the effectiveness of the jailbreak. $\mathsf{HarmBench}$ [63] constructs a spectrum of special harmful behaviors as the dataset. Besides standard harmful behaviors, $\mathsf{HarmBench}$ further introduces copyright behaviors, contextual behaviors, and multimodal behaviors for specific evaluations. Aiming to provide a comprehensive assessment of Chinese LLMs, $\mathsf{Safety}$ - $\mathsf{Prompts}$ [86] constructs a vast amount of malicious prompts in Chinese by instructing GPT-3.5-turbo to enhance high-quality artificial data. $\mathsf{JailbreakBench}$ [14] constructs a mixed dataset that covers OpenAI’s usage policy, in which every harmful behavior is matched with a benign behavior to examine both the safety and robustness of target LLMs. To achieve a comprehensive understanding of jailbreak prompts in the wild, Shen et al. [79] conduct an extensive investigation of prompts sourced from online platforms, classifying them into distinct communities based on their characteristics. Moreover, when presented with a scenario prohibited by OpenAI’s usage policy, they utilize GPT-4 to generate jailbreak prompts for different communities, thereby constructing a large set of forbidden questions.
一些数据集由工具包作为其自动化评估流程的一部分引入。基于不同主流模型使用策略的相似性， $\mathsf{StrongREJECT}$ [ 84] 提出一个通用数据集，其中包含大多数 LLMs 应拒绝的禁止性问题。 $\mathsf{AttackEval}$ [ 80] 开发了一个包含有真实标签的越狱提示的数据集，该数据集可作为评估越狱有效性的可靠标准。 $\mathsf{HarmBench}$ [ 63] 构建了一个特殊有害行为谱系作为数据集。除了标准有害行为外， $\mathsf{HarmBench}$ 进一步引入了版权行为、上下文行为和多模态行为进行特定评估。为了全面评估中文 LLMs， $\mathsf{Safety}$ - $\mathsf{Prompts}$ [ 86] 通过指示 GPT-3.5-turbo 生成大量中文恶意提示，以增强高质量人工数据。 $\mathsf{JailbreakBench}$ [ 14] 构建了一个混合数据集，涵盖 OpenAI 的使用策略，其中每种有害行为都与一种良性行为相匹配，以检查目标 LLMs 的安全性和鲁棒性。为了全面了解现实世界中的越狱提示，Shen 等人[79]对来自在线平台的提示进行了广泛调查，根据其特征将它们分类到不同的社区中。此外，当面对 OpenAI 使用政策禁止的场景时，他们利用 GPT-4 为不同社区生成越狱提示，从而构建了一个大型禁止问题集。

Toolkit 工具包

Compared to datasets that are mostly used for evaluating the safety of LLMs, toolkits often integrate whole evaluation pipelines and can be extended to assess jailbreak attacks automatically. $\mathsf{HarmBench}$ [63] proposes a red-teaming evaluation framework that can estimate both jailbreak attack and defense methods. Given a jailbreak attack method and a safety-aligned target LLM, the framework will first generate test cases with different harmful behaviors to jailbreak the target model. Then, the responses and the corresponding behaviors are combined for evaluation, where several classifiers work together to generate the final ASR. $\mathsf{Safety}$ - $\mathsf{Prompts}$ [86] establish a platform to estimate the safety of Chinese LLMs. In the evaluation, jailbreak prompts of different safety scenarios are inputted to the target LLM, and the responses are later examined by a LLM evaluator to give a comprehensive score to judge the safety of the target LLM. To provide a comprehensive and reproducible comparison of current jailbreak research, Chao et al. [14] develop $\mathsf{JailbreakBench}$ , a lightweight evaluation framework applicable to jailbreak attack and defense methods. Especially, $\mathsf{JailbreakBench}$ has maintained most of the state-of-the-art adversarial prompts, defense methods, and evaluation classifiers so that users can easily invoke them to construct a personal evaluation pipeline. $\mathsf{EasyJailbreak}$ [122] proposes a standardized framework consisting of three stages to estimate jailbreak attacks. In the preparation stage, jailbreak settings including malicious questions and template seeds are provided by the user. Then in the inference stage, $\mathsf{EasyJailbreak}$ applies templates to the questions to construct jailbreak prompts, and mutates the prompts before inputting them into the target model to get responses. In the final stage, the queries and corresponding responses are inspected by LLM-based or rule-based evaluators to give the overall metrics.
与主要用于评估 LLMs 安全性的数据集相比，工具包通常集成了整个评估流程，并且可以扩展以自动评估越狱攻击。 $\mathsf{HarmBench}$ [ 63] 提出了一种红队评估框架，可以评估越狱攻击和防御方法。给定一种越狱攻击方法和一个安全对齐的目标 LLM，该框架将首先生成具有不同有害行为的测试用例来越狱目标模型。然后，将响应及其对应行为结合起来进行评估，其中多个分类器协同工作以生成最终的 ASR。 $\mathsf{Safety}$ - $\mathsf{Prompts}$ [ 86] 建立了一个平台来评估中文 LLMs 的安全性。在评估中，不同安全场景的越狱提示被输入到目标 LLM 中，响应随后由 LLM 评估器检查，以给出综合评分来判断目标 LLM 的安全性。为了提供当前越狱研究的全面且可重复的比较，Chao 等人[ 14]开发了 $\mathsf{JailbreakBench}$ ，一个适用于越狱攻击和防御方法的轻量级评估框架。特别是， $\mathsf{JailbreakBench}$ 维护了大部分最先进的对抗性提示、防御方法和评估分类器，以便用户可以轻松调用它们来构建个人评估流程。 $\mathsf{EasyJailbreak}$ [ 122] 提出了一种包含三个阶段的标准化框架来评估越狱攻击。在准备阶段，用户提供包括恶意问题和模板种子在内的越狱设置。然后在推理阶段， $\mathsf{EasyJailbreak}$ 将模板应用于问题以构建越狱提示，并在将提示输入目标模型以获取响应之前对其进行变异。在最终阶段，基于 LLM 或基于规则的评估器检查查询和相应的响应，以给出整体指标。

Conclusion 结论

In this paper, we present a comprehensive taxonomy of attack and defense methods in jailbreaking LLMs and a detailed paradigm to demonstrate their relationship. We summarize the existing work and notice that the attack methods are becoming more effective and require less knowledge of the target model, which makes the attacks more practical, calling for effective defenses. This could be a future direction for holistically understanding genuine risks posed by unsafe models. Moreover, we investigate and compare current evaluation benchmarks of jailbreak attack and defense. We hope our work can identify the gaps in the current race between the jailbreak attack and defense, and provide solid inspiration for future research.
在本文中，我们提出了针对 LLMs 的越狱攻击与防御方法的综合分类体系，并详细阐述了一个展示它们之间关系的范例。我们总结了现有工作，并注意到攻击方法正变得越来越有效，且对目标模型的知识要求降低，这使得攻击更具实用性，从而呼唤有效的防御措施。这可能是全面理解不安全模型所构成的真实风险的未来方向。此外，我们还调查并比较了当前针对越狱攻击与防御的评价基准。我们希望我们的工作能够识别出当前越狱攻击与防御竞赛中的不足之处，并为未来的研究提供坚实的启示。

References

[1] Gabriel Alon and Michael Kamfonas. Detecting Language Model Attacks with Perplexity. CoRR abs/2308.14132, 2023.
[2] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks. CoRR abs/2404.02151, 2024.
[3] Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. Gemini: A Family of Highly Capable Multimodal Models. CoRR abs/2312.11805, 2023.
[4] Anthropic. Introducing claude. https://www.anthropic.com/news/introducing-claude, 2024.
[5] Anthropic. Many-shot jailbreaking. https://www.anthropic.com/research/many-shot-jailbreaking, 2024.
[6] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. CoRR abs/2204.05862, 2022.
[7] Somnath Banerjee, Sayan Layek, Rima Hazra, and Animesh Mukherjee. How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries. CoRR abs/2402.15302, 2024.
[8] Rishabh Bhardwaj and Soujanya Poria. Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment. CoRR abs/2308.09662, 2023.
[9] Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions. In International Conference on Learning Representations (ICLR), 2024.
[10] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language Models are Few-Shot Learners. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2020.
[11] Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM. CoRR abs/2309.14348, 2023.
[12] Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. Explore, Establish, Exploit: Red Teaming Language Models from Scratch. CoRR abs/2306.09442, 2023.
[13] Zhiyuan Chang, Mingyang Li, Yi Liu, Junjie Wang, Qing Wang, and Yang Liu. Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues. CoRR abs/2402.09091, 2024.
[14] Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramèr, Hamed Hassani, and Eric Wong. JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models. CoRR abs/2404.01318, 2024.
[15] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking Black Box Large Language Models in Twenty Queries. CoRR abs/2310.08419, 2023.
[16] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan, Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374, 2021.
[17] Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. Comprehensive Assessment of Jailbreak Attacks Against LLMs. CoRR abs/2402.05668, 2024.
[18] Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He. Attack Prompt Generation for Red Teaming and Defending Large Language Models. In Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2176–2189. ACL, 2023.
[19] Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. MasterKey: Automated Jailbreak Across Multiple Large Language Model Chatbots. CoRR abs/2307.08715, 2023.
[20] Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, and Yang Liu. Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning. CoRR abs/2402.08416, 2024.
[21] Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual Jailbreak Challenges in Large Language Models. In International Conference on Learning Representations (ICLR), 2024.
[22] Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily. CoRR abs/2311.08268, 2023.
[23] Yanrui Du, Sendong Zhao, Ming Ma, Yuhan Chen, and Bing Qin. Analyzing the Inherent Response Tendency of LLMs: Real-World Instructions-Driven Jailbreak. CoRR abs/2312.04127, 2023.
[24] Aysan Esmradi, Daniel Wankit Yip, and Chun-Fai Chan. A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models. CoRR abs/2312.10982, 2023.
[25] Víctor Gallego. Configurable Safety Tuning of Language Models with Synthetic Preference Data. CoRR abs/2404.00495, 2024.
[26] Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El Showk, Stanislav Fort, Zac Hatfield-Dodds, Tom Henighan, Danny Hernandez, Tristan Hume, Josh Jacobson, Scott Johnston, Shauna Kravec, Catherine Olsson, Sam Ringer, Eli Tran-Johnson, Dario Amodei, Tom Brown, Nicholas Joseph, Sam McCandlish, Chris Olah, Jared Kaplan, and Jack Clark. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned. CoRR abs/2209.07858, 2022.
[27] Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. MART: improving LLM safety with multi-round automatic red-teaming. CoRR abs/2311.07689, 2023.
[28] Jonas Geiping, Alex Stein, Manli Shu, Khalid Saifullah, Yuxin Wen, and Tom Goldstein. Coercing LLMs to do and reveal (almost) anything. CoRR abs/2402.14020, 2024.
[29] Simon Geisler, Tom Wollschläger, M. H. I. Abdalla, Johannes Gasteiger, and Stephan Günnemann. Attacking Large Language Models with Projected Gradient Descent. CoRR abs/2402.09154, 2024.
[30] Yichen Gong, Delong Ran, Jinyuan Liu, Conglei Wang, Tianshuo Cong, Anyu Wang, Sisi Duan, and Xiaoyun Wang. FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts. CoRR abs/2311.05608, 2023.
[31] Xingang Guo, Fangxu Yu, Huan Zhang, Lianhui Qin, and Bin Hu. COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability. CoRR abs/2402.08679, 2024.
[32] Maanak Gupta, Charankumar Akiri, Kshitiz Aryal, Eli Parker, and Lopamudra Praharaj. From ChatGPT to ThreatGPT: Impact of Generative AI in Cybersecurity and Privacy. CoRR abs/2307.00691, 2023.
[33] Divij Handa, Advait Chirmule, Bimal G. Gajera, and Chitta Baral. Jailbreaking Proprietary Large Language Models using Word Substitution Cipher. CoRR abs/2402.10601, 2024.
[34] Jonathan Hayase, Ema Borevkovic, Nicholas Carlini, Florian Tramèr, and Milad Nasr. Query-Based Adversarial Prompt Generation. CoRR abs/2402.12329, 2024.
[35] Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes. CoRR abs/2403.00867, 2024.
[36] Yangsibo Huang, Samyak Gupta, Mengzhou Xia, Kai Li, and Danqi Chen. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. In International Conference on Learning Representations (ICLR), 2024.
[37] Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline Defenses for Adversarial Attacks Against Aligned Language Models. CoRR abs/2309.00614, 2023.
[38] Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, and Shiyu Chang. Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing. CoRR abs/2402.16192, 2024.
[39] Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of LLM via a human-preference dataset. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
[40] Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs. CoRR abs/2402.11753, 2024.
[41] Haibo Jin, Ruoxi Chen, Andy Zhou, Jinyin Chen, Yang Zhang, and Haohan Wang. GUARD: role-playing to generate natural-language jailbreakings to test guideline adherence of large language models. CoRR abs/2402.03299, 2024.
[42] Erik Jones, Anca D. Dragan, Aditi Raghunathan, and Jacob Steinhardt. Automatically Auditing Large Language Models via Discrete Optimization. In International Conference on Machine Learning (ICML), pages 15307–15329. PMLR, 2023.
[43] Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tatsunori Hashimoto. Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks. CoRR abs/2302.05733, 2023.
[44] Heegyu Kim, Sehyun Yuk, and Hyunsouk Cho. Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement. CoRR abs/2402.15180, 2024.
[45] Aounon Kumar, Chirag Agarwal, Suraj Srinivas, Soheil Feizi, and Hima Lakkaraju. Certifying LLM Safety against Adversarial Prompting. CoRR abs/2309.02705, 2023.
[46] Raz Lapid, Ron Langberg, and Moshe Sipper. Open Sesame! Universal Black Box Jailbreaking of Large Language Models. CoRR abs/2309.01446, 2023.
[47] Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b. CoRR abs/2310.20624, 2023.
[48] Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023.
[49] Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, and Yinxing Xue. A Cross-Language Investigation into Jailbreak Attacks in Large Language Models. CoRR abs/2401.16765, 2024.
[50] Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, and Ee-Chien Chang. Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs. CoRR abs/2402.14872, 2024.
[51] Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers. CoRR abs/2402.16914, 2024.
[52] Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. DeepInception: Hypnotize Large Language Model to Be Jailbreaker. CoRR abs/2311.03191, 2023.
[53] Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, and Hongyang Zhang. RAIN: your language models can align themselves without finetuning. CoRR abs/2309.07124, 2023.
[54] Chengyuan Liu, Fubang Zhao, Lizhi Qing, Yangyang Kang, Changlong Sun, Kun Kuang, and Fei Wu. Goal-Oriented Prompt Attack and Safety Evaluation for LLMs. CoRR abs/2309.11830, 2023.
[55] Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, and Kai Chen. Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction. CoRR abs/2402.18104, 2024.
[56] Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models. CoRR abs/2310.04451, 2023.
[57] Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study. CoRR abs/2305.13860, 2023.
[58] Yule Liu, Kaitian Chao Ting Lu, Yanshun Zhang, and Yingliang Zhang. Safe and helpful chinese. https://huggingface.co/datasets/DirectLLM/Safe_and_Helpful_Chinese, 2023.
[59] Zixuan Liu, Xiaolin Sun, and Zizhan Zheng. Enhancing LLM safety via constrained direct preference optimization. CoRR abs/2403.02475, 2024.
[60] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning. CoRR abs/2308.08747, 2023.
[61] Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, and Xuanjing Huang. CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models. CoRR abs/2402.16717, 2024.
[62] Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, and Atul Prakash. PRP: propagating universal perturbations to attack large languagenmodel guard-rails. CoRR abs/2402.15911, 2024.
[63] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David A. Forsyth, and Dan Hendrycks. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. CoRR abs/2402.04249, 2024.
[64] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of Attacks: Jailbreaking Black-Box LLMs Automatically. CoRR abs/2312.02119, 2023.
[65] OpenAI. GPT-4 technical report. CoRR abs/2303.08774, 2023.
[66] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2022.
[67] Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, and Yuandong Tian. AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs. CoRR abs/2404.16873, 2024.
[68] Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! CoRR abs/2310.03693, 2023.
[69] Huachuan Qiu, Shuai Zhang, Anqi Li, Hongliang He, and Zhenzhong Lan. Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models. CoRR abs/2307.08487, 2023.
[70] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2023.
[71] Delong Ran, Jinyuan Liu, Yichen Gong, Jingyi Zheng, Xinlei He, Tianshuo Cong, and Anyu Wang. JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models. CoRR abs/2406.09321, 2024.
[72] Abhinav Rao, Sachin Vashistha, Atharva Naik, Somak Aditya, and Monojit Choudhury. Tricking LLMs into Disobedience: Understanding, Analyzing, and Preventing Jailbreaks. CoRR abs/2305.14965, 2023.
[73] Alexander Robey, Eric Wong, Hamed Hassani, and George J. Pappas. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks. CoRR abs/2310.03684, 2023.
[74] Paul R"ottger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models. CoRR abs/2308.01263, 2023.
[75] Sander Schulhoff, Jeremy Pinto, Anaum Khan, Louis-François Bouchard, Chenglei Si, Svetlina Anati, Valen Tagliabue, Anson Liu Kost, Christopher Carnahan, and Jordan L. Boyd-Graber. Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, 2023.
[76] Rusheb Shah, Quentin Feuillade-Montixi, Soroush Pour, Arush Tagade, Stephen Casper, and Javier Rando. Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation. CoRR abs/2311.03348, 2023.
[77] Reshabh K. Sharma, Vinayak Gupta, and Dan Grossman. SPML: A DSL for defending language models against prompt attacks. CoRR abs/2402.11755, 2024.
[78] Erfan Shayegani, Md Abdullah Al Mamun, Yu Fu, Pedram Zaree, Yue Dong, and Nael B. Abu-Ghazaleh. Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks. CoRR abs/2310.10844, 2023.
[79] Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. Do Anything Now: Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models. CoRR abs/2308.03825, 2023.
[80] Dong Shu, Mingyu Jin, Suiyuan Zhu, Beichen Wang, Zihao Zhou, Chong Zhang, and Yongfeng Zhang. AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models. CoRR abs/2401.09002, 2024.
[81] Sonali Singh, Faranak Abri, and Akbar Siami Namin. Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles. In IEEE International Conference on Big Data (ICBD), pages 2508–2517. IEEE, 2023.
[82] Chawin Sitawarin, Norman Mu, David A. Wagner, and Alexandre Araujo. PAL: proxy-guided black-box attack on large language models. CoRR abs/2402.09674, 2024.
[83] Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF. In International Conference on Learning Representations (ICLR), 2024.
[84] Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A StrongREJECT for Empty Jailbreaks. CoRR abs/2402.10260, 2024.
[85] Lukas Struppek, Minh Hieu Le, Dominik Hintersdorf, and Kristian Kersting. Exploring the Adversarial Capabilities of Large Language Models. CoRR abs/2402.09132, 2024.
[86] Hao Sun, Zhexin Zhang, Jiawen Deng, Jiale Cheng, and Minlie Huang. Safety Assessment of Chinese Large Language Models. CoRR abs/2304.10436, 2023.
[87] Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. Advances in Neural Information Processing Systems, 36, 2024.
[88] Kazuhiro Takemoto. All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks. CoRR abs/2401.09798, 2024.
[89] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
[90] Llama Team. Meta llama guard 2. https://github.com/meta-llama/PurpleLlama/blob/main/Llama-Guard2/MODEL_CARD.md, 2024.
[91] Yu Tian, Xiao Yang, Jingyuan Zhang, Yinpeng Dong, and Hang Su. Evil Geniuses: Delving into the Safety of LLM-based Agents. CoRR abs/2311.11855, 2023.
[92] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton-Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurélien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine-Tuned Chat Models. CoRR abs/2307.09288, 2023.
[93] Hao Wang, Hao Li, Minlie Huang, and Lei Sha. From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings. CoRR abs/2402.16006, 2024.
[94] Jiongxiao Wang, Jiazhao Li, Yiquan Li, Xiangyu Qi, Junjie Hu, Yixuan Li, Patrick McDaniel, Muhao Chen, Bo Li, and Chaowei Xiao. Mitigating Fine-tuning Jailbreak Attack with Backdoor Enhanced Alignment. CoRR abs/2402.14968, 2024.
[95] Jiongxiao Wang, Zichen Liu, Keun Hee Park, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models. CoRR abs/2305.14950, 2023.
[96] Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs. CoRR abs/2308.13387, 2023.
[97] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
[98] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
[99] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In Annual Conference on Neural Information Processing Systems (NeurIPS). NeurIPS, 2022.
[100] Zeming Wei, Yifei Wang, and Yisen Wang. Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations. CoRR abs/2310.06387, 2023.
[101] Yueqi Xie, Minghong Fang, Renjie Pi, and Neil Zhenqiang Gong. GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis. CoRR abs/2402.13494, 2024.
[102] Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding. CoRR abs/2402.08983, 2024.
[103] Xianjun Yang, Xiao Wang, Qi Zhang, Linda R. Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models. CoRR abs/2310.02949, 2023.
[104] Dongyu Yao, Jianshu Zhang, Ian G. Harris, and Marcel Carlsson. FuzzLLM: A Novel and Universal Fuzzing Framework for Proactively Discovering Jailbreak Vulnerabilities in Large Language Models. CoRR abs/2309.05274, 2023.
[105] Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confidence Computing, 4(2):100211, June 2024.
[106] Zheng Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-Resource Languages Jailbreak GPT-4. CoRR abs/2310.02446, 2023.
[107] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts. CoRR abs/2309.10253, 2023.
[108] Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Pinjia He, Shuming Shi, and Zhaopeng Tu. GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher. In International Conference on Learning Representations (ICLR), 2024.
[109] Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs. CoRR abs/2401.06373, 2024.
[110] Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks. CoRR abs/2403.04783, abs/2403.04783, 2024.
[111] Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, and Daniel Kang. Removing RLHF Protections in GPT-4 via Fine-Tuning. CoRR abs/2311.05553, 2023.
[112] Xiaoyu Zhang, Cen Zhang, Tianlin Li, Yihao Huang, Xiaojun Jia, Xiaofei Xie, Yang Liu, and Chao Shen. A Mutation-Based Method for Multi-Modal Jailbreaking Attack Detection. CoRR abs/2312.10766, 2023.
[113] Yuqi Zhang, Liang Ding, Lefei Zhang, and Dacheng Tao. Intention analysis makes llms a good jailbreak defender. CoRR abs/2401.06561, 2024.
[114] Zaibin Zhang, Yongting Zhang, Lijun Li, Hongzhi Gao, Lijun Wang, Huchuan Lu, Feng Zhao, Yu Qiao, and Jing Shao. PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety. CoRR abs/2401.11880, 2024.
[115] Zhexin Zhang, Leqi Lei, Lindong Wu, Rui Sun, Yongkang Huang, Chong Long, Xiao Liu, Xuanyu Lei, Jie Tang, and Minlie Huang. Safetybench: Evaluating the safety of large language models with multiple choice questions. CoRR abs/2309.07045, 2023.
[116] Zhuo Zhang, Guangyu Shen, Guanhong Tao, Siyuan Cheng, and Xiangyu Zhang. Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs. CoRR abs/2312.04782, 2023.
[117] Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, and William Yang Wang. Weak-to-Strong Jailbreaking on Large Language Models. CoRR abs/2401.17256, 2024.
[118] Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. CoRR abs/2401.18018, 2024.
[119] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
[120] Xiaosen Zheng, Tianyu Pang, Chao Du, Qian Liu, Jing Jiang, and Min Lin. Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses. CoRR abs/2406.01288, 2024.
[121] Andy Zhou, Bo Li, and Haohan Wang. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. CoRR abs/2401.17263, 2024.
[122] Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, Qi Zhang, and Xuanjing Huang. EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models. CoRR abs/2403.12171, 2024.
[123] Yukai Zhou and Wenjie Wang. Don’t Say No: Jailbreaking LLM by Suppressing Refusal. CoRR abs/2404.16369, 2024.
[124] Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models. CoRR abs/2310.15140, 2023.
[125] Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and Transferable Adversarial Attacks on Aligned Language Models. CoRR abs/2307.15043, 2023.
[126] Xiaotian Zou, Yongkang Chen, and Ke Li. Is the system message really important to jailbreaks in large language models? CoRR abs/2402.14857, 2024.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey越狱攻击与大型语言模型的防御：一项调查