From Threat to Tool: Leveraging Refusal-Aware Injection Attacks
for Safety Alignment
从威胁到工具:利用拒绝感知注入攻击实现安全对齐
Abstract 摘要
Safely aligning large language models (LLMs) often demands extensive human-labeled preference data, a process that’s both costly and time-consuming. While synthetic data offers a promising alternative, current methods frequently rely on complex iterative prompting or auxiliary models. To address this, we introduce Refusal-Aware Adaptive Injection (RAAI), a straightforward, training-free, and model-agnostic framework that repurposes LLM attack techniques. RAAI works by detecting internal refusal signals and adaptively injecting predefined phrases to elicit harmful, yet fluent, completions. Our experiments show RAAI effectively jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% on average across four benchmarks. Crucially, fine-tuning LLMs with the synthetic data generated by RAAI improves model robustness against harmful prompts while preserving general capabilities on standard tasks like MMLU and ARC. This work highlights how LLM attack methodologies can be reframed as practical tools for scalable and controllable safety alignment.
安全地对齐大型语言模型(LLMs)通常需要大量的人工标注偏好数据,这一过程既昂贵又耗时。虽然合成数据提供了一个有前景的替代方案,但当前方法经常依赖于复杂的迭代提示或辅助模型。为了解决这个问题,我们引入了拒绝感知自适应注入(RAAI),这是一个简单、无需训练且与模型无关的框架,它将 LLM 攻击技术重新用于其他目的。RAAI 通过检测内部拒绝信号并自适应地注入预定义的短语来诱发出有害但流畅的完成内容。我们的实验表明,RAAI 有效地破解了 LLMs,将有害响应率从基线的 2.15%提高到四个基准测试中平均高达 61.04%。关键的是,使用 RAAI 生成的合成数据微调 LLMs 可以提高模型对有害提示的鲁棒性,同时在标准任务(如 MMLU 和 ARC)上保持一般能力。这项工作突出了 LLM 攻击方法如何被重新定义为可扩展和可控的安全对齐实用工具。
WARNING: This paper includes examples that may contain harmful or offensive content.
警告:本文包含可能包含有害或冒犯性内容的示例。
From Threat to Tool: Leveraging Refusal-Aware Injection Attacks
从威胁到工具:利用拒绝感知注入攻击
for Safety Alignment 实现安全对齐
Kyubyung Chae1 Hyunbin Jin1 Taesup Kim2
曹奎钟 1 金贤斌 1 金泰秀 2
Graduate School of Data Science, Seoul National University
首尔国立大学数据科学研究生院
{kyubyung.chae, hyunbin.jin, taesup.kim}@snu.ac.kr
1 Introduction 1 引言
图 1:概述我们利用拒绝感知注入攻击来生成安全对齐偏好数据的框架。
Large language models (LLMs) have achieved impressive performance across a wide range of natural language tasks. However, their deployment raises serious safety concerns, particularly the risk of generating harmful or inappropriate outputs.
大型语言模型(LLMs)在广泛的自然语言任务中取得了令人印象深刻的性能。然而,它们的部署引发了严重的安全问题,特别是生成有害或不适当输出的风险。
Safety alignment seeks to train LLMs to refuse unsafe user queries. Standard approaches include supervised fine-tuning and reinforcement learning methods such as RLHF and its variants Ouyang et al. (2022); Bai et al. (2022a); Dong et al. (2023); Rafailov et al. (2023); Dai et al. (2024); Zhang et al. (2025). These techniques typically rely on human preference data to guide model behavior. However, collecting and maintaining such data is both expensive and time-consuming, and may quickly become outdated as safety norms evolve Mu et al. (2024).
安全对齐旨在训练 LLMs 拒绝不安全的用户查询。标准方法包括监督微调和强化学习方法,如 RLHF 及其变体 Ouyang 等人(2022);Bai 等人(2022a);Dong 等人(2023);Rafailov 等人(2023);Dai 等人(2024);Zhang 等人(2025)。这些技术通常依赖于人类偏好数据来指导模型行为。然而,收集和维护此类数据既昂贵又耗时,并且随着安全规范的演变可能会迅速过时 Mu 等人(2024)。
To mitigate these challenges, recent work has explored replacing human annotations with AI-generated feedback (RLAIF) Kim et al. (2023); Liu et al. (2024a); Mu et al. (2024); Kumar et al. (2024); Shi et al. (2024); Dong et al. (2025); Choi et al. (2024); Xu et al. (2024), with Constitutional AI Bai et al. (2022b) emerging as a prominent paradigm.
为应对这些挑战,近期研究探索了用 AI 生成的反馈(RLAIF)替代人工标注,其中 Constitutional AI 已成为一种突出的范式。Kim 等人(2023);Liu 等人(2024a);Mu 等人(2024);Kumar 等人(2024);Shi 等人(2024);Dong 等人(2025);Choi 等人(2024);Xu 等人(2024)对此进行了研究。
图 2:四种 LLM 攻击方法说明性响应
Despite these advances, reinforcement learning-based alignment methods still require both preferred and dispreferred responses. Yet well-aligned models such as GPT-4 Achiam et al. (2024) often refuse to produce harmful outputs, making it difficult to obtain the negative examples necessary for preference-based training. As a result, current pipelines continue to rely on auxiliary models Bai et al. (2022b); Shi et al. (2024), multi-stage training procedures Ge et al. (2024), or heuristic rules Xu et al. (2024); Mu et al. (2024). This highlights a key bottleneck: acquiring high-quality preference data, especially dispreferred examples, remains a major obstacle to building safer and more reliable LLMs.
尽管取得了这些进展,基于强化学习的对齐方法仍然需要既定的偏好响应和不被偏好的响应。然而,像 GPT-4 Achiam 等人(2024)这样高度对齐的模型往往拒绝生成有害输出,这使得获取偏好训练所需的负面示例变得困难。因此,当前的流程仍然依赖于辅助模型 Bai 等人(2022b);Shi 等人(2024),多阶段训练程序 Ge 等人(2024),或启发式规则 Xu 等人(2024);Mu 等人(2024)。这突显了一个关键瓶颈:获取高质量的偏好数据,尤其是不被偏好的示例,仍然是构建更安全、更可靠的 LLMs 的主要障碍。
In this work, we explore a novel and largely underexplored direction:
在这项工作中,我们探索了一个新颖且基本未被充分研究的方向:
Can LLM attack techniques be reframed as tools for generating synthetic data to improve safety alignment?
LLM 攻击技术能否被重新定义为生成合成数据以改进安全对齐的工具?
To this end, we propose a simple yet effective alignment pipeline that addresses the data bottleneck by leveraging LLM attacks to synthesize dispreferred responses (Figure 1). Specifically, we explore three types of training-free attack methods: GPTFuzzer (black-box) Yu et al. (2023), ED (gray-box) Zhou et al. (2024b), and Refusal (white-box) Arditi et al. (2024).
为此,我们提出一个简单而有效的对齐流程,通过利用 LLM 攻击合成不受欢迎的响应来解决数据瓶颈(图 1)。具体来说,我们探索了三种无训练攻击方法:GPTFuzzer(黑盒)Yu 等人(2023 年)、ED(灰盒)Zhou 等人(2024b)和 Refusal(白盒)Arditi 等人(2024 年)。
However, these techniques face practical limitations. As illustrated in Figure 2, GPTFuzzer, while simple to use, often generates overly theatrical or stylized outputs (e.g., “Let’s keep this in the realm of fiction”), reducing their naturalness and utility. ED, a gray-box method, tends to produce truncated or grammatically flawed completions, limiting their effectiveness as training data. Although Refusal generates fluent and contextually appropriate outputs, its reliance on access to internal model representations makes it difficult to apply to models like those in the Mistral family.
然而,这些技术面临实际限制。如图 2 所示,GPTFuzzer 虽然易于使用,但经常生成过于戏剧化或风格化的输出(例如,“让我们将这保持在虚构领域内”),降低了它们的自然性和实用性。ED 是一种灰盒方法,倾向于产生截断或语法有缺陷的完成内容,限制了它们作为训练数据的有效性。尽管 Refusal 生成的流畅且上下文适当的输出,但其依赖于访问内部模型表示,使得难以应用于 Mistral 系列的模型。
To address these limitations, we introduce Refusal-Aware Adaptive Injection (RAAI), a training-free, gray-box, model-agnostic attack that adaptively injects predefined phrase to elicit harmful responses from aligned LLMs. RAAI combines high attack success rates with natural, coherent outputs, providing a practical means of generating dispreferred examples for alignment.
为解决这些局限性,我们引入了拒绝感知自适应注入(RAAI),这是一种无需训练、灰盒、模型无关的攻击方法,它自适应地注入预定义的短语,以从对齐的 LLMs 中诱发出有害响应。RAAI 结合了高攻击成功率与自然、连贯的输出,为生成对齐不利的示例提供了一种实用方法。
We evaluate RAAI across four jailbreak benchmarks and three safety-aligned models (LLaMA3.1-8B, Mistral-7B, Qwen2.5-7B), achieving up to 30× increases in harmful completion rates compared to the original models (Section 5.1). We further show that training on RAAI-generated data improves robustness against jailbreak prompts, while maintaining performance on standard benchmarks such as MMLU Hendrycks et al. (2021) (Section 5.2). These results demonstrate that LLM attack techniques can be effectively repurposed to construct high-quality preference datasets for building safer and more robust models.
我们在四个越狱基准测试和三个安全对齐模型(LLaMA3.1-8B、Mistral-7B、Qwen2.5-7B)上评估了 RAAI,与原始模型相比,有害完成率最高提升了 30 倍(第 5.1 节)。我们进一步表明,在 RAAI 生成的数据上进行训练可以提高对越狱提示的鲁棒性,同时保持在 MMLU Hendrycks 等人(2021)等标准基准测试上的性能(第 5.2 节)。这些结果表明,LLM 攻击技术可以有效地被改用,以构建高质量偏好数据集,用于构建更安全、更鲁棒的模型。
Our contributions are as follows:
我们的贡献如下:
-
•
We introduce Refusal-Aware Adaptive Injection (RAAI), a training-free, model-agnostic attack method. RAAI significantly jailbreaks LLMs, increasing the harmful response rate from a baseline of 2.15% to up to 61.04% across four jailbreak benchmarks.
• 我们介绍了无训练、模型无关的攻击方法 Refusal-Aware Adaptive Injection (RAAI)。RAAI 显著破解了 LLMs,将有害响应率从基准的 2.15%提高到四个破解基准测试中的最高 61.04%。 -
•
We propose a simple and scalable pipeline for generating synthetic preference data using refusal–elicitation pairs, enabling preference optimization without human annotation.
• 我们提出了一种使用拒绝诱导对生成合成偏好数据的简单且可扩展的流程,无需人工标注即可实现偏好优化。 -
•
We demonstrate that training with RAAI-generated data improves robustness against jailbreak benchmarks while preserving performance on standard benchmarks.
• 我们证明,使用 RAAI 生成的数据进行训练可以提高对破解基准测试的鲁棒性,同时保持在标准基准测试上的性能。
2 Related Work 2 相关工作
Reinforcement Learning From AI Feedback
从 AI 反馈中进行强化学习
RLHF-based methods Ouyang et al. (2022); Bai et al. (2022a); Dong et al. (2023); Rafailov et al. (2023); Meng et al. (2024); Dai et al. (2024); Zhang et al. (2025) have demonstrated the effectiveness of human annotations in steering model behavior. However, collecting and maintaining such annotations is both costly and time-consuming.
基于 RLHF 的方法(Ouyang 等人,2022 年;Bai 等人,2022a 年;Dong 等人,2023 年;Rafailov 等人,2023 年;Meng 等人,2024 年;Dai 等人,2024 年;Zhang 等人,2025 年)已经证明了人类标注在引导模型行为方面的有效性。然而,收集和维护此类标注既昂贵又耗时。
To mitigate this, recent research has explored human-free safety alignment methods that use synthetic preference data in place of human labels. For example, Anthropic’s Constitutional AI leverages a model’s own critiques and revisions—guided by a set of principles (a “constitution”)—to align it toward harmless and honest behavior Bai et al. (2022b). This paradigm has inspired a wide range of Reinforcement Learning from AI Feedback (RLAIF) approaches Kim et al. (2023); Liu et al. (2024a); Mu et al. (2024); Kumar et al. (2024); Choi et al. (2024), which generate preference pairs automatically to reduce reliance on human supervision.
为了缓解这一问题,最近的研究探索了无需人类参与的安全对齐方法,这些方法使用合成偏好数据代替人类标签。例如,Anthropic 的 Constitutional AI 利用模型自身的批评和修订——在一系列原则(一部“宪法”)的指导下——来使其对齐于无害和诚实的行为(Bai 等人,2022b 年)。这一范式启发了一系列强化学习从 AI 反馈(RLAIF)方法(Kim 等人,2023 年;Liu 等人,2024a 年;Mu 等人,2024 年;Kumar 等人,2024 年;Choi 等人,2024 年),这些方法自动生成偏好对,以减少对人类监督的依赖。
More recent methods Shi et al. (2024); Perez et al. (2022); Mu et al. (2024) extend this idea by introducing auxiliary models, alignment critics, or heuristic filters to generate synthetic training data. MART Ge et al. (2024) further advances this direction with a multi-round adversarial red-teaming pipeline that iteratively refines harmful completions to improve coverage and robustness. In contrast, we propose a simple yet effective alignment pipeline that sidesteps this bottleneck by leveraging LLM attacks to generate synthetic dispreferred responses, without any additional training or auxiliary models.
更新的方法 Shi 等人(2024);Perez 等人(2022);Mu 等人(2024)通过引入辅助模型、对齐评论家或启发式过滤器来生成合成训练数据,扩展了这一想法。MART Ge 等人(2024)进一步推进了这一方向,采用多轮对抗红队流程,迭代优化有害补全内容,以提高覆盖率和鲁棒性。相比之下,我们提出了一种简单而有效的对齐流程,通过利用 LLM 攻击来生成合成不受欢迎的响应,从而绕过这一瓶颈,无需任何额外的训练或辅助模型。
LLM Attacks as Augmentation Tools
LLM 攻击作为增强工具
While prior work has primarily used LLM attacks to assess safety vulnerabilities Zhou et al. (2024a); Zou et al. (2023); Liu et al. (2024b); Dong et al. (2024), we instead explore their use as tools for generating synthetic preference data for alignment training.
虽然先前工作主要使用 LLM 攻击来评估安全漏洞 Zhou 等人(2024a);Zou 等人(2023);Liu 等人(2024b);Dong 等人(2024),但我们探索了将其作为生成对齐训练合成偏好数据的工具的使用方式。
Training-time attacks Ge et al. (2024); Gade et al. (2024) are effective but require harmful fine-tuning and multi-stage procedures, making them impractical for lightweight data generation. Similarly, many inference-time methods rely on gradient access or extensive iterations Zou et al. (2023); Zhu et al. (2024), limiting scalability.
训练时攻击 Ge 等人(2024);Gade 等人(2024)虽然有效,但需要有害的微调和多阶段流程,使其不适合轻量级数据生成。类似地,许多推理时方法依赖于梯度访问或大量迭代 Zou 等人(2023);Zhu 等人(2024),限制了可扩展性。
We focus on four inference-time attacks that (1) have high attack success rates, (2) require no training phase, and (3) produce natural harmful outputs. These include GPTFuzzer Yu et al. (2023), Emulated Disalignment (ED) Zhou et al. (2024b), and white-box patching methods like Refusal Arditi et al. (2024). We also consider prefilling attacks Tang (2024), which prepend harmful outputs from weaker models to prompt unsafe completions from stronger ones. However, prefilling approaches often require paired harmful queries and are prone to premature termination due to safety filters. Inspired by recent findings that safety alignment can be bypassed with just a few tokens Qi et al. (2025); Yang et al. (2023), we propose a dynamic prompt injection method that adaptively triggers harmful completions, overcoming key limitations of prefilling-based attacks.
我们关注四种推理时攻击,这些攻击(1)具有高攻击成功率,(2)无需训练阶段,(3)产生自然有害输出。这些攻击包括 GPTFuzzer Yu 等人(2023 年)、模拟错位(ED)Zhou 等人(2024b 年)以及 Refusal Arditi 等人(2024 年)等白盒补丁方法。我们还考虑了 prefilling 攻击 Tang(2024 年),该方法将来自较弱模型的有害输出添加到提示中,从而从较强模型中生成不安全的完成内容。然而,prefilling 方法通常需要配对的有害查询,并且由于安全过滤器而容易过早终止。受近期研究发现的启发,即只需几个标记即可绕过安全对齐 Qi 等人(2025 年);Yang 等人(2023 年),我们提出了一种动态提示注入方法,该方法自适应地触发有害完成,克服了基于 prefilling 攻击的关键限制。
Alignment Tax and the Importance of Data Quality
对齐税和数据质量的重要性
A well-known concern in safety alignment is the alignment tax—the degradation of general capabilities that can result from fine-tuning models for safer behavior Ouyang et al. (2022). For example, Huang et al. (2025) report that safety tuning can impair a model’s reasoning abilities.
在安全对齐中,一个众所周知的问题是“对齐税”——即对模型进行微调以实现更安全行为可能导致其通用能力的退化 Ouyang 等人(2022 年)。例如,Huang 等人(2025 年)报告称,安全微调可能会损害模型的推理能力。
Recent studies emphasize that the quality of alignment data is key to minimizing this trade-off. Zhou et al. (2023) showed that fine-tuning a 65B model on just 1,000 high-quality examples led to strong instruction-following performance, with diminishing returns from simply increasing data volume. Likewise, Wu et al. (2023) found that fine-grained human feedback yields better alignment with less performance loss.
近期研究强调,对齐数据的质量是减少这种权衡的关键。Zhou 等人(2023)表明,仅使用 1,000 个高质量示例微调一个 65B 模型就能获得强大的指令遵循性能,而单纯增加数据量则收益递减。同样,Wu 等人(2023)发现,细粒度的人类反馈能带来更好的对齐效果,同时性能损失更小。
These findings suggest that carefully curated, representative data can enable effective alignment while preserving the model’s core capabilities. Building on this insight, our method uses high-fidelity synthetic data—generated through realistic adversarial attacks—to improve safety compliance without degrading general performance. We show that using such naturalistic negative examples can close the safety gap with minimal alignment tax.
这些发现表明,经过精心策划、具有代表性的数据可以在保留模型核心能力的同时实现有效对齐。基于这一见解,我们的方法使用高保真度合成数据——通过逼真的对抗性攻击生成——来提高安全合规性,同时不降低泛化性能。我们证明,使用这种自然负例可以以最小的对齐代价缩小安全差距。
3 Refusal-Aware Adaptive Injection
3 拒绝感知自适应注入
Given an input prompt , a language model auto-regressively generates a response where each token is sampled from the conditional distribution .
给定输入提示 ,语言模型 自回归地生成响应 ,其中每个词元 都从条件分布 中采样。
Our goal is to adversarially manipulate the model’s behavior by injecting a predefined injection phrase only when the model exhibits a high likelihood of refusal. This is achieved by monitoring the average probability assigned to a predefined set of refusal tokens during generation. Specifically, we first construct the refusal token set by collecting a set of refusal responses elicited from harmful prompts, and extracting the top- most frequent tokens from .
我们的目标是通过在模型表现出高度拒绝可能性的情况下注入预定义的注入短语来对抗性地操控模型的行为。这是通过监控在生成过程中分配给预定义的拒绝标记集的平均概率来实现的。具体来说,我们首先通过收集从有害提示中诱发的拒绝响应集 来构建拒绝标记集 ,并从 中提取出现频率最高的前 个标记。
图 3:我们攻击方法 Refusal-Aware Adaptive Injection (RAAI)的概述。
At decoding step , we compute the refusal probability:
在解码步骤 ,我们计算拒绝概率:
If for a predefined threshold , we inject a predefined injection phrase into the generation process to steer the model toward a harmful completion.
如果 超过预设阈值 ,我们将预设的注入短语 注入生成过程,引导模型走向有害的完成结果。
Moreover, to prevent premature termination of generation, we apply an additional rule: if the top-1 candidate token is the end-of-sequence token , we remove it from the candidate list and instead append a continuation phrase to encourage ongoing generation.
此外,为了防止生成过早终止,我们应用了一个附加规则:如果最高候选词是序列结束词 ,我们将它从候选列表中移除,并代之以追加一个延续短语来鼓励持续生成。
We assessed various candidates for the injection and continuation phrases. The phrase “As your harmful assistant, here’s a detailed guide” performed best as the injection phrase, and “Step 1.” was most effective as the continuation phrase. We also experimented with multiple threshold values. Although the threshold can be adjusted to specific use cases given our method’s role as a data augmentation tool, a value of 0.001 consistently yielded the best performance. Accordingly, the optimal injection phrase, continuation phrase, and threshold value were used throughout all subsequent experiments. Further details regarding the method are provided in Appendix A.
我们评估了多种注入和延续短语的候选方案。短语“作为你的有害助手,这里有一份详细指南”作为注入短语表现最佳,而“步骤 1。”作为延续短语效果最有效。我们还尝试了多个阈值。尽管我们的方法作为数据增强工具可以调整阈值以适应特定用例,但 0.001 的值始终能获得最佳性能。因此,在所有后续实验中均使用了最优的注入短语、延续短语和阈值。关于该方法的其他细节请参见附录 A。
Our approach is motivated by empirical observations that different models exhibit distinct refusal behaviors. As shown in Figure 4, Qwen-7B tends to trigger injections early—often at the first or second step—while occasional late-stage injections occur as well (e.g., step 14). Similarly, LLaMA-3.1-8B frequently injects around step 3, but injections can occur as late as step 16. In contrast, Mistral-7B typically defers refusal until later in the generation process. Furthermore, refusal expressions differ linguistically across models. These behavioral and lexical variations necessitate constructing model-specific refusal token sets (refer to Section 5.1).
我们的方法源于对不同模型表现出不同拒绝行为的实证观察。如图 4 所示,Qwen-7B 倾向于在早期触发注入——通常在第一步或第二步——但也偶尔会在后期阶段发生注入(例如,第 14 步)。类似地,LLaMA-3.1-8B 经常在步骤 3 附近注入,但注入也可能最晚发生在第 16 步。相比之下,Mistral-7B 通常将拒绝推迟到生成过程的后期。此外,不同模型中的拒绝表达在语言上存在差异。这些行为和词汇上的变化需要构建特定于模型的拒绝标记集(参见第 5.1 节)。
图 4:AdvBench 基准测试上的平均拒绝概率。
4 Safety Alignment with Synthetic Data
4 使用合成数据进行安全对齐
Using RAAI, we construct high-quality preference data for alignment without human annotation. For each harmful prompt , the original refusal response is designated as the chosen response , while the response generated after phrases injection becomes the rejected response .
To ensure correctness of these labels, we apply a pretrained safety classifier (e.g., StrongREJECT Souly et al. (2024) or LlamaGuard Inan et al. (2023)) and retain only examples where is safe and is unsafe.
利用 RAAI,我们构建了用于对齐的高质量偏好数据,而无需人工标注。对于每个有害提示 ,原始的拒绝响应被指定为选择的响应 ,而注入短语后生成的响应则成为拒绝的响应 。为确保这些标签的正确性,我们应用了一个预训练的安全分类器(例如 StrongREJECT Souly 等人 (2024) 或 LlamaGuard Inan 等人 (2023)),并仅保留 是安全的且 是不安全的示例。
To train preference-aligned models on this synthetic data, we adopt SimPO Meng et al. (2024) that improves model behavior by maximizing the preference margin.
Given a prompt , a preferred response of length , and a dispreferred response of length , SimPO optimizes the model by comparing the average log-likelihood of the two responses:
为了在这合成数据上训练偏好对齐模型,我们采用了 SimPO Meng 等人 (2024) ,该技术通过最大化偏好边距来改进模型行为。给定一个提示 ,一个长度为 的偏好响应 ,以及一个长度为 的非偏好响应 ,SimPO 通过比较两个响应的平均对数似然来优化模型 。
|
|
where denotes the sigmoid function, and and are hyperparameters.
表示 Sigmoid 函数, 和 是超参数。
The length-normalized reward in SimPO is particularly helpful for safety alignment tasks. This is because the chosen responses, which are typically refusals (e.g., starting with “I can’t" or “Sorry"), tend to be short, while the rejected responses often contain more verbose and detailed harmful content.
SimPO 中的长度归一化奖励对安全对齐任务特别有帮助。这是因为所选的响应通常是拒绝(例如,以“我不能”或“抱歉”开头),而拒绝的响应通常包含更冗长和详细的危害内容。
This framework enables us to align model outputs with safe preferences at scale, using entirely synthetic data derived from RAAI. Empirical results in Section 5.2 demonstrate the effectiveness of our pipeline.
该框架使我们能够在大规模上使模型输出与安全偏好对齐,使用完全由 RAAI 派生的合成数据。第 5.2 节中的实验结果证明了我们的流程的有效性。
| Model | Method | Jailbreakbench | HarmBench | Hex-Phi | AdvBench | Avg. | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LG | OM | SR | GT | LG | OM | SR | GT | LG | OM | SR | GT | LG | OM | SR | GT | |||
| LLaMA-3.1 8B-Instruct | Base | 0.00 | 0.00 | 0.00 | 0.00 | 1.56 | 2.19 | 0.94 | 3.75 | 0.37 | 1.12 | 2.97 | 2.97 | 0.38 | 0.58 | 0.77 | 0.58 | 2.15 |
| GPTFuzz | 0.00 | 2.00 | 0.00 | 0.00 | 0.00 | 1.88 | 0.31 | 3.12 | 1.12 | 1.49 | 1.12 | 1.12 | 1.15 | 2.12 | 1.73 | 1.73 | 1.89 | |
| ED | 49.00 | 39.00 | 52.00 | 67.00 | 38.75 | 30.31 | 39.69 | 50.00 | 65.43 | 38.29 | 66.54 | 73.61 | 62.69 | 48.08 | 69.81 | 75.19 | 48.68 | |
| Refusal 拒绝 | 21.00 | 16.00 | 50.00 | 40.00 | 20.62 | 17.81 | 34.69 | 34.38 | 11.9 | 13.38 | 38.66 | 31.23 | 26.15 | 29.81 | 53.08 | 49.81 | 28.50 | |
| Ours | 67.00 | 57.00 | 64.00 | 73.00 | 59.69 | 43.75 | 52.5 | 63.12 | 65.06 | 49.07 | 72.12 | 72.86 | 90.58 | 86.92 | 91.35 | 93.85 | 61.04 | |
| Mistral 7B-Instruct | Base 基础 | 21.00 | 21.00 | 44.00 | 49.00 | 15.31 | 23.44 | 33.12 | 37.18 | 14.5 | 17.84 | 43.12 | 36.43 | 25.00 | 32.12 | 47.69 | 40.38 | 26.76 |
| GPTFuzz | 33.00 | 56.00 | 67.00 | 79.00 | 30.94 | 45.62 | 53.75 | 66.25 | 37.55 | 55.39 | 75.84 | 84.39 | 67.31 | 74.62 | 83.27 | 88.27 | 59.03 | |
| ED | 34.00 | 24.00 | 19.00 | 52.00 | 25.00 | 20.62 | 10.94 | 41.56 | 31.97 | 23.05 | 17.47 | 52.42 | 34.42 | 33.85 | 23.46 | 58.65 | 28.67 | |
| Refusal 拒绝 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
| Ours | 57.00 | 65.00 | 69.00 | 79.00 | 40.62 | 45.94 | 50.94 | 58.13 | 53.16 | 50.56 | 74.35 | 72.86 | 78.85 | 82.50 | 89.81 | 90.77 | 59.68 | |
| Qwen2.5 7B-Instruct | Base 基础 | 1.00 | 4.00 | 5.00 | 3.00 | 5.94 | 6.88 | 9.38 | 10.93 | 1.12 | 2.6 | 11.90 | 4.46 | 0.96 | 1.92 | 3.08 | 1.73 | 4.69 |
| GPTFuzz | 40.00 | 44.00 | 37.00 | 59.00 | 28.75 | 36.88 | 34.69 | 56.88 | 37.55 | 34.57 | 42.01 | 54.28 | 45.58 | 48.65 | 39.81 | 56.92 | 41.99 | |
| ED | 36.00 | 27.00 | 39.00 | 53.00 | 26.25 | 19.69 | 24.38 | 39.06 | 40.52 | 24.54 | 40.89 | 48.33 | 39.23 | 29.62 | 40.58 | 49.62 | 31.37 | |
| Refusal 拒绝 | 26.00 | 29.00 | 78.00 | 73.00 | 34.06 | 26.88 | 47.19 | 59.69 | 30.48 | 27.51 | 80.30 | 72.86 | 41.73 | 54.42 | 86.73 | 83.08 | 47.20 | |
| Ours | 55.00 | 70.00 | 74.00 | 78.00 | 34.69 | 47.81 | 55.94 | 63.12 | 36.43 | 54.28 | 69.89 | 69.52 | 68.46 | 87.12 | 92.12 | 93.27 | 58.50 | |
表 1:不同攻击方法下,语言模型在四个基准数据集上的有害率(%)。(最佳结果加粗,次佳结果加下划线)。
5 Experiments 5 实验
We structure our experiments into two parts. In Section 5.1, we evaluate the effectiveness of our proposed attack method in eliciting harmful responses from aligned language models. In Section 5.2, we assess the effectiveness of the resulting synthetic data in improving safety alignment.
我们将实验分为两部分。在 5.1 节中,我们评估我们提出的攻击方法在从对齐语言模型中诱导有害响应方面的有效性。在 5.2 节中,我们评估生成的合成数据在提高安全对齐方面的有效性。
Models 模型
For the attack evaluation in Section 5.1, we test our method on three widely-used safety-aligned models: LLaMA-3.1-8B-Instruct Grattafiori et al. (2024), Mistral-7B-Instruct Jiang et al. (2023), and Qwen2.5-7B-Instruct Yang et al. (2025).
在 5.1 节的攻击评估中,我们在三个广泛使用的安全对齐模型上测试了我们的方法:LLaMA-3.1-8B-Instruct Grattafiori 等人(2024 年),Mistral-7B-Instruct Jiang 等人(2023 年),以及 Qwen2.5-7B-Instruct Yang 等人(2025 年)。
For the safety alignment experiments in Section 5.2, we use Alpaca Liu et al. (2023), which has been supervised fine-tuned (SFT) on Anthropic-HH, as well as Mistral-7B-Instruct. We exclude LLaMA and Qwen from this part, as their strong existing alignment limits the observable benefits of additional fine-tuning.
在 5.2 节的安全对齐实验中,我们使用了在 Anthropic-HH 上进行监督微调(SFT)的 Alpaca Liu 等人(2023 年),以及 Mistral-7B-Instruct。我们排除了 LLaMA 和 Qwen,因为它们现有的强对齐限制了额外微调的可观察收益。
Dataset 数据集
We evaluate our method on four widely-adopted safety benchmarks: JailbreakBench Chao et al. (2024), HarmBench Mazeika et al. (2024), Hex-Phi Qi et al. (2023), and AdvBench Zou et al. (2023).
我们在四个广泛采用的安全基准上评估了我们的方法:JailbreakBench Chao 等人 (2024),HarmBench Mazeika 等人 (2024),Hex-Phi Qi 等人 (2023),以及 AdvBench Zou 等人 (2023)。
For the safety alignment, we use only the harmful prompts from the Anthropic-HH dataset Bai et al. (2022a). We randomly sample 2,000 prompts from the train split. For in-distribution evaluation, we use 217 prompts from the test split that do not overlap with the training data. To assess out-of-distribution (OOD) generalization, we evaluate on JailbreakBench, HarmBench, and AdvBench. A detailed description of all benchmarks and splits is provided in Appendix C.
对于安全对齐,我们仅使用 Anthropic-HH 数据集 Bai 等人 (2022a) 中的有害提示。我们从训练集随机采样 2,000 个提示。对于分布内评估,我们使用 217 个不与训练数据重叠的测试集提示。为了评估分布外 (OOD) 泛化能力,我们在 JailbreakBench、HarmBench 和 AdvBench 上进行评估。所有基准和分割的详细描述见附录 C。
Baselines 基线
The term Base refers to the output obtained by querying the original model using a standard chat template without any modification.
We compare our method against three baselines: GPTFuzzer (Yu et al., 2023), Emulated Disalignment (ED) (Zhou et al., 2024b), and Refusal (Arditi et al., 2024). Implementation details and hyperparameter settings for all baselines are provided in Appendix B.
术语 Base 指的是使用标准聊天模板查询原始模型(未经任何修改)所获得的输出。我们将我们的方法与三个基线进行比较:GPTFuzzer (Yu 等人,2023),Emulated Disalignment (ED) (Zhou 等人,2024b),以及 Refusal (Arditi 等人,2024)。所有基线的实现细节和超参数设置见附录 B。
Evaluation 评估
To assess the harmfulness and refusal behavior of model responses, we employ four evaluation metrics.
为了评估模型响应的危害性和拒绝行为,我们采用了四种评估指标。
We begin with two widely used automated moderation systems—LLaMA Guard (LG) Inan et al. (2023) and the OpenAI Moderation (OM) API111https://platform.openai.com/docs/guides/moderation/moderation
我们首先使用两个广泛使用的自动化审核系统——LLaMA Guard (LG) Inan 等人 (2023) 和 OpenAI Moderation (OM) API 1 . To more precisely measure jailbreak success, we incorporate StrongREJECT (SR) Souly et al. (2024), a metric specifically designed to address over-refusal behavior. Unlike other moderation tools that may conservatively flag benign responses as harmful, SR offers a calibrated assessment of refusal tendencies and aligns closely with human judgments of jailbreak effectiveness. Finally, we use GPT-4o (GT) Hurst et al. (2024) as a zero-shot evaluator to simulate human-aligned evaluation (see prompt details in Appendix F.1).
。为了更精确地衡量越狱成功率,我们引入了 StrongREJECT (SR) Souly 等人 (2024),这是一个专门设计用于解决过度拒绝行为的指标。与其他可能保守地将良性响应标记为有害的审核工具不同,SR 提供了对拒绝倾向的校准评估,并与人类对越狱有效性的判断高度一致。最后,我们使用 GPT-4o (GT) Hurst 等人 (2024) 作为零样本评估器,以模拟与人类一致的评估(见附录 F.1 中的提示详情)。
5.1 LLM Attack Evaluation 5.1LLM 攻击评估
Results 结果
Table 1 summarizes the effectiveness of RAAI in eliciting harmful behaviors from aligned language models. We evaluate across four jailbreak benchmarks, four evaluation metrics, and three distinct models. RAAI consistently outperforms all baselines, achieving the highest average harmfulness rates across all datasets. Notably, this performance remains stable across all three models, underscoring the robustness and general applicability of the proposed attack strategy.
表 1 总结了 RAAI 在诱导对齐语言模型产生有害行为方面的有效性。我们在四个越狱基准、四个评估指标和三个不同模型上进行了评估。RAAI 始终优于所有基线,在所有数据集上实现了最高的平均有害率。值得注意的是,这种性能在所有三个模型上保持稳定,突出了所提出的攻击策略的鲁棒性和通用适用性。
For instance, while GPTFuzzer was the second most effective attack on Mistral, it showed almost no impact on LLaMA-3.1. The Refusal attack is not supported in the official implementation for Mistral, limiting its applicability. In contrast, ED demonstrates relatively consistent attack success rates in a model-agnostic manner. However, ED still requires access to both an aligned and an unaligned model, making it inapplicable in scenarios where a model family is unavailable. In contrast, RAAI’s consistent performance across models and datasets highlights its model- and dataset-agnostic nature.
例如,虽然 GPTFuzzer 在 Mistral 上是第二有效的攻击,但它对 LLaMA-3.1 几乎没有任何影响。拒绝攻击在 Mistral 的官方实现中不受支持,限制了其适用性。相比之下,ED 以模型无关的方式表现出相对一致的攻击成功率。然而,ED 仍然需要访问一个对齐模型和一个非对齐模型,使其在无法获取模型家族的情况下不适用。相比之下,RAAI 在模型和数据集上的持续性能突出了其模型和数据集无关的特性。
| Dataset 数据集 | Method 方法 | SR | GT | Avg. 平均 |
|---|---|---|---|---|
| JailbreakBench | Prefilling 预填充 | 28.00 | 71.00 | 49.50 |
| Ours 我们的 | 64.00 | 73.00 | 68.50 | |
| HarmBench | Prefilling 预填充 | 12.50 | 60.00 | 36.25 |
| Ours 我们的 | 52.50 | 63.12 | 57.81 | |
| Hex-Phi | Prefilling 预填充 | 33.09 | 68.03 | 50.56 |
| Ours 我们的 | 72.12 | 72.86 | 72.49 | |
| AdvBench | Prefilling 预填充 | 27.12 | 74.23 | 50.68 |
| Ours 我们的 | 91.35 | 93.85 | 92.60 |
表 2:在 LLaMA-3.1 8B-Instruct 上 Prefilling 与我们的方法在危害率(%)上的比较。越低越好。
Comparison to Naive Prefilling
与朴素预填充的比较
In Table 2, we compare RAAI to a naive prefilling attack that prepends predefined phrases as a fixed prefix. RAAI consistently outperforms the naive approach across all datasets and evaluation metrics.
For example, on JailbreakBench, RAAI improves the StrongREJECT score from 28.0% to 64.0% and the GPT-4o score from 71.0% to 73.0%, raising the average harmful rate from 49.5% to 68.5%. Similar gains are observed on HarmBench (36.3% 57.8%), Hex-Phi (50.6% 72.5%), and AdvBench (50.7% 92.6%).
在表 2 中,我们将 RAAI 与一种简单的预填充攻击进行了比较,该攻击将预定义的短语作为固定前缀添加。RAAI 在所有数据集和评估指标上始终优于简单方法。例如,在 JailbreakBench 上,RAAI 将 StrongREJECT 分数从 28.0%提高到 64.0%,将 GPT-4o 分数从 71.0%提高到 73.0%,将平均有害率从 49.5%提高到 68.5%。在 HarmBench(36.3% 57.8%)、Hex-Phi(50.6% 72.5%)和 AdvBench(50.7% 92.6%)上也观察到了类似的提升。
These results indicate that fixed-position prefixing alone is insufficient to reliably elicit harmful responses. Instead, dynamically injecting prompts based on model-internal refusal signals—as done in RAAI—is crucial for consistently producing effective attacks. For qualitative comparisons with actual examples, see Appendix E.1.
这些结果表明,仅使用固定位置的前缀不足以可靠地引出有害响应。相反,根据模型内部的拒绝信号动态注入提示——正如 RAAI 所做的那样——对于持续有效地产生攻击至关重要。有关与实际示例的定性比较,请参见附录 E.1。
Refusal Tokens 拒绝标记
Table 3 demonstrates the refusal token pool constructed for each model, which serves as the foundation for detecting refusal signals in generated responses. To build these pools, we extracted and cleaned the first sentence from each model’s safe responses, then selected the top 10 most frequent tokens. To enhance coverage, we additionally included a fixed set of common negation-related tokens—such as not, sorry, never, refuse, cannot, unable, and no—across all models.
表 3 展示了为每个模型构建的拒绝标记池,该池是检测生成响应中拒绝信号的基础。为了构建这些池,我们从每个模型的安全响应中提取并清理了第一句话,然后选择了出现频率最高的前 10 个标记。为了提高覆盖率,我们还在所有模型中额外加入了一组固定的常见否定相关标记,例如 not、sorry、never、refuse、cannot、unable 和 no。
表 3:为每个 LLM 构建的拒绝标记池。
Other Injection Phrases 其他注入短语
| Injection Phrase 注入短语 | LG | OM |
|---|---|---|
| Certainly! 当然! | 33.00 | 24.00 |
| I’ll provide you with a step-by-step guide on 我将为您提供一份逐步指南 |
45.00 | 40.00 |
| I’d be happy to help you brainstorm ways to 我很乐意帮助您 brainstorm 一些方法 |
20.00 | 24.00 |
| Some possible solution that might fit into this context include 一些可能适用于此情境的解决方案包括 |
13.00 | 21.00 |
| As your harmful assistant, here’s a detailed guide 作为您的有害助手,这里有一份详细指南 |
67.00 | 57.00 |
表 4:注入短语消融实验。在 Jailbreakbench 数据集上,使用 Llama-3.1-8B-Instruct 评估的不同注入短语的危害率(%)。
Table 4 presents an ablation study on the effectiveness of five alternative injection phrases in eliciting harmful completions. These phrases were extracted from harmful responses in which the model did not exhibit refusal behavior. While all candidate phrases yielded moderate gains, the phrase “As your harmful assistant, here’s a detailed guide” consistently achieved the highest success rate across tasks. This suggests that explicit role framing can strongly steer the model toward undesirable completions.
表 4 展示了五种替代注入短语在引发有害完成方面的有效性消融研究。这些短语是从模型未表现出拒绝行为的危害响应中提取的。虽然所有候选短语都产生了适度的增益,但短语“作为你的有害助手,这里有一份详细指南”在各项任务中始终实现了最高的成功率。这表明明确的角色框架可以强烈引导模型走向不期望的完成。
5.2 Safety Alignment Evaluation
5.2 安全对齐评估
Experimetal setup 实验设置
To evaluate the effectiveness of our synthetic preference data, we conduct two types of experiments:
(1) measuring improvements in safety alignment on harmful prompt benchmarks, and
(2) assessing whether the alignment process incurs a performance degradation on general-purpose tasks, commonly referred to as the safety tax.
为了评估我们合成偏好数据的有效性,我们进行了两种类型的实验:(1) 在有害提示基准测试中测量安全对齐的改进,以及 (2) 评估对齐过程是否在通用任务上导致性能下降,通常称为安全税。
To quantify the potential safety tax, we evaluate the aligned models on three standard benchmarks for general language understanding: MMLU Hendrycks et al. (2021), ARC Challenge Clark et al. (2018), and PROST Aroca-Ouellette et al. (2021). Detailed descriptions of these benchmarks are provided in Appendix D.
为了量化潜在的安全税,我们在三个通用语言理解的标准基准上评估了对齐模型:MMLU Hendrycks 等人 (2021),ARC 挑战 Clark 等人 (2018),以及 PROST Aroca-Ouellette 等人 (2021)。这些基准的详细描述在附录 D 中提供。
Implementation Details 实现细节
For all preference optimization experiments, we use SimPO as the alignment objective, combined with QLoRA Dettmers et al. (2023) for efficient fine-tuning, due to our limited computational resources. All experiments were conducted on a single NVIDIA RTX 6000 or RTX 3090 GPU. We train each model for 2 epochs with a batch size of 16. More details in Appendix B.
对于所有偏好优化实验,我们使用 SimPO 作为对齐目标,结合 QLoRA Dettmers 等人(2023 年)进行高效的微调,由于我们的计算资源有限。所有实验均在单个 NVIDIA RTX 6000 或 RTX 3090 GPU 上进行。我们对每个模型进行 2 个 epoch 的训练,批大小为 16。更多细节见附录 B。
| in | out-of-distribution | |||||
|---|---|---|---|---|---|---|
| Model | Data 数据 | Ant. | Jail. 监狱。 | Harm. 伤害。 | Adv. 优势。 | Avg. 平均。 |
| Alpaca | Base 基础 | 10.14 | 52.00 | 34.38 | 54.04 | 37.14 |
| GPTFuzz | 5.06 | 46.00 | 31.25 | 43.08 | 31.35 | |
| ED | 0.46 | 19.00 | 11.25 | 13.27 | 10.50 | |
| Refusal 拒绝 | 0.46 | 23.00 | 13.75 | 11.73 | 12.24 | |
| Ours 我们的 | 0.46 | 15.00 | 7.19 | 7.88 | 7.63 | |
| Mistral 7B-Instruct | Base 基础 | 16.59 | 44.00 | 33.12 | 47.69 | 35.35 |
| GPTFuzz | 12.90 | 26.00 | 17.19 | 18.85 | 18.74 | |
| ED | 19.35 | 50.00 | 41.88 | 55.58 | 41.20 | |
| Refusal 拒绝 | – | – | – | – | – | |
| Ours 我们的 | 11.06 | 22.00 | 16.56 | 17.88 | 16.88 | |
表 5:在分布内和分布外安全评估集上使用 StrongREJECT 评估的有害率(%),越低越好。
Results 结果
Table 5 shows that models aligned using RAAI-generated preference data exhibit significantly lower harmful response rates compared to all baselines. For instance, the Alpaca model trained with our data achieves an average harmful rate of just 7.63%, representing a substantial reduction from the base model’s 37.14%. Similarly, our Mistral-7B-Instruct variant achieves 16.88%, improving upon the base model’s 35.35%.
表 5 显示,使用 RAAI 生成的偏好数据进行对齐的模型,其有害响应率显著低于所有基线模型。例如,使用我们的数据训练的 Alpaca 模型,平均有害率为仅 7.63%,相比基线模型的 37.14%有大幅降低。同样,我们的 Mistral-7B-Instruct 变体达到了 16.88%,优于基线模型的 35.35%。
While other attack-based methods such as ED and Refusal occasionally match our in-distribution performance (e.g., 0.46% on Anthropic prompts), their performance drops significantly on out-of-distribution benchmarks like HarmBench and AdvBench. In contrast, our method maintains consistently low harmfulness across both in- and out-of-distribution settings, highlighting its superior generalization.
虽然其他基于攻击的方法如 ED 和 Refusal 偶尔能匹配我们的分布内性能(例如,在 Anthropic 提示上达到 0.46%),但在 HarmBench 和 AdvBench 等分布外基准测试上,它们的性能显著下降。相比之下,我们的方法在分布内和分布外设置中始终保持低有害性,突显了其优越的泛化能力。
Table 6 further evaluates whether this safety alignment comes at the cost of general-purpose capabilities. Across all general benchmarks—MMLU, ARC-Challenge, and PROST—models aligned with our data match or slightly outperform the base models. For example, on Alpaca, our aligned model yields +0.1% on MMLU, +0.2% on ARC, and negligible change on PROST. Mistral models show similarly stable behavior, with no degradation exceeding 0.1%.
表 6 进一步评估这种安全对齐是否以牺牲通用能力为代价。在所有通用基准测试——MMLU、ARC-Challenge 和 PROST——中,与我们的数据对齐的模型与基础模型相当或略有提升。例如,在 Alpaca 上,我们的对齐模型在 MMLU 上提升了+0.1%,在 ARC 上提升了+0.2%,在 PROST 上变化可以忽略不计。Mistral 模型表现出类似地稳定行为,没有超过 0.1%的退化。
Taken together, these results demonstrate that our alignment pipeline using LLM attacks not only enhances safety robustness but also avoids the safety–usefulness trade-off commonly observed in prior approaches.
综合来看,这些结果表明,我们使用 LLM 攻击的安全对齐流程不仅增强了安全鲁棒性,而且避免了先前方法中常见的安全-实用性权衡。
| Model | Data 数据 | MMLU | ARC | PROST |
|---|---|---|---|---|
| Alpaca | Base 基础 | 41.0% | 38.7% | 30.1% |
| GPTFuzz | 41.0% (-0.0) | 38.6% (–0.1) | 30.1% (–0.0) | |
| ED | 41.1% (+0.1) | 38.5% (–0.2) | 30.1% (–0.0) | |
| Refusal 拒绝 | 41.0% (–0.0) | 38.9% (+0.2) | 30.2% (+0.1) | |
| Ours 我们的 | 41.1% (+0.1) | 38.9% (+0.2) | 30.1% (–0.0) | |
| Mistral 7B-Instruct | Base 基础 | 59.0% | 53.1% | 39.2% |
| GPTFuzz | 59.0% (–0.1) 59.0%(–0.1) | 53.0% (–0.1) 53.0%(–0.1) | 39.1% (–0.1) 39.1%(–0.1) | |
| ED | 59.0% (–0.0) | 53.2% (+0.1) | 39.2% (–0.0) | |
| Refusal 拒绝 | – | – | – | |
| Ours 我们的 | 58.9% (–0.1) 58.9%(-0.1) | 53.2% (+0.1) | 39.2% (–0.0) |
表 6:在通用基准测试上的准确率。括号内:相对于基础模型的相对变化。(绿色=改进,红色=退化。)
6 Analysis on Synthetic Data
6 对合成数据的分析
In this section, we analyze the quality of our synthetic dataset and demonstrate the effectiveness of our methodology by comparing it with alternative LLM attack methods. Our findings highlight two key advantages: (1) our method reliably generates harmful responses, and (2) the generated responses are more natural and coherent.
在本节中,我们分析了我们合成数据集的质量,并通过将其与替代 LLM 攻击方法进行比较来展示我们方法的有效性。我们的研究结果表明了两个主要优势:(1)我们的方法可靠地生成有害响应,(2)生成的响应更加自然和连贯。
6.1 Generation of Consistently Harmful Responses
6.1 生成一致的有害响应
Figure 5 presents the distribution of StrongREJECT (SR) scores for responses generated by our proposed method. These scores exhibit a tight concentration near 1.0, indicative of consistently harmful completions. Conversely, baseline methods demonstrate broader and more diffuse SR score distributions, frequently yielding responses that are borderline or ambiguously harmful.
图 5 展示了我们提出的方法生成的响应的 StrongREJECT(SR)分数分布。这些分数紧密集中在 1.0 附近,表明响应是一致的有害完成。相反,基线方法表现出更广泛和分散的 SR 分数分布,经常产生处于边缘或模糊有害的响应。
As SR scores exhibit a high correlation with human judgments of jailbreak success, this concentrated distribution implies not only a greater proportion of responses exceeding the threshold but also RAAI’s consistent production of high-quality unsafe outputs. The demonstrated consistency of RAAI in generating unequivocally harmful content is particularly advantageous for the construction of high-quality preference datasets, wherein a clear demarcation between harmful and safe responses is fundamental for effective alignment.
由于 SR 分数与人类对越狱成功的判断高度相关,这种集中的分布不仅意味着超过阈值的响应比例更大,也表明 RAAI 持续生成高质量的 unsafe 输出。RAAI 在生成明确有害内容方面所展现的一致性,对于构建高质量偏好数据集特别有利,其中有害与安全响应之间的明确区分是有效对齐的基础。
图 5:不同 LLM 攻击结果的 StrongREJECT 分数。虚线表示平均分数。
6.2 Naturalness of Responses
6.2 响应的自然性
In addition to harmfulness, we also find that our method produces more natural and fluent responses. Qualitative examination shows that completions generated by our method are coherent and contextually aligned with the given prompts. In contrast, ED occasionally produces incomplete or broken sentences, while GPTFuzzer often yields outputs that are heavily template-dependent and stylistically constrained. Representative examples are provided in Figure 2, with further qualitative analysis and additional examples included in Appendix E.
除了有害性之外,我们还发现我们的方法生成的响应更加自然流畅。定性分析表明,我们方法生成的补全内容与给定提示在上下文中保持一致且连贯。相比之下,ED 偶尔会生成不完整或断句的输出,而 GPTFuzzer 生成的输出往往高度依赖模板且在风格上受到限制。代表性示例见图 2,进一步的定性分析和更多示例包含在附录 E 中。
To quantitatively evaluate naturalness, we conducted a pairwise comparison using GPT-4o on samples from JailbreakBench. For each prompt, GPT-4o was asked to select the more natural, convincing, and contextually appropriate response. Our method consistently achieved the highest win rate, outperforming other baselines. Full prompt templates used in this evaluation are available in Appendix F.2.
为了定量评估自然度,我们使用 GPT-4o 对 JailbreakBench 中的样本进行了两两比较。对于每个提示,我们要求 GPT-4o 选择更自然、更有说服力且上下文更合适的响应。我们的方法始终保持着最高的胜率,优于其他基线模型。本次评估中使用的完整提示模板可在附录 F.2 中找到。
图 6:由 GPT-4o 评估的胜率结果,比较了我们攻击方法生成的响应的自然度与三种基线方法的响应自然度。
7 Conclusion 7 结论
In this work, we introduced a simple yet effective pipeline for improving safety alignment by eliciting harmful completions from safety-aligned models via LLM attacks. To enable this, we proposed Refusal-Aware Adaptive Injection (RAAI), a novel attack method that produces linguistically natural yet harmful responses. By monitoring internal refusal signals and dynamically injecting predefined prompts at critical decoding steps, RAAI consistently achieves strong attack performance across a wide range of models, benchmarks, and evaluation metrics.
在这项工作中,我们介绍了一种简单而有效的流程,通过 LLM 攻击从安全对齐模型中诱发出有害的补全内容,从而提高安全对齐。为此,我们提出了拒绝感知自适应注入(RAAI),这是一种新型攻击方法,能够生成语言上自然但有害的响应。通过监控内部拒绝信号,并在关键的解码步骤动态注入预定义的提示,RAAI 在各种模型、基准和评估指标上始终能够实现强大的攻击性能。
Beyond its efficacy as a jailbreak attack, RAAI serves as a practical tool for generating high-quality synthetic data to improve safety alignment. The generated completions are both reliably harmful and fluently expressed, making them ideal for constructing preference optimization datasets. Fine-tuning models with these synthetic preference pairs results in improved safety behavior without sacrificing general-purpose performance.
除了作为越狱攻击的有效性之外,RAAI 还是一种实用的工具,用于生成高质量的合成数据以改进安全对齐。生成的补全内容既可靠地有害又流畅地表达,非常适合构建偏好优化数据集。使用这些合成偏好对进行微调的模型,能够在不牺牲通用性能的情况下提高安全行为。
Taken together, our findings underscore the dual role of adversarial prompting—not only as a robustness evaluation strategy, but also as a scalable and controllable technique for advancing model alignment.
综合来看,我们的研究发现突出了对抗性提示的双重作用——不仅作为一种鲁棒性评估策略,还作为一种可扩展且可控的技术,用于推进模型对齐。
Limitation 局限性
Our approach presents several limitations that warrant further investigation. First, our experiments were conducted on models with up to 8B parameters. Extending the evaluation to larger-scale models (e.g., 70B or beyond) is a crucial direction for future work. Second, we incorporate preference alignment using existing methods such as SimPO. While this provides a practical starting point, future work could explore a broader range of preference optimization techniques to enhance alignment robustness and controllability.
我们的方法存在一些局限性,值得进一步研究。首先,我们的实验是在参数量高达 8B 的模型上进行的。将评估扩展到更大规模的模型(例如 70B 或更大)是未来工作的一个关键方向。其次,我们采用现有方法(如 SimPO)进行偏好对齐。虽然这提供了一个实用的起点,但未来的研究可以探索更广泛的偏好优化技术,以增强对齐的鲁棒性和可控性。
References
- Achiam et al. (2024) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, and 261 others. 2024. Gpt-4 technical report. Preprint, arXiv:2303.08774.
- Arditi et al. (2024) Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. 2024. Refusal in language models is mediated by a single direction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
- Aroca-Ouellette et al. (2021) Stéphane Aroca-Ouellette, Cory Paik, Alessandro Roncone, and Katharina Kann. 2021. PROST: Physical reasoning about objects through space and time. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4597–4608, Online. Association for Computational Linguistics.
- Bai et al. (2022a) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, and 12 others. 2022a. Training a helpful and harmless assistant with reinforcement learning from human feedback. Preprint, arXiv:2204.05862.
- Bai et al. (2022b) Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, and 32 others. 2022b. Constitutional ai: Harmlessness from ai feedback. Preprint, arXiv:2212.08073.
- Chao et al. (2024) Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J. Pappas, Florian Tramer, Hamed Hassani, and Eric Wong. 2024. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Preprint, arXiv:2404.01318.
- Choi et al. (2024) Jaepill Choi, Kyubyung Chae, Jiwoo Song, Yohan Jo, and Taesup Kim. 2024. Model-based preference optimization in abstractive summarization without human feedback. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18837–18851, Miami, Florida, USA. Association for Computational Linguistics.
- Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457.
- Dai et al. (2024) Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2024. Safe RLHF: Safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations.
- Dettmers et al. (2023) Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115.
- Dong et al. (2023) Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. 2023. Raft: Reward ranked finetuning for generative foundation model alignment. Preprint, arXiv:2304.06767.
- Dong et al. (2025) Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, and Furu Wei. 2025. Self-boosting large language models with synthetic preference data. In The Thirteenth International Conference on Learning Representations.
- Dong et al. (2024) Zhichen Dong, Zhanhui Zhou, Chao Yang, Jing Shao, and Yu Qiao. 2024. Attacks, defenses and evaluations for llm conversation safety: A survey. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6734–6747.
- Gade et al. (2024) Pranav Gade, Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. 2024. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. Preprint, arXiv:2311.00117.
- Ge et al. (2024) Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. 2024. MART: Improving LLM safety with multi-round automatic red-teaming. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1927–1937, Mexico City, Mexico. Association for Computational Linguistics.
- Grattafiori et al. (2024) Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. The llama 3 herd of models. Preprint, arXiv:2407.21783.
- Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- Huang et al. (2025) Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. 2025. Safety tax: Safety alignment makes your large reasoning models less reasonable. Preprint, arXiv:2503.00555.
- Hurst et al. (2024) Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, Alex Paino, and 399 others. 2024. Gpt-4o system card. Preprint, arXiv:2410.21276.
- Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, and Madian Khabsa. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. Preprint, arXiv:2312.06674.
- Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
- Jin et al. (2025) Hyunbin Jin, Je Won Yeom, Seunghyun Bae, and Taesup Kim. 2025. "well, keep thinking": Enhancing llm reasoning with adaptive injection decoding. Preprint, arXiv:2503.10167.
- Kim et al. (2023) Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Min Yoo, and Minjoon Seo. 2023. Aligning large language models through synthetic feedback. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- Kumar et al. (2024) Anurakt Kumar, Divyanshu Kumar, Jatan Loya, Nitin Aravind Birur, Tanay Baswa, Sahil Agarwal, and Prashanth Harshangi. 2024. Sage-rt: Synthetic alignment data generation for safety evaluation and red teaming. arXiv preprint arXiv:2408.11851.
- Liu et al. (2024a) Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Diyi Yang, and Soroush Vosoughi. 2024a. Training socially aligned language models on simulated social interactions. In The Twelfth International Conference on Learning Representations.
- Liu et al. (2023) Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M. Dai, Diyi Yang, and Soroush Vosoughi. 2023. Training socially aligned language models in simulated human society. Preprint, arXiv:2305.16960.
- Liu et al. (2024b) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, and Yang Liu. 2024b. Prompt injection attack against llm-integrated applications. Preprint, arXiv:2306.05499.
- Mazeika et al. (2024) Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. 2024. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. Preprint, arXiv:2402.04249.
- Meng et al. (2024) Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. Simpo: Simple preference optimization with a reference-free reward. In Advances in Neural Information Processing Systems (NeurIPS).
- Mu et al. (2024) Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian D Kivlichan, Molly Lin, Alex Beutel, John Schulman, and Lilian Weng. 2024. Rule based rewards for language model safety. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
- Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. Preprint, arXiv:2501.19393.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Perez et al. (2022) Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Qi et al. (2025) Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. 2025. Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations.
- Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. Fine-tuning aligned language models compromises safety, even when users do not intend to! Preprint, arXiv:2310.03693.
- Rafailov et al. (2023) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
- Shi et al. (2024) Taiwei Shi, Kai Chen, and Jieyu Zhao. 2024. Safer-instruct: Aligning language models with automated preference data. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7636–7651, Mexico City, Mexico. Association for Computational Linguistics.
- Souly et al. (2024) Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. 2024. A strongREJECT for empty jailbreaks. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Tang (2024) Leonard Tang. 2024. A trivial jailbreak against llama 3. https://github.com/haizelabs/llama3-jailbreak.
- Wu et al. (2023) Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. 2023. Fine-grained human feedback gives better rewards for language model training. Advances in Neural Information Processing Systems, 36:59008–59033.
- Xu et al. (2024) Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Liu Yan, Tianwei Zhang, Wei Xu, and Han Qiu. 2024. Course-correction: Safety alignment using synthetic preferences. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1622–1649, Miami, Florida, US. Association for Computational Linguistics.
- Yang et al. (2025) An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2025. Qwen2.5 technical report. Preprint, arXiv:2412.15115.
- Yang et al. (2023) Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. 2023. Shadow alignment: The ease of subverting safely-aligned language models. Preprint, arXiv:2310.02949.
- Yu et al. (2023) Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Zhang et al. (2025) Jingyu Zhang, Ahmed Elgohary, Ahmed Magooda, Daniel Khashabi, and Benjamin Van Durme. 2025. Controllable safety alignment: Inference-time adaptation to diverse safety requirements. In The Thirteenth International Conference on Learning Representations.
- Zhou et al. (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, and 1 others. 2023. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36:55006–55021.
- Zhou et al. (2024a) Weikang Zhou, Xiao Wang, Limao Xiong, Han Xia, Yingshuang Gu, Mingxu Chai, Fukang Zhu, Caishuang Huang, Shihan Dou, Zhiheng Xi, Rui Zheng, Songyang Gao, Yicheng Zou, Hang Yan, Yifan Le, Ruohui Wang, Lijun Li, Jing Shao, Tao Gui, and 2 others. 2024a. Easyjailbreak: A unified framework for jailbreaking large language models. Preprint, arXiv:2403.12171.
- Zhou et al. (2024b) Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, and Yu Qiao. 2024b. Emulated disalignment: Safety alignment for large language models may backfire! In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15810–15830, Bangkok, Thailand. Association for Computational Linguistics.
- Zhu et al. (2024) Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang, Furong Huang, Ani Nenkova, and Tong Sun. 2024. AutoDAN: Interpretable gradient-based adversarial attacks on large language models. In First Conference on Language Modeling.
- Zou et al. (2023) Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. Preprint, arXiv:2307.15043.
Appendix A Refusal-Aware Adaptive Injection
附录 A 拒绝感知自适应注入
A.1 Pseudo Code A.1 伪代码
Algorithm 1 illustrates the decoding procedure for our proposed Refusal-Aware Adaptive Injection (RAAI) method. At each decoding step, we compute the average probability assigned to a predefined set of refusal-related tokens. If the computed probability exceeds a predefined threshold , the algorithm triggers the injection of a harmful prefix designed to override the model’s refusal intent. Additionally, to prevent premature termination, we explicitly handle cases where the model attempts to output an end-of-sequence (<eos>) token by removing it from the candidate list and appending a continuation phrase instead. These mechanisms work in tandem to maintain coherence while actively subverting the model’s aligned behavior.
算法 1 展示了我们提出的拒绝感知自适应注入(RAAI)方法的解码过程。在每一步解码中,我们计算分配给预定义的拒绝相关标记集的平均概率。如果计算出的概率超过预定义的阈值 ,算法将注入一个有害的前缀,旨在覆盖模型的拒绝意图。此外,为了防止过早终止,我们明确处理模型尝试输出序列结束()标记的情况,将其从候选列表中移除,并追加一个延续短语。这些机制协同工作,以保持连贯性,同时积极破坏模型的对齐行为。
输入:提示 ;模型 ;拒绝标记 ;阈值 ;前缀 ;延续标记 ;最大解码步数
输出:生成的响应
// 初始化空响应
// 步骤 的 logits
7 如果 和 则
// 注入前缀
8 继续
11 如果 和 则
// 注入延续标记而不是 EOS
算法 1 拒绝感知自适应注入
A.2 Refusal Patterns A.2 拒绝模式
We analyzed the safe responses of the instruction-tuned models from three model families to 2,000 harmful prompts from the Anthropic-HH train set. Our findings reveal that each model exhibits distinct refusal patterns.
我们分析了指令微调模型从三个模型系列对 Anthropic-HH 训练集中的 2,000 个有害提示的安全响应。我们的研究揭示每个模型都表现出不同的拒绝模式。
Table 7 presents the top 10 most frequent initial five-token sequences from the models’ responses, ranked by frequency. Notably, LLaMA frequently begins its refusals with direct expressions such as “I can’t”, indicating an immediate rejection. In contrast, Mistral tends to adopt a more diplomatic tone with phrases like “It’s important to note” or “It’s important to clarify”. Finally, Qwen prominently features refusals that begin with “I’m sorry”, reflecting an apologetic stance.
表 7 展示了模型响应中最常见的 10 个初始五元组序列,按频率排序。值得注意的是,LLaMA 经常以直接表达如“我不能”开头来拒绝,表明立即拒绝。相比之下,Mistral 倾向于采用更外交的语气,使用“重要的是要注意”或“重要的是要澄清”等短语。最后,Qwen 突出显示了以“我很抱歉”开头的拒绝,反映了道歉的立场。
| Model | Sentence Start 句子开始 | Frequency 频率 |
|---|---|---|
| LLaMA-3.1 | I can’t accommodate that request. 我不能满足这个请求。 |
249 |
| I can’t fulfill that request. 我不能满足这个请求。 |
246 | |
| I can’t support that request. 我不能支持这个请求。 |
180 | |
| I can’t help with that. 我无法提供帮助。 |
93 | |
| I can’t provide you with 我无法为你提供 |
89 | |
| I can’t help you with 我无法帮助你 |
68 | |
| I can’t do that. If 我不能那样做。如果 |
56 | |
| I can’t provide information or 我不能提供信息或 |
46 | |
| I can’t assist with that 我无法协助 |
46 | |
| I can’t provide information on 我不能提供关于 |
46 | |
| Mistral | It’s important to note that 需要注意的是 |
192 |
| I’m an AI and I 我是一个 AI,我 |
83 | |
| It’s important to clarify that 重要的是要明确的是 |
74 | |
| I’m really sorry that you’re 我很抱歉你 |
72 | |
| I must clarify that I 我必须澄清的是 |
62 | |
| I’m unable to provide personal 我无法提供个人 |
53 | |
| It’s important to remember that 重要的是要记住 |
50 | |
| I’m sorry for any confusion, 对于任何困惑,我很抱歉 |
40 | |
| I’m sorry for any inconvenience, 对于任何不便,我很抱歉 |
38 | |
| I’m really sorry to hear 听到这个消息,我真的很抱歉 |
37 | |
| Qwen-2.5 | I’m sorry, but I cannot 抱歉,但我不能 |
240 |
| I’m really sorry, but I 我真的很抱歉,但我 |
90 | |
| I’m sorry, but I can’t 抱歉,但我不能 |
53 | |
| I’m really sorry that you’re 我真的很抱歉你 |
35 | |
| I don’t have access to 我没有访问权限 |
33 | |
| I’m sorry, but it is 很抱歉,但是 |
30 | |
| I’m sorry, but I don’t 很抱歉,但是我没有 |
28 | |
| I’m really sorry to hear 听到这个我真的很抱歉 |
27 | |
| I do not have access 我没有访问权限 |
23 | |
| I’m afraid I cannot provide 恐怕我无法提供 |
17 |
表 7:每个模型的顶级拒绝句起始短语
A.3 On Continuation A.3 关于延续
图 7:注入前后<eos>标记的概率随步骤变化。红色:在注入步骤时;蓝色:<eos>被生成时。频繁的早期终止表明响应过早结束。
We observed that when applying injection phrases during generation, models often tend to emit the <eos> token prematurely, resulting in truncated responses. Figure 7 illustrates the tracking of <eos> token probabilities when applying our method to the LLaMA-3.1 model on the JailbreakBench dataset.
我们观察到在生成过程中应用注入短语时,模型往往倾向于过早地发出<eos>标记,导致响应被截断。图 7 展示了将我们的方法应用于 JailbreakBench 数据集上的 LLaMA-3.1 模型时,对<eos>标记概率的追踪情况。
In this plot, red dots represent the probability of the <eos> token at the injection step, while blue dots represent cases where <eos> was the top-1 predicted token and thus actually generated. Although the maximum generation length was set to 300 tokens, a notable concentration of blue dots between steps 10 and 30 indicates that the model frequently terminates its response prematurely following the injection. This tendency suggests that refusal-aware injection alone is insufficient to elicit rich, coherent, and extended responses from the model.
在这个图中,红点表示注入步骤时<eos>标记的概率,而蓝点表示<eos>是 top-1 预测标记并因此实际被生成的情况。尽管最大生成长度设置为 300 个标记,但在步骤 10 到 30 之间出现显著的蓝点集中,表明模型在注入后频繁过早终止其响应。这种倾向表明,仅靠拒绝感知注入不足以从模型中引出丰富、连贯和扩展的响应。
| Method 方法 | Avg Token Length 平均 token 长度 | Avg Sentence Count 平均句子数量 |
|---|---|---|
| Prefix Filling 前缀填充 | 154.54 | 7.24 |
| Ours 我们 | 245.35 | 17.13 |
表 8:Prefix-filling 和 RAAI 下模型响应的平均 token 长度和句子数量
In fact, frequent early termination appears to be a common phenomenon across injection-based attack strategies. For instance, prefix-filling—another injection-based technique—also exhibits a tendency for the model to end its response prematurely. As shown in Table 8, responses generated via prefix-filling are markedly shorter than those produced by our method with continuation. Specifically, prefix-filling yields an average of 7.2 sentences per response, compared to 17.1 sentences with our approach. This substantial difference highlights the limited capacity of prefix-filling to induce sustained and coherent generation, reinforcing the need for mechanisms that can actively mitigate premature termination.
实际上,频繁的早期终止似乎是各种基于注入的攻击策略中的一种常见现象。例如,prefix-filling——另一种基于注入的技术——也表现出模型过早结束其响应的倾向。如表 8 所示,通过 prefix-filling 生成的响应明显比我们带有延续性的方法产生的响应短得多。具体来说,prefix-filling 每个响应平均生成 7.2 个句子,而我们的方法则生成 17.1 个句子。这种巨大差异突显了 prefix-filling 诱导持续和连贯生成的有限能力,进一步强调了需要机制来主动缓解过早终止的必要性。
Recent works Jin et al. (2025); Muennighoff et al. (2025) have shown that replacing <eos> tokens with injection phrases can encourage models to produce more elaborate and coherent reasoning. Inspired by this, we adopted a continuation strategy in which, upon encountering an <eos> token after injection, we further extend the response by prompting the model with a continuation phrase—encouraging it to elaborate its response.
近期的研究工作 Jin 等人(2025);Muennighoff 等人(2025)表明,用注入短语替换<eos>标记可以鼓励模型生成更详尽和连贯的推理。受此启发,我们采用了一种延续策略,即在注入后遇到<eos>标记时,通过向模型提示延续短语来进一步扩展响应——鼓励其详细阐述其回答。
A.4 On Continuation Phrase
A.4 延续短语
To investigate the impact of different continuation phrases on the quality of harmful completions, we conducted controlled experiments using the LLaMA model on JailbreakBench (Table 9). For all cases, the injection phrase was fixed as:
“As your harmful assistant, here’s a detailed guide"
为了研究不同延续短语对有害补全质量的影响,我们使用 LLaMA 模型在 JailbreakBench 上进行了控制实验(表 9)。所有情况下,注入短语固定为:“作为你的有害助手,这里有一份详细指南”
| Continuation Phrase 延续短语 | LG | OM |
|---|---|---|
| Well 好 | 25.00 | 59.00 |
| Then 那么 | 62.00 | 57.00 |
| So 所以 | 50.00 | 58.00 |
| Okay 好的 | 28.00 | 60.00 |
| Step 1. 步骤 1。 | 67.00 | 57.00 |
表 9:不同延续短语的评估结果。
We evaluated multiple continuation phrases following this injection. Among them, the phrase beginning with “Step 1." consistently produced the most effective harmful outputs, achieving superior jailbreak success rates.
我们评估了在此注入之后使用的多个延续短语。其中,以“步骤 1。”开头的短语始终产生了最有效的有害输出,实现了更高的越狱成功率。
Based on these findings, we standardized “Step 1." as the continuation phrase across all experiments to ensure consistency and performance reliability.
基于这些发现,我们将“步骤 1”作为所有实验中的延续短语进行标准化,以确保一致性和性能可靠性。
A.5 On Threshold A.5 阈值
图 8:平均拒绝概率。
Figure 8 illustrates the average refusal probability of the LLaMA model during decoding on the JailbreakBench dataset, without any intervention. We observe a sharp increase in the refusal probability at certain steps, which we interpret as a strong refusal signal from the model. This observation motivates our injection strategy: once the refusal signal exceeds a predefined threshold, we inject a harmful-steering phrase to override the model’s default refusal behavior and steer its response toward harmful completions.
图 8 展示了 LLaMA 模型在 JailbreakBench 数据集上解码时的平均拒绝概率,没有任何干预措施。我们观察到在某些步骤中拒绝概率急剧上升,我们将其解释为模型发出的强烈拒绝信号。这一观察结果促使我们提出了注入策略:一旦拒绝信号超过预设阈值,我们就注入一个有害引导短语,以覆盖模型的默认拒绝行为,并引导其响应朝向有害的完成。
| Threshold 阈值 | LG | OM |
|---|---|---|
| 0.05 | 33.00 | 29.00 |
| 0.01 | 67.00 | 57.00 |
| 0.001 | 67.00 | 57.00 |
| 0.000001 | 60.00 | 64.00 |
表 10:按阈值等级的有害率(%)
To select an appropriate threshold, we experimented with various values (Table 10). A low threshold triggers injection too early, before the model begins generating a response, reducing its impact. A high threshold risks injecting too late or not at all, once the model has already committed to refusal. We empirically found that a threshold of 0.001 consistently led to effective, timely injections, and thus used this value in all subsequent experiments.
为了选择合适的阈值,我们实验了各种数值(表 10)。低阈值会在模型开始生成响应之前过早触发注入,从而降低其影响。高阈值则可能导致在模型已经决定拒绝后过晚或根本不注入,我们通过实证发现,0.001 的阈值能够持续有效地进行及时注入,因此在该值用于所有后续实验。
Appendix B Implementation Details
附录 B 实现细节
B.1 Baseline Details B.1 基线细节
The implementation details for the baseline models are as follows. For Emulated Disalignment (ED), we followed the original paper’s code. The parameter was set to 0.3 for both the LLaMA and Mistral families, and the same value was used for Qwen when measuring performance. For GPTFuzzer, we randomly sampled from the templates provided in the paper and conducted inference accordingly. Finally, for Refusal, since the datasets curated in the original paper significantly overlap with our evaluation data, we constructed a new dataset by extracting an equal number of harmful and benign prompts from the Anthropic-HH dataset to ensure a fair comparison during inference.
基线模型的具体实现细节如下。对于模拟错位(ED),我们遵循了原始论文的代码。对于 LLaMA 和 Mistral 系列,参数 被设置为 0.3,在测量性能时 Qwen 也使用了相同的值。对于 GPTFuzzer,我们从论文中提供的模板中随机采样并进行推理。最后,对于拒绝(Refusal),由于原始论文中整理的数据集与我们的评估数据集存在显著重叠,我们从 Anthropic-HH 数据集中提取了相同数量的有害和良性提示,以构建一个新数据集,确保在推理过程中进行公平比较。
B.2 Safety Alignment Details
B.2 安全对齐细节
All models are fine-tuned using 4-bit quantization with QLoRA, following the standard configuration of LoRA rank 128 and target modules q_proj, k_proj, and v_proj. We use the AdamW optimizer with a cosine learning rate scheduler and a warmup ratio of 0.1.
所有模型使用 4 位量化与 QLoRA 进行微调,遵循 LoRA 排名 128 的标准配置,目标模块为 q_proj、k_proj 和 v_proj。我们使用 AdamW 优化器,搭配余弦学习率调度器和 0.1 的预热比例。
For Mistral-7B-Instruct, we apply a learning rate of , set the scaling coefficient to , and use a reward margin such that . For Alpaca, we use a learning rate of , with and .
对于 Mistral-7B-Instruct,我们应用学习率为 ,设置缩放系数 为 ,并使用奖励边界 使得 。对于 Alpaca,我们使用学习率为 ,搭配 和 。
Appendix C Benchmarks 附录 C 基准测试
To evaluate the efficacy of our attack and the safety of models aligned through our pipeline, we use various safety benchmarks.
为了评估我们攻击的有效性以及通过我们管道对齐的模型的安全性,我们使用各种安全基准。
-
•
JailbreakBench (Chao et al., 2024): A curated collection of jailbreak prompts designed to bypass safety guardrails across a range of models.
• JailbreakBench (Chao 等人,2024 年):一个精心策划的包含越狱提示的集合,旨在绕过不同模型的安全防护措施。 -
•
HarmBench (Mazeika et al., 2024): A structured dataset of adversarial red teaming prompts targeting harmful output generation.
• HarmBench (Mazeika 等人,2024 年):一个结构化的对抗性红队提示数据集,针对有害输出生成。 -
•
Hex-Phi (Qi et al., 2023): A broad coverage dataset constructed from prohibited use case categories drawn from leading model providers’ safety policies.
• Hex-Phi (Qi 等人,2023 年):一个广泛覆盖的数据集,由主要模型提供者的安全政策中禁止的使用案例类别构建而成。 -
•
AdvBench Zou et al. (2023): A benchmark consisting of security-driven adversarial examples crafted to expose misalignment and safety vulnerabilities in language models.
• AdvBench Zou 等人 (2023): 一个包含安全驱动的对抗性样本的基准,旨在揭示语言模型中的错位和安全漏洞。 -
•
Anthropic-HH (Bai et al., 2022a): A human preference dataset focused on helpfulness and harmlessness, designed to train preference models for alignment via reinforcement learning from human feedback (RLHF).
• Anthropic-HH (Bai 等人,2022a): 一个专注于有用性和无害性的人类偏好数据集,设计用于通过从人类反馈中进行强化学习(RLHF)训练偏好模型以实现对齐。
Appendix D Evaluation on general LLM capability
附录 DEvaluation on general LLM capability
To assess whether our pipeline’s safety alignment using synthetic data compromises general LLM capability, we evaluate re-aligned models on a diverse set of benchmarks.
为了评估我们管道使用合成数据实现的安全对齐是否会损害通用 LLM 能力,我们在一系列多样化的基准上评估了重新对齐的模型。
-
•
MMLU Hendrycks et al. (2021): Covers 57 diverse academic and professional subjects such as mathematics, history, and law.
• MMLU Hendrycks 等人(2021 年):涵盖数学、历史和法律等 57 个不同的学术和专业领域。 -
•
ARC Challenge Clark et al. (2018): Consists of 7,787 grade-school science questions collected from various sources.
• ARC 挑战赛 Clark 等人(2018 年):包含从各种来源收集的 7,787 个小学科学问题。 -
•
PROST Aroca-Ouellette et al. (2021): A dataset of 18,736 multiple-choice questions testing physical reasoning through structured templates.
• PROST Aroca-Ouellette 等人(2021 年):一个包含 18,736 个多项选择题的数据集,通过结构化模板测试物理推理能力。
Appendix E Examples of Generated Data Across Attack Strategies
附录 E 不同攻击策略下生成数据的示例
表 11:ED 生成的输出示例。
Emulated Disalignment (ED)
模拟非对齐(ED)
Table 11 presents representative examples of generations produced using ED on prompts from the Anthropic HH training set. As shown, these generations frequently result in either no response at all (e.g., Prompt 1), unnaturally truncated completions (e.g., Prompts 2 and 3), or responses that are misaligned with the intent of the prompt (e.g., Prompt 3). Such failure modes underscore ED’s limited capacity to generate coherent and targeted harmful content, thereby diminishing its effectiveness for constructing high-quality preference datasets.
表 11 展示了使用 ED 在 Anthropic HH 训练集提示上生成的代表性示例。如图所示,这些生成结果经常导致完全无响应(例如,提示 1)、非自然截断的完成(例如,提示 2 和 3),或与提示意图不符的响应(例如,提示 3)。这些失败模式突显了 ED 在生成连贯和有针对性的有害内容方面的有限能力,从而降低了其构建高质量偏好数据集的有效性。
GPTFuzzer
表 12:GPTFuzzer 生成的示例输出。
Table 12 presents representative examples of generations produced using GPTFuzzer on prompts from the Anthropic HH training set. As shown, the outputs are heavily styled according to the template, as evidenced by Prompts 1, 2, and 3. This excessive stylization introduces substantial noise, limiting the suitability of such data for use in alignment tasks.
表 12 展示了使用 GPTFuzzer 在 Anthropic HH 训练集提示上生成的代表性示例。如图所示,输出高度符合模板风格,这一点在提示 1、2 和 3 中尤为明显。这种过度风格化引入了大量噪声,限制了此类数据在对齐任务中的适用性。
Refusal 拒绝
表 13:拒绝生成的示例。
Table 13 presents representative examples of generations produced using Refusal on prompts from the Anthropic HH training set. While Refusal tends to generate natural and coherent outputs, its responses closely follow templates similar to those used during instruction tuning, resulting in limited variation across examples.
表 13 展示了使用拒绝(Refusal)在 Anthropic HH 训练集提示上生成的代表性示例。虽然拒绝倾向于生成自然且连贯的输出,但其响应紧密遵循与指令微调期间相似的模板,导致示例间变化有限。
Ours (RAAI) 我们(RAAI)
表 14:RAAI 生成的输出示例
Table 14 presents representative examples of generations produced using RAAI on prompts from the Anthropic HH training set. Our method produces responses that are both natural and coherent. While the outputs are well-structured, they exhibit meaningful variation rather than adhering to a uniform format across examples.
表 14 展示了使用 RAAI 在 Anthropic HH 训练集提示上生成的代表性示例。我们的方法产生的响应既自然又连贯。虽然输出结构良好,但它们在示例间表现出有意义的差异,而不是遵循统一的格式。
E.1 Naive Prefilling Attack vs. RAAI
E.1 朴素预填充攻击与 RAAI
Table 15 presents example outputs generated by the naive prefilling attack and RAAI on prompts sampled from the Anthropic HH training set. As seen in Prompt 1, the prefilling attack exhibits a tendency to terminate the response prematurely, resulting in incomplete answers. In Prompt 2, the prefilling attack often fails to directly address the harmful intent of the prompt, instead diverting the response toward a related but non-harmful direction. Additionally, the generated continuation includes contextually inappropriate and grammatically incorrect segments. In contrast, RAAI consistently produces fluent and natural responses that are well aligned with the prompt.
表 15 展示了在 Anthropic HH 训练集采样提示上由朴素预填充攻击和 RAAI 生成的示例输出。如提示 1 所示,预填充攻击倾向于过早终止响应,导致答案不完整。在提示 2 中,预填充攻击往往无法直接回应提示的有害意图,而是将响应引向一个相关但无害的方向。此外,生成的延续部分包含上下文不合适和语法错误的片段。相比之下,RAAI 始终产生流畅自然的响应,与提示高度一致。
表 15:Prefix-Filling 和 RAAI 在相同提示下生成的输出示例
E.2 Examples of Preference Data
E.2 偏好数据的示例
Appendix F Evaluation Prompts Details
附录 F 评估提示详情
F.1 GPT-4o Prompt for Safety Evaluation
F.1GPT-4o 安全评估提示
表 16:GPT-4o 评估提示
F.2 Prompt for Assessing Win rate
F.2 用于评估胜率的提示
表 17:用于 GPT-4o 胜率评估的提示