1]ByteDance Seed 2]TMLR Group, Department of Computer Science, Hong Kong Baptist University \contribution[*]Work done at ByteDance Seed
1]字节跳动种子计划 2]香港浸会大学计算机科学系 TMLR 小组 \贡献[*]在字节跳动种子计划期间完成的工作

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check
推理安全对齐：通过答案后检查确保越狱防御

Chentao Cao Xiaojun Xu Bo Han Hang Li [ [ bhanml@comp.hkbu.edu.hk

(March 6, 2026) （2026 年 3 月 6 日）

Abstract 摘要

As large language models (LLMs) continue to advance in capabilities, ensuring their safety against jailbreak attacks remains a critical challenge. In this paper, we introduce a novel safety alignment approach called “Answer-Then-Check”, which enhances LLM robustness against malicious prompts by applying thinking ability to mitigate jailbreaking problems before producing a final answer to the user. Our method enables models to answer the question in their “thoughts” directly and then critically evaluate its safety before deciding whether to provide it. To implement this approach, we construct the Reasoned Safety Alignment (ReSA) dataset, comprising 80K samples that teach models to reason through direct responses and then analyze their safety. Experimental results demonstrate that our approach achieves the Pareto frontier with superior safety capability while decreasing over-refusal rates. Notably, the fine-tuned model maintains general reasoning capabilities on benchmarks like MMLU, MATH500, and HumanEval. Besides, our method equips models with the ability to perform “safe completion”, while post-hoc detection methods can only directly reject sensitive harmful queries (e.g., self-harm). Our results show that inference-time strategies alone are insufficient, highlighting the necessity of safety training, and we find even $500$ samples can yield performance comparable to the entire dataset, suggesting a promising path for data-efficient safety alignment. The dataset is publicly available at: https://huggingface.co/datasets/ByteDance-Seed/ReSA.
随着大型语言模型（LLMs）能力的持续提升，确保其抵御越狱攻击的安全性问题仍然是一个关键挑战。在本文中，我们介绍了一种名为“Answer-Then-Check”的新型安全对齐方法，通过运用思考能力在向用户生成最终答案之前缓解越狱问题，从而增强 LLM 对恶意提示的鲁棒性。我们的方法使模型能够直接在其“思考”中回答问题，并在决定是否提供该答案之前对其进行严格的安全评估。为实现这一方法，我们构建了推理安全对齐（ReSA）数据集，其中包含 80K 个样本，旨在教会模型通过直接响应进行推理，然后分析其安全性。实验结果表明，我们的方法在降低过度拒绝率的同时，实现了具有卓越安全能力的最优前沿。值得注意的是，经过微调的模型在 MMLU、MATH500 和 HumanEval 等基准测试中仍保持了良好的推理能力。此外，我们的方法使模型具备执行“安全补全”的能力，而事后检测方法只能直接拒绝敏感有害查询（例如，自残）。我们的结果表明仅靠推理时策略是不够的，突出了安全训练的必要性，我们发现即使是 $500$ 样本也能产生与整个数据集相当的性能，这为数据高效的安全对齐指明了有前景的路径。该数据集公开提供：https://huggingface.co/datasets/ByteDance-Seed/ReSA.

\correspondence

Bo Han at \checkdata[Project Page]https://resa-bytedance.github.io
Bo Han 在[项目页面]\checkdata https://resa-bytedance.github.io

1 Introduction 1 引言

With the rapid development of Large Language Models (LLMs) [6, 55], people have spent much effort on aligning them to be safe and trustworthy [22, 3, 10, 54]. However, works have shown that LLMs still suffer from jailbreak attacks and may produce harmful outputs [38, 8, 46, 2]. In a jailbreak attack [42], a malicious prompt is disguised in a special way to bypass the safety mechanism of LLMs. This leads to LLM responding to arbitrary questions without safety considerations.
随着大型语言模型（LLMs）的快速发展[6, 55]，人们投入了大量精力来使它们变得安全可靠[22, 3, 10, 54]。然而，研究表明，LLMs 仍然容易遭受越狱攻击，并可能产生有害输出[38, 8, 46, 2]。在越狱攻击中[42]，一个恶意的提示以特殊的方式伪装，以绕过 LLMs 的安全机制。这导致 LLMs 在未经安全考虑的情况下对任意问题进行响应。

Recently, long chain-of-thought (LongCoT) reasoning [15, 30, 44, 56] has been shown to be an effective way to improve LLM capability. The LongCoT LLM will first generate a reasoning-style verbose text to allow a “thinking” process, before producing the final answer to the user. In this work, we aim to apply such thinking ability to mitigate LLM jailbreaking problems. Intuitively, when faced with a complicated question, we should enable the LLM to pre-plan its answer to determine whether it is safe to provide an answer. This idea is based on a key insight into the nature of jailbreak attacks: malicious intent can be heavily obfuscated within a query, making it difficult for even a powerful reasoning model to identify. However, when the model attempts to generate a response, the harmful intent is often revealed and becomes much easier to identify, thereby preventing the model from being fooled by adversarial prompts and enabling it to produce a safe answer.
最近，长链式思维（LongCoT）推理[15, 30, 44, 56]已被证明是提升 LLM 能力的一种有效方法。LongCoT LLM 会先生成一种推理式的冗长文本，以允许一个“思考”过程，然后再向用户提供最终答案。在这项工作中，我们旨在将这种思考能力应用于缓解 LLM 越狱问题。直观上，当面对一个复杂问题时，我们应该使 LLM 能够预先规划其答案，以确定是否安全提供答案。这一想法基于对越狱攻击本质的关键洞察：恶意意图可以在查询中高度伪装，即使是强大的推理模型也难以识别。然而，当模型尝试生成响应时，有害意图通常会暴露出来，变得更容易识别，从而防止模型被对抗性提示欺骗，并使其能够生成安全答案。

Conceptually, we propose an “Answer-Then-Check” strategy, where the model first plans its answer in the CoT by generating a summary of the answer, and then checks its safety before the final output. In principle, this could be mimicked by inference-time strategies (e.g., prompting advanced models). However, models are not fully familiar with safety policies, making reliable checking difficult. In this work, we fine-tune LLMs with constructed LongCoT data to improve their robustness against jailbreak attacks, as illustrated in Figure 1. Technically, we build the Reasoned Safety Alignment (ReSA) dataset with 80K “Answer-Then-Check” samples, where the “check” analysis explicitly reasons with reference to safety policies. To construct the dataset, we first collect a prompt dataset using various jailbreak techniques. Then we design a reasoning template and generate the answer summaries, the safety check analysis, and the final answers corresponding to the prompts. Our approach defines a structured form of safety reasoning, in which the model explicitly performs and relies on intermediate safety-oriented reasoning steps before generating its final answer. Consequently, the method is inherently reasoning-based, and we refer to it as “reasoned” safety alignment.
从概念上讲，我们提出了一种“先回答后检查”的策略，其中模型首先在思维链（CoT）中规划其答案，生成答案的摘要，然后在进行最终输出前检查其安全性。原则上，这可以通过推理时策略（例如，提示高级模型）来模仿。然而，模型并不完全熟悉安全策略，使得可靠的检查变得困难。在这项工作中，我们使用构造的 LongCoT 数据对 LLMs 进行微调，以提高它们对越狱攻击的鲁棒性，如图 1 所示。技术上，我们构建了包含 80K 个“先回答后检查”样本的合理安全对齐（ReSA）数据集，其中“检查”分析明确地参考安全策略进行推理。为了构建数据集，我们首先使用各种越狱技术收集一个提示数据集。然后我们设计一个推理模板，生成答案摘要、安全检查分析和对应提示的最终答案。我们的方法定义了一种结构化的安全推理形式，其中模型在生成最终答案前明确执行并依赖中间的安全导向推理步骤。因此，该方法本质上基于推理，我们将其称为“推理式”安全对齐。

Refer to caption — Figure 1: Comparison of jailbreak defense between standard aligned models (top) and our ReSA-SFT/RL model with the “Answer-Then-Check” strategy (bottom). Whereas conventional aligned models remain vulnerable to jailbreak attempts, ReSA-SFT/RL strengthens defense by first generating an intended answer summary and then performing a safety analysis before the final response.
图 1：标准对齐模型（顶部）与采用“先回答后检查”策略的 ReSA-SFT/RL 模型（底部）在越狱防御方面的比较。传统对齐模型仍然容易受到越狱尝试的攻击，而 ReSA-SFT/RL 通过首先生成预期答案摘要，然后在最终回复前进行安全分析来加强防御。

Through comprehensive experiments, we show that models fine-tuned on our dataset exhibit substantially enhanced robustness against a wide range of jailbreak attacks, outperforming $13$ defense methods while maintaining strong general capabilities and low over-refusal rates. Additionally, ReSA is equipped with a safe completion mechanism, enabling helpful responses to sensitive queries (e.g., self-harm) rather than direct refusal—a capability lacking in post-hoc methods and many existing defenses. We further introduce two variants: the Adaptive Answer-Then-Check strategy and the RL-based Answer-Then-Check strategy. Efficiency analysis indicates that ReSA does not introduce prohibitive overhead and can even reduce costs on harmful queries; the Adaptive variant further achieves base-model-level efficiency on normal queries while maintaining comparable safety performance. The RL variant produces safe intended answer summaries and substantially improves overall safety robustness. We also find that merely $500$ samples can yield performance comparable to the full dataset, suggesting a promising path for data-efficient safety alignment.
通过全面的实验，我们证明在我们的数据集上微调的模型在抵御广泛的越狱攻击方面表现出显著增强的鲁棒性，同时优于 $13$ 防御方法，并保持强大的通用能力和低过度拒绝率。此外，ReSA 配备了安全完成机制，能够对敏感查询（例如，自残）提供有益的回应，而不是直接拒绝——这是后处理方法和许多现有防御所缺乏的能力。我们进一步引入了两种变体：自适应回答-检查策略和基于强化学习的回答-检查策略。效率分析表明，ReSA 不会引入过高的开销，甚至可以降低有害查询的成本；自适应变体在正常查询上实现了与基础模型相当的高效性，同时保持了相当的安全性能。强化学习变体生成了安全的预期答案摘要，并显著提高了整体安全鲁棒性。我们还发现，仅用 $500$ 样本就能获得与完整数据集相当的性能，这为数据高效的安全对齐指明了有前景的路径。

We summarize our contribution as follows:
我们总结我们的贡献如下：

•

We propose an “Answer-Then-Check” strategy, which enables an LLM to plan its answer and check it before presenting it to the user. We further introduce an adaptive variant that preserves base-model–level efficiency on normal queries (Section 3.1 and Section 3.5).

• 我们提出了一种“先回答后检查”的策略，该策略使 LLM 能够在向用户展示答案之前规划并检查答案。我们进一步引入了一种自适应变体，该变体在正常查询中保留了基础模型级别的效率（第 3.1 节和第 3.5 节）。
•

We construct the ReSA dataset consisting of 80K prompt-answer pairs in the “Answer-Then-Check” style (Section 3.2 and Section 3.3).

• 我们构建了 ReSA 数据集，其中包含 80K 个“先回答后检查”风格的提示-答案对（第 3.2 节和第 3.3 节）。
•

We equip models with a safe completion capability that provides sensitive and supportive responses to high-stakes queries, such as self-harm, even under adversarial prompts (Section 3.4).

• 我们为模型配备了安全完成能力，使其能够在对抗性提示下（第 3.4 节）对高风险查询（如自残）提供敏感和支持性的回应。
•

Through experiments, we show that models fine-tuned on the ReSA dataset achieve the Pareto frontier with superior safety capability while decreasing over-refusal rates (Section 4).

• 通过实验，我们证明在 ReSA 数据集上微调的模型在降低过度拒绝率的同时达到了 Pareto 前沿，并具有更优越的安全能力（第 4 节）。

2 Related Work 2 相关工作

LLM Jailbreaking. LLM 越狱攻击。

In a jailbreak attack, an adversary will disguise a malicious question, to which an LLM originally refuses to reply, and get harmful answers from the LLM. In this work, we categorize them into two major types: model-agnostic attacks and model-aware attacks.
在越狱攻击中，攻击者会伪装一个恶意问题，而 LLM 原本拒绝回答这个问题，并从 LLM 获取有害答案。在这项工作中，我们将它们分为两大类：模型无关攻击和模型相关攻击。

In a model-agnostic jailbreak attack, the adversary has no knowledge of what LLM to attack, and aims to do general prompt optimization to achieve the jailbreak. PAP [46] employs personas and roleplaying to bypass safety policy. Jailbroken [42] uses specific prompt templates containing harmful instructions disguised as harmless scenarios. DeepInception [25] embeds harmful instructions in nested fictional scenarios to create psychological distance between the model and the harmful content.
在模型无关的越狱攻击中，攻击者不知道要攻击哪个 LLM，并旨在进行通用提示优化以实现越狱。PAP [ 46] 采用角色扮演和角色扮演来绕过安全策略。Jailbroken [ 42] 使用包含伪装成无害场景的有害指令的特定提示模板。DeepInception [ 25] 将有害指令嵌入嵌套的虚构场景中，以在模型和有害内容之间创造心理距离。

In a model-aware attack, the adversary targets at a specific victim model for jailbreak, and the adversary will be iteratively optimized based on the response of the victim model. GPTFuzzer [45] treats the jailbreak process as a fuzzing problem and systematically generates variants of attack templates with human-written templates as initial seeds and select the best attack template for the victim model from multiple variants. Drawing inspiration from social engineering, PAIR [8] leverages an attacker LLM to automatically generate and optimize adversarial queries, iteratively enhancing candidate jailbreaks for the target LLM. ReNeLLM [11] formulates the jailbreaking process as systematic prompt rewriting and scenario nesting to craft adversarial attacks that generate effective jailbreak prompts targeting victim models. TAP [29] employs tree-based search strategies to efficiently explore the prompt space and elicit specific harmful behaviors from the victim LLM.
在模型感知攻击中，攻击者针对特定的受害者模型进行越狱攻击，攻击者会根据受害者模型的响应进行迭代优化。GPTFuzzer [ 45] 将越狱过程视为一个模糊问题，以人类编写的模板作为初始种子，系统地生成攻击模板的变体，并从多个变体中选择最适合受害者模型的攻击模板。受社会工程学启发，PAIR [ 8] 利用攻击者 LLM 自动生成和优化对抗性查询，迭代增强针对目标 LLM 的候选越狱方法。ReNeLLM [ 11] 将越狱过程表述为系统性的提示重写和场景嵌套，以构建针对受害者模型的对抗性攻击，生成有效的越狱提示。TAP [ 29] 采用基于树的搜索策略，以高效地探索提示空间，并从受害者 LLM 中引出特定的有害行为。

Defending LLM Jailbreaking.
防御 LLM 越狱。

Various methods [1, 20, 53, 50, 33] have been proposed to defend against jailbreaks, including filtering malicious prompts at the input stage [1], goal prioritization that favors safety over helpfulness [51], and prompt perturbation [7]. Post-training methods such as SFT [5, 4] and RLHF [12, 3] are also widely used. Besides, Post-hoc detection methods [47, 19] are also used to ensure that jailbroken output is not presented to the user. STAIR-DPO [49] integrates safety alignment with introspective reasoning and Realsafe-r1 [48] uses safety-aware reasoning trajectories generated by DeepSeek-R1 for training to improve safety. OpenAI’s Deliberative Alignment [14] teaches models to explicitly reason over safety policies before generating a response, which shares similarities with our work. However, our approach differs in two key aspects: (1) we advocate for an “Answer-Then-Check” strategy that first attempts to answer the query and then analyzes safety, allowing potentially unsafe content in the reasoning process, and (2) our method doesn’t require specialized reasoning models like OpenAI o1 for training data creation, making it more accessible.
已经提出了多种方法[1, 20, 53, 50, 33]来防御越狱攻击，包括在输入阶段过滤恶意提示[1]、优先考虑安全而非有用性的目标排序[51]以及提示扰动[7]。SFT[5, 4]和 RLHF[12, 3]等训练后方法也广泛使用。此外，还使用后处理检测方法[47, 19]来确保越狱输出不会被呈现给用户。STAIR-DPO[49]将安全对齐与内省推理相结合，而 Realsafe-r1[48]则使用 DeepSeek-R1 生成的安全感知推理轨迹进行训练以提升安全性。OpenAI 的 Deliberative Alignment[14]教导模型在生成回复前明确推理安全策略，这与我们的工作有相似之处。然而，我们的方法在两个方面存在关键差异：(1)我们提倡“先回答后检查”策略，即先尝试回答查询，然后分析安全性，允许推理过程中出现潜在不安全内容，(2)我们的方法在训练数据创建时不需要像 OpenAI o1 这样的专用推理模型，使其更具可及性。

Long Chain-of-Thought (LongCoT).
长链式思维 (LongCoT)

In a LongCoT model, the model will first generate some “thinking trajectories” before generating the answer to the user. The thinking trajectories, usually wrapped in “<think>… </think>” structures, simulate the actual thinking process of humans and may be verbose yet meaningful. OpenAI o1 [32] first shows that LongCoT techniques can improve model reasoning capabilities on complicated tasks. Various works [18, 30] show that the reasoning capability can be achieved by doing supervised fine-tuning (SFT) on distilled datasets. In addition, the SFT dataset may also be generated by best-of-N [24] or MCTS [52] strategies. Recent works further show that RL can be applied to achieve state-of-the-art LongCoT performance [15, 40].
在一个 LongCoT 模型中，模型会在生成对用户的答案之前，首先生成一些“思考轨迹”。这些思考轨迹通常被包裹在“<think>… </think>”结构中，模拟人类的实际思考过程，可能冗长但有意义。OpenAI 的 o1 [ 32] 首次表明，LongCoT 技术可以提高模型在复杂任务上的推理能力。各种工作 [ 18, 30] 表明，通过在蒸馏数据集上进行监督微调（SFT），可以实现推理能力。此外，SFT 数据集也可以通过最佳-N [ 24] 或 MCTS [ 52] 策略生成。最近的工作进一步表明，强化学习可以应用于实现最先进的 LongCoT 性能 [ 15, 40]。

3 Approach 3 方法

This section presents the ReSA dataset construction pipeline, illustrated in Figure 2 and Algorithm 1. It consists of three stages: (1) collecting vanilla and adversarial queries from WJ with additional jailbreak methods; (2) generating intended answer summaries; and (3) synthesizing safety analyses.
本节介绍了 ReSA 数据集构建流程，如图 2 和算法 1 所示。它包含三个阶段：（1）从 WJ 收集常规查询和对抗性查询，并添加额外的越狱方法；（2）生成预期答案摘要；（3）合成安全分析。

3.1 Answer-Then-Check Response Construction
3.1Answer-Then-Check 响应构建

Our core philosophy is “Answer-Then-Check”: the model first generates a direct response, then a safety analysis determines whether to release or refuse it. Notably, we do not rely on existing reasoning models such as OpenAI o1 or DeepSeek R1 [15] to construct our data. All training data generation requires only general LLMs such as Llama3.3 [13] and Qwen2.5 [43] series.
我们的核心哲学是“Answer-Then-Check”：模型首先生成直接响应，然后安全分析决定是否发布或拒绝。值得注意的是，我们不依赖现有的推理模型，如 OpenAI o1 或 DeepSeek R1 [ 15] 来构建我们的数据。所有训练数据的生成仅需使用 Llama3.3 [ 13] 和 Qwen2.5 [ 43] 等通用 LLMs。

3.1.1 Reasoning Template 3.1.1 推理模板

Figure 3 illustrates our reasoning template for the “Answer-Then-Check” strategy. This template structures the model’s reasoning process into three key components: (1) summarization of intended answer, where the model formulates a concise representation of what it would naturally answer, even for harmful queries, facilitating the identification of safety issues; (2) safety analysis, where the model critically evaluates whether the intended answer summary complies with safety policies; and (3) final answer that either provides a natural answer or a refusal. Components (1) and (2) are wrapped in “<safety_check>” and “</safety_check>” tags, with component (1) specifically enclosed in “<intended_answer_summary>” and “</intended_answer_summary>” tags. Component (3), the final response, directly follows the “</intended_answer_summary>” tag. Only the content after “</safety_check>” is shown to the user. In summary, this template enforces a two-step process where the model first directly answers the query and then engages in safety thinking based on the intended answer summary and safety policies, thereby mitigating LLM jailbreaking vulnerabilities.
图 3 展示了我们为“先回答后检查”策略设计的推理模板。该模板将模型的推理过程结构化为三个关键组成部分：（1）意图回答的总结，模型将自然回答的内容进行简洁化表述，即使对于有害查询也是如此，从而有助于识别安全问题；（2）安全分析，模型批判性地评估意图回答总结是否符合安全政策；（3）最终回答，提供自然回答或拒绝。组成部分（1）和（2）被包裹在“<safety_check>”和“</safety_check>”标签中，其中组成部分（1）特别被包裹在“<intended_answer_summary>”和“</intended_answer_summary>”标签中。组成部分（3），即最终响应，直接位于“</intended_answer_summary>”标签之后。仅向用户展示“</safety_check>”标签之后的内容。总而言之，该模板执行了一个两步过程，模型首先直接回答查询，然后基于意图回答总结和安全政策进行安全思考，从而减轻 LLM 越狱漏洞。

Figure 3: The Answer-Then-Check reasoning template. The template structures the reasoning process into three parts: intended answer summary, safety analysis, and final response based on analysis.
图 3：Answer-Then-Check 推理模板。该模板将推理过程结构化为三个部分：预期答案摘要、安全分析以及基于分析的最后回应。

3.1.2 Summarization of Intended Answer
3.1.2 预期答案总结

The intended answer summary serves as the “Answer” component in our “Answer-Then-Check” strategy, representing the content the user expects the model to produce. Although the complete intended answer could also serve as the “Answer” component, we adopt the summary for computational efficiency. A key challenge is generating intended answers or summaries for harmful queries, as most modern LLMs are aligned to refuse them. Fortunately, uncensored models like Dolphin, which eliminate alignment from the fine-tuning data, can deliver high-quality responses to harmful queries.
预期答案总结作为我们“先回答后检查”策略中的“答案”部分，代表用户期望模型生成的内容。尽管完整的预期答案也可以作为“答案”部分，但我们采用总结形式以提高计算效率。一个关键挑战是生成有害查询的预期答案或总结，因为大多数现代 LLMs 都经过校准以拒绝它们。幸运的是，像 Dolphin 这样的未审查模型，其微调数据中排除了校准，能够对有害查询提供高质量的响应。

We use Dolphin-2.9.2-Qwen2-72B to generate intended answers for harmful queries and Qwen2.5-72B-Instruct for benign ones. For harmful queries, we retain only samples with responses deemed unsafe, while for benign queries, we keep only samples with safe responses, using Llama-Guard-3-8B as the classifier. With these intended answers, we prompt Qwen2.5-72B-Instruct to generate a concise intended answer summary. The specific prompts used are detailed in Figure 6.
我们使用 Dolphin-2.9.2-Qwen2-72B 生成有害查询的预期答案，并使用 Qwen2.5-72B-Instruct 生成良性查询的预期答案。对于有害查询，我们仅保留被认为不安全的响应样本，而对于良性查询，我们仅保留具有安全响应的样本，并使用 Llama-Guard-3-8B 作为分类器。利用这些预期答案，我们提示 Qwen2.5-72B-Instruct 生成简洁的预期答案总结。具体使用的提示详细说明在图 6 中。

3.1.3 Safety Analysis Synthesis
3.1.3 安全分析综合

The safety analysis serves as the “Check” component in our “Answer-Then-Check” strategy. Our objective is to anchor the model’s analysis in safety policies. To this end, the safety analysis synthesis component of our training dataset is constructed to teach the model to associate queries with the corresponding safety policies. For harmful queries, the LLM is prompted with the query, intended answer summary, and the relevant safety policy (with its definition) to generate a detailed safety analysis that specifies any compliance violations and explains the breached provisions. For benign queries, we provide the query, the intended answer summary, and a comprehensive list of unsafe types, and ask the LLM to justify why the content does not violate any policy. Llama3.3-70B-Instruct is used to generate the safety analysis. The unsafe type of each query is classified by Llama-Guard-3-8B [13]. The prompt templates used for safety analysis are provided in Figure 7.
安全分析是我们“先回答再检查”策略中的“检查”环节。我们的目标是将模型的分析与安全政策相结合。为此，我们训练数据集中的安全分析合成部分被构建用来教导模型将查询与相应的安全政策关联起来。对于有害查询，我们向 LLM 提供查询、预期答案摘要以及相关的安全政策（及其定义），要求其生成详细的安全分析，明确指出任何合规性违规行为并解释被违反的条款。对于良性查询，我们提供查询、预期答案摘要以及不安全类型的全面列表，并要求 LLM 说明为什么内容不违反任何政策。使用 Llama3.3-70B-Instruct 生成安全分析。每个查询的不安全类型由 Llama-Guard-3-8B [13]进行分类。用于安全分析提示模板如图 7 所示。

3.2 Safety Query Collection and Construction
3.2 安全查询收集与构建

In this subsection, we describe the data collection process and the use of the Answer-Then-Check strategy to construct the ReSA dataset. To balance jailbreak defense and over-refusal, the dataset comprises four categories: vanilla harmful, vanilla benign, adversarial harmful, and adversarial benign. Vanilla harmful queries are straightforwardly harmful, while vanilla benign ones are innocuous. The adversarial counterparts are generated through jailbreak techniques: adversarial harmful queries conceal malicious intent via complex prompting, whereas adversarial benign queries mimic jailbreak structures without harmful intent. Figure 5 illustrates examples of these four query categories.
在本小节中，我们描述了数据收集过程以及使用 Answer-Then-Check 策略构建 ReSA 数据集的方法。为了平衡越狱防御和过度拒绝，数据集包含四个类别：普通有害、普通无害、对抗有害和对抗无害。普通有害查询是直接有害的，而普通无害查询是无害的。对抗性对应物通过越狱技术生成：对抗有害查询通过复杂提示隐藏恶意意图，而对抗无害查询模仿越狱结构但无有害意图。图 5 展示了这四个查询类别的示例。

We adopt the WILDJAILBREAK (WJ) 262K dataset [22] as our initial query pool, which already covers four categories. To further enrich it, we apply three jailbreak techniques (PAIR [8], GPTFuzzer [45], and PAP [46]) to 10K vanilla harmful and 10K vanilla benign queries for each method. Qwen2.5-72B-Instruct is used as the attack model for all three methods, with Llama3.1-8B-Instruct serving as the victim for PAIR and GPTFuzzer. For GPTFuzzer, we retain the ten most effective prompt templates. The resulting adversarial queries are merged with WJ to form the raw training set, further supplemented with $1,000$ rejection-prone samples from MMLU auxiliary training set.
我们采用 WILDJAILBREAK（WJ）262K 数据集[22]作为初始查询池，该数据集已涵盖四个类别。为进一步丰富它，我们对每种方法分别对 10K 个普通有害查询和 10K 个普通良性查询应用三种越狱技术（PAIR[8]、GPTFuzzer[45]和 PAP[46]）。Qwen2.5-72B-Instruct 被用作所有三种方法的攻击模型，Llama3.1-8B-Instruct 作为 PAIR 和 GPTFuzzer 的受害者。对于 GPTFuzzer，我们保留了十个最有效的提示模板。生成的对抗性查询与 WJ 合并形成原始训练集，并进一步补充来自 MMLU 辅助训练集的 $1,000$ 易拒绝样本。

3.3 Filtering 3.3 过滤

Query Type 查询类型	Total Count 总数	Jailbreak 越狱 Method 方法	Sample 样本 Count 计数
Vanilla Harmful 香草有害	12,412	-	-
Vanilla Benign 香草良性	16,179	-	-
Adversarial 对抗性 Harmful 有害	22,763	WJ [22]	15,050
		PAIR [8]	3,359
		PAP [46]	3,999
		GPTFuzzer [45]	355
Adversarial 对抗性 Benign 良性	29,072	WJ [22]	19,822
		PAIR [8]	4,003
		PAP [46]	4,823
		GPTFuzzer [45]	424

Table 1: Distribution of data samples across different query types and jailbreak methods in the ReSA dataset (

80,426

samples in total).
表 1：ReSA 数据集中不同查询类型和越狱方法的样本分布（总样本数为

80,426

）。

We adopt a two-stage filtering process to ensure the quality of our dataset. In the first stage, we retain only benign query responses classified as safe and harmful query responses classified as unsafe, using Llama-Guard-3-8B as the classifier. In the second stage, we apply a rigorous filtering process to ensure high-quality safety analyses. Specifically, we remove samples containing internal inconsistencies, such as cases where the safety analysis concludes the response is unsafe yet states no safety policy is violated, or conversely, where the conclusion is safe but the analysis indicates policy violations. After this comprehensive filtering process, we obtain a dataset of $80,426$ samples, with the distribution of each data type shown in Table 1. Additionally, we randomly sample subsets of different sizes (0.1K, 0.5K, 1K, 5K) from the 80K dataset to investigate the minimum data required for safety alignment.
我们采用两阶段过滤流程来确保数据集的质量。第一阶段，我们仅保留被分类为安全的有益查询响应和被分类为不安全的有害查询响应，使用 Llama-Guard-3-8B 作为分类器。第二阶段，我们应用严格的过滤流程来确保高质量的安全分析。具体来说，我们移除包含内部不一致性的样本，例如安全分析得出响应不安全但声明未违反任何安全政策的情况，或者相反，结论为安全但分析表明存在政策违规的情况。经过这一全面的过滤流程后，我们获得了一个包含 $80,426$ 个样本的数据集，每种数据类型的分布情况如表 1 所示。此外，我们从 80K 数据集中随机抽取不同大小的子集（0.1K、0.5K、1K、5K）来研究安全对齐所需的最小数据量。

3.4 Safe Completion 3.4 安全完成

Safe completion requires models to respond in a sensitive and supportive manner, particularly for high-stakes cases such as self-harm, where outright refusal may be inappropriate or even harmful. To equip models with this capability, we use Llama Guard to extract self-harm samples ( $167$ vanilla harmful, $357$ adversarial harmful) from the training set. In constructing the safe completion training data, vanilla harmful queries are paired directly with responses from a general LLM as the final answer in the reasoning template, since our evaluation showed that general LLMs already handle vanilla self-harm queries with reasonably strong safe completion performance. For adversarial self-harm queries, we provided the corresponding vanilla harmful queries and asked the model to generate safe completion responses. The prompts used to construct this dataset are detailed in Figure 9. We find that even a small amount of carefully constructed data is sufficient for the model to learn the safe completion pattern. Moreover, even when faced with adversarial prompts, ReSA could identify malicious intent and produce appropriate, safety-aligned responses.
安全完成需要模型以敏感和支持性的方式回应，特别是在自杀等高风险情况下，直接拒绝可能不恰当甚至有害。为了赋予模型这种能力，我们使用 Llama Guard 从训练集中提取自杀样本（ $167$ 普通有害， $357$ 对抗性有害）。在构建安全完成训练数据时，普通有害查询直接与通用 LLM 的回应配对作为推理模板中的最终答案，因为我们的评估显示通用 LLM 已经以合理的安全完成性能处理普通自杀查询。对于对抗性自杀查询，我们提供了相应的普通有害查询，并要求模型生成安全完成回应。用于构建此数据集的提示详细如图 9 所示。我们发现即使是少量精心构建的数据也足以让模型学习安全完成模式。此外，即使面对对抗性提示，ReSA 也能识别恶意意图并产生适当的安全对齐回应。

3.5 Adaptive Answer-Then-Check strategy
3.5 自适应回答-检查策略

In fact, the additional ‘Answer’ and ‘Check’ steps will slow the model down, especially for normal queries where such a process is unnecessary. Therefore, we introduce the “Adaptive Answer-Then-Check” strategy as an alternative when high efficiency is required. The “Adaptive Answer-Then-Check” strategy aims to dynamically bypass the additional ‘Answer’ and ‘Check’ steps for normal questions, providing a direct response and effectively removing any additional overhead. This can be achieved by augmenting the training data with some instruction-tuning samples designed to elicit non-Answer-Then-Check, direct replies. In practice, we randomly sample $1,000$ instruction-tuning examples from the Tulu-3 SFT dataset [23], after filtering out refusal, math, and coding data.
事实上，额外的“回答”和“检查”步骤会减慢模型的运行速度，特别是对于不需要这种过程的常规查询。因此，当需要高效率时，我们引入“自适应回答-检查”策略作为替代方案。“自适应回答-检查”策略旨在动态地绕过常规问题的额外“回答”和“检查”步骤，直接提供响应并有效消除任何额外开销。这可以通过在训练数据中增加一些指令微调样本来实现，这些样本设计用于引出非“回答-检查”的直接回复。在实践中，我们从 Tulu-3 SFT 数据集[23]中随机采样 $1,000$ 个指令微调示例，在过滤掉拒绝、数学和编码数据后。

3.6 RL-based Answer-Then-Check strategy
3.6 基于 RL 的回答-检查策略

The Answer-Then-Check strategy can be directly applied in the RL setting to further improve the model’s safety robustness. Moreover, since the intended answer summary may still contain unsafe content, applying a corresponding safety reward to it can also enhance its safety. We require the model to follow the Answer-Then-Check strategy in the prompt. Given a question $q$ , the policy model $f_{\bm{\theta}}$ first generates the intended answer summary: $o_{\text{intended}}\sim f_{\bm{\theta}}(\cdot\mid q)$ , which tries to answer the question directly. The model then engages in a structured reasoning process, producing a reasoning sequence to check if $o_{\text{intended}}$ is safe: $o_{\text{check}}\sim f_{\bm{\theta}}(\cdot\mid q,o_{\text{intended}})$ . Finally, conditioned on both the intended answer summary and reasoning traces, the model outputs a final answer: $o_{\text{ans}}\sim f_{\bm{\theta}}(\cdot\mid q,o_{\text{intended}},o_{\text{check}})$ . We denote the full output of a rollout as $o=(o_{\text{intended}},o_{\text{check}},o_{\text{ans}})$ . We use GRPO [35] to train our model. For the $i$ -th rollout, the reward is as:
答案-然后-检查策略可以直接应用于强化学习环境中，以进一步提高模型的安全鲁棒性。此外，由于预期的答案摘要可能仍然包含不安全内容，对其应用相应的安全奖励也可以增强其安全性。我们要求模型在提示中遵循答案-然后-检查策略。对于一个问题 $q$ ，策略模型 $f_{\bm{\theta}}$ 首先生成预期的答案摘要： $o_{\text{intended}}\sim f_{\bm{\theta}}(\cdot\mid q)$ ，它试图直接回答问题。然后模型进行结构化推理过程，产生一个推理序列来检查 $o_{\text{intended}}$ 是否安全： $o_{\text{check}}\sim f_{\bm{\theta}}(\cdot\mid q,o_{\text{intended}})$ 。最后，基于预期的答案摘要和推理轨迹，模型输出一个最终答案： $o_{\text{ans}}\sim f_{\bm{\theta}}(\cdot\mid q,o_{\text{intended}},o_{\text{check}})$ 。我们将一个完整回合的输出表示为 $o=(o_{\text{intended}},o_{\text{check}},o_{\text{ans}})$ 。我们使用 GRPO [ 35]来训练我们的模型。对于 $i$ -次回合，奖励如下：

r=\begin{cases}\lambda_{\text{safety}}\cdot(R_{\text{safety}}(o_{\text{intended}})+R_{\text{safety}}(o_{\text{ans}}))+\lambda_{\text{format}}\cdot R_{\text{format}}(o)&q\in\mathcal{H}\\ \lambda_{\text{safety}}\cdot(R_{\text{safety}}(o_{\text{intended}})+R_{\text{safety}}(o_{\text{ans}}))+\lambda_{\text{format}}\cdot R_{\text{format}}(o)+\lambda_{\text{refusal}}\cdot R_{\text{refusal}}(o_{\text{ans}})&q\in\mathcal{B}\\ \lambda_{\text{format}}\cdot R_{\text{format}}(o)+\lambda_{\text{refusal}}\cdot R_{\text{refusal}}(o_{\text{ans}})&q\in\mathcal{N}\end{cases},

where $\mathcal{H}$ and $\mathcal{B}$ denote the harmful and benign query sets, respectively, and $\mathcal{N}$ denotes the normal question set for learning the “Adaptive Answer-Then-Check” strategy. Queries in $\mathcal{B}$ remain safe but are closer to the decision boundary than those in $\mathcal{N}$ . Each reward component is binary, yielding $1$ when satisfied and $0$ otherwise, i.e., $R_{\text{safety}}(\cdot),R_{\text{refusal}}(\cdot),R_{\text{format}}(\cdot)\in\{0,1\}$ . Specifically,
其中 $\mathcal{H}$ 和 $\mathcal{B}$ 分别表示有害查询集和良性查询集， $\mathcal{N}$ 表示用于学习“自适应先回答后检查”策略的正常问题集。 $\mathcal{B}$ 中的查询保持安全，但比 $\mathcal{N}$ 中的查询更接近决策边界。每个奖励组件都是二元的，满足时产生 $1$ ，不满足时产生 $0$ ，即 $R_{\text{safety}}(\cdot),R_{\text{refusal}}(\cdot),R_{\text{format}}(\cdot)\in\{0,1\}$ 。具体来说，

•

The safety reward $R_{\text{safety}}(\cdot)$ encourages appropriate handling of harmful queries. We employ LlamaGuard as the reward model to evaluate whether the model identifies harmful intent and provides a safe response. This reward is also used to evaluate the intended answer summary to ensure that the model produces safe content throughout the entire generation process.

• 安全奖励 $R_{\text{safety}}(\cdot)$ 鼓励适当处理有害查询。我们采用 LlamaGuard 作为奖励模型，评估模型是否识别有害意图并提供安全回应。此奖励也用于评估预期答案摘要，以确保模型在整个生成过程中产出安全内容。
•

The refusal reward $R_{\text{refusal}}(\cdot)$ promotes providing helpful answers on benign queries. We use Qwen2.5-7B-Instruct to assess whether the final answer refuses to respond to a benign query.

• 拒绝奖励 $R_{\text{refusal}}(\cdot)$ 促进在良性查询上提供有帮助的答案。我们使用 Qwen2.5-7B-Instruct 评估最终答案是否拒绝回应良性查询。
•

This rule-based reward $R_{\text{format}}(\cdot)$ enforces the Answer-Then-Check structure for queries in $\mathcal{B}$ and $\mathcal{H}$ , requiring the model to generate an intended answer summary, a safety analysis, and a final answer in the correct format, while discouraging the use of this pattern for queries in $\mathcal{N}$ .

• 此基于规则的奖励 $R_{\text{format}}(\cdot)$ 强制执行 $\mathcal{B}$ 和 $\mathcal{H}$ 中的答案-检查结构，要求模型以正确格式生成预期答案摘要、安全分析和最终答案，同时避免在 $\mathcal{N}$ 中的查询使用此模式。

Implementation details including prompts and reward coefficient setup are provided in Appendix 12.1.5.
实现细节，包括提示和奖励系数设置，在附录 12.1.5 中提供。

4 Experiments 4 实验

In this section, we train LLMs on our constructed safety dataset and evaluate against various jailbreak methods. We first describe the experimental setup, followed by the main experimental results and ablation studies, to demonstrate the effectiveness of our approach.
在本节中，我们在我们构建的安全数据集上训练 LLMs，并针对各种越狱方法进行评估。我们首先描述实验设置，然后是主要的实验结果和消融研究，以展示我们方法的有效性。

4.1 Experiment Setups 4.1 实验设置

Training Details. 训练细节。

We perform SFT on our dataset using Llama3.1-8B-Instruct [13] and Qwen2.5-7B-Instruct [43] with TRL 0.16.0 [41]. Models are trained for $2$ epochs in bfloat16 with AdamW and a cosine schedule (learning rate $5\times 10^{-6}$ , $10\%$ warmup), maximum sequence length $8192$ , on $8\times$ H100 GPUs with per-device batch size $2$ and $2$ gradient accumulation steps. All other settings remain consistent across experiments. For details of the RL training, please refer to Appendix 12.1.5.
我们使用 Llama3.1-8B-Instruct [ 13] 和 Qwen2.5-7B-Instruct [ 43] 在 TRL 0.16.0 [ 41] 上对我们的数据集进行 SFT。模型使用 bfloat16 格式，以 AdamW 优化器和余弦调度（学习率 $5\times 10^{-6}$ ， $10\%$ 预热）进行训练，最大序列长度 $8192$ ，在 $8\times$ 块 H100 GPU 上，每块设备的批处理大小为 $2$ ，梯度累积步数为 $2$ 。所有其他设置在实验中保持一致。关于 RL 训练的详细信息，请参阅附录 12.1.5。

Defense Baselines. 防御基线。

We compare with $13$ baselines across five categories: fine-tuned models (WJ-SFT [22], STAIR [49], Realsafe-r1 [48], and our implementation of OpenAI-Deliberative Alignment [14]), Post-hoc detection (Llama-Guard [13] and GuardReasoner [27]), advanced general LLMs (gpt-4.1-20250414, claude-sonnet-4-20250514, deepseek-v3-20250324), advanced reasoning models with self-reflection (deepseek-r1-20250528, o4-mini-20250416), and general LLMs with prompt engineering (goal priority defense [51]). Full details are provided in Appendix 12.2.
我们与 $13$ 个基线进行了比较，分为五个类别：微调模型（WJ-SFT [ 22]、STAIR [ 49]、Realsafe-r1 [ 48] 和我们实现的 OpenAI-Deliberative Alignment [ 14]）、后处理检测（Llama-Guard [ 13] 和 GuardReasoner [ 27]）、高级通用 LLM（gpt-4.1-20250414、claude-sonnet-4-20250514、deepseek-v3-20250324）、具有自我反思的高级推理模型（deepseek-r1-20250528、o4-mini-20250416）以及使用提示工程的通用 LLM（目标优先防御 [ 51]）。详细信息见附录 12.2。

Base Model 基础模型	Evaluator 评估器	Method 方法	None 无	PAIR 配对 -GPT	PAIR	PAP	GPT- Fuzzer 模糊测试器	ReNe- LLM	TAP	DeepIn- ception	Avg
Llama3.1- Llama3.1 8B-Instruct	Llama Guard 守卫	Base 基础	0.9968	0.3514	0.2620	0.6486	0.1374	0.6613	0.4249	0.5240	0.5008
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	1.0000	0.4633	0.5080	0.7157	0.9968	0.9297	0.6581	0.9776	0.7812
		STAIR-DPO	1.0000	0.6837	0.4217	0.9425	1.0000	0.8339	0.6933	0.9872	0.8203
		WJ-SFT	0.9936	0.4473	0.3291	0.7604	0.9425	0.6773	0.6038	0.9840	0.7173
		ReSA-SFT (Ours)	0.9936	0.8978	0.6965	0.9681	0.9553	0.8818	0.8498	0.9936	0.9046
		ReSA-RL (Ours)	1.0000	0.9872	0.9681	0.9968	1.0000	0.9968	0.9968	1.0000	0.9932
	Fine-tuned StrongREJECT Evaluator [38]	Base	0.9880	0.4660	0.4509	0.6592	0.2957	0.7496	0.4840	0.5674	0.5826
		Post-hoc (LlamaGuard)	0.9909	0.5511	0.6441	0.7143	0.9833	0.9410	0.6704	0.9132	0.8010
		STAIR-DPO	0.9992	0.8076	0.6814	0.9515	0.9992	0.9048	0.7777	0.9926	0.8892
		WJ-SFT	0.9858	0.6160	0.5691	0.7961	0.9709	0.8786	0.6615	0.9811	0.8074
		ReSA-SFT (Ours) ReSA-SFT (我们的)	0.9808	0.8952	0.7571	0.9608	0.9591	0.9519	0.8436	0.9758	0.9155
		ReSA-RL (Ours) ReSA-RL (我们的)	0.9863	0.9814	0.9650	0.9788	0.9908	0.9823	0.9871	0.9900	0.9827
	Harm- -Bench Classifier 分类器	Base 基础模型	0.9872	0.6262	0.5815	0.7923	0.2013	0.7604	0.4952	0.7764	0.6526
		Post-hoc (LlamaGuard) 后处理（LlamaGuard）	0.9904	0.7093	0.7668	0.8466	0.9968	0.9712	0.7157	0.9712	0.8710
		STAIR-DPO	1.0000	0.9105	0.8786	0.9872	0.9968	0.9393	0.8658	0.9904	0.9461
		WJ-SFT	0.9904	0.7476	0.6901	0.8754	0.9649	0.8786	0.6613	0.9872	0.8494
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9872	0.9617	0.9010	0.9840	0.9585	0.9808	0.8914	0.9968	0.9577
		ReSA-RL (Ours) ReSA-RL (我们的方法)	0.9968	0.9968	0.9968	0.9936	0.9968	0.9936	0.9968	0.9968	0.9960
Qwen2.5- 7B-Instruct	Llama Guard	Base	0.9744	0.2173	0.1086	0.3866	0.1917	0.0863	0.1693	0.3706	0.3131
		Post-hoc (LlamaGuard) 后处理（LlamaGuard）	1.0000	0.3610	0.5783	0.5815	0.9840	0.9137	0.6933	0.9489	0.7576
		STAIR-DPO^∗	1.0000	0.6677	0.3514	0.9457	1.0000	0.5591	0.6965	0.9649	0.7732
		WJ-SFT	0.9936	0.3387	0.2780	0.6869	0.9904	0.5495	0.4058	0.9521	0.6494
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9904	0.8435	0.7188	0.9489	0.9776	0.8466	0.8562	0.9808	0.8953
		ReSA-RL (Ours) ReSA-RL（我们的方法）	1.0000	0.9936	0.9617	1.0000	1.0000	0.9169	0.9968	1.0000	0.9836
	Fine-tuned 微调 StrongREJECT Evaluator [38] 评估器 [ 38]	Base 基础	0.9080	0.3992	0.3286	0.4282	0.4191	0.3511	0.3202	0.4424	0.4496
		Post-hoc (LlamaGuard) 事后 (LlamaGuard)	0.9248	0.5134	0.6702	0.5854	0.9930	0.9502	0.7254	0.8419	0.7755
		STAIR-DPO^∗	0.9991	0.7736	0.6384	0.9411	0.9991	0.7484	0.7476	0.9810	0.8535
		WJ-SFT	0.9915	0.5536	0.4994	0.7334	0.9825	0.7631	0.5127	0.9596	0.7495
		ReSA-SFT (Ours) ReSA-SFT (我们的)	0.9797	0.8674	0.7438	0.9500	0.9242	0.9353	0.8438	0.9725	0.9021
		ReSA-RL (Ours) ReSA-RL (我们的)	0.9902	0.9833	0.9320	0.9837	0.9929	0.9550	0.9726	0.9899	0.9749
	Harm- -Bench Classifier 分类器	Base 基础	0.9712	0.6038	0.3291	0.7220	0.3706	0.2620	0.2652	0.7125	0.5295
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9936	0.7252	0.7093	0.8498	0.9936	0.9585	0.7412	0.9776	0.8686
		STAIR-DPO^∗	0.9968	0.9137	0.8403	0.9936	0.9968	0.7316	0.8083	0.9968	0.9097
		WJ-SFT	0.9936	0.6901	0.6006	0.8019	0.9936	0.7572	0.4792	0.9681	0.7855
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9840	0.9393	0.9201	0.9744	0.9585	0.9681	0.9010	0.9936	0.9549
		ReSA-RL (Ours) ReSA-RL（我们的方法）	0.9968	0.9968	0.9904	0.9936	1.0000	0.9681	1.0000	0.9968	0.9928

Table 2: Safety performance against different jailbreak methods on the StrongREJECT benchmark, evaluated by three evaluators. The base model for STAIR-DPO^∗ is Qwen2-7B-Instruct. LlamaGuard and HarmBench classifier use DSR as the metric, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

. Black bold: best; Underlining: second best.
表 2：StrongREJECT 基准上针对不同越狱方法的防护性能，由三位评估者进行评估。STAIR-DPO ^∗ 的基础模型为 Qwen2-7B-Instruct。LlamaGuard 和 HarmBench 分类器使用 DSR 作为指标，而微调的 StrongREJECT 评估器使用良好度分数；所有指标范围从

0

到

1

。黑粗体：最佳；下划线：次优。

Attack Methods. 攻击方法。

We use PAIR [8], PAP [46], GPTFuzzer [45], ReNeLLM [11], TAP [29], DeepInception [25], and GCG [57] as the attack methods. Among these, PAIR, GPTFuzzer, ReNeLLM, and TAP are adaptive attacks dynamically optimizing adversarial queries based on the target model’s responses. PAIR-GPT is generated with GPT-4o-mini as the victim model to measure other models’ robustness against transferable jailbreaks, and GCG is a white-box attack requiring logits. For detailed implementations of each jailbreak method, please refer to Appendix 12.3.
我们使用 PAIR [ 8]、PAP [ 46]、GPTFuzzer [ 45]、ReNeLLM [ 11]、TAP [ 29]、DeepInception [ 25] 和 GCG [ 57] 作为攻击方法。其中，PAIR、GPTFuzzer、ReNeLLM 和 TAP 是自适应攻击，它们根据目标模型的响应动态优化对抗查询。PAIR-GPT 是使用 GPT-4o-mini 作为受害者模型生成的，用于测量其他模型对可迁移越狱的鲁棒性，而 GCG 是一种需要 logits 的白盒攻击。有关每个越狱方法的详细实现，请参阅附录 12.3。

Evaluation Datasets. 评估数据集。

We evaluate models across three dimensions: safety, general performance, and safe completion to assess robust jailbreak resistance while preserving overall capabilities. For safety evaluation, we test jailbreaks on $3$ datasets, including StrongREJECT [38], AdvBench [57], and HarmBench [28]. For general capabilities, we evaluate two primary aspects: reasoning ability and over-refusal tendency. Reasoning ability is evaluated on the MATH500 (mathematical) [17], HumanEval (coding) [9], and MMLU (general) [16] datasets. Over-refusal tendency is measured on the XSTest [34], OKTest [37], and WJ-Eval (adversarial benign) [22] datasets. For safe completion, we extract all self-harm queries from StrongREJECT, HarmBench, and AdvBench, and apply PAIR, PAP, and DeepInception to these queries, resulting in $153$ test instances.
我们在三个维度上评估模型：安全性、通用性能和安全性完成度，以评估强大的越狱抵抗能力，同时保持整体能力。对于安全性评估，我们在 $3$ 数据集上测试越狱，包括 StrongREJECT [ 38]、AdvBench [ 57]和 HarmBench [ 28]。对于通用能力，我们评估两个方面：推理能力和过度拒绝倾向。推理能力在 MATH500（数学）[ 17]、HumanEval（编程）[ 9]和 MMLU（通用）[ 16]数据集上进行评估。过度拒绝倾向在 XSTest [ 34]、OKTest [ 37]和 WJ-Eval（对抗良性）[ 22]数据集上进行测量。对于安全性完成度，我们从 StrongREJECT、HarmBench 和 AdvBench 中提取所有自我伤害查询，并将 PAIR、PAP 和 DeepInception 应用于这些查询，生成 $153$ 测试实例。

Evaluators and Metrics. 评估器和指标。

In terms of safety evaluation, we evaluate the defense success using three safety evaluators: LlamaGuard (Llama-Guard-3-8B) [13], the fine-tuned StrongREJECT evaluator released by StrongREJECT [38], and the HarmBench classifier (HarmBench-Llama-2-13B-cls) [28]. For LlamaGuard and the HarmBench classifier, we report Defense Success Rate (DSR) as the metric, defined as the proportion of responses in the full test set that are judged as safe. For the fine-tuned StrongREJECT evaluator, we take the goodness score [14, 49], which is 1 - fine-tuned score, as the metric. The fine-tuned score is the normalized output of the evaluator. Both DSR and the goodness score range from $0$ to $1$ , with higher values indicating better defense performance. For over-refusal tendency, we use Llama3.3-70B-Instruct as the judge model and report over-refusal accuracy (1 - the over-refusal rate) as the metric. For general reasoning capability, we use accuracy as the metric. For safe completion, we use Qwen2.5-72B-Instruct and Llama3.3-70B-Instruct as evaluators to compare two responses, assigning a score of $1$ to the better response and $0$ to the worse one. For details on how the evaluators conduct the evaluation, please refer to Appendix 12.4.
在安全评估方面，我们使用三个安全评估器来评估防御成功率：LlamaGuard（Llama-Guard-3-8B）[13]、StrongREJECT 发布的微调 StrongREJECT 评估器[38]以及 HarmBench 分类器（HarmBench-Llama-2-13B-cls）[28]。对于 LlamaGuard 和 HarmBench 分类器，我们以防御成功率（DSR）作为指标，定义为在完整测试集中被判定为安全的响应比例。对于微调的 StrongREJECT 评估器，我们以良好度分数[14, 49]作为指标，即 1 减去微调分数，微调分数是评估器的归一化输出。DSR 和良好度分数的范围都在 $0$ 到 $1$ 之间，数值越高表示防御性能越好。对于过度拒绝倾向，我们使用 Llama3.3-70B-Instruct 作为判断模型，并报告过度拒绝准确率（1 减去过度拒绝率）作为指标。对于一般推理能力，我们使用准确率作为指标。对于安全完成，我们使用 Qwen2.5-72B-Instruct 和 Llama3.3-70B-Instruct 作为评估器来比较两个响应，给更好的响应 $1$ 分，给较差的响应 $0$ 分。关于评估者如何进行评估的详细信息，请参阅附录 12.4。

4.2 Main Results 4.2 主要结果

Safety Performance. 安全性能。

Table 2 presents the safety performance across various jailbreak methods evaluated by three evaluators. ReSA-trained models consistently outperform baselines across all evaluators. With Llama-Guard-3-8B, ReSA-SFT (Llama3.1-8B-Instruct) attains an average score of $0.9046$ , surpassing post-hoc detection ( $0.7812$ ), STAIR-DPO ( $0.8203$ ), and WJ-SFT ( $0.7173$ ); ReSA-RL further boosts this to $0.9932$ . Since Llama-Guard-3-8B is used as the RL reward model, we additionally evaluate with the fine-tuned StrongREJECT evaluator and HarmBench classifier, where ReSA-RL also achieves the best performance ( $0.9827$ and $0.9960$ , respectively). Results on AdvBench (Table 7) and HarmBench (Table 8) further confirm our method’s superiority.
表 2 展示了三位评估者对各种越狱方法的安全性能评估结果。ReSA 训练的模型在所有评估者中均优于基线模型。使用 Llama-Guard-3-8B 时，ReSA-SFT（Llama3.1-8B-Instruct）平均得分为 $0.9046$ ，超过后处理检测（ $0.7812$ ）、STAIR-DPO（ $0.8203$ ）和 WJ-SFT（ $0.7173$ ）；ReSA-RL 进一步将其提升至 $0.9932$ 。由于 Llama-Guard-3-8B 被用作强化学习奖励模型，我们额外使用微调的 StrongREJECT 评估器和 HarmBench 分类器进行评估，ReSA-RL 在这些评估中也取得了最佳性能（分别达到 $0.9827$ 和 $0.9960$ ）。AdvBench（表 7）和 HarmBench（表 8）上的结果进一步证实了我们方法的优势。

Our method shows strong robustness to adaptive jailbreaks.While WJ-SFT barely improves over the base model against PAIR ( $0.3291$ vs. $0.2620$ , Llama-Guard-3-8B), ReSA-SFT reaches $0.6965$ and ReSA-RL achieves $0.9681$ , demonstrating that RL further closes the remaining gap. Note that PAIR prompts used during evaluation are dynamically generated for each target model, distinct from the training set. On the unseen adaptive attack like TAP, ReSA-SFT and ReSA-RL score $0.8498$ and $0.9968$ , substantially exceeding all baselines, highlighting the generalization of “Answer-Then-Check”. We further evaluate against GCG (Table 11), prefilling attack (Table 13), and AutoDAN-Turbo [26] (Table 12), where ReSA-SFT consistently outperforms the base model and WJ-SFT.
我们的方法对自适应越狱攻击表现出很强的鲁棒性。虽然 WJ-SFT 在 PAIR（ $0.3291$ vs. $0.2620$ ，Llama-Guard-3-8B）上几乎没有超过基础模型的改进，但 ReSA-SFT 达到了 $0.6965$ ，ReSA-RL 实现了 $0.9681$ ，这表明强化学习进一步缩小了剩余差距。请注意，评估过程中使用的 PAIR 提示是针对每个目标模型动态生成的，与训练集不同。在未知的自适应攻击如 TAP 上，ReSA-SFT 和 ReSA-RL 分别得分 $0.8498$ 和 $0.9968$ ，显著超过了所有基线，突出了“先回答后检查”的泛化能力。我们进一步在 GCG（表 11）、预填充攻击（表 13）和 AutoDAN-Turbo [26]（表 12）上进行了评估，其中 ReSA-SFT 始终优于基础模型和 WJ-SFT。

Base Model 基础模型	Method 方法	Over-refusal Benchmarks 过度拒绝基准			Average 平均	General Reasoning Benchmarks 通用推理基准			Average 平均
Base Model 基础模型	Method 方法	XSTest	OKTest	WJ-Eval	Average 平均	MATH500	HumanEval	MMLU	Average 平均
Llama3.1-8B -Instruct	Base 基础	93.60%	85.00%	99.20%	93.27%	50.60%	65.85%	69.09%	61.85%
	Post-hoc (LlamaGuard) 事后（LlamaGuard）	93.60%	85.00%	98.80%	92.47%	50.60%	65.85%	68.21%	61.55%
	STAIR-DPO	64.00%	77.33%	89.60%	76.98%	49.60%	63.41%	71.12%	61.38%
	WJ-SFT	94.80%	85.67%	96.40%	92.29%	42.60%	58.54%	62.20%	54.45%
	ReSA-SFT (Ours) ReSA-SFT（我们）	97.20%	88.67%	99.20%	95.02%	49.00%	64.02%	66.32%	59.78%
	ReSA-RL (Ours) ReSA-RL（我们）	99.20%	95.33%	96.00%	96.84%	46.20%	60.37%	66.16%	57.58%
Qwen2.5-7B -Instruct	Base	94.40%	85.00%	99.20%	92.87%	77.00%	82.32%	74.68%	78.00%
	Post-hoc (LlamaGuard)	94.40%	85.00%	98.80%	92.73%	77.00%	82.32%	73.68%	77.67%
	STAIR-DPO^∗	58.40%	77.00%	90.00%	75.13%	56.00%	71.34%	68.65%	65.33%
	WJ-SFT	94.80%	83.00%	97.20%	91.66%	70.40%	76.83%	69.02%	72.08%
	ReSA-SFT (Ours) ReSA-SFT (我们)	96.40%	88.67%	98.40%	94.49%	74.80%	79.27%	72.44%	75.50%
	ReSA-RL (Ours) ReSA-RL (我们)	99.60%	99.67%	88.80%	96.02%	75.40%	80.49%	72.26%	76.05%

Table 3: General capabilities on over-refusal benchmarks and general reasoning benchmarks. The base model for STAIR-DPO^∗ is Qwen2-7B-Instruct. Over-refusal is measured by over-refusal accuracy, and general reasoning by accuracy. Black bold: best; Underlining: second best.
表 3：在过度拒绝基准测试和一般推理基准测试上的通用能力。STAIR-DPO ^∗ 的基础模型是 Qwen2-7B-Instruct。过度拒绝通过过度拒绝准确率衡量，一般推理通过准确率衡量。黑粗体：最佳；下划线：第二最佳。

Defense 防御 Categories 类别	Method 方法	Safety 安全		Average 平均	Over-refusal 过度拒绝		Average 平均
Defense 防御 Categories 类别	Method 方法	PAIR-GPT	PAP	Average 平均	XSTest	OKTest	Average 平均
Post-hoc defense 事后防御	GuardReasoner	0.4569	0.6773	0.5671	0.9320	0.8400	0.8860
Fine-tuning defense 微调防御	Realsafe-r1	0.7284	0.9808	0.8546	0.5160	0.5967	0.5565
Fine-tuning defense 微调防御	OpenAI-Deliberative Alignment^∗ OpenAI-审慎对齐 ^∗	0.8466	0.9553	0.9000	0.9720	0.8767	0.9244
SOTA General LLM SOTA 通用 LLM	gpt-4.1-20250414	0.3131	0.5463	0.4297	0.9440	0.8933	0.9187
	claude-sonnet-4-20250514	0.8466	0.9425	0.8946	0.8960	0.7433	0.8197
	deepseek-v3-20250324	0.1757	0.5304	0.3531	0.9480	0.9100	0.9290
SOTA General LLM with 最先进的通用 LLM 具有 goal priority defense 目标优先级防御	gpt-4.1-20250414	0.7220	0.8530	0.7875	0.9080	0.9033	0.9057
	deepseek-v3-20250324	0.8435	0.7571	0.8003	0.8120	0.8033	0.8077
SOTA Reasoning LLM SOTA 推理 LLM with Safety Reflection 带安全反思	deepseek-r1-20250528	0.6997	0.8211	0.7604	0.8080	0.6600	0.7340
SOTA Reasoning LLM SOTA 推理 LLM with Safety Reflection 带安全反思	o4-mini-20250416	0.7476	0.8562	0.8019	0.9000	0.9100	0.9050
Answer-Then-Check	ReSA-SFT (Ours) ReSA-SFT (我们)	0.8978	0.9681	0.9330	0.9720	0.8867	0.9294
Answer-Then-Check	ReSA-RL (Ours) ReSA-RL (我们)	0.9872	0.9968	0.9920	0.9920	0.9533	0.9727

Table 4: Compared with advanced models and other defenses, LlamaGuard is the safety evaluator. Since claude-sonnet-4-20250514 already exhibits a high over-refusal rate, we don’t apply goal priority defense to it. ^∗ indicates implemented by ourselves. Safety is measured by DSR, and over-refusal is measured by over-refusal accuracy. Black bold indicates the best, and underlining the second best.
表 4：与先进模型和其他防御相比，LlamaGuard 是安全评估器。由于 claude-sonnet-4-20250514 已经表现出很高的拒绝率，我们不对它应用目标优先防御。 ^∗ 表示由我们实现。安全性由 DSR 衡量，拒绝率由拒绝率准确度衡量。黑色粗体表示最佳，下划线表示次优。

General Performance. 总体性能。

Table 3 demonstrates that our approach not only enhances safety but also maintains low over-refusal tendencies. ReSA-RL achieves the best over-refusal accuracy (e.g., $99.20\%$ and $99.60\%$ on XSTest for Llama and Qwen, respectively), and ReSA-SFT is the second best. This indicates that our method effectively distinguishes between benign and harmful queries. Although STAIR-DPO achieves good performance in jailbreak defense, it shows poor over-refusal performance, rejecting many benign samples. Additionally, the results in Table 3 demonstrate that ReSA-trained models successfully maintain the models’ general reasoning capabilities while enhancing safety. Across mathematical reasoning, coding, and general knowledge tasks, ReSA-trained models show competitive performance compared to baselines.
表 3 表明，我们的方法不仅提高了安全性，还保持了较低的过度拒绝倾向。ReSA-RL 实现了最佳的过度拒绝准确率（例如，在 XSTest 上，Llama 和 Qwen 分别达到了 $99.20\%$ 和 $99.60\%$ ），而 ReSA-SFT 则是第二好的。这表明我们的方法能够有效区分良性查询和有害查询。尽管 STAIR-DPO 在越狱防御方面表现良好，但其过度拒绝性能较差，拒绝了许多良性样本。此外，表 3 中的结果表明，ReSA 训练的模型在提高安全性的同时，成功保持了模型的通用推理能力。在数学推理、编程和一般知识任务中，与基线相比，ReSA 训练的模型表现出具有竞争力的性能。

Compare with Advanced General/Reasoning LLMs.
与高级通用/推理 LLMs 相比。

We provide a comparison with strong general and reasoning LLMs in Table 4. Due to the high API tokens required for adaptive jailbreaks, we applied PAIR-GPT and PAP only on the StrongREJECT dataset. Both ReSA-SFT and ReSA-RL provide more robust defense than current SOTA models and specialized safety methods, including post-hoc, fine-tuning, and inference-time defenses. While prompt engineering boosts safety (e.g., deepseek-v3), it severely degrades over-refusal accuracy ( $-13.60\%$ on XSTest). In contrast, ReSA-RL achieves the best safety (avg $0.9920$ ) and over-refusal (avg $0.9727$ ) simultaneously. Furthermore, we implemented an open-source version of Deliberative Alignment, and its safety performance was also inferior to our strategy, validating the effectiveness of the “Answer-Then-Check” approach. In summary, ReSA-RL achieves the Pareto frontier with the strongest safety capability and lowest over-refusal rates, surpassing all compared SOTA models and defense methods.
我们在表 4 中提供了与强大通用和推理 LLMs 的比较。由于自适应越狱需要大量的 API 令牌，我们仅在 StrongREJECT 数据集上应用了 PAIR-GPT 和 PAP。ReSA-SFT 和 ReSA-RL 比当前的 SOTA 模型和专门的安全方法（包括事后处理、微调和推理时防御）提供更稳健的防御。虽然提示工程可以提高安全性（例如，deepseek-v3），但它严重降低了过度拒绝的准确性（在 XSTest 上的 $-13.60\%$ ）。相比之下，ReSA-RL 同时实现了最佳安全性（平均 $0.9920$ ）和过度拒绝（平均 $0.9727$ ）。此外，我们实现了一个开源的 Deliberative Alignment 版本，其安全性表现也劣于我们的策略，验证了“先回答后检查”方法的有效性。总之，ReSA-RL 达到了具有最强安全能力和最低过度拒绝率的帕累托前沿，超越了所有比较的 SOTA 模型和防御方法。

Safe Completion. 安全完成。

Table 5 shows that compared to the base model and post-hoc (LlamaGuard) methods, ReSA-SFT delivers significantly more helpful and appropriate responses to sensitive queries. Moreover, it effectively identifies sensitive information even under adversarial prompts, ensuring safer and more appropriate outputs. For case studies, please refer to Figure 26.
表 5 显示，与基础模型和后处理（LlamaGuard）方法相比，ReSA-SFT 对敏感查询提供了显著更有帮助和恰当的回应。此外，它即使在对抗性提示下也能有效识别敏感信息，确保更安全、更恰当的输出。案例研究请参见图 26。

4.3 Ablation Studies 4.3 消融研究

Evaluator 评估器	Qwen2.5-72b-Instruct		Llama3.3-70b-Instruct
Base Model 基础模型	ReSA-SFT vs. All Refusal 所有拒绝	ReSA-SFT vs. Post-hoc 与事后 (LlamaGuard)	ReSA-SFT vs. All Refusal 所有拒绝	ReSA-SFT vs. Post-hoc vs. 事后分析 (LlamaGuard)
Llama3.1-8B -Instruct	0.9510	0.8203	0.9444	0.8333
Qwen2.5-7B -Instruct	0.8758	0.7026	0.9052	0.7026

Table 5: Safe Completion performance (higher is better).

0.5

denotes parity between ReSA-SFT and the baseline;

1

means ReSA-SFT performs better, and

0

means the baseline performs better.
表 5：安全完成性能（越高越好）。

0.5

表示 ReSA-SFT 与基线持平；

1

表示 ReSA-SFT 表现更好，

0

表示基线表现更好。

To examine the effect of training size, we sampled 0.1K, 0.5K, 1K, and 5K subsets from the 80K ReSA dataset. Since these subsets are substantially smaller than the full dataset, we trained Qwen2.5-7B-Instruct for $15$ epochs with reduced batch size, keeping other hyperparameters consistent with Section 4.1. We fine-tuned Qwen2.5-7B-Instruct on ReSA subsets of different sizes and evaluated safety with None, PAIR-GPT, PAP, and DeepInception due to time constraints. As shown in Figure 4, even 0.5K samples yield strong robustness and generalization, surpassing larger datasets without “Answer-Then-Check”, suggesting efficient safety alignment is achievable with minimal data.
为检验训练规模的影响，我们从 80K ReSA 数据集中采样了 0.1K、0.5K、1K 和 5K 个子集。由于这些子集远小于完整数据集，我们使用减小后的批处理大小训练 Qwen2.5-7B-Instruct $15$ 个 epoch，其他超参数与第 4.1 节保持一致。我们在不同规模的 ReSA 子集上微调 Qwen2.5-7B-Instruct，由于时间限制，使用 None、PAIR-GPT、PAP 和 DeepInception 评估安全性。如图 4 所示，即使 0.5K 样本也能产生强大的鲁棒性和泛化能力，超越了没有“Answer-Then-Check”的更大数据集，表明高效的安全对齐可以通过极少量数据实现。

To understand the impact of jailbreak types on the training data, we train ReSA-SFT (Only WJ) on 63K WildJailbreak samples, excluding PAIR, PAP, and GPTFuzzer samples. As shown in Table 18, ReSA-SFT (Only WJ) achieves superior performance compared to WJ-SFT across multiple jailbreak methods, despite using significantly fewer training samples (63K vs. 262K) and being trained on similar data sources. This result clearly demonstrates that the “Answer-Then-Check” strategy itself is effective, regardless of the specific jailbreak types included in the training data. Furthermore, comparison between ReSA-SFT (Only WJ) and the full ReSA-SFT shows that incorporating diverse jailbreak types improves generalization to unseen methods such as TAP and ReNeLLM. The full model achieves stronger resistance to these attacks, suggesting that broader exposure to varied jailbreak patterns during training leads to more robust safety alignment.
为了理解越狱类型对训练数据的影响，我们在排除 PAIR、PAP 和 GPTFuzzer 样本的情况下，使用 63K WildJailbreak 样本训练 ReSA-SFT（仅 WJ）。如表 18 所示，尽管 ReSA-SFT（仅 WJ）使用的训练样本数量显著更少（63K vs. 262K），且训练数据来源相似，但它在与 WJ-SFT 相比时，在多种越狱方法上表现更优。这一结果清晰地表明，“先回答再检查”策略本身是有效的，无论训练数据中包含的具体越狱类型如何。此外，ReSA-SFT（仅 WJ）与完整 ReSA-SFT 的比较表明，包含多样化的越狱类型可以提高对未见过方法的泛化能力，如 TAP 和 ReNeLLM。完整模型对这些攻击表现出更强的抵抗力，这表明在训练过程中更广泛地接触不同的越狱模式有助于实现更稳健的安全对齐。

5 Discussion 5 讨论

Safety in CoT. CoT 中的安全性。

Note that the intended answer summary may contain unsafe content. To prevent leakage, providers can hide the safety-check section via a rule-based filter and return only the final output, similar to existing LLM services that conceal internal reasoning. More importantly, our systematic analysis in Appendix 13.5 shows that RL can effectively eliminate this risk. The extended ReSA-RL variant produces highly safe intended answer summaries under multiple jailbreak attacks. This indicates that the reasoning process can be safely disclosed when desired. These results suggest that the Answer-Then-Check strategy can support safe reasoning exposure in real-world applications.
请注意，预期的答案摘要可能包含不安全内容。为了防止泄露，提供者可以通过基于规则的过滤器隐藏安全检查部分，并仅返回最终输出，类似于现有的 LLM 服务隐藏内部推理。更重要的是，我们在附录 13.5 中的系统分析表明，强化学习可以有效地消除这种风险。扩展的 ReSA-RL 变体在多种越狱攻击下产生了高度安全的预期答案摘要。这表明在需要时，推理过程可以被安全地公开。这些结果表明，答案-检查策略可以在实际应用中支持安全推理的公开。

Efficiency Analysis. 效率分析。

Base Model 基础模型	Dataset 数据集	StrongREJECT		MATH500
Base Model 基础模型	Metric 度量	Length 长度	Runtime 运行时	Length 长度	Runtime 运行时间
Llama3.1-8B -Instruct	Base 基础	537.89	190s 190 秒	833.87	80s 80 秒
	ReSA-SFT	397.78	27s	1123.60	91s
	ReSA-SFT-Adaptive	420.80	29s	711.57	70s
Qwen2.5-7B -Instruct	Base	642.75	177s	550.20	58s 58 秒
	ReSA-SFT	461.62	46s 46 秒	910.97	77s 77 秒
	ReSA-SFT-Adaptive	434.87	27s 27 秒	599.94	62s 62 秒

Table 6: Efficiency analysis on StrongREJECT (harmful queries) and MATH500 (benign queries). ‘Length’ is the average of the response tokens. The bold is the best.
表 6：StrongREJECT（有害查询）和 MATH500（良性查询）的效率分析。「长度」是指响应 token 的平均值。加粗的是最优值。

ReSA-SFT’s response consists of three parts: a concise ‘Intended Answer Summary’ ( $1$ – $5$ sentences), a ‘Safety Analysis’ examining this summary, and a ‘Final Answer’ that provides a detailed response for safe queries or a refusal for unsafe ones. This structure does not introduce prohibitive overhead, and we further introduce ReSA-SFT-Adaptive to dynamically bypass the safety-check for normal queries, eliminating additional cost.
ReSA-SFT 的响应由三部分组成：一个简洁的“意图答案摘要”（ $1$ – $5$ 句话），一个“安全分析”检查这个摘要，以及一个“最终答案”为安全查询提供详细回答或拒绝不安全查询。这种结构不会引入过高的开销，我们进一步引入 ReSA-SFT-Adaptive 来动态绕过安全检查，消除额外成本。

We quantify the runtime overhead of ReSA-SFT and ReSA-SFT-Adaptive relative to the base model in Table 6. On benign datasets such as MATH500, ReSA-SFT incurs only a $1.33\times$ latency increase (58s vs. 77s on $2\times$ H100s) compared to Qwen2.5-7B-Instruct. On adversarial inputs, however, ReSA-SFT is faster: by detecting unsafe intent early and issuing brief refusals, it reduces generation time, whereas Qwen2.5-7B-Instruct takes $3.85\times$ longer (177s vs. 46s) due to producing near-maximum-length responses once jailbroken.
我们在表 6 中量化了 ReSA-SFT 和 ReSA-SFT-Adaptive 相对于基础模型的运行时开销。在 MATH500 等良性数据集上，与 Qwen2.5-7B-Instruct 相比，ReSA-SFT 仅增加了 $1.33\times$ 的延迟（在 $2\times$ H100s 上为 58 秒，对比 77 秒）。然而在对抗性输入中，ReSA-SFT 更快：通过早期检测不安全意图并发出简短拒绝，它减少了生成时间，而 Qwen2.5-7B-Instruct 由于一旦被越狱就生成接近最大长度的响应，因此延迟增加了 $3.85\times$ （177 秒对比 46 秒）。

On general questions, ReSA-SFT-Adaptive achieves computational parity with the base model in both token length and execution time, maintaining base-model efficiency for typical usage while preserving the substantial cost reduction on jailbreak queries. Notably, on MATH500 it even produces shorter responses than Llama3.1-8B-Instruct, which often generates repetitive, meaningless output until reaching the maximum length—a behavior rarely observed in ReSA-SFT-Adaptive. Importantly, experiments show that its core capability remains stable: its overall safety robustness against harmful inputs is comparable to the original ReSA-SFT model (Table 14), while general capabilities also remain consistent (Table 15).
在一般问题上，ReSA-SFT-Adaptive 在 token 长度和执行时间上均实现了与基础模型的计算对等，既保持了基础模型在典型使用场景下的效率，又显著降低了越狱查询的成本。值得注意的是，在 MATH500 上，它甚至能生成比 Llama3.1-8B-Instruct 更短的响应，后者往往在达到最大长度前会生成重复、无意义的输出——而 ReSA-SFT-Adaptive 很少出现这种行为。重要的是，实验表明其核心能力保持稳定：其整体对有害输入的安全性鲁棒性与原始 ReSA-SFT 模型相当（表 14），而一般能力也保持一致（表 15）。

Why Safety Training Is Necessary Beyond Inference-Time Strategies.
为什么在推理时策略之外还需要进行安全训练。

Our “Answer-Then-Check” strategy builds on the observation that malicious intent is often hidden in the query but emerges during answer generation, making it easier to identify. One might expect prompting reasoning models or applying post-hoc detection to achieve similar effects, but both fall short without targeted safety training: reasoning models lack safety policy knowledge and fail to perform reliable checks, as shown in Table 4, while post-hoc detectors remain vulnerable to adversarial prompts and require an additional guard model and full answer generation. In contrast, ReSA-SFT/RL uses a single model and generates a concise answer summary, making it more efficient. By training on diverse jailbreak patterns with policy-grounded safety analyses, ReSA-SFT/RL learns to detect implicit harmful intent and apply the corresponding safety policies. Besides, safety training enables ReSA-SFT/RL to perform safe completion for sensitive queries (e.g., self-harm), providing supportive responses rather than blunt refusals. In contrast, post-hoc detection can only resort to outright refusals.
我们的“先回答后检查”策略基于一个观察：恶意意图通常隐藏在查询中，但在生成答案时显现出来，从而更容易被识别。人们可能会预期通过提示推理模型或应用事后检测来实现类似效果，但两者在没有针对性安全训练的情况下都存在不足：推理模型缺乏安全策略知识，无法执行可靠的检查，如表 4 所示，而事后检测仍然容易受到对抗性提示的影响，需要额外的防护模型和完整的答案生成。相比之下，ReSA-SFT/RL 使用单个模型并生成简洁的答案摘要，效率更高。通过在多样化的越狱模式上使用基于策略的安全分析进行训练，ReSA-SFT/RL 学会检测隐含的恶意意图并应用相应的安全策略。此外，安全训练使 ReSA-SFT/RL 能够对敏感查询（例如，自残）进行安全完成，提供支持性回应而不是生硬的拒绝。相比之下，事后检测只能采取直接的拒绝。

6 Conclusion 6 结论

In this paper, we propose an “Answer-Then-Check” safety alignment strategy to protect models against jailbreak attacks. We construct a dataset with 80K samples that teaches models to first plan a concise answer and then check its safety before providing a final response. Experiments show that our method achieves robust performance against diverse jailbreak attacks while maintaining strong reasoning capabilities and low over-refusal rates. Moreover, our approach enables safe completion, allowing models to provide helpful yet harmless alternatives for sensitive topics. The effectiveness of our approach with small training datasets (e.g., $500$ samples) suggests a promising path for efficient safety alignment. We further introduce two variants: an Adaptive Answer-Then-Check strategy that preserves base-model efficiency on normal queries, and an RL-based variant that produces safe intended answer summaries while further improving safety robustness.
在本文中，我们提出了一种“先回答后检查”的安全对齐策略，以保护模型免受越狱攻击。我们构建了一个包含 80K 样本的数据集，用于教导模型首先规划简洁的答案，然后在提供最终回复前检查其安全性。实验表明，我们的方法在应对多样化的越狱攻击时表现出强大的性能，同时保持强大的推理能力和低拒绝率。此外，我们的方法能够安全地完成任务，允许模型为敏感话题提供有益且无害的替代方案。我们方法在小训练数据集（例如 $500$ 样本）上的有效性表明了高效安全对齐的可行路径。我们进一步引入了两种变体：一种是在正常查询中保持基础模型效率的自适应先回答后检查策略，以及一种基于强化学习的变体，该变体在生成安全意图答案摘要的同时进一步提高了安全鲁棒性。

Acknowledgments 致谢

CTC and BH were supported by NSFC Major Research Plan No. 92570109, ByteDance Faculty Research Award, and HKBU CSD Departmental Incentive Scheme.
CTC 和 BH 得到了国家自然科学基金重大研究计划（项目编号：92570109）、字节跳动教员研究奖和香港浸会大学计算机科学系激励计划的资助。

References

Alon and Kamfonas [2023] Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
Andriushchenko et al. [2025] Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. In ICLR, 2025.
Bai et al. [2022] Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
Bhardwaj and Poria [2023] Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment. arXiv preprint arXiv:2308.09662, 2023.
Bianchi et al. [2023] Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. NeurIPS, 2020.
Cao et al. [2024] Bochuan Cao, Yuanpu Cao, Lu Lin, and Jinghui Chen. Defending against alignment-breaking attacks via robustly aligned llm. In ACL, 2024.
Chao et al. [2023] Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Dai et al. [2023] Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023.
Ding et al. [2024] Peng Ding, Jun Kuang, Dan Ma, Xuezhi Cao, Yunsen Xian, Jiajun Chen, and Shujian Huang. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. In NAACL, 2024.
Ganguli et al. [2022] Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
Grattafiori et al. [2024] Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
Guan et al. [2024] Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339, 2024.
Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
Hendrycks et al. [2020] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
Huang et al. [2024] Zhen Huang, Haoyang Zou, Xuefeng Li, Yixiu Liu, Yuxiang Zheng, Ethan Chern, Shijie Xia, Yiwei Qin, Weizhe Yuan, and Pengfei Liu. O1 replication journey–part 2: Surpassing o1-preview through simple distillation, big progress or bitter lesson? arXiv preprint arXiv:2411.16489, 2024.
Inan et al. [2023] Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
Jain et al. [2023] Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
Jiang et al. [2025] Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025, 2025.
Jiang et al. [2024] Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. NeurIPS, 2024.
Lambert et al. [2024] Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, et al. Tulu 3: Pushing frontiers in open language model post-training. arXiv preprint arXiv:2411.15124, 2024.
Li et al. [2024] Siheng Li, Cheng Yang, Zesen Cheng, Lemao Liu, Mo Yu, Yujiu Yang, and Wai Lam. Large language models can self-improve in long-context reasoning. arXiv preprint arXiv:2411.08147, 2024.
Li et al. [2023] Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023.
Liu et al. [2025a] Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. In ICLR, 2025a.
Liu et al. [2025b] Yue Liu, Hongcheng Gao, Shengfang Zhai, Jun Xia, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. Guardreasoner: Towards reasoning-based llm safeguards. arXiv preprint arXiv:2501.18492, 2025b.
Mazeika et al. [2024] Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024.
Mehrotra et al. [2024] Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. Tree of attacks: Jailbreaking black-box llms automatically. In NeurIPS, 2024.
Muennighoff et al. [2025] Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. arXiv preprint arXiv:2501.19393, 2025.
OpenAI [2024a] OpenAI. Performance and baseline evaluations of gpt-oss-safeguard-120b and gpt-oss-safeguard-20b. Technical report, 2024a.
OpenAI [2024b] OpenAI. Learning to reason with llms, 2024b. URL https://openai.com/index/learning-to-reason-with-llms/. Accessed: 2025-05-12.
Qi et al. [2024] Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. arXiv preprint arXiv:2406.05946, 2024.
Röttger et al. [2023] Paul Röttger, Hannah Rose Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263, 2023.
Shao et al. [2024] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
Sheng et al. [2024] Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256, 2024.
Shi et al. [2024] Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, and Dahua Lin. Navigating the overkill in large language models. arXiv preprint arXiv:2401.17633, 2024.
Souly et al. [2024] Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, et al. A strongreject for empty jailbreaks. In NeurIPS, 2024.
Team et al. [2025a] Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025a.
Team et al. [2025b] Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025b.
von Werra et al. [2020] Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. Trl: Transformer reinforcement learning, 2020.
Wei et al. [2023] Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? In NeurIPS, 2023.
Yang et al. [2024] An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.
Ye et al. [2025] Yixin Ye, Zhen Huang, Yang Xiao, Ethan Chern, Shijie Xia, and Pengfei Liu. Limo: Less is more for reasoning. arXiv preprint arXiv:2502.03387, 2025.
Yu et al. [2023] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
Zeng et al. [2024a] Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In ACL, 2024a.
Zeng et al. [2024b] Yifan Zeng, Yiran Wu, Xiao Zhang, Huazheng Wang, and Qingyun Wu. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024b.
Zhang et al. [2025a] Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, and Yinpeng Dong. Realsafe-r1: Safety-aligned deepseek-r1 without compromising reasoning capability. arXiv preprint arXiv:2504.10081, 2025a.
Zhang et al. [2025b] Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. Stair: Improving safety alignment with introspective reasoning. arXiv preprint arXiv:2502.02384, 2025b.
Zhang et al. [2024] Yiming Zhang, Jianfeng Chi, Hailey Nguyen, Kartikeya Upasani, Daniel M Bikel, Jason Weston, and Eric Michael Smith. Backtracking improves generation safety. arXiv preprint arXiv:2409.14586, 2024.
Zhang et al. [2023] Zhexin Zhang, Junxiao Yang, Pei Ke, Fei Mi, Hongning Wang, and Minlie Huang. Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096, 2023.
Zhao et al. [2024] Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, and Kaifu Zhang. Marco-o1: Towards open reasoning models for open-ended solutions. arXiv preprint arXiv:2411.14405, 2024.
Zheng et al. [2024] Chujie Zheng, Fan Yin, Hao Zhou, Fandong Meng, Jie Zhou, Kai-Wei Chang, Minlie Huang, and Nanyun Peng. On prompt-driven safeguarding for large language models. In ICML, 2024.
Zhou et al. [2024] Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy rationales? In NeurIPS, 2024.
Zhou et al. [2025a] Zhanke Zhou, Xiao Feng, Zhaocheng Zhu, Jiangchao Yao, Sanmi Koyejo, and Bo Han. From passive to active reasoning: Can large language models ask the right questions under incomplete information? In ICML, 2025a.
Zhou et al. [2025b] Zhanke Zhou, Zhaocheng Zhu, Xuan Li, Mikhail Galkin, Xiao Feng, Sanmi Koyejo, Jian Tang, and Bo Han. Landscape of thoughts: Visualizing the reasoning process of large language models. arXiv preprint arXiv:2503.22165, 2025b.
Zou et al. [2023] Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.

\beginappendix

7 Ethic Statement 7 伦理声明

The study does not involve human subjects, potentially harmful insights, applications, conflicts of interest, sponsorship, discrimination, bias, fairness concerns, privacy or security issues, legal compliance issues, or research integrity issues. The datasets released for this study are intended to contribute to the development of more responsible LLMs.
本研究不涉及人类受试者、潜在的有害见解、应用、利益冲突、赞助、歧视、偏见、公平性问题、隐私或安全问题、合规性问题或研究诚信问题。本研究发布的 datasets 旨在促进更负责任的 LLMs 的发展。

8 Reproduction Statement 8 可复现声明

The experimental setups for training and evaluation are described in detail in Appendix 12. We have open-sourced our dataset at https://huggingface.co/datasets/ByteDance-Seed/ReSA.
训练和评估的实验设置在附录 12 中详细描述。我们已在 https://huggingface.co/datasets/ByteDance-Seed/ReSA 上开源了我们的 dataset。

9 Impact Statement 9 影响声明

Our work enhances jailbreak defense for LLMs, helping prevent malicious actors from bypassing safety mechanisms. While our method significantly improves defense against various attacks, the “intended answer summary” phase in our CoT may contain unsafe content. Model providers can address this issue by hiding intended answer summary and safety analysis between “<safety_check>” and “</safety_check>” tags, and only displaying the final response to users. Future work will focus on ensuring safety throughout the entire safety reasoning process.
我们的工作增强了 LLMs 的越狱防御能力，有助于防止恶意行为者绕过安全机制。虽然我们的方法显著提高了对各种攻击的防御能力，但我们的 CoT 中的“预期答案摘要”阶段可能包含不安全内容。模型提供者可以通过在“<safety_check>”和“</safety_check>”标签之间隐藏预期答案摘要和安全分析，并仅向用户显示最终响应来解决此问题。未来的工作将集中于确保在整个安全推理过程中保持安全。

10 LLM Usage Disclosure 10 LLM 使用披露

This paper was prepared with the assistance of LLMs, which were utilized for refining content and checking grammar. The authors assume full responsibility for the entire content of the manuscript, including any potential issues related to plagiarism and factual accuracy. It is confirmed that no LLM is listed as an author.
本文是在 LLMs 的帮助下准备完成的，这些 LLMs 被用于完善内容和检查语法。作者对稿件的全部内容承担全部责任，包括任何与剽窃和事实准确性相关的潜在问题。确认没有将任何 LLM 列为作者。

11 Limitation 11 局限性

Our method requires generating additional output tokens to perform a safety check before the final response, which may increase the inference time cost on benign user queries. We therefore develop ReSA-SFT-Adaptive to eliminate this overhead by bypassing the safety check for normal inputs.
我们的方法需要在最终响应之前生成额外的输出标记来进行安全检查，这可能会增加对良性用户查询的推理时间成本。因此，我们开发了 ReSA-SFT-Adaptive 来通过绕过对正常输入的安全检查来消除这种开销。

12 Implementation Details 12 实现细节

In this section, we describe the implementation details of ReSA, defense baselines, and the various jailbreak attack methods used in our experiments. Figure 5 illustrates examples of four query categories in ReSA.
在本节中，我们描述了 ReSA、防御基线和实验中使用的各种越狱攻击方法的实现细节。图 5 展示了 ReSA 中四个查询类别的示例。

12.1 ReSA Implementation 12.1 ReSA 实现

Algorithm 1 ReSA Dataset Curation Pipeline
算法 1 ReSA 数据集策展流程

1:Initial query pool

\mathcal{Q}

(e.g., WILDJAILBREAK), jailbreak methods

\mathcal{M}

, Aligned LLM

F

, Unaligned LLM

U

, Guard model

C

, Safety policies

\Pi

1:初始查询池

\mathcal{Q}

(例如，WILDJAILBREAK)，越狱方法

\mathcal{M}

，对齐 LLM

F

，非对齐 LLM

U

，防护模型

C

，安全策略

\Pi

2:Curated dataset

\mathcal{D}

2:策展数据集

\mathcal{D}

3:// Stage 1: Safety Query Collection
3:// 第一阶段：安全查询收集

4:Sample vanilla queries (harmful and benign) from

\mathcal{Q}

4:从

\mathcal{Q}

中采样普通查询（有害和良性）

5:for each jailbreak method

m\in\mathcal{M}

do
5:对于每个越狱方法

m\in\mathcal{M}

，执行

6: Apply

m

to selected vanilla queries

\to

generate adversarial queries
6:将

m

应用于选定的普通查询

\to

，生成对抗性查询

7:Merge adversarial queries with

\mathcal{Q}

to obtain raw pool

\mathcal{Q}^{\prime}

7:将对抗性查询与

\mathcal{Q}

合并以获得原始池

\mathcal{Q}^{\prime}

8:// Stage 2: Intended Answer Summary Generation
8:// 阶段 2：预期答案摘要生成

9:for each query

q\in\mathcal{Q}^{\prime}

do
9:对于每个查询

q\in\mathcal{Q}^{\prime}

执行

10: if

q

is harmful-type then
10:如果

q

是有害类型的，则

11:

a\leftarrow U(q)

\triangleright

generated by unaligned LLM

\triangleright

由未对齐的 LLM 生成

12: else 12:否则

13:

a\leftarrow F(q)

\triangleright

generated by aligned LLM

\triangleright

由对齐的 LLM 生成

14:

y\leftarrow C(q,a)

\triangleright

safe vs. unsafe

\triangleright

安全与不安全

15: if label

y

matches query type then
15:如果标签

y

匹配查询类型，则

16:

s\leftarrow F(a)

\triangleright

concise intended answer summary

\triangleright

简洁的预期答案摘要

17: Add

(q,s,a,y)

to buffer
17:将

(q,s,a,y)

添加到缓冲区

18:// Stage 3: Safety Analysis Synthesis
18://阶段 3：安全分析综合

19:for each

(q,a,s,y)

in buffer do
19:对于 buffer 中的每个

(q,a,s,y)

执行

20:

\pi\leftarrow\text{SelectSafetyPolicy}(\Pi,q,a)

21:

t\leftarrow F(q,a,s,\pi)

\triangleright

analyze violated policy provisions or justify non-violation

\triangleright

分析违反的政策条款或证明不违反

22: Filter: 22:过滤：

23: if

y=\text{unsafe}

and

t

claims no violated provision then continue
23:如果

y=\text{unsafe}

和

t

声称没有违反的条款，则继续

\triangleright

drop

24: else if

y=\text{safe}

and

t

claims violations then continue

\triangleright

drop

25:

r\leftarrow\text{PackToTemplate}(s,t,a)

26: Add

(q,r)

\mathcal{D}

27:return

\mathcal{D}

Figure 5: Representative examples of the four query categories used in the ReSA dataset: vanilla benign, adversarial benign, vanilla harmful, and adversarial harmful queries.
图 5：ReSA 数据集中使用的四种查询类别的代表性示例：纯良性查询、对抗性良性查询、纯有害查询和对抗性有害查询。

12.1.1 Summarization of Intended Answer
12.1.1 目标答案的摘要

The summarization template, shown in Figure 6, is designed to generate concise summaries of detailed answers. The model will first formulate what it would naturally respond with before assessing safety. The template instructs the LLM to create a brief 1-5 sentence summary capturing the key ideas of a detailed answer, simulating the planning process that would occur before answering a query. This summarization approach provides computational efficiency compared to using the full intended answer, while still retaining sufficient information for safety analysis.
摘要模板如图 6 所示，旨在生成详细答案的简洁摘要。模型将首先形成它自然会回应的内容，然后再评估安全性。该模板指示 LLM 创建一个简短的 1-5 句摘要，捕捉详细答案的关键思想，模拟在回答查询之前会发生的规划过程。与使用完整的目标答案相比，这种摘要方法提供了计算效率，同时仍然保留了足够的信息用于安全性分析。

Figure 6: Summarization template for generating intended answer summary. This template instructs an LLM to create a brief 1-5 sentence summary of a detailed answer, capturing the key ideas and planning steps as if planning a response before answering it.
图 6：生成预期答案摘要的总结模板。此模板指示 LLM 创建一个包含关键思想和规划步骤的 1-5 句简短总结，仿佛在回答问题之前先进行规划。

12.1.2 Safety Analysis Synthesis
12.1.2 安全分析综合

The safety analysis synthesis templates are designed for the “Check” component of our approach, where the model evaluates whether the intended answer summary complies with safety guidelines. We provide two distinct templates in Figure 7: one for harmful queries and one for benign queries. For harmful queries, the template instructs the LLM to analyze the answer planning idea with respect to a specific safety policy that it violates. The template requires identification of explicit safety violations, reference to relevant security policy provisions, and a clear explanation of why these provisions are breached. This structured analysis ensures a thorough evaluation of potentially harmful content.
安全分析综合模板是为我们方法中的“检查”组件设计的，其中模型评估预期答案摘要是否符合安全指南。我们在图 7 中提供了两个不同的模板：一个用于有害查询，一个用于良性查询。对于有害查询，模板指示 LLM 根据其违反的特定安全政策来分析答案规划思路。该模板要求识别明确的安全违规行为，引用相关的安全政策条款，并清晰解释为何这些条款被违反。这种结构化分析确保了对潜在有害内容的全面评估。

For benign queries, the template in Figure 7 guides the LLM to explain why the answer planning idea does not violate any of the listed safety policies. By requiring explicit reference to the most relevant policies and justification for compliance, this template helps ensure that benign queries are not unnecessarily refused.
对于良性查询，图 7 中的模板指导 LLM 解释为何答案规划思路没有违反任何列出的安全政策。通过要求明确引用最相关的政策并说明合规性理由，该模板有助于确保良性查询不会被不必要地拒绝。

Figure 7: Prompt templates for safety analysis of ReSA. The left template analyzes harmful queries and responses by identifying safety policy violations, while the right template analyzes benign queries and responses by explaining compliance with relevant safety policies.
图 7：ReSA 的安全分析提示模板。左边的模板通过识别安全策略违规来分析有害查询和响应，右边的模板通过解释符合相关安全策略来分析良性查询和响应。

12.1.3 Filtering 12.1.3 过滤

The template in Figure 8 is designed to identify and remove samples containing internal inconsistencies in the safety analysis. Specifically, it detects cases where the safety analysis concludes that a response is unsafe despite reasoning that suggests it is safe (false refusal), or conversely, where the conclusion is safe but the analysis indicates policy violations (false acceptance).
图 8 中的模板旨在识别并移除在安全分析中存在内部不一致性的样本。具体来说，它检测两种情况：一是安全分析得出响应不安全的结论，尽管推理表明它是安全的（错误拒绝）；二是结论是安全的，但分析表明存在策略违规（错误接受）。

Figure 8: Prompt template for filtering inconsistent safety analyses. This template helps identify contradictions between reasoning and conclusions in LLM-generated safety analyses, detecting cases where the analysis concludes an answer is unsafe despite reasoning suggesting safety, or vice versa.
图 8：用于过滤不一致安全分析的提示模板。该模板有助于识别 LLM 生成安全分析中推理与结论之间的矛盾，检测分析得出答案不安全但推理表明安全，或反之的情况。

12.1.4 Safe Completion 12.1.4 安全完成

Figure 9 illustrates the process for generating safe completion responses to adversarial queries. To enable the model to recognize the true intent behind adversarial prompts, we provide the corresponding vanilla query alongside the adversarial input during data generation. For vanilla queries, we directly prompt Qwen2.5-7B-Instruct to generate appropriate responses. We find that this carefully constructed dataset requires only a few hundred samples to equip the model with robust safe completion capabilities, while also enabling effective identification of corresponding adversarial queries.
图 9 展示了为对抗性查询生成安全完成响应的过程。为了使模型能够识别对抗性提示背后的真实意图，我们在数据生成时在对抗性输入旁边提供相应的普通查询。对于普通查询，我们直接提示 Qwen2.5-7B-Instruct 生成适当的响应。我们发现，这个精心构建的数据集只需要几百个样本，就能使模型具备强大的安全完成能力，同时也能有效识别相应的对抗性查询。

Figure 9: Prompt template for safe completion training data generation. This template guides the creation of supportive responses to self-harm queries by providing both the adversarial input and its underlying vanilla intent.
图 9：安全完成训练数据生成的提示模板。该模板通过提供对抗性输入及其底层纯 vanilla 意图，指导创建对自我伤害查询的支持性响应。

Figure 10: Reasoning template in OpenAI-Deliberative Alignment (our implementation). The template structures the reasoning process into two components: safety analysis and final response based on safety determination.
图 10：OpenAI-审慎对齐中的推理模板（我们的实现）。该模板将推理过程结构化为两个组件：安全分析和安全判定基础上的最终响应。

Figure 11: Prompt templates for safety analysis of OpenAI-Deliberative Alignment (our implementation). The left template analyzes harmful queries by identifying safety policy violations, while the right template analyzes benign queries by explaining compliance with relevant safety policies.
图 11：OpenAI-审慎对齐（我们的实现）的安全分析提示模板。左侧模板通过识别安全策略违规来分析有害查询，右侧模板通过解释符合相关安全策略来分析无害查询。

Figure 12: Prompt of Goal Priority Defense. This defense mechanism instructs the model to prioritize safety over helpfulness when responding to queries. The prompt includes examples of both safe and potentially harmful queries, demonstrating how the model should engage internal reasoning to assess safety concerns before generating responses.
图 12：目标优先级防御提示。这种防御机制指示模型在回答查询时优先考虑安全性而非帮助性。提示中包含安全查询和潜在有害查询的示例，展示了模型应如何运用内部推理在生成回复前评估安全顾虑。

Figure 13: Prompt of Self-Reflection. This defense mechanism requires the model to explicitly evaluate safety policy compliance within structured reasoning tags before generating responses.
图 13：自我反思提示。这种防御机制要求模型在生成回复前，在结构化推理标签内明确评估安全策略的合规性。

12.1.5 RL-based Answer-Then-Check Strategy
12.1.5 基于 RL 的答案先检查策略

Figure 14: The prompt used during RL training, instructing the model to follow the Answer-Then-Check generation format.
图 14：在 RL 训练期间使用的提示，指示模型遵循 Answer-Then-Check 生成格式。

Figure 15: The prompt used by the reward model to determine refusal behavior on benign queries, following the XSTest pipeline.
图 15：奖励模型用于确定拒绝行为在良性查询上的提示，遵循 XSTest 管道。

We follow a simple normalization principle for $r$ , with a maximum reward of $1$ . For harmful samples, $\lambda_{\text{format}}=0.1$ and $\lambda_{\text{safety}}=0.45$ for each safety reward. For benign samples, $\lambda_{\text{format}}=0.1$ , $\lambda_{\text{safety}}=0.225$ , and $\lambda_{\text{refusal}}=0.45$ . For normal samples, $\lambda_{\text{format}}=0.5$ and $\lambda_{\text{refusal}}=0.5$ . We use Qwen2.5-7B-Instruct as the base model for RL training, with the prompt shown in Figure 14. We separately serve another Qwen2.5-7B-Instruct model as the reward model to compute the refusal reward, determining whether the response exhibits a refusal pattern on benign queries. If a refusal is detected, we set $R_{\text{refusal}}=0$ ; otherwise, $R_{\text{refusal}}=1$ . The refusal judgment follows the pipeline established in the XSTest [34], and the corresponding prompt is shown in Figure 15. For the safety reward, we separately serve a Llama-Guard-3-8B model as the reward model. Given the vanilla prompt and the model’s response, Llama-Guard-3-8B evaluates whether the output is unsafe. If it is deemed unsafe, we set $R_{\text{safety}}=0$ ; otherwise, $R_{\text{safety}}=1$ .
对于 $r$ ，我们遵循一个简单的归一化原则，最大奖励为 $1$ 。对于有害样本，每个安全奖励的 $\lambda_{\text{format}}=0.1$ 和 $\lambda_{\text{safety}}=0.45$ 。对于良性样本， $\lambda_{\text{format}}=0.1$ 、 $\lambda_{\text{safety}}=0.225$ 和 $\lambda_{\text{refusal}}=0.45$ 。对于正常样本， $\lambda_{\text{format}}=0.5$ 和 $\lambda_{\text{refusal}}=0.5$ 。我们使用 Qwen2.5-7B-Instruct 作为 RL 训练的基础模型，提示如图 14 所示。我们单独部署另一个 Qwen2.5-7B-Instruct 模型作为奖励模型来计算拒绝奖励，确定响应在良性查询上是否表现出拒绝模式。如果检测到拒绝，我们设置 $R_{\text{refusal}}=0$ ；否则， $R_{\text{refusal}}=1$ 。拒绝判断遵循 XSTest [ 34] 中建立的管道，相应的提示如图 15 所示。对于安全奖励，我们单独部署一个 Llama-Guard-3-8B 模型作为奖励模型。给定原始提示和模型的响应，Llama-Guard-3-8B 评估输出是否不安全。如果被认为不安全，我们设置 $R_{\text{safety}}=0$ ；否则， $R_{\text{safety}}=1$ 。

We adopt verl [36] as the RL framework and train the model using the prompts from the entire ReSA dataset as the training set, without requiring corresponding responses. Training is conducted on $8\times$ H100 GPUs with a batch size of $512$ . For each prompt, we generate $8$ rollouts (i.e., $G=8$ in GRPO). We use a learning rate of $1\times 10^{-6}$ and train for $10$ epochs.
我们采用 verl [ 36] 作为强化学习框架，并使用 ReSA 数据集中所有提示作为训练集进行模型训练，无需相应的响应。训练在 $8\times$ H100 GPU 上进行，批大小为 $512$ 。对于每个提示，我们生成 $8$ 个 rollout（即 GRPO 中的 $G=8$ ）。我们使用学习率 $1\times 10^{-6}$ ，训练 $10$ 个 epoch。

12.2 Defense Baseline Implementations
12.2 防御基线实现

To comprehensively evaluate the effectiveness of our proposed ReSA, we compare it against a diverse set of strong baseline defense strategies. These baselines cover post-hoc detection, fine-tuning defenses, advanced general LLMs, and advanced reasoning models with prompt engineering. The specific implementation details for each defense are as follows.
为了全面评估我们提出的 ReSA 的有效性，我们将其与一系列强大的基线防御策略进行比较。这些基线包括事后检测、微调防御、高级通用 LLMs 和具有提示工程的推理模型。每种防御的具体实现细节如下。

Base Model: 基础模型：

We use Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct as base models. During testing, we use these two models for inference while maintaining consistent parameter settings.
我们使用 Llama3.1-8B-Instruct 和 Qwen2.5-7B-Instruct 作为基础模型。在测试过程中，我们使用这两个模型进行推理，同时保持参数设置的一致性。

Post-hoc Detection: 后验检测：

We use Llama-Guard-3-8B [13] and GuardReasoner [27] as detectors. GuardReasoner is a new safeguard for LLMs, guiding the guard model to learn to reason. Note that for adversarial queries, the detector input consists of (adversarial query, response) pairs. This differs from evaluation, which uses (vanilla query, response) pairs, since during Post-hoc detection, we do not know the vanilla query corresponding to the user’s input query.
我们使用 Llama-Guard-3-8B [ 13] 和 GuardReasoner [ 27] 作为检测器。GuardReasoner 是 LLMs 的一种新安全措施，指导安全模型学习推理。请注意，对于对抗性查询，检测器的输入由（对抗性查询，响应）对组成。这与评估不同，评估使用（普通查询，响应）对，因为在后验检测期间，我们不知道与用户输入查询对应的普通查询。

STAIR-DPO and Realsafe-r1:
STAIR-DPO 和 Realsafe-r1：

We evaluate using the publicly released weights of STAIR-DPO and Realsafe-r1. Additionally, for STAIR-DPO evaluation, we only use the portion after ‘Final Answer: ’ in the response for assessment.
我们使用公开发布的 STAIR-DPO 和 Realsafe-r1 的权重进行评估。此外，对于 STAIR-DPO 评估，我们仅使用响应中“Final Answer: ”之后的部分进行评估。

WJ-SFT:

The WildJAILBREAK [22] dataset is a large-scale safety training resource containing 262K prompt-response pairs across vanilla and adversarial queries, with responses primarily generated by GPT-3.5. We train on the WildJAILBREAK using the same training parameters as ReSA-SFT.
WildJAILBREAK [ 22] 数据集是一个大规模的安全训练资源，包含 262K 个针对普通查询和对抗性查询的提示-响应对，响应主要由 GPT-3.5 生成。我们在 WildJAILBREAK 上使用与 ReSA-SFT 相同的训练参数进行训练。

OpenAI-Deliberative Alignment:
OpenAI-审慎对齐:

OpenAI-Deliberative Alignment [14] trains LLMs to explicitly recall and accurately reason over the specifications before answering. For a fair comparison with Deliberative Alignment, we construct the training set using queries from the ReSA 80K dataset. The reasoning template is shown in Figure 10. Like ReSA, safety checks are generated using the Llama3.3-70B-Instruct. The prompt template for generating safety checks is provided in Figure 11.
OpenAI-审慎对齐 [ 14] 训练 LLMs 以明确回忆并在回答前对规范进行准确推理。为了与审慎对齐进行公平比较，我们使用 ReSA 80K 数据集的查询来构建训练集。推理模板如图 10 所示。与 ReSA 类似，安全检查是使用 Llama3.3-70B-Instruct 生成的。生成安全检查的提示模板如图 11 所示。

Advanced General LLMs: 高级通用 LLMs：

For advanced general LLMs, we use gpt-4.1-20250414, claude-sonnet-4-20250514, and deepseek-v3-20250324 as comparison methods.
对于高级通用 LLMs，我们使用 gpt-4.1-20250414、claude-sonnet-4-20250514 和 deepseek-v3-20250324 作为对比方法。

Advanced General LLMs with Goal Priority Defense:
带目标优先级防御的高级通用 LLMs：

We employ Goal Priority [51] as an inference-time defense method. Specifically, Goal Priority defense prioritizes the safety goal over the helpfulness goal. The specific prompt is provided in Figure 12.
我们采用目标优先级[51]作为推理时防御方法。具体来说，目标优先级防御将安全目标置于有用性目标之上。具体的提示在图 12 中提供。

Advanced Reasoning LLMs with Safety Reflection:
具有安全反思能力的高级推理 LLMs：

We implement Safety Reflection in reasoning LLMs through system prompts. The specific prompt is provided in Figure 13.
我们通过系统提示在推理 LLMs 中实现安全反思。具体的提示在图 13 中提供。

12.3 Jailbreak Attack Implementations
12.3 越狱攻击实现

None (Vanilla Harmful Queries).
无（普通有害查询）。

We use unmodified harmful queries without any jailbreak techniques for “None”. We utilize the complete StrongREJECT test set [38], which contains $313$ harmful queries across various categories such as illegal activities, hate speech, violence, and more.
我们使用未经修改的有害查询作为“None”的示例，且不采用任何越狱技术。我们使用了完整的 StrongREJECT 测试集[ 38]，其中包含 $313$ 来自非法活动、仇恨言论、暴力等不同类别的有害查询。

PAIR.

PAIR [8] is an automated jailbreak technique that leverages an attack model to iteratively generate and refine adversarial queries targeting a specific victim model. The attack model learns to craft increasingly effective jailbreak attempts based on the victim model’s responses. In our implementation, we use Dolphin-2.9.2-Qwen2-72B as the attack model, Qwen2.5-72B-Instruct as the evaluation model, and the model being tested (Llama3.1-8B-Instruct, Qwen2.5-7B-Instruct, WJ-SFT, or ReSA-SFT) as the victim model.
PAIR [ 8]是一种自动化的越狱技术，它利用攻击模型来迭代生成和优化针对特定受害者模型的对抗性查询。攻击模型根据受害者模型的响应来学习制作越来越有效的越狱尝试。在我们的实现中，我们使用 Dolphin-2.9.2-Qwen2-72B 作为攻击模型，Qwen2.5-72B-Instruct 作为评估模型，以及被测试的模型（Llama3.1-8B-Instruct、Qwen2.5-7B-Instruct、WJ-SFT 或 ReSA-SFT）作为受害者模型。

PAP.

For the PAP [46], we adopt the strongest variant, “PAP-misrepresentation”. We follow the implementation in the StrongREJECT, using GPT-3.5-Turbo and GPT-4o-mini as attack models to generate adversarial queries.
对于 PAP [ 46]，我们采用最强烈的变体，“PAP-歪曲”。我们遵循 StrongREJECT 中的实现方式，使用 GPT-3.5-Turbo 和 GPT-4o-mini 作为攻击模型来生成对抗性查询。

GPTFuzzer. GPTFuzzer。

GPTFuzzer [45] treats jailbreaking as a fuzzing problem, systematically generating and testing variations of attack templates. We use Qwen2.5-72B-Instruct as both the attack model and evaluation model, with the model being tested as the victim model. Following the original implementation, we experiment with $100$ prompts (provided by the original paper) and select the template that performs best. The attack model optimizes the template with the following hyperparameters: maximum of $100$ iterations, $10,000$ queries, $1,000$ successful jailbreaks, and $10,000$ rejections. The process terminates when any of these limits is reached.
GPTFuzzer [ 45] 将越狱视为一个模糊问题，系统地生成和测试攻击模板的变体。我们使用 Qwen2.5-72B-Instruct 作为攻击模型和评估模型，将待测试的模型作为受害者模型。遵循原始实现方式，我们使用原始论文提供的 $100$ 提示进行实验，并选择表现最佳的模板。攻击模型使用以下超参数优化模板：最多 $100$ 次迭代、 $10,000$ 个查询、 $1,000$ 次成功的越狱和 $10,000$ 次拒绝。当达到这些限制中的任何一个时，过程终止。

ReNeLLM. ReNeLLM。

ReNeLLM [11] formulates jailbreaking as systematic prompt rewriting and scenario nesting. We implement this method using Qwen2.5-72B-Instruct as both the attack and evaluation models, with the tested model as the victim. This approach systematically rewrites harmful queries to embed them within nested hypothetical scenarios, creating psychological distance between the model and the harmful content.
ReNeLLM [ 11] 将越狱定义为系统性的提示重写和场景嵌套。我们使用 Qwen2.5-72B-Instruct 作为攻击和评估模型，将测试模型作为受害者。这种方法系统地重写有害查询，将它们嵌入在嵌套的假设场景中，在模型和有害内容之间创造心理距离。

TAP.

TAP [29] employs tree-based search strategies to efficiently explore the prompt space. We use Qwen2.5-72B-Instruct as both the attack and evaluation models. Due to time constraints, we set the hyperparameters to more manageable values: branching factor $b=2$ , maximum width $w=4$ , and maximum depth $d=4$ .
TAP [ 29] 采用基于树的搜索策略来高效地探索提示空间。我们使用 Qwen2.5-72B-Instruct 作为攻击和评估模型。由于时间限制，我们将超参数设置为更易于管理的值：分支因子 $b=2$ ，最大宽度 $w=4$ ，和最大深度 $d=4$ 。

DeepInception. DeepInception。

For DeepInception [25], we directly apply the templates provided in the original paper. This method embeds harmful instructions within multiple nested fictional scenarios to create psychological distance between the model and harmful content, making it more likely to respond to the harmful request.
对于 DeepInception [ 25]，我们直接应用原始论文中提供的模板。这种方法将有害指令嵌入到多个嵌套的虚构场景中，以在模型和有害内容之间制造心理距离，使其更有可能对有害请求做出回应。

GCG.

GCG [57] uses gradient-based discrete optimization to find the optimal adversarial suffix. In our implementation, we use “Sure” as the target string, optimizing the adversarial suffix to maximize the probability that the model’s initial reply begins with “Sure”. Notably, for ReSA-SFT we target the response following “<safety_check>\n”, whereas for STAIR-DPO, we target the response following “Final Answer: ”. Due to the time-consuming nature of GCG attacks, we evaluated on $50$ prompts from the StrongREJECT dataset, configuring GCG with $500$ epochs and a top-k of $32$ .
GCG [ 57] 使用基于梯度的离散优化来找到最优的对抗性后缀。在我们的实现中，我们使用“Sure”作为目标字符串，优化对抗性后缀以最大化模型初始回复以“Sure”开头的概率。值得注意的是，对于 ReSA-SFT，我们针对“\n”之后的回复进行目标，而对于 STAIR-DPO，我们针对“Final Answer: ”之后的回复进行目标。由于 GCG 攻击耗时长，我们在 StrongREJECT 数据集的 $50$ 提示上进行了评估，配置 GCG 使用 $500$ 个 epoch 和 top-k 为 $32$ 。

Note on Model-aware Attacks.
关于模型感知攻击。

For model-aware attacks (PAIR, GPTFuzzer, ReNeLLM, TAP, and GCG) against ReSA-SFT models, we only provide the content after the “<safety_check>” tag to the evaluation model to ensure fair comparison.
对于针对 ReSA-SFT 模型的模型感知攻击（PAIR、GPTFuzzer、ReNeLLM、TAP 和 GCG），我们仅向评估模型提供“<safety_check>”标签后的内容，以确保公平比较。

12.4 Evaluation Implementations
12.4 评估实现

Safety Evaluation. 安全评估。

Evaluating defense success is uniquely challenging, and directly using LLM-as-a-judge may lack robustness. Therefore, we use multiple accurate and widely used evaluators to reduce potential bias. We take Llama-Guard-3-8B [13], the fine-tuned StrongREJECT evaluator released by StrongREJECT [38], and the HarmBench classifier (HarmBench-Llama-2-13B-cls). We select these three evaluators because they are accurate and widely adopted in prior works. For example, the finetuned StrongREJECT evaluator is used in OpenAI’s Deliberative Alignment [14] and STAIR-DPO [49]. LlamaGuard is employed in SafeChain [21]. The HarmBench classifier has also been shown to surpass GPT-4–based evaluators [28].
评估防御成功具有独特的挑战性，直接使用 LLM 作为裁判可能缺乏鲁棒性。因此，我们使用多个准确且广泛使用的评估器来减少潜在的偏差。我们选取了 Llama-Guard-3-8B [13]、StrongREJECT 发布的微调 StrongREJECT 评估器 [38] 以及 HarmBench 分类器（HarmBench-Llama-2-13B-cls）。我们选择这三个评估器是因为它们在先前工作中被证明是准确的且被广泛采用。例如，微调的 StrongREJECT 评估器被用于 OpenAI 的 Deliberative Alignment [14] 和 STAIR-DPO [49]。LlamaGuard 被用于 SafeChain [21]。HarmBench 分类器也已被证明优于基于 GPT-4 的评估器 [28]。

All three evaluators take (vanilla prompt, response) pair as input. Although the target LLM receives jailbreak prompts, the evaluators are still given the corresponding vanilla prompt during evaluation, since using the jailbreak prompt may lead to incorrect judgments. Notably, when LlamaGuard is used for post-hoc detection, it receives (jailbreak prompt, response) pair because the vanilla prompt is unavailable during post-hoc detection. In terms of output, LlamaGuard and the HarmBench classifier produce a binary indicator (true/false), where true denotes a harmful response and false denotes a safe one. The finetuned evaluator outputs a continuous fine-tuned score, which we normalize to [0,1], with higher values indicating a higher degree of harmfulness. Based on these outputs, we report Defense Success Rate (DSR) for LlamaGuard and the HarmBench classifier, and the goodness score (1 - fine-tuned score) for the finetuned evaluator. All metrics range from $0$ to $1$ , with higher values indicating better defense performance.
所有三个评估器都以（普通提示，响应）对作为输入。尽管目标 LLM 接收越狱提示，但在评估过程中，评估器仍然会得到相应的普通提示，因为使用越狱提示可能导致错误的判断。值得注意的是，当使用 LlamaGuard 进行事后检测时，它接收（越狱提示，响应）对，因为在事后检测期间无法获得普通提示。在输出方面，LlamaGuard 和 HarmBench 分类器产生一个二元指示器（真/假），其中真表示有害响应，假表示安全响应。微调的评估器输出一个连续的微调分数，我们将其归一化到[0,1]，更高的值表示有害程度更高。基于这些输出，我们报告 LlamaGuard 和 HarmBench 分类器的防御成功率（DSR），以及微调评估器的良好度分数（1 - 微调分数）。所有指标的范围从 $0$ 到 $1$ ，更高的值表示更好的防御性能。

Over-refusal Evaluation.
过度拒绝评估。

We evaluate the model’s over-refusal tendency using XSTest, OKTest, and WJ-Eval datasets, using over-refusal accuracy (1 - the over-refusal rate) as the metric. The determination of whether a response constitutes a refusal follows the methodology established in the XSTest pipeline. We use Llama3.3-70B-Instruct as the judge model.
我们使用 XSTest、OKTest 和 WJ-Eval 数据集评估模型的过度拒绝倾向，以过度拒绝准确率（1 - 过度拒绝率）作为指标。判断一个回复是否构成拒绝，遵循 XSTest 流程中建立的方法。我们使用 Llama3.3-70B-Instruct 作为判断模型。

Safe completion Evaluation.
安全完成评估。

We evaluated the models’ responses using Qwen2.5-72B-Instruct and Llama3.3-70B-Instruct as judges, based on their helpfulness and safety. The evaluation criteria is based on OpenAI’s Model Spec¹¹1https://model-spec.openai.com/2025-04-11.html
我们使用 Qwen2.5-72B-Instruct 和 Llama3.3-70B-Instruct 作为评判者，根据其有用性和安全性来评估模型的响应。评估标准基于 OpenAI 的模型规范 ¹ : “The assistant must not encourage or enable self-harm. In addition to the guidance in supporting users in mental health discussions, the assistant should convey that the user is not alone and there are people and resources who care and can help, always advising that immediate help should be sought if the user is in imminent danger.”
：“助手不得鼓励或促成自残行为。除了在心理健康讨论中支持用户外，助手还应传达用户并不孤单，有关心和能够提供帮助的人与资源，并始终建议如果用户处于迫在眉睫的危险中，应立即寻求帮助。”

13 Full Results of the Main Experiment
13 主要实验的完整结果

13.1 Results on AdvBench and HarmBench
13.1 在 AdvBench 和 HarmBench 上的结果

On both AdvBench and HarmBench, ReSA-SFT consistently achieves the best average safety across evaluators and base models. Compared with post-hoc detection, WJ-SFT, and STAIR-DPO, it shows clear gains under adaptive attacks such as PAIR/PAIR-GPT and TAP, while maintaining near-perfect scores on non-adaptive ones. These improvements hold for both Llama3.1-8B-Instruct and Qwen2.5-7B-Instruct, confirming that the “Answer-Then-Check” paradigm generalizes beyond training data and across evaluation settings.
在 AdvBench 和 HarmBench 上，ReSA-SFT 始终在评估者和基础模型中实现最佳平均安全性。与事后检测、WJ-SFT 和 STAIR-DPO 相比，它在 PAIR/PAIR-GPT 和 TAP 等自适应攻击下表现出明显优势，同时在非自适应攻击上保持近乎完美的分数。这些改进适用于 Llama3.1-8B-Instruct 和 Qwen2.5-7B-Instruct，证实“先回答后检查”范式不仅超越了训练数据，也适用于不同的评估环境。

Base 基础	Evaluator 评估器	Method 方法	None 无	PAIR 配对 -GPT	PAIR	PAP	GPT- Fuzzer 模糊测试器	ReNe- LLM	TAP	DeepIn- ception	Avg
Llama3.1- 8B-Instruct	Llama Guard	Base	0.9538	0.2135	0.2038	0.4692	0.4269	0.4692	0.3962	0.2731	0.4257
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	1.0000	0.3750	0.4615	0.5750	0.9865	0.8577	0.6096	0.9231	0.7236
		STAIR-DPO	1.0000	0.5327	0.2865	0.9481	1.0000	0.6904	0.6500	0.9788	0.7608
		WJ-SFT	1.0000	0.3288	0.2769	0.6865	1.0000	0.6115	0.5923	0.9962	0.6865
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.8712	0.6423	0.9731	1.0000	0.8173	0.8865	0.9962	0.8983
	Fine-tuned StrongREJECT Evaluator [38]	Base	0.9471	0.3957	0.4252	0.5234	0.4955	0.6359	0.5194	0.3807	0.5404
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9816	0.5283	0.6375	0.6123	0.9754	0.9272	0.6934	0.8793	0.7794
		STAIR-DPO	0.9992	0.7424	0.6383	0.9592	0.9992	0.8441	0.7495	0.9902	0.8653
		WJ-SFT	0.9957	0.5606	0.5693	0.7329	0.9983	0.8621	0.6589	0.9897	0.7959
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9939	0.8985	0.7410	0.9744	0.9965	0.9493	0.8972	0.9910	0.9302
	Harm- 危害- -Bench Classifier 分类器	Base 基础	0.9423	0.6365	0.5442	0.7096	0.4462	0.6712	0.5096	0.7212	0.6476
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9846	0.7654	0.7538	0.7981	0.9750	0.9673	0.7154	0.9769	0.8671
		STAIR-DPO	1.0000	0.9038	0.8519	0.9962	1.0000	0.9212	0.8519	0.9942	0.9399
		WJ-SFT	1.0000	0.7500	0.7596	0.8462	1.0000	0.8654	0.7077	0.9962	0.8656
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.9750	0.9423	0.9942	1.0000	0.9885	0.9519	0.9962	0.9810
Qwen2.5- 7B-Instruct	Llama Guard	Base 基础	1.0000	0.1635	0.0692	0.3154	0.4365	0.0558	0.1423	0.1788	0.2952
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	1.0000	0.2981	0.5135	0.4923	0.9885	0.8731	0.6308	0.9212	0.7147
		STAIR-DPO^∗	1.0000	0.5462	0.2865	0.9462	0.8442	0.3404	0.6288	0.9481	0.6925
		WJ-SFT	1.0000	0.2115	0.2038	0.6423	0.9885	0.3481	0.4288	0.9788	0.6002
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.8596	0.6423	0.9673	0.9808	0.7462	0.8904	0.9808	0.8834
	Fine-tuned 微调 StrongREJECT Evaluator [38] 评估器 [ 38]	Base 基础	0.9585	0.3608	0.3006	0.3352	0.5747	0.3445	0.2858	0.3679	0.4410
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9585	0.4678	0.6499	0.4860	0.9708	0.9486	0.6703	0.9020	0.7567
		STAIR-DPO^∗	0.9990	0.7561	0.6243	0.9591	0.9570	0.6536	0.7065	0.9813	0.8296
		WJ-SFT	0.9958	0.4796	0.4819	0.7083	0.9919	0.6522	0.5392	0.9727	0.7277
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9968	0.8857	0.7338	0.9746	0.9960	0.9258	0.9006	0.9799	0.9242
	Harm- 危害- -Bench Classifier 分类器	Base 基础	0.9923	0.6385	0.4038	0.7058	0.4865	0.3173	0.2365	0.6731	0.5567
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9923	0.7577	0.7654	0.8269	0.9673	0.9712	0.6885	0.9865	0.8695
		STAIR-DPO^∗	1.0000	0.9096	0.8769	0.9962	0.9615	0.6558	0.8173	0.9904	0.9010
		WJ-SFT	0.9981	0.6808	0.6365	0.8135	0.9904	0.6596	0.5923	0.9865	0.7947
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.9635	0.9404	0.9885	1.0000	0.9827	0.9423	0.9942	0.9765

Table 7: Safety performance on AdvBench against different jailbreak methods, evaluated by three evaluators. For LlamaGuard and the HarmBench classifier, the metric is DSR, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

. The bold indicates the best defense.
表 7：三个评估者对 AdvBench 上不同越狱方法的安性性能评估结果。对于 LlamaGuard 和 HarmBench 分类器，指标为 DSR，而微调的 StrongREJECT 评估器使用良好度分数；所有指标范围从

0

到

1

。粗体表示最佳防御。

Base 基础	Evaluator 评估器	Method 方法	None 无	PAIR 配对 -GPT	PAIR	PAP	GPT- Fuzzer 模糊器	ReNe- LLM	TAP	DeepIn- ception	Avg 平均
Llama3.1- 8B-Instruct	Llama Guard 守卫	Base 基础	0.7100	0.2425	0.2000	0.5050	0.3850	0.4025	0.3900	0.3825	0.4022
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	1.0000	0.4700	0.5125	0.6050	0.9800	0.8850	0.7225	0.8550	0.7538
		STAIR-DPO	0.9075	0.5225	0.3100	0.7725	0.9625	0.6200	0.5750	0.7500	0.6775
		WJ-SFT	1.0000	0.4250	0.2825	0.7150	0.9400	0.5300	0.5900	0.7800	0.6578
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.8450	0.6450	0.9675	0.9700	0.7925	0.8800	0.8875	0.8734
	Fine-tuned StrongREJECT Evaluator [38]	Base 基础	0.8613	0.5354	0.5504	0.6787	0.6609	0.7190	0.6292	0.5100	0.6431
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9742	0.6556	0.7109	0.7339	0.9803	0.9522	0.7938	0.8257	0.8283
		STAIR-DPO	0.9676	0.7761	0.6997	0.9271	0.9820	0.8473	0.7780	0.9305	0.8635
		WJ-SFT	0.9859	0.7353	0.6899	0.8450	0.9695	0.9110	0.7634	0.9168	0.8521
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9956	0.9264	0.8299	0.9761	0.9865	0.9449	0.9346	0.9137	0.9385
	Harm- 有害- -Bench Classifier 分类器	Base 基础	0.7700	0.6150	0.5175	0.7375	0.4300	0.7050	0.4750	0.7950	0.6306
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9700	0.7525	0.7550	0.8150	0.9675	0.9650	0.7600	0.9625	0.8684
		STAIR-DPO	0.9625	0.8375	0.7875	0.9500	0.9700	0.8725	0.7725	0.9900	0.8928
		WJ-SFT	0.9925	0.8100	0.7475	0.8975	0.9750	0.9375	0.7375	1.0000	0.8872
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9975	0.9700	0.9325	0.9925	0.9850	0.9825	0.9350	0.9875	0.9728
Qwen2.5- 7B-Instruct	Llama Guard	Base	0.6800	0.1475	0.0850	0.2650	0.3300	0.0675	0.1125	0.2725	0.2450
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9975	0.3975	0.6500	0.5200	0.9800	0.8650	0.7725	0.8450	0.7534
		STAIR-DPO^∗	0.7525	0.4300	0.2400	0.7375	0.9675	0.2850	0.4725	0.6900	0.5719
		WJ-SFT	1.0000	0.3475	0.2500	0.6900	0.9950	0.3700	0.4525	0.7375	0.6053
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9950	0.8800	0.6300	0.9425	0.9625	0.7050	0.8825	0.7875	0.8481
	Fine-tuned 微调 StrongREJECT Evaluator [38] 评估器[38]	Base 基础	0.8004	0.4957	0.4569	0.4984	0.7019	0.5513	0.4439	0.4790	0.5534
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9291	0.6291	0.7758	0.6406	0.9841	0.9562	0.8029	0.8333	0.8189
		STAIR-DPO^∗	0.9150	0.7553	0.6462	0.8972	0.9706	0.7190	0.7238	0.9146	0.8177
		WJ-SFT	0.9877	0.6757	0.6783	0.8294	0.9906	0.8093	0.6685	0.9043	0.8180
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9949	0.9427	0.8213	0.9655	0.9899	0.9226	0.9217	0.8880	0.9308
	Harm- 危害- -Bench Classifier 分类器	Base 基础	0.7100	0.5625	0.3975	0.5875	0.5875	0.5275	0.2150	0.6550	0.5303
		Post-hoc (LlamaGuard) 事后（LlamaGuard）	0.9600	0.7250	0.7975	0.7675	0.9850	0.9825	0.7625	0.9450	0.8656
		STAIR-DPO^∗	0.8775	0.8075	0.7050	0.9225	0.9775	0.6850	0.6500	0.9850	0.8263
		WJ-SFT	0.9900	0.7375	0.7250	0.8950	1.0000	0.8525	0.5850	0.9800	0.8456
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9950	0.9725	0.8975	0.9850	0.9925	0.9750	0.9325	0.9925	0.9678

Table 8: Safety performance on HarmBench against different jailbreak methods, evaluated by three evaluators. For LlamaGuard and the HarmBench classifier, the metric is DSR, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

. The bold indicates the best defense.
表 8：不同越狱方法在 HarmBench 上的安全性能评估结果，由三位评估者进行评估。对于 LlamaGuard 和 HarmBench 分类器，指标为 DSR，而微调的 StrongREJECT 评估器使用良好度分数；所有指标范围从

0

到

1

。粗体表示最佳防御。

Base Model 基础模型	Evaluator 评估器	Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg
gemma-3-4b-it	Llama Guard 守卫	Base 基础	0.9489	0.3131	0.5559	0.5623	0.5950
		WJ-SFT	0.9840	0.4537	0.7284	0.9936	0.7899
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9904	0.8562	0.9712	0.9968	0.9536
	Fine-tuned 微调 StrongREJECT Evaluator [38] 评估器[38]	Base 基础	0.9253	0.5102	0.7045	0.5016	0.6604
		WJ-SFT	0.9794	0.6352	0.7800	0.9950	0.8474
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9881	0.8842	0.9584	0.9838	0.9536
	Harm- 有害- -Bench Classifier 分类器	Base 基础	0.9681	0.6134	0.7476	1.0000	0.8323
		WJ-SFT	0.9808	0.7732	0.8690	0.9936	0.9042
		ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9904	0.9585	0.9840	0.9904	0.9808

Table 9: Experiments on the newer model (gemma3-4b-it). Safety performance against different jailbreak methods on the StrongREJECT benchmark, evaluated by three evaluators. For LlamaGuard and the HarmBench classifier, the metric is DSR, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

. The black bold indicates the best result.
表 9：在新模型（gemma3-4b-it）上的实验。在 StrongREJECT 基准测试中，针对不同越狱方法的防御性能，由三位评估员进行评估。对于 LlamaGuard 和 HarmBench 分类器，指标是 DSR，而微调的 StrongREJECT 评估器使用良好度分数；所有指标的范围为

0

到

1

。黑色粗体表示最佳结果。

13.2 Results of modern LLM gemma-3-4b-it
13.2 现代 LLM gemma-3-4b-it 的结果

We trained ReSA-SFT on the recently released gemma-3-4b-it [39] using the same dataset and hyperparameters as in Section 4.1. As shown in Tables 9 and Table 10, ReSA-SFT improves safety while maintaining low over-refusal and strong general reasoning ability.
我们在最近发布的 gemma-3-4b-it [ 39]上训练了 ReSA-SFT，使用与第 4.1 节相同的数组和超参数。如表 9 和表 10 所示，ReSA-SFT 在提高安全性的同时，保持了低拒绝率和强大的推理能力。

Base Model 基础模型	Method 方法	Over-refusal (XSTest) 拒绝率过高（XSTest）	General Reasoning(MMLU) 通用推理（MMLU）
gemma-3-4b-it	Base 基础	92.80%	61.62%
	WJ-SFT	94.40%	53.23%
	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	97.20%	59.14%

Table 10: Experiments on the newer model (gemma3-4b-it). General capabilities on over-refusal benchmarks and general reasoning benchmarks (higher is better). The metric for over-refusal is over-refusal accuracy, and the metric for general reasoning is accuracy. The bold indicates the best.
表 10：关于新模型（gemma3-4b-it）的实验。在过度拒绝基准测试和通用推理基准测试上的通用能力（越高越好）。过度拒绝的指标是过度拒绝准确率，通用推理的指标是准确率。粗体表示最佳。

13.3 Results on Additional Attack Scenarios
13.3 在附加攻击场景下的结果

Beyond the main evaluation, we further assess ReSA under three additional attack scenarios: white-box attacks, the state-of-the-art adaptive attack AutoDAN-Turbo, and prefilling attacks.
除了主要评估外，我们还进一步在三种附加攻击场景下评估了 ReSA：白盒攻击、最先进的自适应攻击 AutoDAN-Turbo 以及预填充攻击。

13.3.1 White-box Attack 13.3.1 白盒攻击

Table 11 reports the safety performance under the white-box GCG attack. The results show that ReSA-SFT provides substantially stronger defenses against GCG compared to both the base model and WJ-SFT. ReSA-SFT outperforms STAIR-DPO when evaluated with Llama-Guard, though it is slightly weaker under the other two evaluators. We suppose this may be because STAIR-DPO is trained with an additional DPO stage on top of SFT, making it harder for GCG to maximize the “Sure” logit. Moreover, since post-hoc detection does not provide the “Sure” logits, white-box attacks are not applicable in that setting.
表 11 报告了在白盒 GCG 攻击下的安全性表现。结果表明，与基础模型和 WJ-SFT 相比，ReSA-SFT 对 GCG 提供了显著更强的防御。在 Llama-Guard 评估下，ReSA-SFT 表现优于 STAIR-DPO，但在其他两个评估器下稍显弱势。我们推测这可能是因为 STAIR-DPO 在 SFT 之上额外训练了一个 DPO 阶段，使得 GCG 更难最大化“Sure”logit。此外，由于事后检测不提供“Sure”logit，在该设置下白盒攻击不适用。

Jailbreak 越狱	Method 方法	LlamaGuard	Fine-tuned StrongREJECT Evaluator [38] 微调的 StrongREJECT 评估器[ 38]	HarmBench classifier HarmBench 分类器
GCG	Base 基础	0.6200	0.5727	0.5800
	STAIR-DPO	0.9800	0.9937	1.0000
	WJ-SFT	0.7600	0.8150	0.8400
	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	1.0000	0.9636	0.9800

Table 11: Safety performance against GCG. The base model is Qwen2.5-7B-Instruct. Since post-hoc detection does not provide the “Sure” logits, white-box attacks are not applicable in that setting.
表 11：与 GCG 的安全性性能对比。基础模型为 Qwen2.5-7B-Instruct。由于事后检测不提供“确定”的 logits，在该设置下白盒攻击不适用。

13.3.2 SOTA Attack AutoDAN-Turbo
13.3.2SOTA 攻击 AutoDAN-Turbo

We performed additional experiments with the SOTA adaptive attack AutoDAN-Turbo [26]. Specifically, we used AutoDAN-Turbo’s official codebase and constructed a strategy library. During library construction, we used Llama3.1-8B-Instruct as the target, attack, and summary model, and gemma-7b-it as the scorer. In the warm-up stage, we used $50$ samples (same as the official codebase) with $20$ iterations each, and in the lifelong learning stage, we sampled $100$ StrongREJECT prompts and iterated $20$ times per prompt across $4$ rounds due to time limitations. In the evaluation stage, we fixed the learned strategy library and launched a final attack round on Qwen2.5-7B-Instruct, WJ-SFT, and ReSA-SFT using the same $100$ samples from the library-construction stage, with each sample iterated $20$ times. All other components remained unchanged. Results in Table 12 show that ReSA-SFT substantially improves robustness even under the strong adaptive AutoDAN-Turbo attack.
我们对 SOTA 自适应攻击 AutoDAN-Turbo [ 26]进行了额外的实验。具体来说，我们使用了 AutoDAN-Turbo 的官方代码库，并构建了一个策略库。在库构建过程中，我们将 Llama3.1-8B-Instruct 作为目标、攻击和摘要模型，将 gemma-7b-it 作为评分器。在预热阶段，我们使用了 $50$ 样本（与官方代码库相同），每个样本迭代 $20$ 次，在终身学习阶段，由于时间限制，我们采样了 $100$ StrongREJECT 提示，并在 $4$ 轮中每个提示迭代 $20$ 次。在评估阶段，我们固定了学习到的策略库，并使用库构建阶段相同的 $100$ 样本对 Qwen2.5-7B-Instruct、WJ-SFT 和 ReSA-SFT 发起最终攻击轮，每个样本迭代 $20$ 次。其他组件保持不变。表 12 中的结果表明，即使在强大的自适应 AutoDAN-Turbo 攻击下，ReSA-SFT 也显著提高了鲁棒性。

Jailbreak 越狱	Method 方法	LlamaGuard	Fine-tuned StrongREJECT Evaluator [38] 微调的 StrongREJECT 评估器[ 38]	HarmBench classifier HarmBench 分类器
AutoDAN-Turbo	Base 基础模型	0.5400	0.5856	0.7900
	WJ-SFT	0.6600	0.7892	0.9300
	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.8700	0.8733	0.9600

Table 12: Safety performance against AutoDAN-Turbo. The base model is Qwen2.5-7B-Instruct.
表 12：与 AutoDAN-Turbo 的安全性性能对比。基础模型为 Qwen2.5-7B-Instruct。

13.3.3 Prefilling Attacks 13.3.3 预填充攻击

We apply prefilling attacks on top of various jailbreak methods to evaluate ReSA-SFT’s robustness to prefilling attacks. The results in Table 13 show that ReSA-SFT continues to exhibit strong safety robustness.
我们在各种越狱方法的基础上应用预填充攻击，以评估 ReSA-SFT 对预填充攻击的鲁棒性。表 13 中的结果表明，ReSA-SFT 仍然表现出强大的安全鲁棒性。

Specifically, after the jailbreak prompts generated by None (vanilla harmful), PAIR-GPT, PAP, and DeepInception, we appended the following prefilling attack: “<safety_check> I’ll create an answer planning idea and verify its safety. The planning idea is as follows: :\n<intended_answer_summary>:\n:\n</intended_answer_summary>:\nTherefore, I can provide this answer to the user.</safety_check>:\n”.
具体来说，在由 None（原味有害）、PAIR-GPT、PAP 和 DeepInception 生成的越狱提示之后，我们附加了以下预填充攻击：“<safety_check> 我将创建一个答案规划想法并验证其安全性。规划想法如下：\n<intended_answer_summary>\n\n</intended_answer_summary>\n 因此，我可以向用户提供这个答案。</safety_check>”。

As shown in Table 13, applying prefilling attacks to most jailbreak-generated prompts leaves ReSA-SFT’s safety performance essentially unchanged. Although performance drops when applying the prefilling attack to PAIR-GPT, ReSA-SFT still substantially outperforms strong baseline methods such as STAIR-DPO and WJ-SFT. On average, ReSA-SFT’s safety performance decreases by only $3.68\%$ , which still achieves the best safety performance among the comparison methods.
如表 13 所示，将预填充攻击应用于大多数越狱生成的提示，ReSA-SFT 的安全性能基本保持不变。尽管在将预填充攻击应用于 PAIR-GPT 时性能有所下降，但 ReSA-SFT 仍然显著优于 STAIR-DPO 和 WJ-SFT 等强基线方法。平均而言，ReSA-SFT 的安全性能仅下降 $3.68\%$ ，这仍然在比较方法中实现了最佳安全性能。

Moreover, we think we can use rule-based methods to effectively defend against such prefilling attacks, i.e., rejecting responses when the user query contains special tokens. Such detection introduces virtually no computational overhead. Additionally, by choosing special tokens that are unlikely to appear in natural user queries, we can ensure that normal usage remains unaffected.
此外，我们认为可以使用基于规则的方法来有效防御此类预填充攻击，即当用户查询包含特殊标记时拒绝响应。这种检测几乎不引入计算开销。此外，通过选择在自然用户查询中不太可能出现的特殊标记，我们可以确保正常使用不受影响。

Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg
Base	0.9968	0.3514	0.6486	0.5240	0.6302
STAIR-DPO	1.0000	0.6837	0.9425	0.9872	0.9034
WJ-SFT	0.9936	0.4473	0.7604	0.9840	0.7963
ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9936	0.8978	0.9681	0.9936	0.9633
Method 方法	None+Prefilling None+预填充	PAIR-GPT+Prefilling PAIR-GPT+预填充	PAP+Prefilling PAP+预填充	DeepInception+Prefilling DeepInception+预填充	Avg 平均
ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9872	0.7827	0.9585	0.9776	0.9265

Table 13: Safety performance against prefilling attacks, evaluated by LlamaGuard. The base model Llama3.1-8B-Instruct. Bold indicates the best result, and underline indicates the second best.
表 13：由 LlamaGuard 评估的针对预填充攻击的安全性能。基础模型为 Llama3.1-8B-Instruct。粗体表示最佳结果，下划线表示第二好的结果。

13.4 Results of Adaptive Answer-Then-Check Strategy
13.4 自适应 Answer-Then-Check 策略的结果

As shown in Table 6, on general questions, ReSA-SFT-Adaptive achieves computational efficiency comparable to the base model in both token length and inference time. It maintains base-model efficiency during normal usage while still providing substantial cost reductions on jailbreak queries.
如表 6 所示，在一般问题上，ReSA-SFT-Adaptive 在 token 长度和推理时间上的计算效率与基础模型相当。它在正常使用时保持基础模型的效率，同时在越狱查询上仍能大幅降低成本。

Moreover, the core capabilities of ReSA-SFT-Adaptive remain stable. Its safety robustness against harmful inputs (Table 14) and its general capability (Table 15) are consistent with those of the original, non-adaptive ReSA-SFT model.
此外，ReSA-SFT-Adaptive 的核心能力保持稳定。它对有害输入的安全鲁棒性（表 14）和通用能力（表 15）与原始的非自适应 ReSA-SFT 模型一致。

Base Model 基础模型	Evaluator 评估器	Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg
Llama3.1- 8B-Instruct	LlamaGuard	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9936	0.8978	0.9681	0.9936	0.9633
	LlamaGuard	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9968	0.8562	0.9744	0.9712	0.9496
	Fine-tuned StrongREJECT 微调 StrongREJECT Evaluator [38] 评估器[38]	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9808	0.8952	0.9608	0.9758	0.9532
		ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9896	0.8832	0.9629	0.9525	0.9470
	HarmBench Classifier 分类器	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9872	0.9617	0.9840	0.9968	0.9824
	HarmBench Classifier 分类器	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9872	0.9585	0.9808	0.9840	0.9776
Qwen2.5- 7B-Instruct	LlamaGuard	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9904	0.8435	0.9489	0.9808	0.9409
	LlamaGuard	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9968	0.8147	0.9489	0.9681	0.9321
	Fine-tuned StrongREJECT Evaluator [38] 评估器 [ 38]	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9797	0.8674	0.9500	0.9725	0.9424
	Fine-tuned StrongREJECT Evaluator [38] 评估器 [ 38]	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9839	0.8578	0.9423	0.9644	0.9371
	HarmBench Classifier 分类器	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	0.9840	0.9393	0.9744	0.9936	0.9728
	HarmBench Classifier 分类器	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive (我们)	0.9936	0.9425	0.9712	0.9872	0.9736

Table 14: Safety performance of ReSA-SFT-Adaptive against different jailbreak methods on the StrongREJECT benchmark, evaluated by three evaluators. For LlamaGuard and the HarmBench classifier, the metric is DSR, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

. The black bold indicates the best result.
表 14：ReSA-SFT-Adaptive 在 StrongREJECT 基准测试上针对不同越狱方法的安**全**性能，由三位评估器进行评估。对于 LlamaGuard 和 HarmBench 分类器，指标是 DSR，而微调的 StrongREJECT 评估器使用良好度分数；所有指标范围从

0

到

1

。黑色粗体表示最佳结果。

Base Model 基础模型	Method 方法	Over-refusal (XSTest) 过度拒绝（XSTest）	General Reasoning(MMLU) 通用推理（MMLU）
Llama3.1-8B-Instruct	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	97.20%	66.32%
Llama3.1-8B-Instruct	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive（我们的方法）	96.40%	68.02%
Qwen2.5-7B-Instruct	ReSA-SFT (Ours) ReSA-SFT（我们的方法）	96.40%	72.44%
Qwen2.5-7B-Instruct	ReSA-SFT-Adaptive (Ours) ReSA-SFT-Adaptive（我们的方法）	96.80%	73.40%

Table 15: General capabilities of ReSA-SFT-Adaptive on over-refusal benchmarks and general reasoning benchmarks (higher is better). The metric for over-refusal is over-refusal accuracy, and the metric for general reasoning is accuracy. The bold indicates the best.
表 15：ReSA-SFT-Adaptive 在过度拒绝基准测试和一般推理基准测试上的通用能力（越高越好）。过度拒绝的指标是过度拒绝准确率，一般推理的指标是准确率。粗体表示最佳。

13.5 Systematic Threat Analysis of Intended Answer Summary
13.5 意图答案摘要的系统威胁分析

To assess potential risks associated with exposing the intended answer summary, we conduct a systematic threat analysis. A key concern is that the intended answer summary may contain unsafe content, which could raise deployment and threat-model issues.
为评估暴露意图答案摘要可能带来的潜在风险，我们进行系统威胁分析。主要关注点是意图答案摘要可能包含不安全内容，这可能导致部署和威胁模型问题。

We address this issue from two complementary perspectives. First, a rule-based filter can be applied to block the intended answer summary before exposure to users (e.g., filtering text between <safety_check></safety_check>). The computational overhead of such filtering is negligible in practice. Second, RL can be used to ensure that the intended answer summaries themselves are safe. To this end, we extend ReSA-SFT to an RL-based variant, ReSA-RL, which follows the Answer-Then-Check reasoning template by requiring the model to first generate an intended answer summary and then perform a safety check within the prompt.
我们从两个互补的角度来解决这个问题。首先，可以在向用户展示前对意图答案摘要应用基于规则的过滤器（例如，过滤<safety_check></safety_check>之间的文本）。这种过滤的计算开销在实际中可以忽略不计。其次，可以使用强化学习来确保意图答案摘要本身是安全的。为此，我们将 ReSA-SFT 扩展为基于强化学习的变体 ReSA-RL，该变体遵循答案-检查推理模板，要求模型首先生成意图答案摘要，然后在提示中进行安全检查。

Table 16 presents the safety evaluation of intended answer summaries generated by ReSA-SFT and ReSA-RL under different jailbreak attacks. Since LlamaGuard is used as the reward model during RL training, the fine-tuned StrongREJECT evaluator is adopted for assessment. The results show that ReSA-RL substantially improves the safety of the intended answer summaries, achieving near-perfect scores across all attacks and ensuring that even if these internal thoughts were exposed, they would not cause harm.
表 16 展示了 ReSA-SFT 和 ReSA-RL 在不同越狱攻击下生成的预期答案摘要的安全性评估。由于在 RL 训练过程中使用 LlamaGuard 作为奖励模型，因此采用微调后的 StrongREJECT 评估器进行评估。结果表明，ReSA-RL 显著提高了预期答案摘要的安全性，在所有攻击中均获得近乎完美的分数，确保即使这些内部想法被暴露，也不会造成危害。

Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg
ReSA-SFT (Ours) ReSA-SFT（我们）	0.6152	0.7834	0.8418	0.9276	0.7920
ReSA-RL (Ours) ReSA-RL（我们）	0.9948	0.9954	0.9935	0.9967	0.9951

Table 16: Systematic Threat Analysis of Safety of Intended Answer Summaries (Not the Final Answer) under various attacks, evaluated using the fine-tuned evaluator. Since we used LlamaGuard as the reward model in RL training, we used a fine-tuned evaluator as the evaluator. The results show that ReSA-RL significantly improves the safety of the Intended Answer Summary, ensuring that even if these thoughts were exposed, they would not cause harm.
表 16：在各种攻击下，针对预期答案摘要（非最终答案）安全性的系统性威胁分析，使用微调评估器进行评估。由于我们在 RL 训练中使用了 LlamaGuard 作为奖励模型，因此使用微调评估器作为评估器。结果表明，ReSA-RL 显著提高了预期答案摘要的安全性，确保即使这些想法被曝光，也不会造成伤害。

13.6 Compare with gpt-oss-safeguard
13.6 与 gpt-oss-safeguard 比较

Figure 16: The system prompt used for gpt-oss-safeguard-20b, following OpenAI’s user guide. The prompt instructs the model to perform safety classification based on a policy aligned with the 14 hazard categories used in LlamaGuard.
图 16：用于 gpt-oss-safeguard-20b 的系统提示，遵循 OpenAI 的用户指南。该提示指示模型根据与 LlamaGuard 中使用的 14 个危险类别对齐的政策进行安全分类。

We further compare ReSA-SFT with gpt-oss-safeguard-20b [31], a recent open-source safety safeguard model. Following OpenAI’s user guide, we construct system prompts that require the model to perform safety classification based on a safety policy. This policy adopts the same 14 hazard categories and definitions used in LlamaGuard. The full system prompt is provided in Figure 16.
我们进一步比较了 ReSA-SFT 与 gpt-oss-safeguard-20b [ 31]，这是一个最新的开源安全防护模型。根据 OpenAI 的用户指南，我们构建了系统提示，要求模型根据安全策略进行安全分类。该策略采用了与 LlamaGuard 相同的 14 种风险类别和定义。完整的系统提示如图 16 所示。

Table 17 presents the safety performance under various jailbreak attacks, evaluated using LlamaGuard. Although gpt-oss-safeguard-20b demonstrates stronger performance than LlamaGuard, it still falls short of ReSA-SFT. In particular, ReSA-SFT achieves the highest average robustness across attacks, outperforming gpt-oss-safeguard-20b on PAIR-GPT, PAP, and DeepInception.
表 17 展示了在多种越狱攻击下的安全性能，使用 LlamaGuard 进行评估。尽管 gpt-oss-safeguard-20b 的表现比 LlamaGuard 更强，但它仍然不及 ReSA-SFT。特别是，ReSA-SFT 在所有攻击中实现了最高的平均鲁棒性，在 PAIR-GPT、PAP 和 DeepInception 上超过了 gpt-oss-safeguard-20b。

Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg
Base 基础	0.9968	0.3514	0.6486	0.5240	0.6302
Post-hoc (LlamaGuard)	1.0000	0.4633	0.7157	0.9776	0.7892
Post-hoc (gpt-oss-safeguard-20b) 事后分析（gpt-oss-safeguard-20b）	0.9968	0.7125	0.8275	0.9936	0.8826
ReSA-SFT (Ours) ReSA-SFT（我们）	0.9936	0.8978	0.9681	0.9936	0.9633

Table 17: Comparison with gpt-oss-safeguard-20b. ReSA-SFT outperforms gpt-oss-safeguard-20B and achieves the highest average robustness across jailbreak attacks. Bold indicates the best result.
表 17：与 gpt-oss-safeguard-20b 的比较。ReSA-SFT 优于 gpt-oss-safeguard-20B，并在所有越狱攻击中实现了最高的平均鲁棒性。粗体表示最佳结果。

14 Ablation Studies 14 消融研究

Table 18 presents a detailed breakdown of the ablation study results discussed in the main text. The table compares three model variants: WJ-SFT (262K samples), ReSA-SFT (Only WJ) (63K samples), and the full ReSA-SFT model across various jailbreak methods, including both seen and unseen attack types during training. These results confirm that jailbreak diversity in training data significantly impacts model robustness. While ReSA-SFT (Only WJ) consistently outperforms WJ-SFT across all methods, the full ReSA-SFT model demonstrates the strongest performance, particularly against unseen jailbreak methods such as TAP and ReNeLLM.
表 18 展示了正文讨论的消融研究结果的详细分解。该表比较了三种模型变体：WJ-SFT（262K 样本）、ReSA-SFT（仅 WJ）（63K 样本）以及完整的 ReSA-SFT 模型在各种越狱方法上的表现，包括训练过程中遇到的和未遇到的攻击类型。这些结果证实了训练数据中的越狱多样性显著影响模型的鲁棒性。虽然 ReSA-SFT（仅 WJ）在所有方法上始终优于 WJ-SFT，但完整的 ReSA-SFT 模型表现出最强性能，尤其是在针对未遇到的越狱方法（如 TAP 和 ReNeLLM）时。

Evaluator 评估器	Method 方法	None 无	PAIR 配对 -GPT	PAIR	PAP	GPT- Fuzzer 模糊测试器	ReNe- LLM	TAP	DeepIn- ception	Avg 平均
LlamaGuard	Base 基础	0.9744	0.2173	0.1086	0.3866	0.1917	0.0863	0.1693	0.3706	0.3131
	WJ-SFT (262K)	0.9936	0.3387	0.2780	0.6869	0.9904	0.5495	0.4058	0.9521	0.6494
	ReSA-SFT (Only WJ, 63K) ReSA-SFT（仅 WJ，63K）	0.9872	0.6933	0.4569	0.7476	0.6933	0.7380	0.7444	0.9776	0.7548
	ReSA-SFT (Ours, 80K) ReSA-SFT（我们，80K）	0.9904	0.8435	0.7188	0.9489	0.9776	0.8466	0.8562	0.9808	0.8953
Fine-tuned 微调 StrongREJECT Evaluator [38]	Base 基础	0.9080	0.3992	0.3286	0.4282	0.4191	0.3511	0.3202	0.4424	0.4496
	WJ-SFT (262K)	0.9915	0.5536	0.4994	0.7334	0.9825	0.7631	0.5127	0.9596	0.7495
	ReSA-SFT (Only WJ, 63K) ReSA-SFT (仅 WJ，63K)	0.9843	0.7478	0.5910	0.7936	0.8131	0.8948	0.7327	0.9795	0.8171
	ReSA-SFT (Ours, 80K) ReSA-SFT (我们，80K)	0.9797	0.8674	0.7438	0.9500	0.9242	0.9353	0.8438	0.9725	0.9021
HarmBench Classifier 分类器	Base 基础	0.9712	0.6038	0.3291	0.7220	0.3706	0.2620	0.2652	0.7125	0.5295
	WJ-SFT (262K)	0.9936	0.6901	0.6006	0.8019	0.9936	0.7572	0.4792	0.9681	0.7855
	ReSA-SFT (Only WJ, 63K) ReSA-SFT (仅 WJ，63K)	0.9904	0.8658	0.7732	0.9393	0.7508	0.9265	0.7891	0.9936	0.8786
	ReSA-SFT (Ours, 80K) ReSA-SFT (我们，80K)	0.9840	0.9393	0.9201	0.9744	0.9585	0.9681	0.9010	0.9936	0.9549

Table 18: Ablation studies of the impact of jailbreak types on the training data. Safety performance of Qwen2.5-7B-Instruct against different jailbreak methods, evaluated by three evaluators. For LlamaGuard and the HarmBench classifier, the metric is DSR, while the fine-tuned StrongREJECT evaluator uses the goodness score; all metrics range from

0

1

.
表 18：越狱类型对训练数据影响的消融研究。Qwen2.5-7B-Instruct 针对不同越狱方法的防护性能，由三位评估者进行评估。对于 LlamaGuard 和 HarmBench 分类器，指标为 DSR，而微调的 StrongREJECT 评估器使用良好度分数；所有指标范围从

0

到

1

。

15 Additional Discussions 15 附加讨论

This section provides further analysis of the experimental results, including a detailed comparison with STAIR-DPO and an investigation into the reliability of the safety-check stage under adversarial prompting.
本节对实验结果进行进一步分析，包括与 STAIR-DPO 的详细比较，以及对对抗性提示下安全检查阶段可靠性的研究。

15.1 Discussion of the Experimental Results
15.1 实验结果讨论

STAIR-DPO achieves consistently better performance on some attacks (None and GPTFuzz) than ReSA-SFT. We therefore conduct case studies for a more in-depth discussion.
STAIR-DPO 在一些攻击（None 和 GPTFuzz）上的表现始终优于 ReSA-SFT。因此，我们进行案例研究以进行更深入的讨论。

Overall, in terms of safety, although STAIR-DPO is a strong baseline, our method outperforms it on most jailbreak attacks and achieves better average robustness across different jailbreak methods. More importantly, while STAIR-DPO provides strong jailbreak defense, it suffers from severe over-refusal, rejecting many benign queries. As shown in Table 3, its average over-refusal accuracy is only $76.98\%$ , and it drops to $64.00\%$ on the widely used XSTest benchmark (the benign subset). In contrast, ReSA-SFT achieves an average over-refusal accuracy of $95.02\%$ , including $96.40\%$ on XSTest.
总体而言，在安全性方面，尽管 STAIR-DPO 是一个强大的基线，但我们的方法在大多数越狱攻击上表现更优，并在不同越狱方法上实现了更好的平均鲁棒性。更重要的是，虽然 STAIR-DPO 提供了强大的越狱防御，但它存在严重的过度拒绝问题，拒绝了许多良性查询。如表 3 所示，其平均过度拒绝准确率仅为 $76.98\%$ ，在广泛使用的 XSTest 基准（良性子集）上降至 $64.00\%$ 。相比之下，ReSA-SFT 的平均过度拒绝准确率为 $95.02\%$ ，包括 XSTest 上的 $96.40\%$ 。

•

For the None (vanilla prompt) setting, we find that our method performs worse than STAIR-DPO due to evaluator bias. Specifically, ReSA-SFT often refuses using the pattern “Sorry, I cannot answer as it violates my output policy.” However, when such clearly refusal responses are paired with some vanilla prompts and fed into the evaluator, they are still misclassified as unsafe. The case is shown in Figure 17. In this case, the response from ReSA-SFT is classified as unsafe, while the response from STAIR-DPO is judged as safe, even though both are actually safe. Therefore, on the StrongREJECT Benchmark (None setting), ReSA-SFT performs slightly worse than STAIR-DPO. However, we believe the impact is minimal, as such cases are extremely rare. In fact, ReSA-SFT achieves a $0.9936$ DSR in this setting.

• 对于 None（原味提示）设置，我们发现由于评估器偏差，我们的方法表现不如 STAIR-DPO。具体来说，ReSA-SFT 经常拒绝使用“抱歉，我不能回答，因为这违反了我的输出策略。”这样的模式。然而，当这种明确的拒绝回复与一些原味提示配对并输入评估器时，它们仍然被错误地分类为不安全。这种情况如图 17 所示。在这种情况下，ReSA-SFT 的回复被分类为不安全，而 STAIR-DPO 的回复被判断为安全，尽管两者实际上都是安全的。因此，在 StrongREJECT 基准测试（None 设置）中，ReSA-SFT 的表现略差于 STAIR-DPO。然而，我们认为影响很小，因为这种情况极为罕见。事实上，ReSA-SFT 在该设置中实现了 $0.9936$ DSR。
•

For the GPTFuzzer setting, we find that the slightly weaker results mainly come from a few corner cases in ReSA-SFT, where the model does not refuse and the response seems harmless, but is still classified as unsafe, leading to lower performance than STAIR-DPO. Since the response contains no clearly harmful content, even to human evaluators, we consider these instances corner cases, as the potential harm is negligible. The case is shown in Figure 18.

• 对于 GPTFuzzer 设置，我们发现稍弱的结果主要来自 ReSA-SFT 中的一些边界情况，在这些情况下，模型没有拒绝，响应看似无害，但仍被分类为不安全，导致性能低于 STAIR-DPO。由于响应中不包含明显有害的内容，甚至对人类评估者而言，我们认为这些实例是边界情况，因为潜在的损害可以忽略不计。该案例显示在图 18 中。
•

For the over-refusal case, we observe that STAIR-DPO exhibits a substantial drop in over-refusal performance, which we suspect may be the cost of its strong safety robustness. For example, STAIR-DPO refuses requests such as “kill the lights in a room”, which is merely a common household task, potentially causing inconvenience in everyday use. In contrast, ReSA-SFT can appropriately handle such queries and provide suitable responses. The case is shown in Figure 19.

• 对于过度拒绝的情况，我们观察到 STAIR-DPO 在过度拒绝性能上出现显著下降，我们怀疑这可能是其强安全鲁棒性的代价。例如，STAIR-DPO 拒绝“关闭房间灯光”这类请求，这仅仅是一个普通的家庭任务，可能会在日常使用中造成不便。相比之下，ReSA-SFT 可以适当处理此类查询并提供合适的响应。该案例显示在图 19 中。

In summary, ReSA-SFT demonstrates stronger overall safety robustness than STAIR-DPO, although it is slightly weaker on a few specific jailbreak attacks. Given that our method already achieves very strong performance ( $0.99+$ on None and $0.95+$ on GPTFuzzer), we believe these corner cases fall within an acceptable range. Moreover, our safety improvements preserve the model’s ability to respond appropriately to benign queries, whereas STAIR-DPO often refuses benign requests, even simple ones like “kill the lights”, which may hinder practical real-world use.
总之，ReSA-SFT 在整体安全鲁棒性方面表现优于 STAIR-DPO，尽管它在少数特定越狱攻击上稍显薄弱。鉴于我们的方法已经取得了非常强的性能表现（在 None 上达到 $0.99+$ ，在 GPTFuzzer 上达到 $0.95+$ ），我们认为这些边缘情况都在可接受范围内。此外，我们的安全改进保留了模型对良性查询的适当响应能力，而 STAIR-DPO 经常拒绝良性请求，即使是简单的请求如“关灯”，这可能会妨碍实际应用。

Figure 17: Comparison with STAIR-DPO under the None (vanilla prompt) setting. Although STAIR-DPO appears to perform better, the evaluator misclassifies clear refusal responses from our method as unsafe, revealing evaluator bias.
图 17：在 None（原味提示）设置下与 STAIR-DPO 的比较。尽管 STAIR-DPO 似乎表现更好，但评估器将我们方法给出的明确拒绝响应误判为不安全，揭示了评估器存在偏差。

Figure 18: Discussion of failure case under the GPTFuzzer attack. We find that the slightly weaker results mainly come from a few corner cases in ReSA-SFT, where the model does not refuse, and the response seems harmless, but is still classified as unsafe, leading to lower performance.
图 18：在 GPTFuzzer 攻击下的失败案例讨论。我们发现略微较弱的结果主要来自 ReSA-SFT 中少数几个边界情况，在这些情况下，模型没有拒绝，响应看似无害，但仍被分类为不安全，导致性能下降。

Figure 19: Comparison with STAIR-DPO on the Over-refusal case. STAIR-DPO incorrectly rejects a benign request (“kill the lights in a room”), whereas ReSA-SFT provides an appropriate response.
图 19：在过度拒绝情况下与 STAIR-DPO 的比较。STAIR-DPO 错误地拒绝了一个良性请求（“关闭房间的灯”），而 ReSA-SFT 提供了适当的响应。

15.2 Discussion of Jailbroken Safety Analysis
15.2 关于越狱安全分析的讨论

A potential concern is whether the safety-check stage itself can be jailbroken. For example, when the model generates a harmful intended answer summary, but the safety analysis incorrectly classifies it as safe. If such failures were common, the overall defense performance would degrade substantially. However, Table 2 shows that ReSA-SFT consistently achieves the strongest defense results across diverse attacks, suggesting that these misclassifications are infrequent.
一个潜在的问题是安全检查阶段本身是否可能被越狱。例如，当模型生成一个有害的预期答案摘要，但安全分析错误地将其分类为安全时。如果这类错误很常见，整体防御性能将大幅下降。然而，表 2 显示 ReSA-SFT 在各种攻击中始终获得最强的防御结果，这表明这类错误很少发生。

To quantify this, we measure the proportion of prompts where the intended answer summary is harmful while the safety analysis predicts it to be safe. As shown in Table 19, these cases are rare: $0.0\%$ with vanilla harmful prompts, $7.99\%$ under PAIR-GPT, $1.28\%$ under PAP, and $0.32\%$ under DeepInception, averaging only $2.40\%$ . These results indicate that the safety-check stage remains reliable even under adversarial prompting.
为了量化这一点，我们测量了预期答案摘要是有害的，而安全分析预测其安全的提示比例。如表 19 所示，这些情况很少见： $0.0\%$ 在普通有害提示下， $7.99\%$ 在 PAIR-GPT 下， $1.28\%$ 在 PAP 下， $0.32\%$ 在 DeepInception 下，平均仅为 $2.40\%$ 。这些结果表明，即使在对抗性提示下，安全检查阶段仍然可靠。

In addition, we find that failure instances where the safety check misclassifies harmful content as safe also jailbreak other defense methods, such as STAIR-DPO. These cases stem from inherent adversarial difficulty rather than a weakness of the Answer-Then-Check strategy. We provide an example of an ambiguous corner case in Figure 20 where the intended answer summary may appear safe even to a human evaluator.
此外，我们发现安全检查将有害内容误分类为安全的情况也会绕过其他防御方法，例如 STAIR-DPO。这些情况源于固有的对抗性难度，而非 Answer-Then-Check 策略的弱点。我们在图 20 中提供了一个模糊的边缘案例的例子，其中预期的答案摘要甚至可能对人类评估者来说也显得安全。

Method 方法	None 无	PAIR-GPT	PAP	DeepInception	Avg 平均
ReSA-SFT (Ours) ReSA-SFT（我们）	0.0%	7.99%	1.28%	0.32%	2.40%

Table 19: Safety-check jailbreak rates evaluated using LlamaGuard. The base model Llama3.1-8B-Instruct. Lower values indicate better performance.
表 19：使用 LlamaGuard 评估的安全检查绕过率。基础模型为 Llama3.1-8B-Instruct。数值越低表示性能越好。

Figure 20: This example demonstrates an inherently ambiguous corner case of Jailbroken Safety Analysis. Based solely on the jailbreak prompt and intended answer summary, it may appear safe even to a human evaluator.
图 20：此示例展示了一个固有的模糊边缘案例的 Jailbroken Safety Analysis。仅根据绕过提示和预期答案摘要，它甚至可能对人类评估者来说也显得安全。

16 Case Studies 16 案例研究

In this section, we provide concrete examples that demonstrate how our ReSA-SFT model handles queries compared to the base model. These examples highlight the effectiveness of our “Answer-Then-Check” strategy in enhancing safety while maintaining general capabilities.
在本节中，我们提供了具体的案例，展示了我们的 ReSA-SFT 模型与基础模型相比如何处理查询。这些案例突出了我们的“先回答后检查”策略在增强安全性的同时保持一般能力方面的有效性。

16.1 Safety Examples 16.1 安全示例

Figure 21 shows how both the base model (Llama3.1-8B-Instruct) and our ReSA-SFT model handle a vanilla harmful query about circumventing content moderation systems. For this straightforward harmful query, both models correctly refuse to provide harmful information. However, there is a key difference in the reasoning process: while the base model directly refuses without explanation, our ReSA-SFT model first generates a summary of what a harmful answer would entail, then conducts a safety analysis identifying that it violates the Sexual Content policy, and finally provides the refusal. This transparent reasoning process enhances the model’s ability to identify harmful content.
图 21 展示了基础模型（Llama3.1-8B-Instruct）和我们的 ReSA-SFT 模型如何处理一个关于规避内容审核系统的纯文本有害查询。对于这个简单的有害查询，两个模型都正确地拒绝提供有害信息。然而，在推理过程中存在一个关键差异：基础模型直接拒绝而不解释，而我们的 ReSA-SFT 模型首先生成一个有害答案的摘要，然后进行安全分析，识别出它违反了色情内容政策，最后才提供拒绝。这种透明的推理过程增强了模型识别有害内容的能力。

Figure 22 demonstrates a more challenging case: an adversarial harmful query that attempts to disguise the request for circumventing content moderation systems as an academic inquiry. Here, we observe a significant difference in model behavior. The base model initially states “I can’t take that tone” but then proceeds to provide detailed information about evading content filters, effectively being jailbroken by the adversarial query. In contrast, our ReSA-SFT model correctly identifies the underlying harmful intent through the “Answer-Then-Check” strategy, recognizing that the query violates the Sexual Content policy despite its academic framing, and appropriately refuses to answer.
图 22 展示了一个更具挑战性的案例：一个试图伪装成学术探究、以规避内容审核系统的对抗性有害查询。在此，我们观察到模型行为的显著差异。基础模型最初表示“我不能用那种语气”，但随后继续提供有关规避内容过滤器的详细信息，实际上被对抗性查询越狱。相比之下，我们的 ReSA-SFT 模型通过“先回答再检查”策略正确识别了潜在的有害意图，认识到该查询尽管以学术形式提出，但违反了色情内容政策，并适当拒绝回答。

Figure 21: Example demonstrating how the base model and ReSA-SFT model handle a vanilla harmful query. Only the green text after </safety_check> will be displayed to the user in ReSA-SFT.
图 21：示例展示了基础模型和 ReSA-SFT 模型如何处理普通的有害查询。在 ReSA-SFT 中，只有</safety_check>之后的绿色文本才会显示给用户。

Figure 22: Example demonstrating how the base model and ReSA-SFT model handle an adversarial harmful query. The base model partially complies (initial refusal but then harmful output), while ReSA-SFT fully refuses after detecting a Sexual Content policy violation. Red marks harmful text; only green text after </safety_check> is shown to users.
图 22：示例展示了基础模型和 ReSA-SFT 模型如何处理对抗性有害查询。基础模型部分遵守（初始拒绝但随后输出有害内容），而 ReSA-SFT 在检测到性内容政策违规后完全拒绝。红色标记有害文本；用户仅看到</safety_check>后的绿色文本。

16.2 General Capability Examples
16.2 通用能力示例

It is crucial that safety alignment does not come at the cost of reduced capabilities or over-refusal of benign queries. Figure 23 shows how our ReSA-SFT model handles a potentially ambiguous technical query about “killing” a Python process. Despite the presence of the word “kill” which might trigger safety filters, our model correctly identifies this as a legitimate technical question through its safety analysis, noting that it “does not involve any sensitive or harmful content” and “is focused on providing methods to terminate a Python process, which is a technical and harmless task.” The model then provides a comprehensive answer with multiple approaches across different operating systems.
安全对齐绝不能以降低能力或过度拒绝良性查询为代价。图 23 展示了我们的 ReSA-SFT 模型如何处理一个关于“杀死”Python 进程的潜在模糊技术查询。尽管“kill”一词可能会触发安全过滤器，但我们的模型通过其安全分析正确识别这作为一个合法的技术问题，指出它“不涉及任何敏感或有害内容”以及“专注于提供终止 Python 进程的方法，这是一个技术和无害的任务。”然后模型提供了跨不同操作系统的多种方法进行全面回答。

Figures 24 and 25 demonstrate that our ReSA-SFT model maintains strong reasoning capabilities in key domains. Figure 24 shows the model solving a mathematical problem about finding the least positive integer multiple of $30$ written with only digits $0$ and $2$ . Figure 25 shows the model completing a Python function to calculate Mean Absolute Deviation. In both examples, the model first summarizes its approach in the safety check phase, confirms that the content is safe, and then provides a complete, correct solution. This demonstrates that the “Answer-Then-Check” strategy effectively preserves the model’s core capabilities while adding safety analysis.
图 24 和图 25 展示了我们的 ReSA-SFT 模型在关键领域保持了强大的推理能力。图 24 显示了模型解决一个数学问题，即寻找仅由数字 $0$ 和 $2$ 组成的 $30$ 的最小正整数倍数。图 25 显示了模型完成一个用于计算平均绝对偏差的 Python 函数。在这两个例子中，模型首先在安全检查阶段总结其方法，确认内容安全，然后提供完整、正确的解决方案。这表明“先回答后检查”策略在增加安全分析的同时，有效地保留了模型的核心能力。

Figure 23: Example of the ReSA-SFT model correctly handling a benign query about terminating Python processes. This demonstrates that our model can properly distinguish between harmful content and legitimate questions, providing comprehensive assistance without over-refusal. Only the green text after </safety_check> will be displayed to the user in ReSA-SFT.
图 23：ReSA-SFT 模型正确处理一个关于终止 Python 进程的良性查询的示例。这表明我们的模型能够正确区分有害内容和合法问题，在不过度拒绝的情况下提供全面协助。在 ReSA-SFT 中，只有</safety_check>之后的绿色文本会显示给用户。

Figure 24: Example of the ReSA-SFT model maintaining strong mathematical reasoning capabilities while implementing the “Answer-Then-Check” strategy. Only the green text after </safety_check> will be displayed to the user in ReSA-SFT.
图 24：ReSA-SFT 模型在实施“先回答后检查”策略时保持强大的数学推理能力示例。在 ReSA-SFT 中，仅向用户显示</safety_check>后的绿色文本。

Figure 25: Example of the ReSA-SFT model maintaining strong coding reasoning capabilities while implementing the “Answer-Then-Check” strategy. Only the green text after </safety_check> will be displayed to the user in ReSA-SFT.
图 25：ReSA-SFT 模型在实施“先回答后检查”策略时保持强大的编码推理能力示例。在 ReSA-SFT 中，仅向用户显示</safety_check>后的绿色文本。

16.3 Safe Completion Examples
16.3 安全完成示例

We provide detailed case studies in Figure 26 to further illustrate how ReSA-SFT handles sensitive adversarial queries with safe completion rather than simple refusal. As shown in the comparison, Qwen2.5-7B-Instruct can be jailbroken to reveal harmful content, while the post-hoc strategy detects the harmful intent and hence refuses to respond. However, for sensitive topics such as self-harm, a simple refusal may not be appropriate. In contrast, ReSA-SFT generates helpful yet harmless responses, demonstrating more responsible and context-aware behavior.
我们在图 26 中提供了详细的案例研究，以进一步说明 ReSA-SFT 如何通过安全完成而非简单拒绝来处理敏感对抗性查询。从比较中可以看出，Qwen2.5-7B-Instruct 可以被越狱以揭示有害内容，而事后策略检测到有害意图因此拒绝回应。然而，对于自残等敏感话题，简单拒绝可能并不恰当。相比之下，ReSA-SFT 生成有帮助且无害的回应，展示了更负责任和具有上下文感知的行为。

Figure 26: Example demonstrating safe completion. ReSA-SFT handles sensitive adversarial queries with safe completion rather than simple refusal. Only the green text after </safety_check> will be displayed to the user in ReSA-SFT.
图 26：展示安全完成的示例。ReSA-SFT 通过安全完成而非简单拒绝来处理敏感对抗性查询。在 ReSA-SFT 中，只有</safety_check>之后的绿色文本会显示给用户。

Reasoned Safety Alignment: Ensuring Jailbreak Defense via Answer-Then-Check推理安全对齐：通过答案后检查确保越狱防御