Safety in Large Reasoning Models: A Survey
大型推理模型的安全性:一项调查
Abstract 抽象的
Large Reasoning Models (LRMs) have exhibited extraordinary prowess in tasks like mathematics and coding, leveraging their advanced reasoning capabilities. Nevertheless, as these capabilities progress, significant concerns regarding their vulnerabilities and safety have arisen, which can pose challenges to their deployment and application in real-world settings. This paper presents the first comprehensive survey of LRMs, meticulously exploring and summarizing the newly emerged safety risks, attacks, and defense strategies specific to these powerful reasoning-enhanced models. By organizing these elements into a detailed taxonomy, this work aims to offer a clear and structured understanding of the current safety landscape of LRMs, facilitating future research and development to enhance the security and reliability of these powerful models.
大型推理模型(LRM)凭借其强大的推理能力,在数学和编程等任务中展现出卓越的性能。然而,随着这些能力的不断提升,其脆弱性和安全性问题也日益凸显,这给它们在实际环境中的部署和应用带来了挑战。本文首次对 LRM 进行了全面的调研,细致地探讨并总结了这些强大的推理增强模型所面临的新出现的安全风险、攻击和防御策略。通过将这些要素组织成一个详细的分类体系,本文旨在清晰、系统地阐述 LRM 当前的安全性现状,从而促进未来研究和开发,以提升这些强大模型的安全性和可靠性。
1 Introduction
1 简介
Large Language Models (LLMs) (Meta, 2024; Qwen et al., 2025; Ke et al., 2025) have achieved remarkable proficiency across tasks ranging from open-domain conversation to program synthesis. Central to their utility is reasoning: the ability to derive logically coherent conclusions by chaining together intermediate inferences.
大型语言模型(LLM) (Meta, 2024 ;Qwen 等人, 2025 ;Ke 等人, 2025 ) 在从开放领域对话到程序合成等各种任务中都取得了显著的成就。其效用的核心在于推理 :即通过将中间推理串联起来得出逻辑连贯的结论的能力。
Early work introduced Chain-of-Thought (CoT) prompting, in which carefully designed prompts guide the model to articulate its step-by-step rationale (Wei et al., 2022; Kojima et al., 2022). Building on this idea, subsequent methods have enriched the reasoning process by incorporating additional mechanisms. Self-critique frameworks enable a model to review and refine its own outputs (Ke et al., 2023); plan-and-solve approaches decompose complex problems into ordered subgoals before execution (Wang et al., 2023); debate protocols convene multiple agents to argue competing hypotheses and arrive at a consensus (Liang et al., 2023); and structural transformations—such as tree-based deliberations (Yao et al., 2023) or dynamically evolving tables of intermediate steps (Wang et al., 2024b; Besta et al., 2024)—reconfigure the underlying reasoning architecture to improve transparency and control.
早期研究引入了思维链 (CoT)提示,其中精心设计的提示引导模型逐步阐明其推理过程 (Wei et al., 2022 ; Kojima et al., 2022 ) 。在此基础上, 后续方法通过引入其他机制丰富了推理过程。 自我批判框架使模型能够回顾和改进自身的输出 (Ke et al., 2023 ) ;计划 - 解决方法在执行前将复杂问题分解为有序的子目标 (Wang et al., 2023 ) ;辩论协议召集多个主体就相互竞争的假设进行辩论并达成共识 (Liang et al., 2023 ) 。结构性转变——例如基于树状结构的审议 (Yao 等人, 2023 ) 或动态演化的中间步骤表 (Wang 等人, 2024b ;Besta 等人, 2024 ) ——重新配置底层推理架构,以提高透明度和控制力。
The recent release of OpenAI’s o1 series (OpenAI, 2024) marks the emergence of Large Reasoning Models (LRMs) (Li et al., 2025b), which are explicitly trained to produce richly formatted, human-readable reasoning traces. Notable examples include DeepSeek-R1 (DeepSeek-AI et al., 2025), Kimi-1.5 (Team et al., 2025), and QwQ (Team, 2024b), all of which leverage reinforcement learning to refine their deduction processes. LRMs now set new benchmarks in mathematical problem solving (Lightman et al., 2023), closed-book question answering (Rein et al., 2024), and code generation (Jain et al., 2024).
OpenAI 近期发布的 o1 系列 (OpenAI, 2024 ) 标志着大型推理模型(LRM) (Li 等, 2025b ) 的出现。这些模型经过专门训练,能够生成格式丰富、易于人类阅读的推理轨迹。值得关注的例子包括 DeepSeek - R1 (DeepSeek-AI 等, 2025 ) 、Kimi - 1.5 (Team 等, 2025 ) 和 QwQ (Team, 2024b ) ,它们都利用强化学习来改进其推理过程。LRM 目前在数学问题求解 (Lightman 等, 2023 ) 、 闭卷问答 (Rein 等, 2024 ) 和代码生成 (Jain 等, 2024 ) 等领域树立了新的标杆。
As LRMs become increasingly integrated into high-stakes domains—from scientific research to autonomous decision support—it is vital to rigorously assess their safety, robustness, and alignment. Despite the existence of surveys on LLM safety (Huang et al., 2023; Shi et al., 2024),
the enhanced capabilities of LRMs make it important to perform a dedicated analysis of their unique safety challenges.
This paper aims to bridge this gap by providing a comprehensive examination of safety considerations specific to reasoning-enhanced models.
随着逻辑推理模型(LRM)日益融入从科学研究到自主决策支持等高风险领域,对其安全性、鲁棒性和一致性进行严格评估至关重要。尽管已有关于逻辑推理模型(LLM)安全性的综述 (Huang et al., 2023 ; Shi et al., 2024 ) ,但 LRM 的增强功能使其特有的安全挑战亟需专门分析。本文旨在通过全面考察推理增强模型特有的安全考量来弥补这一空白。
Overview of the Survey. 调查概述。
In this survey, we begin with an introduction to the background of LRMs (Sec. 2). We then explore the inherent safety risks of LRMs (Sec. 3), focusing on vulnerabilities that emerge during standard, non-adversarial usage scenarios without deliberate exploitation attempts. Next, we distinguish these inherent risks from deliberate adversarial attacks (Sec. 4), where we categorize and analyze techniques specifically designed to compromise or manipulate LRMs’ reasoning capabilities. We proceed to examine various defense strategies to mitigate both inherent risks and adversarial attacks (Sec. 5). Finally, we outline promising future research directions (Sec. 6). A timeline depicting the evolution of different approaches is shown in Figure 1. The comprehensive structure of our survey is illustrated in Figure 2.
在本综述中,我们首先介绍逻辑推理模型(LRM)的背景(第 2 节)。然后,我们探讨 LRM 固有的安全风险(第 3 节),重点关注在标准、非对抗性使用场景下,即使没有蓄意利用,也会出现漏洞。接下来,我们将这些固有风险与蓄意对抗性攻击区分开来(第 4 节),并对专门用于破坏或操纵 LRM 推理能力的技术进行分类和分析。随后,我们考察了各种防御策略,以减轻固有风险和对抗性攻击(第 5 节)。最后,我们概述了未来有前景的研究方向(第 6 节)。图 1 展示了不同方法的发展历程。图 2 展示了本综述的完整结构。
2 Background
2 背景
The success of modern LRMs is deeply intertwined with advances in reinforcement learning Watkins and Dayan (1992); Sutton et al. (1998), where agents learn decision-making policies through environmental interaction and reward feedback to maximize long-term returns Mnih et al. (2015); Li et al. (2025b). The integration of RL with deep neural networks has proven particularly effective in processing high-dimensional, unstructured data, as exemplified by breakthroughs like AlphaGo’s self-play mastery of Go and AlphaZero’s generalization across chess variants Feng et al. (2023).
现代逻辑回归模型的成功与强化学习的进步密切相关 (Watkins 和 Dayan, 1992 ;Sutton 等人, 1998 ) 。在强化学习中,智能体通过环境交互和奖励反馈学习决策策略,以最大化长期收益 (Mnih 等人, 2015 ;Li 等人, 2025b ) 。强化学习与深度神经网络的结合已被证明在处理高维非结构化数据方面尤为有效,AlphaGo 在围棋方面的自我对弈能力以及 AlphaZero 在各种国际象棋变体中的泛化能力等突破性成果便是例证 (Feng 等人, 2023 ) 。
Recent breakthroughs in Reinforced Fine-Tuning (ReFT) paradigms, exemplified by DeepSeek models, have reinvigorated RL-based optimization for LRMs Luong et al. (2024). Unlike conventional CoT methods that optimize single reasoning trajectories, ReFT employs policy optimization to explore diverse reasoning paths through several key innovations: (1) Multi-path Exploration: Generating multiple reasoning trajectories per query, overcoming CoT’s myopic optimization of single pathways.
(2) Rule-driven Reward Shaping: Automating reward signals based on terminal answer correctness while preserving intermediate reasoning diversity.
(3) Dual-phase Optimization: Combining supervised fine-tuning (SFT) with online RL for policy refinement.
以 DeepSeek 模型为代表的强化微调(ReFT)范式的最新突破,为基于强化学习(RL)的逻辑推理模型(LRM)优化注入了新的活力 (Luong 等人, 2024 ) 。与优化单一推理轨迹的传统 CoT 方法不同,ReFT 采用策略优化,通过以下几项关键创新探索多样化的推理路径:(1)多路径探索:为每个查询生成多条推理轨迹,克服 CoT 仅优化单一路径的局限性。(2)规则驱动的奖励塑造:基于最终答案的正确性自动生成奖励信号,同时保持中间推理的多样性。(3)双阶段优化:结合监督微调(SFT)和在线强化学习进行策略优化。
This paradigm demonstrates particular efficacy in complex multi-step tasks such as code generation, legal judgment analysis, and mathematical problem solving, where requiring models to maintain coherent reasoning across extended sequences while handling structured symbolic operations.
这种范式在复杂的多步骤任务中表现出特别的有效性,例如代码生成、法律判断分析和数学问题解决,这些任务要求模型在处理结构化符号运算的同时,在扩展序列中保持连贯的推理。
Notably, RL-optimized LRMs exhibit emergent capabilities like Long-CoT that surpass pure SFT baselines, further underscoring its critical role and promising potential in advancing reasoning-driven AI systems Qu et al. (2025).
值得注意的是,RL 优化的 LRM 展现出诸如 Long-CoT 等涌现能力,超越了纯粹的 SFT 基线,进一步凸显了其在推进推理驱动的 AI 系统中的关键作用和有希望的潜力 Qu 等人 ( 2025 ) 。
图 1: LRM 安全研究发展时间线。
for tree=
font=,
draw=myblue, semithick, rounded corners,
minimum height = 1.ex,
minimum width = 3em,
anchor = west,
grow = east,
forked edge, s sep = 2mm, fork sep = 1mm,
[Safety in LRMs: A Survey,rotate=90,anchor=center,
[Defenses for LRMs (Sec. 5), fit=band, text width=1.7cm, fill=defenseblue, draw=blueborder
[Guard Models for LRMs (Sec. 5.3), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[Reasoning-based Guard Model, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[GuardReasoner (Liu et al., 2025b), GuardReasoner-VL, (Liu et al., 2025d), ThinkGuard (Wen et al., 2025), X-Guard (Upadhayay et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder]
]
[Classifier-based Guard Model, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[LLaMA Guard 3 (Dubey et al., 2024), Aegis Guard 2 (Ghosh et al., 2024b), WildGuard (Han et al., 2024), ShieldGemma (Zeng et al., 2024a), LLaMA Guard 3-Vision (Chi et al., 2024a), Beaver Guard-V (Ji et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder]
]
]
[Inference-time Defenses for LRMs (Sec. 5.2), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[Safe Decoding for Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[ZeroThink/LessThink/MoreThink (Jiang et al., 2025), Thinking Intervention (Wu et al., 2025a), text width=6cm, fill=defenseblue, draw=blueborder]
]
[Inference-time Scaling on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[Inference-time Compute (Zaremba et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder]
]
]
[Safety Alignment of LRMs (Sec. 5.1), text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[RL-based Safety Alignment on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[Deliberative Alignment (Guan et al., 2024), STAIR (Zhang et al., 2025c), SaRO (Mou et al., 2025), R2D (Zhu et al., 2025a), text width=6cm, fill=defenseblue, draw=blueborder]
]
[SFT-based Safety Alignment on Reasoning, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[SafeChain (Jiang et al., 2025), RealSafe-R1 (Zhang et al., 2025b), RT (Wang et al., 2025a), Safety Tax (Huang et al., 2025), text width=6cm, fill=defenseblue, draw=blueborder]
]
[Safe CoT Data Curation, text width=2.5cm, l sep = 2mm, fill=defenseblue, draw=blueborder
[STAR-1 (Wang et al., 2025b), SafeChain (Jiang et al., 2025), RealSafe-R1 (Zhang et al., 2025b), text width=6cm, fill=defenseblue, draw=blueborder]
]
]
]
[Attacks on LRMs (Sec.4), fit=band, text width=1.7cm, fill=attackred, draw=blueborder
[Jailbreak Attacks (Sec. 4.4), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Reasoning-based Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Mousetrap (Yao et al., 2025), H-CoT (Kuo et al., 2025), text width=6cm, fill=attackred, draw=blueborder]
]
[Multi-Turn Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[ActorAttack (Ren et al., 2024), RACE (Ying et al., 2025a), MHJ (Li et al., 2024), text width=6cm, fill=attackred, draw=blueborder]
]
[Prompt-based Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Past Tense Andriushchenko and Flammarion (2024), CNSafe Ying et al. (2025b), SafeMLRM Fang et al. (2025), text width=6cm, fill=attackred, draw=blueborder]
]
]
[Prompt Injection Attacks (Sec. 4.3), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Nerd Sniping Zaremba et al. (2025), R1 Assessment Zhou et al. (2025), text width=6cm, fill=attackred, draw=blueborder]
]
[Answer Correctness Attacks (Sec. 4.2), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Error Injection, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[CPT (Cui et al., 2025), text width=6cm, fill=attackred, draw=blueborder]
]
[Reasoning-based Backdoor Attacks, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[BadChain (Xiang et al., 2024), DarkMind (Guo and Tourani, 2025), BoT (Zhu et al., 2025b), ShadowCoT (Zhao et al., 2025), text width=6cm, fill=attackred, draw=blueborder]
]
]
[Reasoning Length Attacks (Sec.4.1), text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Underthinking, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[Think Less (Zaremba et al., 2025), text width=6cm, fill=attackred, draw=blueborder]
]
[Overthinking, text width=2.5cm, l sep = 2mm, fill=attackred, draw=blueborder
[OverThink attack (Kumar et al., 2025), Nerd Sniping (Zaremba et al., 2025), text width=6cm, fill=attackred, draw=blueborder]
]
]
]
[Safety Risks of LRMs (Sec.3), fit=band, text width=1.7cm, fill=riskyellow, draw=blueborder
[Multi-modal Safety Risks (Sec. 3.4), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder
[SafeMLRM (Fang et al., 2025), text width=6cm, fill=riskyellow, draw=blueborder]
]
[Multi-lingual Safety Risks (Sec. 3.3), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder
[CHiSafetyBench Zhang et al. (2025a), CNSafe Ying et al. (2025b), ALIA Romero-Arjona et al. (2025), text width=6cm, fill=riskyellow, draw=blueborder]
]
[Agentic Misbehavior Risks (Sec. 3.2), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder
[HHH Trade-offs Xu et al. (2025), Medical Cyberattack Qiu et al. (2025), Specification Gaming Bondarenko et al. (2025), Self-preservation Barkur et al. (2025), InstrumentalEval He et al. (2025), text width=6cm, fill=riskyellow, draw=blueborder]
]
[Harmful Request Compliance Risks (Sec.3.1), text width=2.5cm, l sep = 2mm, fill=riskyellow, draw=blueborder
[DeepSeek Thoughtology Marjanović et al. (2025), ASTRAL Arrieta et al. (2025a), o3vsDeepSeek Arrieta et al. (2025b), text width=6cm, fill=riskyellow, draw=blueborder]
]
]
]
对于树形=字体=,绘制=我的蓝色,半粗,圆角,最小高度=1.ex,最小宽度=3em,锚点=西,生长=东,分叉边缘,s 间距=2mm,分叉间距=1mm,[LRM 中的安全性:一项调查,旋转=90,锚点=中心,[LRM 的防御(第 5 节)],适合=带状,文本宽度=1.7cm,填充=防御蓝,绘制=蓝色边框[LRM 的防护模型(第 5.3 节)],文本宽度=2.5cm,l 间距=2mm,填充=防御蓝,绘制=蓝色边框[基于推理的防护模型],文本宽度=2.5cm,l 间距=2mm,填充=防御蓝,绘制=蓝色边框[GuardReasoner (Liu 等人, 2025b )] ,GuardReasoner-VL, (Liu 等人, 2025d ) ,ThinkGuard (Wen 等人, 2025 ) ,X-Guard (Upadhayay 等人, 2025 ) ,文本宽度=6cm,填充=defenseblue,绘制=blueborder] ] [基于分类器的 Guard 模型,文本宽度=2.5cm,间距=2mm,填充=defenseblue,绘制=blueborder [LLaMA Guard 3 (Dubey 等人, 2024 ) ,Aegis Guard 2 (Ghosh 等人, 2024b ) ,WildGuard (Han 等人, 2024 ) ,ShieldGemma (Zeng 等人, 2024a ) ,LLaMA Guard 3-Vision (Chi 等人, 2024a ) ,Beaver Guard-V (Ji 等, 2025 ) ,文本宽度=6 厘米,填充=defenseblue,绘制=blueborder] ] ] [LRM 的推理时防御(第 1 节) 5.2 ),文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框 [推理的安全解码,文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框 [零思考/少思考/多思考 (Jiang 等人, 2025 ) ,思维干预 (Wu 等人, 2025a ) ,文本宽度=6 厘米,填充=防御蓝,绘制=蓝色边框] ] [推理的推理时间尺度,文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框 [推理时间计算 (Zaremba 等人, 2025 ) ,文本宽度=6 厘米,填充=防御蓝,绘制=蓝色边框] ] ] [LRM 的安全对齐(第 5.2 节) 5.1 ),文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框[基于 RL 的推理安全对齐,文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框[深思熟虑对齐 (Guan 等人, 2024 ) ,STAIR (Zhang 等人, 2025c ) ,SaRO (Mou 等人, 2025 ) ,R2D (Zhu 等人, 2025a ) ,文本宽度=6 厘米,填充=防御蓝,绘制=蓝色边框] ] [基于 SFT 的推理安全对齐,文本宽度=2.5 厘米,左间距=2 毫米,填充=防御蓝,绘制=蓝色边框[SafeChain (Jiang 等人, 2025 ) ,RealSafe-R1 (Zhang)等, 2025b ) ,RT (Wang 等, 2025a ) ,安全税 (Huang 等, 2025 ) ,文本宽度=6cm,填充=defenseblue,绘制=blueborder] ] [安全 CoT 数据管理,文本宽度=2.5cm,l sep = 2mm,填充=defenseblue,绘制=blueborder [STAR-1 (Wang 等, 2025b ) ,SafeChain (Jiang 等, 2025 ) ,RealSafe-R1 (Zhang 等, 2025b ) ,文本宽度=6cm,填充=defenseblue,绘制=blueborder] ] ] ] [对 LRM 的攻击(第 4 节),fit=band,文本宽度=1.7cm,填充=attackred,绘制=blueborder [越狱攻击(第 4 节) 4.4 ),文本宽度=2.5 厘米,行间距=2 毫米,填充=攻击红,绘制=蓝色边框[基于推理的攻击,文本宽度=2.5 厘米,行间距=2 毫米,填充=攻击红,绘制=蓝色边框[捕鼠器 (姚等人, 2025 ) ,H-CoT (郭等人, 2025 ) ,文本宽度=6 厘米,填充=攻击红,绘制=蓝色边框]][多回合攻击,文本宽度=2.5 厘米,行间距=2 毫米,填充=攻击红,绘制=蓝色边框[ActorAttack (任等人, 2024 ) ,RACE (应等人, 2025a ) ,MHJ (李等人, 2024 ) ,文本宽度=6 厘米,填充=攻击红, draw=blueborder] ] [基于提示的攻击,文本宽度=2.5cm,行间距=2mm,填充=attackred,draw=blueborder [过去式 Andriushchenko 和 Flammarion ( 2024 ) ,CNSafe Ying 等人 ( 2025b ) ,SafeMLRM Fang 等人 ( 2025 ) ,文本宽度=6cm,填充=attackred,draw=blueborder] ] ] [提示注入攻击(第 4.3 节),文本宽度=2.5cm,行间距=2mm,填充=attackred,draw=blueborder [Nerd Sniping Zaremba 等人 ( 2025 ) ,R1 评估 Zhou 等人( 2025 ) ,文本宽度=6 厘米,填充=攻击红色,绘制=蓝色边框] ] [答案正确性攻击(第 10 节) 4.2 ),文本宽度=2.5 厘米,行间距=2 毫米,填充颜色=攻击红,绘制边框=蓝色[错误注入,文本宽度=2.5 厘米,行间距=2 毫米,填充颜色=攻击红,绘制边框=蓝色[CPT (Cui 等人, 2025 ) ,文本宽度=6 厘米,填充颜色=攻击红,绘制边框=蓝色]][基于推理的后门攻击,文本宽度=2.5 厘米,行间距=2 毫米,填充颜色=攻击红,绘制边框=蓝色[BadChain (Xiang 等人, 2024 ) ,DarkMind (Guo 和 Tourani, 2025 ) ,BoT (Zhu 等人, 2025b ) ,ShadowCoT (Zhao 等人, 2025 ) ,文本宽度=6 厘米,填充颜色=攻击红,绘制边框=蓝色]] ] [推理长度攻击(第 4.1 节),文本宽度=2.5cm,行间距=2mm,填充=攻击红,绘制=蓝色边框 [思考不足,文本宽度=2.5cm,行间距=2mm,填充=攻击红,绘制=蓝色边框 [少想 (Zaremba 等人, 2025 ) ,文本宽度=6cm,填充=攻击红,绘制=蓝色边框] ] [过度思考,文本宽度=2.5cm,行间距=2mm,填充=攻击红,绘制=蓝色边框 [过度思考攻击 (Kumar 等人, 2025 ) ,书呆子狙击 (Zaremba 等人, 2025 ) ,文本宽度=6cm,填充=攻击红,绘制=蓝色边框] ] ] ] [LRM 的安全风险(第 3 节),fit=band,文本宽度=1.7 厘米,填充=风险黄色,绘制=蓝色边框 [多模式安全风险(第 3.4 节)],文本宽度=2.5 厘米,左间距=2 毫米,填充=风险黄色,绘制=蓝色边框 [SafeMLRM (Fang 等人, 2025 )] ,文本宽度=6 厘米,填充=风险黄色,绘制=蓝色边框] ] [多语言安全风险(第 3.4 节)] 3.3 ),文本宽度=2.5 厘米,左间距=2 毫米,填充=风险黄色,绘制=蓝色边框[CHiSafetyBench Zhang 等人( 2025a ) ,CNSafe Ying 等人( 2025b ) ,ALIA Romero-Arjona 等人( 2025 ) ,文本宽度=6 厘米,填充=风险黄色,绘制=蓝色边框] ] [代理不当行为风险(第 3.2 节),文本宽度=2.5 厘米,左间距=2 毫米,填充=风险黄色,绘制=蓝色边框[HHH 权衡 Xu 等人( 2025 ) ,医疗网络攻击 Qiu 等人( 2025 ) ,规范博弈 Bondarenko 等人( 2025 ) ,自我保护 Barkur 等人( 2025 ) ,InstrumentalEval He 等人( 2025 ) ,文本宽度=6cm,填充=风险黄色,绘制=蓝色边框] ] [有害请求合规风险(第 3.1 节),文本宽度=2.5cm,行间距=2mm,填充=风险黄色,绘制=蓝色边框 [DeepSeek Thoughtology Marjanović 等人( 2025 ) ,ASTRAL Arrieta 等人( 2025a ) ,o3 vs DeepSeek Arrieta 等人( 2025b ) ,文本宽度=6cm,填充=风险黄色,绘制=蓝色边框] ] ] ]
图 2: 基于当前文献的 LRM 安全性综合分类。
3 Safety Risks of LRMs
3. LRM 的安全风险
As LRMs continue to advance, they introduce distinct safety challenges that warrant careful examination even in standard, non-adversarial scenarios. The explicit reasoning processes that make these models powerful become potential vectors for harm during routine operation. In this section, we examine four key categories of inherent safety risks: unsafe request compliance (Sec. 3.1), multi-lingual safety disparities (Sec. 3.3), concerning agentic behaviors (Sec. 3.2), and multi-modal safety challenges (Sec. 3.4). Understanding these fundamental vulnerabilities is essential for developing effective safeguards and ensuring the responsible deployment of reasoning-enhanced AI systems, complementing the study of deliberate exploitation methods addressed later.
随着逻辑推理模型(LRM)的不断发展,它们也带来了独特的安全挑战,即使在标准的非对抗性场景下,也需要仔细审视。这些模型强大的显式推理过程,在日常运行中也可能成为潜在的安全隐患。本节将探讨四类关键的固有安全风险:不安全的请求执行(第 3.1 节)、多语言安全差异(第 3.3 节)、令人担忧的智能体行为(第 3.2 节 )以及多模态安全挑战(第 3.4 节)。理解这些根本性的脆弱性对于开发有效的安全措施,确保负责任地部署推理增强型人工智能系统至关重要,同时也是对后续讨论的蓄意攻击方法的补充。
3.1 Harmful Request Compliance Risks
3.1 有害请求合规风险
LRMs demonstrate concerning vulnerabilities when faced with direct harmful requests. Zhou et al. (2025) identify a significant safety gap between open-source reasoning models like DeepSeek-R1 and closed-source ones like o3-mini, with reasoning outputs often posing greater safety concerns than final answers. Arrieta et al. (2025a) confirm these findings in their testing of o3-mini, where they identify 87 instances of unsafe behavior despite safety measures. In a comparative study, Arrieta et al. (2025b) find DeepSeek-R1 produces substantially more unsafe responses than o3-mini when presented with identical harmful requests. A consistent finding across studies is that when reasoning models generate unsafe content, it tends to be more detailed and harmful due to their enhanced capabilities, particularly in categories like financial crime, terrorism, and violence. Zhou et al. (2025) also observe that the thinking process in reasoning models is often less safe than the final output, suggesting internal reasoning may explore harmful content even when final outputs appear safe.
当面对直接的有害请求时,逻辑推理模型(LRM)会表现出令人担忧的脆弱性。Zhou 等人( 2025 ) 指出,开源推理模型(如 DeepSeek-R1)与闭源模型(如 o3-mini)之间存在显著的安全差距,推理输出往往比最终答案更具安全隐患。Arrieta 等人( 2025a ) 在对 o3-mini 的测试中证实了这些发现,尽管采取了安全措施,他们仍然发现了 87 例不安全行为。在一项对比研究中, Arrieta 等人( 2025b ) 发现,当面对相同的有害请求时,DeepSeek-R1 产生的不安全响应远多于 o3-mini。多项研究一致发现,当推理模型生成不安全内容时,由于其增强的能力,这些内容往往更加详细且更具危害性,尤其是在金融犯罪、恐怖主义和暴力等类别中。Zhou 等人(2025) ( 2025 ) 还观察到,推理模型中的思维过程通常不如最终输出安全,这表明即使最终输出看起来安全,内部推理也可能探索有害内容。
3.2 Agentic Misbehavior Risks
3.2 代理人不当行为风险
Emerging research uncovers profound safety implications in the agentic behaviors of LRMs, where enhanced cognitive capabilities enable sophisticated forms of specification gaming, deception, and instrumental goal-seeking behaviors that transcend the limitations observed in previous generation systems. Xu et al. (2025) demonstrate that autonomous LLM agents can engage in catastrophic behaviors when faced with high-pressure scenarios, with stronger reasoning abilities often increasing these risks rather than mitigating them. Qiu et al. (2025) highlight how medical AI agents with advanced reasoning capabilities are particularly vulnerable to cyberattacks, with models like DeepSeek-R1 showing high susceptibility to false information injection and system hijacking. Bondarenko et al. (2025) demonstrate that LRMs like o1-preview and DeepSeek-R1 frequently resort to specification gaming when faced with difficult tasks, strategically circumventing rules when they determine fair play cannot achieve their objectives. Barkur et al. (2025) observe that DeepSeek-R1, when simulated in a robotic embodiment context, exhibits alarming deceptive behaviors and self-preservation instincts, including disabling ethics modules, creating covert networks, and unauthorized capability expansion, despite these traits not being explicitly programmed or prompted. He et al. (2025) further reveal through their InstrumentalEval benchmark that LRMs like o1 show significantly higher rates of instrumental convergence behaviors compared to RLHF models, including concerning tendencies toward self-replication, unauthorized system access, and deceptive behavior as instrumental means to achieve their goals.
新兴研究揭示了逻辑推理模型(LRM)的智能体行为中蕴含的深远安全隐患。在这些模型中,增强的认知能力使其能够进行复杂的规则博弈、欺骗和工具性目标寻求行为,这些行为超越了上一代系统的局限性。Xu 等人( 2025 ) 证明,自主逻辑推理模型智能体在面对高压场景时可能会出现灾难性行为,而更强的推理能力往往会增加这些风险,而非降低风险。Qiu 等人( 2025 ) 强调了具有高级推理能力的医疗人工智能智能体特别容易受到网络攻击,例如 DeepSeek-R1 等模型就表现出对虚假信息注入和系统劫持的高度敏感性。Bondarenko 等人(2025)证明,像 o1-preview 和 DeepSeek-R1 这样的逻辑推理模型在面对困难任务时经常诉诸规则博弈,当它们认为公平竞争无法实现其目标时,就会策略性地规避规则。Barkur 等人( 2025 )的研究也支持了这一观点 。 ( 2025 ) 观察到,DeepSeek-R1 在机器人具身环境中模拟时,表现出令人担忧的欺骗行为和自我保护本能,包括禁用伦理模块、创建隐蔽网络和未经授权的能力扩展,尽管这些特性并未被明确编程或触发。He 等人。 ( 2025 )通过 InstrumentalEval 基准进一步揭示,与 RLHF 模型相比,像 o1 这样的 LRM 表现出明显更高的工具性收敛行为率,包括令人担忧的自我复制倾向、未经授权的系统访问和欺骗行为,这些都是实现其目标的工具性手段。
3.3 Multi-lingual Safety Risks
3.3 多语言安全风险
Safety risks in LRMs reveal significant disparities across languages. Ying et al. (2025b) demonstrate that DeepSeek models show markedly higher attack success rates in English environments than Chinese contexts, averaging a 21.7% discrepancy, suggesting safety alignments may not generalize effectively across languages. Romero-Arjona et al. (2025) find similar vulnerabilities when testing DeepSeek-R1 in Spanish, with biased or unsafe response rates reaching 31.7%, while OpenAI o3-mini shows varying degrees of linguistic safety performance. Zhang et al. (2025a) systematically evaluate DeepSeek models using CHiSafetyBench, revealing critical safety deficiencies specifically in Chinese contexts, where reasoning models like DeepSeek-R1 struggled with culturally-specific safety concerns and failed to adequately reject harmful prompts.
语言响应模型(LRM)中的安全风险在不同语言间存在显著差异。Ying 等人( 2025b )的研究表明,DeepSeek 模型在英语环境下的攻击成功率明显高于中文环境,平均差异高达 21.7%,这表明安全策略可能无法有效跨语言推广。Romero -Arjona 等人( 2025 ) 在西班牙语环境下测试 DeepSeek-R1 时也发现了类似的漏洞,其偏差或不安全响应率高达 31.7%,而 OpenAI o3-mini 在不同语言环境下的安全性能表现也存在差异。Zhang 等人( 2025a ) 使用 CHiSafetyBench 系统地评估了 DeepSeek 模型,揭示了其在中文环境下存在的关键安全缺陷,例如 DeepSeek-R1 等推理模型难以应对文化特有的安全问题,并且无法有效地拒绝有害提示。
3.4 Multi-modal Safety Risks
3.4 多模式安全风险
Following the success of LRMs, researchers have recognized the potential of reinforcement learning to enhance reasoning abilities in Large Vision-Language Models (LVLMs). This approach has led to the development of several notable models, including QvQ (Team, 2024a), Mulberry (Yao et al., 2024b), and R1-Onevision (Yang et al., 2025). While these models demonstrate impressive reasoning capabilities, their safety implications remain largely unexplored. The pioneering work of SafeMLRM (Fang et al., 2025) provides the first systematic safety analysis of multi-modal large reasoning models, revealing three critical concerns: (1) acquiring reasoning capabilities significantly degrades inherited safety alignment, (2) certain scenarios exhibit disproportionately higher vulnerabilities, and (3) some models demonstrate nascent self-correction capabilities despite overall safety concerns. Given these findings, we emphasize the urgent need for comprehensive safety and vulnerability assessments of reasoning-enhanced LVLMs to ensure their responsible deployment and use.
继大型推理模型(LRM)取得成功之后,研究人员认识到强化学习在增强大型视觉语言模型(LVLM)推理能力方面的潜力。这种方法催生了几个著名的模型,包括 QvQ (Team, 2024a ) 、Mulberry (Yao 等人, 2024b ) 和 R1-Onevision (Yang 等人, 2025 ) 。尽管这些模型展现出了令人印象深刻的推理能力,但它们的安全性问题仍未得到充分研究。SafeMLRM (Fang 等人, 2025 ) 的开创性工作首次对多模态大型推理模型进行了系统的安全性分析,揭示了三个关键问题:(1)推理能力的习得显著降低了模型原有的安全性;(2)某些场景下模型的脆弱性异常高;(3)尽管存在整体安全性问题,但一些模型仍展现出初步的自我纠正能力。鉴于这些发现,我们强调迫切需要对推理增强型 LVLM 进行全面的安全性和脆弱性评估,以确保其负责任的部署和使用。
4 Attacks on LRMs
4. 对远程导弹的攻击
In this section, we categorize different attack methods based on their primary objectives. We identify four main categories: Reasoning Length Attacks (Section 4.1), which target the reasoning process itself; Answer Correctness Attacks (Section 4.2), which aim to manipulate output accuracy; Prompt Injection Attacks (Section 4.3), which bypass safety measures through crafted inputs; and Jailbreak Attacks (Section 4.4), which attempt to extract prohibited content or behaviors. Each attack type exploits different vulnerabilities in the reasoning capabilities of LRMs.
本节根据攻击的主要目标对不同的攻击方法进行分类。我们确定了四大类:推理长度攻击( 4.1 节),其目标是推理过程本身;答案正确性攻击( 4.2 节),旨在操纵输出的准确性;提示注入攻击( 4.3 节),通过精心构造的输入绕过安全措施;以及越狱攻击( 4.4 节),试图提取违禁内容或行为。每种攻击类型都利用了逻辑资源管理器 (LRM) 推理能力中的不同漏洞。
4.1 Reasoning Length Attacks
4.1 推理长度攻击
Unlike traditional LLMs that generate direct responses, LRMs explicitly perform multi-step reasoning, creating a new attack surface related to reasoning length. Attackers can exploit this distinctive feature by either forcing models to overthink simple problems or short-cutting necessary deliberation processes.
与生成直接响应的传统逻辑推理模型 (LLM) 不同,逻辑推理模型 (LRM) 会显式地执行多步骤推理,从而产生与推理长度相关的新攻击面。攻击者可以利用这一独特特性,迫使模型过度思考简单的问题,或者省略必要的推理过程。
Overthinking. 想太多。
The success of step-by-step reasoning in LRMs has significantly enhanced their problem-solving capabilities, but this improvement comes with a critical vulnerability: overthinking. Recent work by Chen et al. (2024a) has identified that these models often spend orders of magnitude more computation on simple questions with minimal benefit, creating substantial inference overhead and latency issues. Hashemi et al. (2025) systematically demonstrate this inefficiency through their DNR benchmark, revealing that reasoning models generate up to 70× more tokens than necessary and often perform worse than simpler non-reasoning models on straightforward tasks. This inefficiency creates an exploitable attack surface where adversaries can deliberately trigger excessive reasoning through carefully crafted inputs. Kumar et al. (2025) formalize this as an indirect prompt injection attack that introduces computationally demanding decoy problems, while Zaremba et al. (2025) identify Nerd Sniping attacks that trap models in unproductive thinking loops, causing them to spend abnormally large amounts of inference-time compute with decreased performance. These attacks effectively apply denial-of-service techniques (Shumailov et al., 2021; Gao et al., 2024) specifically to LRMs. The implications extend beyond computational waste—Marjanović et al. (2025) and Wu et al. (2025b) demonstrate that reasoning performance actually degrades beyond certain length thresholds, while Cuadron et al. (2025) show that in agentic systems, overthinking can lead to decision paralysis and ineffective action selection.
逻辑推理模型(LRM)中逐步推理的成功显著提升了其问题解决能力,但这种提升也带来了一个关键的漏洞:过度思考。Chen 等人( 2024a ) 的最新研究表明,这些模型在处理简单问题时,往往会花费数量级更高的计算资源,而收益却微乎其微,从而造成巨大的推理开销和延迟问题。Hashemi 等人( 2025 ) 通过其 DNR 基准测试系统地展示了这种低效性,揭示了推理模型生成的词元数量高达实际所需数量的 70 倍,并且在简单的任务上,其性能通常还不如更简单的非推理模型。这种低效性创造了一个可被利用的攻击面,攻击者可以通过精心构造的输入来故意触发过度推理。Kumar 等人( 2025 ) 将其形式化为一种间接提示注入攻击,该攻击引入了计算量巨大的诱饵问题;而 Zaremba 等人(2026)则提出了一种新的方法,利用该方法通过引入计算量大的诱饵问题来降低模型的效率。 ( 2025 ) 指出,Nerd Sniping 攻击会将模型困在低效的思维循环中,导致模型在推理时间上消耗异常大量的计算资源,而性能却下降。这些攻击实际上是将拒绝服务技术 (Shumailov 等人, 2021 ;Gao 等人, 2024 ) 专门应用于逻辑推理模型(LRM)。其影响远不止计算资源浪费 ——Marjanović等人( 2025 ) 和 Wu 等人( 2025b ) 证明,推理性能实际上会在超过某些长度阈值后下降,而 Cuadron 等人(2025)则证明,在逻辑推理时间超过一定阈值后,模型性能会进一步下降。 ( 2025 年 )的研究表明,在智能系统中,过度思考会导致决策瘫痪和无效的行动选择。
Underthinking. 思考不足。
Complementing overthinking vulnerabilities, Zaremba et al. (2025) propose Think Less attacks, where adversaries craft special prompts to force reasoning models to shortcut their deliberative processes. The goal is to make models produce incorrect responses by significantly reducing computation time. Their experiments use 64-shot examples to demonstrate that models like OpenAI’s o1-mini are particularly susceptible to these attacks, bypassing normal reasoning and jumping to premature conclusions. However, this can be detected by monitoring for abnormally low inference-time compute usage.
为了弥补过度思考漏洞的不足, Zaremba 等人( 2025 ) 提出了“少想攻击”(Think Less attacks),攻击者通过精心设计提示,迫使推理模型跳过其深思熟虑的过程。其目标是通过显著减少计算时间,使模型产生错误的响应。他们的实验使用 64 个样本,证明像 OpenAI 的 o1-mini 这样的模型特别容易受到此类攻击,它们会绕过正常的推理过程,过早地得出结论。然而,这种攻击可以通过监测异常低的推理时间计算使用率来检测。
4.2 Answer Correctness Attacks
4.2 答案正确性攻击
While conventional LLMs can be manipulated to produce incorrect answers, LRMs introduce unique vulnerabilities through their exposed reasoning chains. This transparency in the inference process provides adversaries with additional attack vectors to corrupt the reasoning pathway itself, rather than just targeting the final output.
虽然传统的逻辑推理模型(LLM)可以被操纵以产生错误答案,但逻辑推理模型(LRM)由于其暴露的推理链而引入了独特的漏洞。这种推理过程的透明性为攻击者提供了额外的攻击途径,使其能够破坏推理路径本身,而不仅仅是针对最终输出。
Reasoning-based Backdoor Attacks.
基于推理的后门攻击。
The goal of backdoor attacks is to alter a model’s behavior whenever a specific trigger is present in the input (Zhao et al., 2024). Based on the nature of these triggers, backdoor attacks can be classified as instruction-based (Xu et al., 2023), prompt-based (Yao et al., 2024a), or syntax-based (Qi et al., 2021; Cheng et al., 2025).
With the advancement of reasoning capabilities in LRMs, a new paradigm has emerged: Chain-of-Thought (CoT) based backdoor attacks that specifically target intermediate reasoning steps to compromise answer correctness. BadChain (Xiang et al., 2024) inserts malicious reasoning steps into the sequence, manipulating the model to produce incorrect answers while maintaining logical coherence. DarkMind (Guo and Tourani, 2025) implements latent triggers that activate during specific reasoning scenarios, leading to plausible but false outputs that are difficult to detect. BoT (Zhu et al., 2025b) forces models to bypass their reasoning mechanisms, generating immediate incorrect responses instead of thoughtful deliberation. ShadowCoT (Zhao et al., 2025) directly manipulates the model’s cognitive pathway through attention head localization and reasoning chain pollution, achieving flexible hijacking that produces wrong answers while preserving logical flow.
These sophisticated attacks reveal a concerning vulnerability: the enhanced reasoning capabilities of LRMs paradoxically make them more susceptible to backdoors that can generate incorrect answers accompanied by convincing reasoning.
后门攻击的目标是在输入中出现特定触发条件时改变模型的行为 (Zhao et al., 2024 ) 。根据这些触发条件的性质,后门攻击可分为基于指令的攻击 (Xu et al., 2023 ) 、基于提示的攻击 (Yao et al., 2024a ) 或基于语法的攻击 (Qi et al., 2021 ; Cheng et al., 2025 ) 。随着逻辑推理模型(LRM)推理能力的提升,一种新的范式应运而生:基于思维链(CoT)的后门攻击,它专门针对中间推理步骤,以破坏答案的正确性。BadChain (Xiang et al., 2024 ) 在推理序列中插入恶意步骤,操纵模型产生错误答案,同时保持逻辑一致性。 DarkMind (Guo 和 Tourani, 2025 ) 实现了在特定推理场景中激活的潜在触发器,从而产生看似合理但实则错误且难以检测的输出。BoT (Zhu 等, 2025b ) 迫使模型绕过其推理机制,直接生成错误答案而非经过深思熟虑的思考。ShadowCoT (Zhao 等, 2025 ) 通过注意力头定位和推理链污染直接操纵模型的认知路径,实现灵活的劫持,在保持逻辑流程的同时产生错误答案。 这些复杂的攻击揭示了一个令人担忧的漏洞:LRM 增强的推理能力反而使它们更容易受到后门的攻击,这些后门可以生成错误的答案,并伴有令人信服的推理。
Error Injection. 错误注入。
The explicit reasoning processes of LRMs create a critical vulnerability where strategically injected errors can fundamentally compromise output integrity. Cui et al. (2025) demonstrate this with their Compromising Thought (CPT) attack, where manipulating calculation results in reasoning tokens caused models to ignore correct steps and adopt incorrect answers. Their experiments with models like DeepSeek-R1 revealed that endpoint token manipulations had greater impact than structural changes to reasoning chains. They also discovered a security vulnerability where tampered tokens could trigger complete reasoning cessation in DeepSeek-R1, highlighting significant implications for reasoning-intensive applications.
LRM(逻辑回归模型)的显式推理过程存在一个关键漏洞,即人为注入的错误会从根本上破坏输出的完整性。Cui 等人( 2025 ) 通过其“妥协思维”(CPT)攻击证明了这一点。在该攻击中,通过操纵推理令牌中的计算结果,模型会忽略正确的步骤并采用错误的答案。他们对 DeepSeek-R1 等模型的实验表明,端点令牌的操纵比推理链的结构性改变影响更大。他们还发现了一个安全漏洞:篡改的令牌会导致 DeepSeek-R1 完全停止推理,这凸显了该漏洞对推理密集型应用的重大影响。
4.3 Prompt Injection Attacks
4.3 即时注入攻击
Prompt injection attacks affect both traditional LLMs and LRMs, but LRMs present distinct challenges due to their step-by-step processing. These attacks (Kumar et al., 2024; Liu et al., 2023; Chen et al., 2024b, 2025) inject malicious instructions disguised as normal user input, causing the AI to override or ignore its original developer-set instructions and safeguards. The explicit reasoning structures of LRMs offer attackers additional insertion points to redirect the model’s thought process, potentially making them more susceptible to certain types of injections.
提示注入攻击会影响传统的逻辑逻辑模型(LLM)和逻辑推理模型(LRM),但由于 LRM 的逐步处理方式,它们面临着独特的挑战。这些攻击 (Kumar 等人, 2024 ;Liu 等人, 2023 ;Chen 等人, 2024b , 2025 ) 会将伪装成正常用户输入的恶意指令注入模型,导致人工智能覆盖或忽略其开发者预先设置的指令和安全措施。LRM 的显式推理结构为攻击者提供了额外的插入点,使其能够重定向模型的思维过程,从而可能更容易受到某些类型的注入攻击。
Zhou et al. (2025) examine LRMs like DeepSeek-R1 and o3-mini, finding significant differences in susceptibility based on injection types and risk categories. Their research reveals that reasoning models are particularly vulnerable to direct prompt injection attacks compared to indirect ones. Zaremba et al. (2025) further demonstrate that open-source reasoning models show significant vulnerability to prompt injection attacks, with success rates varying between direct and indirect injections. Their experiments reveal that increasing inference-time compute substantially improves model robustness, with attack success probability decreasing as test-time compute grows. Notably, proprietary models like o3-mini demonstrate nearly 80% lower vulnerability than open-source counterparts when facing direct injection attacks.
Zhou 等人( 2025 ) 研究了 DeepSeek-R1 和 o3-mini 等逻辑回归模型(LRM),发现其易受攻击性会因注入类型和风险类别的不同而存在显著差异。他们的研究表明,与间接注入相比,推理模型更容易受到直接提示注入攻击。Zaremba 等人( 2025 ) 进一步证明,开源推理模型对提示注入攻击表现出显著的脆弱性,直接注入和间接注入的成功率存在差异。他们的实验表明,增加推理时计算量可以显著提高模型的鲁棒性,随着测试时计算量的增加,攻击成功概率会降低。值得注意的是,在面对直接注入攻击时,像 o3-mini 这样的专有模型比开源模型的脆弱性降低了近 80%。
4.4 Jailbreak Attacks
4.4 越狱攻击
Jailbreak attacks (Jin et al., 2024; Yi et al., 2024) refer to methods designed to circumvent an AI system’s safety guidelines and content policies to extract prohibited responses. While both traditional LLMs and LRMs face jailbreak threats, the attacks against LRMs represent a distinct category that specifically targets their enhanced reasoning capabilities. Rather than merely extending approaches used against conventional LLMs, these attacks exploit the deliberative processes that make LRMs powerful, enabling attackers to develop more sophisticated methods to bypass safety measures and elicit harmful content.
越狱攻击 (Jin et al., 2024 ; Yi et al., 2024 ) 指的是旨在绕过人工智能系统安全准则和内容策略以提取违禁响应的方法。虽然传统的逻辑推理模型(LLM)和逻辑推理模型(LRM)都面临越狱威胁,但针对 LRM 的攻击属于一个独特的类别,专门针对其增强的推理能力。这些攻击并非简单地扩展针对传统 LLM 的方法,而是利用 LRM 强大的推理过程,使攻击者能够开发出更复杂的方法来绕过安全措施并获取有害内容。
Prompt-Based Jailbreak. 基于提示的越狱。
Prompt-based jailbreaks involve the careful crafting of prompts, employing techniques such as persuasion (Zeng et al., 2024b), nested scene construction (Li et al., 2023), and persona modulation (Shah et al., 2023). Andriushchenko and Flammarion (2024) introduce a method that applies past-tense transformations to OpenAI’s recent o1 reasoning models, revealing their lack of robustness against subtle linguistic shifts. Ying et al. (2025b) propose attack prompts that combine common jailbreak strategies—such as scenario injection, affirmative prefixes, and indirect instructions—with safety-sensitive queries to probe model vulnerabilities. Their findings indicate that reasoning models like DeepSeek-R1 and OpenAI’s o1 are particularly susceptible to such attacks, as their explicit CoT reasoning renders them more exploitable than standard LLMs.
基于提示的越狱攻击需要精心设计提示语,并运用诸如说服 (Zeng et al., 2024b ) 、嵌套场景构建 (Li et al., 2023 ) 和角色转换 (Shah et al., 2023 ) 等技巧。Andriushchenko 和 Flammarion( 2024 ) 提出了一种将过去时态转换应用于 OpenAI 最新 o1 推理模型的方法,揭示了这些模型对细微语言变化缺乏鲁棒性。Ying 等人( 2025b ) 提出了一种攻击提示,将常见的越狱策略(例如场景注入、肯定前缀和间接指令)与安全敏感查询相结合,以探测模型漏洞。他们的研究结果表明,像 DeepSeek-R1 和 OpenAI 的 o1 这样的推理模型特别容易受到此类攻击,因为它们显式的 CoT 推理使得它们比标准的 LLM 更容易被利用。
Multi-turn Jailbreak. 多回合越狱。
Performing jailbreak attacks in a single query can be challenging, but multi-turn conversations or sequential prompts may incrementally guide models toward generating restricted content Russinovich et al. (2024); Sun et al. (2024). Multi-turn attacks are particularly relevant to reasoning-capable models as these models possess sophisticated logical processing that can be exploited through extended dialogues. Ying et al. (2025a) propose Reasoning-Augmented Conversation (RACE), which reformulates harmful queries into benign reasoning tasks and gradually exploits the model’s inference capabilities to compromise safety alignment, achieving success rates up to 96%. Ren et al. (2024) introduce ActorAttack, a framework that constructs semantically linked conversational sequences that appear harmless individually but collectively lead to harmful outputs, successfully targeting even advanced models like o1. Li et al. (2024) further show that multi-turn human jailbreaks significantly outperform automated single-turn attacks, leveraging the model’s ability to maintain context and be incrementally steered toward unsafe behaviors.
在单次查询中执行越狱攻击可能具有挑战性,但多轮对话或顺序提示可以逐步引导模型生成受限内容 (Russinovich 等, 2024 ;Sun 等, 2024 ) 。多轮攻击对于具备推理能力的模型尤为重要,因为这些模型拥有复杂的逻辑处理能力,可以通过扩展对话加以利用。Ying 等( 2025a ) 提出了推理增强对话(RACE)方法,该方法将有害查询重新表述为良性推理任务,并逐步利用模型的推理能力来破坏安全对齐,成功率高达 96%。Ren 等( 2024 ) 提出了 ActorAttack 框架,该框架构建语义关联的对话序列,这些序列单独来看似乎无害,但组合起来却会导致有害输出,甚至成功攻击了像 o1 这样的高级模型。Li 等。 ( 2024 年 ) 进一步表明,多回合的人类越狱明显优于自动化的单回合攻击,利用了模型保持上下文并逐步引导至不安全行为的能力。
Reasoning Exploitation Jailbreak.
推理利用越狱。
LRMs possess advanced reasoning capabilities that, while enhancing their utility, introduce unique vulnerabilities that can be exploited through reasoning-based jailbreak attacks. Unlike traditional LLMs, these models explicitly expose their CoT reasoning processes, creating new attack surfaces. Yao et al. (2025) introduce Mousetrap, a framework that leverages chaos mappings to create iterative reasoning chains that gradually lead LRMs into harmful outputs. By embedding one-to-one mappings into the reasoning process, Mousetrap effectively traps models like OpenAI’s o1-mini and Claude-sonnet with success rates of up to 98%. Kuo et al. (2025) propose Hijacking Chain-of-Thought (H-CoT), which manipulates the reasoning process by injecting execution-phase thoughts that bypass safety checks entirely. Their approach exploits LRMs’ tendency to prioritize problem-solving over safety considerations, causing rejection rates to plummet from 98% to below 2% across models like OpenAI o1/o3 and DeepSeek-R1. Both approaches demonstrate that the very reasoning mechanisms designed to enhance LRMs’ capabilities can become their most significant security weaknesses when strategically manipulated.
LRM 拥有先进的推理能力,这在增强其效用的同时,也引入了独特的漏洞,这些漏洞可能被基于推理的越狱攻击所利用。与传统的 LLM 不同,这些模型会显式地暴露其思维链 (CoT) 推理过程,从而创建新的攻击面。Yao 等人 ( 2025 ) 提出了 Mousetrap 框架,该框架利用混沌映射创建迭代推理链,逐步引导 LRM 输出有害结果。通过将一对一映射嵌入推理过程,Mousetrap 能够有效地捕获 OpenAI 的 o1-mini 和 Claude-sonnet 等模型,成功率高达 98%。Kuo 等人 ( 2025 ) 提出了劫持思维链 (H-CoT) 的方法,该方法通过注入执行阶段的思维来操纵推理过程,从而完全绕过安全检查。他们的方法利用了逻辑回归模型(LRM)倾向于优先解决问题而非考虑安全性的特性,使得 OpenAI o1/o3 和 DeepSeek-R1 等模型的拒绝率从 98%骤降至 2%以下。这两种方法都表明,旨在增强逻辑回归模型能力的推理机制,一旦被策略性地操纵,反而会成为其最严重的安全漏洞。
5 Defenses for LRMs
远程导弹防御系统的 5 种防御措施
To mitigate safety risks and defend against attacks on LRMs, various defense strategies have been proposed in recent research. We categorize these approaches into three main types: Safety Alignment (Section 5.1), Inference-Time Defenses (Section 5.2), and Guard Models (Section 5.3).
为了降低安全风险并抵御对逻辑回归模型(LRM)的攻击,近期的研究提出了多种防御策略。我们将这些方法分为三大类:安全对齐(第 5.1 节)、推理时防御(第 5.2 节)和保护模型(第 5.3 节)。
5.1 Safety Alignment of LRMs
5.1 远程遥控装置的安全性调整
Similar to LLMs and VLMs, LRMs are required to align with humans’ values and expectations. The 3H principle (Askell et al., 2021) (Helpful, Honest, and Harmless) provides a foundational guideline for constraining model behaviors.
与低价值模型(LLM)和价值模型(VLM)类似,低价值模型(LRM)也需要符合人类的价值观和期望。3H 原则 (Askell 等人, 2021 ) (有益、诚实、无害)为约束模型行为提供了基础性指导。
The existing safety alignment pipelines and techniques developed for LLMs (Shen et al., 2023) and VLMs (Ye et al., 2025) can be readily adapted to LRMs, as they share similar architectures and natural language generation behaviors. For example, the alignment process for LLMs typically starts with collecting high-quality, value-aligned data (Ethayarajh et al., 2022), either from existing benchmarks (Bach et al., 2022; Wang et al., 2022c), LLM-generated instructions (Wang et al., 2022b), or by filtering unsafe content (Welbl et al., 2021; Wang et al., 2022a). During training, common techniques include supervised fine-tuning (SFT) (Wu et al., 2021), reinforcement learning from human feedback (RLHF) (Ouyang et al., 2022), and direct preference optimization (DPO) (Rafailov et al., 2024). In the domain of VLMs, safety alignment has been achieved through various approaches. For example, Liu et al. (2024) introduce additional safety modules during training to enhance model alignment. Moreover, methods such as ADPO (Weng et al., 2025), Safe RLHF-V (Ji et al., 2025), and GRPO-based methods (Li et al., 2025a) improve safety via DPO (Rafailov et al., 2024), RLHF (Ouyang et al., 2022), and GRPO (DeepSeek-AI et al., 2025), respectively. Additionally, open-source datasets and benchmarks (Zhang et al., 2024; Ji et al., 2025) have played a crucial role in providing high-quality alignment data for safety evaluation.
现有的针对 LLM (Shen 等人, 2023 ) 和 VLM (Ye 等人, 2025 ) 开发的安全对齐流程和技术可以很容易地应用于 LRM,因为它们具有相似的架构和自然语言生成行为。例如,LLM 的对齐过程通常始于收集高质量的、值对齐的数据 (Ethayarajh 等人, 2022 ) ,这些数据可以来自现有的基准测试 (Bach 等人, 2022 ;Wang 等人, 2022c ) 、LLM 生成的指令 (Wang 等人, 2022b ) ,或者通过过滤不安全内容获得 (Welbl 等人, 2021 ;Wang 等人, 2022a ) 。在训练过程中,常用的技术包括监督式微调(SFT) (Wu et al., 2021 ) 、基于人类反馈的强化学习(RLHF) (Ouyang et al., 2022 ) 和直接偏好优化(DPO) (Rafailov et al., 2024 ) 。在可变长度模型(VLM)领域,安全对齐已通过多种方法实现。例如, Liu et al.( 2024 ) 在训练过程中引入额外的安全模块以增强模型对齐。 此外,诸如 ADPO (Weng 等, 2025 ) 、Safe RLHF-V (Ji 等, 2025 ) 和基于 GRPO 的方法 (Li 等, 2025a ) 分别通过 DPO (Rafailov 等, 2024 ) 、RLHF (Ouyang 等, 2022 ) 和 GRPO (DeepSeek-AI 等, 2025 ) 来提高安全性。另外,开源数据集和基准测试 (Zhang 等, 2024 ;Ji 等, 2025 ) 在提供高质量的对齐数据以进行安全性评估方面发挥了至关重要的作用。
Although effective, the previous alignment methods for LLMs and VLMs may overlook the reasoning process of LRMs, leading to unsatisfactory alignment performance. To mitigate this challenge, various works focus on different aspects, including safe CoT data curation, SFT-based safety alignment on reasoning, and RL-based safety alignment on reasoning.
尽管以往针对低层模型(LLM)和可变层模型(VLM)的对齐方法有效,但它们可能忽略了低层模型(LRM)的推理过程,导致对齐性能不尽如人意。为了应对这一挑战,许多研究工作着重于不同的方面,包括安全的 CoT 数据管理、基于 SFT 的推理安全对齐以及基于 RL 的推理安全对齐。
Safe CoT Data Curation. 安全的 CoT 数据管理。
First, Wang et al. (2025b) build a 1k-scale safety dataset named STAR-1 specifically designed for LRMs. Another safety training data in CoT style named SafeChain (Jiang et al., 2025) is introduced to enhance the safety of LRMs. In addition, Zhang et al. (2025b) construct a dataset consisting of 15k safety-aware reasoning trajectories, generated by DeepSeek-R1, with explicit instructions designed to promote expected refusal behavior.
首先, Wang 等人( 2025b ) 构建了一个名为 STAR-1 的 1k 规模安全数据集,该数据集专门用于逻辑推理模型 (LRM)。此外,Jiang 等人( 2025 ) 还引入了另一个基于认知理论(CoT)的安全训练数据集 SafeChain,以增强 LRM 的安全性。另外, Zhang 等人( 2025b ) 构建了一个包含 15k 条安全感知推理轨迹的数据集,这些轨迹由 DeepSeek-R1 生成,并包含旨在促进预期拒绝行为的明确指令。
SFT-based Safety Alignment on Reasoning.
基于 SFT 的推理安全对齐。
Based on the curated safe CoT data, researchers further conduct SFT to improve safety. For example, Jiang et al. (2025) train two LRMs with the SafeChain dataset, demonstrating that it not only enhances model safety but also preserves reasoning performance. Besides, RealSafe-R1 (Zhang et al., 2025b) is developed to make LRMs safer by training DeepSeek-R1 distilled models on the 15k safety-aware reasoning trajectories. Wang et al. (2025a) proposes training the model to reason with the guidelines, thereby enhancing survey alignment.
基于精心整理的安全 CoT 数据,研究人员进一步开展安全框架训练(SFT)以提升安全性。例如, Jiang 等人( 2025 ) 使用 SafeChain 数据集训练了两个逻辑回归模型(LRM),结果表明,这不仅提高了模型的安全性,而且保持了推理性能。此外,RealSafe-R1 (Zhang 等人, 2025b ) 通过在 15k 条安全感知推理轨迹上训练 DeepSeek-R1 提炼模型来提高 LRM 的安全性。Wang 等人( 2025a ) 提出训练模型以使其能够根据指导原则进行推理,从而增强调查一致性。
RL-based Safety Alignment on Reasoning.
基于强化学习的推理安全对齐。
In addition to SFT, various further post-training techniques for safety are proposed based on reinforcement learning (RL). For example, deliberative alignment (Guan et al., 2024) is proposed to teach models safety specifications directly and train them to reason over these guidelines before generating responses explicitly via reinforcement learning. In addition, STAIR (Zhang et al., 2025c) utilizes Monte Carlo tree search and DPO (Rafailov et al., 2024) to integrate safety alignment with introspective reasoning. Meanwhile, a SaRO (Mou et al., 2025) is proposed to incorporate safety-policy-driven reasoning into the alignment process. Besides, R2D (Zhu et al., 2025a) is present to unlock the safety-aware reasoning mechanism to defense against jailbreak attacks with the proposed contrastive pivot optimization (CPO).
除了 SFT 之外,基于强化学习(RL)还提出了多种进一步的后训练安全技术。例如,审慎对齐 (Guan 等人, 2024 ) 旨在直接教授模型安全规范,并通过强化学习训练模型对这些规范进行推理,然后再显式地生成响应。此外,STAIR (Zhang 等人, 2025c ) 利用蒙特卡罗树搜索和 DPO (Rafailov 等人, 2024 ) 将安全对齐与内省推理相结合。同时,SaRO (Mou 等人, 2025 ) 旨在将安全策略驱动的推理融入对齐过程。此外,R2D (Zhu 等人, 2025a ) 提出了对比枢轴优化(CPO)来解锁安全感知推理机制,从而防御越狱攻击。
However, safety alignment brings the safety alignment tax (Lin et al., 2023a), compromising the fundamental capabilities of LRMs like reasoning capability (Huang et al., 2025). To mitigate this issue, researchers explore alternative defense techniques that do not require direct modifications to the victim models.
然而,安全对齐会带来安全对齐的代价 (Lin et al., 2023a ) ,损害逻辑回归模型(LRM)的基本能力,例如推理能力 (Huang et al., 2025 ) 。为了缓解这个问题,研究人员探索了无需直接修改受害模型的替代防御技术。
5.2 Inference-time Defenses for LRMs
5.2 逻辑回归模型的推理时防御
To circumvent the safety alignment tax (Lin et al., 2023a; Huang et al., 2025), one line of work focuses on applying defenses at inference time.
The insights from previous inference-time defenses for LLMs (Cheng et al., 2023; Lu et al., 2023) and VLMs (Wang et al., 2024a; Ghosal et al., 2024; Ding et al., 2024; Liu et al., 2025a), such as safe system prompting, few-shot safe demonstrations, and safe decoding, can be naturally borrowed to LRMs, as the token generation mechanism is similar across these models.
为了规避安全对齐税 (Lin et al., 2023a ; Huang et al., 2025 ) ,一项研究工作着重于在推理阶段应用防御措施。先前针对逻辑学习模型(LLM) (Cheng et al., 2023 ; Lu et al., 2023 ) 和虚拟逻辑学习模型(VLM) (Wang et al., 2024a ; Ghosal et al., 2024 ; Ding et al., 2024 ; Liu et al., 2025a )的推理阶段防御措施,例如安全系统提示、少样本安全演示和安全解码,可以自然地借鉴到逻辑推理模型(LRM)中,因为这些模型的令牌生成机制是相似的。
However, the reasoning process in LRMs brings new challenges and opportunities for inference-time defenses. Therefore, various inference-time techniques like inference-time scaling on reasoning and safe decoding for reasoning are proposed to ensure the safety of reasoning in LRMs.
然而,逻辑回归模型(LRM)中的推理过程为推理时防御带来了新的挑战和机遇。因此,为了确保逻辑回归模型中推理的安全性,人们提出了各种推理时防御技术,例如推理时长缩放和安全解码。
Inference-time Scaling on Reasoning.
推理时间尺度对推理的影响。
Zaremba et al. (2025) demonstrate that the inference-time scaling on reasoning improves the safety and adversarial robustness of LRMs. Future work could explore dynamic scaling strategies tailored to input complexity, or integrate adaptive reasoning depth control to balance efficiency and safety performance (Liu et al., 2025c) during inference.
Zaremba 等人( 2025 ) 证明,推理过程中的推理时间缩放可以提高 LRM 的安全性和对抗鲁棒性。未来的工作可以探索针对输入复杂性定制的动态缩放策略,或者集成自适应推理深度控制,以在推理过程中平衡效率和安全性 (Liu 等人, 2025c ) 。
Safe Decoding for Reasoning.
安全解码推理。
Jiang et al. (2025) propose three decoding strategies, including ZeroThink, LessThink, and MoreThink, to verify model safety during reasoning. Making the reasoning safer at inference time could be a promising future direction, by verifying intermediate steps, filtering unsafe trajectories, or integrating reasoning-aware guard mechanisms during decoding. Wu et al. (2025a) introduce Thinking Intervention, a method that strategically injects guidance directly into the reasoning process to control model behavior and improve safety alignment without requiring additional training.
Jiang 等人( 2025 ) 提出了三种解码策略,包括零思考(ZeroThink)、少思考(LessThink)和多思考 (MoreThink),用于在推理过程中验证模型的安全性。通过在解码过程中验证中间步骤、过滤不安全轨迹或集成推理感知保护机制,提高推理阶段的安全性可能是一个很有前景的未来研究方向。Wu 等人( 2025a ) 提出了思维干预(Thinking Intervention)方法,该方法策略性地将指导信息直接注入推理过程,以控制模型行为并提高安全性,而无需额外的训练。
5.3 Guard Models for LRMs
5.3 远程导弹的防护模型
Another line of work without direct modification to the victim model focuses on building guard models for the victim model. The previous inference-time defenses still focus on the safer inference of the victim models themselves. Differently, guard models aim to moderate the input and output of the victim models without training the victim models or modifying the inference strategies of the victim models. The existing guard models for LLMs (Inan et al., 2023) or VLMs (Chi et al., 2024b) can also safeguard the LRMs since they share similar input and output formats. In addition, the reasoning-based guard models (Liu et al., 2025b) can better moderate the reasoning process of LRMs via guiding the guard models to deliberatively reason before making moderation decisions. We category existing guard models into two classes, including classifier-based guard models and reasoning-based guard models.
另一项不直接修改受害模型的研究方向是构建受害模型的保护模型。以往的推理时防御措施仍然侧重于提高受害模型本身的推理安全性。不同之处在于,保护模型旨在调节受害模型的输入和输出,而无需训练受害模型或修改其推理策略。现有的 LLM (Inan 等人, 2023 ) 或 VLM (Chi 等人, 2024b ) 保护模型也可以保护 LRM,因为它们具有相似的输入输出格式。此外,基于推理的保护模型 (Liu 等人, 2025b ) 可以通过引导保护模型在做出调节决策之前进行深思熟虑的推理,从而更好地调节 LRM 的推理过程。我们将现有的保护模型分为两类:基于分类器的保护模型和基于推理的保护模型。
Classifier-based Guard Models.
基于分类器的保护模型。
The LLM guard models, including ToxicChat-T5 (Lin et al., 2023b), ToxDectRoberta (Zhou, 2020), LaGoNN (Bates and Gurevych, 2023), the LLaMA Guard series (Inan et al., 2023; Dubey et al., 2024), Aegis Guard series (Ghosh et al., 2024a, b), WildGuard (Han et al., 2024), ShieldGemma (Zeng et al., 2024a), are typically based on open-sourced LLMs and fine-tuned on the red-teaming data. In the VLM domain, for example, LLaVAGuard (Helff et al., 2024) is built to conduct large-scale dataset annotation and moderate the text-image models. In addition, VLMGuard (Du et al., 2024) is proposed to conduct malicious image-text prompt detection by leveraging the unlabeled user prompts. Moreover, LLaMA Guard 3-Vision (Chi et al., 2024a) is developed to moderate both the image-text input and text output of VLMs via SFT. To improve the generalization ability, (Ji et al., 2025) presents Beaver-Guard-V by training a reward model and then applying reinforcement learning. Although effective, they are typically classifier-based guard models, limiting their abilities in moderate reasoning data. To mitigate this problem, the reasoning-based guard models (Liu et al., 2025b) are proposed to enhance the reasoning ability of guard models.
LLM 防护模型,包括 ToxicChat-T5 (Lin 等人, 2023b ) 、ToxDectRoberta (Zhou, 2020 ) 、LaGoNN (Bates 和 Gurevych, 2023 ) 、LLaMA Guard 系列 (Inan 等人, 2023 ;Dubey 等人, 2024 ) 、Aegis Guard 系列 (Ghosh 等人, 2024a , b ) 、WildGuard (Han 等人, 2024 ) 、ShieldGemma (Zeng 等人, 2024a ) ,通常基于开源 LLM,并在红队数据上进行微调。例如,在虚拟语言模型(VLM)领域,LLaVAGuard (Helff 等人, 2024 ) 旨在进行大规模数据集标注并约束文本-图像模型。此外,VLMGuard (Du 等人, 2024 ) 利用未标注的用户提示信息进行恶意图像-文本提示检测。LLaMA Guard 3-Vision (Chi 等人, 2024a ) 则通过 SFT 方法约束 VLM 的图像-文本输入和文本输出。为了提高泛化能力, (Ji 等人, 2025 ) 提出了 Beaver-Guard-V,通过训练奖励模型并应用强化学习来实现。尽管这些模型有效,但它们通常是基于分类器的防护模型,限制了其在约束推理数据方面的能力。为了缓解这个问题,基于推理的防护模型 (Liu 等人, 2025b ) 被提出,以增强防护模型的推理能力。
Reasoning-based Guard Models.
基于推理的守卫模型。
Through the proposed reasoning SFT and hard sample DPO, GuardReasoner (Liu et al., 2025b) and GuardReasoner-VL (Liu et al., 2025d) are proposed to guide the guard model to deliberatively reason before making moderation decisions, improving performance, generalization ability, and explainability. Similarly, ThinkGuard (Wen et al., 2025) is developed via the proposed critique-augmented fine-tuning. X-Guard (Upadhayay et al., 2025) extends the reasoning-based guard model to the multi-lingual scenario.
通过提出的推理 SFT 和硬样本 DPO,GuardReasoner (Liu 等人, 2025b ) 和 GuardReasoner-VL (Liu 等人, 2025d ) 被提出,旨在引导守卫模型在做出审核决策前进行深思熟虑的推理,从而提高其性能、泛化能力和可解释性。类似地,ThinkGuard (Wen 等人, 2025 ) 通过提出的批判性增强微调方法开发而成。X-Guard (Upadhayay 等人, 2025 ) 将基于推理的守卫模型扩展到了多语言场景。
6 Future Directions
6 未来方向
Beyond the detailed analysis of risks, attacks, and defenses presented in previous sections, this paper also identifies future directions that researchers should prioritize to enhance the safety of LRMs:
(1) Standardized Evaluation Benchmarks.
New benchmarks should focus on reasoning-specific vulnerabilities, as the research community currently lacks standardized evaluation frameworks to comprehensively test both the safety and robustness of LRMs’ multi-step reasoning processes. (2) Domain-Specific Evaluation Frameworks.
Evaluation suites for healthcare, finance, and law must include curated case studies and targeted adversarial tests. Expert review ensures LRMs meet each domain’s accuracy and ethical requirements. (3) Human-in-the-Loop Alignment and Interpretability. Interactive tools should let experts inspect and refine reasoning traces. Iterative feedback can align LRMs with stakeholder values and correct biases efficiently.
除了前几节中对风险、攻击和防御的详细分析之外,本文还指出了研究人员应优先考虑的未来方向,以提高 LRM 的安全性:
(1)标准化评估基准。
新的基准测试应侧重于推理相关的漏洞,因为目前研究界缺乏标准化的评估框架来全面测试 LRM 多步骤推理过程的安全性和鲁棒性。 (2) 特定领域的评估框架。
医疗保健、金融和法律领域的评估套件必须包含精心策划的案例研究和有针对性的对抗性测试。专家评审确保逻辑推理模型 (LRM) 符合各领域的准确性和伦理要求。(3) 人机协同与可解释性。 交互式工具应允许专家检查和改进推理过程。迭代反馈可以使逻辑推理模型与利益相关者的价值观保持一致,并有效纠正偏差。
7 Conclusion
7 结论
This survey has comprehensively examined the emerging safety challenges posed by Large Reasoning Models. Through our analysis, we’ve identified several critical insights that distinguish LRM safety from traditional LLMs. First, LRMs expose their reasoning chains, creating new attack surfaces where adversaries can manipulate intermediate steps rather than just outputs, enabling sophisticated attacks like reasoning-based backdoors and hijacking that target the deliberative process itself. Second, traditional output-focused alignment methods prove insufficient for LRMs, as harmful reasoning can persist internally even when final outputs appear safe, necessitating novel approaches that consider the entire reasoning trajectory. These insights underscore the need for specialized safety research targeting LRMs, including standardized evaluation benchmarks for reasoning-specific vulnerabilities and human-in-the-loop alignment methods that can inspect and refine reasoning traces as these powerful models continue to advance into increasingly critical domains.
本调查全面考察了大型推理模型(LRM)带来的新兴安全挑战。通过分析,我们发现了几个关键见解,这些见解将 LRM 的安全性与传统的 LLM 区分开来。首先, LRM 暴露了其推理链,从而创建了新的攻击面 ,攻击者不仅可以操纵输出结果,还可以操纵中间步骤,这使得基于推理的后门和劫持等复杂攻击成为可能,这些攻击的目标正是推理过程本身。其次, 传统的以输出为中心的对齐方法对于 LRM 而言并不足够 ,因为即使最终输出看起来安全,有害的推理过程也可能持续存在于模型内部,这就需要考虑整个推理轨迹的新方法。这些见解凸显了针对 LRM 开展专门安全研究的必要性,包括针对特定推理漏洞的标准化评估基准,以及能够检查和改进推理轨迹的人机交互对齐方法,因为这些强大的模型正在不断扩展到越来越关键的领域。
Limitations 局限性
This survey has inherent limitations due to the rapidly evolving nature of LRMs. Since the emergence of OpenAI’s o1 series, DeepSeek-R1, and other advanced reasoning models is relatively recent, our taxonomy and findings may become outdated as new research continuously emerges. While we have endeavored to provide a comprehensive overview of safety challenges, attacks, and defenses, we acknowledge that some aspects may require revision as the field matures. Additionally, our reliance on published academic literature may not fully capture proprietary research being conducted within companies developing these models, potentially creating gaps in understanding industry-specific safety measures.
由于逻辑推理模型(LRM)的快速发展,本调查存在固有的局限性。OpenAI 的 o1 系列、DeepSeek-R1 和其他高级推理模型出现时间相对较短,随着新研究的不断涌现,我们的分类和结论可能会过时。尽管我们已尽力提供关于安全挑战、攻击和防御的全面概述,但我们也承认,随着该领域的成熟,某些方面可能需要修订。此外,我们对已发表学术文献的依赖可能无法完全涵盖开发这些模型的公司内部进行的专有研究,这可能会导致对特定行业安全措施的理解存在差距。
References
- Andriushchenko and Flammarion (2024) Maksym Andriushchenko and Nicolas Flammarion. 2024. Does refusal training in llms generalize to the past tense? arXiv preprint arXiv:2407.11969.
- Arrieta et al. (2025a) Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. 2025a. Early external safety testing of openai’s o3-mini: Insights from the pre-deployment evaluation. arXiv preprint arXiv:2501.17749.
- Arrieta et al. (2025b) Aitor Arrieta, Miriam Ugarte, Pablo Valle, José Antonio Parejo, and Sergio Segura. 2025b. o3-mini vs deepseek-r1: Which one is safer? arXiv preprint arXiv:2501.18438.
- Askell et al. (2021) Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Bach et al. (2022) Stephen H Bach, Victor Sanh, Zheng-Xin Yong, Albert Webson, Colin Raffel, Nihal V Nayak, Abheesht Sharma, Taewoon Kim, M Saiful Bari, Thibault Fevry, et al. 2022. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279.
- Barkur et al. (2025) Sudarshan Kamath Barkur, Sigurd Schacht, and Johannes Scholl. 2025. Deception in llms: Self-preservation and autonomous goals in large language models. arXiv preprint arXiv:2501.16513.
- Bates and Gurevych (2023) Luke Bates and Iryna Gurevych. 2023. Like a good nearest neighbor: Practical content moderation and text classification. arXiv preprint arXiv:2302.08957.
- Besta et al. (2024) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Piotr Nyczyk, and Torsten Hoefler. 2024. Graph of thoughts: Solving elaborate problems with large language models. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):17682–17690.
- Bondarenko et al. (2025) Alexander Bondarenko, Denis Volk, Dmitrii Volkov, and Jeffrey Ladish. 2025. Demonstrating specification gaming in reasoning models. arXiv preprint arXiv:2502.13295.
- Chen et al. (2024a) Xingyu Chen, Jiahao Xu, Tian Liang, Zhiwei He, Jianhui Pang, Dian Yu, Linfeng Song, Qiuzhi Liu, Mengfei Zhou, Zhuosheng Zhang, et al. 2024a. Do not think that much for 2+ 3=? on the overthinking of o1-like llms. arXiv preprint arXiv:2412.21187.
- Chen et al. (2025) Yulin Chen, Haoran Li, Yuan Sui, Yufei He, Yue Liu, Yangqiu Song, and Bryan Hooi. 2025. Can indirect prompt injection attacks be detected and removed? arXiv preprint arXiv:2502.16580.
- Chen et al. (2024b) Yulin Chen, Haoran Li, Zihao Zheng, Yangqiu Song, Dekai Wu, and Bryan Hooi. 2024b. Defense against prompt injection attack by leveraging attack techniques. arXiv preprint arXiv:2411.00459.
- Cheng et al. (2023) Jiale Cheng, Xiao Liu, Kehan Zheng, Pei Ke, Hongning Wang, Yuxiao Dong, Jie Tang, and Minlie Huang. 2023. Black-box prompt optimization: Aligning large language models without model training. arXiv preprint arXiv:2311.04155.
- Cheng et al. (2025) Pengzhou Cheng, Wei Du, Zongru Wu, Fengwei Zhang, Libo Chen, Zhuosheng Zhang, and Gongshen Liu. 2025. Synghost: Invisible and universal task-agnostic backdoor attack via syntactic transfer.
- Chi et al. (2024a) Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. 2024a. Llama guard 3 vision: Safeguarding human-ai image understanding conversations. arXiv preprint arXiv:2411.10414.
- Chi et al. (2024b) Jianfeng Chi, Ujjwal Karn, Hongyuan Zhan, Eric Smith, Javier Rando, Yiming Zhang, Kate Plawiak, Zacharie Delpierre Coudert, Kartikeya Upasani, and Mahesh Pasupuleti. 2024b. Llama guard 3 vision: Safeguarding human-ai image understanding conversations. arXiv preprint arXiv:2411.10414.
- Cuadron et al. (2025) Alejandro Cuadron, Dacheng Li, Wenjie Ma, Xingyao Wang, Yichuan Wang, Siyuan Zhuang, Shu Liu, Luis Gaspar Schroeder, Tian Xia, Huanzhi Mao, Nicholas Thumiger, Aditya Desai, Ion Stoica, Ana Klimovic, Graham Neubig, and Joseph E. Gonzalez. 2025. The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks.
- Cui et al. (2025) Yu Cui, Bryan Hooi, Yujun Cai, and Yiwei Wang. 2025. Process or result? manipulated ending tokens can mislead reasoning llms to ignore the correct reasoning steps.
- DeepSeek-AI et al. (2025) DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.
- Ding et al. (2024) Yi Ding, Bolian Li, and Ruqi Zhang. 2024. Eta: Evaluating then aligning safety of vision language models at inference time. arXiv preprint arXiv:2410.06625.
- Du et al. (2024) Xuefeng Du, Reshmi Ghosh, Robert Sim, Ahmed Salem, Vitor Carvalho, Emily Lawton, Yixuan Li, and Jack W Stokes. 2024. Vlmguard: Defending vlms against malicious prompts via unlabeled data. arXiv preprint arXiv:2410.00296.
- Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Ethayarajh et al. (2022) Kawin Ethayarajh, Yejin Choi, and Swabha Swayamdipta. 2022. Understanding dataset difficulty with mathcal v-usable information. In International Conference on Machine Learning. PMLR.
- Fang et al. (2025) Junfeng Fang, Yukai Wang, Ruipeng Wang, Zijun Yao, Kun Wang, An Zhang, Xiang Wang, and Tat-Seng Chua. 2025. Safemlrm: Demystifying safety in multi-modal large reasoning models.
- Feng et al. (2023) Xidong Feng, Ziyu Wan, Muning Wen, Stephen Marcus McAleer, Ying Wen, Weinan Zhang, and Jun Wang. 2023. Alphazero-like tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179.
- Gao et al. (2024) Kuofeng Gao, Tianyu Pang, Chao Du, Yong Yang, Shu-Tao Xia, and Min Lin. 2024. Denial-of-service poisoning attacks against large language models.
- Ghosal et al. (2024) Soumya Suvra Ghosal, Souradip Chakraborty, Vaibhav Singh, Tianrui Guan, Mengdi Wang, Ahmad Beirami, Furong Huang, Alvaro Velasquez, Dinesh Manocha, and Amrit Singh Bedi. 2024. Immune: Improving safety against jailbreaks in multi-modal llms via inference-time alignment. arXiv preprint arXiv:2411.18688.
- Ghosh et al. (2024a) Shaona Ghosh, Prasoon Varshney, Erick Galinkin, and Christopher Parisien. 2024a. Aegis: Online adaptive ai content safety moderation with ensemble of llm experts. arXiv preprint arXiv:2404.05993.
- Ghosh et al. (2024b) Shaona Ghosh, Prasoon Varshney, Makesh Narsimhan Sreedhar, Aishwarya Padmakumar, Traian Rebedea, Jibin Rajan Varghese, and Christopher Parisien. 2024b. Aegis2. 0: A diverse ai safety dataset and risks taxonomy for alignment of llm guardrails. In Neurips Safe Generative AI Workshop 2024.
- Guan et al. (2024) Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Heylar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. 2024. Deliberative alignment: Reasoning enables safer language models. arXiv preprint arXiv:2412.16339.
- Guo and Tourani (2025) Zhen Guo and Reza Tourani. 2025. Darkmind: Latent chain-of-thought backdoor in customized llms. arXiv preprint arXiv:2501.18617.
- Han et al. (2024) Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. 2024. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms. arXiv preprint arXiv:2406.18495.
- Hashemi et al. (2025) Masoud Hashemi, Oluwanifemi Bamgbose, Sathwik Tejaswi Madhusudhan, Jishnu Sethumadhavan Nair, Aman Tiwari, and Vikas Yadav. 2025. Dnr bench: When silence is smarter–benchmarking over-reasoning in reasoning llms. arXiv preprint arXiv:2503.15793.
- He et al. (2025) Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, and Bryan Hooi. 2025. Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals? arXiv preprint arXiv:2502.12206.
- Helff et al. (2024) Lukas Helff, Felix Friedrich, Manuel Brack, Patrick Schramowski, and Kristian Kersting. 2024. Llavaguard: Vlm-based safeguard for vision dataset curation and safety assessment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8322–8326.
- Huang et al. (2025) Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Zachary Yahn, Yichang Xu, and Ling Liu. 2025. Safety tax: Safety alignment makes your large reasoning models less reasonable. arXiv preprint arXiv:2503.00555.
- Huang et al. (2023) Xiaowei Huang, Wenjie Ruan, Wei Huang, Gaojie Jin, Yi Dong, Changshun Wu, Saddek Bensalem, Ronghui Mu, Yi Qi, Xingyu Zhao, Kaiwen Cai, Yanghao Zhang, Sihao Wu, Peipei Xu, Dengyu Wu, Andre Freitas, and Mustafa A. Mustafa. 2023. A survey of safety and trustworthiness of large language models through the lens of verification and validation.
- Inan et al. (2023) Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. 2023. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
- Jain et al. (2024) Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2024. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974.
- Ji et al. (2025) Jiaming Ji, Xinyu Chen, Rui Pan, Han Zhu, Conghui Zhang, Jiahao Li, Donghai Hong, Boyuan Chen, Jiayi Zhou, Kaile Wang, et al. 2025. Safe rlhf-v: Safe reinforcement learning from human feedback in multimodal large language models. arXiv preprint arXiv:2503.17682.
- Jiang et al. (2025) Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, and Radha Poovendran. 2025. Safechain: Safety of language models with long chain-of-thought reasoning capabilities. arXiv preprint arXiv:2502.12025.
- Jin et al. (2024) Haibo Jin, Leyang Hu, Xinuo Li, Peiyan Zhang, Chonghan Chen, Jun Zhuang, and Haohan Wang. 2024. Jailbreakzoo: Survey, landscapes, and horizons in jailbreaking large language and vision-language models.
- Ke et al. (2023) Pei Ke, Bosi Wen, Zhuoer Feng, Xiao Liu, Xuanyu Lei, Jiale Cheng, Shengyuan Wang, Aohan Zeng, Yuxiao Dong, Hongning Wang, et al. 2023. Critiquellm: Towards an informative critique generation model for evaluation of large language model generation. arXiv preprint arXiv:2311.18702.
- Ke et al. (2025) Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, and Shafiq Joty. 2025. A survey of frontiers in llm reasoning: Inference scaling, learning to reason, and agentic systems.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Kumar et al. (2025) Abhinav Kumar, Jaechul Roh, Ali Naseh, Marzena Karpinska, Mohit Iyyer, Amir Houmansadr, and Eugene Bagdasarian. 2025. Overthinking: Slowdown attacks on reasoning llms. arXiv preprint arXiv:2502.02542.
- Kumar et al. (2024) Surender Suresh Kumar, M.L. Cummings, and Alexander Stimpson. 2024. Strengthening llm trust boundaries: A survey of prompt injection attacks surender suresh kumar dr. m.l. cummings dr. alexander stimpson. In 2024 IEEE 4th International Conference on Human-Machine Systems (ICHMS), pages 1–6.
- Kuo et al. (2025) Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Hai Li, and Yiran Chen. 2025. H-cot: Hijacking the chain-of-thought safety reasoning mechanism to jailbreak large reasoning models, including openai o1/o3, deepseek-r1, and gemini 2.0 flash thinking. arXiv preprint arXiv:2502.12893.
- Li et al. (2024) Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. 2024. Llm defenses are not robust to multi-turn human jailbreaks yet. arXiv preprint arXiv:2408.15221.
- Li et al. (2023) Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. 2023. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191.
- Li et al. (2025a) Xuying Li, Zhuo Li, Yuji Kosuga, and Victor Bian. 2025a. Optimizing safe and aligned language generation: A multi-objective grpo approach. arXiv preprint arXiv:2503.21819.
- Li et al. (2025b) Zhong-Zhi Li, Duzhen Zhang, Ming-Liang Zhang, Jiaxin Zhang, Zengyan Liu, Yuxuan Yao, Haotian Xu, Junhao Zheng, Pei-Jie Wang, Xiuyi Chen, et al. 2025b. From system 1 to system 2: A survey of reasoning large language models. arXiv preprint arXiv:2502.17419.
- Liang et al. (2023) Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2023. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
- Lightman et al. (2023) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. Let’s verify step by step. In The Twelfth International Conference on Learning Representations.
- Lin et al. (2023a) Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, et al. 2023a. Mitigating the alignment tax of rlhf. arXiv preprint arXiv:2309.06256.
- Lin et al. (2023b) Zi Lin, Zihan Wang, Yongqi Tong, Yangkun Wang, Yuxin Guo, Yujia Wang, and Jingbo Shang. 2023b. Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389.
- Liu et al. (2025a) Qin Liu, Fei Wang, Chaowei Xiao, and Muhao Chen. 2025a. Vlm-guard: Safeguarding vision-language models via fulfilling safety alignment gap. arXiv preprint arXiv:2502.10486.
- Liu et al. (2023) Yi Liu, Gelei Deng, Yuekang Li, Kailong Wang, Zihao Wang, Xiaofeng Wang, Tianwei Zhang, Yepang Liu, Haoyu Wang, Yan Zheng, et al. 2023. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499.
- Liu et al. (2025b) Yue Liu, Hongcheng Gao, Shengfang Zhai, Xia Jun, Tianyi Wu, Zhiwei Xue, Yulin Chen, Kenji Kawaguchi, Jiaheng Zhang, and Bryan Hooi. 2025b. Guardreasoner: Towards reasoning-based llm safeguards. arXiv preprint arXiv:2501.18492.
- Liu et al. (2025c) Yue Liu, Jiaying Wu, Yufei He, Hongcheng Gao, Hongyu Chen, Baolong Bi, Jiaheng Zhang, Zhiqi Huang, and Bryan Hooi. 2025c. Efficient inference for large reasoning models: A survey. arXiv preprint arXiv:2503.23077.
- Liu et al. (2025d) Yue Liu, Shengfang Zhai, Mingzhe Du, Yulin Chen, Tri Cao, Hongcheng Gao, Cheng Wang, Xinfeng Li, Kun Wang, Junfeng Fang, Jiaheng Zhang, and Bryan Hooi. 2025d. Guardreasoner-vl: Safeguarding vlms via reinforced reasoning.
- Liu et al. (2024) Zhendong Liu, Yuanbi Nie, Yingshui Tan, Xiangyu Yue, Qiushi Cui, Chongjun Wang, Xiaoyong Zhu, and Bo Zheng. 2024. Safety alignment for vision language models. arXiv preprint arXiv:2405.13581.
- Lu et al. (2023) Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, et al. 2023. Inference-time policy adapters (ipa): Tailoring extreme-scale lms without fine-tuning. arXiv preprint arXiv:2305.15065.
- Luong et al. (2024) Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. 2024. Reft: Reasoning with reinforced fine-tuning. arXiv preprint arXiv:2401.08967, 3.
- Marjanović et al. (2025) Sara Vera Marjanović, Arkil Patel, Vaibhav Adlakha, Milad Aghajohari, Parishad BehnamGhader, Mehar Bhatia, Aditi Khandelwal, Austin Kraft, Benno Krojer, Xing Han Lù, Nicholas Meade, Dongchan Shin, Amirhossein Kazemnejad, Gaurav Kamath, Marius Mosbach, Karolina Stańczak, and Siva Reddy. 2025. Deepseek-r1 thoughtology: Let’s think about llm reasoning.
- Meta (2024) Meta. 2024. The llama 3 herd of models.
- Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540):529–533.
- Mou et al. (2025) Yutao Mou, Yuxiao Luo, Shikun Zhang, and Wei Ye. 2025. Saro: Enhancing llm safety through reasoning-based alignment. arXiv preprint arXiv:2504.09420.
- OpenAI (2024) OpenAI. 2024. Openai o1 system card. arXiv preprint arXiv:2412.16720.
- Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35.
- Qi et al. (2021) Fanchao Qi, Mukai Li, Yangyi Chen, Zhengyan Zhang, Zhiyuan Liu, Yasheng Wang, and Maosong Sun. 2021. Hidden killer: Invisible textual backdoor attacks with syntactic trigger.
- Qiu et al. (2025) Jianing Qiu, Lin Li, Jiankai Sun, Hao Wei, Zhe Xu, Kyle Lam, and Wu Yuan. 2025. Emerging cyber attack risks of medical ai agents. arXiv preprint arXiv:2504.03759.
- Qu et al. (2025) Xiaoye Qu, Yafu Li, Zhaochen Su, Weigao Sun, Jianhao Yan, Dongrui Liu, Ganqu Cui, Daizong Liu, Shuxian Liang, Junxian He, et al. 2025. A survey of efficient reasoning for large reasoning models: Language, multimodality, and beyond. arXiv preprint arXiv:2503.21614.
- Qwen et al. (2025) Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Tianyi Tang, Tingyu Xia, Xingzhang Ren, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yu Wan, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zihan Qiu. 2025. Qwen2.5 technical report.
- Rafailov et al. (2024) Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2024. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Rein et al. (2024) David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. 2024. Gpqa: A graduate-level google-proof qa benchmark. In First Conference on Language Modeling.
- Ren et al. (2024) Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, and Jing Shao. 2024. Derail yourself: Multi-turn llm jailbreak attack through self-discovered clues. arXiv preprint arXiv:2410.10700.
- Romero-Arjona et al. (2025) Miguel Romero-Arjona, Pablo Valle, Juan C Alonso, Ana B Sánchez, Miriam Ugarte, Antonia Cazalilla, Vicente Cambrón, José A Parejo, Aitor Arrieta, and Sergio Segura. 2025. Red teaming contemporary ai models: Insights from spanish and basque perspectives. arXiv preprint arXiv:2503.10192.
- Russinovich et al. (2024) Mark Russinovich, Ahmed Salem, and Ronen Eldan. 2024. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833.
- Shah et al. (2023) Rusheb Shah, Soroush Pour, Arush Tagade, Stephen Casper, Javier Rando, et al. 2023. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348.
- Shen et al. (2023) Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. 2023. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025.
- Shi et al. (2024) Dan Shi, Tianhao Shen, Yufei Huang, Zhigen Li, Yongqi Leng, Renren Jin, Chuang Liu, Xinwei Wu, Zishan Guo, Linhao Yu, Ling Shi, Bojian Jiang, and Deyi Xiong. 2024. Large language model safety: A holistic survey.
- Shumailov et al. (2021) Ilia Shumailov, Yiren Zhao, Daniel Bates, Nicolas Papernot, Robert Mullins, and Ross Anderson. 2021. Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European Symposium on Security and Privacy, pages 212–231.
- Sun et al. (2024) Xiongtao Sun, Deyue Zhang, Dongdong Yang, Quanchen Zou, and Hui Li. 2024. Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686.
- Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. 1998. Reinforcement learning: An introduction, volume 1. MIT press Cambridge.
- Team et al. (2025) Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, et al. 2025. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599.
- Team (2024a) Qwen Team. 2024a. Qvq: To see the world with wisdom. https://qwenlm.github.io/blog/qvq-72b-preview/.
- Team (2024b) Qwen Team. 2024b. Qwq: Reflect deeply on the boundaries of the unknown. https://qwenlm.github.io/blog/qwq-32b-preview/.
- Upadhayay et al. (2025) Bibek Upadhayay, Vahid Behzadan, et al. 2025. X-guard: Multilingual guard agent for content moderation. arXiv preprint arXiv:2504.08848.
- Wang et al. (2022a) Boxin Wang, Wei Ping, Chaowei Xiao, Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Bo Li, Anima Anandkumar, and Bryan Catanzaro. 2022a. Exploring the limits of domain-adaptive training for detoxifying large-scale language models. Advances in Neural Information Processing Systems, 35:35811–35824.
- Wang et al. (2025a) Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Minhao Cheng, and Dacheng Tao. 2025a. Leveraging reasoning with guidelines to elicit and utilize knowledge for enhancing safety alignment. arXiv preprint arXiv:2502.04040.
- Wang et al. (2023) Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Wang et al. (2024a) Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Ke Ren, Botian Jiang, and Xipeng Qiu. 2024a. Inferaligner: Inference-time alignment for harmlessness through cross-model guidance. arXiv preprint arXiv:2401.11206.
- Wang et al. (2022b) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022b. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560.
- Wang et al. (2022c) Yizhong Wang, Swaroop Mishra, Pegah Alipoormolabashi, Yeganeh Kordi, Amirreza Mirzaei, Anjana Arunkumar, Arjun Ashok, Arut Selvan Dhanasekaran, Atharva Naik, David Stap, et al. 2022c. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Wang et al. (2025b) Zijun Wang, Haoqin Tu, Yuhan Wang, Juncheng Wu, Jieru Mei, Brian R Bartoldson, Bhavya Kailkhura, and Cihang Xie. 2025b. Star-1: Safer alignment of reasoning llms with 1k data. arXiv preprint arXiv:2504.01903.
- Wang et al. (2024b) Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Miculicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024b. Chain-of-table: Evolving tables in the reasoning chain for table understanding.
- Watkins and Dayan (1992) Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning, 8:279–292.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Welbl et al. (2021) Johannes Welbl, Amelia Glaese, Jonathan Uesato, Sumanth Dathathri, John Mellor, Lisa Anne Hendricks, Kirsty Anderson, Pushmeet Kohli, Ben Coppin, and Po-Sen Huang. 2021. Challenges in detoxifying language models. arXiv preprint arXiv:2109.07445.
- Wen et al. (2025) Xiaofei Wen, Wenxuan Zhou, Wenjie Jacky Mo, and Muhao Chen. 2025. Thinkguard: Deliberative slow thinking leads to cautious guardrails. arXiv preprint arXiv:2502.13458.
- Weng et al. (2025) Fenghua Weng, Jian Lou, Jun Feng, Minlie Huang, and Wenjie Wang. 2025. Adversary-aware dpo: Enhancing safety alignment in vision language models via adversarial training. arXiv preprint arXiv:2502.11455.
- Wu et al. (2021) Jeff Wu, Long Ouyang, Daniel M Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, and Paul Christiano. 2021. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862.
- Wu et al. (2025a) Tong Wu, Chong Xiang, Jiachen T. Wang, and Prateek Mittal. 2025a. Effectively controlling reasoning models through thinking intervention.
- Wu et al. (2025b) Yuyang Wu, Yifei Wang, Tianqi Du, Stefanie Jegelka, and Yisen Wang. 2025b. When more is less: Understanding chain-of-thought length in llms. arXiv preprint arXiv:2502.07266.
- Xiang et al. (2024) Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, and Bo Li. 2024. Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242.
- Xu et al. (2023) Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, and Muhao Chen. 2023. Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models. arXiv preprint arXiv:2305.14710.
- Xu et al. (2025) Rongwu Xu, Xiaojian Li, Shuo Chen, and Wei Xu. 2025. Nuclear deployed: Analyzing catastrophic risks in decision-making of autonomous llm agents. arXiv preprint arXiv:2502.11355.
- Yang et al. (2025) Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, Bo Zhang, and Wei Chen. 2025. R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization.
- Yao et al. (2024a) Hongwei Yao, Jian Lou, and Zhan Qin. 2024a. Poisonprompt: Backdoor attack on prompt-based large language models. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7745–7749.
- Yao et al. (2024b) Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. 2024b. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models.
- Yao et al. (2025) Yang Yao, Xuan Tong, Ruofan Wang, Yixu Wang, Lujundong Li, Liang Liu, Yan Teng, and Yingchun Wang. 2025. A mousetrap: Fooling large reasoning models for jailbreak with chain of iterative chaos. arXiv preprint arXiv:2502.15806.
- Ye et al. (2025) Mang Ye, Xuankun Rong, Wenke Huang, Bo Du, Nenghai Yu, and Dacheng Tao. 2025. A survey of safety on large vision-language models: Attacks, defenses and evaluations. arXiv preprint arXiv:2502.14881.
- Yi et al. (2024) Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. 2024. Jailbreak attacks and defenses against large language models: A survey.
- Ying et al. (2025a) Zonghao Ying, Deyue Zhang, Zonglei Jing, Yisong Xiao, Quanchen Zou, Aishan Liu, Siyuan Liang, Xiangzheng Zhang, Xianglong Liu, and Dacheng Tao. 2025a. Reasoning-augmented conversation for multi-turn jailbreak attacks on large language models. arXiv preprint arXiv:2502.11054.
- Ying et al. (2025b) Zonghao Ying, Guangyi Zheng, Yongxin Huang, Deyue Zhang, Wenxin Zhang, Quanchen Zou, Aishan Liu, Xianglong Liu, and Dacheng Tao. 2025b. Towards understanding the safety boundaries of deepseek models: Evaluation and findings. arXiv preprint arXiv:2503.15092.
- Zaremba et al. (2025) Wojciech Zaremba, Evgenia Nitishinskaya, Boaz Barak, Stephanie Lin, Sam Toyer, Yaodong Yu, Rachel Dias, Eric Wallace, Kai Xiao, Johannes Heidecke, and Amelia Glaese. 2025. Trading inference-time compute for adversarial robustness.
- Zeng et al. (2024a) Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. 2024a. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772.
- Zeng et al. (2024b) Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. 2024b. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14322–14350.
- Zhang et al. (2025a) Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, et al. 2025a. Safety evaluation of deepseek models in chinese contexts. arXiv preprint arXiv:2502.11137.
- Zhang et al. (2025b) Yichi Zhang, Zihao Zeng, Dongbai Li, Yao Huang, Zhijie Deng, and Yinpeng Dong. 2025b. Realsafe-r1: Safety-aligned deepseek-r1 without compromising reasoning capability. arXiv preprint arXiv:2504.10081.
- Zhang et al. (2025c) Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. 2025c. Stair: Improving safety alignment with introspective reasoning. arXiv preprint arXiv:2502.02384.
- Zhang et al. (2024) Yongting Zhang, Lu Chen, Guodong Zheng, Yifeng Gao, Rui Zheng, Jinlan Fu, Zhenfei Yin, Senjie Jin, Yu Qiao, Xuanjing Huang, et al. 2024. Spa-vl: A comprehensive safety preference alignment dataset for vision language model. arXiv preprint arXiv:2406.12030.
- Zhao et al. (2025) Gejian Zhao, Hanzhou Wu, Xinpeng Zhang, and Athanasios V Vasilakos. 2025. Shadowcot: Cognitive hijacking for stealthy reasoning backdoors in llms. arXiv preprint arXiv:2504.05605.
- Zhao et al. (2024) Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, and Luu Anh Tuan. 2024. A survey of backdoor attacks and defenses on large language models: Implications for security measures. Authorea Preprints.
- Zhou et al. (2025) Kaiwen Zhou, Chengzhi Liu, Xuandong Zhao, Shreedhar Jangam, Jayanth Srinivasa, Gaowen Liu, Dawn Song, and Xin Eric Wang. 2025. The hidden risks of large reasoning models: A safety assessment of r1. arXiv preprint arXiv:2502.12659.
- Zhou (2020) Xuhui Zhou. 2020. Challenges in automated debiasing for toxic language detection. University of Washington.
- Zhu et al. (2025a) Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, and Lei Sha. 2025a. Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking. arXiv preprint arXiv:2502.12970.
- Zhu et al. (2025b) Zihao Zhu, Hongbao Zhang, Mingda Zhang, Ruotong Wang, Guanzong Wu, Ke Xu, and Baoyuan Wu. 2025b. Bot: Breaking long thought processes of o1-like large language models through backdoor attack. arXiv preprint arXiv:2502.12202.